Posted in corpus linguistics, techy

On analysing concordance lines

I start this post by giving a very quick introduction to concordances. If you are already an experienced corpus linguist, you can skip to the final section on categorising concordance lines. I am curious about your own practices for analysing concordance lines: do you print them out and highlight the different patterns? Or do you annotate the lines electronically, using a concordancer or a spreadsheet? Is there any other option that hasn’t occurred to me yet?

The basic display format in corpus linguistics

In the past year or so I was pre-occupied with relatively abstract, ‘big picture’-style analyses of my corpus (basically key key word and collocation analysis), but now I have come across a theme for which a smaller-scale, qualitative analysis is more appropriate. (Once I’ve wrapped it all up, I hope to share some insights. Or you may have to wait for my thesis to get done …).

For me as a corpus linguist, the go-to tool for any qualitative investigation  is the concordancer. As the name suggests, it produces concordances. A concordance is the basic display format in corpus linguistics that lists snippets of the text, illustrating the use of a particular word or phrase in a corpus. Concordance analysis has brought the discipline a long way, especially when Sinclair developed very systematic ways of analysing concordance lines for making dictionaries. (Sinclair’s guidelines are recorded in his book Reading Concordances; it’s a shame that Google Books has no preview …).

Consider this quote from Martin Wynne’s (2008, online) handbook chapter on concordancing:

For many linguists, searching and concordancing is what they mean by “doing corpus linguistics”.

The way we read concordance lines is quite different from the way we read a text. This  vertical reading may take some time getting used to. Here’s an example, concordance lines  for language on WebCorp:


You can also use WebCorp to produce concordance lines from the web; or you can access corpora that are available online with integrated concordance functionality, such as the BNCweb or the  BYU corpora. (If you want to run concordances on specialised subcorpora on the BYU interface, you might be interested in the slides and the handout from my session at the University of Birmingham Summer School in Corpus Linguistics this year).

Of course, we often want to use corpus linguistic tools on materials that haven’t been made widely available, because it is often necessary to prepare a corpus from scratch for a particular research question. To create concordances for your own texts you using concordancers like AntConc and WordSmith Tools (which you could buy if your institution doesn’t  have a license).

What are your personal preferences for analysing concordance lines?

Concordance analysis is all about viewing a word (or phrase) in its co-text to identify any patterns in the way it is used. It’s often helpful to resort the concordance lines. Concordance tools usually let you resort based on the surrounding words (in positions 1-5 or more on the left and right).


According to Martin Wynne (2008, online),

[t]his type of manual annotation of concordance lines is often done on concordance printouts with a pen. Software which allows the annotation to be done on the electronic concordance data makes it possible to sort on the basis of the annotations, and to thin the concordance to leave only those lines with or without a certain manual categorisation.

Personally, I usually start with a print out of the simple concordance lines. Then, once I have identified some simple categories I often move on to an Excel spreadsheet. I like being able to add columns for categories (I should just not overdo it, like in the photo…). Moreover, in some versions of Excel, it is possible to select and change the font of particular words in the same cell (seems to work on Excel for Mac but not for Windows). That way, I can highlight the word or phrase which prompts the category for the concordance line. It is also possible to assign a concordance to particular categories.


wst_set_coloursSome concordancers provide functionality for categorizing concordance lines. In WordSmith Tools it is possible to assign categories (‘sets’). I have only recently tried this function and I’m quite impressed with the range of colours that are available, which you can see in the screenshot on the left. More information is available from the manual. BNCweb also provides a (simple) categorisation function with up to 6 categories. In the example from the screenshot below we would distinguish between can as the modal verb and can as the container for a drink. Of course, the modal is much more frequent (in general language usage, not in a text about coke cans…). Therefore all the example concordance lines represent the modal usage.


I am curious about these features and in how far people use them. If you don’t use these functions, how else do you categorise concordance lines? Do you do it manually, after printing out? In practice, how often do you analyse concordance lines? Are they quite important in your research or do you focus on more quantitative aspects, checking concordance lines when necessary?

Further reading

Sinclair, J. (2003). Reading Concordances: An Introduction. Harlow: Pearson/Longman.
Wynne, M. (2008). Searching and concordancing. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (Vol. 1, pp. 706–737). Berlin: Mouton de Gruyter. [pre-publication draft available online]
Posted in techy

A practical one: Steps for installing WordSmith Tools on a Mac

I wrote this post 1.5 months ago, in late September 2015. Now that some time has passed and I have played around with WordSmith and Windows on my Mac I think I’m ready to post it.


I have decided to put something relatively practical down today – compared to my previous posts, which were more generally about feelings related to the PhD. I’m about to start the 2nd year of my PhD (until 1 October I like to take advantage of the ‘1st year status’, though) and therefore things must get more practical. There’s still reason to talk about feelings, the nature of academia and a PhD. Yet, at the moment my feelings are actually somewhat dominated by the need to get something practical done. In corpus linguistics practical tasks often have a technical aspect.

My kitschy mac decoration; sorry for the imprecise application!

In early 2013 at the beginning of my final BA semester I bought a Macbook, because … my relatively cheap Asus laptop had badly crashed twice, requiring a new hard disk (ok, I poured coffee over it…), was generally getting slow and had some pink and turquoise stripes on the display. At that point I was mainly thinking about my final year project which I would have to submit in May. Then I didn’t realise that the area of corpus linguistics, which I had already studied in a BA module, would also become the major focus of my MA and my PhD and that a Macbook might not be the greatest choice for that. [Please feel free to criticise this idea].

The reason that having a Mac is tricky for corpus linguistics is that one of the most popular software packages, WordSmith Tools (WS), does not natively run on a Mac. There are many other options, specifically the freeware AntConc which runs on basically any operating system. [I recently learned about a new tool called corpkit which so far seems a Mac/Linux exclusive though!] Many corpora are also accessible from the web – such as the COCA, the BNC, … If you want to build your own corpus, however, you likely need to have a tool on your own computer (unless you can convince the developers of a system like CQPweb to host it for you). Of course there are more techy options like using programming environments such as R or Python for corpus linguistic analysis. Because of some of the functions available in WS and the fact that my undergraduate and postgraduate corpus linguistics modules were based on this software I still like to use it for some tasks.

Since I had regular access to a campus-based Windows desktop in the first year of the PhD I avoided the issue of installing WS on my mac. Now I might need to do more work from my home office so that the question has popped up. I had heard that you need to install Windows in a virtual environment on your Mac by installing either Parallels or VMware. Each of them costs approximately £70, I believe, add that to the cost of a Windows licence and the effort of installing it all and I wasn’t too excited. Now that I did some research I learned about Oracle’s Virtualbox, and it seems to work as well, but is free. Disclaimer: I don’t know what the potential disadvantages are in installing WS via the free Virtualbox rather than a paid-for virtual environment! (Anyone?) Once I also tried circumventing the step of installing the Windows OS by using the tool WineBottler which allows you to pretend to your Mac that the Windows programme you want to use is actually in a Mac format. This wasn’t successful in my attempt to use WS and there wasn’t support available for this case, probably because corpus tools are not very widely used in comparison to other software (I suppose only linguists, other academics, and some language teacher know about them…).

So here are the steps that I followed for installing WS in a Virtualbox on my Mac:

  1. Download Virtualbox (Oracle, available for free) + its extension pack (this allows you to have shared folders between your Mac and virtual OS, I think – see this video at 22.30 for a guideline of setting up a shared folder)
  2. Install Virtualbox + extension pack
  3. Buy a Windows license (I decided for Windows 7, because that’s the last one I’m familiar with) from a software website & download the operating system (iso file) from there – I found the German site, but I’m sure there are English options available
  4. Install Windows inside a new virtual machine in the Virtualbox. I basically followed the directions in video 1 and video 2. (I settled on 2GB RAM because I have 4GB; 2 CPU because I have 4 and 20GB dynamically allocated space).
    The option of setting up shared folders to access the same files from the mac OS and the new Windows OS are explained in video 3 (minute…)
  5. Install the latest version of WS from the Mike Scott’s website  – you will need to have a valid license key, which you can purchase from the same site (but if you are a research student it might be worth checking with your university whether they can provide you with one)

The software runs a bit more hesitant than on my previous university PC, but it does show results. How are people’s experiences with Parallels/VMware? For those, do you also need to allocate a certain percentage of your macbook’s RAM. CPU and storage for the virtual machine? How much?


November update:

Having used WS on my Mac multiple times now over the course of 1.5 months I’d say it works alright. I can open files and also create keyword lists or concordances without major problems. However, I always have to be careful that I don’t select items or click on buttons too quickly. For example, when I ‘choose texts’ for one of the tools it’s dangerous to hold down the shift key and the downward arrow – usually this makes the whole application freeze and I have to kill it. It’s also worth noting that it’s better not to have too many other programmes running at the same time (also on the Mac OS). This might be a problem of my own computer, though. It’s been bought on a student budget and therefore is one of the slowest Macbook options from 2012.

One issue that came up regarding Windows is that I forgot to activate it at the beginning (although I had a key! – it didn’t force me too, though…). So last week the Windows screen turned black and I got all blamed and shamed by the operating system (this copy is not genuine!). Unfortunately when I tried activating it this didn’t work – the system said I was trying to use a key for the wrong computer. I think this is probably due to confusion caused by the virtual environment. After many stressful attempts at getting through the Microsoft UK customer service hotline I finally got to talk to a human (!) customer service operator who helped me to manually activate my Windows 7…