Posted in programming

Quick tip for R: How to save your dataset in a native R format for future work

This is a note to self more than anything else, but maybe someone learning R out there finds it useful, too.

I lost some time recently because I kept running R results and only saved them as plots and csvs. As I’m on a budget Macbook with limited memory I can’t keep many results loaded in R (it all stays in memory). Now if I want to go back and change a plot, for example to make it prettier in terms of its dimensions or to add a title or even to filter the data that goes into a subplot… I’ll have to rerun the results.

Saving the results in a csv file is good for future reference, but won’t help with the issue that we can’t easily (?)  recreate the results from it. It seems far easier to save the actual R data in an R format.

In fact, my ‘statistics and programming colleague’ has been providing such R files in the ‘RDS format’ for our project to save my time of running them, but giving me the chance to select my own subsets for plots. I’m a bit gutted that I didn’t realise the potential of this function for my own work until today. (I am having to rerun the results in order to create nicer plots; but then it’s also better for archiving the results in an R format than only in csv, I suppose, because things do change or I might find mistakes in my methodology later…).

In order to create an RDS file you have to use this function (from the R documentation; for me its usually sufficient to simply name the object the file path):

saveRDS(object, file = "", ascii = FALSE, version = NULL,
        compress = TRUE, refhook = NULL)

For the technical details you can refer to the R documentation linked above or this post that explains the difference between ‘saveRDS()’ and ‘save()’ in more detail. In a nutshell, ‘save()’ apparently saves the object with its name. So, if my original results were called ‘results’ and meanwhile I had created another object called ‘results’ I’d have a problem when I loaded the saved version. With ‘saveRDS()’ we don’t have this problem.

Hopefully, this post can be of use to some of you (obviously check what’s most helpful for your work). I’ll start saving all my important R results in this format 🙂


	
Posted in corpus linguistics, techy

On analysing concordance lines

I start this post by giving a very quick introduction to concordances. If you are already an experienced corpus linguist, you can skip to the final section on categorising concordance lines. I am curious about your own practices for analysing concordance lines: do you print them out and highlight the different patterns? Or do you annotate the lines electronically, using a concordancer or a spreadsheet? Is there any other option that hasn’t occurred to me yet?

The basic display format in corpus linguistics

In the past year or so I was pre-occupied with relatively abstract, ‘big picture’-style analyses of my corpus (basically key key word and collocation analysis), but now I have come across a theme for which a smaller-scale, qualitative analysis is more appropriate. (Once I’ve wrapped it all up, I hope to share some insights. Or you may have to wait for my thesis to get done …).

For me as a corpus linguist, the go-to tool for any qualitative investigation  is the concordancer. As the name suggests, it produces concordances. A concordance is the basic display format in corpus linguistics that lists snippets of the text, illustrating the use of a particular word or phrase in a corpus. Concordance analysis has brought the discipline a long way, especially when Sinclair developed very systematic ways of analysing concordance lines for making dictionaries. (Sinclair’s guidelines are recorded in his book Reading Concordances; it’s a shame that Google Books has no preview …).

Consider this quote from Martin Wynne’s (2008, online) handbook chapter on concordancing:

For many linguists, searching and concordancing is what they mean by “doing corpus linguistics”.

The way we read concordance lines is quite different from the way we read a text. This  vertical reading may take some time getting used to. Here’s an example, concordance lines  for language on WebCorp:

screen-shot-2016-11-17-at-17-55-23

You can also use WebCorp to produce concordance lines from the web; or you can access corpora that are available online with integrated concordance functionality, such as the BNCweb or the  BYU corpora. (If you want to run concordances on specialised subcorpora on the BYU interface, you might be interested in the slides and the handout from my session at the University of Birmingham Summer School in Corpus Linguistics this year).

Of course, we often want to use corpus linguistic tools on materials that haven’t been made widely available, because it is often necessary to prepare a corpus from scratch for a particular research question. To create concordances for your own texts you using concordancers like AntConc and WordSmith Tools (which you could buy if your institution doesn’t  have a license).

What are your personal preferences for analysing concordance lines?

Concordance analysis is all about viewing a word (or phrase) in its co-text to identify any patterns in the way it is used. It’s often helpful to resort the concordance lines. Concordance tools usually let you resort based on the surrounding words (in positions 1-5 or more on the left and right).

 

According to Martin Wynne (2008, online),

[t]his type of manual annotation of concordance lines is often done on concordance printouts with a pen. Software which allows the annotation to be done on the electronic concordance data makes it possible to sort on the basis of the annotations, and to thin the concordance to leave only those lines with or without a certain manual categorisation.

Personally, I usually start with a print out of the simple concordance lines. Then, once I have identified some simple categories I often move on to an Excel spreadsheet. I like being able to add columns for categories (I should just not overdo it, like in the photo…). Moreover, in some versions of Excel, it is possible to select and change the font of particular words in the same cell (seems to work on Excel for Mac but not for Windows). That way, I can highlight the word or phrase which prompts the category for the concordance line. It is also possible to assign a concordance to particular categories.

categorising_concordance_lines

wst_set_coloursSome concordancers provide functionality for categorizing concordance lines. In WordSmith Tools it is possible to assign categories (‘sets’). I have only recently tried this function and I’m quite impressed with the range of colours that are available, which you can see in the screenshot on the left. More information is available from the manual. BNCweb also provides a (simple) categorisation function with up to 6 categories. In the example from the screenshot below we would distinguish between can as the modal verb and can as the container for a drink. Of course, the modal is much more frequent (in general language usage, not in a text about coke cans…). Therefore all the example concordance lines represent the modal usage.

 

I am curious about these features and in how far people use them. If you don’t use these functions, how else do you categorise concordance lines? Do you do it manually, after printing out? In practice, how often do you analyse concordance lines? Are they quite important in your research or do you focus on more quantitative aspects, checking concordance lines when necessary?

Further reading

Sinclair, J. (2003). Reading Concordances: An Introduction. Harlow: Pearson/Longman.
Wynne, M. (2008). Searching and concordancing. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (Vol. 1, pp. 706–737). Berlin: Mouton de Gruyter. [pre-publication draft available online]