A few words on corpus linguisticsDecember 12, 2011
Part 1 of 2
by Ron Carter
In the first of a two-part blog entry, Prof. Ronald Carter of the University of Nottingham provides a brief introduction to corpora and corpus linguistics, exploring ways in which corpora are currently being used to inform language teaching and the development of teaching materials.
What is a corpus?
corpus noun (plural corpuses or corpora) the collection of a single writer’s work or of writing about a particular subject, or a large amount of written and sometimes spoken material collected to show the state of a language
Cambridge Advanced Learner’s Dictionary Third Edition (2008) Cambridge: Cambridge University Press
Many corpora these days run to millions of words. The British National Corpus (BNC), for example, consists of 100 million words of English: a written part (90%) includes newspapers, magazines, journals, books, letters, memos, essays, etc and a spoken part (10%) includes conversations, recorded in a way that achieves a demographic balance, as well as a range of spoken language from business or government meetings, radio shows, phone-ins, etc. These large collections of text are stored and read electronically, allowing researchers to employ a variety of software to reveal different patterns of language that exist within the corpus.
The example of CANBEC
CANBEC stands for the Cambridge and Nottingham Business English Corpus. The project was established in the School of English Studies at the University of Nottingham, UK, and was developed together with Cambridge University Press. The CANBEC corpus consists of one million words of spoken data recorded in a variety of different businesses, from big multinational companies to small partnerships. Most of the data was obtained in the UK, but some was recorded in other countries which included a range of non-native speakers. The data covers internal meetings, external meetings (involving two or more different companies), office talk, sales presentations, telephone conversations and general office banter. Meetings form the largest part of the corpus.
The corpus was completed in 2003 and the data has enabled researchers to find out how real people speak and use English today in a work environment, how business language really works, how to teach it better and how to make language learning materials for business English. It has already been extensively used to inform materials produced by the Press in a range of teaching materials, grammars, dictionaries, course books and, most recently, the Cambridge Business English Dictionary.
Single word frequency
A computerised corpus allows us to study lots of different things about the language. One of the most straightforward and revealing things it does is to generate lists of the most frequent words. Here is a list of the twenty most frequent single words from the CANBEC corpus.
It is interesting to note here how key content words (order, marketing) mix with markers of spoken interaction (ok, hmm) and that a computer reads singular and plural nouns as separate items.
However, whilst knowing the frequency of single words can be useful, exploration of ‘chunks’ of language can be even more revealing. This will be discussed in more detail in the second part of this blog entry, along with an overview of ways in which analysis of corpora data can help inform language teaching and materials development.
Some text for this blog has been extracted from: O’Keeffe, A., McCarthy, M. and Carter, R. (2007) From Corpus to Classroom: Language use and language teaching. Cambridge: Cambridge University Press. For further reading on CANBEC see Handford, M. (2011) The Language of Business Meetings Cambridge: Cambridge University Press.
Prof. Ronald Carter, School of English Studies, University of Nottingham