A few words on corpus linguistics

Part 1 of 2

In the first of a two-part blog entry, Prof. Ronald Carter of the University of Nottingham provides a brief introduction to corpora and corpus linguistics, exploring ways in which corpora are currently being used to inform language teaching and the development of teaching materials.

What is a corpus?

corpus noun (plural corpuses or corpora) the collection of a single writer’s work or of writing about a particular subject, or a large amount of written and sometimes spoken material collected to show the state of a language

Cambridge Advanced Learner’s Dictionary Third Edition (2008) Cambridge: Cambridge University Press

Many corpora these days run to millions of words. The British National Corpus (BNC), for example, consists of 100 million words of English: a written part (90%) includes newspapers, magazines, journals, books, letters, memos, essays, etc and a spoken part (10%) includes conversations, recorded in a way that achieves a demographic balance, as well as a range of spoken language from business or government meetings, radio shows, phone-ins, etc. These large collections of text are stored and read electronically, allowing researchers to employ a variety of software to reveal different patterns of language that exist within the corpus.

The example of CANBEC

CANBEC stands for the Cambridge and Nottingham Business English Corpus. The project was established in the School of English Studies at the University of Nottingham, UK, and was developed together with Cambridge University Press. The CANBEC corpus consists of one million words of spoken data recorded in a variety of different businesses, from big multinational companies to small partnerships. Most of the data was obtained in the UK, but some was recorded in other countries which included a range of non-native speakers. The data covers internal meetings, external meetings (involving two or more different companies), office talk, sales presentations, telephone conversations and general office banter. Meetings form the largest part of the corpus.

The corpus was completed in 2003 and the data has enabled researchers to find out how real people speak and use English today in a work environment, how business language really works, how to teach it better and how to make language learning materials for business English. It has already been extensively used to inform materials produced by the Press in a range of teaching materials, grammars, dictionaries, course books and, most recently, the Cambridge Business English Dictionary.

Single word frequency

A computerised corpus allows us to study lots of different things about the language. One of the most straightforward and revealing things it does is to generate lists of the most frequent words. Here is a list of the twenty most frequent single words from the CANBEC corpus.

	CANBEC
1	we
2	we’ve
3	hmm
4	customer
5	we’re
6	sales
7	product
8	orders
9	need
10	customers
11	meeting
12	order
13	stock
14	okay
15	company
16	marketing
17	the
18	business
19	mail
20	gonna

It is interesting to note here how key content words (order, marketing) mix with markers of spoken interaction (ok, hmm) and that a computer reads singular and plural nouns as separate items.

However, whilst knowing the frequency of single words can be useful, exploration of ‘chunks’ of language can be even more revealing. This will be discussed in more detail in the second part of this blog entry, along with an overview of ways in which analysis of corpora data can help inform language teaching and materials development.

Some text for this blog has been extracted from: O’Keeffe, A., McCarthy, M. and Carter, R. (2007) From Corpus to Classroom: Language use and language teaching. Cambridge: Cambridge University Press. For further reading on CANBEC see Handford, M. (2011) The Language of Business Meetings Cambridge: Cambridge University Press.

Prof. Ronald Carter, School of English Studies, University of Nottingham

9 thoughts on “A few words on corpus linguistics”

Pingback: Vive la eRevolution! « Unplanned
Pingback: What can we learn from emails? | Tuyentranslate
Pingback: Tiếng Anh trong thư điện tử | LUYỆN TIẾNG ANH
Pingback: Linguistics: oddments, miscellany and paraphernalia | ELT Infodump
nhombay.com.vn

Howdy very nice website!! Man .. Excellent ..

Superb .. I will bookmark your site and take the feeds additionally?
I’m happy to search out numerous useful info right here within the
put up, we’d like work out more strategies on this
regard, thanks for sharing. . . . . .

August 9, 2016 at 2:00 pm Reply
unblocked games vevo running fred

The college have a responsibility to defend the children so they will setup protocols that
will only allow games they deem match to be played online.

August 27, 2016 at 5:05 am Reply
Pingback: New words – 17 February 2020 – About Words – Cambridge Dictionaries Online blog – Get Proficiency in English
حلول

Couldn’t you also say: from the frying pan into the fire if you get from a bad into a worse situation?

April 26, 2020 at 9:49 pm Reply
Soames

Superb platform!

August 16, 2023 at 9:01 am Reply

About Words – Cambridge Dictionary blog

9 thoughts on “A few words on corpus linguistics”

Leave a ReplyCancel reply

9 thoughts on “A few words on corpus linguistics”

Leave a ReplyCancel reply

Discover more from About Words - Cambridge Dictionary blog