A few words on corpus linguistics part 2

by Ron Carter

Part 2 of 2

In the second of this two-part blog entry, Prof. Ronald Carter of the University of Nottingham looks in more detail at the kind of information corpora can reveal about the use of language and why this is so important for the development of language teaching materials.

Language chunks

Here is a list of the top twenty most frequent three word chunks from the Cambridge and Nottingham Business English Corpus (CANBEC) compared with a corpus of spoken academic English (ACAD). The numbers are occurrences per million words. The items in bold are discussed below.

  CANBEC per m   Spoken ACAD per m
1 I don’t know 642 1 a lot of 477
2 a lot of 563 2 I don’t know 469
3 at the moment 485 3 one of the 442
4 we need to 438 4 you can see 364
5 I don’t think 378 5 this is a 358
6 the end of 376 6 you have to 343
7 in terms of 243 7 this is the 338
8 a bit of 241 8 in terms of 300
9 be able to 237 9 a sort of 297
10 at the end 235 10 there is a 276
11 end of the 230 11 and this is 271
12 and I think 229 12 look at the 268
13 I think it’s 229 13 the end of 265
14 to do it 223 14 the sort of 265
15 we have to 208 15 at the end 253
16 have a look 196 16 you want to 253
17 I think we 194 17 you know the 250
18 you know the 192 18 do you think 247
19 a couple of 187 19 to do with 247
20 we’ve got a 184 20 and so on 239

I don’t know is high in both corpora, and in both cases it is frequently followed by reporting clauses [1] beginning with if or a wh-word.  A lot of, a couple of and sort of, all rather vague expressions, are also evident in both (though not all shown in the table), as is the specifying expression [2] in terms of.  So we have a mix of specific and vague expressions but overall there is more vagueness. The CANBEC list has four chunks involving think, perhaps reflecting the constant speculating and hedging in business negotiations. And I don’t know is often used when beginning to explore possibilities, allowing us not to reveal what we do know too openly. Vague language also allows us to hedge our bets when we speak with one another.

Both corpora have chunks that refer to looking at things (i.e. considering things), with ACAD also including you can see, a structure which mirrors more direct one-to-one instruction typical of many academic contexts. CANBEC has a high occurrence of at the moment, perhaps suggesting the constant flux and change in business situations. The CANBEC list also brings together the high-frequency key words we and need (we need to at no. 4).  This reflects the high incidence of statements of collective goals in spoken business English, (mirroring the corporate mantra there’s no I in team), for need is often used in business requests and directives when we don’t want to sound too forceful.

We in CANBEC carries a wide range of references, from very broad corporate references to smaller, group references and to the individual speaker, who may use it to shelter behind corporate authority or responsibility or to avoid embarrassment for those present in, say, a meeting.  The following examples illustrate this:

We need to have a close look at it and then you’re gonna check it over aren’t you?
We need to revisit that really.
We need to approach them to see if we can get the price down.
We need to figure it out about the server.

Here is a stretch of conversation from the corpus which illustrates one use of we as a reference to the individual speaker (with all names and places anonymised, of course).

[Meeting between a multinational car manufacturer and a British hydraulics company. They are discussing product development.]

Speaker 3: I mean ultimat… ultimately it’s your decision whether you want a…

Speaker 1: True. But er

Speaker 3: a hard blow fuse if you like or a a resettable fuse.

Speaker 1: You’re right. But the thing is I mean we need to know what your rationale is. And if you say ‘We prefer to have a resettable one because we we know this is a problem’ then it will help Nigel to make that   decision you see.

The relevance of corpora data

We have so far simply scratched the surface of what a corpus can reveal but it may give you a flavour of the possibilities open to course book, materials and dictionary writers. Clearly there are times when frequency lists and hard quantitative evidence of patterns of language from multi-million word databases are valuable; and there are other times when it is important to look at language more qualitatively, exploring longer stretches of continuous language from the corpus.  There are times when it helps for examples to be made up for the purposes of graded learning; and there are other times when it helps for there to be examples of real English. Any examples drawn from the CANBEC database will be authentic, produced in real contexts of use and the comparison between different corpora allows the different linguistic fingerprints of different registers to emerge.

A corpus allows you to look at the language a little more objectively and to give evidence for our judgements and intuitions. Clearly, CANBEC needs to be supplemented by written business corpora from the Cambridge English Corpus, but this corpus both helps us understand better the differences and distinctions between spoken and written versions of the language and gives us insights and information that go beyond what intuition and conventional knowledge about the language can tell us.

Some text for this blog has been extracted from: O’Keeffe, A., McCarthy, M. and Carter, R. (2007) From Corpus to Classroom: Language use and language teaching. Cambridge: Cambridge University Press.  For further reading on CANBEC see Handford, M. (2011) The Language of Business MeetingsCambridge: Cambridge University Press.

Prof. Ronald Carter, School of English Studies,University of Nottingham

[1] A clause used to report someone’s speech.

[2] A phrase used to specify which person or thing we are talking about.

12 thoughts on “A few words on corpus linguistics part 2

  1. Harry

    What strikes me, as an American, is the fact that the words and phrases that appear in these messages are very common on this side of the Atlantic as well; there is nothing distinctly “British” here. (The Queen might claim copyright on the pseudo-royal “we,” but I’ve heard it far too often from minor corporate executives.) You are clearly discussing world English.

    Perhaps this reflects the dominance of American voices in popular media (film, TV, hip hop, etc.). There’s also the fact that American academics tend to dominate fields related to business management. (I’ve heard this from an educator in Curacao.) Then there’s the fact that increased international travel and communications — those phone banks in India and the Philippines, for instance — is erasing regional differences, much as regional accents have faded (but not disappeared) in the US.

    The UK, of course, is brilliant at protecting regional accents. I once visited Durham and had no idea what most people were saying. I’m sure, however, that they would quickly revert to your list in a business environment.

  2. charles bornhoeft

    I have a new word:
    (Tals)…..a technologically advanced living species navigating the ufos and including mission controll…these spices are a select group of individuals.
    the tals have a great jump start in this Darwinen universe …..
    alien…….its slang……meaning…..anything not from earth….alien microbe ……
    extraterrestrial …..its slang….to long….meaning……anything not from earth…..
    give the technologically advanced living species navigating the ufos a name thank you. Be advised that we are already using the word in the field…..

  3. Pingback: Linguistics: oddments, miscellany and paraphernalia | ELT Infodump

  4. Being not just in criminal law, but a heavy hitter, you want to tell me about corpus. Lets make a fun duel for the members if you are sure enough of Y O U lol

    could be fun and I am sure the general membership took much away from your good but rather drab assistance.
    God Bless – Keep Tryin

  5. Pingback: New words – 17 February 2020 – About Words – Cambridge Dictionaries Online blog – Get Proficiency in English

  6. praests

    Chomsky sez corpora ar USELESS –

    ‘innit’ is ‘authentic’ – the definition of authenticity is suspect.

    ‘er’ is the sound in ERic.
    why’r pepl insensibl?

    And why can’t pepl say ‘one’ ie WUN
    when they southernise or americanise accent ?

  7. Sparky

    Do your systems track whether or not a word or phrase was used correctly, and should it count if it was not? And at what point (what frequency, I suppose, or maybe how many similarly incorrect uses) would a particular use come to be considered acceptable?

    I’ve been thinking about this a lot lately, as I’ve spent a fair amount of this year of isolation contributing to and editing a blog that’s been developed over the last five or so years. Previous major contributors were American, British, and Brazilian, with varying levels of education in English (and with varying levels of translation services) — and I believe that they’re mostly younger than I am, as well (my assessment, based on their use of the language). It’s been by turns fascinating, somewhat horrifying, amusing, confusing, and — most importantly — time consuming.

  8. Julie Kline

    Question: Are all these posted responses above mine responding to the blog post : “A few words on corpus linguistics part 2”? I am confused by some of them. 🙂 lol

  9. lorimisttia

    English is an interesting language but yes, any language will have two or more meanings so it is very rich. Thanks for this article useful for me and everyone.

Leave a Reply