A few words on corpus linguistics part 2December 14, 2011
by Ron Carter
Part 2 of 2
In the second of this two-part blog entry, Prof. Ronald Carter of the University of Nottingham looks in more detail at the kind of information corpora can reveal about the use of language and why this is so important for the development of language teaching materials.
Here is a list of the top twenty most frequent three word chunks from the Cambridge and Nottingham Business English Corpus (CANBEC) compared with a corpus of spoken academic English (ACAD). The numbers are occurrences per million words. The items in bold are discussed below.
|CANBEC||per m||Spoken ACAD||per m|
|1||I don’t know||642||1||a lot of||477|
|2||a lot of||563||2||I don’t know||469|
|3||at the moment||485||3||one of the||442|
|4||we need to||438||4||you can see||364|
|5||I don’t think||378||5||this is a||358|
|6||the end of||376||6||you have to||343|
|7||in terms of||243||7||this is the||338|
|8||a bit of||241||8||in terms of||300|
|9||be able to||237||9||a sort of||297|
|10||at the end||235||10||there is a||276|
|11||end of the||230||11||and this is||271|
|12||and I think||229||12||look at the||268|
|13||I think it’s||229||13||the end of||265|
|14||to do it||223||14||the sort of||265|
|15||we have to||208||15||at the end||253|
|16||have a look||196||16||you want to||253|
|17||I think we||194||17||you know the||250|
|18||you know the||192||18||do you think||247|
|19||a couple of||187||19||to do with||247|
|20||we’ve got a||184||20||and so on||239|
I don’t know is high in both corpora, and in both cases it is frequently followed by reporting clauses  beginning with if or a wh-word. A lot of, a couple of and sort of, all rather vague expressions, are also evident in both (though not all shown in the table), as is the specifying expression  in terms of. So we have a mix of specific and vague expressions but overall there is more vagueness. The CANBEC list has four chunks involving think, perhaps reflecting the constant speculating and hedging in business negotiations. And I don’t know is often used when beginning to explore possibilities, allowing us not to reveal what we do know too openly. Vague language also allows us to hedge our bets when we speak with one another.
Both corpora have chunks that refer to looking at things (i.e. considering things), with ACAD also including you can see, a structure which mirrors more direct one-to-one instruction typical of many academic contexts. CANBEC has a high occurrence of at the moment, perhaps suggesting the constant flux and change in business situations. The CANBEC list also brings together the high-frequency key words we and need (we need to at no. 4). This reflects the high incidence of statements of collective goals in spoken business English, (mirroring the corporate mantra there’s no I in team), for need is often used in business requests and directives when we don’t want to sound too forceful.
We in CANBEC carries a wide range of references, from very broad corporate references to smaller, group references and to the individual speaker, who may use it to shelter behind corporate authority or responsibility or to avoid embarrassment for those present in, say, a meeting. The following examples illustrate this:
We need to have a close look at it and then you’re gonna check it over aren’t you? We need to revisit that really. We need to approach them to see if we can get the price down. We need to figure it out about the server.
Here is a stretch of conversation from the corpus which illustrates one use of we as a reference to the individual speaker (with all names and places anonymised, of course).
[Meeting between a multinational car manufacturer and a British hydraulics company. They are discussing product development.]
Speaker 3: I mean ultimat… ultimately it’s your decision whether you want a…
Speaker 1: True. But er …
Speaker 3: a hard blow fuse if you like or a a resettable fuse.
Speaker 1: You’re right. But the thing is I mean we need to know what your rationale is. And if you say ‘We prefer to have a resettable one because we we know this is a problem’ then it will help Nigel to make that decision you see.
The relevance of corpora data
We have so far simply scratched the surface of what a corpus can reveal but it may give you a flavour of the possibilities open to course book, materials and dictionary writers. Clearly there are times when frequency lists and hard quantitative evidence of patterns of language from multi-million word databases are valuable; and there are other times when it is important to look at language more qualitatively, exploring longer stretches of continuous language from the corpus. There are times when it helps for examples to be made up for the purposes of graded learning; and there are other times when it helps for there to be examples of real English. Any examples drawn from the CANBEC database will be authentic, produced in real contexts of use and the comparison between different corpora allows the different linguistic fingerprints of different registers to emerge.
A corpus allows you to look at the language a little more objectively and to give evidence for our judgements and intuitions. Clearly, CANBEC needs to be supplemented by written business corpora from the Cambridge English Corpus, but this corpus both helps us understand better the differences and distinctions between spoken and written versions of the language and gives us insights and information that go beyond what intuition and conventional knowledge about the language can tell us.
Some text for this blog has been extracted from: O’Keeffe, A., McCarthy, M. and Carter, R. (2007) From Corpus to Classroom: Language use and language teaching. Cambridge: Cambridge University Press. For further reading on CANBEC see Handford, M. (2011) The Language of Business MeetingsCambridge: Cambridge University Press.
Prof. Ronald Carter, School of English Studies,University of Nottingham