Standard reference corpora for key analysis

The following corpora have been used to provide standard frequency information against which you can compare your own corpus (at the word, POS or semantic levels):

Corpus WordPOSSemantic

BNC Sampler spoken: 982,712 words from BNC Sampler spoken corpus

BNC Sampler written: 968,267 words from BNC Sampler written corpus

BNC Sampler spoken CG: 480,759 words from BNC Sampler spoken context-governed corpus

BNC Sampler spoken Demog: 501,953 words from BNC Sampler spoken demographic corpus

BNC Sampler written Imag: 222,541 words from BNC Sampler written imaginative corpus

BNC Sampler written Inform: 745,726 words from BNC Sampler written informative corpus

BNC Sampler CG (Spoken) Business: 141,143 words from BNC Sampler Context Governed Business corpus
(Company and trades union talks or interviews; business meetings; sales demonstrations etc)

BNC Sampler CG (Spoken) Educational: 86,575 words from BNC Sampler Context Governed Educational and Informative corpus
(Lectures, talks and educational demonstrations; news commentaries; classroom interaction etc)

BNC Sampler CG (Spoken) Leisure: 144,925 words from BNC Sampler Context Governed Leisure corpus
(Sports commentaries; broadcast chat shows and phone-ins; club meeting and speeches etc)

BNC Sampler CG (Spoken) Institutional: 151,445 words from BNC Sampler Context Governed Institutional corpus
(Political speeches; sermons; local and national governmental proceedings etc)

British English 2006 (BE06): 929,862 words from published general written British English. It has the same sampling frame as the LOB and FLOB corpora.

American English 2006 (AmE06): 966,609 words from published general written American English, also using the same sampling frame as the LOB and FLOB corpora.

Notes:
1. There is more information on the BNC (British National Corpus) sampler corpus from the Oxford BNC website and explanatory documentation from UCREL.
2. The encoding of punctuation in the BNC Sampler CG corpora differs from other corpora in Wmatrix. Frequencies of punctuation are included in these lists and should be ignored during comparisons.
3. The descriptions of the contents of the BNC sampler CG corpus above are taken from pp. 32-33 of Aston, G. and Burnard, L. (1998). [The BNC Handbook: Exploring the British National Corpus with SARA, Edinburgh University Press, Edinburgh.] which describes the contents of the BNC context governed section of the whole corpus rather than just the BNC sampler.
4. Note that the word counts here include the counting of multi-word-expressions as one item (see section 5 below).
5. More information on the BE06 and AmE06 corpora can be found on Paul Baker's website.