The following corpora have been used to provide standard frequency information against which you can compare your own corpus (at the word, POS or semantic levels):
Corpus | Word | POS | Semantic |
---|---|---|---|
BNC Sampler spoken: 982,712 words from BNC Sampler spoken corpus | |||
BNC Sampler written: 968,267 words from BNC Sampler written corpus | |||
BNC Sampler spoken CG: 480,759 words from BNC Sampler spoken context-governed corpus | |||
BNC Sampler spoken Demog: 501,953 words from BNC Sampler spoken demographic corpus | |||
BNC Sampler written Imag: 222,541 words from BNC Sampler written imaginative corpus | |||
BNC Sampler written Inform: 745,726 words from BNC Sampler written informative corpus | |||
BNC Sampler CG (Spoken) Business: 141,143 words from BNC Sampler Context Governed Business corpus | |||
BNC Sampler CG (Spoken) Educational: 86,575 words from BNC Sampler Context Governed Educational and Informative corpus | |||
BNC Sampler CG (Spoken) Leisure: 144,925 words from BNC Sampler Context Governed Leisure corpus | |||
BNC Sampler CG (Spoken) Institutional: 151,445 words from BNC Sampler Context Governed Institutional corpus | |||
British English 2006 (BE06): 929,862 words from published general written British English. It has the same sampling frame as the LOB and FLOB corpora. | |||
American English 2006 (AmE06): 966,609 words from published general written American English, also using the same sampling frame as the LOB and FLOB corpora. |
Notes:
1. There is more information on the BNC (British National Corpus) sampler corpus from the
Oxford BNC website
and explanatory documentation from UCREL.
2. The encoding of punctuation in the BNC Sampler CG corpora differs from
other corpora in Wmatrix. Frequencies of punctuation are included in these lists and should be
ignored during comparisons.
3. The descriptions of the contents of the BNC sampler CG corpus above are taken
from pp. 32-33 of Aston, G. and Burnard, L. (1998).
[The BNC Handbook: Exploring the British National Corpus with SARA, Edinburgh University Press, Edinburgh.]
which describes the contents of the BNC context governed section of the whole corpus rather
than just the BNC sampler.
4. Note that the word counts here include the counting of multi-word-expressions as one item (see section 5 below).
5. More information on the BE06 and AmE06 corpora can be found on Paul Baker's website.