Document SAMPLER.HTM
UCREL, Lancaster University, Lancaster LA1 4YT, UK
March 1998
The British National Corpus Sampler Corpus: Explanatory documentation
1. Introduction
The BNC Sampler Corpus is a subcorpus of the British National
Corpus, consisting of approximately one-fiftieth of the whole
corpus, viz. two million words. The Sampler Corpus is word-class
tagged, using a more detailed tagset than has been used for the
BNC as a whole. It also has the advantage that all the word-class
tags assigned to words have been manually checked and, where necessary,
corrected. Consequently, the number of errors in word-class tagging
must be very small. The advantages of the Sampler Corpus, then,
are the following:
2. The Constitution of the BNC Sampler Corpus
The Sampler Corpus consists of the following text categories.
In Table 2:1, the number of words of each category is added.
Table 2:1 - BNC Sampler Corpus (2,001,394 words)
Context-Governed (496,852) | Demographic (493,852) [by socio-economic class] | Imaginative (231,663) | Informative(779,027) |
Leisure (136,606) | I (AB) (164,933) | Drama (23,786) | Pure science (32,974) |
Educational (80,463) | II (C1) (98,700) | Poetry (30,144) | Applied science (117,685) |
Business (134,275) | III (C2) (137,686) | Prose Fiction (177,733) | Social science (29,868) |
Public/Institutional (145,508) | IV (DE) (92,533) | World affairs (277,128) | |
Commerce & finance (92,057) | |||
Arts (51,645) | |||
Belief & thought (43,626) | |||
Leisure (134,044) |
To maintain comparability with the whole BNC, as well as the
integrity of the text samples already in the BNC, it was decided
to avoid dividing these documents up into smaller extracts for
the purposes of the Sampler Corpus. This meant that the selection
of individual texts to form part of the Sampler had to be constrained
by the size of text compared with the amount of 'room' in the
Sampler for a particular text type. Another constraint was that
the Sampler Corpus had to be compiled in 1994, at a time when
the BNC was under development, and there was virtually no access
to bibliographical information regarding the contents of the BNC.
The consequence was that the choice of texts could not be determined
by random sampling methods, but that a member of the research
team had to select texts by human inspection. Within these constraints,
the texts for the Sampler were chosen so as to copy both in size
and in content the varieties and proportions of text types in
the whole BNC.
2.1 List of texts in the Sampler Corpus
A. Spoken
A.1 Context-governed sampling
Leisure
FL6 FLK FX5 FX6 FXR FY8 FYJ G4N G5A G63 HE3 HE4 HEM HM4 J8G
Business
F7J FLS FUG G3U H47 H5D HDF HDG HDT HLW HYF J3W J97
Educational/Informative
F71, F77, F7G, FLY, FM4, FM7, FUH, G4K, JJS
Public/Institutional
DCH, F86, FLU, FMP, FMS, FUT, FUU, H4A, J44, JJA, JJV, JJW, JNG,
JNM
A.2 Demographic sampling
respondent's socio-economic class = A or B
KB8, KBU, KC4, KCV, KP6, KPG, KBK, KCB, KCH, KDU, KP8, KST, KB3,
KC0, KC3, KC8
respondent's socio-economic class = C1
KBG, KCN, KD0, KB9, KD5, KBL, KD2, KDM, KE3
respondent's socio-economic class = C2
KC1, KCG, KD3, KD8, KBX, KCT, KCX, KD1, KDH, KBF, KCE, KCL, KCY,
KPD
respondent's socio-economic class = D or E
KB1, KCA, KDN, KB2, KC2, KC7, KCU, KD6
B. Written
B.1 Imaginative
Drama FU6
Poetry CHX, F9M, G11, G1V
Prose Fiction AEA, ALS, CCD, CHR, FRY, FSB, G0A, GUL, GV9,
GW5, GWA, J2G
B.2 Informative
Pure science
FU0, FU9, J2H, J2J
Applied science
CF5, CF7, CF8, CL8, EAP, F98, FA4, FR2, G0K, G3N, H0H, H0S
Belief and thought
CBB, EBK, EVR, GX0
Commerce and finance
AP6, CEL, EVY, FEJ, G0C, GX4, HY1, J24, J6W
Arts
CF6, CN4, J1L, J55
Community and Social science
APJ, EX7, FCF, H8W, JXL
World affairs
A7V, A87, A8J, A8W, A95, A9E, A9M, A9V, AA4, AAB, AAK, AAT, B2E, BMJ, CHP,
EW4, FB4, FU7, G2R, GT9, H7C, HXN
Leisure
BP6, C9C, CAA, CDH, CF9, G22, GUB, GV1, H13, J1N
3. Further information
Detailed information on the word-class tagging of the BNC Sampler
Corpus is given in a separate document, CLAWS-C7.