Document BNC SAMPLER

Document SAMPLER.HTM

UCREL, Lancaster University, Lancaster LA1 4YT, UK

March 1998

The British National Corpus Sampler Corpus: Explanatory documentation

1. Introduction

The BNC Sampler Corpus is a subcorpus of the British National Corpus, consisting of approximately one-fiftieth of the whole corpus, viz. two million words. The Sampler Corpus is word-class tagged, using a more detailed tagset than has been used for the BNC as a whole. It also has the advantage that all the word-class tags assigned to words have been manually checked and, where necessary, corrected. Consequently, the number of errors in word-class tagging must be very small. The advantages of the Sampler Corpus, then, are the following:

It has been word-class-tagged using the C7 tagset, a more detailed tagset of 135 tags (plus 12 punctuation tags), instead of the C5 tagset with 61 tags (plus punctuation tags).

The word-class tags have been hand-checked and corrected, so that errors are minimal.

There is approximately a 50%-50% division of the Sampler Corpus into written and spoken materials. This is a better balance than the 90%-10% divison of the whole BNC.

Although the Sampler Corpus lacks much of the detail and variety of the entire BNC, it does contain a wide and balanced sampling of texts from the BNC, so as to maintain the general text types and the proportions of general text types (apart from the unequal written/spoken division) of the BNC as a whole.

Experience suggests that for many research and application purposes, a small-scale BNC such as the Sampler provides is a more convenient corpus to use than the whole 100-million-word BNC.

2. The Constitution of the BNC Sampler Corpus

The Sampler Corpus consists of the following text categories. In Table 2:1, the number of words of each category is added.

Table 2:1 - BNC Sampler Corpus (2,001,394 words)

SPOKEN (990,704 words) WRITTEN (1,010,690 words)

Context-Governed (496,852) Demographic (493,852) [by socio-economic class] Imaginative (231,663) Informative(779,027)

Leisure (136,606) I (AB) (164,933) Drama (23,786) Pure science (32,974)

Educational (80,463) II (C1) (98,700) Poetry (30,144) Applied science (117,685)

Business (134,275) III (C2) (137,686) Prose Fiction (177,733) Social science (29,868)

Public/Institutional (145,508) IV (DE) (92,533) World affairs (277,128)

Commerce & finance (92,057)

Arts (51,645)

Belief & thought (43,626)

Leisure (134,044)

To maintain comparability with the whole BNC, as well as the integrity of the text samples already in the BNC, it was decided to avoid dividing these documents up into smaller extracts for the purposes of the Sampler Corpus. This meant that the selection of individual texts to form part of the Sampler had to be constrained by the size of text compared with the amount of 'room' in the Sampler for a particular text type. Another constraint was that the Sampler Corpus had to be compiled in 1994, at a time when the BNC was under development, and there was virtually no access to bibliographical information regarding the contents of the BNC. The consequence was that the choice of texts could not be determined by random sampling methods, but that a member of the research team had to select texts by human inspection. Within these constraints, the texts for the Sampler were chosen so as to copy both in size and in content the varieties and proportions of text types in the whole BNC.

2.1 List of texts in the Sampler Corpus

A. Spoken

A.1 Context-governed sampling

Leisure

FL6 FLK FX5 FX6 FXR FY8 FYJ G4N G5A G63 HE3 HE4 HEM HM4 J8G

Business

F7J FLS FUG G3U H47 H5D HDF HDG HDT HLW HYF J3W J97

Educational/Informative

F71, F77, F7G, FLY, FM4, FM7, FUH, G4K, JJS

Public/Institutional

DCH, F86, FLU, FMP, FMS, FUT, FUU, H4A, J44, JJA, JJV, JJW, JNG, JNM

A.2 Demographic sampling

respondent's socio-economic class = A or B

KB8, KBU, KC4, KCV, KP6, KPG, KBK, KCB, KCH, KDU, KP8, KST, KB3, KC0, KC3, KC8

respondent's socio-economic class = C1

KBG, KCN, KD0, KB9, KD5, KBL, KD2, KDM, KE3

respondent's socio-economic class = C2

KC1, KCG, KD3, KD8, KBX, KCT, KCX, KD1, KDH, KBF, KCE, KCL, KCY, KPD

respondent's socio-economic class = D or E

KB1, KCA, KDN, KB2, KC2, KC7, KCU, KD6

B. Written

B.1 Imaginative

Drama FU6

Poetry CHX, F9M, G11, G1V

Prose Fiction AEA, ALS, CCD, CHR, FRY, FSB, G0A, GUL, GV9, GW5, GWA, J2G

B.2 Informative

Pure science

FU0, FU9, J2H, J2J

Applied science

CF5, CF7, CF8, CL8, EAP, F98, FA4, FR2, G0K, G3N, H0H, H0S

Belief and thought

CBB, EBK, EVR, GX0

Commerce and finance

AP6, CEL, EVY, FEJ, G0C, GX4, HY1, J24, J6W

Arts

CF6, CN4, J1L, J55

Community and Social science

APJ, EX7, FCF, H8W, JXL

World affairs

A7V, A87, A8J, A8W, A95, A9E, A9M, A9V, AA4, AAB, AAK, AAT, B2E, BMJ, CHP,

EW4, FB4, FU7, G2R, GT9, H7C, HXN

Leisure

BP6, C9C, CAA, CDH, CF9, G22, GUB, GV1, H13, J1N

3. Further information

Detailed information on the word-class tagging of the BNC Sampler Corpus is given in a separate document, CLAWS-C7.

SPOKEN (990,704 words)		WRITTEN (1,010,690 words)
Context-Governed (496,852)	Demographic (493,852) [by socio-economic class]	Imaginative (231,663)	Informative(779,027)
Leisure (136,606)	I (AB) (164,933)	Drama (23,786)	Pure science (32,974)
Educational (80,463)	II (C1) (98,700)	Poetry (30,144)	Applied science (117,685)
Business (134,275)	III (C2) (137,686)	Prose Fiction (177,733)	Social science (29,868)
Public/Institutional (145,508)	IV (DE) (92,533)		World affairs (277,128)
			Commerce & finance (92,057)
			Arts (51,645)
			Belief & thought (43,626)
			Leisure (134,044)