Geoffrey Leech and Nicholas Smith
UCREL, Lancaster University, Lancaster LA1 4YT, UK
The whole of the British National Corpus (BNC) has been retagged with word-class tags: that is, a label is attached to each word, indicating its grammatical class (or part of speech), the tags being the same as in the first version of the BNC.
The word-class tagging was undertaken by the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University, UK. Those contributing to the tagging were Roger Garside (UCREL director), Geoffrey Leech (UCREL chair), Tony McEnery (co-fund-holder), Nicholas Smith (leading research coordinator), Paul Baker (research associate), Martin Wynne (research associate), Mike Pacey (programmer), Michael Bryant (programming consultant). Several other researchers worked on version 1 of the BNC, and the BNC Sampler corpus.
This manual contains four main documents:
The BNC Basic Tagset and the BNC Enriched Tagset
The BNC is word-class tagged using a set of 57 tags (known as C5) which we refer to as the "BNC Basic Tagset". (There are also 4 punctuation tags, excluded from consideration here.) Each C5 tag represents a grammatical class of words, and consists of a partially mnemonic sequence of three characters: e.g. NN1 for "singular common noun".
Additionally, a two-million-word subcorpus of the BNC, the "Sampler Corpus", has been tagged using a richer, more detailed tagset (known as C7) which we also refer to as the "BNC Enriched Tagset". The Sampler Corpus is being distributed on CD-Rom independently of the whole BNC. The tagging of the Sampler Corpus has been thoroughly corrected by manual post-editing of the automatic tagging, and may be assumed to be almost error-free.
The BNC, consisting of c.100 million words, was tagged automatically, using the CLAWS4 automatic tagger developed by Roger Garside at Lancaster, and a second program, known as Template Tagger, developed chiefly by Mike Pacey. (See further details in the automatic tagging programs document, and also R. Garside, G. Leech and T. McEnery, 1997 (eds.), Corpus Annotation: Linguistic Information from Computer Text Corpora, London: Longman, chapters 7-9). With such a large corpus, there was no opportunity to undertake post-editing2, i.e. disambiguation and correction of tagging errors produced by the automatic tagger, and so the errors (about 1.15 per cent of all words) remain in the distributed form of the corpus. In addition, the distributed form of the corpus contains ambiguous taggings (c.3.75 per cent of all words), shown in the form of ambiguity tags (also called ‘portmanteau tags’), consisting of two C5 tags linked by a hyphen: e.g. VVD-VVN. These tags indicate that the automatic tagger was unable to determine, with sufficient confidence, which was the correct category, and so left two possibilities for users to disambiguate themselves, if they should wish to do so. For example, in the case of VVD-VVN, the first (preferred) tag, say for a word such as wanted, is VVD: past tense of lexical verb; and the second (less favoured) tag is VVN: past participle of lexical verb. On the whole, the likelihood of the first tag of an ambiguity tag being correct is over 3 to 1 - see, however, details of individual tags in Table 2 of the error report document.
After the automatic tagging, some manual tagging was undertaken to correct some particularly blatant errors, mainly foreign or classical words embedded in English text. CLAWS is not very successful at detecting these foreign words and tagging them with their appropriate tag (UNC), except when they form part of established expressions such as ad hoc or nom de plume - in which case they are normally given tags appropriate to their grammatical function, e.g. as nouns or adverbs.
The main purpose of the report on estimated error rates is to document the rather small percentage of ambiguities and errors remaining in the tagged BNC, so that users of the corpus can assess the accuracy of the tagging for their own purposes. Since not surprisingly we have been unable to inspect each of the 100 million tags in the BNC, we have had to estimate ambiguity rates and error rates on the basis of a manual post-editing of a corpus sample of 50,000 words. The estimate is based on twenty-four 2,000-word text extracts and two 1,000-word extracts, selected so as to be as far as possible representative of the whole corpus. (The texts from which these extracts are taken are listed in an Appendix to the error report. Like the BNC as a whole, the sample for manual analysis contains 10 per cent of spoken data.)
Regarding the segmentation of a text into individual word-tokens (called tokenization), our tagging practice in general follows the default assumption that an orthographic word (separated by spaces, with or without punctuation, from adjacent words) is the appropriate unit for word-class tagging. There are, however, exceptions to this. For example, a single orthographic word may consist of more than one grammatical word: in the case of enclitic verb contractions (as in she’s, they’ll, we’re) and negative contractions (as in don’t, isn’t, won’t), two tags are assigned in sequence to the same orthographic word. Also quite frequent is the opposite circumstance, where two or more orthographic words are given a single grammatical tag: e.g. multi-word adverbs such as of course and in short, and multi-word prepositions such as instead of and up to are each assigned a single word tag (AV0 for adverbs, PRP for prepositions). Sometimes, whether such orthographic sequences are to be treated as a single word for tagging purposes depends on the context and its interpretation. In short is in some circumstances not an adverb but a sequence of preposition + adjective (eg. in short, sharp bursts). Up to in some contexts needs to be treated as a sequence of two grammatical words: adverbial-particle + preposition-or-infinitive-marker (eg. We had to phone her up to get the code.). (For details, see the lists of Multiword units and contracted forms in BNC2.)
In one respect, we have allowed the orthographic occurrence of spaces to be criterial. This is in the tagging of compound words such as markup, mark-up and mark up. Since English orthographic practice is often variable in such matters, the same ‘compound’ expression may occur in the corpus tagged as two words (if they are separated by spaces) or as one word (if the sequence is printed solid or with a hyphen). Thus mark up (as a noun) will be tagged NN1 AVP, whereas markup or mark-up will be tagged simply NN1.
Many detailed decisions have to be made in deciding how to draw the line between the correct and the incorrect assignment of a tag. So that the concept of what is a ‘correct’ or ‘accurate’ annotation can be determined, there have to be detailed guidelines of tagging practice. These are incorporated in a separate document, the Wordclass Tagging Guidelines.
The Guidelines have to give much attention to borderline phenomena, where the distinction between (say) an adjective and a verb participle in -ing is unclear, and to clarify criteria for differentiating them. To promote consistency of tagging practice, the guidelines may even impose somewhat arbitrary dividing lines between one word class and another. Consider the case of a word such as setting, which may be a verb present participle (VVG), an adjective (AJ0) or a singular common noun (NN1). The difference may be illustrated by the three examples:
Oil prices are rising again. (verb, VVG)
the rising sun (adjective, AJ0)
the attempted rising was put down (noun, NN1)
The assignment of an example of ‘Verb+ing’ to the adjective category relies heavily on a semantic criterion, viz. the ability to paraphrase Verb+ing Noun by ‘Noun + Relative Clause that/which/who be Verb+ing’ or ‘that/which/who Verb(s)’ (e.g. the rising sun = the sun which is/was rsing; a working mother = a mother who works). These contrast with a case such as dining table, where the first word dining is judged to be a noun. The reason for this is that the paraphrasable meaning of the expression is not ‘a table which is/was dining or dines’, but rather ‘a table (used) for dining’. Although somewhat arbitrary, this relative clause test is well established in English grammatical literature, and such criteria are useful in enabling a reasonable degree of consistency in tagging practice to be achieved, so that the success rate of corpus tagging can be checked and evaluated. (See the Wordclass Tagging Guidelines (Adjective vs Noun) for further details.)
It also has to be recognized that some borderline cases (occasionally) may have to be considered unresolvable. We may conclude, for example, that the word Hatching (occurring as a heading on its own, without any syntactic context) could be equally well analysed VVG or NN1, and in such a case one would be tempted to leave the ambiguity (VVG-NN1) in the corpus, showing uncertainty where any grammarian would be likely to acknowledge it. However, in our calculations of ambiguity, we have adhered to the common assumption that ideally, all tags should be correctly disambiguated. Other examples of unresolvability from the sample texts are:
the importance of weaving in the East (verb or noun? - VVG-NN1)
Armed with the knowledge (past participle verb or adjective? - VVN-AJ0)
the Lord is my shepherd (common noun or proper noun? - NN1-NP0)
In practice, in our post-edited sample, we chose the first tag to be correct in these cases.
Ambiguity tags, and the principle of asymmetry
As in the first version of the BNC, we have introduced only a limited number of ambiguity tags, to deal with particular cases where the tagger has difficulty in distinguishing two categories, and where incorrect taggings would otherwise result rather frequently. Ambiguity tags involve only the following 18 word-class labels, and each of the ambiguity tags allows only two labels to be named:
|AJ0 general adjective (positive)||NN2 plural common noun||AV0 general adverb|
|NP0 proper noun||AVP adverbial particle||PNI indefinite pronoun|
|AVQ wh- adverb||PRP general preposition||CJS general subordinator|
|VVB lexical verb: finite base form||CJT subordinator: that||VVD lexical verb: past tense;|
|CRD cardinal numeral||VVG lexical verb: present participle (-ing form)||DT0 determiner-pronoun|
|VVN lexical verb: past participle||NN1 singular common noun||VVZ lexical verb: -s form|
The list of permitted ambiguity tags is given in the Wordclass tagging guidelines
It will be noted that overall 30 ambiguity tags are recognized. We also observe that each ambiguity tag (eg VVD-VVN) is matched by another ambiguity tag which is its mirror image (eg VVN-VVD). This is where there has been a different practice from that employed in the first version of the BNC. In the first version, no mirror-image ambiguity tags were allowed, and the ambiguity tag was neutral with respect to which tag was more likely. In this version, however, the ordering of tags is significant: it is the first of the two tags which is estimated by the tagger to be the more likely. Hence the interpretation of an ambiguity tag X-Y may be expressed as follows:
"There is not sufficient confidence to choose between tags X and Y; however, X is considered to be more likely."
NOTE, THEREFORE, THAT THE MEANING OF ANY GIVEN AMBIGUITY TAG IN THE ORIGINAL RELEASE OF THE BNC WAS SOMEWHAT DIFFERENT FROM, AND LESS INFORMATIVE THAN, THE MEANING OF THE SAME TAG IN THIS VERSION OF THE CORPUS (BNC2)
1. The terms "POS-tagging" and "wordclass-tagging" are used interchangeably in the manual.
2. The only exceptions to this statement are: (i) the file F9M, which contains the Rap poetry "City Psalms" by Benjamin Zephaniah. It was thoroughly hand-corrected because the tagger, not familiar with Jamaican Creole, had produced an inordinate number of tagging errors. (ii) files identified as containing many foreign and classical expressions, as mentioned above (section 2, para.2).
[ Top of Introduction | Guidelines to Wordclass Tagging | Automatic tagging of the BNC | Error rates | Acknowledgments ]
Date: 17 March 2000