[ Related documents: Introduction to the Manual | Guidelines to Wordclass Tagging | Error rates | Acknowledgments ]
This document describes the overall process of POS-tagging texts for version 2 of the British National Corpus. Figure 1 below shows the main stages involved: stages A-D are handled by the CLAWS4 tagger; stage E, Template Tagger is a corrective phase for CLAWS; the main part of F., Ambiguity Tagging, describes the conversion of some less reliable tags into dual tags containing more than one part of speech.
A. |
Tokenization |
B. |
Initial tag assignment |
C. |
Tag selection (disambiguation) |
D. |
Idiomtagging |
E. |
Template Tagger |
F. |
Postprocessing: including Ambiguity tagging |
Figure 1. Wordclass Tagging schema for BNC2.
The BNC2 was automatically tagged using CLAWS4, an automatic tagger which developed out of the CLAWS1 automatic tagger (authors: Roger Garside and Ian Marshall 1983) used to tag the LOB Corpus. The advanced version CLAWS4 is principally the work of Roger Garside, although many other researchers at Lancaster have contributed to its performance in one way or another. Further information about CLAWS4 can be obtained from Leech, Garside and Bryant 1994 and Garside and Smith 1997.
CLAWS4 is a hybrid tagger, employing a mixture of probabilistic and non-probabilistic techniques. It assigns a tag (or sometimes two tags) to a word as a result of four main processes:
The first major step in automatic tagging is to divide up the text or corpus to be tagged into individual (1) word tokens and (2) orthographic sentences. These are the segments usually demarcated by (1) spaces and (2) sentence boundaries (i.e. sentence final punctuation followed by a capital letter). This procedure is not so straightforward as it might seem, particularly because of the ambiguity of full stops (which can be abbreviation marks as well as sentence-demarcators) and of capital letters (which can signal a naming expression, as well as the beginning of a sentence). Faults in tokenization occasionally occur, but rarely cause tagging errors.
In tokenization, an orthographic word boundary (normally a space, with or without accompanying punctuation) is the default test for identifying the beginning and end of word-tokens. (See, however, the next paragraph and D below.) Hyphens are counted as word-internal, so that a hyphenated word such as key-ring is given just one tag (NN1). Because of the different ways of writing compound words, the same compound may occur in three forms: as a single word written ‘solid’ (markup), as a hyphenated word (mark-up) or as a sequence of two words (mark up). In the first two cases, CLAWS4 will give the compound a single tag, whereas in the third case, it will receive two tags: one for mark and the other for up.
A set of special cases dealt with by tokenization is the set of enclitic verb and negative contractions such as 's, 're, 'll and 'nt, which are orthographically attached to the preceding word. These will be given a tag of their own, so that (for example) the orthographic forms It's, they're, and can't are given two tags in sequence: pronoun + verb, verb + negative, etc. There are also some 'merged' forms such as won't and dunno, which are decomposed into more than one word for tagging purposes. For example, dunno actually ends up with the three tags for do + n't + know. [ View the list of contracted forms ]
The second stage of CLAWS POS-tagging is to assign to each word token one or more tags. Many word tokens are unambiguous, and so will be assigned just one tag: e.g. various AJ0 (adjective). Other word tokens are ambiguous, taking from two to seven potential tags. For example, the token paint can be tagged NN1, VVB, VVI, i.e. as a noun or as a verb; the token broadcast can be tagged as VVB, VVI, VVD, VVN (verb which is either present tense, infinitive, past tense, or past participle). In addition, it can be a noun (NN1) or an adjective (AJ0), as in a broadcast concert.
To find the list of potential tags associated with a word, CLAWS first looks up the word in a lexicon of c.50,000 word entries. This lexicon look-up accounts for a large proportion of the word tokens in a text. However, many rarer words or names will not be found in the lexicon, and are tagged by other test procedures. Some of the other procedures are:
When a word is associated with more than one tag, information is given by the lexicon look-up or other procedures on the relative probability of each tag. For example, the word for can be a preposition or a conjunction, but is much more likely to be a preposition. This information is provided by the lexicon, either in numerical form, or where numerical data available are insufficient, by a simple distinction between 'unmarked', 'rare' and 'very rare' tags.
Some adjustment of probability is made according to the position of the word in the sentence. If a word begins with a capital, the likelihood of various tags depends partly on whether the word occurs at the beginning of a sentence. For instance, the word Brown at the beginning of a sentence is less likely to be a proper noun than an adjective or a common noun (normally written brown). Hence the likelihood of a proper noun tag being assigned is reduced at the beginning of a sentence.
The next stage, logically, is to choose the most probable tag from any ambiguous set of tags associated with a word token by tag assignment (but see D below). This is another probabilistic procedure, this time making use of the context in which a word occurs. A method known as Viterbi alignment uses the probabilistic estimates available, both in terms of the tag-word associations and the sequential tag-tag likelihoods, to calculate the most likely path through the sequence of tag ambiguities. (The model employed is largely equivalent to a hidden Markov model.) After tag selection, a single 'winning tag' is selected for each word token in a text. (The less likely tags are not obliterated: they follow the winning tag in descending probability order.) However, the winning tag is not necessarily the right answer. If the CLAWS tagging stopped at this point, only c.95-96% of the word-tokens would be correctly tagged. This is the main reason for including an additional stage (or rather a set of stages) termed 'idiomtagging'.
Idiomtagging is a stage of CLAWS4's operation in which sequences of words and tags are matched against a 'template'. Depending on the match, the tags may be disambiguated or corrected. In practice, there are two main reasons for idiomtagging:
Idiomtagging is a matching procedure which operates on lists of rules which might loosely be termed 'idioms'. Among these are:
The idiomtagging component of CLAWS is quite powerful in matching 'template' expressions in which there are wild-card symbols, Boolean operators and gaps of up to n words. They are much more variable than 'idioms' in the ordinary sense, and resemble finite-state networks.
Another important point about idiomtagging is that it is split up into two main phases which operate at different points in the tagging system. One part of the idiomtagging takes place at the end of Stage C., in effect retrospectively correcting some of the errors which would otherwise occur in CLAWS output. Another part, however, actually takes place between Stages B. and C. This means it can utilise ambiguous input and also produce ambiguous output, perhaps adjusting the likelihood of one tag relative to another. As an example, consider the case of so long as, which can be a single grammatical item - a conditional conjunction meaning 'provided that'. The difficulty is that so long as can also be a sequence of three separate grammatical items: degree adverb + adjective/adverb + conjunction/preposition. In this case, the tagging ambiguity belongs to a whole word sequence rather than a single word, and the output of the idiomtagging has to be passed on to the probabilistic tag selection stage. Hence, although we have called idiomtagging 'Stage D.', it is actually split between two stages, one preceding C. and one following C.
When the text emerges from Stages C. and D., each word has an associated set of one or more tags associated with it, and each tag itself is associated with a probability represented as a percentage. An example is:
entering VVG 86% NN1 14% AJ0 0%
Clearly VVG (-ing participle of the verb enter) is judged by CLAWS4 to be the most likely tag in this case.
[ CLAWS POS-tagging | Ambiguity Tagging ]
The error rate with CLAWS4 averages around 3%.1 For the BNC Tagging Enhancement project, we decided to concentrate our efforts on the rule-based part of the system, where most of the inroads in error reduction had been made. This involved (a) developing software with more powerful pattern-matching capabilities than the CLAWS Idiomlist, and (b) carrying out a more systematic analysis of errors, to identify appropriate error-correcting rules.
The new program, known as Template Tagger, supplements rather than supplants CLAWS. It takes a CLAWS output file as its input, and "patches"2 any erroneous tags it finds by using hand-written template rules. Figure 1 above shows where Template Tagger fits in the overall tagging scheme. Effectively, it is an elaborate 'search and replace' tool, capable of matching longer-distance and more variable dependencies than is possible with the Idiomlist:
These features can best be understood by an example. In BNC1 there were quite a number of errors disambiguating prepositions from subordinating conjunctions, in connection with words like after, before, since and so on. The following rule corrects many such cases from subordinating conjunction (CJS) to prepositions (PRP) tags. It applies a basic grammatical principle that subordinating conjunctions mark the start of clauses and generally require a finite verb somewhere later in the sentence.
#AFTER [CJS^PRP] PRP, ([!#FINITE_VB/VVN])16, #PUNC1
The two commas divide the rule into three units, each containing a word or tag or word+tag combination. Square brackets contain tag patterns, and a tag following square brackets is the replacement tag (ie the action part of the rule). #AFTER refers to a list of words like after, before and since, that have similar grammatical properties. These words are defined in a separate file; not all conjunction-preposition words are listed - as, for instance, can be used elliptically, without the requirement for a following verb. (See Tagging Guidelines under as). The definition for #FINITE_VB contains a list of possible POS-tags (rather than word values), eg VVZ/VV0/VM0. Finally #PUNC1 is a 'hard' punctuation boundary (one of . : ; ? and ! ). The patching rule can be interpreted as:
'If a sequence of the following kind occurs:
then change the conjunction tag to preposition.' |
The rule doesn't always work accurately, and doesn't cater for all preposition-conjunction errors. (i) It relies to a large extent on CLAWS having correctly identified finite verb tags in the right context of the preposition-conjunction; sometimes, however, a past participle is confused with a past tense form. (We therefore added VVN, ie past participle, as a possible alternative to #FINITE_VB in the second part of the pattern. The downside of this was that Template Tagger ignored some conjunction-preposition errors containing genuine use of VVN in the right context). (ii) The scope of the rule doesn't cover long sentences where more than 16 non-finite-verb words occur after the conjunction-preposition. A separate rule had to be written to handle such cases. (iii) Adverb uses of after, before and since etc. need to be fixed by additional rules.
Targetting and writing the Template rules
The Templates are targetted at the most error-prone categories introduced (or rather, left unresolved) by CLAWS. As with the preposition-conjunction example just shown, many disambiguation errors congregate around pairs of tags, for example adjective and adverb, or noun and verb. Sometimes a triple is involved, eg a past tense verb (VVD), past participle (VVN) and adjective (AJ0) in the case of surprised.
A small team of researchers sought out patterns in the errors by concordancing a training corpus that contained two parallel versions of the tagging: the automatic version produced by CLAWS and a hand-corrected version, which served as a benchmark. A concordance query of the form "tag A | tag B", would retrieve lines where the former version assigned an incorrect tag A and the latter a correct tag B. An example is shown below, in which A is a subordinating conjunction and B a preposition.
the company which have occurred | since CJS [PRP] | the balance sheet date . **11;7898;ptr **1 nd-green shirt with epaulettes . | Before CJS [PRP] | the show , the uniforms were approved by rt towards the library catalogue | since CJS [PRP] | the advent of online systems . The overall ales . There have been no events | since CJS [PRP] | the balance sheet date which materially af n in demand , adding 13p to 173p | since CJS [PRP] | the end of October . Printing group Linx h Hugh Candidus of Peterborough . | After CJS [PRP] | the appointment of Henry of Poitou , a sel boys would be in the Ravenna mud | until CJS [PRP] | the spring . Our landlady obviously liked ution in treatment brought about | since CJS [PRP] | the arrival of penicillin and antibiotics |
Figure 2. Parallel concordance showing conjunction-preposition tagging errors
By working interactively with the parallel concordance, sorting on the tags of the immediate context, testing for significant collocates to the left and right, and generally applying his/her linguistic knowledge, the researcher can often detect sufficient commonality between the tagging errors to formulate a patching rule (or a set of rules) such as that shown above. It took several iterations of training and testing to refine the rules to a point where they could be applied by Template Tagger to the full corpus. 4
It should be said that some categories of error were easier to write rules for than others. Finding productive rules for noun-verb correction was especially difficult, because of the many types of ambiguity between nouns, verb and other categories, and the widely differing contexts in which they appear. The errors and ambiguity tags associated with NN1-VVB and NN2-VVZ in BNC2 in the error report testify to this problem. Here a more sophisticated lexicon, detailing the selectional restrictions of individual verbs and nouns (and other categories) would have undoubtedly been useful.
Ordering the rules
In some instances the ordering of rules was important. When two rules in the same ruleset compete, the longer match applies. Clashes arise in the case of the multiply ambiguous word as, for instance. Besides the clear grammatical choices between a preposition and a complementiser introducing an adverbial clause, there are many "interfering" idiomatic uses (as well as, as regards etc) and elliptical uses ( The TGV goes as fast as the Bullet train [sc.goes]). To avoid interference between the rules, we found it preferable to let an earlier pass of the rules handle more idiomatic (or exceptional) structures, and let a later pass deal with the more regular grammatical dependencies.
In many rule sets, however, we found that ordering did not affect the overall result, as we tried to ensure each rule was 'true' in all cases. Since, however, more than one rule sometimes carried out the same tag change to a particular word, the system was not optimised for speed and efficiency.
Besides the ordering of rules within rulesets, it is worth considering the placement of Template Tagger within the tagging schema (Figure 1). Ideally, it would be sensible to exploit the full pattern-matching functionality of the Template Tagger earlier in the schema, using it in place of the CLAWS Idiomlist not just after statistical disambiguation, where it is undoubtedly necessary, but also before it. In this way Template Tagger could have precluded much unnecessary ambiguity passing to Stage C. above. The reason we did not do this was pragmatic, that TT was in fact developed as a general-purpose annotation tool (See Fligelstone, Pacey and Rayson 1997), and not exclusively for the POS-tagging of BNC2. In future versions of the tagging software we hope to integrate Template Tagger more fully with CLAWS.
[ CLAWS POS-tagging | Postprocessing with Template Tagger ]
The post-processing phase has the task of producing output in the form in which the user is going to find it most usable.
The final phase, "ambiguity tagging", merits a little further discussion. The requirement for such tags is clear when one observes that even using Template Tagger on top of CLAWS, there remains a residuum of error, around 2%, in the corpus. By permitting ambiguity tags we are effectively able to "hedge" in many instances that might otherwise have counted as errors - improving the chances of retrieving a particular tag, but at the cost of retrieving other tags as well. We considered that a reasonable goal would be to employ sufficient ambiguity tags to achieve an overall error rate for the corpus of 1%.
Because CLAWS's reliability in statistical disambiguation varies according to the POS-tags involved, we calculated the thresholds for application of ambiguity tags separately for each relevant tag-pair A-B (where A is CLAWS's first-choice and B its second-choice tag). First, the tag-pairs were chosen according to their error frequencies in a training corpus of 100,000 words. The proportion of A-B errors to the total number of errors indicated how many errors of that type would be allowed in order to achieve the 1% error rate overall; we will refer to this figure as "the target number of errors" for A-B. We then
As we report under Error rates, the BNC2 in fact contains a higher error rate than 1%. This is because some thresholds applied at the 1% rate incurred a very high frequency of potential ambiguity tags: we hand-adjusted such thresholds if permitting a slight rise in errors led to a substantial reduction in the number of ambiguities.
The form and frequency of ambiguity tags are explained in the documents Guidelines to Wordclass tagging and Error rates respetively. Further comments on stages E. and F. can be found in Smith 1997.
1. That is, the error rate based on CLAWS's first choice tag only.
2. We borrow the term "patching" from Brill (1992), although for his tagging program the patches are discovered by an automatic procedure.
3. The repetition value of up to 16 words was reached at by trial and error; an occurrence of a finite verb beyond that range was rarely in the same clause as the #AFTER-type word. 4. Training and testing were mostly carried out on the BNC Sampler corpus of 2 million words. For less frequent phenomena we needed to use sections from the full BNC. None of the texts used for the tagging error report are contained in the Sampler.
Brill, E. (1992) 'A simple rule-based part-of-speech tagger'. Proceedings of the 3rd conference on Applied Natural Language Processing. Italy: Trento.
Fligelstone S., Pacey M., and Rayson P. (1997) 'How to Generalize the Task of Annotation'. In Garside et al. (1997)
Garside R., Leech G. and McEnery A. (eds.) (1997) Corpus Annotation. London: Longman.
Garside R., and Smith N. (1997) 'A hybrid grammatical tagger: CLAWS4'. In Garside et al. (1997)
Leech, G., Garside, R., and Bryant, M. (1994). CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94). Japan: Kyoto. (pp.622-628.)
Marshall, I. (1983). 'Choice of Grammatical Word-class without Global Syntactic Analysis: Tagging Words in the LOB Corpus'. Computers and the Humanities 17, 139-50.
Smith, N. (1997) 'Improving a Tagger'. In Garside et al. (1997)
Date: 17 March 2000References
[ Related documents: Introduction to the manual | Guidelines to Wordclass Tagging | Error rates ]