Part-of-speech (POS) tagging,
also called grammatical tagging, is the
commonest form of corpus annotation,
and was the first form of
annotation to be developed by UCREL at Lancaster.
Our POS tagging software for English text,
CLAWS (the Constituent Likelihood Automatic Word-tagging System), has
been continuously developed since the early 1980s.
The latest version of the
tagger, CLAWS4, was used to POS tag c.100 million words of the
British National Corpus (BNC).
CLAWS has consistently achieved 96-97% accuracy (the precise degree
of accuracy varying according to the type of text).
Judged in terms of major categories,
the system has an error-rate of only 1.5%,
with c.3.3% ambiguities unresolved, within the BNC.
More detailed analysis of the error rates for the C5 tagset in
the BNC can be found within the BNC manual.
|In the context of the BNC Enhancement project, UCREL devised a
Template Tagger to act as a post processor for CLAWS. The rule-based
formalism implemented in the Template Tagger is more powerful than that
built into CLAWS itself. Manual corpus analysis and knowledge of frequent CLAWS
tagging errors was used to create a rule base for the tool. This facilitated
an improvement in the tagging accuracy in the resulting corpus.
For more details, see
Fligelstone, Rayson, and Smith (1996) and
Fligelstone, Pacey, and Rayson (1997).
Please note that the Template Tagger processing is not currently included in the
online tagger or the licenced versions of CLAWS4. Please contact us for further
details of current availability.
UCREL offers access to our latest version of CLAWS4 by:
Several tagsets have been used in CLAWS over the years.
The CLAWS1 tagset has 132 basic wordtags,
many of them identical in form
and application to Brown Corpus tags. A revision of
CLAWS at Lancaster in 1983-6 resulted in a new, much
revised, tagset of 166 word tags, known as the
The tagset for the BNC
(C5 tagset) has just over 60 tags.
This tagset was kept small because
it was designed for handling much larger quantities of data than were
dealt with up to that point.
For the BNC sampler corpus the enriched
C6 tagset was used which has
over 160 tags. The current standard tagset is the
C7 tagset which is the same as the
C6 tagset apart from the punctuation tags. In C6 these all begin with the letter
The mapping between C7 and C5 is a many-to-one conversion, and is available
in a tab-delimited text file.
An extension to the large C7 tagset has been devised. The
C8 tagset makes further distinctions
in the determiner and pronoun categories as well as for auxiliary verbs.
Note that characters '@' and '%' appearing at the end of any CLAWS tags
are rarity markers and can be ignored. These were added manually during the
creation of the CLAWS lexicon to indicate rare tags for words.
|Many detailed decisions have to be made in deciding how to draw the line between the correct and the incorrect assignment of a tag. So that the concept of what is a 'correct' or 'accurate' annotation can be determined, there have to be detailed guidelines of tagging practice. These are incorporated in a separate document, the Wordclass Tagging Guidelines. (This is part of the Manual to accompany The British National Corpus (Version 2) and gives guidelines for the C5 tagset). There is a similar document for the C7 tagset: BNC sampler corpus - guidelines to wordclass tagging.|
Garside, R. (1987). The CLAWS Word-tagging System. In: R. Garside, G. Leech and G. Sampson (eds), The Computational Analysis of English: A Corpus-based Approach. London: Longman.
Leech, G., Garside, R., and Bryant, M. (1994). CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94) Kyoto, Japan, pp622-628. (partial html version pdf version)
Garside, R. (1996). The robust tagging of unrestricted text: the BNC experience. In J. Thomas and M. Short (eds) Using corpora for language research: Studies in the Honour of Geoffrey Leech Longman, London, pp 167-180.
Garside, R., and Smith, N. (1997) A hybrid grammatical tagger: CLAWS4, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 102-121.
Fligelstone, S., Rayson, P., and Smith, N. (1996). Template analysis: bridging the gap between grammar and the lexicon. In J. Thomas, and M. Short (eds), Using corpora for language research. pp 181-207. Longman, London.
Fligelstone, S., Pacey, M. and Rayson, P. (1997) How to generalize the task of annotation, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 122-136.