CLAWS part-of-speech tagger for English

Free CLAWS WWW tagger \| Obtaining a licence \| Tagging service
Introduction	Part-of-speech (POS) tagging, also called grammatical tagging, is the commonest form of corpus annotation, and was the first form of annotation to be developed by UCREL at Lancaster. Our POS tagging software for English text, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to POS tag c.100 million words of the British National Corpus (BNC).
Accuracy	CLAWS has consistently achieved 96-97% accuracy (the precise degree of accuracy varying according to the type of text). Judged in terms of major categories, the system has an error-rate of only 1.5%, with c.3.3% ambiguities unresolved, within the BNC. More detailed analysis of the error rates for the C5 tagset in the BNC can be found within the BNC manual.
Template tagging	In the context of the BNC Enhancement project, UCREL devised a Template Tagger to act as a post processor for CLAWS. The rule-based formalism implemented in the Template Tagger is more powerful than that built into CLAWS itself. Manual corpus analysis and knowledge of frequent CLAWS tagging errors was used to create a rule base for the tool. This facilitated an improvement in the tagging accuracy in the resulting corpus. For more details, see Fligelstone, Rayson, and Smith (1996) and Fligelstone, Pacey, and Rayson (1997). Please note that the Template Tagger processing is not currently included in the online tagger or the licenced versions of CLAWS4. Please contact us for further details of current availability.
Tagging services	UCREL offers access to our latest version of CLAWS4 by: selling site and single user licences for software use within academic institutions and commercial organisations providing an in-house tagging service at Lancaster University our free CLAWS WWW tagger where you can submit text to be POS tagged via the Internet CLAWS can also be accessed through the web-based Wmatrix interface
Tagsets	Several tagsets have been used in CLAWS over the years. The CLAWS1 tagset has 132 basic wordtags, many of them identical in form and application to Brown Corpus tags. A revision of CLAWS at Lancaster in 1983-6 resulted in a new, much revised, tagset of 166 word tags, known as the `CLAWS2 tagset'. The tagset for the BNC (C5 tagset) has just over 60 tags. This tagset was kept small because it was designed for handling much larger quantities of data than were dealt with up to that point. For the BNC sampler corpus the enriched C6 tagset was used which has over 160 tags. The current standard tagset is the C7 tagset which is the same as the C6 tagset apart from the punctuation tags. In C6 these all begin with the letter 'Y'. The mapping between C7 and C5 is a many-to-one conversion, and is available in a tab-delimited text file. An extension to the large C7 tagset has been devised. The C8 tagset makes further distinctions in the determiner and pronoun categories as well as for auxiliary verbs. Note that characters '@' and '%' appearing at the end of any CLAWS tags are rarity markers and can be ignored. These were added manually during the creation of the CLAWS lexicon to indicate rare tags for words.
Tagging guidelines	Many detailed decisions have to be made in deciding how to draw the line between the correct and the incorrect assignment of a tag. So that the concept of what is a 'correct' or 'accurate' annotation can be determined, there have to be detailed guidelines of tagging practice. These are incorporated in a separate document, the Wordclass Tagging Guidelines. (This is part of the Manual to accompany The British National Corpus (Version 2) and gives guidelines for the C5 tagset). There is a similar document for the C7 tagset: BNC sampler corpus - guidelines to wordclass tagging.

For more information on the CLAWS tagger, see Garside (1987), Leech, Garside and Bryant (1994), Garside (1996), and Garside and Smith (1997):

Garside, R. (1987). The CLAWS Word-tagging System. In: R. Garside, G. Leech and G. Sampson (eds), The Computational Analysis of English: A Corpus-based Approach. London: Longman.

Leech, G., Garside, R., and Bryant, M. (1994). CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94) Kyoto, Japan, pp622-628. (partial html version pdf version)

Garside, R. (1996). The robust tagging of unrestricted text: the BNC experience. In J. Thomas and M. Short (eds) Using corpora for language research: Studies in the Honour of Geoffrey Leech Longman, London, pp 167-180.

Garside, R., and Smith, N. (1997) A hybrid grammatical tagger: CLAWS4, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 102-121.

Fligelstone, S., Rayson, P., and Smith, N. (1996). Template analysis: bridging the gap between grammar and the lexicon. In J. Thomas, and M. Short (eds), Using corpora for language research. pp 181-207. Longman, London.

Fligelstone, S., Pacey, M. and Rayson, P. (1997) How to generalize the task of annotation, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 122-136.

CLAWS part-of-speech tagger for English

Introduction

Accuracy

Template tagging

Tagging services

Tagsets

Taggingguidelines

Tagging
guidelines