CLAWS part-of-speech tagger for English


Free CLAWS WWW tagger | Obtaining a licence | Tagging service


Introduction

Part-of-speech (POS) tagging, also called grammatical tagging, is the commonest form of corpus annotation, and was the first form of annotation to be developed by UCREL at Lancaster. Our POS tagging software for English text, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to POS tag c.100 million words of the British National Corpus (BNC).

Accuracy

CLAWS has consistently achieved 96-97% accuracy (the precise degree of accuracy varying according to the type of text). Judged in terms of major categories, the system has an error-rate of only 1.5%, with c.3.3% ambiguities unresolved, within the BNC. More detailed analysis of the error rates for the C5 tagset in the BNC can be found within the BNC manual.

Template tagging

In the context of the BNC Enhancement project, UCREL devised a Template Tagger to act as a post processor for CLAWS. The rule-based formalism implemented in the Template Tagger is more powerful than that built into CLAWS itself. Manual corpus analysis and knowledge of frequent CLAWS tagging errors was used to create a rule base for the tool. This facilitated an improvement in the tagging accuracy in the resulting corpus. For more details, see Fligelstone, Rayson, and Smith (1996) and Fligelstone, Pacey, and Rayson (1997). Please note that the Template Tagger processing is not currently included in the online tagger or the licenced versions of CLAWS4. Please contact us for further details of current availability.

Tagging services

UCREL offers access to our latest version of CLAWS4 by:
  • selling site and single user licences for software use within academic institutions and commercial organisations
  • providing an in-house tagging service at Lancaster University
  • our free CLAWS WWW tagger where you can submit text to be POS tagged via the Internet
  • CLAWS can also be accessed through the web-based Wmatrix interface

  • Tagsets

    Several tagsets have been used in CLAWS over the years. The CLAWS1 tagset has 132 basic wordtags, many of them identical in form and application to Brown Corpus tags. A revision of CLAWS at Lancaster in 1983-6 resulted in a new, much revised, tagset of 166 word tags, known as the `CLAWS2 tagset'. The tagset for the BNC (C5 tagset) has just over 60 tags. This tagset was kept small because it was designed for handling much larger quantities of data than were dealt with up to that point. For the BNC sampler corpus the enriched C6 tagset was used which has over 160 tags. The current standard tagset is the C7 tagset which is the same as the C6 tagset apart from the punctuation tags. In C6 these all begin with the letter 'Y'. The mapping between C7 and C5 is a many-to-one conversion, and is available in a tab-delimited text file. An extension to the large C7 tagset has been devised. The C8 tagset makes further distinctions in the determiner and pronoun categories as well as for auxiliary verbs. Note that characters '@' and '%' appearing at the end of any CLAWS tags are rarity markers and can be ignored. These were added manually during the creation of the CLAWS lexicon to indicate rare tags for words.

    Tagging
    guidelines

    Many detailed decisions have to be made in deciding how to draw the line between the correct and the incorrect assignment of a tag. So that the concept of what is a 'correct' or 'accurate' annotation can be determined, there have to be detailed guidelines of tagging practice. These are incorporated in a separate document, the Wordclass Tagging Guidelines. (This is part of the Manual to accompany The British National Corpus (Version 2) and gives guidelines for the C5 tagset). There is a similar document for the C7 tagset: BNC sampler corpus - guidelines to wordclass tagging.

    For more information on the CLAWS tagger, see Garside (1987), Leech, Garside and Bryant (1994), Garside (1996), and Garside and Smith (1997):

    Garside, R. (1987). The CLAWS Word-tagging System. In: R. Garside, G. Leech and G. Sampson (eds), The Computational Analysis of English: A Corpus-based Approach. London: Longman. PDF version

    Leech, G., Garside, R., and Bryant, M. (1994). CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94) Kyoto, Japan, pp622-628. (partial html version pdf version)

    Garside, R. (1996). The robust tagging of unrestricted text: the BNC experience. In J. Thomas and M. Short (eds) Using corpora for language research: Studies in the Honour of Geoffrey Leech Longman, London, pp 167-180. PDF version

    Garside, R., and Smith, N. (1997) A hybrid grammatical tagger: CLAWS4, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 102-121. PDF version

    Fligelstone, S., Rayson, P., and Smith, N. (1996). Template analysis: bridging the gap between grammar and the lexicon. In J. Thomas, and M. Short (eds), Using corpora for language research. pp 181-207. Longman, London. PDF version

    Fligelstone, S., Pacey, M. and Rayson, P. (1997) How to generalize the task of annotation, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 122-136. PDF version


    UCREL LOGO