below and Wilson and Rayson, 1993), UCREL is collaborating with Reflexions Communication Research Ltd. to develop software which will undertake the semantic tagging of words and then automatically assign `content tags' to the words in a set of interview transcripts, and provide a statistical analysis of the resulting tag frequency profile. The project aims to extend previous work by developing enhanced disambiguation techniques, larger lexical resources and word sense frequency data for spoken English, automatic pronoun resolution and a broader syntactic analysis.

ACASD is a suite of programs for the automated semantic field tagging and content analysis of spontaneous spoken English. It is made up of 5 main modules: the semantic tagging program itself; a program for manual postediting of the output; a program for the identification of key syntactico-semantic links; a program for mapping semantic field tags onto research-specific content categories; and a dedicated statistics and concordance module with a user-friendly X-windows interface. This project aims to combine the best of both qualitative and quantitative survey research, but it has many other potential applications in linguistics and more generally in the social sciences and humanities: for example, a pilot study of a large corpus of doctor-patient interactions has been carried out using ACASD, and its application to the stylistic analysis of written as well as spoken English has been piloted by Wilson and Leech (1993). The project is funded by the EPSRC (Engineering and Physical Sciences Research Council), and is part of a collaborative (JFIT) project with commercial partnership and the support of the DTI.
Treebank development for probabilistic parsing (ATR project)

[1994-1997] UCREL collaborated with the Advanced Telecommunications Research (ATR) Institute of Kyoto, Japan, in the development of a computer/human interactive system for the treebanking of unconstrained modern English texts. CLAWS assigns a syntactic tag to each word and this is enriched with a semantic tag selected by the human analyst using XANTHIPPE, an editing program written by Roger Garside. The fully-labelled sentences are then grammatically analysed ("parsed") with the aid of ATR's treebanking tool, which allows the grammarian to select the correct analysis according to the ATR Grammar which is contained within the tool. The sentences will form a training corpus for probabilistic parsing.

Automatic content analysis of interview transcripts

[1990-92] This SERC project, funded under IEATP (Information Engineering Advanced Technology Programme), was a collaborative project, with DTI funding. The lead industrial partner was Reflexions Market Research, London. The task, was to develop software for assigning content tags to the words in interview transcripts, identifying syntactic semantic links between words, and outputting a statistical profile of the resulting content analysis. The project ventured into a field where natural language processing had not operated before. (see the
Benedict - The New Intelligent Dictionary

[2002-2005] The Benedict project combines forces from language technology providers, the academia, the dictionary publishing world, and user organizations to discover the best way to cater for the needs of dictionary users by combining state-of-the-art language technology with research results on user needs and on the potential of future dictionaries.

The project integrates the different approaches to produce a future dictionary, Benedict, which meets the users and their needs halfway by providing an interactive user-specified access interface, tailoring the dictionary information supply according to user specifications, incorporating multilayered entry structure with new information categories and links to corpus data and syntactically- and semantically-based corpus search tools in the dictionary data base.

The new product, Benedict, is particularly aimed to cater for the demands of the multilingual corporate world. Our project partners are Kielikone Oy, HarperCollins Publishers Ltd, Gummerus Kustannus Oy, University of Tampere, and Nokia.

The British National Corpus (BNC)

Oxford University Press, Oxford University Computing Service, The British Library, Longman Group Ltd. and W. & R. Chambers Ltd.). The project goal was to compile a 100 million word national corpus of machine readable text, including a wide variety of written and spoken British English. UCREL's contribution to this project was primarily in the linguistic analysis of the corpus. Funding: SERC/DTI.
British National Corpus Enhancement Project

[1995-6] This EPSRC-funded Project will produce an enhanced version of the 100 million word
British National Corpus. The Corpus consists of a wide variety of machine-readable texts selected from many written and spoken genres of British English. UCREL was responsible for providing the linguistic description for this Corpus, in the form of an automatic part-of-speech analysis for the whole Corpus, and a detailed hand-corrected analysis of a 2-million word sample selected from the main Corpus. The BNC Enhancement Project aims to improve the accuracy of this linguistic analysis by identifying and correcting errors and ambiguities in the Corpus using various automatic techniques. The Project also aims to produce documentation for the linguistic analysis. (see above for information on the first project).

A Corpus of English Dialogues, 1560-1760

The 'Corpus of English Dialogues, 1560-1760' is an A.H.R.B. funded project, which aims to construct a one-million-word-plus computerised corpus of Early Modern English dialogue texts. Why dialogue texts? The focus is on dialogue, because it will allow insight into the nature of impromptu speech and interactive two-way communication in the Early Modern English period - aspects which have received little research attention.

The corpus will cover a two-hundred-year time span from 1560 to 1760. It draws texts from six text-types: trial proceedings, depositions, drama, handbooks in dialogue form, prose fiction and language teaching books. In this way, examples of both recorded and constructed dialogue are included.

This project also represents a collaboration with Prof. Merja Kyto (Uppsala University).

Grammatical tagging of the Corpus of London Teenage English (COLT project)

[1996] This is a small scale project funded by the University of Bergen (Norway). UCREL is undertaking the grammatical tagging of the 500,000 word COLT corpus.

Corpus-based grammar in contrast (CORGRAM)

Funded by the AHRC, this project explores an application of novel corpus-based methods to a set of issues in grammatical analysis, in the context of a language, Nepali, for which corpus linguistics is in its infancy. It will also extend the analysis to a cross-linguistic comparison bringing in English and Russian. The methodology that will be employed by the project is an novel empirical approach to grammatical categories and the quantitative patterns in which they are distributed in textual data. At the core of the methodology are co-occurrence statistics derived from text corpora (primarily written corpora, due to patchy availability of spoken corpora of sufficient extent). These statistics will take two primary forms: raw co-occurrence counts of grammatical categories, and collocation lists.

The questions to be addressed are:

Corpus Resources and Terminology Extraction (CRATER)

[1994-5] This EU- funded project, of which Tony McEnery of UCREL was the lead partner, led to the alignment of a multilingual corpus of English, French and Spanish texts which are mutual translations (a parallel corpus). The tri-lingual aligned corpus is being distributed to users in Europe and elsewhere, being the first corpus of its kind to be made generally available through the Internet. The project has also created software for corpus alignment, and for extraction of terminology from the parallel corpus. We colaborated with IBM Paris, the Paris software house (C2V), and the Universidad Autonoma de Madrid
EAGLES Phase II: Guidelines for the Syntactic Annotation of Corpora

This phase of the
EAGLES (Expert Advisory Group for Language Engineering Standards) Initiative began in February 1995 and ran until the end of July 1995, and was being run from Lancaster. The final aim was to produce guidelines for the standardisation of syntactic annotation of corpora. However this work has been carried out in three phases:
  1. collection of data on existing syntactically annotated corpora
  2. production of two reports (SASG2 and SASG3 - available via ftp from Pisa) in which a detailed overview of existing annotation practices has been carried out.
  3. production of guidelines for standardisation (document SASG1 - under development).
This work carries on directly from the previous EAGLES phase on morphosyntactic annotation of corpora. However, in dealing with syntactic annotation, existing practices are already much more varied, and tend to be in most cases, more theory-dependent. For these reasons the guidelines include the following sections:
  1. recommended annotation using a constituency grammar
  2. recommended annotation using a dependency grammar
  3. guidelines regarding documentation of an annotated corpus
  4. example annotated texts from a variety of European languages.
EAGLES Preliminary Reports, published in electronic form by the European Commission (some are available at the EAGLES Home Page in Pisa, Italy): The coordinators of this subgroup were Geoffrey Leech, Ruthanna Barnett, Peter Kahrel.
other members of the subgroup: Hans van Halteren (Univ. of Nijmegen), Jean-Marc Langé (IBM Paris), Simonetta Montemagni (Institute of Computational Linguistics, Pisa), Atro Voutilainen (Univ. of Helsinki).

Enabling Minority Language Engineering Project (EMILLE)

EMILLE (Enabling Minority Language Engineering) is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK.

Extending CLAWS

[July 2003 - February 2004] Extending CLAWS was a project funded by Lancaster University's small grants scheme for 4 months and enabled the following work: a) Investigate and scope the changes required to allow CLAWS to be applied to American English. b) Create parallel (American English) versions of the machine-readable dictionaries used by CLAWS, and begin implementing changes required for American English c) Carry out an evaluation by applying part-of-speech tagging to the first release of the American National Corpus and in addition the Michigan Corpus of Academic Spoken English (MICASE). d) Port CLAWS to allow it to run under the Windows operating systems.

We are grateful for the support of external developers and beta-testers as part of this work including a) Jukka-Pekka Juntunen (Kielikone Ltd, Helsinki) b) Dr John Milton (Hong Kong University of Science and Technology) c) Dr Yukio Tono (Meikai University, Tokyo).

EUROTRA Machine Translation Project (ET10/63)

[May 1992 - Oct 1993] UCREL was involved (along with Essex University, IBM France, and the Paris software house C2V) in the European Community's machine translation project. UCREL's role was mainly in the grammatical analysis of English texts for the project, and semi-automated argument frame extraction for lexicography. Funding: EC.
The Lancaster Database of Text Corpora

[1990-5] A major outcome of this project is a corpus of grammatically and semantically annotated texts drawn from varied genres of written English. It is planned to lodge the corpus in the ESRC Data Archive, so that it will be available to other researchers.

Language Engineering Resources for the British Isles Indigenous Minority Languages

[2002-] The indigenous minority languages of the British Isles (or "BIMLs") Cornish, (Scottish) Gaelic, Irish, Manx, Scots, Ulster Scots (Ullans) and Welsh are becoming increasingly widely used in both public and private life. Thus, speech and language technology applications for these languages are now becoming an urgent need. These are needed not only for monolingual content management, but also to aid translators and interpreters since the BIMLs are nearly always used in bilingual contexts alongside English. To develop such applications, basic language resources (such as corpora of machine-readable texts, machine-readable dictionaries, speech databases and so on) are required.

Two recent EPSRC-funded projects at Lancaster (MILLE and EMILLE) have provided a great service to the non-indigenous minority language communities in the UK by locating existing resources, investigating end-user needs and wants, examining basic technical issues and beginning to generate appropriate resources. However, no such consolidated survey or examination of issues has yet been undertaken for the BIMLs. The present project thus has three broad goals: first, to survey the existing resources and tools for the various BIMLs; second, to obtain information about end-user needs and wants in these areas; and, third, to investigate some of the technical and practical issues that the BIMLs raise, primarily for the collection, transcription and annotation of spoken corpus material. The latter will involve collecting and annotating a small sample corpus of spoken Welsh and Gaelic.

For more information, see the project web site.

Development of a Language Model for Speech Recognition

[1987-1992] This project was running for over 5 years with IBM funding, and involved collaboration with the IBM T.J.Watson Research Center, Yorktown Heights, New York (up to the end of 1990, IBM UK Scientific Centre, Winchester, was also involved).

UCREL's work in this collaboration entailed compiling a large treebank (or syntactically analysed corpus - more than 3 million words) to act as a testbed for development of probabilistic grammars. Another type of corpus processing, anaphoric annotation, has been extensively carried out, using the same treebank material. UCREL's work involved developing specialised editing tools for these tasks.

These kinds of corpus processing have primary applications in (1) speech recognizers, and (2) probabilistic machine translation.

For more information, see the book: Black, Garside, and Leech (eds) (1993).

Integrated resources for written and spoken language resources

[1997-8] This is part of the ECs EAGLES initiative, funded by DGXIII, to establish standards for the development of natural language resources in EC countries. UCREL is subcontracted to the University of Bielefeld, Germany, to undertake a survey of practices in the representation and annotation of dialogue, and to formulate provisional guidelines for standardization of such practices.

For more information, see the EAGLES WP4 homepage. Also see Grice et al (2000).

The Machine Readable Spoken English Corpus (MARSEC)

This project, run by Dr.,Gerry Knowles jointly with the University of Leeds, aims to convert the Lancaster-IBM Spoken English Corpus into a relational speech database (including phonetic transcription). Funding: ESRC. See the related work at Reading.
Minority Language Engineering Project (MILLE)

The Minority Language Engineering Project is a joint project between the Department of Linguistics at Lancaster University and Oxford University Computer Services. It seeks to investigate the development of corpus resources for UK non-indigenous minority languages. (NIMLs).

A multimedia corpus-based longitudinal study of children's writing

[1996-1998] A two-year project supported by the Leverhulme Trust in collaboration with the Centre for Language in Social Life research group in the Department of Linguistics and Modern English Language. The particular interest of this project lies in the combination of textual and non-textual (visual) material in the same corpus.
See the project publications in the proceedings of PALC 97 (Practical Applications of Language Corpora, Lodz, Poland) and the November 1998 issue of Literary & Linguistics Computing (
Smith et al 1998). The Lancaster Corpus of Children's Project Writing is online.
Parallel Wordclass Tagging of the Parole 2 Corpus and the British National Corpus

[July 1997-] (Funded by EPSRC.) The aim of this project, which involves collaboration with the University of Birmingham, is to tag morphosyntactically 250,000 words of text from the PAROLE2 corpus and 1 million words from the British National Corpus using two different taggers - Lancaster's Claws tagger and Birmingham's QTAG tagger. Both taggers will be developed to make use of an EAGLES-conformant tagset. The resulting outputs will be compared and the comparison will focus on discovering how far the two taggers correct each other's errors in predictable ways. The analysis will result in a set of contextual rules to convert dispreferred tags to preferred tags, so that the result of parallel tagging will be more accurate than the output of a single tagger. If there is time, it is hoped that the experiment can be extended to take in further taggers.

Polish/English Corpus Research and Applications

A British Council funded project, PELCRA, which is being carried out together with the University of Lodz in Poland, aims to provide a databank of spoken and written Polish which could help build better dictionaries and help in English language education in Poland.

The data for the project is being gathered from newspapers in Poland such as Gazzeta Wyborska and from tape recordings of spontaneous speech in a variety of settings.
Leverhulme Corpus Project

Investigator: Geoffrey Leech Research Associate: Nick Smith (half-time)

This project is supported by the Leverhulme Trust under its Emeritus Fellowship Scheme. The research projects runs for 15 months from October 2003. The plan is to build a corpus which matches as closely as possible the LOB and FLOB corpora of written British English, except that the year of data collection is 1931, or near to that date (+/- 3 years). The immediate purpose of building this corpus is to make it possible to compare these three temporally equidistant corpora (1931, 1961, 1991): "Pre-LOB", LOB, and FLOB. This will enable us to track grammatical change through a period of 60 years of the 20th century. In previous projects on recent grammatical change in English funded by the AHRB and the British Academy, we have be able to observe some notable trends through the differences between corpora of the 1960s and the 1990s, such as declining frequency of the modal auxiliaries (especially shall, must, ought to and may) and a growing frequency of semi-modals such as have to, need to, and want to. By projecting this comparison back to the beginning of the 1930s, we will be able to confirm that these trends are a continuation of earlier changes. The early decades of the 20th century are virtually unrepresented in corpora of English, and so the planned new corpus will fill an important empirical gap in our historical knowledge of the language. The new corpus under construction is as yet unnamed.

Self-Access Grammar Tools

[1994-5] The second phase of this project (funded by IHE [Innovation in Higher Education]) has been completed, and the software is ready to be used operationally in the teaching of Linguistics and English Language at Part I level. The current widespread inadequacy of grammar teaching in schools makes it highly desirable to provide this facility for university foundation courses, and the software is already being experimentally used in two universities apart from Lancaster.
Corpus of Written British Creole

[Feb.-May 1995] The purpose of this project was to collect and computerize, for the first time, a corpus from the writings of British Creole speakers of Afro-Caribbean origin.
More details are available.
Requirements reverse engineering to support business process change

[May 1998-April 2001] The aim of the research proposed here is to improve the requirements analysis for legacy system evolution where the underlying business process has already changed. As far as we are aware, this is a unique focus which, despite addressing a real-world problem, has not been systematically addressed before. Our approach is to investigate the reverse engineering of requirements documents by the novel integration of techniques for the textual analysis of documentation; modelling of business processes; and modelling the organisational structures serving the business processes.

Rhythm and Timing in Spoken English

[1994-5] The objective of this project has been to extract a model of rhythm and timing from the MARSEC database (a multi-layered corpus of spoken English developed at Lancaster in collaboration with Leeds University).
The MARSEC database is a unique resource which can now be used to investigate in microscopic detail areas of spoken langauge performance which have baffled phoneticians for years. The analysis of duration (closely allied to rhythm), which is the focus of this project, has application to high-quality speech synthesis.
Scragg Revisited - a quantitative investigation of spelling across the centuries

[April 05 - July 05, funded by the British Academy with the University of Central Lancashire]
In January 2003, Dawn Archer and Paul Rayson began to explore the feasibility of retraining the
UCREL Semantic Analysis System (USAS) so that it was applicable to historical texts dating from 1600 onwards (Archer et al 2003). Previously, the USAS system has been used to annotate written and spoken English from the modern period. Our initial experiments suggested that the statistical model adopted in the part-of-speech tagging component (CLAWS) was sufficiently robust to cope with grammatical variation across time, but spelling variants caused difficulty, as variants of known words were not recognized by the system. Our response was to implement a prototype spelling detector as a pre-processing step in USAS. This work continued in the WordHoard project funded by the Mellon Foundation with Prof. Martin Mueller (Northwestern University, Chicago), in which we linguistically annotated the Nameless Shakespeare corpus and Chadwick-Healey's Eighteenth and Nineteenth Century Fiction corpora.

The approach we have adopted thus far is to produce a list of variant spellings, which we manually match to normalized forms. A Variant Detector computer program (VARD) then inserts modern equivalents of these forms when they appear in a given text while preserving the original variant. This approach is proving to be very effective: we have identified 45,800+ variants from our analyses of different historical texts to date, and have undertaken a preliminary empirical study of spelling variation across the 16th-19th centuries based on 4,000 of these variants (Archer & Rayson 2004).

Whilst this early work has fostered considerable interest from historical linguists, dialectologists and historians, we recognize that such an approach will prove to be too time consuming in the long-term. We therefore intend to identify patterns from our existing variants as a means of developing the VARD so that it is able to detect and "normalize" spelling variants automatically (via fuzzy matching procedures: edit distance, letter replacement heuristics and the Soundex algorithm). We believe that the VARD enables a more comprehensive study of variation than has been previously possible (cf. Scragg 1974; Osselton 1984). We will demonstrate this by undertaking a study of variation across four centuries (i.e. 16th-19th), using the datasets mentioned above. We will also study variation across text-types dating from the same period (i.e. the 17th century). This particular study will draw from Shakespeare's Complete Works, the Lampeter Corpus (1640-1760) and the Corpus of English Dialogues (1560-1760). The latter study will also take note of any idiosyncratic variant usage (cf. Osselton 1984). Because of the way in which variants are categorized, the potential of the VARD is much greater than our proposed pilot studies intimate: indeed, the VARD will enable quantitative analyses of not only non-standard spellings and contractions, but also variation at the morphological, phonetic, orthographic and syntactic levels.
A Speech-Act Annotated Corpus for Dialogue Systems: Pilot Project (SPAAC)

Geoffrey Leech, Tony McEnery, and Martin Weisser

The SPAAC dialogue annotation system has been developed, under EPSRC grant GR/R371542, primarily for XML speech-act annotation of service dialogues. But in the course of working on it we broadened the annotation to include six dimensions:

Dimensions (c) and (d) can be seen as respectively concerned with domain-restricted semantics and domain semantic-pragmatics. These need more work. Dimension (f) provides the key level of annotation, for which other dimensions contribute diagnostic information.

The tagging is semi-automatic: after an initial parsing-assisted structural segmentation of the unpunctuated turns, the annotation tool SPAACy (developed by Martin Weisser) then automatically adds XML mark-up to the dialogue and undertakes a preliminary analysis of all five dimensions (b)-(f). The output of SPAACy is then manually post-edited.

The result so far has been a tagging of two major kinds of telephone dialogue: telephone operator and service dialogues (data provided by BT) and train booking dialogues (data provided by Arrangements for making the annotated dialogue corpora available to users are under way.

For details of the speech act annotation scheme, see the SPAAC Annotation Scheme document (available as MS Word or PDF).

Unlocking the Word Hoard

[2004-5 Funded by the Andrew W. Mellon Foundation] The WordHoard Project is named after an Old English phrase for the verbal treasure "unlocked" by a wise speaker. It applies to highly canonical literary texts the insights and techniques of corpus linguistics, that is to say, the empirical and computer-assisted study of large bodies of written texts or transcribed speech. In the WordHoard environment, such texts are annotated or tagged by morphological, lexical, semantic, prosodic, and narratological criteria. They are mediated through a "digital page" or user interface that lets scholarly but non-technical users explore the greatly increased query potential of textual data kept in such a form.

It is a basic assumption of WordHoard that new kinds of historical, literary, or broadly cultural analysis will be supported through the forms of data access that are made possible when literary texts are treated in the manner of linguistic corpora. Deeply tagged corpora of course support more finely grained inquiries at a verbal or stylistic level. But more importantly, access to the words of a text at such microscopic levels also lets you look in new ways at the imaginative worlds created by those words.

WordHoard consists of the following components:

WordHoard is a project of Academic Technologies at Northwestern University, with support from the Northwestern University Library. The linguistic annotation will be developed in collaboration with Paul Rayson and Dawn Archer from UCREL. The project leaders at Northwestern University are Martin Mueller (Department of English and Classics) and Bill Parod (Academic Technologies).