A LITTLE HISTORY...

UCREL began its existence in 1970 when Geoffrey Leech founded a group under the name of CAMET (Computer Archive of Modern English Texts) within the then single Department of English. The CAMET group's aim was to compile a 1,000,000 word corpus of written British English for use on the computer as a parallel to the American Brown corpus developed at Brown University, Rhode Island, by Nelson Francis and Henry Kucera (the first computer corpus of English). In its later stages this project, which reached completion in 1978, was assisted by the involvement of the Norwegian universities of Oslo and Bergen, and the completed corpus was hence called the Lancaster/Oslo-Bergen (or LOB) corpus (Johansson, Leech and Goodluck, 1978).

After completion of the LOB corpus project, CAMET successfully applied for funding to carry out a grammatical analysis of the corpus. The Brown University team had developed computer software (TAGGIT) for assigning a part of speech (or word class) to each word in a corpus or text (Greene and Rubin, 1971). This software had a success rate of around 77% without manual intervention. Lancaster aimed to develop part-of-speech tagging software, aiming at a success rate higher than the Brown programs. This work, which involved collaboration with Roger Garside of the Computer Studies Department, culminated in the first version of the software package called CLAWS (Constituent Likelihood Automatic Word-tagging System) which, in much revised and improved form, is still a major component of UCREL's work (Garside, 1987). The CLAWS system employed a rich blend of decision making techniques, based in particular on statistical probabilities of tag co-occurrences using data derived from the manually-corrected tagged LOB corpus (Marshall, 1983). It achieved a success rate without manual intervention in the high 90s per cent.

The increased emphasis on corpus analysis rather than compilation, and especially the use of computational methods, led to a proposal to the University's Senate that the link between what were by then called the Department of Linguistics and Modern English Language and the Department of Computing be formalized by setting up an interdisciplinary research unit, UCREL. The proposal was approved by Senate on 21 March 1984 and CAMET was thus transformed into UCREL.

In 1983-84 UCREL received funding for two further major projects: one (from the Science and Engineering Research Council, SERC) for further work on the grammatical analysis of the LOB corpus, the other (from the computer manufacturers ICL) for the development of a context-sensitive spelling checker. The grammatical analysis project (see Atwell, Leech and Garside, 1984), which ran until 1986, involved four work packages:

The development of an improved version of the CLAWS part-of-speech tagger (CLAWS2).
The development of a syntactic parser using probabilistic models of analysis to determine the most likely analysis of a sentence (Beale, 1985a; 1985b; Garside and Leech, 1985; Leech, 1987).
The production of a manually parsed subsample of the LOB corpus as a data source for the probabilistic parser (Sampson, 1987a). This analysis was carried out by Professor Geoffrey Sampson (then Reader in Linguistics at Lancaster, subsequently Professor at Leeds and Sussex). This manually parsed sample is now known as the Lancaster-Leeds Treebank.
The production of software which can derive a distributional lexicon from CLAWS tagged text, i.e.,a lexicon showing the base form of each word, with all its inflexional variants and frequencies of its lexical collocations in the corpus (Beale, 1987; 1989).

The spelling checker project, which also ran to 1986, employed a probabilistic approach to the identification of spelling errors which, because the spelling errors happened to form correctly-spelled English words different from the `target' word, would not be detected by the ordinary type of spelling checker, for example the confusion of there and their (Atwell and Elliott, 1987).

In 1984-5 Dr.,Ted Briscoe joined the University from the University of Cambridge as Lecturer in Linguistics, bringing with him the grammar tools part of a project under the Alvey Knowledge Based Systems initiative. This project produced software written in Lisp which provided an environment for developing semantically enriched phrase structure grammars for automatic parsing, and also used this environment to produce a unification-based phrase structure grammar of English (Carroll et al., 1988; Grover et al., 1989). In that this was a rule-based and not a probabilistic grammar, the Alvey project differed from most other UCREL projects.

Also in 1984, UCREL received a small grant from the University's Humanities Research Committee to begin research into text-to-speech synthesis. Previous UCREL projects had dealt only with written text: this project marked the beginning of UCREL's involvement with speech processing. The project soon received more substantial support from IBM UK Laboratories and ran until 1987. The main goal of this project was to compile a corpus of modern spoken English, with a prosodic transcription showing features of stress, intonation and pauses (Taylor and Knowles, 1988). The resulting corpus of c.,53,000 words --- the Lancaster-IBM Spoken English Corpus (SEC) --- also exists on audio-tape for instrumental analysis. Funding was later extended for a study of prosodic stylistics to examine how prosodic patterns differ between different kinds of discourse (Wichmann, 1991), and, as noted above, a grant has recently be obtained to produce a speech database (MARSEC) based on the SEC. Among other things, the MARSEC project will produce a digitized sound version of the corpus, and a phonetic transcription.

UCREL's collaboration with IBM was strengthened through involvement in IBM's ongoing research on the development of continuous speech recognition systems. Because of the difficulty in reliably identifying word boundaries in continuous speech, such systems need to employ analysis at levels other than the acoustic and phonetic. Lancaster's role, in a collaboration which lasted until 1991, was to perform the automatic tagging and subsequent manual parsing of large amounts of text using a fast annotation interface (Leech and Garside, 1991). The resulting `treebank' was applied in the development of probabilistic models of language. At a later stage in the project, the scope of annotation was extended to provide texts annotated with anaphoric relations between pronouns and noun phrases (Fligelstone, 1992).

In Summer 1988, UCREL branched out into semantic analysis. A pilot project was completed, and in 1990 a two year project began to build upon this work (Wilson, 1991; Wilson and Rayson, 1993). This project was successfully completed in 1992, and funding was approved for further development (ACASD project).

Work on semantic analysis was also carried out on the Lancaster Database project.

A most significant development for UCREL was its participation in the British National Corpus. UCREL was a member of a national consortium of academic and industrial partners (the other members being Oxford University Press, Oxford University Computing Service, The British Library, Longman Group Ltd. and W. & R. Chambers Ltd.), with the goal of compiling a representative 100 million word corpus containing a wide variety of present-day written and spoken British English (see Leech, Garside and Bryant, 1994).

UCREL has recenly moved into a new multilingual phase of development, and has recently been involved in projects within the European Community's EUROTRA machine translation programme (McEnery and Daille, 1993) as well as its Multilingual Action Plan.

In order to reflect this changing nature of UCREL's research, and to emphasize our position as a research centre within the University, we decided to change our name in June 1995. Hence the Unit for Computer Research on the English Language became the University Centre for Computer Corpus Research on Language. However, we retain the acronym of UCREL.

Details of more recent and current projects can be found on a separate page.

Members of UCREL in the BNC lab.

Members of BNC research team with Roger Garside and Geoffrey Leech.