Introduction to UCREL

WHAT IS UCREL?

UCREL (University Centre for Computer Corpus Research on Language) is a research group based in the School of Computing and Communications and the Department of Linguistics and English Language.

As recognised in the EPSRC/BCS/IEE International Computer Science Review in 2001, it has led the way in an approach to statistical natural language processing based upon information from large bodies of naturally-occurring text and in the application of corpus data to industrial problems in areas as diverse as dictionary creation and speech processing. UCREL's recent projects have been funded nationally by EPSRC, ESRC, AHRC and the Leverhulme Trust, as well as by the EU and the Andrew W. Mellon Foundation in the US. Previous projects include the British National Corpus project, a national consortium of academic and industrial partners (Oxford University Press, Oxford University Computing Service, The British Library, Longman Group Ltd., W. & R. Chambers Ltd.). Further international collaborators have included HarperCollins, Nokia, IBM Paris, Universidad Autonoma de Madrid and Shogakukan Inc, Japan.

Since 1994, UCREL has launched three continuing series of highly successful conferences (Teaching and Language Corpora; Discourse Anaphora and Anaphor Resolution Colloquium; Corpus Linguistics), with the first two events in each series taking place at Lancaster University.

The work of Lancaster University has enabled language technology applications, such as the software for market and social survey research developed under EPSRC grants GR/J93733/01 and GR/F36385/01. Lancaster's innovations in corpus construction and annotation have enhanced the competitiveness of the UK's commercial sector. For example, its contributions to the British National Corpus project have been successfully exploited by the UK's language industries, including publishing companies such as Longman and Oxford University Press.

For more than four decades, UCREL has led the way in an approach to natural language processing that is based upon information derived from large bodies of naturally-occuring text. These bodies of text are stored on the computer and are known as corpora (sg. corpus).

The vast majority of UCREL's work is carried out within this corpus-based paradigm. The corpora are used to derive empirical knowledge about language, which can supplement, and frequently supplant, information from reference sources and introspection (Leech, 1991; 1992).

Because they are well suited to quantitative analysis, corpora can provide information about the relative frequencies of many aspects of language. These frequencies can then be employed in probabilistic analysis techniques, which are another major feature of UCREL's work.

Probabilistic systems, rather than using hard-and-fast rules, instead use frequency data along with sophisticated statistical models to make a `best guess' about the correct analysis of a piece of language (Sampson, 1987b). Although probabilistic systems make mistakes, they often perform at a very high degree of accuracy (in the high nineties per cent). Compared with rule-based systems, they are exceptionally robust, and can analyze `real' language containing performance errors (as opposed to idealized invented examples) where rule-based systems would often fail. Because of this robustness and overall accuracy, mainstream computational linguists are now taking an increased interest in probabilistic methods and corpora.

UCREL's work is very much focussed on practical outcomes. We have engaged in corpus-based research contributing to such practical applications as:

speech synthesis
speech recognition
machine-aided translation and assisting human translators
dictionary publishing
social survey interview analysis
computer-aided language teaching
software engineering

Our work focusses on:

English - we were a leading partner in the British National Corpus consortium and are now exploiting the BNC to arrive at new, data-grounded analyses of present-day British speech and writing. We are also involved in corpus-based work on the historical development of the English language, as well as on learner English.
Modern foreign languages - we have built, annotated, and exploited corpora of modern languages such as French and Spanish, and we are presently involved (in collaboration with the University of Lodz) in producing a major corpus of contemporary Polish.
Minority, endangered, and ancient languages - we have pioneered corpus work on non-indigenous minority languages in the UK (e.g. Chinese, Hindi, Punjabi), and we are now extending this work to European indigenous minority languages. We have also carried out computer-aided linguistic research on ancient languages such as Latin.

We are always looking for new ways to apply our expertise and are interested in research projects and individual or team-based consultancy, as well as sharing ideas or techniques.
If there is a natural language topic you would like to explore with us, please contact us at the address shown on the home page.