Corpora

UCREL members have been involved in the compilation and annotation of many electronic corpora, often in collaboration with other institutions. Some corpora are held only as plain orthographic text, whilst others are held with several kinds of annotation.

Some of the corpora listed below are available via ICAME in Bergen, Norway, and information on how to obtain some of the others is available at the same site. A selection of the corpus manuals are on-line too. Yet more corpora are made available via ELDA or OTA

Speech, Thought, and Writing Presentation

Two corpora - one of spoken data and one of written texts - tagged using categories of speech, writing and thought presentation outlined initially in Leech and Short (1981) and developed in the work of Short, Semino and Wynne (see for example Short, Wynne and Semino 1998). See the project homepage.

Corpora of South Asian languages

Generated by the EMILLE (Enabling Minority Language Engineering) project at Lancaster University and Sheffield University. EMILLE collected a 97 million word electronic corpus of South Asian languages, especially those spoken in the UK. See http://www.emille.lancs.ac.uk/.

Lancaster Corpus of Mandarin Chinese

The corpus was designed as a Chinese match of the Freiburg-LOB Corpus of British English (FLOB), and, as such, provides a valuable resource for contrastive studies between English and Chinese as well as a sound basis for monolingual investigations of Chinese. The LCMC corpus is distributed by the European Language Resources Association (Cat. No ELRA-W0039) and the Oxford Text Archive (Cat. No 2474).

Lancaster Newsbooks Corpus

A corpus of 17th-century news texts was collected to study the journalism of the seventeenth century. See http://www.ling.lancs.ac.uk/newsbooks for more details.

20th century corpora

To match the existing LOB and FLOB corpora, the Lancaster1931 and Lancaster1901 corpora are being collected. See the project website.

Corpora

Speech, Thought, and Writing Presentation

Corpora of South Asian languages

Lancaster Corpus of Mandarin Chinese

Lancaster Newsbooks Corpus

20th century corpora

The British National Corpus (BNC)

The Lancaster/Oslo-Bergen Corpus (LOB)

The Longman-Lancaster Corpus

The Lancaster/IBM Spoken English Corpus (SEC)

The ET10-63 Corpus

The International Telecommunications Union (ITU) or CRATER Corpus

The Lampeter Corpus of Early Modern English Tracts

The Lancaster-Leeds Treebank

The Lancaster Parsed Corpus (LPC)

The American Printing House for the Blind Treebank (APHB)

The Associated Press Treebank (AP)

The Canadian Hansard Treebank

The IBM Manuals Treebank

The Anaphoric Treebank