Corpora


UCREL members have been involved in the compilation and annotation of many electronic corpora, often in collaboration with other institutions. Some corpora are held only as plain orthographic text, whilst others are held with several kinds of annotation.

Some of the corpora listed below are available via ICAME in Bergen, Norway, and information on how to obtain some of the others is available at the same site. A selection of the corpus manuals are on-line too. Yet more corpora are made available via ELDA or OTA


Speech, Thought, and Writing Presentation

Two corpora - one of spoken data and one of written texts - tagged using categories of speech, writing and thought presentation outlined initially in Leech and Short (1981) and developed in the work of Short, Semino and Wynne (see for example Short, Wynne and Semino 1998). See the project homepage.

Corpora of South Asian languages

Generated by the EMILLE (Enabling Minority Language Engineering) project at Lancaster University and Sheffield University. EMILLE collected a 97 million word electronic corpus of South Asian languages, especially those spoken in the UK. See http://www.emille.lancs.ac.uk/.

Lancaster Corpus of Mandarin Chinese

The corpus was designed as a Chinese match of the Freiburg-LOB Corpus of British English (FLOB), and, as such, provides a valuable resource for contrastive studies between English and Chinese as well as a sound basis for monolingual investigations of Chinese. The LCMC corpus is distributed by the European Language Resources Association (Cat. No ELRA-W0039) and the Oxford Text Archive (Cat. No 2474).

Lancaster Newsbooks Corpus

A corpus of 17th-century news texts was collected to study the journalism of the seventeenth century. See http://www.ling.lancs.ac.uk/newsbooks for more details.

20th century corpora

To match the existing LOB and FLOB corpora, the Lancaster1931 and Lancaster1901 corpora are being collected. See the project website.

The British National Corpus (BNC)

The BNC is a 100,000,000 word corpus of written and spoken British English from the early 1990s. Approximately 90% of the corpus is made up of written material and approximately 10% is made up of spoken material. The corpus is tagged for part of speech.
Full details of the corpus can be found on the
BNC web page.

The Lancaster/Oslo-Bergen Corpus (LOB)

Approximately 1,000,000 words of British written English dating from 1960. The corpus is made up of 15 different genre categories. Available as orthographic text, and tagged with the CLAWS1 part-of-speech tagging system. The Leeds-Lancaster Treebank and Lancaster Parsed Corpus are analyzed subsamples of the LOB corpus. For further information see the
corpus manual (1978) and the tagged corpus manual (1986).

The Longman-Lancaster Corpus

Approximately 14.5 million words of written English from various geographical locations in the English-speaking world and of various dates and text types. Orthographic text only.

The Lancaster/IBM Spoken English Corpus (SEC)

Approximately 53,000 words of British spoken English, mainly taken from radio broadcasts dating from the mid 1980s. Available as orthographic text, tagged with the CLAWS2 part-of-speech tagging system, parsed, and prosodically annotated. There are also tapes of a standard suitable for the instrumental analysis of F0 values.

The ET10-63 Corpus

The ET10-63 corpus is a bilingual parallel corpus of English and French, containing EC offical documents on telecommunications. The corpus is part-of-speech tagged and also lemmatized.
Approximately 1,250,000 words of each language.

The International Telecommunications Union (ITU) or CRATER Corpus

An 1,000,000-word trilingual corpus of Spanish, French and English, aligned at the sentence level. The corpus is made up of texts from the telecommunications domain. It has been part-of-speech tagged in all three languages.

The Lampeter Corpus of Early Modern English Tracts

A corpus of approx. 1,000,000 words of English pamphlet literature covering the years 1640-1740. Text samples are taken from each decade within this century and several genres are represented. This corpus contains the whole text of pamphlets, rather than sub-samples. It is being tagged for part-of-speech and lemmatized at the TU Chemnitz-Zwickau's REAL Centre, in association with Lancaster.

The Lancaster-Leeds Treebank

A manually parsed subsample of the LOB corpus showing the surface phrase structure of each sentence, prepared by Professor Geoffrey Sampson. Approximately 45,000 words taken from all the genre categories of the LOB corpus.

The Lancaster Parsed Corpus (LPC)

A subsample of the LOB corpus, parsed by computer and manually corrected by several researchers. Approximately 140,000 words with samples from each of the 15 categories in the LOB corpus.

The American Printing House for the Blind Treebank (APHB)

A skeleton-parsed corpus of a wide range of English texts. 200,000 words.

The Associated Press Treebank (AP)

A skeleton-parsed corpus of American newswire reports. 1,000,000 words.

The Canadian Hansard Treebank

A skeleton-parsed corpus of proceedings in the Canadian Parliament. 750,000 words.

The IBM Manuals Treebank

A skeleton-parsed corpus of computer manuals. 800,000 words.

The Anaphoric Treebank

A subsample of the AP corpus, annotated to show the reference of pronouns and lexical cohesion. Approximately 100,000 words.

UCREL LOGO