Corpora
UCREL members have been involved in the compilation and annotation
of many electronic corpora, often in collaboration with other institutions.
Some corpora are held only as plain orthographic
text, whilst others are held with several kinds of
annotation.
Some of the corpora listed below are available via
ICAME in Bergen, Norway, and
information on how to obtain some of the others is available at the
same site.
A selection of the corpus manuals
are on-line too. Yet more corpora are made available via
ELDA or OTA
Speech, Thought, and Writing Presentation
Two corpora - one of spoken data and one of written texts - tagged
using categories of speech, writing and thought presentation
outlined initially in Leech and Short (1981) and developed in the work
of Short, Semino and Wynne (see for example Short, Wynne and Semino
1998). See the
project homepage.
Corpora of South Asian languages
Generated by the EMILLE (Enabling Minority Language Engineering) project
at Lancaster University and Sheffield University.
EMILLE collected a 97 million word electronic corpus of South Asian
languages, especially those spoken in the UK.
See http://www.emille.lancs.ac.uk/.
Lancaster Corpus of Mandarin Chinese
The corpus
was designed as a Chinese match of the Freiburg-LOB Corpus of
British English (FLOB), and, as such, provides a valuable resource
for contrastive studies between English and Chinese as well as a sound
basis for monolingual investigations of Chinese. The LCMC corpus is
distributed by the European Language Resources Association (Cat. No
ELRA-W0039) and the Oxford Text Archive (Cat. No 2474).
Lancaster Newsbooks Corpus
A corpus of 17th-century news texts was collected to study the journalism of the
seventeenth century.
See http://www.ling.lancs.ac.uk/newsbooks
for more details.
20th century corpora
To match the existing LOB and FLOB corpora, the
Lancaster1931 and Lancaster1901 corpora are being collected.
See the project website.
The British National Corpus (BNC)
The BNC is a 100,000,000 word corpus of written and spoken
British English from the early 1990s. Approximately 90% of
the corpus is made up of written material and approximately
10% is made up of spoken material. The corpus is tagged
for part of speech.
Full details of the corpus can be
found on the BNC web page.
The Lancaster/Oslo-Bergen Corpus (LOB)
Approximately 1,000,000 words
of British written English dating from 1960. The corpus is made up of
15 different genre categories. Available as orthographic text, and
tagged with the CLAWS1 part-of-speech tagging system. The
Leeds-Lancaster Treebank and Lancaster Parsed Corpus are analyzed
subsamples of the LOB corpus.
For further information
see the corpus manual (1978)
and the tagged corpus manual (1986).
The Longman-Lancaster Corpus
Approximately 14.5 million words of written English from
various geographical locations in the English-speaking world and of
various dates and text types. Orthographic text only.
The Lancaster/IBM Spoken English Corpus (SEC)
Approximately 53,000 words of British spoken English,
mainly taken from radio broadcasts dating from the mid 1980s. Available
as orthographic text, tagged with the CLAWS2 part-of-speech tagging
system, parsed, and prosodically annotated. There are also tapes of a
standard suitable for the instrumental analysis of F0 values.
The ET10-63 Corpus
The ET10-63 corpus is a bilingual parallel corpus of English and
French, containing EC offical documents on telecommunications. The
corpus is part-of-speech tagged and also lemmatized.
Approximately 1,250,000 words of each language.
The International Telecommunications Union (ITU) or CRATER Corpus
An 1,000,000-word trilingual corpus of Spanish, French
and English, aligned at the sentence level. The
corpus is made up of texts from the telecommunications domain.
It has been part-of-speech tagged in all three languages.
The Lampeter Corpus of Early Modern English Tracts
A corpus of approx. 1,000,000 words of English pamphlet literature
covering the years 1640-1740. Text samples are taken from each
decade within this century and several genres are represented.
This corpus contains the whole text of pamphlets, rather than
sub-samples. It is being tagged for part-of-speech and lemmatized
at the TU Chemnitz-Zwickau's REAL Centre, in association with Lancaster.
The Lancaster-Leeds Treebank
A manually parsed subsample of the LOB
corpus showing the surface phrase structure of each sentence, prepared
by Professor Geoffrey Sampson. Approximately 45,000 words taken from
all the genre categories of the LOB corpus.
The Lancaster Parsed Corpus (LPC)
A subsample of the LOB corpus, parsed by computer and
manually corrected by several researchers. Approximately 140,000 words
with samples from each of the 15 categories in the LOB corpus.
The American Printing House for the Blind Treebank (APHB)
A skeleton-parsed corpus of a wide range of
English texts. 200,000 words.
The Associated Press Treebank (AP)
A skeleton-parsed corpus of American newswire reports.
1,000,000 words.
The Canadian Hansard Treebank
A skeleton-parsed corpus of proceedings in the Canadian Parliament. 750,000 words.
The IBM Manuals Treebank
A skeleton-parsed corpus of computer manuals. 800,000 words.
The Anaphoric Treebank
A subsample of the AP corpus, annotated to show the reference of
pronouns and lexical cohesion. Approximately
100,000 words.