Lancaster University UCREL Home Page
University Centre for Computer Corpus Research on Language 
 

Introduction
Who we are
Web-based course in corpus linguistics
Projects
Events
Corpus Linguistics 2007
Corpus Linguistics 2009

UCREL bookshelf
Technical papers
Publications list
ACL Anthology mirror

Corpora
Corpus search tools
Corpus Annotation
British National Corpus
English word frequency lists
DEMO CLAWS English part-of-speech tagger
NEW USAS English semantic tagger
Log-likelihood calculator

Relevant web links
Bookmarks for Corpus-based Linguists

Local intranet and help
(local members only)

Computing Department
Linguistics Department
Lancaster University

 


[UCREL LOGO]
'LEADING THE WAY IN CORPUS-BASED NLP RESEARCH'

UCREL is a research centre of Lancaster University.

  • We specialize in the automatic or computer-aided analysis of large bodies of naturally-occurring language ('corpora').
  • We have a record of achievement of more than twenty years as pioneers in this field.
  • We remain at the leading edge of computer corpus construction and analysis.
  • Our work focusses on modern English, early modern English, modern foreign languages, minority, endangered, and ancient languages.

News:

December 2007

Call for subscriptions to ICAME Journal published under the auspices of the Aksis Centre (The Department of Culture, Language and Information Technology, University of Bergen, Norway) and UCREL. To subscribe to the ICAME Journal for the next issue published in May 2008 please visit our secure on-line order form that offers credit card, fax, post, phone and purchase order options.

November 2007

New issue of the Empirical Text and Culture Research Journal published.

September 2007

Corpus Linguistics Advanced Research Education and Training (CLARET) funded by the AHRC. The first workshop will take place at Liverpool University on 29-30th November 2007.

July 2007:

New project funded by the AHRC: Corpus-based grammar in contrast (CORGRAM).

November 2006:

Call for papers: the Corpus Linguistics 2007 conference will be held at the University of Birmingham, 27-30 July 2007.

September 2006:

Call for submissions and subscriptions to ICAME Journal published under the auspices of the Aksis Centre (The Department of Culture, Language and Information Technology, University of Bergen, Norway) and UCREL. To subscribe to the ICAME Journal for the next issue published in May 2007 please visit our secure on-line order form that offers credit card, fax, post, phone and purchase order options. Price for one issue is USD52, around 43 Euros or GBP30 dependant on the current exchange rate. Issue 30 is still on sale via the issue 30 order form.

August 2006:

Call for submissions to two journals:

June 2006:

Call for participation: workshop on Historical Text Mining. July 20th and 21st, Lancaster University, UK. For more details, see the workshop webpage: http://ucrel.lancs.ac.uk/events/htm06/

January 2006:

We are now taking orders for issue 30 of the ICAME Journal to be published in the spring of 2006. Please visit the secure on-line order page.

December 2005:

Call for papers: Third International Workshop on Language Resources for Translation Work, Research & Training.
A Satellite Event of LREC 2006 (5th Language Resources and Evaluation Conference).
Date: 28th May 2006.
Venue: Magazzini del Cotone Conference Center, Genoa, Italy.

November 2005:

Call for papers: EACL 2006 Workshop on Multi-word-expressions in a multilingual context. April 3rd 2006, Trento, Italy. http://ucrel.lancs.ac.uk/EACL06MWEmc/

October 2005:

Announcing a new journal: Corpora
Corpora is a new journal focusing on the many and varied uses of corpora both in linguistics and beyond. The journal accepts articles presenting research findings based on the exploitation of corpora as well as accounts of corpus building, corpus tool construction and corpus annotation schemes. The journal will be published by Edinburgh University Press. For more details, see the Corpora Journal Home Page.

August 2005:

New project funded by the Leverhulme Trust entitled "Changing English Across the 20th Century: a corpus-based study". The main aim of the research is to carry out an investigation of areas of change in grammatical usage in 20th Century British English, focussing on the verb phrase. The study will be based on a package of four corpora sampled at regular intervals: 1991 – 1961 – 1931 – 1901. Two sub-goals are to: a) Compile a new corpus of British English called Lancaster1901 focussing on the beginning of the twentieth century. b) Enhance the encoding and annotation of Lancaster1901 and the three existing corpora (Lancaster1931, LOB and FLOB), and release the enhancements to the academic community.

For more information see the Lancaster University press release and the project details.

April 2005:

New project funded by the EPSRC: Assist (Automated Semantic Assistance for Translators) aims to address the problem of providing contextual examples of translation equivalents for words from the general lexicon. We will employ comparable corpora, an existing semantic field annotation system for English and develop a new semantic field tagger for Russian.

New project funded by the British Academy: Scragg revisited - a quantitative investigation of spelling variation across the centuries.

October 2004:

Lancaster is involved in the AHRB ICT Methods Network, see the AHRB press release and the network site at CCH for more details.

September 2004:

ELRA have announced that new Written Language Resources are available in their catalogue. You will find below their short descriptions. Please visit their on-line catalogue to get more detailed information: www.elda.fr and www.elra.info

*** ELRA-W0037 The EMILLE/CIIL Corpus ***

The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES- compliant SGML and encoded using Unicode.

*** ELRA-W0038 The EMILLE Lancaster Corpus ***

The EMILLE Lancaster Corpus consists of monolingual corpora containing approximately 58,880,000 words for seven South Asian languages (Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.

*** ELRA-W0039 The Lancaster Corpus of Mandarin Chinese (LCMC) ***

The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written text categories including news, literary texts, academic prose and official documents etc published in P. R. China in the earlier 1990s for a total of approximately 1 million words. The same sampling frame and period as FLOB/FROWN were used in LCMC. The corpus is encoded in Unicode (UTF-8) and marked up in XML.

August 2004:

CALL FOR PRE-CONFERENCE WORKSHOP PROPOSALS. Deadline: December 3rd, 2004. Corpus Linguistics 2005 Birmingham, July 14th-17th. Organisers: University of Birmingham and University of Lancaster http://www.corpus.bham.ac.uk/conference Proposals are invited for pre-conference workshops on July 14th at the University of Birmingham. The conference, Corpus Linguistics 2005, is run jointly by the universities of Birmingham and Lancaster, and is the third biennial conference in the series on Corpus Linguistics. The workshops and the conference will be held at the University of Birmingham between July 14th -17th 2005

May 2004:

New project funded by the Andrew W. Mellon Foundation: WordHoard applies to highly canonical literary texts the insights and techniques of corpus linguistics.

January 2004:

Release of the Lancaster Corpus of Mandarin Chinese, a Mandarin Chinese match for the FLOB and FROWN corpora. The corpus is part-of-speech tagged and available, free of charge, for use in non-profit making research.

Release of the EMILLE/CIIL corpus. The corpus contains monolingual written corpus data for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). It also contains orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words. The corpora were built as part of a collaboration between Lancaster University and the Central Institute of Indian Languages, Mysore.

December 2003:

Call for papers: 5th International Conference on Discourse Anaphora and Anaphor Resolution (DAARC2004)

November 2003:

New project: the Leverhulme Corpus Project plans to build a corpus which matches as closely as possible the LOB and FLOB corpora of written British English, except that the year of data collection is 1931, or near to that date (+/- 3 years).

2nd call for papers The sixth Teaching And Language Corpora conference (TaLC 2004)

Local Corpus research group meetings will continue this term on Mondays at 4pm in B81, Bowland.

September 2003:

New books containing a selection of papers from the CL2001 conference:


Wilson, A., Rayson, P. and McEnery, T. (eds.) (2003) Corpus Linguistics by the Lune: a festschrift for Geoffrey Leech. Peter Lang, Frankfurt. (Volume 8 in the Lodz studies in Language Series edited by Lewandowska-Tomaszczyk, B. and Melia, P. J.) ISBN 3-631-50952-2


Wilson, Rayson, McEnery (2003)
Wilson, A., Rayson, P. and McEnery, T. (eds.) (2003) A Rainbow of Corpora: Corpus Linguistics and the Languages of the World. Lincom-Europa, München. ISBN 3 89586 872 8. Linguistics Edition 40. 174 pp.


August 2003:

Conference Announcement: The sixth Teaching And Language Corpora conference (TaLC 2004)

CLAWS part-of-speech tagger free web trial extended to 10,000 words for academic users.

March 2003:

Initial release of the Lancaster Newsbooks Corpus

EMILLE Corpus Beta version released

Recent conference: Corpus Linguistics 2003CL2003
Bookshelf Book Series Announcement: Routledge Advances in Corpus Linguistics

   
UCREL,
Lancaster University,
Lancaster,
LA1 4WA.
Tel: +44 1524 510357 Fax: +44 1524 510492
email: ucrel at lancaster.ac.uk

Lancaster University approved pages maintained by Paul Rayson and Chris Needham, Lancaster University, UK.
All material in these pages © 1993-2007 UCREL, Lancaster University.
web page statistics