Events

Forthcoming Events

Lancaster Summer Schools in Corpus Linguistics and other Digital Methods, Lancaster University, 27-30 June 2017.
The UCREL Corpus Research Seminar Series brings together researchers in the Linguistics, Computing and other Departments.

Major Past Events (held at Lancaster)

Lancaster Summer Schools in Corpus Linguistics and other Digital Methods, Lancaster University, 12-15 July 2016.
Corpus Linguistics 2015, Lancaster University, 21-24 July 2015.
Lancaster Summer Schools in Interdisciplinary Digital Methods, Lancaster University, 14-17 July 2015.
31st South Asian Languages Analysis Roundtable (SALA-31), Lancaster University, 14-16 May 2015
UCREL Summer School in Corpus Linguistics, Lancaster University, 15-18 July 2014.
11th Teaching and Language Corpora Conference (TaLC 11), Lancaster University, 20-23 July 2014.
4th Using Corpora in Contrastive and Translation Studies Conference (UCCTS4), Lancaster University, 24-26 July 2014.
UCREL Summer School in Corpus Linguistics, Lancaster University, 16-19 July 2013.
Corpus Linguistics 2013, Lancaster University, 23-26 July 2013.
UCREL Summer School in Corpus Linguistics, Lancaster University, 10-12 July 2012.
UCREL Summer School in Corpus Linguistics, Lancaster University, 13-15 July 2011.
Workshop on Arabic Corpus Linguistics, Lancaster University, 11-12 April 2011. (workshop archive)
Text-mining in the Digital Humanities: The Interface between Conceptual History, Critical Discourse Analysis and Corpus Linguistics. Lancaster University Thu 13 - Fri 14 May 2010.
BAAL Gender and Language SIG event: Gender and Corpus Linguistics, Tuesday March 30th 2010, Lancaster University.
Corpus Linguistics Advanced Research Education and Training (CLARET) PhD training workshop series 2009/10
ICAME 2009 hosted in Lancaster and co-organised by UCLAN, May 2009.
CLARET PhD Training Workshop, 31 March - 1 April 2008.
Workshop on Historical Text Mining, July 20th and 21st 2006, Lancaster University, UK.
Digital Resources for the Humanities (DRH 2005) conference, 4-7 September 2005, Lancaster University.
Corpus Linguistics 2003, Lancaster University, 28 March - 1 April 2003.
Corpus Linguistics 2001, Lancaster University, 30 March - 2 April 2001.
DAARRC2000 - Discourse, Anaphora and Reference Resolution Conference, Lancaster University, 16-18 November, 2000.
The Discourse Anaphora and Anaphor Resolution Colloquium (DAARC2) was hosted at Lancaster University, 1 - 4th August, 1998.
Teaching and Language Corpora (Talc96) was hosted by UCREL in August 1996.
The Discourse Anaphora and Anaphor Resolution Colloquium (DAARC96) was hosted by UCREL in July 1996.
The first Teaching and Language Corpora conference (Talc94) was hosted by UCREL in April 1994.
In September 1993, UCREL hosted an ESRC-funded workshop on computerized spoken discourse.
Many researchers have visited us and given us seminars on their work (see below).

Related events (not held at Lancaster)

Corpus Linguistics 2011, University of Birmingham, 19-22 July 2011.
Corpus Linguistics 2009 held in Liverpool, July 2009.
8th Teaching And Language Corpora (TALC 8) Conference 4-6 July 2008, Lisbon.
Corpus Linguistics 2007, University of Birmingham, July 2007.
Text mining for historians, 17 - 18 July 2007 at University of Glasgow.
The 6th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC'2007), Lagos (Algarve), Portugal, hosted by University of Lisbon, Faculty of Sciences. March 29 - 30, 2007.
7th Teaching and Language Corpora Conference : TaLC2006: 1-4 July 2006. Organised by the Department of Intercultural Studies and Applied Languages, University Paris 7 Denis Diderot in Paris, France. Held at the Bibliothèque Nationale de France in Paris.
Corpus Linguistics 2005, University of Birmingham, 14-17 July 2005.
5th Discourse Anaphora and Anaphor Resolution Colloquium DAARC 2004. S. Miguel, Azores, Portugal, September 23-24, 2004 hosted by the University of Lisbon, Faculty of Sciences
The sixth Teaching And Language Corpora conference TaLC 2004 Tuesday 6th - Friday 9th July 2004 Granada, Spain.
The 4th International Conference on Discourse Anaphora and Anaphor Resolution (DAARC2002) September 18 - 20, 2002 hosted by the University of Lisbon, Faculty of Sciences Organizers: Antonio Branco, Tony McEnery and Ruslan Mitkov
5th Teaching and Language Corpora (TALC 2002) 27 - 31 July 2002, Bertinoro, Italy.
4th Teaching and Language Corpora (TALC2000), Graz, 19-23 July 2000.
3rd Teaching and Language Corpora (TALC98) Keble College, Oxford, 24-27 July, 1998.

Seminars from visiting scholars

Humanities computing: what is it and why is it important?: Dr Marilyn Deegan, Oxford University. Bowland SCR, 2 p.m. Friday 24th October 2003.
Corpus linguistics and grammaticalisation theory: Sebastian Hoffmann, University of Zurich. September 25th 2003, 11am B61 Bowland.
CoreNet: Chinese-Japanese-Korean Wordnet with Shared Semantic Hierarchy and its Development Procedure: Prof. Key-Sun Choi, KAIST (Korea Advanced Institute of Science and Technology). August 29th 2003, 9.15am Bowland B61.
Parsing dialogue moves into higher level dialogue structure: Jean Carletta, Edinburgh University. March 4th 2002, 1pm B61.
The intonation of requests: a corpus-based approach: Anne Wichmann, University of Central Lancashire. February 25th 2002, 1pm B61.
Corpus linguistics and its alternatives: Wolfgang Teubert, Birmingham University. January 24th 2002 4.30pm George Fox LT4.
Problems of Frequency Count in a Corpus of German: Randall Jones, Brigham Young University. November 5th 2001 1pm, B66, Linguistics.
NLP meets TEFL: Tracing the Zero Article: Oliver Mason, University of Birmingham. Monday 7th June 1999 12pm A62, Linguistics. (full presentation available)
Towards the `Corpusization' of Spoken Discourse Transcripts: Paired Oral Interviews in English Language Testing as a Pilot Study: Gordon Tucker, Centre for Language and Communication Research, Cardiff University. Monday 24th May 1999 12pm A62, Linguistics.
Author's briefing on WordSmith: Mike Scott, University of Liverpool. Monday 17th May 1999 12pm B66, Linguistics.
Cambridge International Dictionaries and Corpora: Andrew Harley, Systems Development Manager - ELT Reference Cambridge University Press Monday 22nd March 1999, 12pm, North Spine Seminar Room 1.
Recognising proper names in text, by context-sensitive pattern-matching: Bill Black, (Centre for Computational Linguistics, UMIST) On study leave in Dept of Computing, Lancaster. Friday 21st November 1997 2 p.m B80, Linguistics.
Attacking Anaphora on All Fronts: Ruslan Mitkov (University of Wolverhampton). Friday 6th December 1996 3.00 p.m. B80, Linguistics.
GATE: A General Architecture for Text Engineering.: Robert Gaizauskas, Department of Computer Science, University of Sheffield. Tuesday 14th May 1996 2pm A62 Linguistics Department.
Using word frequency lists to measure corpus homogeneity and similarity between corpora.: Adam Kilgarriff, Information Technology Research Institute, University of Brighton. Friday 3rd May 1996 1pm B39 Computing Department, SECAMS.
Introducing SARA.: Lou Burnard, Oxford University Computing Services. Friday 26th April 1996 1pm B66 Linguistics Department.
Text Categorisation.: Chris Paice (Department of Computing at Lancaster). Wednesday 24th January 1996 2pm Skylab meeting room (SECAMS)
Language Id (statistical language id with a surprise twist): Ted Dunning (Computing Research Laboratory at New Mexico State University) Thursday 4th January 1996.
Dimensions in English: by Doug Biber and Ed Finegan Friday May 13th 1994.
Use and usefulness: frequency of use and the teaching of English: Graeme Kennedy (Victoria University of Wellington). 3rd November 1993.
Taking Corpora to Lancaster: A new study of the modal verb WILL: John M. Kirk (The Queen's University of Belfast). 10th September 1993.
Universalism in Phonology: Atoms, Structures, Derivations.: Jacques Durand (University of Salford). 20th January 1993.
Constraint-based approaches to Machine Translation: Louisa Sadler (Department of Language and Linguistics, University of Essex). April 24th 1992.

Abstracts

Towards the `Corpusization' of Spoken Discourse Transcripts: Paired Oral Interviews in English Language Testing as a Pilot Study
Gordon Tucker
Centre for Language and Communication Research
Cardiff University
Monday 24th May 1999 12pm A62 Linguistics.

Abstract:

There is (hopefully) a growing awareness amongst researchers of spoken discourse that corpus linguistic approaches to their data may provide an additional investigative and analytic resource to complement traditional forms of discourse and conversation analysis.

Whilst spoken data transcripts in word processing format often abound in linguistic departments where discourse is studied, considerable work is necessary to convert them into a useful resource for corpus linguistic type queries. This work is made more difficult when such centres do not have substantial computer science, computational linguistic or corpus linguistic support.

The central question that needs to be addressed is whether and to what extent it is possible to realistically `corpusize' these data sources, given the very limited specialist support. This general question encompasses a good number of issues. These range from mark-up (with SGML as a strong candidate) to the availability of suitable search engine software, where, again, there is little possibility of customized software tools being developed in-house. Furthermore, with specialized corpora of this kind, there are issues of optimum size and representativeness.

Mark-up is a major issue, both in terms of what can or cannot be achieved automatically (e.g. tagging, parsing etc.) and what forms of encoding are necessary or potentially helpful to enable discourse researchers to exploit the corpus (e.g. semantic tagging, turn-transition and other relevant discourse features etc.).

The potential to harness spoken discourse data in this way is currently being explored in Cardiff through a pilot compilation of a corpus of paired oral interviews taken from the University of London - Edexcel English Language Tests. These data bring with them the additional problem of handling Non Native Speaker (NNS) language.

One important purpose of the talk is therefore to share this rather solitary experience and to outline my approach to the numerous problems encountered in dealing with spoken discourse and taking the original transcript as a point of departure. This is very much `work in progress' and it is hoped that the enterprise will benefit from its airing to an informed and specialist audience.

Recognising proper names in text, by context-sensitive pattern-matching
Bill Black,
(Centre for Computational Linguistics, UMIST)
Friday 21st November 1997 2 p.m B80, Linguistics.

Abstract:

Proper names constitute a problem for many natural language processing applications, since with a few exceptions they are not to be found in dictionaries, and they can be constructed in a similar way to descriptive noun phrases. The FACILE project (4th Framework Language Engineering) is concerned with the categorisation of news texts and with extracting information from them. In the FACILE system, named entity analysis (using the operational definition used in the Message Understanding Conferences) takes place at an early, preprocessing stage, building on tokenisation, tagging and morphological analysis (in 4 languages). The problem is not just to syntactically label possibly complex proper names, but to classify them as persons, organisations and locations, and to recognise abbreviated forms as coreferent.

The approach we are taking uses a hybrid pattern-matching/parsing machinery with a rule metalanguage we have designed and refined in the light of results in the MUC-7 "dry run" a few weeks ago.

Attacking Anaphora on All Fronts

Ruslan Mitkov
(University of Wolverhampton)
Friday 6th December 1996 3.00 p.m. B80, Linguistics

Abstract:

Anaphor resolution is a complicated problem in Natural Language Processing and has attracted the attention of many researchers. Most of the approaches developed so far have been traditional linguistic ones with the exception of a few projects where statistical, machine learning or uncertainty reasoning methods have been proposed. The approaches offered - from the purely syntactic to the highly semantic and pragmatic (or the alternative) - provide only a partial treatment of the problem. Given this situation and with a view to achieving greater efficiency, it would be worthwhile to develop a framework which combines various methods to be used selectively depending on the situation.

The talk will outline the approaches to anaphor resolution developed by the speaker. First, he will present an integrated architecture which makes use of traditional linguistic methods (constraints and preferences) and which is supplemented by a Bayesian engine for center tracking to increase the accuracy of resolution: special attention will be paid to the new method for center tracking which he developed to this end. Secondly, a uncertainty reasoning approach will be discussed: the idea behind such an underlying AI strategy is that (i) in Natural Language Understanding the program is likely to propose the antecedent of an anaphor on the basis of incomplete information and (ii) since the initial constraint and preference scores are subjective, they should be regarded as uncertain facts. Thirdly, the talk will focus on a two-engine approach which was developed with a view to improving performance: the first engine searches for the antecedent using the integrated approach, whereas the second engine performs uncertainty reasoning to rate the candidates for antecedents. Fourthly, a recently developed practical approach which is knowledge-independent and which does not need parsing and semantic knowledge will be outlined.

In the last part of his talk, R. Mitkov will explain why Machine Translation adds a further dimension to the problem of anaphor resolution. He will also report on the results from two projects which he initiated and which deal with anaphor resolution in English-to-Korean and English-to-German Machine Translation.

Attacking anaphora on all fronts is a worthwhile strategy: performance is enhanced when all available means are enlisted (i.e. the two-engine approach), or a trade-off is possible between more expensive, time-consuming approaches (the integrated, uncertainty-reasoning and two-engine approaches) and a more economical, but slightly less powerful approach (the practical "knowledge-independent" approach).

GATE: A General Architecture for Text Engineering

Robert Gaizauskas
Department of Computer Science
University of Sheffield
Tuesday 14th May 1996 2pm A62 Linguistics Department.

Abstract:

In this talk I will discuss the GATE project currently underway at Sheffield. GATE aims to provide a software infrastructure in which heterogeneous natural language processing modules may be evaluated and refined independently, or may be integrated into larger application systems. Thus, GATE will support both researchers working on component technologies (e.g. parsing, tagging, morphological analysis) and those working on developing end-user applications (e.g. information extraction, text summarisation, document generation, machine translation, and second language instruction). GATE should promote reuse of component technology, permit specialisation and collaboration in large-scale projects, and allow for the comparison and evaluation of alternative technologies.

GATE comprises three principle components:

the GATE document manager, an annotation database storing arbitrary information about documents and document collections;
CREOLE, a Collection of REusable Objects for Language Engineering;
the GATE Graphical Interface, an easy-to-use graphical interface for running and evaluating modules and systems against document collections.

In the talk I discuss these three components and illustrate their use by describing how the Sheffield MUC-6 information extraction system has been embedded in GATE.

Using word frequency lists to measure corpus homogeneity and similarity between corpora.

Adam Kilgarriff
Research Fellow
Information Technology Research Institute
University of Brighton
Friday 3rd May 1996
1pm B39 Computing Department, SECAMS.

Abstract:

How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering. Word frequency lists are cheap and easy to generate so a measure based on them would be of use as a quick guide in many circumstances where a more extensive analysis of the two corpora was not viable; for example, to judge how a newly available corpus related to existing resources. The paper presents a measure, based on the Chi-squared statistic, for measuring both corpus similarity and corpus homogeneity. We show that corpus similarity can only be interpeted in the light of corpus homogeneity. Issues relating to the normalisation of similarity scores are discussed, and some results of using the measure are presented.

Introducing SARA

Lou Burnard
(Manager, Humanities Computing)
Oxford University Computing Services
Friday 26th April 1996
1pm B66 Linguistics Department
Abstract:

SARA (SGML-Aware Retrieval Application) is a client/server software tool allowing a central database of texts with SGML mark-up to be queried by remote clients. The system was developed at Oxford University Computing Services, with funding from the British Library Research and Development Department (1993-4) and the British Academy. The original motivation for its development was the need to provide a robust low-cost search-engine for use with the 100 million word British National Corpus, and several features of the system design necessarily reflect this.

The SARA system has four key parts:

the indexing program, which generates an index of tokens from an SGML marked-up text
the server program, which accepts messages in the Corpus Query Language (see below) and returns results from the SGML text
the SARA protocol, a formally defined set of message types which determines legal interactions between the client and server programs; this protocol makes use of a high-level query language known as CQL (for Corpus Query Language)
one or more client programs, with which a user interacts in any appropriate platform-specific way, and which communicate with the server program using the protocol

This presentation will introduce the SARA architecture and its intended application. An overview of the server and the CQL protocol will be given, together with a full description of the currently available MS-Windows client.

TEXT CATEGORISATION
Chris Paice
Department of Computing at Lancaster.

Wednesday 24th January 1996
2pm Skylab meeting room (SECAMS)
Abstract:

Automatic text categorisation is just what it sounds to be: assigning texts to predefined categories by computer. Usually, the categories are subject areas. Some applications look promising - e.g., forwarding incoming stories to appropriate departments in a news agency.

In this talk I will first outline some methods which are used for text categorisation, and point out various problems. I will then turn to a possibility which appears to have been neglected so far, but may be of interest to corpus linguists: the automatic identification of text types or genres. This leads on to the related problem of text segmentation, which is potentially of use for text understanding and information extraction systems.

Language Id (statistical language id with a surprise twist)
Ted Dunning
Computing Research Laboratory at New Mexico State University

Thursday 4th January 1996
2pm Skylab meeting room (SECAMS)
Abstract:

Given the following 20 character strings, e pruebas bioquimica man immunodeficiency faits se sont produi it is hardly surprising that a person can identify the languages as Spanish, English and French, respectively. It is not even surprising that a person who speaks very little French or Spanish can do this correctly. Clearly, language understanding is not required for language identification. Furthermore, only the French string contains any closed class words, and none of the strings contain any accented characters which are unique to that language (erroneously so in the Spanish example).

Given this simple fact, it is a natural step to wonder just how simple a computer program could be which is capable of performing a comparable job of language identification. It is also interesting to ask whether such a program might be derived from relatively general mathematical principles, or whether it could be written in such a way that it could learn the characteristics of the languages to be distinguished. Another important question is how many characters are needed to reliably identify the language of a string. Finally, it is important to know just how broad the applicability of such a program actually is.

In this talk, I will discuss fully automatic methods for performing this distinction. These methods are robust, simple and fast.

Furthermore, they have much wider applicability than was originally thought. I will also describe how they were used in "The Case of the Missing Sequences".