Forthcoming Events

Major Past Events (held at Lancaster)

Related events (not held at Lancaster)

Seminars from visiting scholars

Humanities computing: what is it and why is it important?
Dr Marilyn Deegan, Oxford University. Bowland SCR, 2 p.m. Friday 24th October 2003.
Corpus linguistics and grammaticalisation theory
Sebastian Hoffmann, University of Zurich. September 25th 2003, 11am B61 Bowland.
CoreNet: Chinese-Japanese-Korean Wordnet with Shared Semantic Hierarchy and its Development Procedure
Prof. Key-Sun Choi, KAIST (Korea Advanced Institute of Science and Technology). August 29th 2003, 9.15am Bowland B61.
Parsing dialogue moves into higher level dialogue structure
Jean Carletta, Edinburgh University. March 4th 2002, 1pm B61.
The intonation of requests: a corpus-based approach
Anne Wichmann, University of Central Lancashire. February 25th 2002, 1pm B61.
Corpus linguistics and its alternatives
Wolfgang Teubert, Birmingham University. January 24th 2002 4.30pm George Fox LT4.
Problems of Frequency Count in a Corpus of German
Randall Jones, Brigham Young University. November 5th 2001 1pm, B66, Linguistics.
NLP meets TEFL: Tracing the Zero Article
Oliver Mason, University of Birmingham. Monday 7th June 1999 12pm A62, Linguistics. (full presentation available)
Towards the `Corpusization' of Spoken Discourse Transcripts: Paired Oral Interviews in English Language Testing as a Pilot Study
Gordon Tucker, Centre for Language and Communication Research, Cardiff University. Monday 24th May 1999 12pm A62, Linguistics.
Author's briefing on WordSmith
Mike Scott, University of Liverpool. Monday 17th May 1999 12pm B66, Linguistics.
Cambridge International Dictionaries and Corpora
Andrew Harley, Systems Development Manager - ELT Reference Cambridge University Press Monday 22nd March 1999, 12pm, North Spine Seminar Room 1.
Recognising proper names in text, by context-sensitive pattern-matching
Bill Black, (Centre for Computational Linguistics, UMIST) On study leave in Dept of Computing, Lancaster. Friday 21st November 1997 2 p.m B80, Linguistics.
Attacking Anaphora on All Fronts
Ruslan Mitkov (University of Wolverhampton). Friday 6th December 1996 3.00 p.m. B80, Linguistics.
GATE: A General Architecture for Text Engineering.
Robert Gaizauskas, Department of Computer Science, University of Sheffield. Tuesday 14th May 1996 2pm A62 Linguistics Department.
Using word frequency lists to measure corpus homogeneity and similarity between corpora.
Adam Kilgarriff, Information Technology Research Institute, University of Brighton. Friday 3rd May 1996 1pm B39 Computing Department, SECAMS.
Introducing SARA.
Lou Burnard, Oxford University Computing Services. Friday 26th April 1996 1pm B66 Linguistics Department.
Text Categorisation.
Chris Paice (Department of Computing at Lancaster). Wednesday 24th January 1996 2pm Skylab meeting room (SECAMS)
Language Id (statistical language id with a surprise twist)
Ted Dunning (Computing Research Laboratory at New Mexico State University) Thursday 4th January 1996.
Dimensions in English
by Doug Biber and Ed Finegan Friday May 13th 1994.
Use and usefulness: frequency of use and the teaching of English
Graeme Kennedy (Victoria University of Wellington). 3rd November 1993.
Taking Corpora to Lancaster: A new study of the modal verb WILL
John M. Kirk (The Queen's University of Belfast). 10th September 1993.
Universalism in Phonology: Atoms, Structures, Derivations.
Jacques Durand (University of Salford). 20th January 1993.
Constraint-based approaches to Machine Translation
Louisa Sadler (Department of Language and Linguistics, University of Essex). April 24th 1992.


Towards the `Corpusization' of Spoken Discourse Transcripts: Paired Oral Interviews in English Language Testing as a Pilot Study
Gordon Tucker
Centre for Language and Communication Research
Cardiff University
Monday 24th May 1999 12pm A62 Linguistics.


There is (hopefully) a growing awareness amongst researchers of spoken discourse that corpus linguistic approaches to their data may provide an additional investigative and analytic resource to complement traditional forms of discourse and conversation analysis.

Whilst spoken data transcripts in word processing format often abound in linguistic departments where discourse is studied, considerable work is necessary to convert them into a useful resource for corpus linguistic type queries. This work is made more difficult when such centres do not have substantial computer science, computational linguistic or corpus linguistic support.

The central question that needs to be addressed is whether and to what extent it is possible to realistically `corpusize' these data sources, given the very limited specialist support. This general question encompasses a good number of issues. These range from mark-up (with SGML as a strong candidate) to the availability of suitable search engine software, where, again, there is little possibility of customized software tools being developed in-house. Furthermore, with specialized corpora of this kind, there are issues of optimum size and representativeness.

Mark-up is a major issue, both in terms of what can or cannot be achieved automatically (e.g. tagging, parsing etc.) and what forms of encoding are necessary or potentially helpful to enable discourse researchers to exploit the corpus (e.g. semantic tagging, turn-transition and other relevant discourse features etc.).

The potential to harness spoken discourse data in this way is currently being explored in Cardiff through a pilot compilation of a corpus of paired oral interviews taken from the University of London - Edexcel English Language Tests. These data bring with them the additional problem of handling Non Native Speaker (NNS) language.

One important purpose of the talk is therefore to share this rather solitary experience and to outline my approach to the numerous problems encountered in dealing with spoken discourse and taking the original transcript as a point of departure. This is very much `work in progress' and it is hoped that the enterprise will benefit from its airing to an informed and specialist audience.

Recognising proper names in text, by context-sensitive pattern-matching
Bill Black,
(Centre for Computational Linguistics, UMIST)
Friday 21st November 1997 2 p.m B80, Linguistics.


Proper names constitute a problem for many natural language processing applications, since with a few exceptions they are not to be found in dictionaries, and they can be constructed in a similar way to descriptive noun phrases. The FACILE project (4th Framework Language Engineering) is concerned with the categorisation of news texts and with extracting information from them. In the FACILE system, named entity analysis (using the operational definition used in the Message Understanding Conferences) takes place at an early, preprocessing stage, building on tokenisation, tagging and morphological analysis (in 4 languages). The problem is not just to syntactically label possibly complex proper names, but to classify them as persons, organisations and locations, and to recognise abbreviated forms as coreferent.

The approach we are taking uses a hybrid pattern-matching/parsing machinery with a rule metalanguage we have designed and refined in the light of results in the MUC-7 "dry run" a few weeks ago.

Attacking Anaphora on All Fronts

Ruslan Mitkov
(University of Wolverhampton)
Friday 6th December 1996 3.00 p.m. B80, Linguistics


Anaphor resolution is a complicated problem in Natural Language Processing and has attracted the attention of many researchers. Most of the approaches developed so far have been traditional linguistic ones with the exception of a few projects where statistical, machine learning or uncertainty reasoning methods have been proposed. The approaches offered - from the purely syntactic to the highly semantic and pragmatic (or the alternative) - provide only a partial treatment of the problem. Given this situation and with a view to achieving greater efficiency, it would be worthwhile to develop a framework which combines various methods to be used selectively depending on the situation.

The talk will outline the approaches to anaphor resolution developed by the speaker. First, he will present an integrated architecture which makes use of traditional linguistic methods (constraints and preferences) and which is supplemented by a Bayesian engine for center tracking to increase the accuracy of resolution: special attention will be paid to the new method for center tracking which he developed to this end. Secondly, a uncertainty reasoning approach will be discussed: the idea behind such an underlying AI strategy is that (i) in Natural Language Understanding the program is likely to propose the antecedent of an anaphor on the basis of incomplete information and (ii) since the initial constraint and preference scores are subjective, they should be regarded as uncertain facts. Thirdly, the talk will focus on a two-engine approach which was developed with a view to improving performance: the first engine searches for the antecedent using the integrated approach, whereas the second engine performs uncertainty reasoning to rate the candidates for antecedents. Fourthly, a recently developed practical approach which is knowledge-independent and which does not need parsing and semantic knowledge will be outlined.

In the last part of his talk, R. Mitkov will explain why Machine Translation adds a further dimension to the problem of anaphor resolution. He will also report on the results from two projects which he initiated and which deal with anaphor resolution in English-to-Korean and English-to-German Machine Translation.

Attacking anaphora on all fronts is a worthwhile strategy: performance is enhanced when all available means are enlisted (i.e. the two-engine approach), or a trade-off is possible between more expensive, time-consuming approaches (the integrated, uncertainty-reasoning and two-engine approaches) and a more economical, but slightly less powerful approach (the practical "knowledge-independent" approach).

GATE: A General Architecture for Text Engineering

Robert Gaizauskas
Department of Computer Science
University of Sheffield
Tuesday 14th May 1996 2pm A62 Linguistics Department.


In this talk I will discuss the GATE project currently underway at Sheffield. GATE aims to provide a software infrastructure in which heterogeneous natural language processing modules may be evaluated and refined independently, or may be integrated into larger application systems. Thus, GATE will support both researchers working on component technologies (e.g. parsing, tagging, morphological analysis) and those working on developing end-user applications (e.g. information extraction, text summarisation, document generation, machine translation, and second language instruction). GATE should promote reuse of component technology, permit specialisation and collaboration in large-scale projects, and allow for the comparison and evaluation of alternative technologies.

GATE comprises three principle components:

In the talk I discuss these three components and illustrate their use by describing how the Sheffield MUC-6 information extraction system has been embedded in GATE.

Using word frequency lists to measure corpus homogeneity and similarity between corpora.

Adam Kilgarriff
Research Fellow
Information Technology Research Institute
University of Brighton
Friday 3rd May 1996
1pm B39 Computing Department, SECAMS.


How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering. Word frequency lists are cheap and easy to generate so a measure based on them would be of use as a quick guide in many circumstances where a more extensive analysis of the two corpora was not viable; for example, to judge how a newly available corpus related to existing resources. The paper presents a measure, based on the Chi-squared statistic, for measuring both corpus similarity and corpus homogeneity. We show that corpus similarity can only be interpeted in the light of corpus homogeneity. Issues relating to the normalisation of similarity scores are discussed, and some results of using the measure are presented.

Introducing SARA

Lou Burnard
(Manager, Humanities Computing)
Oxford University Computing Services
Friday 26th April 1996
1pm B66 Linguistics Department

SARA (SGML-Aware Retrieval Application) is a client/server software tool allowing a central database of texts with SGML mark-up to be queried by remote clients. The system was developed at Oxford University Computing Services, with funding from the British Library Research and Development Department (1993-4) and the British Academy. The original motivation for its development was the need to provide a robust low-cost search-engine for use with the 100 million word British National Corpus, and several features of the system design necessarily reflect this.

The SARA system has four key parts:

This presentation will introduce the SARA architecture and its intended application. An overview of the server and the CQL protocol will be given, together with a full description of the currently available MS-Windows client.

Chris Paice
Department of Computing at Lancaster.

Wednesday 24th January 1996
2pm Skylab meeting room (SECAMS)

Automatic text categorisation is just what it sounds to be: assigning texts to predefined categories by computer. Usually, the categories are subject areas. Some applications look promising - e.g., forwarding incoming stories to appropriate departments in a news agency.

In this talk I will first outline some methods which are used for text categorisation, and point out various problems. I will then turn to a possibility which appears to have been neglected so far, but may be of interest to corpus linguists: the automatic identification of text types or genres. This leads on to the related problem of text segmentation, which is potentially of use for text understanding and information extraction systems.

Language Id (statistical language id with a surprise twist)
Ted Dunning
Computing Research Laboratory at New Mexico State University

Thursday 4th January 1996
2pm Skylab meeting room (SECAMS)

Given the following 20 character strings, e pruebas bioquimica man immunodeficiency faits se sont produi it is hardly surprising that a person can identify the languages as Spanish, English and French, respectively. It is not even surprising that a person who speaks very little French or Spanish can do this correctly. Clearly, language understanding is not required for language identification. Furthermore, only the French string contains any closed class words, and none of the strings contain any accented characters which are unique to that language (erroneously so in the Spanish example).

Given this simple fact, it is a natural step to wonder just how simple a computer program could be which is capable of performing a comparable job of language identification. It is also interesting to ask whether such a program might be derived from relatively general mathematical principles, or whether it could be written in such a way that it could learn the characteristics of the languages to be distinguished. Another important question is how many characters are needed to reliably identify the language of a string. Finally, it is important to know just how broad the applicability of such a program actually is.

In this talk, I will discuss fully automatic methods for performing this distinction. These methods are robust, simple and fast.

Furthermore, they have much wider applicability than was originally thought. I will also describe how they were used in "The Case of the Missing Sequences".