ACASD is a suite of programs for the automated semantic field tagging
and content analysis of spontaneous spoken English. It is made up of 5
main modules: the semantic tagging program itself; a program for manual
postediting of the output; a program for the identification of key
syntactico-semantic links; a program for mapping semantic field tags onto
research-specific content categories; and a dedicated statistics and
concordance module with a user-friendly X-windows
interface.
This project aims to combine the best of both qualitative
and quantitative survey research,
but it has many other potential applications in linguistics and more generally
in the social sciences and humanities:
for example, a pilot study of a large corpus of doctor-patient
interactions has been carried out using ACASD, and its application to
the stylistic analysis of written as well as spoken English has been
piloted by Wilson and Leech (1993).
The project is funded by the EPSRC (Engineering and Physical Sciences
Research Council),
and is part of a collaborative (JFIT) project with
commercial partnership and
the support of the DTI.
There is more information on the USAS web page.
Contact Paul Rayson
Treebank development for probabilistic parsing (ATR project)
[1994-1997]
UCREL collaborated with the Advanced Telecommunications Research (ATR)
Institute of Kyoto, Japan, in the development of a computer/human
interactive
system for the treebanking of unconstrained modern English texts.
CLAWS assigns a syntactic tag to each word and this is enriched with a
semantic tag selected by the human analyst using XANTHIPPE,
an editing program written by Roger Garside.
The fully-labelled sentences are then grammatically
analysed ("parsed") with the aid of ATR's treebanking tool, which allows the
grammarian to select the correct analysis according to the ATR Grammar
which is contained within the tool.
The sentences will form a training corpus for probabilistic parsing.
Automatic content analysis of interview transcripts
[1990-92]
This SERC project, funded under IEATP (Information Engineering Advanced
Technology Programme), was a collaborative project, with DTI funding.
The lead industrial partner was Reflexions Market Research, London. The task,
was to develop software for assigning content tags to the words in
interview transcripts, identifying syntactic semantic links between words, and outputting
a statistical profile of the resulting content analysis. The project ventured
into a field where natural language processing had not operated before.
(see the ICAME91 paper: Wilson and Rayson, 1993).
The project integrates the different approaches to produce a future
dictionary, Benedict, which meets the users and their needs halfway by
providing an interactive user-specified access interface, tailoring the
dictionary information supply according to user specifications,
incorporating multilayered entry structure with new information
categories and links to corpus data and syntactically- and
semantically-based corpus search tools in the dictionary data base.
The new product, Benedict, is particularly aimed to cater for the
demands of the multilingual corporate world.
Our project partners are Kielikone Oy, HarperCollins Publishers Ltd,
Gummerus Kustannus Oy, University of Tampere, and Nokia.
For more information, see the project website at
Kielikone.
The corpus will cover a two-hundred-year time span from 1560 to 1760. It draws
texts from six text-types: trial proceedings, depositions, drama, handbooks in
dialogue form, prose fiction and language teaching books. In this way,
examples of both recorded and constructed dialogue are included.
This project also represents a collaboration with Prof. Merja Kyto
(Uppsala University).
Contact: Jonathan Culpeper (E-mail: j.culpeper@lancaster.ac.uk)
See the project home page
in Bergen for further information.
The questions to be addressed are:
For more details see the project web page.
We are grateful for the support of external developers and beta-testers
as part of this work including a) Jukka-Pekka Juntunen (Kielikone Ltd,
Helsinki) b) Dr John Milton (Hong Kong University of Science and
Technology) c) Dr Yukio Tono (Meikai University, Tokyo).
Two recent EPSRC-funded projects at Lancaster (MILLE and EMILLE) have
provided a great service to the non-indigenous minority language
communities in the UK by locating existing resources, investigating
end-user needs and wants, examining basic technical issues and
beginning to generate appropriate resources. However, no such
consolidated survey or examination of issues has yet been undertaken
for the BIMLs. The present project thus has three broad goals: first,
to survey the existing resources and tools for the various BIMLs;
second, to obtain information about end-user needs and wants in these
areas; and, third, to investigate some of the technical and practical
issues that the BIMLs raise, primarily for the collection,
transcription and annotation of spoken corpus material. The latter will
involve collecting and annotating a small sample corpus of spoken Welsh
and Gaelic.
For more information, see the
project web site.
UCREL's work in this collaboration entailed compiling a large
treebank (or syntactically analysed corpus - more than 3 million words)
to act as a testbed for development of probabilistic grammars. Another type of corpus
processing,
anaphoric annotation, has been extensively carried out,
using the same treebank material. UCREL's work involved developing specialised editing
tools for these tasks.
These kinds of corpus processing have primary applications in (1) speech
recognizers, and (2) probabilistic machine translation.
For more information, see the book:
Black, Garside, and Leech (eds) (1993).
For more information, see the EAGLES WP4 homepage.
Also see Grice et al (2000).
For more details see the project web page.
The data for the project is being gathered from newspapers in Poland
such as Gazzeta Wyborska and from tape recordings of spontaneous speech
in a variety of settings.
This project is supported by the Leverhulme Trust under its Emeritus
Fellowship Scheme. The research projects runs for 15 months from
October 2003.
The plan is to build a corpus which matches as closely as possible the
LOB and FLOB corpora of written British English, except that the year
of data collection is 1931, or near to that date (+/- 3 years).
The immediate purpose of building this corpus is to make it possible to
compare these three temporally equidistant corpora (1931, 1961, 1991):
"Pre-LOB", LOB, and FLOB. This will enable us to track grammatical
change through a period of 60 years of the 20th century. In previous
projects on recent grammatical change in English funded by the AHRB and
the British Academy, we have be able to observe some notable trends
through the differences between corpora of the 1960s and the 1990s,
such as declining frequency of the modal auxiliaries (especially shall,
must, ought to and may) and a growing frequency of semi-modals such as
have to, need to, and want to. By projecting this comparison back to
the beginning of the 1930s, we will be able to confirm that these
trends are a continuation of earlier changes.
The early decades of the 20th century are virtually unrepresented in
corpora of English, and so the planned new corpus will fill an
important empirical gap in our historical knowledge of the language.
The new corpus under construction is as yet unnamed.
For more details,
see the project web page.
The approach we have
adopted thus far is to produce a list of variant spellings, which we
manually match to normalized forms. A Variant Detector computer program
(VARD) then inserts modern equivalents of these forms when they appear
in a given text while preserving the original variant. This approach is
proving to be very effective: we have identified 45,800+ variants from
our analyses of different historical texts to date, and have undertaken
a preliminary empirical study of spelling variation across the
16th-19th centuries based on 4,000 of these variants (Archer & Rayson
2004).
Whilst this early work has fostered considerable interest from
historical linguists, dialectologists and historians, we recognize that
such an approach will prove to be too time consuming in the long-term.
We therefore intend to identify patterns from our existing variants as
a means of developing the VARD so that it is able to detect and
"normalize" spelling variants automatically (via fuzzy matching
procedures: edit distance, letter replacement heuristics and the Soundex algorithm).
We believe that the VARD enables
a more comprehensive study of variation than has been previously
possible (cf. Scragg 1974; Osselton 1984). We will demonstrate this by
undertaking a study of variation across four centuries (i.e.
16th-19th), using the datasets mentioned above. We will also study
variation across text-types dating from the same period (i.e. the 17th
century). This particular study will draw from Shakespeare's Complete
Works, the Lampeter Corpus (1640-1760) and the Corpus of English
Dialogues (1560-1760). The latter study will also take note of any
idiosyncratic variant usage (cf. Osselton 1984). Because of the way in
which variants are categorized, the potential of the VARD is much
greater than our proposed pilot studies intimate: indeed, the VARD will
enable quantitative analyses of not only non-standard spellings and
contractions, but also variation at the morphological, phonetic,
orthographic and syntactic levels.
The SPAAC dialogue annotation system has been developed, under EPSRC
grant GR/R371542, primarily for XML speech-act annotation of service
dialogues. But in the course of working on it we broadened the
annotation to include six dimensions:
Dimensions (c) and (d) can be seen as respectively concerned with
domain-restricted semantics and domain semantic-pragmatics. These need
more work. Dimension (f) provides the key level of annotation, for
which other dimensions contribute diagnostic information.
The tagging is semi-automatic: after an initial parsing-assisted
structural segmentation of the unpunctuated turns, the annotation tool
SPAACy (developed by Martin Weisser) then automatically adds XML
mark-up to the dialogue and undertakes a preliminary analysis of all
five dimensions (b)-(f). The output of SPAACy is then manually
post-edited.
The result so far has been a tagging of two major kinds of telephone
dialogue: telephone operator and service dialogues (data provided by
BT) and train booking dialogues (data provided by theTrainline.com).
Arrangements for making the annotated dialogue corpora available to
users are under way.
For details of the speech act annotation scheme,
see the SPAAC Annotation Scheme document (available as
MS Word
or
PDF).
The project website is also available.
It is a basic assumption of WordHoard that new kinds of historical,
literary, or broadly cultural analysis will be supported through the
forms of data access that are made possible when literary texts are
treated in the manner of linguistic corpora. Deeply tagged corpora of
course support more finely grained inquiries at a verbal or stylistic
level. But more importantly, access to the words of a text at such
microscopic levels also lets you look in new ways at the imaginative
worlds created by those words.
WordHoard consists of the following components:
Benedict - The New Intelligent Dictionary
[2002-2005]
The Benedict project combines forces from language technology
providers, the academia, the dictionary publishing world, and user
organizations to discover the best way to cater for the needs of
dictionary users by combining state-of-the-art language technology with
research results on user needs and on the potential of future
dictionaries.
The British National Corpus (BNC)
UCREL was a member of a national
consortium of academic and industrial partners (the other members being
Oxford University Press,
Oxford University Computing Service,
The British Library,
Longman Group Ltd.
and W. & R. Chambers Ltd.). The project
goal was to compile a 100 million word national corpus of machine
readable text, including a wide variety of written and spoken British
English. UCREL's contribution to this project was primarily in the
linguistic analysis of the corpus. Funding: SERC/DTI.
See the distribution information.
British National Corpus Enhancement Project
[1995-6]
This EPSRC-funded Project will produce an enhanced version of the 100
million word British National Corpus.
The Corpus consists of a wide
variety of machine-readable texts selected from many written and
spoken genres of British English. UCREL was responsible for providing
the linguistic description for this Corpus, in the form of an automatic
part-of-speech analysis for the whole Corpus, and a detailed
hand-corrected analysis of a 2-million word sample selected from the
main Corpus. The BNC Enhancement Project aims to improve the accuracy
of this linguistic analysis by identifying and correcting errors and
ambiguities in the Corpus using various automatic techniques. The
Project also aims to produce documentation for the linguistic analysis.
(see above for information on the first project).
A Corpus of English Dialogues, 1560-1760
The 'Corpus of English Dialogues, 1560-1760' is an A.H.R.B. funded project,
which aims to construct a one-million-word-plus computerised corpus of Early
Modern English dialogue texts. Why dialogue texts? The focus is on dialogue,
because it will allow
insight into the nature of impromptu speech and interactive two-way
communication in the Early Modern English period - aspects which have received
little research attention.
Grammatical tagging of the Corpus of London Teenage English (COLT project)
[1996]
This is a small scale project funded by the University of Bergen (Norway). UCREL
is undertaking the grammatical tagging of the 500,000 word COLT corpus.
Corpus-based grammar in contrast (CORGRAM)
Funded by the AHRC,
this project explores an application of novel corpus-based methods to a
set of issues in grammatical analysis, in the context of a language,
Nepali, for which corpus linguistics is in its infancy. It will also
extend the analysis to a cross-linguistic comparison bringing in
English and Russian.
The methodology that will be employed by the project is an novel
empirical approach to grammatical categories and the quantitative
patterns in which they are distributed in textual data.
At the core of the methodology are co-occurrence statistics derived
from text corpora (primarily written corpora, due to patchy
availability of spoken corpora of sufficient extent). These statistics
will take two primary forms: raw co-occurrence counts of grammatical
categories, and collocation lists.
Project website: http://www.lancs.ac.uk/staff/hardiea/nepali/corgram.php
and http://www.ling.lancs.ac.uk/activities/534/
Contact: Dr Andrew Hardie
Corpus Resources and Terminology Extraction (CRATER)
[1994-5]
This EU- funded project, of which Tony McEnery of UCREL was the lead
partner, led to the alignment of a multilingual corpus of English, French
and Spanish texts which are mutual translations (a parallel corpus). The
tri-lingual aligned corpus is being distributed to users in Europe and
elsewhere, being the first corpus of its kind to be made generally
available through the Internet. The project has also created software for
corpus alignment, and for extraction of terminology from the parallel
corpus.
We colaborated with IBM Paris, the Paris software house
(C2V),
and the Universidad Autonoma de Madrid
The corpus can be accessed on-line.
Contact Tony McEnery
EAGLES Phase II: Guidelines for the Syntactic Annotation of Corpora
This phase of the EAGLES
(Expert Advisory Group for Language Engineering Standards)
Initiative began in February 1995 and ran until
the end of July 1995, and was being run from Lancaster.
The final aim was to produce guidelines for the standardisation of syntactic
annotation of corpora. However this work has been carried out in three phases:
This work carries on directly from the previous EAGLES phase on morphosyntactic
annotation of corpora. However, in dealing with syntactic annotation, existing
practices are already much more varied, and tend to be in most cases, more
theory-dependent. For these reasons the guidelines include the following sections:
EAGLES Preliminary Reports, published in electronic form by the European
Commission
(some are available at the
EAGLES Home Page
in Pisa, Italy):
The coordinators of this subgroup were
Geoffrey Leech,
Ruthanna Barnett,
Peter Kahrel.
other members of the subgroup:
Hans van Halteren (Univ. of Nijmegen),
Jean-Marc Langé (IBM Paris),
Simonetta Montemagni (Institute of Computational Linguistics, Pisa),
Atro Voutilainen (Univ. of Helsinki).
Enabling Minority Language Engineering Project (EMILLE)
EMILLE (Enabling Minority Language Engineering) is a 3 year EPSRC
project at Lancaster University and Sheffield University, designed to
build a 63 million word electronic corpus of South Asian languages,
especially those spoken in the UK.
Contact Tony McEnery
or Paul Baker
Extending CLAWS
[July 2003 - February 2004]
Extending CLAWS was a project funded by Lancaster University's small grants
scheme for 4 months and enabled the following work:
a) Investigate and scope the changes required to allow CLAWS to be applied to American English.
b) Create parallel (American English) versions of the machine-readable dictionaries used by CLAWS, and begin implementing changes required for American English
c) Carry out an evaluation by applying part-of-speech tagging to the first release of the
American National Corpus and in addition the
Michigan Corpus of Academic Spoken English (MICASE).
d) Port CLAWS to allow it to run under the Windows operating systems.
EUROTRA Machine Translation Project (ET10/63)
[May 1992 - Oct 1993]
UCREL was involved (along with Essex University, IBM France, and the
Paris software house C2V) in the European Community's machine
translation project. UCREL's role was mainly in the grammatical
analysis of English texts for the project, and semi-automated argument
frame extraction for lexicography. Funding: EC.
Contact Tony McEnery
The Lancaster Database of Text Corpora
[1990-5]
A major outcome of this project is a corpus of grammatically and
semantically annotated texts drawn from varied genres of written English.
It is planned to lodge the corpus in the ESRC Data Archive, so that it will
be available to other researchers.
Language Engineering Resources for the British Isles Indigenous Minority Languages
[2002-]
The indigenous minority languages of the British Isles (or "BIMLs") –
Cornish, (Scottish) Gaelic, Irish, Manx, Scots, Ulster Scots (Ullans)
and Welsh – are becoming increasingly widely used in both public and
private life. Thus, speech and language technology applications for
these languages are now becoming an urgent need. These are needed not
only for monolingual content management, but also to aid translators
and interpreters since the BIMLs are nearly always used in bilingual
contexts alongside English. To develop such applications, basic
language resources (such as corpora of machine-readable texts,
machine-readable dictionaries, speech databases and so on) are
required.
Development of a Language Model for Speech Recognition
[1987-1992]
This project was running for over 5 years with IBM funding, and involved collaboration
with the IBM T.J.Watson Research Center, Yorktown Heights, New York
(up to the end of 1990, IBM UK Scientific Centre, Winchester, was also
involved).
Integrated resources for written and spoken language resources
[1997-8]
This is part of the EC’s EAGLES initiative, funded by DGXIII, to
establish standards for the development of natural language resources
in EC countries. UCREL is subcontracted to the University of Bielefeld,
Germany, to undertake a survey of practices in the representation and
annotation of dialogue, and to formulate provisional guidelines for
standardization of such practices.
The Machine Readable Spoken English Corpus (MARSEC)
This project, run by Dr.,Gerry Knowles jointly
with the University of Leeds, aims to convert the Lancaster-IBM Spoken
English Corpus into a relational speech database (including phonetic
transcription). Funding: ESRC.
See the related work at Reading.
Contact Gerry Knowles
Minority Language Engineering Project (MILLE)
The Minority Language Engineering Project is a joint project
between the Department of Linguistics at Lancaster University and
Oxford University Computer Services. It seeks to investigate the
development of corpus resources for UK non-indigenous minority
languages. (NIMLs).
Contact Tony McEnery
or Paul Baker
A multimedia corpus-based longitudinal study of children's writing
[1996-1998]
A two-year project supported by the Leverhulme Trust
in collaboration with the Centre for Language in Social Life research
group in the Department of Linguistics and Modern English Language.
The particular interest of this project lies in the combination of textual
and non-textual (visual) material in the same corpus.
See the project publications in the proceedings of PALC 97
(Practical Applications of Language Corpora, Lodz, Poland) and the November 1998
issue of Literary & Linguistics Computing
(Smith et al 1998).
The Lancaster Corpus of Children's Project Writing is online.
Contact Roz Ivanic
Parallel Wordclass Tagging of the Parole 2 Corpus and the British National Corpus
[July 1997-] (Funded by EPSRC.)
The aim of this project, which involves collaboration with the
University of Birmingham, is to tag morphosyntactically 250,000
words of text from the PAROLE2 corpus and 1 million words from the
British National Corpus using two different taggers - Lancaster's
Claws tagger and Birmingham's QTAG tagger. Both taggers will be
developed to make use of an EAGLES-conformant tagset. The resulting
outputs will be compared and the comparison will focus on discovering
how far the two taggers correct each other's errors in predictable ways.
The analysis will result in a set of contextual rules to convert
dispreferred tags to preferred tags, so that the result of parallel
tagging will be more accurate than the output of a single tagger.
If there is time, it is hoped that the experiment can be extended to
take in further taggers.
Polish/English Corpus Research and Applications
A British Council funded project, PELCRA, which is being carried out
together with the
University of Lodz in Poland,
aims to provide a
databank of spoken and written Polish which could help build better
dictionaries and help in English language education in Poland.
PELCRA website.
Contact Tony
McEnery
Leverhulme Corpus Project
Investigator: Geoffrey Leech
Research Associate: Nick Smith (half-time)
Self-Access Grammar Tools
[1994-5] The
second phase of this project (funded by IHE [Innovation in Higher
Education]) has been completed, and the software is ready to be used
operationally in the teaching of Linguistics and English Language at
Part I level. The current widespread inadequacy of grammar teaching in
schools makes it highly desirable to provide this facility for
university foundation courses, and the software is already being
experimentally used in two universities apart from Lancaster.
Contact Tony
McEnery
Corpus of Written British Creole
[Feb.-May 1995]
The purpose of this project was to collect and computerize, for the first
time, a corpus from the writings of British Creole speakers of
Afro-Caribbean origin.
More details are available.
Contact Dr Mark Sebba
Requirements reverse engineering to support business process change
[May 1998-April 2001]
The aim of the research proposed here is to improve the requirements
analysis for legacy system evolution where the underlying business
process has already changed. As far as we are aware, this is a unique
focus which, despite addressing a real-world problem, has not been
systematically addressed before. Our approach is to investigate the
reverse engineering of requirements documents by the novel integration
of techniques for the textual analysis of documentation; modelling of
business processes; and modelling the organisational structures serving
the business processes.
Rhythm and Timing in Spoken English
[1994-5]
The objective of this project has been to extract a model of rhythm and
timing from the MARSEC database (a multi-layered corpus of spoken English
developed at Lancaster in collaboration with Leeds University).
The MARSEC database is a unique resource which can now be used to investigate
in microscopic detail areas of spoken langauge performance which have baffled
phoneticians for years. The analysis of duration (closely allied to rhythm),
which is the focus of this project, has application to high-quality speech synthesis.
Contact Gerry Knowles
Scragg Revisited - a quantitative investigation of spelling across the centuries
[April 05 - July 05, funded by the British Academy with the University of Central Lancashire]
In January 2003, Dawn Archer and Paul Rayson began to explore the
feasibility of retraining the
UCREL Semantic Analysis System
(USAS) so that it was applicable to historical texts dating
from 1600 onwards (Archer et al 2003). Previously, the USAS system has
been used to annotate written and spoken English from the modern
period. Our initial experiments suggested that the statistical model
adopted in the part-of-speech tagging component (CLAWS) was sufficiently robust
to cope with grammatical variation across time, but spelling variants
caused difficulty, as variants of known words were not recognized by
the system. Our response was to implement a prototype spelling detector
as a pre-processing step in USAS. This work continued in the
WordHoard project
funded by the Mellon Foundation with Prof. Martin Mueller (Northwestern
University, Chicago), in which we linguistically annotated the Nameless
Shakespeare corpus and Chadwick-Healey's Eighteenth and Nineteenth
Century Fiction corpora.
Contact Paul Rayson.
A Speech-Act Annotated Corpus for Dialogue Systems: Pilot Project (SPAAC)
Geoffrey Leech, Tony McEnery, and Martin Weisser
Unlocking the Word Hoard
[2004-5 Funded by the Andrew W. Mellon Foundation]
The WordHoard Project is named after an Old English phrase for the
verbal treasure "unlocked" by a wise speaker. It applies to highly
canonical literary texts the insights and techniques of corpus
linguistics, that is to say, the empirical and computer-assisted study
of large bodies of written texts or transcribed speech. In the
WordHoard environment, such texts are annotated or tagged by
morphological, lexical, semantic, prosodic, and narratological
criteria. They are mediated through a "digital page" or user interface
that lets scholarly but non-technical users explore the greatly
increased query potential of textual data kept in such a form.
WordHoard is a project of Academic Technologies at Northwestern
University, with support from the Northwestern University Library.
The linguistic annotation will be developed in collaboration with Paul Rayson and Dawn Archer from
UCREL. The project leaders at Northwestern University are Martin Mueller (Department of English
and Classics) and Bill Parod (Academic Technologies).
![]() |