There is (hopefully) a growing awareness amongst researchers of spoken discourse that corpus linguistic approaches to their data may provide an additional investigative and analytic resource to complement traditional forms of discourse and conversation analysis.
Whilst spoken data transcripts in word processing format often abound in linguistic departments where discourse is studied, considerable work is necessary to convert them into a useful resource for corpus linguistic type queries. This work is made more difficult when such centres do not have substantial computer science, computational linguistic or corpus linguistic support.
The central question that needs to be addressed is whether and to what extent it is possible to realistically `corpusize' these data sources, given the very limited specialist support. This general question encompasses a good number of issues. These range from mark-up (with SGML as a strong candidate) to the availability of suitable search engine software, where, again, there is little possibility of customized software tools being developed in-house. Furthermore, with specialized corpora of this kind, there are issues of optimum size and representativeness.
Mark-up is a major issue, both in terms of what can or cannot be achieved automatically (e.g. tagging, parsing etc.) and what forms of encoding are necessary or potentially helpful to enable discourse researchers to exploit the corpus (e.g. semantic tagging, turn-transition and other relevant discourse features etc.).
The potential to harness spoken discourse data in this way is currently being explored in Cardiff through a pilot compilation of a corpus of paired oral interviews taken from the University of London - Edexcel English Language Tests. These data bring with them the additional problem of handling Non Native Speaker (NNS) language.
One important purpose of the talk is therefore to share this rather solitary experience and to outline my approach to the numerous problems encountered in dealing with spoken discourse and taking the original transcript as a point of departure. This is very much `work in progress' and it is hoped that the enterprise will benefit from its airing to an informed and specialist audience.
Proper names constitute a problem for many natural language processing applications, since with a few exceptions they are not to be found in dictionaries, and they can be constructed in a similar way to descriptive noun phrases. The FACILE project (4th Framework Language Engineering) is concerned with the categorisation of news texts and with extracting information from them. In the FACILE system, named entity analysis (using the operational definition used in the Message Understanding Conferences) takes place at an early, preprocessing stage, building on tokenisation, tagging and morphological analysis (in 4 languages). The problem is not just to syntactically label possibly complex proper names, but to classify them as persons, organisations and locations, and to recognise abbreviated forms as coreferent.
The approach we are taking uses a hybrid pattern-matching/parsing machinery with a rule metalanguage we have designed and refined in the light of results in the MUC-7 "dry run" a few weeks ago.
Anaphor resolution is a complicated problem in Natural Language Processing and has attracted the attention of many researchers. Most of the approaches developed so far have been traditional linguistic ones with the exception of a few projects where statistical, machine learning or uncertainty reasoning methods have been proposed. The approaches offered - from the purely syntactic to the highly semantic and pragmatic (or the alternative) - provide only a partial treatment of the problem. Given this situation and with a view to achieving greater efficiency, it would be worthwhile to develop a framework which combines various methods to be used selectively depending on the situation.
The talk will outline the approaches to anaphor resolution developed by the speaker. First, he will present an integrated architecture which makes use of traditional linguistic methods (constraints and preferences) and which is supplemented by a Bayesian engine for center tracking to increase the accuracy of resolution: special attention will be paid to the new method for center tracking which he developed to this end. Secondly, a uncertainty reasoning approach will be discussed: the idea behind such an underlying AI strategy is that (i) in Natural Language Understanding the program is likely to propose the antecedent of an anaphor on the basis of incomplete information and (ii) since the initial constraint and preference scores are subjective, they should be regarded as uncertain facts. Thirdly, the talk will focus on a two-engine approach which was developed with a view to improving performance: the first engine searches for the antecedent using the integrated approach, whereas the second engine performs uncertainty reasoning to rate the candidates for antecedents. Fourthly, a recently developed practical approach which is knowledge-independent and which does not need parsing and semantic knowledge will be outlined.
In the last part of his talk, R. Mitkov will explain why Machine Translation adds a further dimension to the problem of anaphor resolution. He will also report on the results from two projects which he initiated and which deal with anaphor resolution in English-to-Korean and English-to-German Machine Translation.
Attacking anaphora on all fronts is a worthwhile strategy: performance is enhanced when all available means are enlisted (i.e. the two-engine approach), or a trade-off is possible between more expensive, time-consuming approaches (the integrated, uncertainty-reasoning and two-engine approaches) and a more economical, but slightly less powerful approach (the practical "knowledge-independent" approach).
In this talk I will discuss the GATE project currently underway at Sheffield. GATE aims to provide a software infrastructure in which heterogeneous natural language processing modules may be evaluated and refined independently, or may be integrated into larger application systems. Thus, GATE will support both researchers working on component technologies (e.g. parsing, tagging, morphological analysis) and those working on developing end-user applications (e.g. information extraction, text summarisation, document generation, machine translation, and second language instruction). GATE should promote reuse of component technology, permit specialisation and collaboration in large-scale projects, and allow for the comparison and evaluation of alternative technologies.
GATE comprises three principle components:
In the talk I discuss these three components and illustrate their use by describing how the Sheffield MUC-6 information extraction system has been embedded in GATE.
How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering. Word frequency lists are cheap and easy to generate so a measure based on them would be of use as a quick guide in many circumstances where a more extensive analysis of the two corpora was not viable; for example, to judge how a newly available corpus related to existing resources. The paper presents a measure, based on the Chi-squared statistic, for measuring both corpus similarity and corpus homogeneity. We show that corpus similarity can only be interpeted in the light of corpus homogeneity. Issues relating to the normalisation of similarity scores are discussed, and some results of using the measure are presented.
SARA (SGML-Aware Retrieval Application) is a client/server software tool allowing a central database of texts with SGML mark-up to be queried by remote clients. The system was developed at Oxford University Computing Services, with funding from the British Library Research and Development Department (1993-4) and the British Academy. The original motivation for its development was the need to provide a robust low-cost search-engine for use with the 100 million word British National Corpus, and several features of the system design necessarily reflect this.
The SARA system has four key parts:
This presentation will introduce the SARA architecture and its intended application. An overview of the server and the CQL protocol will be given, together with a full description of the currently available MS-Windows client.
Automatic text categorisation is just what it sounds to be: assigning texts to predefined categories by computer. Usually, the categories are subject areas. Some applications look promising - e.g., forwarding incoming stories to appropriate departments in a news agency.
In this talk I will first outline some methods which are used for text categorisation, and point out various problems. I will then turn to a possibility which appears to have been neglected so far, but may be of interest to corpus linguists: the automatic identification of text types or genres. This leads on to the related problem of text segmentation, which is potentially of use for text understanding and information extraction systems.
Given the following 20 character strings, e pruebas bioquimica man immunodeficiency faits se sont produi it is hardly surprising that a person can identify the languages as Spanish, English and French, respectively. It is not even surprising that a person who speaks very little French or Spanish can do this correctly. Clearly, language understanding is not required for language identification. Furthermore, only the French string contains any closed class words, and none of the strings contain any accented characters which are unique to that language (erroneously so in the Spanish example).
Given this simple fact, it is a natural step to wonder just how simple a computer program could be which is capable of performing a comparable job of language identification. It is also interesting to ask whether such a program might be derived from relatively general mathematical principles, or whether it could be written in such a way that it could learn the characteristics of the languages to be distinguished. Another important question is how many characters are needed to reliably identify the language of a string. Finally, it is important to know just how broad the applicability of such a program actually is.
In this talk, I will discuss fully automatic methods for performing this distinction. These methods are robust, simple and fast.
Furthermore, they have much wider applicability than was originally thought. I will also describe how they were used in "The Case of the Missing Sequences".