Using Parallel Corpora for Language Analysis and Language Teaching


Michael Barlow


Parallel corpora have been used in translation studies, contrastive linguistics, and language teaching. This workshop introduces the use of parallel (translation) corpora for contrastive studies and for language teaching. A parallel concordancer can analyse parallel corpora to provide information on translation "equivalences" or nonequivalences (false friends) and can provide a much richer picture of the lexicogrammatical relationship holding between two languages than that presented in a bilingual dictionary. The software can present the user with (i) several instances of the search term, (ii) provide various kinds of frequency information, and (iii) make available a large context for each instance of the search term, thereby allowing a thorough analysis of usage.

The theory and practice of parallel concordancing will be discussed briefly, with the majority of the workshop devoted to hands-on practice, based on worksheets related to the operation of the software. We carry out simple text searches for words or phrases and sort the resulting concordance lines and examine frequency information of various kinds, including the frequency of collocates of the search term. More complex searches are also possible, including context searches, searches based on regular expressions, and word/part-of-speech searches (assuming that the corpus is tagged for POS). The software includes utilities for highlighting potential translations, including an automatic component Both the “Translation” and “Hot words,” functions use frequency data to provide information about possible translations of the searchword.

At the end of the workshop, the pros and cons of the use of translation corpora are examined.

























Exploring phraseological variation and aboutness with ConcGram

Chris Greaves, Elena Tognini-Bonelli and Martin Warren


This 2-hour workshop demonstrates ConcGram (Greaves, 2009) which is a corpus linguistics program specifically designed to automatically identify phraseology and, in particular, phraseological variation.  The workshop demonstrates how ConcGram can make a significant contribution to a better understanding of phraseology which is at the heart of Sinclair’s (1987) idiom principle.

            The workshop begins with a brief introduction to the theoretical rationale behind the program, a brief overview of the kinds of study which have been undertaken, and the three main categories of phraseology which we have identified so far.  The original idea for the program came from a desire to be able to fully automatically retrieve the co-selections which comprise lexical items (Sinclair, 1996 and 1998) from a corpus.  In other words, we wanted to be able to identify lexical items without relying on potentially misleading clues in single word frequency lists, or lists of n-grams (also termed ‘clusters’ or ‘bundles’), or some form of user-nominated search.  We felt that single word frequencies are not the best indicators of frequent phraseologies in a corpus, and we were concerned that n-grams miss countless instances of phraseology that have constituency (AB, A*B, A**B, etc., where ‘*’ is an intervening word) and/or positional (AB, BA, B*A, etc.) variation.  We decided that a program was needed which could identify all of the co-occurrences of two or more words irrespective of constituency and/or positional variation in order to more fully account for phraseological variation and provide the raw data for identifying lexical items (see Cheng, Greaves, Sinclair and Warren, 2009) and other forms of phraseology (for example, co-selections of grammatical words in ‘collocational frameworks’ [Renouf and Sinclair, 1991] such as ‘a ... of’ and ‘the ... of the’, and co-selections of organisation-oriented words such as ‘because ... so’.  In addition, we wanted the program to be able to identify these co-occurrences fully automatically in order to support corpus-driven research (Tognini-Bonelli, 2001), by making it unnecessary for any prior search parameters to be inputted by the user. 

            After the brief introduction of theoretical issues, the workshop is devoted to a hands-on introduction to the main functions of ConcGram.  This begins with how to create a corpus which is then fully automatically congrammed from two-word to five-word concgrams.  Examples of concgram concordances are viewed and the issue of ‘associatedness’ versus ‘chance co-occurrence’ discussed.  The functions which enable the user to review concgram configurations are explained as well as the function which enables the user to switch the centred word. 

            Functions related to ‘list management’ are explained and the various ways of managing the extent of the concgram outputs are discussed and demonstrated.  Using ConcGram in ‘user-nominated’ mode is covered and the various options available to users to search for specific concgrams are detailed.  Workshop participants will be given ample opportunity to enter their own search items to try out the program for themselves.

            Just as it is possible to use single word frequencies to arrive at keywords in a specialised corpus relative to a general corpus, so ConcGram enables the user to determine the phraseological profile of a text or corpus (i.e. all of the phraseologies in the text or corpus) which can then be referred to a general corpus.  A procedure is described whereby it is possible to arrive at the aboutness of a text of corpus based on its most frequent concgrams (Sinclair, 2006).  Those word associations which are specific to a text or corpus are termed ‘aboutgrams’ (Sinclair, personal communication; Sinclair in Tognini- Bonelli (ed.), in press). 

            The last 10-15 minutes of the workshop consist of a round-up of the main potential applications of ConcGram, and a chance to discuss issues and ask questions arising from the hands-on session.



The research described in this workshop was substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region (Project No. PolyU 5459/08H, B-Q11N).



Cheng, W., C. Greaves and M. Warren. (2006). “From n-gram to skipgram to concgram”.

            International Journal of Corpus Linguistics 11/4, 411-433.

Cheng, W. C. Greaves, J. McH. Sinclair and M. Warren. (In press, 2009). “Uncovering the

            extent of the phraseological tendency: towards a systematic analysis of concgrams”.

            Applied Linguistics.

Greaves, C. (2009). ConcGram 1.0: A phraseological Search Engine. Amsterdam: John    Benjamins.

Renouf, A. J. and J. McH. Sinclair. (1991). “Collocational Frameworks in English”. In

            K. Ajimer and B. Altenberg (eds). English Corpus Linguistics, 128-43.

Sinclair, J. McH. (1987). “Collocation: A Progress Report”.  In R. Steele and T. Threadgold          (eds). Language Topics: Essays in Honour of Michael Halliday. Amsterdam: John    Benjamins, 319-331.

Sinclair, J. McH. (1996). “The search for units of meaning”. Textus 9/1, 75-106.

Sinclair, J. McH. (1998). “The lexical item”. In E. Weigand (ed.). Contrastive Lexical

            Semantics. Amsterdam: John Benjamins, 1-24.

Sinclair, J. McH. (2006). Aboutness 2. (manuscript), Tuscan Word Centre, Italy.

Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam: John Benjamins.

Tognini-Bonelli, E. (ed.) (forthcoming). John Sinclair on Essential Corpus Linguistics.

            London: Routledge.






















A multi-modal approach to the construction and analysis of spoken corpora


Dawn Knight, Svenja Adolphs, Ronald Carter and Paul Tennent


This workshop outlines a novel approach for the construction and analysis of multi-modal corpora. It draws upon development made as part of two projects based at the University of Nottingham; DReSS and DReSS II.

            DReSS (Understanding New Forms of the Digital Record for e-Social Science) was a 3-year NCeSS funded project (ESRC) which sought to combine the knowledge of linguists and the expertise of computer scientists in the construction of the ‘multi-modal’ corpus software: the Digital Replay System (DRS). DRS presents ‘data’ in three different modes: as spoken (audio), video and textual records of real-life interactions, allowing alignment within a functional, searchable corpus setting (known as the Nottingham Multi-Modal Corpus, i.e. NMMC). The DRS environment therefore allows for the exploration of the lexical, prosodic and gestural features of conversation and how they interact in everyday speech.

            During the workshop we will provide a real-time demonstration of key features of DRS, and will discuss how this free software can provide an ideal platform for constructing bespoke multi-modal corpus datasets. In particular, we will guide participants through the processes of organising, coding and arranging datasets for re-use within this tool (using the DRS ‘track viewer’), and how the data can be navigated and manipulated for CL based analysis.

            We will further showcase the novel multi-modal concordancing facility that has been integrated within the DRS interface. In addition to providing standard mono-modal concordance facilities that are commonly available with current corpora (i.e. to conduct text based searches of data), this concordancer is capable of interrogating data constructed from textual transcriptions anchored to video and/or audio, and from coded annotations of specific features of gesture-in-talk. In other words, once presented with a list of occurrences (concordances), and their surrounding context, the analyst may jump directly to the temporal location of each occurrence within the video or audio clip to which the annotation pertains.

            Following this demonstration, participants will have the opportunity to test-drive DRS for themselves (guidelines for use will be provided), and to ask any technical questions that might arise as a result of this. A range of practical, methodological and ethical challenges and considerations will also be discussed here. Participants are also encouraged to provide feedback (and any related questions) on the system and to fuel discussions on the potential long-term applications of such a tool in the future of CL research (encouraging participants to draw upon their own research experiences to contextualise their ideas/ feedback). 

            As an extension to the work on multi-modal representation of spoken discourse, we will also briefly discuss how DRS is currently being adapted to support the collection and collation of more heterogeneous datasets, including SMS messages, MMS messages, interaction in virtual environments, GPS data and face-to-face situated discourse, as part of the DReSS II project (preliminary datasets collected as part of this new project will be showcased, within the DRS environment). The focus of this work is on enabling a more detailed investigation of the interface between a variety of different communicative modes from an individual’s perspective, tracking a specific person’s (inter)actions over time (i.e. across an hour, day or even week).





The Sketch Engine

Adam Kilgariff


The Sketch Engine is a leading corpus query tool, used for dictionary projects and language research around the world.  It is available on a website, already loaded with large corpora for twelve of the world’s major languages (with more to follow) so it allows people to use corpora over the web without the need to install software or find or build the corpora themselves.  Our slogan is “corpora for all”: we want to facilitate corpus use across the language industries and in language teaching and research, by making it very easy to access high-quality corpora with user-friendly and powerful tools. The website also includes a tool for uploading and installing the user’s own corpora into the Sketch Engine, and another, WebBootCaT, for building “instant corpora” from the web, and exploring them in the Sketch Engine and outside.

The Sketch Engine is the one corpus query tool which provides summaries of a word’s behaviour (‘word sketches’) which bring together grammatical and collocational analysis.

The tool is regularly extended in responses to user requests and to developments in corpus linguistics, lexicography and computational linguistics.  Recent additions include:



The workshop will be 50% lecture and 50% hands-on.  In the lecture I will:



The practical session will give participants an opportunity to use the tool and to explore its functions, with a guide to hand.










Mapping meaning onto Use using corpus pattern analysis


Patrick Hanks


It is a truism that context determines meaning, but what exactly is context, and what counts as a meaning?

Three aspects of context have been identified in the literature: context of utterance, domain, and collocation/colligation. In this workshop, we explore the third aspect and ask some fundamental questions about meaning in language. 

The workshop illustrates both the potential and the pitfalls of doing meaning analysis of content words  by collocational analysis, and in particular by the semantic classification of collocates.  In the first half of the workshop, we start with a corpus-driven analysis of a couple of fairly straightforward, uncontroversial words, to show how the procedure works.     

We shall ask whether the same procedures are equally applicable to nouns, verbs, and adjectives.

Next, we shall look at a more complex case, in which hierarchical meaning analysis will be proposed. For example, take the verb 'admit'.  It is a transitive verb; the semantic class of the direct object determines the meaning. At the most general level, this meaning is either “say reluctantly” or “allow to enter”.  English speakers know intuitively that admitting a fault activates the first sense, while admitting a member activates the second. But what sort of priming is this intuition based on, and how can knowledge such as this be made machine-tractable or available for logical inferencing? We shall ask whether an ontology such as WordNet is of any help in answering such questions, or whether some other procedure is needed or, indeed, possible.

At a more delicate level, we shall ask whether admitting doing something necessarily implies that the thing done is bad (i.e. does it necessarily negative semantic prosody in this context?) and we shall ask the same question about the semantics of admitting that something has been done.   

Then we shall turn to the “allow to enter” sense and ask whether admitting a new member to a club is the same meaning as admitting a patient to a hospital, or a convicted criminal to a prison, or a drunken yob to a night club.

Finally in the first half, we shall look at statistically valid methods of measuring the comparative importance of senses.

In the second half of the workshop, participants will be offered an opportunity to gain practical hands-on experience in using sophisticated on-line tools for corpus-based semantic analysis.














Applications of Corpus Linguistics for EFL classrooms

Renata Condi de Souza, Márcia Veirano Pinto, Marilisa Shimazumi, Jose Lopes Moreira Filho, Tony Berber Sardinha and Tania Shepherd

This series of presentations problematize the place of Corpus Linguistics in the life of EFL practitioners and students.  To this end the presentations focus on three important EFL routines: preparation of materials, teacher education, and error correction and their respective interfaces with Corpus Linguistics. The colloquium will last 2 hours with 20 minutes for audience participation.

Helping third-world teachers into the 21st century: an interface between films, TV programmes, Internet videos and Corpus Linguistics in the EFL classroom.


The use of films, TV programmes and Internet-targeted videos in the EFL classroom is a widely described practice. However, the coupling of these media with corpus-driven or corpus-based activities within well-established TEFL methodological frameworks has rarely been attempted.

This paper presents research results involving 100 EFL teachers from public and private schools and language institutes attending a 32-hour teacher training course on how to teach English by exploring the language of films, TV programmes and Internet videos. This course was designed to familiarize EFL teachers with certain theoretical and practical issues involved in the use of corpora in the classroom as an aid for teaching of spoken English. The data presented consists of corpus-based materials produced by the same EFL teachers, who previous to the course had little or no knowledge of Corpus Linguistics and the use of corpora in EFL classrooms.

The paper also reports on the strategies used for raising teachers’ awareness of language as a probabilistic system and for dealing with language and technological barriers.   Recent results indicate not only a change of paradigm in terms of the aim of materials produced due to the incorporation of Corpus Linguistics concepts and the use of interactive technologies, but also an improvement in teachers’ awareness of certain  features of spoken English. 


Corpus-based materials in the real world: Challenges from curriculum development in a large institution


This presentation focuses on issues related to curriculum development and integration of corpus-based materials to contemporary ELT coursebooks in a large language teaching institution. This stems from my experience as teacher trainer and materials designer for one of the world’s largest language schools, with thousands of students and hundreds of teachers. This context sets important constraints for the use of corpus-based materials, given the sheer scale of the operation. Firstly, the impossibility of developing all materials in house; ready-made coursebooks printed by major publishers must be adopted; and although most recent coursebooks claim to be ‘corpus-informed’, there are few corpus-based materials included. Consequently, corpus-based materials need to be added to these coursebooks, in the form of teacher’s notes. The second issue is that corpus-based materials demand detailed preparation, as they require minimum language research. With so many teachers’ notes needed, time is crucial. This restricts the number of corpus-based tasks incorporated into any given course. The third issue is that corpus-based research normally challenges long-held assumptions about language and may contradict coursebooks adopted. When this occurs, decisions must be made about whether to keep the coursebook materials or to add corpus-based materials challenging the coursebook. The resolution of such conflicts is normally a compromise between the ideal of corpus-based materials and research and the realities of a large language teaching operation whose students and teachers may be unaware of corpus linguistics. These issues and examples of materials designed to meet these challenges will be demonstrated from several contemporary coursebooks. 

Preparing corpus-based teaching materials on the fly

This paper focuses on the Reading Class Builder (RCB), a Windows-based application that generates reading materials for teaching EFL. The tool was designed to meet the needs of teachers wanting to use corpus-based classroom materials but are unfamiliar with corpus processing software and/or have little time for lesson planning. The tool´s Planning Wizzard allows users to generate materials and guides users speedily. Firstly, the user selects the material size, i.e., the number of items. Secondly, the user chooses a reading text (either from its built-in text bank or another source) as teaching material. Thirdly, the tool analyzes the text in a range of ways and displays the frequency of words, the text’s keywords, a list of cognates, parts of speech, bundles, and the text’s lexical density. Finally, the user selects a set of words as focal points and a corpus (again either from the tool’s internal corpora, including the BNC, or a user-defined corpus). Then over 15 exercises are instantly prepared, including concordance-based data-driven activities, guessing, matching, fill in the blanks, language awareness, and critical reading items. These activities are created based on a user modifiable template. Visual Basic 6 was the main programming language for developing the software. This presentation demonstrates how the program works as well as evaluating the tool’s functions, in terms of precision and recall. A trial conducted with Brazilian teachers using the software in their classrooms is described.


Judging correctedness: a corpus-based investigation of  apprentice errors in written English 

The paper´s premise is that an investigation of  correctedness in EFL university writing is justified as a potential means of enabling apprentice-writers to be less disadvantaged when writing  for either professional and/or academic purposes in the present-day globalised, English-writing world. Thus, the paper describes the development of an online system for the identification of error in written EFL, aimed at finding patterns of error in a training corpus of student writing and determining the probability of such patterns predicting error in other student text corpora. The error marking scheme is somewhat different and less granular than existing, well-established manual error annotation systems, including Granger (2003) and Chuang and Nesi (2006),  in terms of both its aims and procedures. The objective is to facilitate the work of EFL practitioners, rather than that of the SLA researcher. The errors in the training corpus were manually tagged in two stages, using an adaptation of Nicholls (1999). In the first stage, errors were marked which covered large stretches (two or three words) of text. In the second stage, errors were marked on a word by word basis, focusing on the most salient error-inducing item, termed ‘the locus of the error’. The data was fed into a pre-processor, which then extracted patterns for error usage and calculated the probability of each pattern predicting an error. Based on this information, an online application was developed and tested, taking either a single learner composition or an entire learner corpus and outputs a list of each word followed by the probability of its erroneous use. The present version of the tool is demonstrated, including an evaluation of its precision and recall, and  the problems encountered during the error marking phase discussed. 


Corpus Linguistics and Literature


Bettina Fischer-Starcke and Martin Wynne


Corpus stylistics is the analysis of literary texts by using corpus linguistic techniques. It combines the analytic techniques of corpus linguistics with the goals of stylistics, extracting meanings from literary texts by using computational means of analysis. In recent years, corpus stylistics has become an accepted part of corpus linguistics. Studies have been published by, for instance, Louw (1993), Semino and Short (2004), Stubbs (2005), Starcke (2006), Wynne (2006), Mahlberg (2007a, b), O’Halloran (2007a, b) and Fischer-Starcke (forthcoming). There have also been a number of workshops at international linguistic conferences, for instance at Corpus Linguistics 2005, Corpus Linguistics 2007, the PALA conferences in 2006 and 2007, ISLE 1, and the workshop "Corpus Approaches to the Language of Literature" in Oxford in 2006. This colloquium is intended to continue the process of the presentation and discussion of approaches to corpus stylistic analysis. It aims to further the discussion of what is possible in corpus stylistic research, to discuss current approaches, and to provide a forum for practitioners of the field to meet and to discuss future directions.


Until now it has often been said (Stubbs 2005, Wynne 2006) that it is still surprisingly uncommon for the methods of corpus linguistics to be applied to literary style, even though many theories and techniques from various fields of linguistics are regularly used to analyse the language of literature, usually under the umbrella term 'stylistics'. This colloquium will present and consider the latest methods of analysis in corpus linguistics that are helping to shed light on various aspects of the language of literature, and consider whether corpus stylistics has now come of age.


The presentations below will attempt to demonstrate that it is becoming increasingly possible to test empirically claims about the language of literature, to search for and provide evidence from texts, and to establish the norms of literary and non-literary style.


Monika Bednarek (University of Technology, Sydney: “Dialogue and characterisation in scripted television discourse”


This paper uses corpus stylistics to analyse constructed spoken discourse in a contemporary American television series, and to compare it with ‘naturally occurring’ conversation. The results are thus linked to the genre of the television series, its ideology/cultural stereotyping and the construal of characters.


Doug Biber (University of Arizona): “Using corpus-based analysis to study fictional style: A multi-dimensional analysis of variation among and within novels”


Multi-dimensional analysis has been widely used to identify the linguistic parameters of variation among spoken and written registers. The present study applies this analytical framework to analyze the patterns of linguistic variation in a large corpus (over 5 million words) of fictional prose in English


Bettina Fischer-Starcke (Vienna University of Economics and Business: “The Phraseology of Jane Austen’s Pride and Prejudice”


Pride and Prejudice has been extensively discussed in literary critical studies. Nevertheless, using corpus linguistic techniques in its analysis yields new and more detailed insight into the text’s meanings than have been discussed in the literary critical literature. This demonstrates the large potential of corpus stylistic studies for the analysis of literary texts.


Jane Johnson (University of Bologna): “A corpus-assisted analysis of salient stylistic features in nineteenth-century Italian novels and their English translations”


An analysis of multi-word expressions in a corpus of novels by the Nobel prize winner for literature, Grazia Deledda, addressing issues such as how far multiword items can provide an indication of authorial style, how translations of the novels compare with the originals as regards the style of each novel, and how a corpus-assisted analysis of literary texts may be exploited to provide an instrument for translators and trainee translators.


Andrew Kehoe & Matt Gee (Birmingham City University): “A wiki tool for the collaborative study of literary texts”


Introducing a new ‘wiki’ tool for the collaborative analysis and annotation of literary texts, building upon the WebCorp Linguist’s Search Engine, and which supports the collaborative close reading and analysis of texts, by allowing researchers, teachers and students to attach comments to individual words or phrases within these texts or to whole texts. These comments take the form of analyses or interpretations, and can generate intra-and inter-textual links, thus relating cutting-edge technological innovations to a time-honoured critical and interpretative tradition.



Lesley Moss (University College, London: “Henry James Beyond the Numbers:

Applying corpus analysis to the text”


Investigating the syntactic differences between Henry James’s early novel, Washington Square, and a late novel, The Golden Bowl, with the aim of linking syntactic complexity and parenthesis to particular literary functions.



Kieran O'Halloran (Open University): “The discourse of reading groups:

argument and collaboration”


This paper will make a contribution to understanding the discourse of reading groups, shining light on the kind of argumentation used in evaluation and interpretation of novels read in a variety of groups, examining patterns of co-occurrence between lexico-grammatical collocation and discoursal function in argumentation. Such patterns of co-occurrence illuminate relationships between time, space and experience for reading group members.










Fischer-Starcke, B. (forthcoming). Corpus Linguistics and Literature. Corpus Stylistic        Analyses of Literary Works by Jane Austen and her Contemporaries. London: Continuum.

Louw, B. (1993). “Irony in the text or insincerity in the writer? The diagnostic potential of            semantic prosodies”. In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology. In Honour of John Sinclair. Philadelphia, Amsterdam: John Benjamins,   157–176.

O’Halloran, K.A (2007a). “The subconscious in James Joyce’s “Eveline”: a corpus stylistic            analysis which chews on the “Fish hook””. Language and Literature, 16, (3), 227– 244.

O’Halloran, K.A (2007b). “Corpus-assisted literary evaluation”. Corpora, 2,

             (1), 33–63.

Mahlberg, M. (2007a). “Corpus stylistics: bridging the gap between linguistic and literary studies”, in M. Hoey, M. Mahlberg, M. Stubbs and W. Teubert (eds) Text, Discourse and Corpora. Theory and Analysis. London: Continuum, 219–246.

Mahlberg, M. (2007b). “Clusters, key clusters and local textual functions in Dickens”.

            Corpora, 2, (1), 1–31.

Semino, E. and Short, M. (2004). Corpus Stylistics: Speech, Writing and Thought   Presentation in a Corpus of English Narratives. London: Routledge.

Starcke, B. (2006). “The phraseology of Jane Austen's Persuasion: Phraseological units as carriers of meaning”. ICAME Journal, 30, 87–104.

Stubbs, M. (2005). “Conrad in the computer: examples of quantitative stylistic methods”.             Language and Literature, 14, (1), 5–24.

Wynne, M. (2006). “Stylistics: Corpus Approache”. In: Keith Brown, (ed.), Encyclopedia of         Language & Linguistics. Second Edition, 12, Oxford: Elsevier, 223-226.



Beyond corpus construction: Exploitation and maintenance of parallel corpora


Oliver Čulo, Silvia Hansen-Schirra and Stella Neumann


Parallel corpora, i.e. collections of originals and their translations, can be used in various ways for the benefit of translation studies, machine translation, linguistics, computational linguistics or simply the human translator. In computational linguistics, parallel data collections serve as training material for machine translation, a data basis for multilingual grammar induction, automatic lexicography etc. Translation scholars use parallel corpora in order to investigate specific properties of translations. For professional translators, parallel corpora serve as reference works, which enable quick and interactive access and information processing.


With issues of the creation of large data collections including multiple annotation and alignment largely solved, exploitation of these collections remains a bottleneck. In order to effectively use annotated and aligned parallel corpora, there are certain considerations to be made:

·         Query: We can expect basic computer literacy from academics nowadays. However, the gap between writing query or evaluation scripts and program usability is immense. One way to address this is by building web query interfaces. Yet in general, what are the claims and possibilities for creating interfaces that address a broader public of researchers using multiply annotated and aligned corpora? An additional ongoing question is the most efficient storage form: are data base formats superior to other formats?

·         Information extraction: The quality of the information extracted by a query heavily depends on the quality of the annotation of the underlying corpus, i.e. on precision and recall of annotation and alignment. Furthermore, the question arises how we can ensure high precision and recall of queries (while possibly keeping query construction efficient). What are the strategies to compose queries which produce high-quality results? How can the query

·software contribute to this goal?

·         Corpus quality: Several criteria for corpus quality have been developed (e.g. in the context of standardization initiatives). Quality can be influenced before compilation, by ensuring the balance of the corpus (in terms of register and sample size), its representativeness etc. Also, inter-annotator agreement and - to a lesser extent - intra-annotator agreement are an issue. But, how can we make corpora thus created fit for automatic exploitation? This involves issues such as data format validity throughout the corpus, robust (if not 100% correct) processing with corpus tools/APIs and the like. What are the criteria and how can these be addressed?

·         Maintenance: Beyond the validity of the data format, maintenance of consistent data collections is a more complex task, particularly if the data collection is continually expanded. A change of the annotation scheme entails adjustments in the existing annotation. Questions to this end include whether automatic adjustment is possible and how it can be achieved. Maintenance may also involve compatibility with and adaptations to new data formats. How can we ensure sustainability of the data formats? The planned colloquium is intended as a platform for interdisciplinary research questions on corpus exploitation.


The abstracts of contributors to the colloquium are organized as follows. We start with two topics that concentrate on corpus formats – the first being about the balance between academia and economy, the other with a view on several distinct languages – and how they influence query and maintenance. The third presentation introduces a tool for parallel treebanks of interest particularly with a view to querying. Presentation four is an example of enriching and extending an existing corpus resource thus addressing aspects of corpus sustainability and maintenance. The last two presentations focus on extracting information for more theoretical topics, querying parallel corpora for studies on translation effects.


Each presentation is scheduled with 35 minutes, with at least 5 minutes for questions and discussion. In order to have time for a general discussion of the overall picture of current developments, scheduled with about 30 minutes at the end of the colloquium, we will not solicit further contributions. The discussion will be guided by the questions raised above.


Bettina Schrader, René-Martin Tudyka and Thomas Ossowski: “Colliding worlds: Academic and industrial solutions in parallel corpus formats”


Within academia, there is a vital and lively discussion how corpus data should to be stored and accessed. The same questions arise within industrial software development. However, the answers provided by academia and industry seem opposite extremes.


Academia focuses on annotation schemes and equivalent storage formats where a storage format should i) allow encoding linguistic structures as precisely as possible, ii) be extendible to additional annotation levels, and iii) keep the annotations separable from the original data. As XML meets all of these requirements, it has become an important storage format, at the cost of relatively slow, unhandy data access.


The industrial requirements for storing language data, however, are i) quick data access, ii) user-friendliness, and iii) a data format that is adequate only for the purpose at hand. Accordingly, corpus data is often stored in data bases due to their powerful indexing and data organization functionality. Linguistic annotations are kept to a minimum, and if possible, the data base structure is not changed at all.


Thus, finding a data storage format that satisfies the goals of corpus linguists and software developers alike seems difficult. While a corpus linguist may put up with slow query processing, as long as the resulting data points are well-annotated and interesting, a translator can do without linguistic annotations, but needs software that instantaneously presents him likely translation suggestions.


We are going to discuss some of these academic and industrial requirements on corpus storage and access by focusing on one specific scenario: the development of a translation memory, i.e. a corpus query and storage system for parallel corpus data. Furthermore, we will discuss how academic and industrial requirements may be brought closer together.


Marko Tadić, Božo Bekavac and Željko Agić: “Building a Southeast European Parallel Corpus: an exercise in ten languages”


This contribution will give the insight into the work-in-progress that has been going on for almost a year. Beside the task of enlarging the Croatian National Corpus, a parallel corpus covering nine Southeast European languages and English is being collected. The languages involved are: Albanian, Bosnian, Bulgarian, Croatian, Greek, Macedonian, Romanian, Serbian, Turkish and English. The texts are collected from the Southeast Times, a newspapers that is being published in these ten languages on the daily basis. Automatic procedures for collecting, storing and maintenance of texts will be demonstrated. So far around million words per language has been collected and the conversion into XCES format and alignment is being done using custom made software that uses user-defined scripts in the conversion process. Also preliminary alignment procedures, that are completed using CORAL, a human controlled program for aligning parallel corpora, will be shown. The future processing such as PoS/MSD tagging and lemmatisation will also be mentioned therefore broadening the possible usage of this corpus. Since we have developed these tools for Croatian and we intend to use them, this further processing for other languages largely depends on the existence and availability of such tools.


The collection of this parallel corpus represents an exemplary exercise that covers typical problems in building parallel corpora for a larger number of languages, such as different scripts and encodings, different sentence segmentation rules and alignment procedures covering pairs of genetically and typologically distinct languages.


Martin Volk, Torsten Marek and Yvonne Samuelsson: “Building and Querying Parallel Treebanks with the Stockholm TreeAligner”


The Stockholm TreeAligner has been developed for aligning syntax trees of a parallel corpus. We have used it to pairwise align the English, German and Swedish treebanks in SMULTRON, the Stockholm Multilingual Treebank. In this treebank both words (terminal symbols) and phrases (non-terminal nodes) are aligned across corresponding sentences. The primary design goal for the TreeAligner was to create an easy-to-use environment to manually annotate or correct the alignments in a parallel treebank. The input to the TreeAligner typically consists of two treebanks that are built beforehand. The required input format is TIGER-XML. If the two treebanks have been automatically aligned in advance, then the alignment information can be imported into the TreeAligner.


Annotating the alignments in the TreeAligner is easy; you simply drag a line with your mouse between the words or nodes that you wish to annotate as having an alignment relation. In this way, it is possible to create one-to-one as well as one-to-many relations. We allow different types of alignment annotations, in the current version we distinguish “good” alignments (i.e. exact translation correspondence) and “fuzzy” alignments (i.e. rough translation correspondence). Once a parallel treebank is aligned, it can be browsed (with helpful features like highlighting of alignments on mouse-over) and searched.


The TreeAligner comes with a powerful query language that was modeled after the TIGERSearch query language and follows its syntax. But unlike TIGERSearch which only works on monolingual treebanks, the TreeAligner also permits queries over parallel treebanks and their alignments.

Moreover we have recently added universal quantification to the query language which overcomes some of the limitations in TIGERSearch.


We envision that future corpus annotation will add more annotation layers. Therefore we are currently extending the query language towards searches over parallel treebanks with frame-semantic annotation.


The TreeAligner is a powerful tool for creating, browsing and searching parallel treebanks. It is useful in several scenarios when working with parallel annotated corpora. It is freely availablein source code (Python) under the GNU general public license.


Špela Vintar and Darja Fišer: “Enriching Slovene WordNet with Multi-word Terms”


The paper describes an innovative approach to expanding the domain coverage of wordnet by exploiting a domain-specific parallel corpus and harvesting the terminology using term extraction methods.


Over time, WordNet has become one of the most valuable resources for a wide range of NLP applications, which initiated the development of wordnets for many other languages as well (see One of such enterprises is the building of Slovene WordNet. The first version of the Slovene wordnet was created from Serbian wordnet, which was translated into Slovene with a Serbian-Slovene dictionary. The result was manually edited and the thus obtained Slovene wordnet contained 4,688 synsets, all from Base Concept Sets 1 and 2. Using multilingual word alignments from the JRC-AQUIS corpus, the Slovene Wordnet was further enriched and refined. However, the approach was limited to single-word literals, which is a major shortcoming, considering the fact that Princeton WordNet contains about 67,000 multi-word expressions. This is why we further developed our corpus-based approach to extract also multi-word term candidates for inclusion into the Slovene Wordnet. Initial experiments were performed for the fishing domain using a subcorpus from JRC-AQUIS.


In the experiment described here we are using an English-Slovene parallel and comparable corpus of texts from the financial domain. We first identify the core terms of the domain in English using the Princeton Wordnet, and then we translate them into Slovene using a bilingual lexicon produced from the parallel corpus. In the next step we extract multi-word terms from the Slovene part of the corpus using a hybrid statistical / pattern-based approach, and finally match the term candidates to existing Wordnet synsets.


Sebastian Pado: “Parallel Corpora with Semantic Roles: A Resource for Translation Studies?”


Semantic roles, and the automatic analysis of text in terms of semantic roles, are a topic of intensive research in computational linguistics (see e.g. the 2008 special issue of the journal "Computational Linguistics"). Since semantic roles reflect predicate-argument structure, it seems an attractive idea to use semantic role annotations of parallel texts for corpus-based studies of translational shifts.


This talk discusses the creation of such a corpus, specifically a 1000-sentence trilingual parallel corpus (English-German-French) with independent semantic role annotations for all three languages in the FrameNet paradigm. The corpus was developed in the context of a computational study on the automatic cross-lingual creation of semantic role resources.


A central observation on this corpus is that not all cross-lingual mismatches in semantic role annotation are truly indicative of translational shifts. We discuss the causes and implications of this observation:


- From the linguistic point of view, the formulation of theories of semantic roles that are completely independent of individual languages is an arguably very difficult goal. We sketch the design decisions behind the English and German FrameNet annotation schemes and their consequences on the emergence of cross-lingual semantic role mismatches.


-We attempt to relate the mismatches to a recent classification scheme for translational shifts.

However, the complex nature of the shifts we encounter makes this difficult. We present a preliminary scheme that classifies semantic role mismatches in terms of properties of the lexical relations involved.


Silvia Hansen-Schirra, Oliver Čulo and Stella Neumann: “Crossing lines and empty links as indicators for translation phenomena: Querying the CroCo corpus”


In an idealised translation, all translation units should be matching correspondent units in the source texts, both in semantics and in grammatical analysis. This is of course not realistic not only because languages simply diverge but also because translators make individual decisions. Very broadly speaking, originals and their translations therefore diverge in two respects: Units in the target text may not have matches in the source text and vice versa, thus exhibiting empty links, or a unit that is part of a larger unit is aligned with a unit outside the larger unit, thus, metaphorically speaking, crossing lines. These two concepts are related on the one hand to concepts used in formal syntax and semantics (like null elements and discontinuous constituency types in LFG or HPSG). On the other hand they are in the tradition of well-known concepts in translation studies such as 1:0 correspondences, translation shifts etc.


The CroCo corpus comprises a collection of parallel texts of both English and German original texts from eight different registers with their German resp. English translation. The corpus is annotated on several linguistic layers, e.g. with PoS information, chunk types and grammatical functions. More interestingly, the corpus is aligned on three levels: word level, clause level and sentence level. This alignment allows us to query the corpus for both crossing lines or empty links.


This presentation will discuss query methods as well as quality issues concerning the query results and will present findings showing how crossing lines and empty links can be exploited to detect translation phenomena.




Working with the Spanish Corpus PUCV-2006: Academic and professional genre identification and variations


Giovanni Parodi, René Venegas, Romualdo Ibáñez and Dr. Cristian González


Each presentation will be given 25 minutes and, at the end, 20 minutes will be given to discussion (120 minutes total)


The identification and description of written disciplinary-oriented discourse genres and their lexicogramatical variations across academic and professional settings is a relatively recent concern in linguistics. Although the efforts displayed in search of more or less exact taxonomies from a rather textual point of view were recorded earlier, the consensual acceptance of a genre theory from approximations based on corpora and principles of variation is only recently being developed. Thus, empirically-based deep descriptions of academic and professional genres, specialized in diverse disciplines, become a relevant challenge. Our objectives focus on a multidimentional description of the written genres that are used in four university undergraduate programmes and their corresponding professional workplaces by collecting and studying the written texts that are read in these contexts and through which specialised knowledge provides access to disciplinary interactions. Part of the objectives of this colloquium is to describe deeply a corpus-based research project of academic and professional Spanish genres underway at Pontificia Universidad Católica de Valparaíso (Chile). We examine assigned student readings in four scientific disciplines at university settings (Psychology, Social Work, Construction Engineering, and Industrial Chemistry) and the written texts that form the core of daily written communication in professional workplaces that correspond to the academic degree programme. The computational tool employed to tag, upload and interrogate the corpus is El Grial interface ( The findings show that twenty-eight genres are detected in academic and professional settings, and that distinctive linguistic features emerge as prototypical of some disciplines and genres. The genre classification reveals interesting differences between Basic Sciences and Engineering and Social Sciences and Humanities and interesting lexicogrammatical patterns are identified across genres and disciplines.


Four presentations constitute this colloquium, in which four focuses are presented as complementing parts of the description and analysis of the largest available on-line tagged specialized corpus of Spanish: PUCV-2006 (57 million words). The specific research focuses involve the identification and classification of the discourse genres emerging from the 491 texts that give form to the corpus; a qualitative and quantitative analysis of the two most important genres identified in the corpus in terms of the number of texts and number of words across four disciplines (University Textbooks and Disciplinary Text); and a multidimensional description and analysis of all academic genres, using the classic Biber’s (1988) factor analysis, including also a comparison other four corpora from different registers (i.e. Latin-American Literature Corpus, Oral Didactic Corpus, Scientific Research Articles Corpus and Public Policies Corpus).




Dr. Giovanni Parodi (Pontificia Universidad Católica de Valparaíso, Chile): “Analyzing the rhetorical organization of the University Textbook genre in four disciplines: Abstraction and concreteness dimensions”


Historically, the study of the rhetorical organization of discourse genres has been restricted only to a few genres and has especially concentrated on the research article (RA) (Swales, 1981). Starting from a large specialized tagged corpus of 126 textbooks written in Spanish and collected from four disciplines (Social Work, Psychology, Construction Engineering and Industrial Chemistry), we describe the rhetorical organization of the University Textbook genre, based on part of the PUCV-2006 Corpus (57 million words). Our aim is to identify and describe the rhetorical moves of the Textbook genre and to describe the communicative purposes of each of the moves and steps identified, based on a corpus-based empirical approach. We also studied the variation across the four disciplines involved. A new macro-level of the focus of the analysis is introduced and justified, which becomes necessary for a deeper description of an extensive textual unit, as is normally such as involves this genre; we have named it Macromove. The qualitative and quantitative analysis of the macromoves and moves help us identified that access to disciplinary knowledge is constructed through a varying repertoire of rhetorical macromoves depending on how concrete or abstract is the scientific domain. The findings show Social Work and Psychology textbooks employ more resources based on an abstract dimension, while Construction Engineering and Industrial Chemistry textbooks display a more concrete organization of resources.

Dr. René Venegas (Pontificia Universidad Católica de Valparaíso): “A multidimensional analysis of academic genres: Description and comparison of the PUCV-2006 Corpus”


Recent studies on Spanish oriented to describe specialized and multi-register corpora have been carried out using diverse corpus methodologies, such as, multidimensional analysis. In this report, we present two studies based on the written academic PUCV-2006 Corpus of Spanish (491 texts and 58,594,630 words). Both studies employ the five dimensions (i.e. Contextual and Interactive Focus, Narrative Focus, Commitment Focus, Modalizing Focus, and Informational Focus) identified by Parodi (2005). Each of these dimensions emerged out of the functional interpretation of co-ocurrent lexicogrammatical features identified through a multidimensional and multiregister analysis. The main assumption underlying these studies is that the dimensions determined by a previous multidimensional analysis can be used to characterize a new corpus of university genres. In the first study, we calculate linguistic density across the five dimensions that provide a lexicogrammatical description of the nine academic genres of which the corpus is composed. In the second one, we compare the PUCV-2006 Corpus with four corpora from different registers (i.e. Latin-American Literature Corpus, Oral Didactic Corpus, Scientific Research Articles Corpus and Public Policies Corpus). The findings confirm the specialized nature of the genres in the PUCV-2006 Corpus, where both, a strong lexicogrammatical compactness of meanings and a regulation emphasis of the degree with which certainty is manifested are strongly expressed.






Dr. Romualdo Ibáñez, (Pontificia Universidad Católica de Valparaíso: “Disciplinary Text genre and the discursive negotiation of knowledge


During the last decades, the study of language use has been carried out from several different perspectives, including research in pragmatics, systemic functional linguistics, register studies, and genre analysis, among others. In the latter line of investigation, several researchers have focused on describing Academic Discourse in order to identify and characterize particular genres which are thought to be representative of a discipline. The aim of this investigation is to describe the Disciplinary Text, an academic genre that has emerged as one of the most frequent means of written communication in the PUCV-2006 Academic Corpus, in three different disciplinary domains (Social Work, Psychology, and Construction Engineering). To do so, we carried out an ascendant/descendent analysis of the rhetorical organization of 270 texts. The results not only revealed the persuasive communicative purpose of the genre, but also, its particular rhetorical organization, which is realized by three rhetorical macromoves, eight moves and eighteen steps. Besides, quantitative results showed relevant variation between Social Sciences and Humanities and Basic Sciences and Engineering in terms of the way the rhetorical steps and moves realize the communicative purpose of the genre. These results allow us to say that the genre under study transcends the limits of a discipline, but, at the same time, some of its central characteristics are affected by disciplinary variation. One of the main implications of this analysis is the empirical evidence that the ways of negotiating knowledge through discourse varies depending on the discipline and area of study.


Dr. Cristian González, (Pontificia Universidad Católica de Valparaíso): Academic and professional genres: identification and characterization in the PUCV-2006 Corpus of Spanish”


The identification and the classification of discourse genres have been one of the permanent concerns in linguistics studies. Particularly, since genres as complex objects have become the focus of analysis, multidimensional approaches have been developed in order to capture their dynamic nature. In this presentation, we identified, defined, classified, and exemplified the discourse genres that have emerged from the 491 texts forming the PUCV-2006 Academic and Professional Corpus of Spanish. In order to accomplish these objectives, we carried out a complementary methodology of deductive and inductive nature. Five criteria were selected and employed to identify all the texts: Communicative Macropurpose, Relations between the Participants, Mode of Discourse Organization, Modality, and Ideal Context of Circulation. After analysing all the texts of the corpus, twenty-eight genres were identified and grouped according to the criteria mentioned above. Interesting clusterings emerged reflecting cognitive, social, functional/communicative and linguistics variations. The importance of using a group of different variables interacting in the identification of genres showed its potential and proved to be a powerful methodology.





Using parallel newspaper corpora for Modern Diachronic Corpus-Assisted Discourse Studies (MD-CADS)


Alan Scott Partington, Alison Duguid, Anna Marchi, Charlotte Taylor and Caroline Clark


This colloquium presents a collection of papers within the nascent discipline of Modern Diachronic Corpus-Assisted Discourse Studies (MD-CADS). This discipline is characterised by the novelty both of its methodology and the topics it is consequently in a position to treat. It employs relatively large corpora of a parallel structure and content from different moments of contemporary time, in the case of this research group, the SiBol group, two large corpora of newspaper discourse corpora (see corpus description below). Although a number of contemporary comparative diachronic studies have been performed (for example the body of work by Mair and colleagues on the LOB and FLOB corpora), these have used far smaller corpora and have of necessity therefore concentrated on relatively frequently occurring, mostly grammatical, phenomena. This research group, from Siena and Bologna universities, by using much larger corpora, is able to study not only grammatical developments over time but also variations in lexical and phrasal preferences. The larger corpora enable us to observe changes in newspaper prose style over the period (which reflect shifting relationships between newspapers and their readerships as well as perhaps overall changes in language) and also perform various sorts of content analyses, that is, examine new - and older - attitudes to social cultural and political phenomena, as construed and projected by the UK broadsheets. Given that newspapers can be accessed individually, the authors can also, when relevant, compare the ways such issues are discussed by papers of different political stances. This collection is evenly divided between the two sorts of research: concerned with tracking changes in (newspaper) language and socio-cultural case studies.


MD-CADS derives from the field of Corpus-Assisted Discourse Studies ( Both combine a quantitative approach, that is, statistical overviews of large collections of the discourse type(s) under study compiled in a corpus using corpus analysis tools such as frequency, keyword and key-cluster lists, with the more qualitative approach typical of discourse analysis, that is, the close, detailed analysis of particular stretches of discourse with the aim of better understanding the specific processes at play in the discourse type(s) under scrutiny.

The corpus approach allows one to improve the reliability and the generalizability of the analysis, by grounding the results on extended empirical evidence, at the same time in depth analysis is supported by the close reading of concordance lines.

The use of an item found to be of interest in a particular text can, by concordancing, be compared to its more general uses in (parts of) the corpus. The overall aim is to access “non-obvious” meanings, and the perspective of this group is that adopting a mixed-methodology, or more precisely a methodology that combines complementary approaches,  will allow for a deeper and sounder understanding of discourse.


Corpus-assisted studies of discourse types are properly comparative: it is only possible to both uncover and evaluate the particular features of a discourse type, or a set thereof, by comparing it with others.  In the same way, it is only possible to grasp the full relevance of many social, cultural and political attitudes if we are able to compare them to other possible attitudes, including those held at other times. MD-CADS helps us perform rather detailed comparisons of this type providing data from comparable corpora with twelve years between them.


The corpora


The principal corpus used in the present study is SiBol,[1] a newspaper corpus which is composed of two roughly parallel subcorpora: SiBol 93, consisting of around 100 million words of texts from the Guardian, Telegraph and The Times from the year 1993, and SiBol 05, approximately 150 million words from the same papers from the year 2005. They both contain the entire output of these papers for their respective years.


The structure of the colloquium


1.Partington will give an overview of MD-CADS and the rationale behind the corpus compilation and interrogation which was carried out using XAIRA and Wordsmith 5. Then the other participants will present the work carried out so far.


2. One paper is a mainly lexically-based study. Duguid scrutinises lists of key-items from both the 1993 and 2005 corpora in the search for groups of salient lexis which might reflect changes in news values, in the balance between news creation and news reporting, between news and comment and the relationship between the media and the public. She considers some lexical evidence of increasing conversationalisation or informalisation in UK quality newspaper prose particularly in the way in which evaluation has become salient in the discourse.


3. Marchi examines and compares the use of the items moral* and ethic* to see what the UK qualities construed as moral issues in 1993 and then in 2005 and how they are evaluated. Although the main focus is diachronic she also looks at differences across individual papers.


4. Taylor conducts an empirical investigation of the rhetorical function of science itself in the news, in particular focussing on how science is increasingly invoked to augment the moral significance of an argument how, from being a largely descriptive term associating with the semantic group of institutions in 1993 (e.g. museum, council, curriculum, institute) it came in 2005 to co-occurr with a group of reporting and justifying processes in (e.g. shows, found, published, suggests).


5. Finally, Clark investigates the evolution of journalistic ‘style’, showing how newspaper reports have had to adjust to the reality of contemporary journalism. She addresses the ‘familiarisation’ of language in the quality press and focuses on the change of  linguistic marking of evidence and the representation of knowledge. She also attempts to draw some conclusion about the role of evidentials in epistemological stance, that is the reporter’s attitude towards knowledge. 



Partington, A. (forthcoming). “The armchair and the machine”:  Corpus-Assisted Discourse Studies. In Corpora for University Language Teachers, edited by C. Taylor Torsello, K. Ackerley and E. Castello. Bern: Peter Lang.

Partington A. and Duguid, A. ( 2008). Modern diachronic corpus-assisted discourse studies (MD-CADS in  Bertuccelli Papa, M. and Bruti, S. (eds) Threads in the complex fabric of Language Pisa: Felici editore, pp. 5-19 

Channell, J. (1994). Vague Language Oxford: Oxford University Press.

Hunston, S. and Thompson, G. (eds) (2000). Evaluation in texts. Authorial Stance and the Construction of Discourse. Oxford: Oxford University Press.

Converging and diverging evidence: corpora and other (cognitive) phenomena?


Dagmar Divjak and Stefan Gries


Over the last decade, it has become increasingly popular for (cognitive) linguists who believe that language emerges from use, to turn to corpora for authentic usage data (see Gries & Stefanowitsch 2006). Recently, a trend has emerged to supplement such corpus analyses with experimental data that presumably reflect aspects of cognitive representation and/or processing (more) directly. If converging evidence is obtained, cognitive claims made on the basis of corpus data are supported (Gries et al. 2005, to appear; Grondelaers & Speelman 2007; Divjak & Gries 2008; Dabrowska in press) and the status of corpora as a legitimate means to explore cognition is strengthened. Yet, recently, diverging evidence has been made available, too: frequency data collected from corpora sometimes make predictions conflicting with those made by experimental data (cf. Arppe & Järvikivi 2007, McGee to appear, Nordquist 2006, to appear) or do not as reliably approximate theoretical concepts such as prototypicality and entrenchment as desired (cf. Gilquin 2006; Wiechmann 2008; Divjak to appear; Gries to appear). This undermines hopes that the linguistic properties of texts produced by speakers straightforwardly reflect the way linguistic knowledge is represented in their minds.


In this colloquium, we want to focus on how well corpora predict cognitive phenomena. Despite the importance attributed to frequency in contemporary linguistics, the relationship between frequencies of occurrence in texts on the one hand, and status or structure in cognition as reflected in experiments on the other hand has not been studied in great detail, and hence remains poorly understood. This colloquium explores the relationship between certain aspects of language and their representation in cognition as mediated by frequency counts in both text and experiment. Do certain types of experimental data fit certain types of corpus data better than others? Which corpus-derived statistics correlate best with experimental results? Or do corpus data have to be understood and analyzed radically differently to obtain the wealth of cognitive information they (might) contain?


For this colloquium we have selected submissions that report on converging as well as diverging evidence between corpus and experimental data and interpret the implications of this from a cognitive-linguistic or psycholinguistic perspective. The six contributions deal with phenomena from morphology and morphosyntax (A. Caines “You talking to me? Testing corpus data with a shadowing experiment”, J. Svanlund “Frequency, familiarity, and conventionalization”), lexico-grammar (M. Mos, P. Berck and A. van den Bosch “The predictive value of word-level perplexity in human sentence processing”, L. Teddiman “Conversion & the Lexicon: Comparing evidence from corpora and experimentation”) and lexical semantics (D. Dobrovol’skij “The Lexical Co-Occurrence of the Russian Degree Modifier črezvyčajno. Corpus Evidence vs. Survey Data”, J. Littlemore and F. MacArthur “Figurative extensions of word meaning: How do corpus data and intuition match up?”). They provide both converging (M. Mos, P. Berck and A. van den Bosch, L. Teddiman, D. Dobrovol’skij) and diverging (J. Svanlund, Author-group_6) evidence.


After a short introduction to the colloquium, in which we take up some fundamental issues,

A. Caines presents a paper assessing the correlation between corpus and experimental data, asking the question to what degree the former can be said to predict the latter. First, a corpus-based study yields a distribution of an emerging construction in British English: the ‘zero auxiliary’ interrogative with progressive aspect, we going out tonight? why they bothering you? The frequency of such a construction is found to vary according to subject type. In the spoken section of the BNC, second person  interrogatives occur most frequently in zero auxiliary form (you doing this? what you saying?), while first person interrogatives occur least frequently in zero auxiliary form (I doing this? what I saying?)A shadowing experiment was designed to test whether the difference in frequency is reflected in a difference in processing. This technique requires subjects to repeat sound files as quickly as they can; the dependent variable is the latency between word onset in the input (recording) and output (subject response). This is work in progress but results will be ready for the workshop, and implications for the potential for corpora to predict cognitive phenomena will be discussed.


J. Svanlund reports on research regarding the degree of conventionalization of some recently formed compounds in Swedish. Their text frequencies, as a measure of conventionalization, are compared to how familiar they are judged by informants, as another measure more closely related to entrenchment. It turns out that these two measures rarely coincided. The most frequent compound is not one of the more familiar ones, and the most familiar ones are not the most frequent. J. Svanlundcovers the conventionalization process through a large corpus of nearly all the instantiations of these compounds in major newspapers. Some of the familiarity differences could be explained by skewed frequency distributions, but not all. Familiarity seems to be connected to several factors concerning word-formation type and to features of the discourse where it is conventionalized. One key factor is how noticeable a word is. Non-transparent words seem to be more noticed, partly because they sometimes cause interpretation problems and must be explained more often. Words pertaining to controversies are also more likely to be noticed, especially where the phenomenon being described and the word describing it are controversial. Some of these factors are mirrored in usage contexts. During the conventionalization process, and often later as well, writers often have to make judgements concerning the novelty, transparency and applicability of, as yet, non-conventional innovations.

Relatively new words are therefore often marked by various kinds of meta-traits in the text: citation marks, hedges, explanations, overt comments on usage, etc. Such traits can be examined and counted, and such investigations can complement mere text frequency as indications of conventionality status and entrenchment.


M. Mos, P. Berck and A. van den Bosch present work on the predictive value of word-level perplexity in sentence processing. They address the question of how well a measure of word-level perplexity of a stochastic language model can predict speakers' processing of sentences in chunks. They focus on one particular type of combination: modifiers with a fixed preposition of the type TROTS_OP (proud of). Like phrasal verbs, the PP following the adjective functions as its complement (Daan was trots op de carnavalswagen/Daan was proud of the carnival float). Although these adjectives and their prepositions are strong collocates, they can co-occur without triggering the complement reading (Daan stond trots op de carnavalswagen/Daan stood proudly on the carnival float). This raises the question whether the former combination is processed more as a unit than the latter. By contrasting complement versus non-complement structures and different verbs (either attracting a complement or a non-complement reading), they investigate the influence of these two factors. M. Mos, P. Berck and A. van den Boschconducted a copy task in which 25 monolingual adult speakers of Dutch and 35 children (grade 6, mean age = 11) saw a sentence on a screen, which they then had to copy onto another screen. The participants' segmentation of the sentences was observed as they recreated the sentences. To estimate the amount of predictiveness of a word given a context of previous words, the authors generated a memory-based language model trained on 48 million words of Dutch text. This model is able to generate a perplexity score on a per word basis, measuring the model's surprise of seeing the next word. Based on the experimental outcomes and the application of the language model on the same data, they find that the children's switch data strongly correlate with the model's predictiveness.


L. Teddiman compares experimental and corpus evidence on lexical conversion, the process through which one word is derived from another without overt marking. For example, work can be used as a noun (I have a lot of work to do) or a verb (I work at the lab), and is easily interpretable in both roles. Lexical underspecification, a proposal describing how such words are encoded in the mind, holds that the root is not specified for lexical category in the lexicon, but is realized as a member of a lexical category when placed in a supporting syntactic environment. Under a strict interpretation, it predicts that speakers should not be able to determine the lexical category of a root without a context. L. Teddiman explores native speaker sensitivity to lexical categories in English when words are presented in isolation and draws upon corpus data to investigate the role of experience. In an online category decision task, participants decided whether a presented word was a noun or a verb as quickly as possible. Stimuli consisted of unambiguous nouns (e.g., food), unambiguous verbs (e.g., think), and words that could be either (e.g., work), as determined through the use of frequency data from the CELEX Lexical Database. Results showed that participants were able to determine the lexical category of unambiguous words in isolation and categorized ambiguous items in accordance with their relative frequencies. Each time a word is used, its context supports a particular interpretation, and this is reflected in the decisions participants made about words in these tasks. Here, corpora represent a measure of summed linguistic experience over time. The corpus presents the contexts by which the lexical category of an item can be determined, and L. Teddiman's results suggest that speakers are sensitive to this information in the absence of context.


D. Dobrovol’skij studies the lexical co-occurrence of the Russian degree modifier črezvyčajno. (extremely) from a corpus and an experimental perspective. His assumption is that many combinatorial phenomena traditionally regarded as arbitrary are, in fact, motivated semantically and/or pragmatically, at least to some extent. Semantic structures of near-synonymous adverbials (cf. črezvyčajno, neobyknovenno, neobyčajno, krajne, beskonečno, bezmerno)do not, at first glance, reveal any relevant factors motivating selection constraints. But they differ from each other in terms of lexical cooccurrence. If relevant factors could be found, then the combinatorial behaviour of such degree modifiers is not quite arbitrary and can be described in terms of semantic (and maybe also pragmatic) correlations. The analysis of contexts from corpora of present day texts reveals the following tendencies that are typical of the contemporary usage of the word črezvyčajno: 1. The degree modifier črezvyčajno combines with adjectives and adverbs rather than with verbs. 2. Words in the scope of  črezvyčajno more often denote 'non-physical' properties and 3. Words in the scope of črezvyčajno tend to denote features which are biased towards the positive pole of various scales. Currently, the adverbial črezvyčajno is perceived as a non-ordinary word, indicative of a discourse type which can be labelled "non-ordinary discourse" (in contrast to everyday language). To validate this observation a survey was conducted. Twenty-nine educated native speakers of Russian saw two types of lexical co-occurrences with črezvyčajno: expressions which did not violate the rules 1-3, and expressions which did to at least some degree. The respondents estimated the compliance of these expressions with the lexical norms using a 5-point scale. D. Dobrovol’skij concludes that the survey data corroborate the corpus evidence.



J. Littlemore and F. MacArthur studies figurative extensions of word meaning and asks how corpus data and intuition match up with how linguistic knowledge is stored in the minds of native speakers and second language learners of English and Spanish. In a usage-based system, native speakers build up knowledge of the semantic extension potential of the words in their language through encountering them in multiple discourse situations. In contrast, language learners have less access to such frequent, meaningful and varied types of linguistic interaction and are expected to have impoverished knowledge. J. Littlemore and F. MacArthur investigated the intuitions that both native speakers and learners of English and Spanish had of the categories of senses associated with the words ‘thread’ (including ‘threaded’ and ‘threading’) and ‘wing’ (including ‘winged’ and ‘winging’). They compared these intuitions with each other and their corpus findings. They show that 1. compared with corpus data, intuitive data for both speaker groups are relatively impoverished, 2. even advanced learners have very limited knowledge of the senses at the periphery of this category compared with that of native speakers, 3. even native speakers exhibited considerable variation, with younger speakers exhibiting much less knowledge than older speakers, 4. different word forms trigger different senses and phraseologies in both corpus and intuitive data, 5. the intuitive data for the native speakers largely reflect the participants' backgrounds , 6. popular culture has a strong influence on intuitive data, suggesting that this knowledge is dynamic and unstable, and 7. fixed phraseologies appear to help native speakers access more meanings. J. Littlemore and F. MacArthur concludes that radial category knowledge builds up over a lifetime. Speakers can access this knowledge much more readily in natural communicative contexts than they are in decontextualised, controlled settings, which underscores the usage-based nature of language processing and production.


After the presentations, we devote 30 minutes to a discussion in which we, as quasi-discussants, synthesize the major findings from the papers, then stimulate a discussion about implications for, and conclusions of, ways to combine data from different methodologies.



Arppe, A. And            J. Järvikivi. (2007). “Every method counts -Combining corpus-based and    experimental evidence in the study of synonymy”. Corpus Linguistics and Linguistic       Theory 3(2):131-59.

Divjak, D.S. (to appear). “On (in)frequency and (un)acceptability”. In B. Lewandowska-  Tomaszczyk (ed.) Corpus linguistics, computer tools and applications -state of the

art. Frankfurt: Peter Lang, 1-21.

Dabrowska, E. (in press). “Words as constructions”. In V. Evans & S. Pourcel (eds) New directions in cognitive linguistics. Amsterdam, Philadelphia: John Benjamins.

Divjak, D.S. & St.Th. Gries. (2008). “Clusters in the Mind? Converging evidence from near-        synonmymy in Russian”. The Mental Lexicon 3(2), 188-213.

Gilquin, G. (2006). “The place of prototypicality in corpus linguistics”. In:St.Th. Gries and

A.    Stefanowitsch (eds), 159-91.

Gries, St.Th. (to appear). Dispersions and adjusted frequencies in corpora: further explorations.

Gries, St.Th., B. Hampe and D. Schönefeld. (2005). “Converging evidence: bringing together       experimental and corpus data on the association of verbs and constructions”.

Cognitive Linguistics. 16 (4), 635-76.

Gries, St.Th. and A. Stefanowitsch (eds) (2006). Corpora in cognitive linguistics: corpus- basedapproaches to syntax and lexis. Berlin, New York: Mouton de Gruyter.


Corpus-Based Approaches to Figurative Language


Alan Wallington, John Barnden, Mark Lee, Rosamund Moon, Gill Philip and Jeanette Littlemore




Since the inception of the Corpus Linguistics Conference in 2001 at Lancaster, our group, whose core members are in the School of Computer Science at the University of Birmingham, have held accompanying colloquia on the topic of Corpus-Based Approaches to Figurative Language. The period since first colloquium has seen a steady growth of interest in figurative language and for this year’s colloquium, we have sought and gained endorsement from RaAM (Researching and Applying Metaphor), one of the leading international associations concerned with metaphor and figurative language.


Colloquium Theme


It is evident from papers presented at recent metaphor conferences and colloquia that there is an increasing interest in using corpora and corpus-based methods in investigating metaphor. This has influenced our proposal in two ways.

Firstly, in the first two colloquia that were associated with the Corpus Linguistics Conference, there was no theme to the colloquium other than the use of corpora and corpus-based methods to elucidate figurative language. However, for the third colloquium, we felt that there was sufficient corpus-based work being undertaken to propose a general theme. We will do the same this time and propose the theme of “Corpus-based accounts of Variety and Variability in Metaphor.”

Secondly, we have noted that although corpora are being used more and more, the use is often rather simplistic. In particular, there often appears to be an assumption that data should be compared to a general reference corpus, such as the BNC, even when it is too generic a data set for many comparisons.

We would like our colloquium to help redress this latter view and to achieve this aim, we proposed a theme of variety and variability. This broad theme can include papers that examine the use of the same metaphor sources/targets, for example metaphors for ‘time’, across different data sources and genres, such as in news articles, personal blogs, general conversation, and so on. This will also be an important aspect of the discussion period planned for the final session of the colloquium.

However, the theme can be interpreted more broadly. Whilst the above interpretation emphasizes variability of genre, we also wished to include papers that investigate the variability found in the use of individual metaphorical utterances, that is in the degree of conventionality or entrenchment, and how conventional metaphors may be changed or elaborated upon.

As noted we intend to devote the final sessions of the colloquium to a discussion on the topic of variety and variability, and from this perspective, researchers who have undertaken good corpus-based studies of a particular topic, but who have used only a single genre or corpus may find fruitful interaction with other participants who have investigated similar topics but used different genre. Such interaction would by itself be an important contribution to the theme of variety and variability. Therefore, we have also accepted contributions that did not stick rigidly to the theme, subject only to the proviso that they examined figurative language from a corpus-based perspective.

Therefore, we will have both papers and posters at the colloquium. Priority will be given in choosing papers to those submissions that stick to the theme of the colloquium, whilst poster priority will be given to papers that give less emphasis to the theme. We believe that, a more informal, poster session will help facilitate the interaction we want. Consequently, we shall have seven oral presentations followed by a poster session with 12 posters. We have also suggested to the paper presenters that they may, if they so wish, include a small poster to display during the poster session. By including these posters, we believe that interaction around the theme of variation and variability will be maximised.

Both oral presenters and poster presenters will provide papers for inclusion in the proceedings, which can be viewed from the week before the conference via a link from the colloquium website,


Titles of Papers and Posters


Khurshid Ahmad & MT Musacchio  Variation and variability of economics metaphors in an English-Italian corpus of reports, newspaper and magazine articles.

Anke Beger  Experts versus Laypersons: Differences in the metaphorical conceptualization of ANGER, LOVE and SADNESS.

Hanno Biber  Hundreds of Examples of Figurative Language from the AAC-Austrian Academy Corpus.

Olga Boriskina  A Cryptotype Approach to the Study of Metaphorical Collocations in English.

Rosario Caballero  Tumbling buildings, wine aromas and tennis players: Fictive and metaphorical motion across genres.

Claudia Marcela Chapeton  A Comparative Corpus-driven study of Animation metaphor in Native and Non-native Student Writing.

D Corts & S Campbell  Novel Metaphor Extensions in Political Satire.

E Gola & S Federici     Words on the edge: conceptual rules and usage variability.

Helena Halmari  Variation in death metaphors in the last statements by executed Texas death row inmates.

Janet Ho  A Corpus approach to figurative expressions of fear in business reports.

M L Lorenzetti  That girl is hot, her dress is so cool, and Im just chilling out now: Emergent Metaphorical Usages of Temperature Terms in English and Italian.

Chiara Nasti  Metaphors On The Way To Lisbon: Will The Treaty Go On The Trunk Line Or Will It Be Stopped In Its Course?

Regina Gutierrez Perez  Metaphor Variation across Languages. A Corpus-driven Study.

Jing Peng & Anna Feldman  Automatic Detection of idiomatic Expressions

Sara Piccioni  What is creative about a creative figurative expression? Comparing distribution of metaphors in a literary and a non-literary corpus.

Elena Semino, Veronika Koller, Andrew Hardie & Paul Rayson  A computer-assisted approach to the analysis of metaphor variation across genres.

Judit Simo  The metaphorical uses of blood in American English and Hungarian.

Tony Veale  The Structure of Creative Comparisons: A Corpus-guided Computational Exploration.

Julia Williams Variation of cancer metaphors in scientific texts and popularisation of scientific texts and popularisations of science in the press.  



Corpus, context and ubiquitous computing

Svenja Adolphs

The development of techniques and tools to record, store and analyse naturally occurring interaction in large language corpora has revolutionised the way in which we describe language and human interaction. Language corpora serve as an invaluable resource for the research of a large range of diverse communities and disciplines, including computer scientists, social scientists and researchers in the arts and humanities, policy makers and publishers. Until recently, spoken and written corpora have mainly taken the form of relatively homogenous renderings of transcribed spoken interaction and textual capture from different contexts. On this basis, we have seen the development of descriptions of language use based on distributional patterns of words and phrases in the different contexts in which they are used. However, everyday communication has evolved rapidly over the past decade with an increase in the use of digital devices, and so have techniques for capturing and representing language in context. There is now a need to take account of this change and to develop corpora which include information about context and its dynamic nature.

The idea of being able to respond to the context of use is also one of the most fundamental and difficult challenges facing the field of ubiquitous computing today, and there thus seems to be an opportunity for interdisciplinary research between Corpus Linguists and Computer Scientists in this area. The ability to understand how different aspects of context, such as location for example, influence language use is important for future context-aware computing applications and also for developing more contextually sensitive descriptions of language in use.

In this talk I will explore different ways in which we may relate measurements of different aspects of context gathered from multiple sensors (e.g. position, movement, time and physiological state) to people’s use of language. I will discuss some of the issues that arise from the design, representation and analysis of spoken corpora that contain additional measurements of different aspects of text and context. I will argue that such descriptions may generate useful insights into the extent to which everyday language and communicative choices are determined by different spatial, temporal and social contexts.




A corpus-driven approach to formulaic language in English:

Multi-word patterns in speech and writing

Douglas Biber

The present study utilizes a radically corpus-driven approach to identify the most common multi-word patterns in conversation and academic writing, and to investigate the differing ways in which those patterns are variable in the two registers.  Following a review of previous corpus-driven research on formulaic language, the paper compares two specific quantitative methods used to identify formulaic sequences:  probabilistic statistics (e.g., Mutual Information Score) versus simple frequency.  When applied to multi-word sequences, these two methodologies produce widely different results, representing ’multi-word lexical collocations’ versus ’multi-word formulaic sequences’ (which incorporate both function words and content words).  The primary focus of the talk is an empirical investigation of the ’patterns’ represented by these multi-word sequences.  It turns out that the multi-word patterns typical of speech are fundamentally different from those typical of academic writing:  Patterns in conversation tend to be fixed sequences (including both function words and content words).  In contrast, most patterns in academic writing are formulaic frames consisting of invariable function words with an intervening variable slot that can be filled by many different content words.


The priming of narrative strategies (and the overriding of that priming)

Michael Hoey

Lexical priming theory predicts that macro-textual decisions will be marked lexically. In other words when writers decide to begin a new paragraph in a news story or start a new chapter in a novel, they draw, often without awareness of what they are doing, upon their primings from all the news stories or novels they have encountered.  A research project funded  by the AHRC, led by the author, in conjunction with Matt O’Donnell, Michaela Mahlberg and Mike Scott, showed that a large number of words and clusters were specifically associated with text-initial and paragraph-initial sentences, occurring in such sentences with a far greater frequency than in non-initial sentences despite their being drawn from the same news stories. In this paper, the principles and processes identified in the project are applied to traditional functional narratives – folk tales and 19th century novels – with a view to seeing whether the narrative model developed by Robert Longacre between 1968 and 1993 is supported. It is argued that the narrative model is indeed supported by priming theory and at the same time supports the theory by showing once again a systematic relationship between macro-textual organisation and micro-lexical choices.


English in South Asia: corpus-based perspectives on the lexis-grammar interface

Joybrato Mukherjee

South Asian English(es) have been described in detail over the past few decades. Special emphasis has been placed on Indian English, by far the largest institutionalised second-language variety of English, e.g. in Kachru's (2005) work. However, in spite of some laudable exceptions (cf. e.g. Shastri 1996 on verb-complementation, Schilk 2006 on collocations), lexicogrammatical phenomena have so far been widely neglected in corpus-based research into varieties of English in South Asia.

The present paper reports on various projects at the University of Giessen (and collaborating institutions) that are intended to shed new light on the lexis-grammar interface of South Asian English(es) in general and verb-complementational patterns in particular on the basis of standard-size corpora and web-derived mega-corpora. Of particular relevance are ditransitive verbs and their complementation patterns in South Asian Englishes, as they have been shown to be especially prone to complementational variation in New Englishes and notably so in Asian Englishes (cf. Mukherjee/Gries 2009).

In this paper, I will focus on the lexicogrammar of Indian English and Sri Lankan English as two particularly relevant South Asian varieties of English that mark different stages of the evolutionary development of New Englishes as described by Schneider (2007). Specifically, I will present various findings from corpus analyses (e.g. with regard to the use of the prototypical ditransitive verb give), but I will also show that for various questions related to speaker attitudes, exonormative and endonormative orientations, one has to go beyond corpus data. In particular, this holds true for the assessment of the epicentre hypothesis, i.e. the assumption that Indian English as the dominant South Asian variety serves as a model for smaller neighbouring communities, such as Sri Lanka, in which English also has an official status.


Kachru, B.B. (2005). Asian Englishes: Beyond the Canon. Hong Kong: Hong Kong          University Press.
Mukherjee, J. and S.Th. Gries (2009). "Collostructional nativisation in New Englishes: verb-
         construction associations in the International Corpus of English". English World-Wide    30(1), 27-51
Schilk, M. (2006) "Collocations in Indian English: A corpus-based sample analysis". Anglia
          124(2), 276-316.
Schneider, E.W. (2007). Postcolonial English: Varieties around the World. Cambridge:
     Cambridge University Press.
Shastri, S.V. (1996): "Using computer corpora in the description of language with special
reference to complementation in Indian English". In R. J. Baumgardner (ed.) South        Asian English: Structure, Use. Urbana, IL: University of Illinois Press, 70-81.


Key cluster and concgram patterns in Shakespeare

Mike Scott

In recent years key words (KWs) in Shakespeare plays have been shown to belong to certain category-types such as theme-related KWs, character-related KWs. Other KWs, for some purposes the more interesting ones, seem to be pointers to other patterns indicative of quite specific features of the language, or of the status of characters or of individual sub-themes. It may be that there is a tension between global KWs and much more localised, "bursty" ones in this regard.

At the same time there has been steadily increasing awareness of the importance of collocation -- no word is an island, to misquote John Donne -- and corpus focus has turned more and more to n-grams, clusters, bundles, skipgrams and concgrams. Whatever the name, they all recognise the connections between words, the fact that words hang about in gangs and sometimes terrorise old ladies.

This presentation, accordingly, moves forward from single-word KWs to larger ones which are shown to occur distinctively in each individual play, or in the speeches of an individual character. The diverse types of patterns are what will be explored here. Are n-grams a mere coincidence of relatively frequent words co-occurring frequently so that they are but sound and fury signifying nothing? Or do they play some purposeful role in Shakespeare's patternings?

The presentation represents work in progress, not expected to conclude for many years.

Arabic question-answering via instance based learning from an FAQ corpus

Bayan Abu Shawar and Eric Atwell


We describe a way to access information in an Arabic Frequently-Asked Questions corpus, without the need for sophisticated natural language processing or logical inference. We collected an Arabic FAQ corpus from a range of medical websites, covering 412 subtopics from 5 domains:  motherhood and pregnancy issues; dental care issues; fasting and related health issues; blood disease issues such as cholesterol level, and diabetes; and blood charity issues. 

FAQs are documents designed to capture the key concepts or logical ontology of a given domain. Any Natural Language interface to an FAQ is expected to reply with the given Answers, so there is no need for Arabic NL generation to recreate well-formed answers, or for Arabic NLP analysis or logical inference to map user input questions onto this logical ontology; instead, a simple set of pattern-template matching rules can be extracted from each Question and Answer in the FAQ corpus, to be used in an Instance Based Learning system which finds the nearest match for any user input question and outputs the corresponding answer(s). The training corpus is in effect transformed into a large number of categories or pattern-template pairs: from the 412 subtopics, our system generated 5,665 pattern-template rules to match against user input. User input is used to search the categories extracted from the training corpus for a nearest match, and the corresponding reply is output.

Previous work demonstrated this Instance Based Learning approach with English and other European languages. Arabic offers additional challenges: a different writing system; no Capitalization to identify proper names; highly inflectional and derivational morphology; and little standardization of format for Questions and Answers across documents in the Corpus.  Nevertheless, we were able to adapt the Instance Based Learning system to the Arabic FAQ corpus straightforwardly, demonstrating that the Instance Based Learning approach is much more language-independent than systems which build in more linguistic knowledge. 

Initial tests showed that a high proportion of answers (93%) to a limited set of test questions were correct.   The same questions were submitted to Google and AskJeeves. However, because Google and AskJeeves return complete documents that hold the answers, we tried to measure how easy it was to find the answers inside the documents; we found that, given the documents, users had to then spend more time searching the documents to find the answer, and they only succeeded around half the time.

Future work includes extending the Arabic FAQ Corpus to a wider range of medical questions and answers; and more rigorous evaluation, including comparisons with more sophisticated NLP-based Arabic Question-Answering systems.  At least we have demonstrated that a simple Instance Based Learning engine can be used as a tool to access Arabic WWW FAQs. We did not need sophisticated natural language analysis or logical inference; a simple (but large) set of corpus-derived pattern-template matching rules is sufficient.


Using monolingual corpora to improve Arabic/English cross-language information retrieval


Farag Ahmed, Andreas Nürnberger and Ernesto William De Luca


In a time of wide availability of communication technologies, language barriers are a serious issue to world communication and to economic and cultural exchanges. More comprehensive tools to overcome such barriers, such as machine translation and cross-lingual information application, are nowadays in strong demand. In this paper, related research problems will be considered especially in the context of Arabic/English Cross Language Information Retrieval (CLIR).

    Word sense disambiguation is a non-trivial task. For some languages the problem is even more complicated where appropriate resources, e.g., lexical resources like WordNet [1], not yet exist. Furthermore, sense-annotated corpora are rare for those languages. This is because obtaining labelled data is expensive and time-consuming. In addition to these problems, the Arabic language suffers from the absence of the vowel in written text which causes a high ambiguity in which the Arabic words could have different meanings and interpretations. The vowel signs are signs placed above and below the letters in order to indicate the proper pronunciation.

    The lack of one-to-one mapping of a lexical item and its meaning causes a translation ambiguity, which produce translation errors that impact the multilingual retrieval system performance. Therefore, extra effort is needed in Arabic to deal with the ambiguous word translations. For example, due to the absence of vowel signs, the Arabic word “يعد “ can have the meanings "promise", "prepare", "count", "return", "bring back" and the word “علم “ can have the meanings "flag", "science", "he knew", "it was known", "he taught", "he was taught" in English. One solution to tackle this problem is to use statistical information extracted from corpora.

    In this paper, we will briefly review the major issues related to CLIR, e.g., translation ambiguity, inflection, translating proper names, spelling variants and special terms [2], concentrating on WSD tasks from Arabic to English. Furthermore, based on the encouraging results we achieved in previous work, which is based on a modification of Naïve Bayesian algorithm [3, 4], we will propose in this paper an improved statistical method to disambiguate the user’s query terms using statistical data from a monolingual corpus1 and the web. This method consists of two main steps: First, using an Arabic analyzer, the query terms are analyzed and the senses of the ambiguous query terms are defined. Secondly, the correct senses of the ambiguous query terms are selected, based on co-occurrence statistics. To compensate for the lack of training data for some test queries, we used web queries in order to obtain the statistical co-occurrence data needed to disambiguate the ambiguous query terms. This was achieved by constructing all possible translation combinations of the query terms and then sending them individually to a particular search engine, in order to identify how often these translation combinations appear together. The evaluation was based on 50 Arabic queries. Two experiments have been done, one was based on the co-occurrence data, which was obtained from the monolingual corpus and the other was based on the co-occurrence data which was obtained from the web.



Ahmed, F. and A. Nürnberger (forthcoming) “Corpora based Approach for Arabic/English Word     Translation Disambiguation”. Journal of Speech and Language Technology, 11.

Ahmed, F. and A. Nürnberger. (2008) “Arabic/english word translations disambiguation using          parallelcorpus and matching scheme”. In Proceedings of the 12th European       Machine          Translation Conference (EAMT08), 6–11.

Hedlund, T., E. Airio, H. Keskustalo,R. Lehtokangas, A. Pirkola and K. Järvelin (2004).                    “Dictionary-based cross-language information retrieval: Learning experiences from        CLEF 2000-2002”. Information Retrieval, 7(1/2), 99-119.


A corpus-based analysis of conceptual love metaphors

Yesim Aksan, Mustafa Aksan, Taner Sezer and Türker Sezer


Corpus-based conceptual metaphor studies have underscored the significant impact of authentic data analysis on the theoretical development of conceptual metaphor theory (Charteris-Black 2004; Deignan 2005; Stefanowitsch and Gries 2006). In this context, most of the linguistic metaphors constituting the conceptual metaphors identified by Lakoff and Johnson (1980) have subjected to corpus-based analysis (Stefanowitsch 2006; Deignan 2008 among others). In this study, we will focus on metaphors of romantic love in English and we will argue that 25 source domains identified by Kövecses (1988) in the conceptualization of romantic love display differences in British and American English.

            The aim of this paper is twofold: (1) to show how contrastive corpus-based conceptual metaphor analysis will unveil language specific elaboration of linguistic metaphors of romantic love in British and American English; (2) to explicate frequent and current use of the proposed source domains for English love metaphors through the analysis of large corpora.

            To achieve these ends, main lexical items from the proposed source domains of romantic love— namely, ECONOMIC EXCHANGE, SOCIAL CONSTRUCT, INSANITY, DISEASE, BOND, FIRE, JOURNEY, etc.— will be concordanced using the BNCweb (CQP-edition) and COCA (Corpus of Contemporary American English by Mark Davies). The concordance data will be analysed for differences between literal and metaphorical uses, and the frequency of metaphorical uses of the lexical items referring to the above-mentioned source domains will be identified in both corpora. For instance, while the lexical item invest from the ECONOMIC EXCHAGE source domain does not collocate with the nouns relationship, feeling and emotion in the BNC, it collocates with all of these nouns in the COCA. In other words, the linguistic metaphor ‘She has invested a lot on that relationship’ appears to be a more likely usage in American English rather than in British English. The distribution of the linguistic metaphors of romantic love in the corpora with respect to medium of text and text type will also be examined.

            Overall, the applied methodology of this paper will reveal a number of different linguistic metaphors used to conceptualize romantic love in British and American English. We might even think that the analysis proposed in this study, — analysis of lexical items with their metaphorical patterns— would yield a database, which might be employed in studies that aim to develop software for identifying metaphor candidates in corpora.



Charteris-Black, J. (2004). Corpus approaches to critical metaphor analysis. Palgrave        MacMillan.

Deignan, A. (2005). Metaphor and corpus linguistics. John Benjamins.

Deignan, A. (2008). “Corpus linguistic data and conceptual metaphor theory”. In M.S.      Zanotto, L. Cameron and M. C. Cavalcanti (eds) Confronting metaphor in use.            Amsterdam: John Benjamins, 149-162.

Kövecses, Z. (1988). The language of love. Bucknell University Press.

Lakoff, G. and M. Johnson (1980). Metaphors we live by. University of Chicago Press.

Stefanowitsch, A. (2006). “Words and their metaphors: A corpus based approach”. In A. Stefanowitsch and S. Th. Gries (eds) Corpus-based approaches to metaphor and         metonymy. Amsterdam: John Benjamins, 63- 105.


A corpus-based reappraisal of the role of biomechanics in lexical phonotactics

Eleonora C. Albano


In a number of influential papers, MacNeilage and Davis (e.g., 2000) have forwarded the view that CV co-occurrence is shaped by biomechanics. The claim is based on certain phonotactic biases recurring in several languages in babbling, first words, and the lexicon. These are: labial C’s favor open V’s; coronal C’s favor front V’s; velar C’s favor back V’s; positions C1 and C2 favor labials and coronals, respectively.

The frequency counts are performed on oral language transcripts and instantiate a productive use of corpora in phonology. They are held to support the frame-then-content theory of the syllable, viz.: frames are biomechanical CV gestalts emerging from primitive vocal tract physiology in ontogeny and phylogeny; content is, in turn, the richer phonetic inventory emerging gradually from linguistic experience.

However instigating, this theory rests on questionable data: the corpora are generally small; the statistics is limited to observed-to-expected (O/E) ratios (derived from chi-squared tables); and no effect size measure is offered.

This paper is an attempt to illuminate the lexical biases in question with substantial data from two related languages: Spanish and Portuguese. It is assumed that CV biases should be investigated in greater depth in languages with known grammar and history before they can be attributed to biomechanics.

The data on CV co-occurrence are drawn from two lexicons (about 45,000 words each) derived from public oral language databases of each language. This sample size has been validated by correlation with larger, written language databases. The phonetic coding includes the entire segment inventory given by orthography to phone conversion. Statistics includes chi-squared, Cramer’s V, and Freeman-Tukey deviates.

The results confirm most of the tendencies reported by MacNeilage and Davis, but lend themselves to a subtler interpretation, in which the role indeed played by biomechanics is subject to linguistic constraints.

Firstly, the biases only attain a non-negligible effect size when all C’s and V’s in a phonetic class are pooled together. Secondly, strength of association only becomes expressive if manner and place of articulation are separated. Thirdly, the biases are sensitive to stress and position in word. Thus, the largest effect sizes, which agree with the general trends reported by the authors, occur in initial unstressed position. Smaller appreciable effect sizes occur in medial stressed position also, but are partly due to derivational morphology. Fourthly, most of the latter biases are common to the two languages, as predictable from their history.

The findings support a view of the lexicon in which the role of biomechanics is under linguistic control. “Natural” CV biases are found in grammatically less constrained environments, while freer, mainly contrastive, C-V combination is found where grammar is at work. Grammar does, however, accommodate the less flexible biomechanical constraints, e.g., those affecting the tongue dorsum. Thus, both languages exhibit a sizeable set of productive unstressed derivational morphemes combining velar C’s with back V’s.

To conclude, our findings are not entirely congruent with MacNeilage and Davis’ claims and call for a re-examination of the Frame-then-Content theory in light of more carefully prepared corpora.



MacNeilage, P.F., and B. L. Davis. (2000). “On the origin of internal structure of word     forms”. Science, 288, 527-531.

Identifying discourse functions through parallel text corpora

Pavlina Aldová


Since the use of corpora predominantly relies on searches based on form, to retrieve data based on communicative functions is a more complex task. Using a parallel English-Czech corpus, my paper will exemplify that a parallel corpus can provide a tool for identification, relating and comparison of varying constructions expressing a discourse function not only in studying translation equivalents, but also to study a language (eg English) as it naturally occurs. The approach is based on using original untranslated language as primary data, and employing translation equivalents as ancillary intermediate material only.

      To identify an array of means performing a given function, we start with a simple search in the English original using the selected marker to obtain a set of functional equivalents in the translated version. The results of this search are subsequently used for reversed searches to obtain other, functionally equivalent (or related) means in the original language (English). A wider list of means expressing, or related to, the given function can thus be compiled based on original untranslated language, extending our knowledge into the range of forms that are not prone to identification or comparison in the sense of a ‘specific form’ in English:


                              Function F expressed in English by Form A (B, C)

English (original)

Czech (translation)

English (original)

Form A   →

Form X    →

Form A



Form B


Form Y    →

Form A,



Form C ...


      To illustrate this approach, various discourse functions will be searched: permission, exclamation, and directives for the third person. In English, the core means to express appeal to the third person subjects mediated through the addressee is let. Translation equivalents indicate that this function is performed by the Czech particles a? or nech?. Assessing the corresponding results for [[a?]] and [[nech?]] in Czech translations of original English texts, it seems possible to correlate the central means let (ex 1) with other constructions, e.g.  modal verbs, negative yes/no questions (2), but also to observe a tendency of English to indicate the appeal mediated through the addressee explicitly, using the infinitive construction after the second person imperative (3), or through the use of causative constructions where the addressee is the beneficiary of the action (4), including the use of the passive, or formulaic imprecations (5), etc.



Czech (translation)

English (original)


"[[Ať]] si dělá hlavu." 

Let him wonder


  [[Ať]] neřve!


"[[Ať]] se v Ramsey modlí za mou duši," pravila ...

Can‘t he stop shouting?


‘Ramsey may pray for my soul,’ she said serenely, ‘and that will settle all accounts.’


"[[Ať]] sem přijde Bannister." 

After a pause, "Tell Bannister to come in."


  [[Ať]] vám poví něco z těch pitomostí, kterým  ...

Get her to tell you some of the stuff she believes.


"[[Ať]] jde Eustace Swayne k čertu i se všemi s ...

He said explosively, "To hell with Eustace Swayne and all his works!"



Dušková, L. (2005) “Syntaktická konstantnost mezi jazyky” [Syntactic constancy across languages], Slovo      a slovesnost 66, 243-260.

Hunston, S. (2006) “Corpus Linguistics”. In Encyclopedia of Language and Linguistics 2nd Edition, Elsevier, 234-248.

Izquierdo Fernández, M. (2007) “Corpus-based Cross-linguistic Research: Directions and Applications            [James’ Interlingual linguistics revisited]”, Interlingüistica, 17, 520-527.


Using a corpus of simplified news texts to investigate features of the intuitive approach to simplification


David Allen


The current paper reports on the investigation of a corpus of news texts which have been simplified for learners of English. Following on from a previous study by the author (accepted by System, 2009), this paper details a number of quantitative and qualitative findings from the analysis which highlight the impact of simplification upon the linguistic features of texts, specifically relative clauses. Using quantitative methods alone is shown, in this case, to be insufficient to address the effects of simplification upon the surrounding text. A discussion of further quantitative approaches to the study of modified texts will then be presented. The study will be of interest to researchers interested in graded texts, the debate surrounding simplified vs. authentic materials in second language learning, and reading in a second language. The corpus is anticipated to serve a number of pedagogic and research purposes in the future.


Corpora and headword lists for learner’s dictionaries: The case of the Spanish Learner’s Dictionary DAELE


Araceli Alonso Campo and Elene Estremera Paños


A learner’s dictionary for non-native speakers differs from a general monolingual dictionary in several aspects: it covers less vocabulary, its definitions are usually written in a controlled vocabulary in order to facilitate comprehension, and it includes more grammatical information because the main target users, who do not have full command of the grammar of the language, need information about the use of words in a specific context. These aspects determine the selection of the main headwords to be included in the dictionary.

      Learner’s dictionaries for non-native speakers are based on the assumption that users look up words in the dictionary not only for decoding texts, but also for encoding. On the one hand, the headword list should constitute a representative sample of the common vocabulary of the average speaker of the language; on the other hand, it should include those lexical units which can cause grammatical problems in text production. In addition, the dictionary should include the words and expressions that learners need for oral communication (Nation 2004). 

      The English language has a significant tradition in learner’s dictionaries, which means that there are many studies on the selection of the vocabulary to be included in a dictionary for non-native learners. Spanish lexicographic tradition, on the contrary, has dedicated itself mainly to elaborating monolingual dictionaries for native speakers, and as a result there are currently no studies available on how to obtain and select the headwords to be included in a Spanish learner’s dictionary.

      Our paper focuses on the selection of a list of 7,000 headwords ?nouns and adjectives? for elaborating a prototype for a digital Spanish Dictionary for learners.  We first briefly describe the Spanish Learner’s Dictionary for Foreigners Project  DAELE (Diccionario de aprendizaje del español como lengua extranjera), which is currently underway at Universitat Pompeu Fabra in Barcelona. Our dictionary project aims to include information which is contrastive with English, as many learners around the world have experience with English or are speakers of English before they start studying Spanish. We compare four lists based on frequency of use, two of which were based on  corpora (El Corpus del Español by Mark Davies and the Corpus PAAU created at the Institut Universitari de Lingüística Aplicada of Universitat Pompeu Fabra), one of which was based on lexical availability index (Ahumada 2006) and one of which was taken from a monolingual Spanish dictionary for native-speaking children (Battaner 1997). The comparison points up many differences in the distribution of nouns and adjectives, which may be summarized as follows:


* 9107 nouns and adjectives were found on only one list

* 3853 nouns and adjectives were found on two lists

* 2380 nouns and adjectives were found on three lists

* 836 nouns and adjectives were found on all four lists


We discuss why there should be such divergence across lists and, more importantly for our dictionary prototype project, the implications of using corpora to determine the headword list for learner’s dictionaries.



Ahumada, I. (2006). El léxico disponible de los estudiantes preuniversitarios de la provincia de Jaén.             Jaén: Servicio de Publicaciones de la Universidad de Jaén.

Battaner, P. (dir.) (1997). Diccionario de Primaria de la Lengua española Anaya-Vox. Barcelona:     Vox-Biblograf.

Davies, M.  (2006). A frequency dictionary of Spanish. New York: Routledge.

Nation, P. (2004) “A study of the most frequent word families in the British National Corpus, in P. Bogaards, B. Laufer (eds) Vocabulary in a Second Language. Amsterdam: John Benjamins, 3-13.

Torner, S. and P. Battaner (eds) (2005) El corpus PAAU 1992: estudios descriptivos, textos y           vocabulario Barcelona: IULA, documenta universitaria.    


Making it plainer: Revision tendencies in a corpus of edited English texts


Amanda Murphy


There are currently 23 official languages in the EU, but English is widely spoken as a lingua franca within its Institutions, and most official written texts – at least within the European Commission - are drawn up in English by non-native speakers. They are subsequently translated into the other 22 official EU languages. Since the setting up of the Editing Unit in 2005, a considerable amount of these texts are voluntarily sent to native-speaker editors, who revise the texts before they are published and/or translated.

      The present research examines some of the revisions in a one-million token corpus of documents from the Editing Unit. On a sample of the documents, all the revisions were classified by hand, and then examined using Wordsmith Tools (Scott 2006).

      While many revisions concern objective grammatical mistakes, this paper focuses on two other aspects of language naturalness, one lexical and one syntactic. Regarding lexis, the paper focuses on revisions of common colligational-collocational patterns such as transitive verb + direct object, where the editor changes the verb. Examples of this type of revision include:


* To perform responsibilities (edited version: to fulfil responsibilities);

* To give a contribution (edited version: to make a contribution);

* To take a commitment (edited version: to make a commitment);

* To perform a study (edited version: to conduct a study);

* To match gaps (edited version: to fill gaps);

* To realize targets (edited version: to meet targets).


      Noun phrases where both the pre-modifying adjective and noun are substituted with a more natural-sounding pair are also examined. Examples of these include great discussion (changed to major debate) or important role (changed to big part).

      The second focus is on revisions beyond the phrase. By examining larger chunks of text, whole sentences or groups of sentences, two opposing tendencies can be observed: on the one hand, simplification, which shortens and ‘straightens out’ sentences, and on the other, explicitation, which adds elements for the sake of clarity, and generally lengthens sentences. 

      Reflections are made on the potential of using such texts for teaching advanced level students to write more plainly, along the lines of Cutts (2004) and Williams (2007).



Cutts, M. (2006). The Plain English Guide. Oxford: Oxford University Press.

Scott, M. (2006). Wordsmith Tools. Version 4.0. Oxford: Oxford University Press.

Williams, J. M. (1997). Style: Toward Clarity and Grace, Chicago: The University of         Chicago Press.




Word order phenomenain spoken French: A study on four corpora of task-oriented dialogue and its consequences on language processing.


Jean-Yves Antoine, Jérome Goulian, Jeanne Villaneau and Marc Le Tallec


We are presenting a corpus study that investigates the question of word order variations (WOV) in a spoken language (namely French) and its consequences on the parsing techniques that are used in Natural Language Processing (NLP). It is a common practice to distinguish between free or rigid word order languages (Hale 1983, Covington 2000). This presentation aims at examining this question according to two directions:

1) We investigate whether language registers (Biber 1988) have an influence on WOV. French is usually considered as a rigid order language. However, our observations on task oriented interactions tend to show that sponntaneous spoken French presents a higher variability. This study aims at quantifying this influence.

2) WOV should result in the presence of syntactic discontinuities (Hudson 2000 ; Bartha and al. 2006) which can not be parsed by projective parsing formalisms (Holan and al. 2000). Given a particular interactive situation (task oriented Human-Machine Interaction), we estimate the occurrence frequency of discontinuities in spoken French.

Our study investigates WOV on four task-oriented spoken dialogue corpora which concern three different application tasks :

* air transport reservation (Air France Corpus)

* tourism information (Murol corpus and OTG corpus)

* switchboard call (UBS corpus)

Two corpora (Air France, UBS) concern phone conversations while the two other are corresponding to direct interaction.

Every word order variation has been manually annotated by 3 experts, following a cross-validation procedure. The annotation gives a detailed description of the observed variations : type (inversion, extraction, cleft sentence, pseudo-cleft sentence), direction (ante-position or post-position), syntactic function of the extracted element and the eventual presence of a discontinuity.

This study shows that conversational spoken French should be affected  by a high rate of WOV. The occurrence of word order variations increases significantly  with the degree of interactivity of the spoken dialogue. For instance, around 26.6% of the speech turns of the Murol corpus (informal interaction with a high interactivity) present a word order variation, while the other corpora (moderate interactivity) are less concerned (from 11.5% to 13.6% of the speech turns). This difference is statistically significant.

Despite this observation, our results show that task-oriented spontaneous spoken French should still be considered as a rigid order language : most the observed variations correspond to weak variations (Holan et al. 2000) which result very rarely in discontinuous syntactic structures (from 0.2% to 0.4% of the speech turns according to the considered corpus). Non-projective parsers remains therefore well adapted to conversational spken French.

Besides this important result, this study shows that WOV follow some impressive regularities :

     * antepositions are prefered to postpositions (82.5% to 91.1% of the variations) and respect  usually the            SVO order (98.0% to 98.7% of the speech turns)

    * subjects are significantly more affected (25.4% to 42.8% of the observed variations) than objects   (5.3% to 15.4%), while order varations concern also modifiers and phrasal complements.

    * Most of subject WOV (95.4% to 100%) are lexically (pronoum) or syntactically (cleft or pseudo-cleft   sentence) marked, while modifiers variations result usually in a simple inversion (84.9% to 96.8 %).



Bartha, C., T. Spiegelhaue., R. Dormeyer and  I. Fischer. (2006). “Word order and discontinuities in            dependency grammar”. Acta Cybernetica, 17(3), 617-632.

Biber, D. (1988). Variation across speech and writing, Cambridge University Press, Cambridge.

Holan T., O. Kubon and M. Plátek M. (2000). “On complexity of word order.” TAL, 41(1), 273-300.

Arabic and Arab English in the Arab world


Eric Atwell


A broad range of research is undertaken in Arab countries, and researchers have carried out and reported on their research using either English or Arabic (or both). English is widely accepted as the international language of science and technology, but little formal research has been done on the English used in the Arab world. Is it dominated by British or American English influences? Or is it a recognisable regional variant, on a footing with Indian English or Singapore English?

            Arabic is the first language of most Arabs, but contemporary use shows noticeable variation from Modern Standard Arabic. Some Arab researchers may feel doubly stigmatised, in that their local variant of Arabic differs from MSA, and their English differs from UK or US standard English.

            (Al-Sulaiti and Atwell 2006) used the World Wide Web to gather a representative selection of contemporary Arabic texts from a wide range of Middle Eastern and North African countries, approximately one million words in total. We have used WWW sources to gather selections of contemporary English texts from across the Arab region. We have collected a Corpus of Arab English, parallel to the International Corpus of English used to study national and regional varieties of English. We then used text-mining tools to compare these against UK and US English standards, to show that some countries in the Arab world prefer British English, some prefer American English, and some fit neither category straightforwardly.

            Natural Language Processing and Corpus Linguistics research has previously focussed on British and American English, and applying these techniques to Arab English presents interesting new challenges. I will present our resources to the Corpus Linguistics audience, to elicit useful applications of these resources in Arab English research, and also to receive requests and/or suggestions for extensions to our corpus which the community might find useful. My goal is to document the contemporary use of Arab English and Arabic across the Arab world, and develop computational resources for both; and to raise the status of both Arab English and Arabic, so they are recognised as different but equal alongside American English and British English in the Arab world and beyond.



Al-Sulaiti, L. and E. Atwell. (2006). “The design of a corpus of contemporary Arabic”.     International Journal of Corpus Linguistics, 11, 135-171.



Civil partnership – “Gay marriage in all but name”!? Uncovering discourses of same-sex relationships


Ingo Bachmann


It has been demonstrated (Gabrielatos and Baker 2008, Teubert 2001, Baker 2005) that corpus linguistic techniques are an appropriate means of uncovering and critically analysing discourses – be that discourses of refugees and asylum seekers, of Euro-sceptics or of gay men.

      The present paper aims to contribute to the study of discourses of gay men, but with a shifting focus: whereas so far gay (single) men and the question of identity were concentrated on (Baker 2005), I want to show how gay men (and to some extent lesbian women) in relationships are represented and what discourses of same-sex relationships are prevalent.

      As a suitable starting point for such an analysis I singled out the debates in the Houses of Parliament in 2004 which lead to the introduction of civil partnerships in the United Kingdom. Since the end of 2005 gay and lesbian couples in the UK have been able to register their relationship as a civil partnership, thus transferring to them more or less the same rights and responsibilities that heterosexual married couples enjoy. As the concept of civil partnership had not existed before, it was during these debates that it was decided upon if civil partnerships should be introduced at all, what rights and responsibilities they entail and who actually is allowed to register for them.

      After compiling a corpus containing all the relevant debates in both Houses, a statistical keyword analysis (WordSmith Tools) was conducted. These keywords are categorised and they cluster around ideas such as

* the concept of a relationship (long-term, committed).

* the controversial issue if civil partnership is actually gay marriage in all but name and if it undermines the institution of marriage.

* legal implications of the bill (tax, equality, rights)

* identity and sexual orientation – same-sex emerges as a new term next to homosexual, gay and lesbian, reflecting a shift from a focus on sexual acts via a person's identity to the relationship of two people.

Keywords from each cluster are investigated in context to try and uncover discourses of civil partnership.



Baker, P. (2005). Public discourses of gay men. London: Routledge.

Gabrielatos, C. and P. Baker (2008). "Fleeing, Sneaking, Flooding: A Corpus Analysis of Discursive Constructions of Refugees and Asylum Seekers in the UK Press, 1996-     2005." Journal of English Linguistics, 36 (1), 5-38.

Teubert, W. (2001). "A province of a federal superstate, ruled by an unelected bureaucracy.          Keywords of the Euro-sceptic discourse in Britain." In A. Musolff, C. Good, P. Points          and R Wittlinger (eds) Attitudes towards Europe. Language in the Unification Process. Aldershot: Ashgate, 45-88.

The British English '06 Corpus - using the LOB model to build a contemporary corpus from the internet.

Paul Baker


This paper describes the collection of the BE06 corpus, a one million word corpus of contemporary written British English texts which followed the same sampling framework as the LOB and Brown corpora. Problems associated with collecting 500 texts via the internet (the texts were all previously published in paper form) are described, and some preliminary findings from the corpus data are given, including comparisons of the data to equivalent corpora from the 1930s, 1960s and 1990s. The paper also addresses the issue of whether diachronic corpora can ever be really 'equivalent' because genres of language do not remain constant.


Corpus tagging of connectors: The case of Slovenian and Croatian academic discourse


Tatjana Balažic Bulc and Vojko Gorjanc


Two issues stand out as a special challenge in corpus research on connectors in Slovenian and Croatian academic discourse. The first issue concerns tagging connectors, which explicitly signal textual cohesion and also establish logical or semantic connections between parts of a text through their meaning and function and, at the same time, indicate the relations between the constituent parts of a text. Our research used a corpus-based approach; that is, we did not tag connectors according to a predefined list, but instead defined them on the basis of two corpora constructed for this purpose, comprised of 19 Slovenian and 17 Croatian original scholarly articles in linguistics. In the first phase, connectors were tagged manually and therefore the sample is somewhat smaller. In the second phase, we tested various methods of automatic connector tagging, making it possible to analyze a larger corpus and yielding more relevant statistical data. We continued by comparing variation in the frequency of hits for both tagging methods. As in other languages, automatic tagging in Slovenian and Croatian is hindered by morphosyntactic and discourse ambiguities because certain connectors are synonymous with conjunctions, adverbs, and so on; there are also a considerable number of synonymous connectors that serve various functions in a text. The relevance of hits with automatic tagging in Slovenian and Croatian is also encumbered by the inflectability of these parts of speech and quite flexible word order, and so the validity of hits using automatic tagging is especially questionable. Automatic tagging also indirectly changes the research methodology because a corpus tagged in advance is already marked by a particular theory, which makes a corpus-based approach impossible.

            Tagging connectors also raises a second issue: text segmentation. Specifically, it turned out that the function and position of connectors is considerably more transparent within smaller text units than in an entire text. Various attempts at text segmentation exist in corpus linguistics; for example, the Linguistic Discourse Model (LDM), Discourse GraphBank Annotation Procedure, and the Penn Discourse Treebank (PDTB). These all have different theoretical premises, but they all define the boundaries of individual text segments using orthographic markers such as periods, semicolons, colons, and commas. Our study also confirmed that investigating connectors as explicit links between two sentences is not sufficient because they also often appear within a sentence. In this sense we segmented the text into smaller units that we refer to as utterances and we understand these as syntactically and semantically complete contextualized units of linguistic production, whether spoken or written.


A comparative study of binominals of the keyword democracy in corpora of the US and South Korean press


Minhee Bang and Seoin Shin


This paper presents findings of a comparative analysis of collocational patterns of the word democracy with a focus on noun collocates occurring as part of binominals (e.g. democracy and freedom) in the corpora of two US and South Korean newspapers. Democracy is identified as one of the keywords of the US newspaper corpus of 42 million words, which comprises of foreign news reports taken from the New York Times and Washington Post from 1999 to 2003. This illustrates that democracy is a major issue in the context of foreign news reporting by the US press. The analysis of noun phrases occurring with democracy in the US newspaper corpus shows different degrees in the strength of the semantic link of the noun phrases with democracy. The semantic link between democracy and noun phrases such as freedom or human rights is relatively established and uncontroversial, whereas the semantic link is less transparent in democracy and noun phrases such as stability or independence. Semantically, by being coordinated as a binominal, it is possible that a semantic link is ‘manufactured’ between democracy and these noun phrases, such that democracy is contingent on the things represented by these noun phrases, or democracy brings about these things. The analysis also shows that certain noun phrases tend to be associated with references to certain countries or regions (for example, noun phrases referring to a capitalist economy, such as free market economy most frequently occur in the context of Russia). The noun phrases paired with democracy seem to reveal where the US interest is vested in its relations with different foreign countries. It is thus hypothesised that noun phrases occurring as binominals with democracy may differ in newspapers from other countries, reflecting the countries’ own perspective and interest. In order to test this hypothesis, binomials of democracy are analysed in a corpus of news articles collected from the two South Korean newspapers, the Dong-A Ilbo and the Hankyoreh, which are two of the most influential broadsheet newspapers in South Korea. The corpus covers the same period between the year 1999 and 2003, and is of a relatively modest size of 200 thousand words. The analysis will show that there are considerable differences in frequency and types of noun phrases occurring with democracy in the South Korean context. As a positively connoted ‘banner’ word (Teubert, 2001: 49), it is argued that democracy is used to highlight and justify what is deemed to be in the country’s own economic and political interest.


Analysing the output of individual speakers


Michael Barlow


Virtually all corpus linguistic analyses have been based on corpora such as the British National Corpus, which are an amalgamation of the writing or speech of many individuals, an exception being Coniam (2004) which examined the writing of one individual.

            This presentation examines the speech of individuals in order to (i) determine the extent of the differences in the patterns found in an amalgamated corpus and an idiolectal corpus; (ii) investigate whether individual speakers exhibit a characteristic fingerprint; and (iii) to assess the nature of the variation from one speaker to the next.

            To pursue this study, the speech of five White House press secretaries was examined.

One advantage of analysing White House press conferences is the fact that a large number of transcripts are available and the context in which the speech is produced is held constant across the different speakers. The sample covered around of year of output for each speaker. The transcripts were downloaded from the web, transformed to a standard format, and POS-tagged. In addition, the speech of all participants other than the press secretary was removed, making for the rather odd format shown below.


<PERINO> I'll let the DNI's office disclose that if they want to. We never --


<PERINO> I don't know.

The data was divided into slices of 200,000 words. The amount of data available for each of the speakers ranged from 200,000 words for two speakers to 1,200,000 words for one of the speakers. This format made it easy to compare inter- and intra-speaker variability. The results of bigram and trigram analyses reveal the distinctive patterns of speech of each speaker.

If, for example, we examine the ranked bigram profiles for four speakers, we can see differences in use of the most frequent bigrams -- the president, of the, going to, in the, I think, and the, to the, that we, on the, that the, think that, to be, I don't, continue to, and we have. In these graphs, the y axis represents rank, which means that the shorter the line, the more frequent the bigram. The lines extending all the way down represents bigrams with a ranking greater than 60. Each line on the x axis represents the rank of one bigram. The left side of the diagram display the profiles for three samples of the output of Mike; the right hand side contains profiles of three different press secretaries. It can be seen that the bigram profile for the speaker Mike are very similar across the different 200,000 word samples and that these profiles differ from that of the other speakers shown.

            The theoretical consequences of these and other results will be discussed in the paper.



Coniam,D. (2004). “Concordancing oneself: Constructing individual textual

            Profiles”. International Journal of Corpus Linguistics, 9 (2), 271–298.

Automatic standardization of texts containing spelling variation:

How much training data do you need?


Alistair Baron and Paul Rayson


Spelling variation within corpora results in a considerable effect on corpus linguistic techniques; this has been shown particularly for Early Modern English, the latest period of the English language to contain large amounts of spelling variation. The accuracy of key word analysis (Baron et al., 2009), part-of-speech annotation (Rayson et al., 2007) and semantic analysis (Archer et al., 2003) are all adversely affected by spelling variation.

            Some researchers have avoided this issue by using modernized versions of texts; for example, Culpeper (2002) opted to use a modern edition of Shakespeare’s Romeo and Juliet in his study to avoid spelling variation affecting his statistical results. However, modern editions are not always readily available for historical texts. Another potential solution is to manually standardize the spelling variation within these texts, although this approach is likely to be unworkable with the large historical corpora now being made available through digitization initiatives such as Early English Books Online1.

            The solution we offer is a piece of software named VARD 2. The tool can be used to manually and automatically standardize spelling variation in individual texts or corpora of any size. An important feature of VARD 2 is that through manual standardization of corpus samples by the user, the tool “learns” how best to standardize the spelling variation in a particular corpus being processed. This results in VARD 2 being better equipped to automatically standardize the remainder of the corpus. After automatic processing, corpus linguistic techniques can be used with far more accuracy on the standardized version of the corpus, and this avoids the need for difficult and time-consuming manual standardization of the full text.

            This paper studies and quantifies the effect of training VARD 2 with different sample sizes, calculating the optimum2 sample size required for manual standardization by the user. To test the tool with Early Modern English, the Innsbruck Letters corpus, part of the Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET) corpus (Markus, 1999) has been used. The corpus is a collection of 469 complete letters dated between 1386 and 1688, totaling 182,000 words. The corpus is particularly useful here as it has been standardized and manually checked with parallel lines of original and standardized text.

            Whilst VARD 2 was built initially to deal with spelling variation in Early Modern English, it is not restricted to this particular variety of English, or indeed the English language itself. There are many other language varieties other than historical containing spelling variation; these include: first and second language acquisition, SMS messaging, weblogs, emails and other web based communication platforms such as chat-rooms and message boards. VARD 2 has the potential to deal with this large variety of spelling variation in any language, although training will be required for the tool to “learn” how to best automatically standardize corpora. The second part of this paper evaluates VARD 2’s capability in dealing with one of these language varieties, namely a corpus of children’s written English.



Archer, D., T. McEnery, P. Rayson and A. Hardie (2003). “Developing an

            automated semantic analysis system for Early Modern English”. In D. Archer, P. Rayson,

A.      Wilson and T. McEnery (eds) Proceedings of the Corpus Linguistics 2003

            conference. UCREL technical paper number 16., 22 - 31.

Baron, A., P. Rayson and D. Archer (2009). “Word frequency and key word

            statistics in historical corpus linguistics”. In R. Ahrens and H. Antor (eds) Anglistik: International Journal of English Studies, 20 (1), 41-67.

Culpeper, J. (2002). “Computers, language and characterisation: An analysis of six characters in Romeo and          Juliet”. In U. Merlander-Marttala, C Ostman and M. Kytö (eds) Conversation in Life and in Literature: Papers from the ASLA Symposium, 15. Uppsala: Universitetstryckeriet, 11-30.

Rayson, P., D. Archer, A. Baron, J. Culpeper and N. Smith (2007). “Tagging the

            Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English Corpora”. In      Proceedings of Corpus Linguistics 2007, July 27-30, University of Birmingham, UK.

Fat cats, bankers, greed and gloom - the language of financial meltdown


Del Barrett


This paper offers a two-part dissection of the language used in press reporting of the credit crunch.

            The starting point for part 1 is a keyword analysis of two general corpora comprising news items from the tabloid and broadsheet press.  This is then used as a control against which the discourse of the credit crunch is analysed through an investigation of keywords, pivotal words and collocates.  The initial hypothesis was that the language of the broadsheets is beginning to show characteristics normally associated with the tabloid press, particularly through headlines designed to sensationalise. In testing this hypothesis, it was revealed that while the broadsheets are certainly becoming more tabloid through their lexical choices, the tabloids are also changing and have started to adopt broadsheet characteristics in their reporting of ‘systemic meltdowns’ and ‘financial turbulence’.

            The second part of the paper examines the way that the ‘credit crunch’ itself is portrayed.  Using a frame-tracing technique, which combines elements of frame theory and (critical) discourse analysis, the study identifies a number of ‘discourse scenarios’ which are constructed through collocate chains.  Investigation of these scenarios reveals a clear distinction between the way in which the broadsheets and the tabloids view the current financial crisis.  The former invokes a strong storm metaphor indicating that the credit-crunch is a force of nature and that we are all helpless victims, whilst the latter sees the crisis as a war that has been declared by an enemy.  Again, we are helpless victims as the other side seems to be winning, but there are strategies that can be employed to win some of the battles.  In other words, the broadsheets do not apportion blame for the crisis, whereas the tabloids place the blame firmly on the shoulders of the ‘Other’.  It is particularly interesting that the ‘Other’ is not one of the minority groups commonly found in critical discourse studies, but rather the ‘fat cat bankers’ – a label ascribed to anyone working in the financial sector. A number of other lesser scenarios are also uncovered, in which the credit-crunch is reified as a thug, a monster, a villain and a pandemic.




A corpus analysis of English article use in Chinese students’ EFL writing


Neil Barrett, Zhi-jun Yang, Yue-wen Wang and Li-Mein Chen


 It is a well known fact that many English learners have trouble using the English article system and it is even more troublesome for students whose first language doesn’t contain an article system, such as Mandarin Chinese. This can cause problems in learner’s academic writing where accuracy is lost due to article errors. In order to investigate these article problems in academic writing, a corpus based coding system was developed to analyze Chinese EFL student’s compositions. This was based on a larger, learner based corpus into Chinese EFL learner’s coherence problems in academic writing (AWTA learner’s corpus). First, over 600 essays were collected from 1st, 2nd and 3rd year undergraduate students who studied English composition. Following this, an article coding system was constructed using Hawkins (1978) Location Theory with adaptations based on research by Robertson (2000) and Moore (2004). This system identifies the pragmatic, semantic and syntactic noun phrase environments including tags for other determiners, adjectives and numerals which have been identified (Moore, 2004) as having an effect on article errors. The coding system was used for a preliminary analysis of article usage and article errors in 30 papers taken from the learner’s corpus. Results show that all learners were able to distinguish between definite and indefinite contexts in terms of article use but problems occurred in 24% of the noun phrases. Two main error types were indentified, the first of which involves a misanalysis of noun countability for the indefinite article and for the null article.  For example: ‘This didn’t help her grandpa acquire admirable education’. The second error was the overuse or underuse of the definite article. This error frequently occurred in generic, plural nouns and non count nouns where a native speaker would typically use the null article. The following example illustrates this: ‘the modern women are different to the women in the past’.  All the data and the coding system can be accessed online from the learner’s corpus, providing an additional resource for teachers to help identify article errors in student’s writing.



Patterns of verb colligation and collocation in scientific discourse


Sabine Bartsch


Studies of scientific discourse tend to focus on those linguistic units that are deemed to convey information regarding scientific concepts. In light of the observation that a high frequency of nouns, nominalizations and noun phrases is characteristic of scientific writing, these linguistic units have received a lot of due attention. Yet, it is indisputable that verbs are likewise central to the linguistic construal and communication of meaning in scientific discourse, not only because many verbs have domain-specific meanings, but also in view of the fact that verbs establish the relations between concepts denoted by nouns (cf. Halliday & Martin 1993).

            This paper presents a corpus-study of the lexico-grammatical and lexical combinatorial properties of verbs in scientific discourse, i.e. colligation and collocation (cf. Hunston & Francis 2000; Sinclair

1991; Hoey 2005; Gledhill 2000). The study is based on a corpus of scientific writing from engineering, the natural sciences and the humanities. The focus of the paper is on the contribution of verbs of communication (e.g. 'explain', 'report') and construal of scientific knowledge (e.g. 'demonstrate', 'depict') to the textual rendition of scientific meaning. The classification of the verbs is based on Levin (1993) and the FrameNet project.

            Many of these verbs are of interest as an object of research on scientific texts because of their discursive function in the communication of the scientific research process. Their occurrence in specific colligational patterns is indicative of their function. For example, 'that'-clauses following communication verbs can be shown to be frequently employed in the reporting of research results, whereas 'to'-infinitive clauses tend to present the scientist's ideas or statements of known facts.

            Furthermore, verbs of the types under study here play a central role in establishing connections between what is said in the natural language text and other modalities such as graphics, diagrams, tables and formulae which are commonly employed in scientific text in order to present experimental settings, quantitative data or symbolic representations ('Figure 1.1 demonstrates ...'). They are thus instrumental in the establishment of relations between different modalities in multimodal texts which is also reflected in specific colligational patterns. Another aspect that is of interest in the study of verbs of communication and construal of scientific knowledge are patterns of lexical co-occurrence, i.e. the specific patterns of collocation. Many of the verbs under study display characteristic patterns of lexical collocation which highlight domain-specific aspects of their potential meaning (e.g. ‘represent the arithmetic mean’).

Results of this study include qualitative and quantitative findings about characteristic patterns of colligation of the verbs under study as well as a profile of recurrent patterns of collocation. The study also reveals some interesting semantic patterns in the characteristic participant structure of these verbs in the domains represented in the corpus.



FrameNet Project (URL: Last visited: 20.01.2009.

Gledhill, C. (2000). “The discourse function of collocation in research article introductions”, English for        Specific             Purposes, 19, 115-135.

Halliday, M. A. K. and J.R. Martin. (1993). Writing Science. Literacy and Discursive Power. Washington,     D.C.: The Falmer Press.

Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London: Routledge.

Hunston, S., and G. Francis. 2000. Pattern Grammar: A corpus-driven approach to the lexical grammar of     English. Amsterdam: John Benjamins.

Kjellmer, G. (1991). "A mint of phrases." In K. Aijmer and B. Altenberg (eds) English

            Corpus Linguistics. London, New York: Longman, 111–127.

Levin, B. (1993). English Verb Classes and Alternations. A Preliminary Investigation. Chicago: The University of  Chicago Press.

Putting corpora into perspective: Sampling strategies for synchronic corpora


Cyril Belica, Holger Keibel, Marc Kupietz, Rainer Perkuhun and Marie Vachková


The scientific interest of most corpus-based work is to extrapolate observations from a corpus to a specific language domain. In theory, such inferences are only justified when the corpus constitutes a sufficiently representative sample of the language domain. Because for most language domains the representativeness of a corpus cannot be evaluated in practice, the sampling of corpora usually seeks to approximate representativeness by intuitively estimating some qualitative and quantitative properties of the respective language domain (e.g., distribution and proportions of registers, genres and text types) and requiring the corpus to roughly display these properties, as far as time, budget and other practical constraints permit.

            When the object of investigation is some present-day language domain (e.g., contemporary written German), the question arises how its dependency on time is to be represented in a corresponding synchronic corpus. A practical solution typically taken is to consider only language material that was produced in some prespecified time range (e.g., 1964-1994 for the written component of the British National Corpus). Sometimes, the corpus is additionally required to be balanced across time (i.e., to contain roughly the same amount of texts or running words for each year).

            However, this practical solution is not satisfactory: as language undergoes continuous processes of change, more recent language productions are certainly better candidates for being included in a synchronic corpus than are older ones. On the other hand, restricting a corpus to the most recent data (e.g., to language productions of the last six months) would exclude a lot of potential variety in language use and capture only those phenomena that happened to have a sufficient number of occasions to become manifest in this short time period. Consequently, rare phenomena might not appear at all in an overly short time period, and more frequent phenomena might be over- or underrepresented, due to statistical and language-external factors rather than to language change. This is clearly not an appropriate solution either: the relevance of phenomena as part of a given synchronic language domain does not suddenly decrease when their observed frequency has decreased recently, and in particular, rare phenomena do not suddenly cease to be a part of the synchronic language domain just because they did not occur at all in the available sample of recent texts.

            In order to reconcile the effects of language change with statistical and language-external effects, this paper proposes a solution that does without any sharp cut-off and instead takes a vanishing point perspective which posits a gradually fading relevance of time slices with increasing “age”. The authors argue from a corpus-linguistic point of view (cognitive arguments are presented elsewhere) that for synchronic studies such a vanishing point perspective on language use is a more adequate approach than the common bird's eye view where all time slices are weighted equally. The authors further present language-independent explorations on how a vanishing point perspective can be achieved in corpus-linguistic work, by what formal functions the relevance of time slices is to be modelled and how the choice of a particular function may be justified in principle. Finally, they evaluate, for the case of contemporary written German, the empirical consequences of using a synchronic corpus in the above sense for phenomena such as simple word frequency, collocational patterning and the complex similarity structure between collocation profiles.





A parsimonious measure of lexical similarity for the characterisation of latent stylistic profiles


Edward J. L. Bell, Damon Berridge and Paul Rayson


Stylometry is the study of the computational and mathematical properties of style. The aim of a stylometrist is to derive accurate stylometrics and models based upon those metrics to gauge stylistic propensities. This paper presents a method of formulating a parsimonious stylistic distance measure via a weighted combination of both non-parametric and parametric lexical stylometrics.

            Determining stylistic distance is harder than black-box classification because the algorithm must not only assign texts to a probable category (are the works penned by the same or different authors); it must also provide the degree of stylistic similarity between texts. The concept of stylistic distance was first indirectly broached by Yule (1944) when he derived the characteristic constant K from vocabulary frequency distributions. Recent work has focused on judging distance by considering the intersection of lexical frequencies between two documents (Labbé, 2007).                                                               

            We tackle the problem with a ratio of composite predictors, the higher the ratio the more the styles diverge. The coefficients of the authorship predictor are estimated using Powells conjugate gradient method of function minimisation (Powell, 1964) on a corpus of 19th Century literature.

            The corpus comprises of 248 fictional works by 103 authors and totals to around 40 million words.

            The utility of the selected stylometrics is initially demonstrated by performing an exploratory analysis on the most prolific authors in the corpus (Austen, Dickens, Hardy and Trollope). The exploratory model shows that each author is characterised by a unique set of stylometric values. This unique set of values corresponds to a lexical subset of the author’s latent stylistic profile.

            After estimating the authorship ratio coefficients, the metric is used to compare large samples of text and empirically assess the stylistic distance between the authors of those samples. The discrimination ability of the method proves accurate over 30,000 binary comparisons and rivals the discernment aptitude of established techniques; namely support vector machines. Finally it is shown that the proposed authorship ratio fairs well in terms of discriminatory aptitude and sample size invariance when compared to the intertextual distance metric (Labbé and Labbé. 2001).



Labbé, C. and D. Labbé (2001). “Intertextual distance and authorship attribution -- Corneille        and Moliere”. Journal of Quantitative Linguistics, 8(19):213–231.

Labbé, D. (2007)” Experiments on authorship attribution by intertextual distance in English”.       Journal of Quantitative Linguistics, 14(48):33–80.

Powell, M. J. D. (1964) “An efficient method for finding the minimum of a function of    several variables without calculating derivatives”. The Computer Journal, 7(2):155–   162.

Yule, G. U. (1994) The Statistical Study Of Literary Vocabulary. Cambridge University     Press,   Cambridge.





Annotating a multi-genre corpus of Early Modern German


Paul Bennett, Martin Durrell, Silke Scheible and Richard J. Whitt


This study addresses the challenges in automatically annotating a spatialised multi-genre corpus of Early Modern German with linguistic information, and describes how the data can be used to carry out a systematic evaluation of state-of-the-art corpus annotation tools on historical data, with the goal of creating a historical text processing pipeline.

            The investigation is part of an ongoing project funded jointly by ESRC and AHRC whose goal is to develop a representative corpus of Early Modern German from 1650-1800. This period is particularly relevant to standardisation and the emergence of German as a literary language, with the gradual elimination of regional variation in writing. In order to provide a broad picture of German during this period, the corpus includes a total of eight different genres (representing both print-oriented and orally-oriented registers), and is ‘spatialised’ both temporally and topographically, i.e. subdivided into three 50-year periods and the five major dialectal regions of the German Empire. The corpus consists of sample texts of 2,000 words, and to date contains around 400,000 words (50% complete). The corpus is comparable with extant English corpora for this period, and promises to be an important research tool for comparative studies of the development of the two languages.

            In order to facilitate a thorough linguistic investigation of the data, we plan to add the following types of annotation:

- Word tokens

- Sentence boundaries

- Lemmas

- POS tags

- Morphological tags

Due to the lexical, morphological, syntactic, and graphemic peculiarities characteristic of this particular stage of written German, and the additional variation introduced by the three variables of genre, region, and time, automatic annotation of the texts poses a major challenge. While current annotation tools tend to perform very well on the type of data on which they have been developed, only few specialised tools are available for processing historical corpora and other non-standard varieties of language. The present study therefore proposes to do the following:

(1) Carry out a systematic evaluation of current corpus annotation tools and assess their robustness across the various sub-corpora (i.e. across different genres, regions, and periods)

(2) Identify procedures for building/improving annotation tools for historical texts, and incorporate them in a historical text processing pipeline

            Point (1) is of particular interest to historical corpus linguists faced with the difficult decision of which tools are most suitable for processing their data, and are likely to require the least manual correction. Point (2) utilises the findings of this investigation to improve the performance of existing tools, with the goal of creating a historical text processing pipeline. We plan to experiment with a novel spelling variant detector for Early Modern German which can identify the modern spelling of earlier non-standard variants in context, similar to Rayson et al.’s variant detector tool (VARD) for English (2005), and complementing the work of Ernst-Gerlach and Fuhr (2006) and Pilz et al. (2006) on historic search term variant generation in German.



Ernst-Gerlach, A. and N. Fuhr (2006). “Generating Search Term Variants for Text Collections with Historic Spellings”. In Proceedings of the 28th European Conference on Information Retrieval Research.

Pilz, T., W. Luther, U. Ammon and N. Fuhr (2006). ‘Rule-based Search in Text Databases with       Nonstandard        Orthography’. Literary and Linguistic Computing, 21(2), 179–186.

Rayson, P., D. Archer and N. Smith (2005). “VARD vs. Word: A comparison of the UCREL variant              detector and modern spell checkers on English historical corpora”. In Proceedings of Corpus          Linguistics 2005, Birmingham.

A tool for finding metaphors in corpora using lexical patterns


Tony Berber Sardinha


Traditionally, retrieving metaphor from corpora has been carried out with general-purpose corpus linguistics tools, such as concordancers, wordlisters, and frequency markedness identifiers. These tools present a problem for metaphor researchers in that they require that users know which words or expressions are more likely to be metaphors in advance. Hence, researchers may turn to familiar terms, words that have received attention in previous research or have been noticed by reading portions of the corpus, and so on. To alleviate this problem, we need a computer tool that can process a whole corpus, evaluate the metaphoric potential of each word in the corpus, and then present researchers with a list of words and their metaphoric potential in the corpus. In this paper, I present an online metaphor identification program that retrieves potential metaphors from both English and Portuguese corpora. This is currently the only publicly available such tool. The program looks for ‘metaphor candidates’, or metaphorically used words. A word is considered metaphorically used when it is part of a linguistic metaphor, which in turn means a stretch of text in which it is possible to interpret incongruity between two domains from the surface lexical content (Cameron 2002: 10). Analysts may then inspect the possible metaphorically used words suggested by the program and determine whether they are part of an actual metaphor or not. The program works by analyzing the patterns (bundles and collocational framework) and part of speech of each word and then matching these patterns to the information in its databases. The databases are the following: (1) word database, which records the probability of individual words being used metaphorically in the training corpora; (2) a left bundles database, with the probability of bundles (3-grams) occurring in front of a metaphorically used word in the training data; (3) a right bundles database (3-grams), with the probability of bundles occurring to the right of a metaphorically used word in the training corpora; (4) frameworks database, with the probability of frameworks (the single words immediately to the left and right) occurring around a metaphorically used word in the training data (e.g. ‘a … of’, etc.); (5) a part of speech database, with the probability of each word class being used metaphorically in the training corpora. The training corpora were the following: for Portuguese, (1) a corpus of conference calls held in Portuguese by an investment bank in Brazil, with 85,438 tokens and 5,194 types, and (2) the Banco de Português (Bank of Portuguese), a large, register-diversified corpus, containing nearly 240 million words of written and spoken Brazilian Portuguese; for English, the British National Corpus, from which a sample of 500 types was taken; each word form was then concordanced, and each concordance was hand analyzed for metaphor. In addition to a description of the way the tools were set up, the paper will include a demonstration of the tools, examples of analyses, discuss problems and propose possible future developments.



Cemeron, L. (2002). Metaphor in Educational Discourse. London: Continuum.






Institutional English in Italian university websites: The acWaC corpus

Silvia Bernardini, Adriano Ferraresi and Fredrico Gaspari


The study of English for Academic Purposes, including learner and lingua franca varieties, has been a constant focus of interest within corpus linguistics. However, the bulk of research has been on disciplinary writing (see e.g. L. Flowerdew 2002, Cortes 2004, Mauranen 2003). Institutional university language (e.g., course catalogues, brochures) remains an understudied subject, despite being of great interest, both from a descriptive point of view -- Biber (2006:50) suggests that it is "more complex than any other university register" -- and from an applied one. Institutional texts, especially those published on the web, are read by a vast audience, including, crucially, prospective students, who more and more often make their first encounter with a university through its website. Producing appropriate texts of an institutional nature in English is especially relevant for institutions in non-English speaking countries, nowadays under increasing pressure to make their courses accessible to an international public.

            Focusing on Italy as a case in point, our paper looks at institutional university language as used in the websites of Italian universities vs. universities in the UK and Ireland (taken as examples of native English standards within the EU). Two corpora were created retrieving a set number of English web pages published by a selection of universities based in these countries. Our aim was to obtain corpora which would be comparable in terms of size and variety of text types, and excluding as far as possible disciplinary texts (e.g. research articles). The same construction procedure was adopted for the Italian corpus and for the UK/Irish corpus: we used the BootCaT toolkit (Baroni and Bernardini 2004), relying on Google's language identifier and excluding .pdf files from the search.

            Since the procedure is semi-automatic and allows limited control over the corpus contents (a general issue with web-as-corpus resources, see Fletcher 2004), the preliminary part of the analysis is aimed at assessing to what extent the two corpora may be considered as comparable in terms of topics covered and (broadly speaking) text types included. To this end, a random sample of pages from each corpus is manually checked, following Sharoff (2006).

            The bulk of the paper is a double comparison of the two corpora. First, word and part-of-speech distributions (of both unigrams and n-grams) across the two corpora are compared. Second, taking as our starting point the analysis of institutional university registers in Biber (2006), we carry out a comparative analysis of characteristic lexical bundles and of stance expressions, focusing specifically on ways of expressing obligation.

            The study is part of a larger project which seeks to shed light on the most salient differences between institutional English in the websites of British/Irish vs. Italian universities. The project's short-term applied aim is to provide resources for Italian authors and translators working in this area, as a first step towards creating a pool of corpora of non-native institutional English as used in other European countries.



Baroni, M. and S. Bernardini (2004). "BootCaT: Bootstrapping corpora and terms from the web". Proceedings         of  LREC 2004, 1313-1316.

Biber, D. (2006). University Language: A Corpus-based Study of Spoken and Written Registers. Amsterdam: John   Benjamins.

Cortes, V. (2004). "Lexical bundles in published and student disciplinary writing: Examples from history and              biology". English for Specific Purposes, 23, 397-423.

Fletcher, W.H. (2004). "Making the web more useful as a source for  linguistic corpora". In U. Connor and T.              Upton (eds) Corpus  Linguistics in North America 200,. 191-205.

Flowerdew, L. (2002). "Corpus-based analyses in EAP". In J. Flowerdew (ed.) Academic Discourse. London:               Longman, 95-114.

Mauranen, A. (2003). "The corpus of English as Lingua Franca in academic settings". TESOL Quarterly, 37 (3),         513-527.

Sharoff, S. (2006). "Creating general-purpose corpora using automated search engine queries". In M. Baroni and       S. Bernardini (eds) WaCky! Working Papers on the Web as Corpus. Bologna: GEDIT, 63-98.

Nominalizations and nationality expressions: A corpus analysis


Daniel Berndt, Gemma Boleda, Berit Gehrke and Louise McNally


Many languages offer competing strategies for expressing the participant roles associated with event nominalizations, such as the use of a denominal adjective ((1a)) vs. a PP ((1b)):


(1)        a. French agreement to participate in the negotiations (Nationadjective+N)

            b. agreement by France to participate in the negotiations (N+P+Nationnoun)


Since Kayne (1981) theoretical work on nominalizations has focused mainly on whether such PPs and adjectives are true arguments of the nominalization or simply modifiers which happen to provide participant role information, with no conclusive results (see e.g. Grimshaw 1990, Alexiadou 2001, Van de Velde 2004, McNally & Boleda 2004); little attention has been devoted to the factors determining when one or the other option is used (but see Bartning 1986). The goal of this research is to address this latter question, in the hope that a better understanding of these factors will also offer new insight into the argument vs. adjunct debate.


Methodology: We hypothesize that the choice between (1a)-(1b) depends on various factors, including whether the noun is deverbal, the argument structure of the underlying verb, prior or subsequent mention of the participant in question (in (1), France), and what we call concept stability – the degree to which the full noun phrase describes a well-established class of (abstract or concrete) entities. We tested for these factors in a study on the British National Corpus. To reduce unintended sources of variation, we limited our study to nationality adjectives/nouns. We examined Nationadjective+N and N+P+Nationnoun examples from 49 different nations whose adjective (French) and proper noun (France) forms occur 1,000-30,000 times in the BNC, filtering the examples whose head noun was too infrequent (≤ 24 occurrences) or too nation-specific (e.g., reunification). To determine the semantic class of the head nouns, we used the WordNet-based Top Concept Ontology (Álvez et al. 2008). For the analysis of nominalizations, we considered only a manually-selected list of 45 nouns.


Results: Unlike nouns denoting physical objects, abstract nouns, including nominalizations, prefer the prepositional construction (nouns in the categories Part and Place, e.g. border, area, are an exception). The adjective construction occurs with a much smaller range of nouns than does the PP construction, an effect that is more pronounced with infrequent nations and when only nominalizations are considered. These results suggest that use of the adjective construction positively correlates with concept stability: Adjective+nominalization combinations are arguably less likely to form stable concepts than adjective+concrete noun combinations (cp. e.g. French wine and a French agreement).

Though the other factors are pending analysis, these initial results indicate an asymmetry in the distribution of the constructions in (1); and the strong association between the use of the adjective construction and concept stability specifically lends support to an analysis of nationality adjectives as classifying modifiers rather than as argument-saturating expressions.

The presence of lexical collocations in bilingual dictionaries:

A study of nouns related to feelings in English and Italian.


Barbara Berti


Nowadays, it is generally accepted that naturalness in languages results from the synergy between grammatical well-formedness and an interlocking series of precise choices at lexical level. From a SLA perspective, the awareness of lexical relations between words becomes particularly crucial as it allows learners to avoid combinatorial mistakes that result in an awkward linguistic production. One of the most important reference tools for students is the dictionary. It is therefore essential that it should contain the largest amount of information concerning the lexical environment proper to each word.

            Lexicographic research has mainly focused on investigating the presence of collocations in learners' dictionaries or general purpose dictionaries (Cowie 1981, Benson 1990, Hausmann 1991, Béjoint 1981, Cop 1988, Siepmann 2006). Nevertheless, despite teachers' effort to encourage the use of monolingual resources, it has been largely shown (Atkins 1985, Atkins and Knowles 1990, Atkins and Varantola 1997, Nuccorini 1992,1994, Béjoint 2002) that learners tend to prefer bilingual ones. Although bilingual lexicography has already taken the issue of collocations into account (Hausmann 1988, Cop 1988, Siepmann 2006), a systematic study concerning the actual presence of collocations in bilingual dictionaries has not been carried out yet, especially in a contrastive perspective English-Italian. This is of particular relevance since dictionary editors often acknowledge the importance of collocations as well as the use of a corpus-based methodology for the compilations of dictionaries.

            In this paper, I investigate the presence of lexical collocations of some nouns related to feelings in a sample of bilingual dictionaries. In the attempt to compromise between theoretical issues and practical applications, I will apply a little flexibility to the notion of lexical collocation on the one hand, as well as set some boundaries on the other. In general, I will consider combinations of words such as lexical word + preposition + lexical word (eg. jump for joy) as lexical collocations (this choice derives from the scarce presence of pure lexical collocations in my sample of dictionaries). I will pinpoint that too often collocations do not stand out among the definitions of lemmas but are present in a more 'silent' form as parts of examples (eg. He is in for a nasty shock where nasty shock is not properly highlighted). I will reflect on the fact that many nouns share a good deal of collocates (eg. The adjective deep collocates with pleasure, satisfaction, happiness, contentment, delight), and try to understand the semantic boundaries of this phenomenon. I will underline the extent to which collocations are interlocking, in the sense expressed by Hoey (2005) (e.g. squeal with sheer delight is made up of two collocations: squeal with delight and sheer delight) and what are their combinatorial limitations (eg. although mischievious delight is a collocation, it is not possible to combine it with squeal with delight and form *squeal with mischievious delight). All this will be done with reference to the BNC and The Oxford Collocation Dictionary. As will be shown, the problem extends when collocations are to be investigated from an Italian perspective, due to the anisomorphism of the two languages.



Atkins, B. T. and F. E. Knowles (1990). "Interim report on the EURALEX/AILA research project into         dictionary use”. In T. Magay and J. Zigàny (eds), BudaLEX '88 proceedings, 381-392.

Atkins, B. T. and K Varantola (1997). “Monitoring dictinary use”, IJL, 10 (1), 1-45.

Cop, M. (1988). “The function of collocations in dictionaries”. In T. Magay and J. Zigàny (eds), BudaLEX '88         proceedings,  35-46.

Cowie, A. P. (1981). "The treatment of collocations and idioms in learners' dictionaries," Applied Linguistics, 2         (3), 223-35.

Hausmann, F. J. (1991). “Collocations in monolingual and bilingual dictionaries”, V. Ivir and D.  Kalojera              (eds), Languages in contact and contrast,.Walter de Gruyter, 225-236.

Hausmann, F. J. (2002). "La lexicographie bilingüe en Europe. Peut-on l’améliorer ?". In E. Ferrario and V. Pulcini, (eds), La lessicografia bilingue tra presente e avvenire, Atti del Convegno Vercelli, 11-21.

Nuccorini, S. (1994). “On dictionary misuse”, EURALEX '94 Proceedings, Amsterdam, 586-597.

Siepmann, D. (2006) “Collocation, Colligation and Encoding Dictionaries. Part II: Lexicographical Aspects”,           International Journal of Lexicography, 19 (1), 1-39.

Understanding culture: Automatic semantic analysis of a general Web corpus vs. a corpus of elicited data


Francesca Bianchi


A long established tradition (Szalay & Maday, 1973; Wilson & Mudraya, 2006) has analysed cultural orientations by extracting Elementary Meaning Units (EMUs) – i.e. subjective meaning reactions elicited by a particular word – from corpora of elicited data. In line with this tradition, Bianchi (2007) attempted the analysis of EMUs in two different corpora of non-elicited data. Her results seemed to support the hypothesis that non-elicited data are as suitable as elicited ones for the extraction of EMUs.

            The current study aims to further assess the contribution that corpora from non-elicited data, and large Web corpora in particular, may provide to the analysis of cultural orientations. This study will focus on British cultural orientations around two specific node words: wine, and chocolate. A general Web corpus of British English (UKWAC; Baroni and Kilgarriff, 2006) will be used, alongside with data elicited from British native speakers. Elicited data will be collected via specifically designed questionnaires based on sentence completion and picture description items. Sentences including the node words will be singled out in the Web and the elicited corpora and subsequently tagged using specific automatic tagging systems (Claws for POS tagging, and SemTag for semantic tagging). Semantic tagging will highlight semantic preference of the node word, i.e. its EMUs; while frequency and strength calculations will distinguish cultural from personal EMUs. The emerging EMUs will be analysed and commented with reference to existing literature on cultural orientations and previous studies based on the chosen node words (Fleischer, 2002; Bianchi, 2007). Attention will be dedicated in particular to comparing and contrasting the advantages and limitations of Web general corpora with respect to more traditional elicited data.



Baroni M., and A. Kilgarrif (2006). “Large linguistically processed web corpora for multiple         languages”. Proceedings of EACL. Trento.

Bianchi F. (2007). “The cultural profile of chocolate in current Italian society: A corpus-based      pilot study”, ETC – Empirical Text and Culture Research, 3, 106-120.

Fleischer M. (2002). “Das Image von Getränken in der polnischen, deutschen und             französischen Kultur”. ETC: Empirical Text and Culture Research, 2, 8-47.

Szalay L.B. and B. C. Maday (1973). “Verbal Associations in the Analysis of Subjective Culture”. Current Anthropology, 14 (1-2) 33-42.

Wilson, A., and O. Mudraya. (2006). “Applying an Evenness Index in Quantitative Studies of     Language and Culture: A Case Study of Women’s Shoe Styles in Contemporary   Russia”. In P. Grzybek and R Köhler (eds), Exact Methods in the Study of Language        and Text, Berlin: Mouton de Gruyter, 709-722.

Introducing probabilistic information in Constraint Grammar parsing

Eckhard Bick


Constraint Grammar (CG), a framework for robust parsing of running text (Karlsson et al. 1995), is a rule based methodology for assigning grammatical tags to tokens and disambiguating them. Traditionally, rules build on lexical information and sentence context, providing annotation for e.g. part of speech (PoS), syntactic function and dependency relations in a modular and progressive way. However, apart from the generative tradition, e.g. HPSG and AGFL (Koster 1991), most robust taggers and parsers today employ statistical methods and machine learning, e.g. the MALT parser (Nivre & Scholz 2004), and though CG has achieved astonishing results for a number of languages, e.g. Portuguese (Bick 2000), the formalism has so far only addressed probabilistics in a very crude and “manual” way – by allowing lexical <Rare> tags and by providing for delayed use of more heuristic rules.

     It would seem likely, therefore, that a performance increase could be gained from integrating statistical information proper into the CG framework. Also, since mature CG rule sets are labour intensive and contain thousands of rules,  statistical information could help to ensure a more uniform and stable performance at an earlier stage in grammar development, especially for less-resourced languages.

     To address these issues, our group designed and programmed a new CG rule compiler, allowing the use of numerical frequency tags in parallel with morphological, syntactic and semantic tags. For our experiments, we used our unmodified, preexisting CG system (EngGram) to analyse a number of English text corpora (BNC [], Europarl [Koehn 2005] and a 2005 Wikipedia dump), and then extracted frequency information for both lemma_POS and wordform_tagstring pairs. This information was then made accessible to the CG rules in the form of (a) isolated and (b) cohort-relative frequency tags, the latter providing a derived frequency percentage for each of a wordform's possible readings relative to the sum of all frequencies for this particular wordform.

     We then introduced layered threshold rules, culling progressively less infrequent readings after each batch of ordinary, linguistic-contextual CG rules. Also, frequency based exceptions were added to a few existing rules. Next, we constructed two revised gold-standard annotations - 4.300 tokens each of (a) randomized Wikipedia sentences and (b) Leipzig Wortschatz sentences (Quasthoff et al. 2006) from the business domain - and tested the performance of the modified grammar against the original grammar, achieving a marked increase in F-scores (weighted recall/precision) for both text types. Though the new rules all addressed morphological/PoS disambiguation, the syntactic function gain (91.1->93.4 and 90.8->93.4) was about twice as big as the PoS gain (96.9->97.9 and 96.9->98.2), reflecting the propagating impact that PoS errors will typically have on subsequent modules.

     A detailed inspection of annotation differences revealed that of all changes, about 80% were for the better, while 20% introduced new errors, suggesting an added potential for improvement by crafting more specific rule contexts for these cases, or by using machine learning techniques to achieve a better ordering of rules.



Bick, E. (2000), The Parsing System Palavras - Automatic Grammatical Analysis of Portuguese in a Constraint        Grammar Framework. Aarhus: Aarhus University Press

Karlsson et al. (1995), “Constraint Grammar - A Language-Independent System for Parsing Unrestricted Text”.        Natural Language Processing, 4.

Koehn, P. (2005). "Europarl: A Parallel Corpus for Statistical Machine Translation", Machine Translation Summit    X, 79-86,

Koster,  C.H.A. (1991). “Affix Grammars for natural languages”. In Lecture Notes in Computer Science, 545.

Nivre, J. & M. Scholz, M. (2004) “Deterministic Dependency Parsing of English Text”. In Proceedings of     COLING 2004, Geneva, Switzerland, 2004.

Quasthoff, U, M. Richter and C. Biemann (2006) “Corpus Portal for Search in Monolingual Corpora”. In    Proceedings of the fifth international conference on Language Resources and Evaluation, 1799-1802.


Word distance distribution in literary texts


Gemma Boleda, Álvaro Corral, Ramon Ferrer i Cancho and Albert Díaz-Guilera


We study the distribution of distances between consecutive repetitions of words in literary texts, measured as the number of words between two occurrences plus one. This reflects the dynamics of language usage, in contrast to the static properties captured in Zipf's law (Zipf 1949). We focus on distance distributions within single documents, as opposed to different documents or amalgamated sources as in previous work.

            We have examined eight novels in English, Spanish, French, and Finnish. The following consistent results are found:

1. All words (including “function” words such as determiners or prepositions) display burstiness, that is, one occurrence of the word tends to trigger more occurrences of the same word (Church and Gale 1995, Bell et al. 2009).

2. Burstiness fades in short distances. In very short distances even a repulsion effect is observed, which can be explained on both linguistic and communicative grounds.

3. We observe systematic differences in the degree of clustering of different words. In particular, words that are relevant from a rhetorical or narrative point of view (alternatively, keywords; Church and Gale 1995) display higher clustering than the rest.

4. The distribution of the rest of the words follows a scaling law, independent of language and frequency.

5. If words are shuffled, neither the scaling law nor burstiness are observed. A Poisson distribution is obtained instead. Note that e.g. Zipf's law is not altered in this circumstance.

6. There are striking similarities between our data and other physical and social phenomena that have been studied as complex systems (Corral 2004, Bunde et al. 2005, Barabási 2005).

Our data reveal universal statistical properties of language as used in literary texts. The factors that determine this distribution, that is, to what extent are linguistic as opposed to more general communicative or cognitive processes involved, as well as the relationship between word distance distribution and other phenomena with similar properties (see point 6), remain an exciting challenge for the future.



Barabási, A-L (2005). “The origin of bursts and heavy tails in human dynamics”. Nature, 435,     207–211.

Bell, A, J. M. Brenier, M. Gregory, C. Girand and D .Jurafsky (2009). “Predictability effects        on durations of content and function words in conversational English”. Journal of   Memory and Language,  60 (1), 92–111.

Bunde, A, J.F. Eichner, J.W. Kantelhardt  and S. Havlin, (2005). “Long-term memory: a   natural mechanism for the clustering of extreme events and anomalous residual times   in climate records”. Phys. Rev. Lett. 94, 048701

Church, K. W. and W. A. Gale (1995). “Poisson mixtures”. Nat. Lang. Eng. 1, 163–190.

Corral, A. (2004). “Long-term clustering, scaling, and universality in the temporal occurrence        of earthquakes”. Phys. Rev. Lett. 92, 108501.

Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Cambridge, MA:   Addison-Wesley. (Hafner reprint, New York, 1972.)


Frequency effects on the evolution of discourse markers in spoken vs. written French


Catherine Bolly and Liesbeth Degand


The current paper discusses the role of frequency effects in the domain of language variation and language change (Bybee 2003), with particular attention to differences between spoken and written language (Chafe & Danielewicz 1987). The following questions will be addressed: How do spoken and written modes interact with each other in the grammaticalization process? Are uses in spoken language the mirror of what we could expect for the future in writing? The question is not new, but needs systematic analyses. The aim of the paper is thus to explore the extent to which evolution of frequent linguistic phenomena may be predicted from their observed frequency in use. Conversely, it also focuses on how the variation of such phenomena in contemporary French can be explained by their diachronic evolution. To shed some light on these issues, we analyzed two types of pragmatic markers in French: (i) the parenthetical construction tu vois (‘you see’), and (ii) the causal conjunctions car and parce que (‘because’).We argue that the historical functional shift from the conceptual domain to the pragmatic/causal domain of these elements could be interpreted as a consequence of the primacy of spoken uses at a given stage of language evolution.        First, the evolution from non-parenthetical to parenthetical uses, typical of spoken language (Brinton 2008), will be explored with respect to their context of production. A preliminary analysis shows an increase in frequency of tu vois (‘you see’) in literature (from infrequent in Preclassical French to frequent in Contemporary French) and in drama (from frequent in Preclassical French to highly frequent in Contemporary French). According to the cline of (inter-)subjectification in language change (Traugott to appear), these results allow for two hypotheses: (i) we expect a higher frequency of non-parenthetical constructions (objective uses) in written vs. “speech-like” French at any historical period; (ii) at the same time, we expect an increase over time in the frequency of parentheticals (subjectified uses) as they undergo a grammaticalization process, mainly in “speech-based” corpora.

            Secondly, the reversed frequency evolution of car (from highly frequent in Old French to dramatically infrequent in Contemporary spoken French) and parce que (from infrequent in Old French to highly frequent in Spoken French) is analyzed in terms of different subjectification patterns for the two conjunctions. Parce que shows an increasing cline of subjective uses, typical of spoken language, while car (being subjective from Old French on) does not show this evolution that makes it particularly fit to (renewed) use in spoken contexts, leading to its quasi disappearance from speech, and probable specialization to restricted (written) genres.



Brinton, L. J. (2008). The Comment Clause in English. Syntactic Origins and Pragmatic    Development. Cambridge: CUP/ Studies in English Language.

Bybee, J. (2003). “Mechanisms of change in grammaticalization: The role of frequency”. In           B. D. Joseph and R. D. Janda (eds) The Handbook of Historical Linguistics. Oxford:              Blackwell, 604-623.

Chafe, W. and Danielewicz, J. (1987). “Properties of spoken and written language”. In

R. Horowitz and S. J. Samuels (eds) Comprehending Oral and Written Language. San Diego: Academic Press, 83-113.

Traugott, E. C. (to appear). “(Inter)subjectivity and (inter)subjectification: a reassessment”. In

H. Cuyckens, K. Davidse and L. Vandelanotte (eds)  Subjectification, intersubjectification and grammaticalization. Berlin and New York: Mouton de Gruyter.

“What came to be called”: Evaluative what and interpreter identity in the discourse of history

Marina Bondi


Research into specialized discourses has often paid particular attention to specific lexis, and  how this instantiates the meanings and values associated with specific communities or institutions. Studies on academic discourse, however, have paid growing attention to general lexis and its disciplinary specificity, for example in metadiscourse or evaluative language use. This finer-grained analysis of disciplinary language can be shown to profit from recent approaches to EAP-specific phraseology, whether paying attention to automatic derivation of statistically significant clusters (Biber 2004, Biber et al. 2004, Hyland 2008), or to grammar patterns and semantic sequences (Groom 2005, Charles 2006, Hunston 2008).

      The paper explores the notion of semantic sequences and its potential in identifying distinctive features of specialized discourses. The study is based on a corpus of academic journal articles (2.5 million words) in the field of history. The methodology adopted combines a corpus and a discourse perspective. A preliminary analysis of frequency data (wordlists and keywords) offers an overview of quantitative variation. Starting from the frequencies of word forms and Multi-Word-Units, attention is paid to semantic associations between different forms and to the association of the unit with further textual-pragmatic meanings.

      The analysis focuses on sequences involving the use of what and a range of signals referring to a shift in time perspectives or attribution. Paraphrasing Hyland and Tse’s analysis of ‘evaluative that’ (2005), we have here an ‘interpretative what, signalling various patterns of interaction of the writer’s voice with other interpretations of historical fact.  A tentative classification of the sequences is provided, as an extension of previous studies on a local grammar of evaluation (Hunston and Sinclair 2000). Co-textual analysis looks both at the lexico-semantic patterns and at the pragmatic functions involved. Frequencies and patterns are then interpreted in the light of factors characterizing academic discourse and specific features of writer identity in historical discourse (Coffin 2006).



Biber, D. (2004). “Lexical bundles in academic speech and writing”. In B. Lewandowska-

Tomaszczyk (ed.) Practical Applications in Language and Computers. Frankfurt am Mein:

Peter Lang, 165-178.

Biber, D., S. Conrad and V. Cortes (2004). “If you look at: Lexical bundles in university teaching

and textbooks”. Applied Linguistics, 25(3), 371-405.

Charles, M. (2006). “Phraseological Patterns in Reporting Clauses Used in Citation: A Corpus-Based            Study of Theses in Two Disciplines”. English for Specific Purposes, 25 (3) 310-331.

Coffin,C. (2006). Historical Discourse. London: Continuum.

Groom, N. (2005). “Pattern and Meaning across Genres and Disciplines: An Exploratory      Study”.            Journal of English for Academic Purposes, 2005, 4 (3) 257-277.

Hunston, S. and J. Sinclair (2000). “A local Grammar of Evaluation”. In S. Hunston, G. Thompson (eds). Evaluation in Text. Oxford: Oxford University Press, 74-101.

Hunston, S. 2008. “Starting with the small words: Patterns, lexis and semantic sequences”.   International Journal of Corpus Linguistics 13 (3) 271-295.

Hyland, K. (2008). “Academic Clusters: text patterning in published and postgraduate writing.          International Journal of Applied Linguistics”, 18 (1), 41-62.

Hyland, K and P. Tse (2005). “Hooking the reader: a corpus study of evaluative that in        abstracts”.        English for Specific Purposes. 24 (2) 123-139.


A comparative study of Phd abstracts written in English by native and non native speakers across different disciplines


Genevieve Bordet


Phd abstracts constitute a specific case of specialized academic genre. In the particular case of the French scientific community, they are now nearly always translated into or directly written in English by non native speakers. Therefore, they offer an interesting opportunity to understand how would-be researchers try and build up their authorial stance through this specific type of writing. Besides defining their field of investigation and formulating new scientific propositions, authors have to demonstrate their ability to belong to a discourse community through their mastering of a specific genre.

            Based on the study of parallel corpora of abstracts written in English by NS and NNS in different disciplines, evidence is given of phraseological regularities specific to NS and NNS.The abstracts belong to different disciplines and fields of research, mainly in hard sciences, such as “material science”, “information research”,” mathematics didactics”.

            We analyse more specifically the expression of authorial stance through pronouns and impersonal tenses, modal verbs, adverbs and adjectives. The realization of lexical cohesion is considered as part of the building up of an authorial stance.

            Using the theoretical and methodological approach of lexicogrammar and genre study, the structure analysis of several texts in different disciplines and both languages (L1 and L2) provides evidence of different discourse planes specific to this type of text (Hunston 2000): text and metatext, the abstract presenting an account of the thesis, which, in turn, accounts for and comments the research work. This detailed study makes it possible to distinguish the different markers used to separate the different planes. It also shows that all abstracts include the following elements or “moves” (Biber 2008): referential discourse (theoretical basis), definition of a proposition or hypothesis, an account of the research work, conclusions and/or prescriptions. This will be mainly shown through the use of a combination of tenses and pronouns. Verbs and verbs collocations could also be semantically categorized and their distribution analysed, distribution analysis and semantic categorisation of verbs and their collocations providing further evidence on communicative strategies in abstracts.

            Based on these findings and using concordancing tools, a comparative study of abstracts written by NS and NNS t reveal significant differences between these two types of texts. This in turn  leads to an interrogation as to the consequences on their specific semantic prosody and their acceptance by the academic community (Swales 1990).

            The results of this study, since they show evidence of phraseological regularities specific to an academic genre, could be useful for information research, giving leads towards the automatic recognition of the genre.

            Besides, the evidence of important differences between NS and NNS realisations and the analysis of their consequences will help to define more adapted goals to EAP teaching.



Biber, D ., U. Connor and T. A. Upton (2008). Discourse on the move. Amsterdam: John Benjamins.

Biber, D. (2006). University language : a corpus based study of written and spoken language. John    Benjamins.

Gledhill, C. J. (2000). Collocations in science writing. Gunter Narr Verlag Tubingen.

Halliday, M. A. K and J. R. Martin (1993). Writing science : literacy and discursive power. Pittsburgh:          University of Pittsburgh Press.

Hunston, S. and G. Thompson (1999). “Evaluation in text  : authorial stance and the construction of discourse”. Oxford Linguistics.

Partington, A. (1998). Using corpora for English language research and teaching. John Benjamins.

Swales, J. (1990).  English in academic and research settings. Cambridge: Cambridge University Press.

Corpora for all? Learning styles and data-driven learning


Alex Boulton


One possible exploitation of corpus linguistics is for learners to access the data themselves, in what Johns (e.g. 1991) has called “data-driven learning” or DDL. The approach has generated considerable empirical research, generally with positive results, although quantitative outcomes tend to remain fairly small or not statistically significant. One possible explanation is that such quantitative data conceal substantial variation, with some learners benefiting considerably, others not at all. This paper sets out to see if the variation can be related to learning styles and preferences, a field as yet virtually unexplored even for inductive/deductive preferences (e.g. Chan & Liou, 2005).

            In this study, learners of English at a French architectural college were encouraged to explore the BNC ( for 15 minutes at the end of each class for specific points which had come up during the lesson. The learners completed a French version of the Index of Learning Styles (ILS) based on the model first developed by Felder and Silverman (1988/2002). This instrument, originally designed for engineering students, has also been widely used in language learning (e.g. Felder & Henriques, 1995), is quick and easy to administer, and provides numerical scores for each learner on four scales (active / reflective, sensing / intuitive, visual / verbal, sequential / global). The results are described, and compared against two further sets of data collected from the participants at the end of the semester. Firstly, they completed a questionnaire concerning their reactions to the DDL activities in their course; secondly, their corpus consultation skills were tested in the lab on a number of new points.

            If it can be shown that DDL is particularly appropriate for certain learner profiles, it may help teachers to tailor its implementation in class for more or less receptive learners, and perhaps increase the appeal of DDL to a wider learner population.



Chan, P-T. & H-C. Liou. (2005). “Effects of web-based concordancing instruction on EFL           students’ learning of verb–noun collocations”. Computer Assisted Language    Learning, 18 (3), 231-251.

Felder, R. & E. Henriques. (1995). “Learning and teaching styles in foreign and second     language education”. Foreign Language Annals, 28 (1), 21-31.

Felder, R. & L. Silverman. (1988/2002). “Learning and teaching styles in engineering        education”. Engineering Education, 78 (7), p. 674-681.  

Johns, T. (1991). “Should you be persuaded: two examples of data-driven learning”. In T.             Johns   & P. King (eds) Classroom Concordancing. English Language Research         Journal, 4, 1-16.



Exploring future constructions in Medieval Spanish using the Biblia Medieval Parallel Corpus


Miriam Bouzouita


 The subject of this paper is the variation and change found in Medieval Spanish future constructions. To be more specific, this study aims to uncover the differences in use between the so-called analytic future structures, such as tornar-m-é ‘I shall return’ in example (1), and the synthetic ones, such as daras ‘you will give’ in example (2). It also aims to trace the diachronic development of these future structures, which ended with the analytic ones being lost.


(1)        E          dixo:                “Tornar-m-é                             a          Jherusalem […]

            and      said.3SG         return.INF-CL-will.1SG                     to         Jerusalem

            ‘And he said: “I shall return to Jerusalem […]”’ (Fazienda: 194)

(2)        E          dyxo                ella:      “Que    me       daras?”

            And     said.3SG         she       what    CL       will-give.2SG

            ‘And she said: “What will you give me?”’ (Fazienda: 52)


            As can be observed, these future constructions differ in that a clitic intervenes between the two parts that form the future tense (an infinitive and a present indicative form of the verb aver ‘to have’) in the analytic structures but not in the synthetic ones.

Although some scholars observed a correlation between the use of these future variants and general clitic placement principles present in Medieval Spanish, such as the restriction precluding clitics from appearing in sentence-initial positions (e.g. Eberenz 1991; Castillo Lluch 1996, 2002), hardly any attention has been paid to this view and alternative explanations, which completely disregard this correlation, continue to enjoy widespread acceptance. The most widely held view is that the variation in future constructions is conditioned by discourse-pragmatic factors: analytic constructions are said to be emphatic while the synthetic ones unmarked (e.g. Company Company 1985-86, 2006; Girón Alconchel 1997). By disentangling the various syntactic environments in which each of the future constructions appears, I hope to make apparent the undeniable link that exists between the use of the various future constructions and clitic placement. I shall show that there exists (i) a complementary distribution between the synthetic forms with preverbal placement and the analytic forms, and (ii) a distributional parallelism between the analytic futures and postverbal placement in non-future tenses. Since this study has been carried out using the Biblia Medieval corpus, a parallel corpus containing Medieval Spanish Bibles dating from different periods, a same token can be traced through time which aids to establish the locus of the change more easily. Finally, I shall account for the observed variation and change within the Dynamic Syntax framework (Cann et al. 2005; Kempson et al. 2001). I shall show that within a processing (parsing/production) perspective, synchronic variation is expected for the future forms. The conclusion is that the variation and change in the future constructions cannot be satisfyingly explained without taking into account (i) clitic phenomena, such as positional pressures that preclude clitics from appearing sentence-initially, and (ii) the processing strategies used for the left-peripheral expressions.


Regrammaticalization as a restrategizing device in political discourse


Michael S. Boyd


Political actors exploit different genres to express their ideas, opinions and messages, legitimize their own policies, and delegitimize their opponents in different situations and contexts. What they say and how they say it are constructed by the particular type of social activity being pursued (Fairclough 1995: 14).  Campaign speeches and debates are two important, yet very different genres of political discourse in which political actors often operate. While campaign speeches are an example of mostly scripted, one-to-many communication, in which statements generally remain uninterrupted and unchallenged, political debates are an example of spontaneous, interactive communication, in which statements are often interrupted and challenged at the time of the speech event. In such different contexts it is not surprising that speakers should adopt different linguistic strategies to frame their message. The work is based on the hypothesis that genres and their contexts deeply influence surface lexico-grammatical as well as syntactic realizations. In such an approach, it naturally follows that socially defined contextual features such as role, location, timing, etc. are all “pivotal” for the (different) discourse realizations in different genres (Chilton and Schäffner 2002: 16). Moreover, this type of analysis cannot ignore “the broader societal and political context in which such discourse is embedded” (Schäffner 1996: 201).

            To demonstrate both the differences and similarities of such textual and discursal realizations the study is based on a quantitative and qualitative analysis of two small corpora taken from the discourse of Barack Obama. The first corpus consists of campaign speeches given during the Democratic primary campaign and elections (2007-8), while the second consists of the statements made by Obama in the three televised debates with Republican candidate John McCain (2008) and taken from written transcripts. The corpora are analyzed individually with particular reference to lexical frequency and keyness. The lexical data are subsequently analyzed qualitatively on the basis of concordances to determine grammatical and syntactic usage.

            The study is particularly interested in regrammaticalization strategies and how they are used in the different corpora. One such strategy is nominalization, which, according to Fowler  “offers extensive ideological opportunities” (1991, cited in Partington 2003: 15). We can see this, for example, in Obama’s overwhelming nominal use of the lexemes CHANGE and HOPE in his speeches. Nominalization also allows for syntactic fronting, giving the terms an almost slogan-like quality (such as, for example, “change we can believe in”). In the debate, on the contrary, not only do these lemmas have a much lower relative frequency but, when they are used, nominalization is rarely encountered. The message of hope and change is reframed to reflect the different micro- and macro-contexts, and in the debates modality appears to be the preferred grammatical (or, indeed, regrammaticalization) strategy. Thus, diverse grammaticalization strategies are used to varying degrees to reflect the differences in context and genre.    



Chilton, P. and C. Schäffner  (2002). “Introduction: Themes and principles in the analysis of political          discourse”. In P. Chilton and C. Schäffner (eds) Politics as Text and Talk Analytic           approaches to political discourse. Amsterdam: John Benjamins, 1-41.

Fairclough, N. (1995). Critical Discourse Analysis. London: Longman.

Partington, A. (2003). The Linguistics of Political Argument: The Spin-doctor and the Wolf-pace at the White House. London: Routledge.

Schäffner, C. (1996). “Editorial: Political Speeches and Discourse Analysis”. Current Issues in        Language & Society, 3 (3), 201-204.

Exploring imagery in literary corpora with the Natural Language ToolKit


Claire Brierley and Eric Atwell


Introduction: Corpus Linguists have used concordance and frequency-analysis tools such as WordSmith (Scott, 2004) and WMatrix (Rayson, 2003) for the exploration and analysis of English literature. Version 0.9.8 of the Python Natural Language Toolkit (Bird et al, 2009) includes a range of sophisticated NLP tools for corpus analysis, such as multi-level tokenization and lexical semantic analysis. We have explored the application of NLTK to literary analysis, in particular to explore imagery or ‘imaginative correspondence’ (Wilson-Knight, 2001: 161) in Shakespeare’s Macbeth and Hamlet; and we anticipate that code snippets and experiments can be adapted for research and research-led teaching with other literary texts.


(1) Preparing the eText for NLP: Modern English versions of these plays appear in NLTK’s Shakespeare corpus and the initial discussion refers to NLTK guides for Corpus Readers and Tokenizers to provide a rationale and user access code for preserving form during verse tokenization – for example, preserving orthographic, compositional and rhythmic integrity in hyphenated compounds like ‘…trumpet-tongued…’ while detaching punctuation tokens as normal. 


(2) Raw counts and the cumulative resonance of words in context: Having transformed the raw text of a play into a nested structure of different kinds of tokens (e.g. line and word tokens), we report on various investigations which involve some form of counting, with coverage of: frequency distributions; lexical dispersion; and distributional similarity. Raw counts for the following words in Macbeth highlight the phenomenon of repetition and its immersive effect: fantastical: 2; fear: 35; fears: 8; horrid: 3; horrible: 3. We also perform analyses on the plays as Text objects via associated methods such as concordance()and collocations().


(3) Semantic distance and the element of surprise in metaphor: Spatial measures of semantic relatedness, such as physical closeness (co-occurrence and collocation) or taxonomic path length in a lexical network, are used in a number of NLP applications (Budanitsky and Hirst, 2006). The imaginative atmosphere of Macbeth is one of confusion, disorder, unreality and nightmare; things are not what they seem: ‘…faire is foul, and foul is fair’. These and similar instances suggest that antonymy is a feature of the play. To investigate such palpable tension in Macbeth, plus the ‘jarring opposites’ in Hamlet (cf. Wilson-Knight, 2001: 347), we might evaluate how well WordNet (Fellbaum, 1998), another of NLTK’s datasets, can uncover the prevalence of antonymy or the degree of semantic distance within selected phrases - since it is ‘distance’ that surprises us in the ‘fusion’ of metaphor. Dictionary classes in NLTK’s wordnet package provide access to the four content-word dictionaries in WordNet {nouns; verbs; adjectives; and adverbs}; and sub-modules enable users to explore synsets of polysemous words; move up and down the concept hierarchy; and measure the semantic similarity of any two words as a correlate of path length between their senses in the hypernym-hyponym taxonomy in WordNet.


A multilingual annotated corpus for the study of Information Structure


Lisa Brunetti, Stefan Bott, Joan Costa and Enric Vallduví


An annotated speech corpus in Catalan, Italian, Spanish, English, and German is presented. The aim of the corpus compilation is to create an empirical resource for a comparative study of Information Structure (IS).

Description. A total of 68 speakers were asked to tell a story by looking at the pictures of three text-less books by M. Meyer (cf. Strömqvist & Verhoven 2004 and references quoted therein). The participants were mostly university students. Catalan and Spanish speakers were from Catalonia; Italian, English, and German speakers had recently arrived in Barcelona. The results are 222 narrations of about 2-9 minutes each (a total of about 16 hours of speech). The recordings are transcribed with an orthographic transcription. Transcriptions and annotations of some selected high quality recordings have been aligned to the acoustic signal stream using the program PRAAT (Boersma et al. 2009) and its specific format (cf. the Corpus of Interactional Data, Bertrand et al. 2008).

Annotation. An original annotation is proposed of non-canonical constructions (NCCs) for the Romance subgroup, namely of syntactically/prosodically marked constructions that represent informational categories such as topic, focus, contrast. The list of NCCs to be annotated is chosen on the basis of our knowledge of the typical NCCs of these languages: left/right dislocations, cleft and pseudocleft clauses, subject inversion, null subjects, focus fronting, etc.

            The analysis of NCCs in context is extremely useful for the study of IS, as they show explicitly what IS strategy the speaker uses within a specific discourse context. Despite their importance, only one example of this kind of annotation is available in the literature, to the best of our knowledge: the MULI corpus of written German (Baumann 2006). Therefore, our corpus provides a so-far missing empirical resource, which will enhance the research on IS based on quantitative analysis of sentences in real context.

             Exploitation. The annotation allows for a comparative description of IS strategies in languages with very similar linguistic potential, in particular with respect to the ‘IS-syntax’ and ‘IS-prosody’ interface (via the alignment to the acoustic signal), and the study of the ‘IS-discourse’ interface. A survey of the difference in frequency and use of NCCs in the three languages is given. For instance, it will be shown that these languages differ in their strategies to hide the agent of the event. Passives are largely used in Italian but not in Spanish and Catalan, while Spanish makes a larger use of arbitrary subjects. This is presumably connected to a larger use of left dislocations that we witness in Spanish than in the other two languages. Another topicalizing strategy, namely a pseudo-cleft, is more common in Spanish than in Italian. These and similar data allow us to make generalizations concerning the degree of transparency of languages in their linguistic representation of IS (cf. Leonetti 2008).



Baumann, S. (2006). “Information Structure and Prosody: Linguistic Categories for Spoken Language           Annotation”. In S. Sudhoff et al. (eds). Methods in Empirical Prosody Research Berlin: W. de             Gruyter.

Bertrand, R., P. Blache, R. Espesser, G. Ferré, C. Meunier, B. Priego-Valverde, and S.  Rauzy (2008), “Le    CID - Corpus of Interactional Data - Annotation et Exploitation Multimodale de Parole         Conversationnelle”. Traitement Automatique des Langues, 49, 3.

Boersma, P. and Weenink, D. (2009). Praat: doing phonetics by computer (Version 5.0.47) [Computer          program]. Retrieved January 21, 2009, from

Leonetti, M. 2008, “Alcune differenze tra spagnolo e italiano relative alla struttura informative”, Convegno    dell’Associazione Internazionale dei Professori d’Italiano, Oviedo, Sept. 2008.

Early Modern English expressions of stance


Beatrix Busse


Stance refers to epistemic or attitudinal comments on propositional information by the speaker and to the information perspectives of utterances in discourse. It describes how speakers express their attitudes and sources of information communicated (Biber et al. 1999).

Modality and modals, complementary clauses, attitudinal expressions and adverbials can be linguistic indicators of stance. Yet, while some very fruitful diachronic studies exist on, for example, modality and mood (including EModE phenomena, e.g. Krug 2000, Closs Traugott

2006), stance adverbials have not been extensively studied diachronically (Biber 2004).

            This paper will identify and analyse possible Early Modern English stylistic and epistemic stance adverbials, such as forsooth, in regard of or to say precisely/true/the truth/the sooth. According to Biber et al. (1999), modern English stance adverbials can be subdivided into a) “epistemic stance adverbials,” which express the truth value of the proposition in terms of, for example, actuality or limitation or viewpoint, b) “attitudinal stance adverbials,” which express the speaker’s attitude to or evaluation of the content, and c) “stylistic stance adverbial,” which represent a speaker’s comment on the style or form of the utterance. Following a function-to-form and a form-to-function mapping as well as a discussion of methodological difficulties associated with these mappings and modern categorisations, my study investigates the frequencies of a selected set of these stance adverbials in relation to change and stability in the course of the Early Modern English period, as well as in relation to grammaticalisation, lexicalisation and pragmatisation (e.g. Brinton and Closs Traugott 2005, Brinton and Closs Traugott 2007). Corpora to be examined will be The Helsinki Corpus of English Texts, The Corpus of Early English Correspondence, A Corpus of English Dialogues 1560-1760, and a corpus of all Shakespeare’s plays. Reference will also be made to Early Modern Grammars and other sources in order to reconstrue contemporary statements of (and about expressing) stance.

            Furthermore, the registers (Biber 2004) in which these constructions occur will be analysed on the assumption that oral and literate styles do not directly correspond to medium or to high and low style. They have to be seen as a continuum. Moreover, the functional potential of stance adverbials will be investigated, for example, in relation to their positions in the clause and to the accompanying speech acts of the clause. Sociolinguistic and pragmatic parameters drawn on will be medium, domain, age of speaker, and other contextual factors, such as degree of construed formality. Within a more qualitative framework, it will be argued that Early Modern epistemic and style adverbials have interpersonal, experiential and textual meanings as discourse markers, indicating scope, attitude and speaker-hearer relationship.



Biber, D., S. Johansson, G. Leech, S.Conrad and E. Finegan (1999). Longman Grammar of Spoken and         Written English. London: Longman.

Biber, D. (2004). “Historical Patterns for the Grammatical Marking of Stance: A Cross-

            Register Comparison”. Journal of Historical Pragmatics, 5 (1), 107-136.

Brinton, L., and E. Closs-Traugott. (2005). Lexicalization and Language Change. Cambridge: Cambridge      University Press.

Brinton, L. and E. Closs-Traugott. (2005). “Lexicalization and Grammaticalization All

            over Again.” Historical Linguistics 2005: Selected Papers from the 17th International

            Conference on Historical Linguistics, Madison, Wisconsin.

Closs-Traugott, E. (2006)“Historical aspects of modality”. In W. Frawley, (2006). The Expression of             Modality, Berlin: Mouton de Gruyter, 107-139.

Krug, M. (2000). Emerging English Modals: A Corpus-Based Study of Grammaticalization.

            Berlin: Mouton de Gruyter.

Integration of morphology and syntax unsupervised techniques for language induction

Héctor Fabio Cadavid and Jonatan Gomez


The unsupervised learning process of language elements has a wide field of applications, including protein sequences identification, studies on child’s language induction and evolution, and natural language processing. In particular, the development of the Semantic Web technology requires some techniques for inducing grammatical knowledge, including the syntactical, morphological and phonological levels. These techniques should be as detailed as possible to build up concepts and relations out of dynamic and error-prone contents.

      Almost all unsupervised language learning techniques are extensions of some classic technique or principle, like ABL (Alignment Based Learning), MDL (Minimum Description Length), or LSV (Letter Successor Variety). For instance, Alignment Based Learning techniques try to induce syntactical categories by identifying some interchangeable elements on sentences’ patterns at the syntactic level, the minimum description length principle is used to incrementally improve the search over a morphologies space, and the Letter Successor Variety principle gives a morpheme’s segmentation criteria at the morphologic level.

      However, such techniques are always focused on a single element of the language’s grammar. This fact has two disadvantages when trying to combine them, as it is required in the semantic Web case: first, two or more separated outputs from different techniques may be difficult to be consolidated since such outputs can be very different in format and content, and second, the information induced by one technique at a lower level is not easy to use into techniques of higher levels to improve their performance. An integration of a morphology unsupervised learning technique with a syntax unsupervised learning technique for language induction on corpuses extracted from internet is discussed in this paper.

      The proposed technique integrates a LSV (Letter Successor Variety)-based unsupervised morphology learning technique with a MEX (Motif Extraction)-based unsupervised syntax learning technique.

This integration enhances some of the grammatical categories obtained by the MEX-based technique with additional word categories identified by the LSV based technique.

      In this way, a hierarchical clustering technique for finding morphologically-related words is developed.

      The integrated solution is tested with corpuses build from two different natural languages: English (predominantly non-flexive/concatenative language) and Spanish (predominantly a flexive language).



Bordag, S. (2007). “Unsupervised and knowledge-free morpheme segmentation and analysis”. In       Proceedings of the Working Notes for the CLEF Workshop 2007.

Clark, A. (2002). Unsupervised Language Acquisition: Theory and Practice. PhD thesis.

Demberg, V. (2007). “A language-independent unsupervised model for morphological segmentation”. In        Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, 920-            927.

Goldsmith, J. (2006) “An algorithm for the unsupervised learning of morphology”. Nat. Lang. Eng., 12(4),   353-371.

Roberts, A. and E. Atwell (2002). “Unsupervised grammar inference systems for natural language”.   Technical Report 2002.20, School of Computing, University of Leeds.

Solan, Z., D. Horn, E. Ruppin, and S. Edelman. (2005). “Unsupervised learning of natural languages”.          Proc Natl Acad Sci U S A, 102(33), 11629-11634.

van Zaanen, M. (2000) “ABL: alignment-based learning”. In Proceedings of the 18th conference on   Computational linguistics, 961-967.

From theory to corpus and back again – the case of inferential constructions


Andreea Calude


Technological advances have revolutionised linguistics research via the use of corpora through increased storage and computation power. However, the shift in paradigm from full introspection to pure usage-based models (Barlow and Kemmer 2000) has not gone without criticism (Newmeyer 2003). The current paper sets out to show that a combined approach, where both these methods play a role in the analysis is most fruitful and provides a more complete picture of linguistic phenomena. The case is made by presenting an investigation of the inferential construction.

            Starting from a theoretical viewpoint, constructions such as It’s that he’s so self-satisfied that I find off-putting have been placed under the umbrella of it-clefts (Huddleston and Pullum 2002: 1418-1419). They are focusing structures, where the cleft constituent happens to be coded by a full clause rather than a noun phrase as is (typically) found in it-clefts (compare the above example to It’s a plate of soggy peas that I find off-putting).

            However, an inspection of excerpts of spontaneous conversation from the Wellington Corpus of Spoken New Zealand English (WCSNZE) (Homes et al 1998) suggests a different profile of the construction. As also found by Koops (2007), the inferential typically occurs in the negative form, and/or with qualifying adverbs such as just, like, and more. More problematically, the inferential only rarely contains a relative clause (Delahunty 2001). This has caused great debate in the literature with regard to whether the that-clause following the copula represents in fact the cleft constituent (Delahunty 2001, Lambrecht 2001), or whether it is instead best analysed as the relative/cleft clause (Collins 1991, and personal communication). Such debates culminated with some banning the inferential from the cleft class altogether (Collins 1991).

Returning to the drawing board, we see further evidence that the inferential examples identified in the corpus differ from it-clefts. While in it-clefts, the truth conditions of the presupposition change with a change in the cleft’s polarity, in the inferential, this is not the case, as exemplified below from the WCSNZE.

(1) It-cleft

a. It’s Norwich Union he’s working for. (WCSNZE, DPC083)

Pressupposition. He is working for Norwich Union.

b. It’s not Norwich Union he’s working for.

Pressupposition. He is not working Norwich Union.

(2) Inferential

a. It’s just that she needs to be on the job. (WCSNZE, DPC059)

Pressuposition: She needs to be on the job.

b. It’s not just that she needs to be on the job.

Pressuposition: She [still] needs to be on the job [but the real issue is something else, not just she needs to be on the job].

As illustrated by the inferential construction, the power of a hybrid approach in linguistic inquiry comes from the ability to draw on both resources, namely, real language examples found in corpus data, and also, the wealth of theoretical tools developed by linguists to account for, question, and (hopefully) explain linguistic phenomena.


References :

Barlow, M. and S. Kemmer (2000). Usage-Based Models of Language. Stanford: CSLI Publications.

Collins, P. (1991). Cleft and Pseudo-cleft constructions in English. London: Routledge.

Delahunty, G. (2001). “Discourse functions of inferential sentences”. Linguistics. 39 (3), 517-545.

Holmes, J., B. Vine and G. Johnson (1998). Guide to the Wellington corpus of spoken New Zealand English.

Koops, C. (2007). “Constraints on Inferential Constructions”. In G. Radden, K. M. Kopcke, B. Thomas          and P. Sigmund (eds) Aspects of Meaning Construction, 207-224. Amsterdam: John Benjamins.

Lambrecht, K. (2001). “A framework for the analysis of cleft constructions”. Linguistics 39, 463–516.

Newmeyer, F. (2003). “Grammar is grammar and usage is usage”. Language, 79 (4), 682-707.

Speech disfluencies in formal context: Analysis based on spontaneous speech corpora

Leonardo Campillos and Manuel Alcantra


This paper examines disfluencies in a corpus of Spanish spontaneous speech in formal contexts. This topic has recently gained interest in the research community, as the last ICAME workshop or the NIST competitive evaluations have shown. The goal of this paper is to propose a classification of disfluencies in formal spoken Spanish and to show their relations with both linguistic and extra-linguistic factors.

122,000 words taken from the MAVIR and C-ORAL-ROM corpora have been analyzed. Both corpora have been annotated with prosodic and linguistic tags, including 2,800 hand‑annotated disfluencies. Corpora are made up of 33 documents classified in different classes depending on contexts: business, political speech, professional explanation, preaching, conference, law, debate, and education.

Main disfluency phenomena which are considered in this work are repeats, false starts, filled pauses, incomplete words, and non-grammatical utterances. Main characteristics are described together with samples. To take an example, analysis shows that most repeated forms are function words (mainly prepositions, conjunctions, articles), with "de" (of ), "en" (in), "y" (and), and "el" (the) on the top of the list. Regarding different disfluency phenomena, data show that frequencies are not related. For example, documents with high rate of repeats do not necessarily have a high rate of pause fillers.

Linguistic and extra-linguistic factors to be considered have been chosen following the state‑of‑the-art literature of disfluencies in other languages since they have not received much attention till now for Spanish. Linguistic factors are: word-related aspects of disfluencies (e.g. which words are repeated) and syntactic complexity (e.g. how syntactically complex are parts following a disfluency). Extralinguistic factors are: document class, number of speakers, and speech domain. They altogether help us to explain differences in occurrence frequency of disfluencies. In the corpora, frequency clearly varies going from a disfluency every 1,000 words to 12 disfluencies every 100 words (with a mean of almost 5 words every 100 words).

On the one hand, results are compared with two previous studies carried out with the C‑ORAL-ROM corpus. The first one proposed a comprehensive typology of phenomena affecting transcription tasks. Phenomena where classified into types and frequencies were compared for informal, formal, and media interactions. Though the goal of that work was not a study of disfluencies, we show now that these are one of the main factors to make difficult a document for the transcriber. The second study was also focused on differences between informal, formal, and media interactions, but from an acoustic point of view. Instead of manual transcriptions, difficulty of automatic processing of different types of spontaneous speech was tested by performing acoustic-phonetic decoding by means of a recognizer on parts of the corpus.

On the other hand, main figures of our study are compared to those of spontaneous English in formal contexts from in order to show how dependent on the language these phenomena are.

   Results are important not only for the linguistic insights they provide. They will also help improve current automatic speech recognition systems for spontaneous language since disfluencies are one of the main problems in ASR.



Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. (1999). Longman

            Grammar of Spoken and Written English. London: Longman.

Huang, X., A. Acero, H.-W. Hon (2001). Spoken Language Processing: A Guide to Theory, Algorithm and    System Development. New Jersey: Prentice Hall PTR.

González Ledesma, A., G. de la Madrid, M. Alcántara Plá, R. de la Torre and A. Moreno‑Sandoval (2004). “Orality and Difficulties in the Transcription of Spoken Corpora”. Proceedings of the Workshop on          Compiling and Processing Spoken Language Corpora, LREC,

Shriberg, E. (1994). Preliminaries to a Theory of Speech Disfluencies. Ph.D. Thesis.

Toledano, D. T., A. Moreno, J. Colas and J. Garrrido. (2005). “Acoustic-phonetic decoding of different         types of spontaneous speech in Spanish”.Disfluencies in Spontaneous Speech Workshop 2005.

Featuring linguistic decline in Alzheimer’s disease: A corpus-based approach


Pascual Cantos-Gomez


Alzheimer’s disease (AD) is a brain disorder -the most common form of dementia named for German physician Alois Alzheimer, who first described it in 1906. It is a progressive and fatal brain disease that destroys brain cells, causing problems with memory, thinking and behaviour. The neuropsychological deficits attributable to Alzheimer’s disease have been documented extensively.

      The various stages of cognitive decline in AD patients include a linguistic decline. Language is known to be vulnerable to the earliest stages of Alzheimer's disease, and the findings of the Iris Murdoch Project (Garrard et al. 2005) confirmed that linguistic changes can appear even before the symptoms are recognised by either the patient or their closest associates.

      This paper describes a longitudinal language assessment of Prime Minister Harold Wilson’s speeches (1964-1970 and 1974-1976), extracted from the Hansard transcripts (the edited verbatim report of proceedings in both Houses) in order to explore possible effects of AD process on his language use.

      We are confident that this longitudinal study and the language use variables selected might provide us with some new understanding of the evolution and progression of language deterioration in AD.



Burrows, J. F. (1987). “Word-patterns and Story-shapes: The Statistical Analysis of Narrative       Style”. Literary and Linguistic Computing, 2 (2), 61-70.

Cantos, P. (2000). “Investigating Type-token Regression and its Potential for Automated Text-Discrimination”. Cuadernos de Filología Inglesa, 9 (1), 71-92.

Chaski, C. (2005). “Computational Stylistics in Forensic Author Identification”. [Document         available on the Internet at]

Conrad, S. and D. Biber (eds) (2001). Variation in English: Multi-Dimensional studies.      London: Longman.

Garrard, P. (in press) “Cognitive Archaeology: Uses, Methods, and Result”. Journal of      Neurolinguistics, doi:10.1016/j-neurolinguistics.2008.07.006.

Garrard, P. et al. (2005) “The Effects of very Early Alzheimer’s Disease on the      Characteristics of Writing by a Renowned Author”. Brain 128: 250-260.

Gómez Guinovart, X. And J. Pérez Guerra (2000). “A multidimensional corpus-based       analysis of English spoken and written-to-be-spoken discourse”. Cuadernos de Filología Inglesa, 9 (9), 39-70.

Kempler, D. et al. (1987) “Syntactic Preservation in Alzheimer’s Disease”. Journal of Speech        and Hearing Research, 30, 343-350.

March, G. E. et al. (2006) “The uses of nouns and deixis in discourse production in            Alzheimer's disease”. Journal of Neurolinguistics, 19, 311-340.

March, G. E. et al. (2009) “The Role of Cognition in Context-dependent Language Use:   Evidence from Alzheimer's Disease”. Journal of Neurolinguistics, 22, 18-36.

            Mollin, S. (2007) “The Hansard Hazard. Gauging the accuracy of British    parliamentary transcripts.” Corpora, 2 (2), 187-210.

Rosenberg, S. and L. Abbeduto (1987) “Indicators of Linguistic Competence in Peer Group         Conversational Behavior of     Mildly Retarded Adults”. Applied Psycholinguistics, 8         (1), 19-32.

The evolution of institutional genres in time: The case of the White House press briefings


Amelia Maria Cava, Silvia De Candia, Giulia Riccio, Cinzia Spinzi and Marco Venuti


Literature on genre analysis mainly focuses on the description of language use in the different professional and institutional domains (Bhatia 2004). Despite the different directions of the studies on genre (Martin and Christie 1997; Swales 1990), however, a common orientation may be seen in their tendency to describe homogeneous concepts, such as communicative situation, register and function. Nevertheless, genre-specific features are subject to changes due to the ongoing processes of internationalisation and globalisation (Candlin and Gotti 2004; Cortese and Duszak 2005). Within the framework of a wider research project titled “Tension and change in English domain-specific genres” funded by the Italian Ministry of Research, the present paper aims to outline, through a corpus-based analysis of lexico-grammatical and syntactic features (Baker 2006), in what ways the White House press briefings, as a genre, have evolved in the last 16 years under the pressure of technological developments and of media market transformation. White House press briefings are meetings between the White House press secretary and the press, held on an almost daily basis and they may be regarded as the main official channel of communication for the White House. Embracing a diachronic perspective, our analysis aims at identifying the main features of the evolution of the briefings as a genre, during the Clinton and George W. Bush administrations. A corpus (DiaWHoB) including all the briefings from January 1993 to January 2009, available on the American Presidency Project website (, has been collected in order to carry out the analysis. The corpus consists of about 4,000 briefings and is made up of more than 18 million words. The scope and size of a specialised corpus of this kind make it a powerful tool to investigate the evolution of the White House press briefing. In order to manage the data more efficiently, the corpus has been annotated. The XML mark-up includes information about individual speakers and their roles, date, briefing details and text structure. Our intent is to compare different discourse strategies adopted by speakers in the briefings at different points in time, and also to identify differences between discourse features employed by the press secretary and those more typical of the press. The present research paper draws up the corpus structure and in what ways the corpus architecture helps in investigating the evolution of the genre, and also presents some preliminary results. In particular, focus is on some examples of evolution in phraseology within the genre of briefings in order to support the hypothesis that a diachronic corpus-based investigation facilitates comparisons among different speakers thanks to the XML mark-up while providing interesting insight into the evolution of a genre.



Baker, P. (2006). Using Corpora in Discourse Analysis. London: Continuum.

Bhatia, V. K. (2004). Worlds of Written Discourse. London: Continuum.

Candlin C. And M Gotti (eds) (2004). Intercultural Discourse in Domain-Specific English. Textus, 17(1).

Cortese, G. and A. Duszak (eds) (2005). Identity, Community, Discourse: English in Intercultural

Settings. Bern: Peter Lang.

Crystal, D. (2003). English as a Global Language. Second Edition. Cambridge: CUP

Gotti, M. (2003). Specialized Discourse. Bern: Peter Lang.

Martin, J. and F. M. Christie (1997). Genre and Institutions. London: Cassell.

Partington, A. (2003). The Linguistics of Political Argument: the Spin-doctor and the Wolf-pack at the

White House. London/New York: Routledge.

Swales J. M. (1990). Genre Analysis. Cambridge: CUP.


Parallel corpora: The case of InterCorp

Frantisek Cermák


There is a growing awareness, started decades ago, that parallel corpora might substantially contribute to language contrastive research and various applications based on them. However, except for notorious and rather one-sided type of parallel corpora, such as the Canadian Hansard and Europarl corpora, most of the attention paid to them has been oddly restricted, mostly to two things. On the one hand, computer scientists seem to compete fiercely in the field of tools including search of optimal alignment methods and when they have arrived at a solution and become convinced that there is no more to be achieved here, they drop the subject and interest in it as well. On the other hand, parallel corpora hardly ever means anything more than a bilingual parallel corpus. Thus, the whole field seems to be lacking in a number of aspects, including both real use and exploitation, that should be linguistic, preferably, and a broader goal of comparing and researching more languages, a goal which should suggest itself in today´s multilingual Europe. Moreover, most attention is being paid, understandably, to such language pairs where at least one is a large language, such as English.

            InterCorp, a subproject of Czech National Corpus (, currently under progress, is a joint attempt of linguists, language teachers and representatives of over 25 languages to change this picture a little and to make Czech, a language spoken by 10-million people, a centre and, if possible, a hub, for the rest of languages included. The list contains now most state European languages, small and large. Given the familiar limited supply of translations the plan is to cover as much as possible from (1) contemporary language (starting with the end of World War 2), (2) also non-fiction of any type (fiction prevails in any case), if available, (3) also translations from a third language, apart from the pair of languages in question (in case of need), and (4) translations into more than one language, if possible. A detailed description of this, general guidelines and problems will be discussed.

            Obviously, this contribution is aimed at redressing the balance looking at linguistic types of exploitation, although some thoughts will also be given to non-linguistic ones. It seems that such a large general multilingual corpus, which seems to have not many parallels elsewhere, could be a basis and tool for finding out more, including answers to questions such as what is possible on such a large scale, what its major problems and desiderata might be (which are still to be discovered). First results (the project will run till 2011) will be available at a conference in August 2009 held in Prague.



Using linguistic metaphor to express modality: A corpus study of argumentative writing


Claudia Marcela Chapetón Castro and Danica Salazar


This paper is part of a larger study that investigates the use of linguistic metaphor in native and non-native argumentative writing by means of corpus methods. Cognitive research has shown that many concepts, especially abstract ones, are structured and mentally represented in terms of metaphor (Lakoff, 1993; Lakoff and Johnson, 1980), which is claimed to be highly systematic and pervasive in different registers. However, until recently, cognitive metaphor research has been based largely on elicited, invented and decontextualized data. The present study takes a corpus-driven approach to the analysis of metaphor in order to connect the conceptual with the linguistic by using empirical, naturally occurring written data.

            Three corpora are compared in the study. The first corpus is a collection of professionally written editorials published in American newspapers. The second is a sample from the Louvain Corpus of Native English Essays (LOCNESS), which contains argumentative essays written by native-speaker American students. The third is a sample from a sub-corpus of the International Corpus of Learner English (ICLE) composed of essays written by advanced Spanish-speaking learners of English. These corpora were chosen to represent expert native writing (editorials), novice native writing (LOCNESS) and learner writing (SPICLE).

            Linguistic metaphor was identified in context through the application of a rigorous, systematic procedure that combined Cameron’s (2003) Metaphor Identification through Vehicle Terms (MIV) procedure with the Pragglejaz Group’s (2007) Metaphor Identification Procedure (MIP). Once identified, instances of linguistic metaphor were analyzed in terms of their function in the texts.

            This paper focuses on one of the functions that metaphors have been found to perform in the data: the expression of modality. Several studies have shown modality to be important in argumentative writing, since they help writers convey their assessments and degree of confidence in their propositions and enable them to set the appropriate tone and distance from their readers (Aijmer, 2002; Hyland and Milton, 1997). In other words, modal devices fulfill pragmatic and interpersonal functions crucial to argumentation.

            This paper compares how the three writer groups under study use linguistic metaphor to communicate modal meanings. In the presentation, patterns of occurrence will be described and differences among the three corpora with regard to the appropriateness of the degree of commitment and certainty expressed through metaphor will be discussed. Aspects such as the use of metaphor in personalized and impersonalized structures and the influence of cultural and linguistic variations on the use of metaphor to express epistemic commitment will be explained. The pedagogical implications of the results obtained will be highlighted.



Aijmer, K. (2002). “Modality in advanced Swedish learners’ written interlanguage”. In S. Granger, J. Hung and S. Petch-Tyson (eds) Computer learner corpora, second language acquisition and foreign    language teaching, 55-76. Amsterdam: John Benjamins.

Cameron, L. (2003). Metaphor in educational discourse. London: Continuum.

Hyland, K. and Milton, J. (1997). “Qualification and certainty in L1 and L2 students’ writing”. Journal of      Second Language Writing, 6, 183-205.

Lakoff, G. (1993). “The contemporary theory of metaphor”. In A. Ortony (ed.) Metaphor and thought.          Cambridge: Cambridge University Press. 202-251

Lakoff, G. And M. Johnson (1980). Metaphors we live by. Chicago: University of Chicago Press.

Pragglejaz Group. (2007). “MIP: A method for identifying metaphorically used words in discourse”.             Metaphor and Symbol, 22 (1), 1-39.

DDL for the EFL classroom: Effective uses of a Japanese-English parallel corpus and the development of a learner-friendly, online parallel concordance


Kiyomi Chujo, Laurence Anthony and Kathryn Oghigian


There is no question that the use of corpora in the classroom has value, but how useful is concordancing with beginner level EFL students?  In this presentation, we outline a course in which a parallel Japanese-English concordancer is used successfully to examine specific grammar features in a newspaper corpus. We will also discuss the benefits and limitations of existing parallel concordance programs, and introduce a new freeware parallel concordancer, WebConc-bilingual, currently being developed.

            In the debate over whether an inductive (Seliger, 1975) or deductive (Shaffer, 1989) approach to grammar learning is more effective, we agree with Corder (1973) that a combination is most effective and we have incorporated both into a successful DDL classroom procedure: (1) hypothesis formation through inductive DDL exercises; (2) explicit explanations from the teacher to confirm or correct these hypotheses; (3) hypothesis testing through follow-up exercises; and (4) learner production. Through this procedure we show that incorporating cognitive processes such as noticing, hypothesis formation, and hypothesis testing enables learners to develop skills effectively.

            The course was based on the findings of two studies. Uchibori et al. (2007) identified a set of grammatical features and structures used in practical English expressions that are found in TOEIC questions but not generally taught in Japanese high school textbooks. Chujo (2003) identified the vocabulary found in TOEIC but not taught in Japanese high school textbooks. These grammatical structures and vocabulary form the basis of twenty DDL lessons taught over two semesters.

            A case study was used to assess the validity of the course design. Students were asked to follow carefully crafted guidelines to explore various noun or verb phrases using parallel Japanese-English concordancing. In addition, students completed follow-up activities using targeted vocabulary to reinforce the grammar patterns discovered through the DDL activities. The evaluation of learning outcomes showed that the course design was effective for understanding the basic patterns of these noun and verb phrases, as well as for learning vocabulary.

            Although we have previously relied on Paraconc (Barlow, 2008), for this course we have started to develop a new online web-based parallel concordancer. Built on a standard LAMP (Linux, Apache, MySQL, PHP) framework, we designed the concordancer to be both powerful and intuitive for teachers and learners to use. In addition, the concordancer engine is designed on a similar architecture to the Google search engine, allowing it to work comfortably on very large corpora of hundreds of millions of words. The concordancer is also built to Unicode standards, and thus, it can process both English and Japanese texts smoothly, and does not require any cumbersome token definition settings. Preliminary results show that the new software is considerably easier to use than standard desktop programs, and also, it provides students the opportunity to carry out DDL exercises at home or even during their commute using a mobile device.



Barlow, M. (2008). ParaConc (Build 269) [Software].

Chujo, K. (2003). “Selecting TOEIC Vocabulary 1 & 2 for Beginner Level Students and Measuring its           Effect on a Sample TOEIC Test”. Journal of the College of Industrial Technology, 36, 27-42.

Seliger, H. (1975). “Inductive Method and Deductive Method in Language Teaching: A Re-Examination”.      International Review of Applied Linguistics, 13, 1-18.

Shaffer, C. (1989). “A Comparison of Inductive and Deductive Approaches to Teaching Foreign       Languages”. The Modern Language Journal, 73(4), 395-402.

Uchibori A., K. Chujo and S. Hasegawa (2006). “Toward Better Grammar Instruction: Bridging the Gap        between High School Textbooks and TOEIC”. The Asian EFL Journal, 8(2), 228-253.

A case study of economic growth based on U.S. presidential speeches


Siaw-Fong Chung


In addition to being a metaphor itself, growth is found to be highly lexicalized in the expression economy/economic growth which can form other metaphors (cf. White (2003) for discussion of ‘growth’). Using ECONOMIC GROWTH as a target domain, this work addresses the issues of source domain indeterminacy in metaphor research. Data from thirty-eight U.S. presidential speeches from the State of the Union from 1970 through 2005 were analyzed (comprises data from presidents Nixon, Ford, Carter, Reagan, G. H. W. Bush, Clinton and G. W. Bush). Seventy-six instances (from total 178; 43%) of growth were found to be related to economy. Three methods (shown in the table below) were used to extract the source domains for ECONOMIC GROWTH, namely through (a) examining syntactic positions (intuition-based); (b) using WordNet and SUMO (Suggested Upper Merged Ontology) (top-down); and (c) using Sketch Engine for collocations (cf. Kilgarriff and Tugwell, 2001) (bottom-up). (Similar idea but with the aid of computational tools was used in Chung (Forthcoming) based on Mandarin newspaper data.)


Determining Source Domains

Metaphorical Expressions

Possible Source Domains

Syntactic Positions

WordNet 1.6 and SUMO


the real engine of economic growth in this country is private sector



Engine of X

X: car, aircraft, war

to promote freedom as the key to economic growth




Key to X

X: success, understanding

the fruits of growth must be widely shared




Fruits of X

X: tree, effort, work


            Even though ECONOMIC GROWTH can serve as a target domain and create new metaphors, inconsistency arises in determining source domains, which is also shown by the use of varied source domains for similar metaphorical expressions in previous studies. From the three methods suggested above, this paper finds collocations usually do not work well with items which metaphorical frequency overrules the literal uses (such as slow in slow down). Combining the analyses from syntactic positions, ontologies, and collocations may show advantages in providing more information regarding source domain determination but it may also cause difficulties methodology-wise especially when contradicting results are found from different methods.



Kilgarriff, A. and D. Tugwell. (2001). “WORD SKETCH: Extraction and Display of        Significant Collocations for Lexicography.” In the Proceedings of the ACL Workshop           COLLOCATION: Computational Extraction, Analysis and Exploitation. Toulouse, 32-38.

White, M. (2003). “Metaphor and Economics: The Case of Growth.” English for Specific Purposes. 22, 131-151.

Chung, S-F. (forthcoming). A Corpus-driven Approach to Source Domain Determination.             Language and Linguistics Monograph Series. Nankang: Academia Sinica.


Corpora and EFL / ELT: Losses, gains and trends in a computerized world

Adrian Ciupe


In the last decade or so, thanks to unprecedented developments in computer technology, corpora research has been yielding unique insights into the workings of English lexis – one of the languages which, as a second acquisition, is an almost ubiquitous requirement in curricula across the world, from primary to tertiary level. Catching up with these ground-breaking advances (derived from the painstaking research of linguists and lexicographers) are also the EFL / ELT beneficiaries of such endeavours, prestigious publishers and non-native learners / teachers of English alike. Nonetheless, although major EFL publishers boast each a proprietor corpus (running to billions of words and counting!), the ultimate general challenge lies in final-product usability by the intended target audience. Such publishers have embarked on the assiduous task of processing raw corpus data in terms of frequency, usage, register, jargon etc towards its (twice-filtered) inclusion into learner dictionaries (notably, their electronic versions), as well as into various course books (especially, exam preparation courses). They are vying (suffice it to say, obvious from blurbs alone) for market leader positions, but with what benefits to the non-native users at large? Are there any grey areas left to be covered? Electronic dictionaries may have become sophisticated enough as to accommodate more user-friendly replicas of book formats, laying at the disposal of learners advanced search tools and thus blending language and computer literacy towards a more efficient acquisition of native-like competence. Despite the apparent benefits, several shortcomings crop up on closer inspection: electronic learner dictionaries draw on mammoth professional corpora, ultimately emerging as mini/guided corpora targeted at non-native speakers. But how does the processed data lend itself to relevance and searchability? How much does the software replicate the compiling strategies behind book-format counterparts? How truly advanced are the incorporated search tools? How much vertical vs. horizontal exploration do they allow? How much of the information is deductively structured (i.e. headword > phraseology > real-life language / examples = the classic approach) and how much inductively (i.e. real-life language / examples > phraseology (> headword) = the new computer-assisted methods)? Do interfaces and layout make a difference? Similarly, regarding corpora use in EFL book formats, what kinds of publications are currently available, whom do they target (c.f. a predilection for exam preparation), what do they contain, how are they organized and how do they practise what they preach? How can the end user really benefit from this classic / modern tandem of data processing (the experts), transfer (the EFL publications) and acquisition (the learners)? My paper will be addressing the questions above, based on practice-derived findings. I will be contending that although cutting-edge technology is readily available both to experts and to an intended target audience (non-native speakers of English), corpora-based EFL / ELT resources are still in a relative infancy (c.f. their novelty) but still, remarkably capable of reshaping focus through need-based end-products that could eventually live up to the much vaunted reliability, usability and flexibility of technology-assisted language learning.





A corpus approach to images and keywords in King Lear


Maria Cristina Consiglio


This paper aims to apply tools and methodology of Corpus Linguistics to Shakespeare’s King Lear in order to explore the relationship between images and keywords. King Lear is a play where language is particularly important in both the characterization and the development of the plot. Traditional critical studies – like Spurgeon (1935); Armstrong (1946); Evans (1952); Clemen, (1966); Wilson Knight (1974); Frye (1986) – have identified some thematic words, which strongly contribute to convey to the audience the general atmosphere of the play and to create an imagery which effectively ‘suggest[s] to us the fundamental problems lying beneath the complex construction of [the] play’ (Clemen, 4). The plot develops in an atmosphere of strokes, blows, fights, shakes, strains, spasms and the imagery contributes to excite, intensify and multiply the emotions and sometimes, through the use of symbols and metaphors, they foreground aspects of the characters’ thought.

            This paper intends to analyse the above-mentioned characteristics using the software WordSmith Tools. A first step consists in a quantitative analysis aiming at identifying the keywords of the play, using a reference corpus made of the other so-called great tragedies by Shakespeare; a second step consists in a qualitative analysis consisting in the study of concordance lines, which allows to verify the different uses and meanings a given word assumes in a given text. The results will be evaluated against those of the ‘traditional’ studies about the imagery of the play in order to verify if and to what extent they coincide. The reason for this choice is to ascertain the usefulness of Corpus Linguistics in studying a linguistically complex literary text like King Lear where language plays such an important role.



Armstrong, E.A. (1946). Shakespeare’s Imagination: A Study of the Psychology of Association and                 Inspiration. London: Drummond.

Clemen, W. (1959). The Development of Shakespeare’s Imagery. London: Methuen.

Culpeper, J. (2002). “Computer, Language, and Characterization: An Analysis of six Characters in   Romeo and Juliet”. In U. Melander-Marttala, C. Ostman and M. Kyto (eds) Conversation in           Life and in Literature: Papers from the ASLA Symposium, Association Suedoise de      Linguistique Appliquée, 15, Universitetstryckeriet: Uppsala, 11-30.

Culpeper, J. (2007). “A New kind of Dictionary for Shakespeare’s Plays: An Immodest Proposal”.   Yearbook of the Spanish and Portuguese Society for English Renaissance Studies, 17, 47-73. 

Evans, B.I. (1952). The Language of Shakespeare’s Plays. Bloomington: Indiana University Press.

Frye, N. (1986) Northrop Frye on Shakespeare, ed. by R. Sandler, New Haven / London: Yale          University Press.

Louw, B. (1993). “Irony in the Text or Insincerity of the Writer? The Diagnistic Potential of Semantic             Prosodies”. In M. Baker, G. Francis and E. Tognini Bonelli (eds) Text and Technology.    Amsterdam: John Benjamins, 157-176.

Louw, B. (1997). “The Role of Corpora in Critical Literary Appreciation”, in A. Wichman, S. Fligelstone, T. McEnery and G. Knowles (eds) Teaching and Language Corpora. Harlow: Longman,                 240-251.

Scott ,M. (2006) “Key words of Individual Texts”. In M. Scott and C. Trimble (eds) Textual Patterns.             Key Words and Corpus Analysis in Language Education. Amsterdam: John Benjamins, 55-          72.

Spurgeon, C. (1952) Shakespeare’s Imagery and What It Tells Us, Cambridge: UP.

Wilson Knight, G. (1959) The Wheel of Fire: Interpretations of Shakespearian Tragedy with Three   New Essays. London: Methuen.


Corpus-based lexical analysis for NNS academic writing


Alejandro Curado-Fuentes


The analysis of NNS (non-native speakers of English) writing for academic purposes has led to extensive research over the past three decades. A main focus is on university compositions or essays where L2 learners ought to go through re-writing procedures. Less consideration seems to be given, by comparison, to NNS research writing for publication aims, although just to give two examples, Burrough-Boenisch (2003; 2005) examine proof-reading procedures.

This paper aims to present a corpus-based lexical analysis of nine journal papers written by Spanish Computer Science faculty. The texts available for the corpus analysis are those authored by the writers in their final versions; however, they are accessed prior to the editors’ last review. The corpus examination has been done by comparing the texts with NS material from a selection of the BNC (British National Corpus) Sampler (Burnard and McEnery, 1999). The chief objective in the process has been to identify both similarity and divergence in terms of the significant lexical items used. Word co-occurrence and use probability in the contrasted contexts determine academic competence, since the mastery of specific lexical patterns should indicate specialised writing (cf. Hoey, 2005). Based on the literature, an attempt at assessing major idiosyncratic traits as well as flaws has been made by comparing main native English, non-native English, and Spanish writing patterns. The contrastive study may constitute an interesting case study as reference for the design of teaching material and / or NNS writing guidelines.



Burnard, L. and T. McEnery (1999). The BNC Sampler. Oxford: Oxford University Press.

Burrough-Boenisch, J. (2003). “Shapers of published NNS research articles”. Journal of     Second Language Writing, 12, 223–243.

Burrough-Boenisch, J. (2005). “NS and NNS scientists’ amendments of Dutch scientific   English and their impact on hedging”. English for Specific Purposesm 24, 25–39.

Hoey, M. (2005). Lexical Priming. A New Theory of Words and Language. London:             Routledge.



Edinburgh Academic Spoken English Corpus: Tutor interaction

with East Asian students in tutorials


Joan Cutting


Some East Asian students are quiet in tutorials (De Vita 2007), and lecturers struggle to help them participate interactively (Jin 1992; Ryan and Hellmundt 2003). Training to increase lecturers’ cultural awareness of non-western learning styles is essential , as is training on interactive tutorial management with East Asian students (Grimshaw 2008; Stokes and Harvey 2008).

            This paper discusses a pilot study of UK university postgraduate tutorials. The tuorials were video-recorded and form a sub-corpus (20,000 words) of the nascent Edinburgh Academic Spoken English Corpus. The transcripts were analysed using corpus linguistics and interactional sociolinguistics, to find what happens when students are required to interact. The database was coded for linguistic features (technical words, vague language, etc), structural features (length of speaking time, interruptions and overlaps, comprehension checks, etc), interactional features (speech acts, cooperative maxims, politeness strategies, etc) and pedagogical strategies (pair work, presentations, posters, etc). The features were then cross-tabulated to identify which linguistic forms, interactional and teaching strategies used by lecturers correlated with students’ active participation. Findings revealed that student participation varied according to some of the tutor behaviour. For example, if the tutor used vague language (e.g. general nouns and verbs, and indefinite pronouns) and colloquialisms, student participation was delayed until the tutor initiation was reformulated satisfactorily; and if the tutor used positive politeness strategies (e.g. using student first names and hedging) student participation increased. This paper discusses reasons for the correlations and lack of correlation, taking into account social variables such as individual differences, gender and language proficiency level.

            The paper is essentially one that focuses on interactional information in the corpus, with a view to drawing up guidelines to support professional development of lecturers. It works on the assumption that findings of linguistic analysis could guide them in good practices (Cutting 2008; Reppen 2008).



Cutting, J. and Feng, J. (2008). 'Getting them to participate: East Asian students' interaction with       tutors in UK seminars'. Unpublished paper at East Asian Learner conference, University of          Portsmouth.

De Vita, G. (2007). “An Appraisal of the Literature on Internationalising HE Learning”. In E. Jones             and S. Brown (eds) Internationalising High Education. Oxford: Routledge.

Edwards, V. And A. Ran (2006). Meeting the needs of Chinese students in British Higher Education.           Reading: University of Reading.

Grimshaw, T. (2008). 'But I don't have any issues': perceptions and constructions of Chinese-speaking        students at a British university. Unpublished paper at East Asian Learner conference,     Portsmouth.

Jin, L. (1992). “Academic Cultural Expectations and Second Language Use: Chinese Postgraduate   Students in the UK: A Cultural Synergy Model”. Unpblished PhD Thesis, University of    Leicester.

Ryan, J. and Hellmundt, S. (2003). “Excellence through diversity: Internationalisation of curriculum            and pedagogy”. Refereed paper presented at the 17th IDP Australian International Education     Conference, Melbourne

Stokes, P. and Harvey, S. (2008) Expectation, experience in the East Asian learner. Unpublished      paper at East Asian Learner conference, Portsmouth.


CLIPS: diatopic, diamesic and diaphasic variations of spoken Italian


Francesco Cutugno and Renata Savy


We present here the largest corpus of spoken Italian ever collected. Compared to the average dimension of available speech corpora, CLIPS (Corpora e Lessici di Italiano Parlato e Scritto), in its spoken component [1], is among the largest ones (see table 1)

 Table 1

In recent years Italian linguistics has dedicated an increasing amount of resources to the study of spoken communication, constructing their proper analytic tools and reducing the historical lack of available data for research. Among the various sources of variability naturally encountered in human languages along different dimensions of expression, Italian presents a particular relevance of diatopic variance which cannot be neglected and that is difficult to be represented: standard ‘Italian’ is an abstraction built on mixing and combining all regional varieties, each derived by one or more local romance dialects (De Mauro, 1972; Lepschy&Lepschy. 1977; Harris&Vincent, 2001).

CLIPS has been collected in 15 locations representative of 15 diatopic varieties of Italian chosen on the base of a detailed socio-economic and socio-linguistic analyses. Moreover the corpus is structured into 5 diaphasic/diamesic layers, presents a variety of textual typologies (see table 2). 30% of the entire corpus is orthographically transcribed, around 10% is labelled at various levels.

Table 2

Specific standard protocols have been used both for transcription and labelling. Orthographic transcription includes metadata description and classification, non-speech and non-lexical phenomena annotation (empty and filled pauses, false starts, repetitions, noises, etc.). Labelling is delivered in TIMIT format. It includes the following levels: lexical, phonological (citation form), phonetic, acoustic, extra-text (comments). The corpus is freely available for research and is carefully described by a set of public documents (website:

The CLIPS project has been funded by Italian Ministry of Education, University and Research (L.488/92) and coordinated by prof. Federico Albano Leoni.



De Mauro, T. (1972). Storia linguistica dell’Italia unita, Bari: Laterza.

Lepschy, A.L. and G. C Lepschy (1977). The Italian Language Today, London: Hutchinson.

Harris M. and N. Vincent (2001). The Romance Languages (4th ed.), London: Routledge.


[1] Lexicon and text corpora in CLIPS have been collected by ILC, Pisa

Corpus-driven morphematic analysis

Václav Cvrček


In this paper I would like to present a computer tool for a corpus-driven morphematic analysis as well as a new corpus approach to derivational and inflectional morphology. The tool is capable of segmenting the Czech and English word-forms into morphemes (i.e. the smallest parts of words which have a meaning or function) by combining distributional and graphotactic methods. The important fact for a further research in this area is that the segmentation process described below can be quite efficient without involving any previous linguistic knowledge about the word structure or its meaning.

            The program recursively forms a series of hypotheses about morpheme-candidates (esp. about morpheme boundaries) and evaluates these candidates on the basis of their distribution among other words in corpus. A working hypothesis assumes that when a morphematic boundary occurs, distribution of morpheme-candidates derived from this partitioning would be decisive (i.e. many words should contain these morphemes) in comparison to a hypothetical division separating two parts of one morpheme from each other (Urrea 2000). Another criterion is derived from graphotactics; assuming that between graphemes which co-occur often (measured with MI-score and t-score) is the probability of the morpheme boundary smaller, this criterion provides a less important but still valuable clue in segmentating the word into morphemes. Final boundary is drawn at a position which suits best for both of the criteria. Results gained from this computer program can be improved by involving some of linguistic information in the process, such as list of characters which are not capable of forming a prefix etc.

            This tool can help to improve the linguistic marking (lemmatisation, tagging) of corpora (which is necessary for languages with rich morphology) and it, too, may be seen as a first possible step towards a semantic marking or clustering of words (identifying morphologically and therefore semantically related words). Besides these advantages, this statistical approach can help in redefining the term 'morpheme' on the grounds of corpus linguistics (only by statistical and formal properties of morpheme-candidates) as a part of word which is regularly distributed in the set of words of the same morphological characteristics (i.e. part of speech or words created by the same derivational process etc.).


Corpus-based derivation of a “basic scientific vocabulary” for indexing purposes


Lyne Da Sylva


The work described here pertains to developing language resources for NLP-based document description and retrieval algorithms, and to exploiting a specialized corpus for this purpose. Retrieval applications seek to help users find documents relevant to their information needs; thus document collections must be described, or indexed, using expressive and discriminating keywords.

            Our work focuses on the definition of different lexical classes within the vocabulary of a given language, where each has a different role in indexing. Most indexing terms stem from the specialized scientific and technical vocabulary (SSTV). Of special interest however is what we call the “Basic scientific vocabulary”, which contains general words not belonging to any specialized vocabulary, but rather used in all scientific or scholarly domains, e.g. “model”, “structure”, “absence”, “example”. This allows subdivisions of the SSTV (e.g. “parabolic antenna, structure”) and may also help in improving term extraction (e.g. removing “structure” in “structure of parabolic antenna”) for automatic indexing. Our goal is to provide a proper linguistic description of the BSV class and develop means of identifying members of this class in a given language, including extracting this type of words from a large corpus (Drouin, 2007, describes a similar endeavour on a French corpus).

            We have identified linguistic criteria (categorial, semantic and morphological) which may define this class, which differs from Ogden’s Basic English (Ogden, 1930), the Academic Word List (Coxhead, 2000) or frequency lists such as those extracted from the British National Corpus. We have also conducted an experiment which attempts to derive this class automatically, from a large corpus of scholarly writing. Our originality lies in the choice of corpus: the documents consist of titles and abstracts taken from bibliographic databases for a large number of different scholarly disciplines: ARTbibliographies Modern, ASFA1 : Biological sciences and living resources, Inspec, Library and Information Science Abstracts, LLBA, Sociological Abstracts, Worldwide Political Science Abstracts. The rationale behind this is they (i) contain scholarly vocabulary, (ii) tend to use general concepts (more than the full texts), and (iii) cover a wide range of scholarly disciplines, so the frequency of vocabulary items used in all should outweigh disciplinary specialities. The total corpus contains close to 14 million words.

            An initial, manually-constructed list for English will be sketched (approximately 344 words), based on the linguistic criteria. Then details of the experiment will be given; 80% of the first 75 extracted nouns do meet the linguistic criteria, as do around 70% of the first 1000. This method has allowed us to increase our working BSV list to 756 words. We will discuss problems with the methodology, explain how the new list has been incorporated into an automatic indexing program yielding richer index entries, and present issues for further research.



Coxhead, A. (2000). “A New Academic Word List”. TESOL Quarterly, 34(2), 213-238.

Drouin, P. (2007). “Identification automatique du lexique scientifique transdisciplinaire”. Revue française de linguistique appliquée, 12(2), 45-64.

Ogden, C. K. (1930) Basic English: A General Introduction with Rules and Grammar.      London: Paul Treber.


A corpus-based approach to the language of the social services: Identifying changes in the profession of the social worker and priorities in social problems.


Adriana Teresa Damascelli


In recent times the need for highly specialised professional profiles has brought about changes in the job market. New professions are increasingly required to cover job position in fields which have developed their own specificity and produced linguistic items, i.e. specialised terminology, which are used to connote concepts, objects, actions, and activities relating to them. Nowadays, the proliferation of specialised fields is so rapid that it is not always possible to trace back their origin, their state, and what kind of evolution has taken place in terms of specialty as well as identify changes in terms of communication. On the other hand, there are some professions whose theoretical background has well-established roots which date back to the past and have gained their status of specialised field.

This paper focuses on the domain of the social services and in particular on the social worker profession. The aim is to determine if any further process of specialisation has taken place and/or has developed in ten years, and in what way. For present purposes a corpus of all the issues published from 1998 to 2008 by the British Journal of Social Workers is being collected. The corpus will maintain annual subdivision, although this would be not of help because of journal policy which may or may not impose the development of certain issues or topics in each volume or year. Subsequently, different computer tools, namely wordlists, concordances, and key words, will be used in order to investigate the corpus. The corpus-based approach will also be of help to trace any trend in the way topics have been covered and hypothesise any relationship with general attitude towards social matters. Examples as well data will be provided as evidence.



Lexical bundles in English abstracts: A corpus-based study of published and non-native graduate writing


Carmen Dayrell


English for Academic Purposes (EAP) poses major challenges to novice writers who, in addition to dealing with the various difficulties involved in the writing process itself, also have to comply with the conventions adopted by their academic discourse community. This is mainly because the presence (or absence) of some specific linguistic features is usually viewed as a key indicator of communicative competence in a field of study. The main rationale behind this suggestion is that expert writers who regularly participate in a given discourse community are familiar with the lexical and syntactical patterns more frequently used by their community and hence often resort to them. Not surprisingly, a number of corpus-based studies have paid increasing attention to lexical bundles (or clusters) in academic texts with a view to developing teaching materials which can assist novice researchers, and non-native speakers in particular, to write academic texts in English. This study follows this line of research and focuses on scientific abstracts. The primary purpose is to investigate lexical bundles in English abstracts written by Brazilian graduate students as opposed to abstracts of published papers from the same disciplines. The comparison is made across three disciplines, namely, pharmaceutical sciences, physics and computer science and takes into consideration the frequencies of lexical bundles with respect to their forms and structures. The data is drawn from two separate corpora of English abstracts. One corpus is made up of 158 abstracts (approximately 32,500 words) written by Brazilian graduate students from the abovementioned disciplines. These abstracts were collected in seven courses on academic writing offered between 2004 and 2008 by the relevant departments of a Brazilian university. The other corpus consists of 1,170 abstracts (over 205,000 words) taken from papers published by various leading academic journals. It has been designed to match the specifications of the corpus of students’ abstracts in terms of disciplines and percentages of texts in each. The long-term objective of this study is to translate the most relevant differences between the two corpora into pedagogic materials in order to raise students’ awareness of their most frequent inadequacies as well as of the chunks or phrases which are most regularly used within their academic discourse community.


Identifying speech acts in emails: Business English and non-native speakers


Rachele De Felice and Paul Deane


This work presents an approach to the automated identification of speech acts in a corpus of workplace emails. As well as contributing to ongoing research in automated processing of emails for task identification (see e.g. Carvalho and Cohen 2006; Lampert, Dale et al. 2008; Mildinhall and Noyes 2008), its main goal is to determine whether learners of English are using the appropriate structures for this type of text. In particular, tests of English for the business environment often require learners to write emails responding to specific task requirements, such as asking for information, or agreeing to a proposal, which can be assigned to basic speech act categories. Automated identification of such items can be used in the scoring of these open-content answers, e.g. noting whether all the speech acts required by the task are present in the students’ texts.

      Our data consists of a corpus of over 1000 such emails written by L2 English students in response to a test prompt. Each email is manually annotated at the sentence level for a speech act, using a scheme which includes categories such as Acknowledgement, Question, Advisement (requesting the recipient do something), Disclosure (of personal thoughts or

intentions), and Edification (stating objective information).

      We then identify a set of features which discriminates among the categories. Similar methods have been found successful in acquiring usage patterns for lexical items such as prepositions and determiners (De Felice and Pulman 2008; Gamon, Gao et al. 2008; Tetreault and Chodorow 2008); one of the novel aspects of this work is to extend the classifier approach to aspects of the text above the lexical level. The features refer to several characteristics of the sentence: punctuation, use of pronouns, use of modal verbs, presence of particular lexical items, and so on. A machine learning classifier is trained to associate particular combinations of features to a given speech act category, to automatically assign a speech act label to novel sentences. The success of this approach is tested by using ten-fold cross-validation on the L2 email set. First results show accuracy of up to 80%. We also intend to assess the performance of this classifier on comparable L1 data, specifically a subset of the freely available Enron email corpus.

      Our work combines methods from corpus linguistics and natural language processing to offer new perspectives on both speech act identification in emails and the wider domain of automated scoring of open-content answers. Additionally, analysis of the annotated learner data can offer further insights into the lexicon and phraseology used by L2 English writers in this text type.



Carvalho, V. and W. Cohen (2006). “Improving email speech act analysis via n-gram         Selection”. HLT-NAACL ACTS Workshop. New York City.

De Felice, R. and S. Pulman (2008). “A classifier-based approach to preposition and

            determiner error correction in L2 English”. COLING. Manchester, UK.

Gamon, M., J. Gao, et al. (2008). “Using contextual speller techniques and language

            modeling for ESL error correction”. IJCNLP.

Lampert, A., R. Dale, et al. (2008). “The nature of requests and commitments in email

            Messages”. AAAI Workshop on Enhanced Messaging. Chicago, 42-47.

Mildinhall, J. and J. Noyes (2008). “Toward a stochastic speech act model of email            Behaviour”. Conference on Email and Anti-Spam (CEAS).

Tetreault, J. and M. Chodorow (2008). “The ups and downs of preposition error detection”.          COLING, Manchester, UK.

What they do in lectures is…: The functions of wh-cleft clauses in university lectures


Katrien Deroey


As part of a study of lecture functions in general and highlighting devices in specific, this paper presents the findings of an investigation into the discourse functions of basic wh-cleft clauses in a corpus of lectures. These clauses, such as What our brains do is complicated information processing, are identifying constructions which background the information in the relative clause (What our brains do) and present the information foregrounded in the complement (complicated information processing) as newsworthy. Corpus-based studies of this construction to date have mainly described its function in writing (Collins 1991, Herriman 2003 & 2004) and spontaneous speech (Collins 1991 & 2006). From his examination of wh-clefts in speech and writing, Collins (1991: 214) concludes that ‘the linear progression from explicitly represented non-news to news offers speakers an extended opportunity to formulate the message’, while Biber et al (1999: 963) note that in conversation, speakers may use this construction with its typically low information content as ‘a springboard in starting an utterance’. With regard to academic speech, Rowley-Jolivet & Carter-Thomas (2003) found that basic wh-clefts are particularly effective in conference presentations to highlight the New and that their apparent ‘underlying presupposed question’ (p. 57) adds a dialogic dimension to monologic speech. All these features suggest that wh-clefts may be a useful in lectures, which are typically monologic and mainly concerned with imparting information. So far, however, studies on the function of these clefts in lectures have generally focussed on the function of part of the wh-clause as a lexical bundle (Biber 2006, Nesi & Basturkmen 2006) and mostly discussed its role as a discourse organising device.

            For the current investigation, a corpus of 12 lectures drawn from the British Academic Spoken English (BASE) Corpus were analysed. This yielded 132 basic wh-clefts, which were classified for their main discourse functions based on the presence of certain lexico-grammatical features, the functional relationship between the clefts and their co-text and an understanding of the purposes of and disciplinary variation within the lecture genre. Four main functional categories thus emerged: informing, evaluating, discourse organizing, evaluating and managing the class. These functions of wh-clefts and their relative frequency are discussed and related to lecture purposes; incidental findings on their co-occurrence with pauses and discourse markers are also touched upon. The study of this highlighting device in a lecture corpus thus aims to contribute to our understanding of what happens in authentic lectures and how this is reflected in the language.



Biber, D. (2006). University language: a corpus-based study of spoken and written registers: Studies          in Corpus Linguistics 23, Amsterdam: John Benjamins.

Collins, P. C. (1991). Cleft and pseudo-cleft constructions in English. London: Routledge.

Collins, P. C. (2006). “It-clefts and wh-clefts: prosody and pragmatics”. Journal of Pragmatics, 38,             1706-1720.

Herriman, J. (2003). “Negotiating identity: the interpersonal functions of wh-clefts in English”.        Functions of Language, 10 (1), 1-30.

Herriman, J. (2004). “Identifying relations: the semantic functions of wh-clefts in English”. Text, 24             (4), 447-469.

Nesi, H. and  H. Basturkmen, (2006). “Lexical bundles and discourse signalling in academic            lectures”. International Journal of Corpus Linguistics, 11 (3), 283-304.

Rowley-Jolivet, E., and S. Carter-Thomas (2005). “Genre awareness and rhetorical appropriacy:      manipulation of information structure by NS and NNS scientists in the international    conference setting”. English for Specific purposes, 24, 41-64.


Building a dynamic and comprehensive field association terms dictionary from domain-specific corpora using linguistic knowledge


Tshering Cigay Dorji, Susumu Yata, El-sayed Atlam, Masao Fuketa,

Kazuhiro Morita and Jun-ichi Aoe


With the exponential growth of digital data in recent years, it remains a challenge to retrieve and process this vast amount of data into useful information and knowledge. A novel technique based on Field Association Terms (FA Terms) has been found to be effective in document classification, similar file retrieval, and passage retrieval. It also holds much potential in areas like machine translation, ontology building, text summarization, cross-language retrieval etc.

            The concept of FA Terms is based on the fact that the subject of a text (document field) can usually be identified by looking at the occurrence of certain specific terms or words in that text. An FA Term is defined as a minimum word or phrase that serves to identify a particular field and cannot be divided further without losing its semantic meaning. Since an FA Term may belong to more than one field, its strength to indicate a specific field is determined by assigning one of the five pre-defined levels.

            However, the main drawback today is the lack of a comprehensive FA Terms dictionary. This paper proposes a method to build a dynamically-updatable comprehensive FA Terms dictionary by extracting and selecting FA Terms from large collections of domain-specific corpora. Firstly, the documents in a domain-specific corpus are part-of-speech (POS) tagged and lemmatized using a program called TreeTagger. Secondly, the FA Term candidates which consist of “rigid noun phrases” are extracted by matching predefined POS patterns. Thirdly, the term frequencies and the document frequencies of the extracted FA Term candidates are compared with those of the candidate terms from a reference corpus. The FA Term candidates are then weighted and ranked by using a special formula based on tf-idf. Finally, the level of each newly selected FA Term is decided by comparing with the existing FA Terms in the dictionary. The process is repeated on a regular basis for newly obtained corpora so that the FA Terms dictionary remains up-to-date.

            Experimental evaluation for 21 fields using 306.28 MB of domain-specific corpora obtained from English Wikipedia dumps selected 497 to 2,517 Field Association Terms for each field at precision and recall of 74-97% and 65-98% respectively.  The relevance of the automatically selected FA Terms was checked by human experts. The newly selected FA Terms were added to the existing dictionary consisting of 14 super-fields, 50 median fields and 393 terminal fields. Experimental results showed that the proposed method is an effective method for building a dynamically-updatable comprehensive FA Terms dictionary. Future studies will improve the proposed methodology by adding a new module for automatic classification of new documents for extraction of new FA Terms, and explore the application of FA Terms in cross-language retrieval and machine translation.


Encoding intonation: The use of italics and the challenges for translation


Peter Douglas


English, Halliday reminds us, is ‘a language in which a relatively heavy semantic load is carried by rhythm and intonation’. The position of the accented word or syllable is largely determined by the speaker’s involvement in an ongoing discourse in which New (rather than Given) information receives tonal prominence. It has been noted that intonation, an overt signal of information status in spoken English, is difficult to convey in the written language. However, the use of italics, particularly in literary texts, can be seen to represent such emphasis; they are a hybrid convention, signs located at the border between written and spoken discourse. General opinion that regards italics as mere punctuation options belies their complex nature and masks their importance for the translation scholar and for the theorist interested in the written representation of spoken discourse.

            The present paper claims that as languages have their own prosodic conventions and their own codes of how these conventions are represented in writing, problems will arise during the translation process. Research into English and Italian source and target texts illustrates how two languages represent intonation in written language and the challenges that this poses for the translator. To this effect, a small, bidirectional parallel corpus of 19th and 20th century English and Italian source texts and their respective target texts was created, and each occurrence of italics identified. Given the typographical nature of the item, compilation of the corpus presented its own challenges.

            Successive rounds of investigation provided quantitative data which, when analysed, revealed not only the considerable communicative potential of italics, but also the marked differences between the two language codes, their respective textual conventions and the strategies available to the translator. A distinctive feature that emerged from the analysis of the English source texts is that italicised items tend to signal marked rather than unmarked default intonation patterns. They highlight words which would generally be expected to go unselected, i.e. they evidence part of the information unit that is Given rather than New. In addition, analysis shows the extent to which italics communicating emphasis appear to be culture-specific to the English texts. However, further examination of Italian target texts also reveals that a wide range of lexico-grammatical resources can be drawn upon to approximate the source text meaning.



Halliday, M. A. K. (1985). An Introduction to Functional Grammar. London: Edward      Arnold, 271.


Lexical bundles in classroom discourse: A corpus-based approach to tracking disciplinary variation and knowledge


Paul Doyle


Corpus-based approaches to discourse analysis provide sophisticated analytical tools for investigating linguistic patterning beyond the clause level. Coupled with interpretations sensitive to the functional role of such patterning in texts, these approaches can facilitate deeper insights into differences between genres and registers, and, in the context of the discourse of primary and secondary classroom teaching, relate these understandings to broader pedagogic processes. In this paper, I focus on recurrent word sequences, or lexical bundles (Biber et al. 1999), as markers of disciplinary variation in a corpus of primary and secondary teacher talk. Frequently occurring lexical bundles can be classified using functional categories such as epistemic stance expressions, modality and topic related discourse organising expressions (ibid). However, in order to account for variation in lexical bundle distribution across disciplines, there is a need for an interpretative framework that relates to a specific community of language users operating in a single genre (Hyland, 2008). Classroom talk is a hybrid discourse (Biber, Conrad and Cortes, 2004) that exhibits both the characteristic interpersonal features of spoken language and ‘literate’ features of written language from textbooks, and it is especially rich in lexical bundles. Using data from the Singapore Corpus of Research in Education, an ongoing educational research project utilising a corpus of classroom interactions, textbooks and students artefacts in key curriculum disciplines, I trace variations in pedagogic practice as evidenced in teacher talk from English medium lessons in English language, Mathematics, Science and Social Studies in Singapore classrooms. Frequent lexical bundles are classified using a framework adapted from Hyland’s (2008) taxonomy, and the distribution of the various categories is compared across the four school disciplines. The approach is evaluated in terms of its ability to relate linguistic variation to significant disciplinary differences, and to highlight processes of knowledge construction in the classroom.



Biber, D., S. Conrad and V. Cortes (2004). “If you look at...: Lexical Bundles in University          Teaching and Textbooks”. Applied Linguistics, 25 (3), 371-405.

Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999). Longman Grammar of         Spoken and Written English. London: Longman Pearson.

Hyland, K. (2008). “As can be seen: Lexical bundles and disciplinary variation”. English for         Specific Purposes, 27 (1), 4-21.




A function-first approach to identifying formulaic language


Philip Durrant


There has been much recent interest in creating pedagogically-oriented descriptions of formulaic language (e.g., Biber, Conrad, & Cortes, 2004; Ellis, Simpson-Vlach, & Maynard, 2008; Hyland, 2008; Shin & Nation, 2008). Corpus-based research in this area has typically taken what might be called a ‘form-first’ approach, in which candidate formulas are identified as the most frequent collocations or n-grams in a relevant corpus. The functional correlates of these high frequency forms are then determined at a later stage of analysis. This approach has much to commend it. It enables fast, reliable analysis of large bodies of data, and has produced some important insights, retrieving patterns which are often not otherwise evident.

            While this research continues to yield valuable results, the present paper argues that much can also be gained by taking what I will call a ‘function-first’ approach to identifying formulaic language. On this approach, a corpus is first annotated for communicative functions; formulas are then identified as the recurrent patterns used to express each function. This method has two key advantages. First, it enables the analyst to identify not only how frequent a given form is, but also what proportion of attempts to express a particular message use that form. As Wray (2002, p. 30) has pointed out, information of this sort is often vital, alongside overall frequency information, in determining whether an item is formulaic.

            The second advantage of a function-first approach is that it enables the analyst to take account of the range of variation associated with a particular expression. Form-first approaches are required to specify in advance the formal features that two strings of language must have in common to count as instances of a single formula. This has usually meant restricting analysis to simple forms such as two-word collocations or invariant ‘lexical bundles’. Since many formulas are to a certain degree internally variable, this restriction inevitably involves the loss of potentially important information. The function-first approach aims to account for the range of variation typical in the expression of a given function and so to include a fuller description of such variable formulas.

            While these features of the function-first approach have the potential to be of much theoretical interest, the  primary motivation for the approach is pedagogical. For language learners, the key information is often not which strings of language are the most frequent, but rather which functions they are most likely to need, what formulas will most appropriately meet these needs, and how those formulas can be manipulated. This is precisely the information which a function-first approach aims to provide.

            This paper will demonstrate a function-first approach to identifying formulaic language through an investigation of a collection of introductions to academic essays written by MA students in the social sciences, taken from the corpus of British Academic Written English. The methodological issues involved will be discussed and sample results presented.



Biber, D., S. Conrad and V. Cortes (2004). “If you look at ...: Lexical Bundles in University Teaching         and Textbooks”. Applied Linguistics, 25 (3), 371-405.

Ellis, N. C., R. Simpson-Vlach and C. Maynard (2008). “Formulaic language in native and second-  language speakers: psycholinguistics, corpus linguistics, and TESOL”. TESOL Quarterly, 41        (3), 375-396.

Hyland, K. (2008). “Academic clusters: text patterning in published and postgraduate writing”.         International Journal of Applied Linguistics, 18 (1).

Shin, D. and P. Nation (2008). “Beyond single words: the most frequent collocations in spoken        English”. ELT Journal, 62 (4), 339-348.

Wray, A. (2002). Formulaic language and the lexicon. Cambridge: CUP.

Corpus-based identification and disambiguation of reading indicators for German nominalizations


Kurt Eberle, Gertrud Faaß and Ulrich Heid


We present and discuss an automatic system for the disambiguation of certain lexical semantic ambiguities and for the identification of disambiguation clues in corpus text.

            As an example, we analyse German nominalizations of verbs of information (e.g. Äußerung “statement”, Darstellung “(re-)presentation”, Mitteilung “information”) and nominalizations of verbs of measuring and recording (e.g. Beobachtung “observation”, Messung “measurement”, which can be ambiguous between an event reading (“event of uttering, informing, observing”, etc.) and a fact-like object reading which makes reference to the content in question. For language understanding, content analysis, and translation, it is desirable to be able to resolve this type of ambiguity.

            The ambiguity can only be resolved in context; relevant types of contextual information include the following:


* semantic class of the verb underlying the nominalization;

* syntactic embedding selectors of the nominalization, e.g.

            - verbs (e.g. modal vs. full verbs, tense);

            -  prepositions;

* complements and modifiers of the nominalizations, e.g. determiners, adjectives, PPs and expressing participants of the verb frame, such as the agent.


            There are many contexts, however, where indicators can themselves be ambiguous, like the German preposition nach, in constructions of the kind nach (den) Äußerungen von X . . . : nach can mean a temporal sequence (“after”, event indicator) or reference to a content (“according to”); to resolve this combined ambiguity definiteness and agent phrases play a role.

            Such fine-grained interpretations require a sophisticated syntactic and semantic analysis tool. Many interactions need to be identified and tested. Our system provides both these functions: an automatic syntactic and semantic analysis with partial disambiguation and with an underspecified representation of ambiguity, and a search mode which allows us to browse texts for relevant constellations to find indicator candidates.

            Our tool is a research prototype based on the large coverage grammar of the machine translation product translate ( it includes a dependency parser based on Slot Grammar (McCord 1991) and provides underspecified representations in the style of FUDRT (Eberle 2004).We added a functionality to the parser that allows us to modularly add hypotheses about the reading indicator function of certain words or combinations thereof; by adding these hypotheses to the system’s knowledge source, we can test their disambiguation effect.

            A search/browse mode allows us to search and parse results for specific syntactic/semantic constellations, to cross-check them in large corpora and to group and count context with related properties, as, e.g. nach Äußerungen von Maier “according to statements by Maier” versus nach den Äußerungen von Maier “after the statements by Maier”.

            In the paper, we present the tool architecture and functions, and we elaborate on details of the event vs. fact disambiguation with nominalizations of verbs of information.


Natural ontologies at work: Investigating fairy tales


Ismaïl El Maarouf


The goal of the EMOTIROB project is to build a smart robot which could respond adequately to children experiencing emotional difficulties. One of the research activities is to set up a comprehension module (Achour et alli 2008) signalling the emotional contour of utterances in order to stimulate an adequate response (non-linguistic) from the robot when faced with child linguistic input. This will require a corpus analysis of input language. Whilst the oral corpus is being completed, research is being carried out on another kind of child-related material: Fairy Tales, that is, a corpus of adults addressing children through texts destined to arouse their imagination. This corpus will mainly serve as a reference for a comparison to a corpus of authentic child language.

            The purpose of this study is to characterize semantically this subset of children's linguistic environment. Studies of child language in the mainstream of corpus linguistics are scarce (Thompson & Sealey 2007), and so are child-related corpora, which means that we know very little about their specificities.

            This study focuses on most frequent verbs and investigates concordances to classify their collocations. The theoretical framework used here is Patrick Hanks's Corpus Pattern Analysis (Hanks 2008). This corpus-driven model makes the most of linguistic contexts to define each meaning unit. Each verb pattern is extracted and associated to a particular meaning while every pattern argument is semantically typed when it helps to disambiguate the meaning of the verb. Building semantic patterns is not a straightforward procedure: we analyse how predefined ontological categories can be used, confronted to, or combined with, the set of collocates found in a similar pattern position. Each collocate set can then be grouped with another under the label of a natural semantic category. Natural ontologies is, in this perspective, a term which designates the network of semantic categories emerging from corpus data.

            The corpus has been annotated semantically and manually with Anaphora and Proper Noun Resolution and verb patterns have been extracted and their arguments semantically typed. This annotation enables the analysis of preferences of semantic types and linguistic units separately since the same unit may pertain to diverse semantic categories. The corpus, though very small according to the standards of Corpus Linguistics (less than 200 000 running words), reveals peculiar semantic deviances compared to studies led so far on adult-addressed texts such as newspapers. In order to account for such deviances, we propose to characterize verbal arguments around prototype categories, which are defined through corpus frequencies. This article shows, among other things, how child-addressed corpora can be described thanks to the specific semantic organisation of verb arguments and to corpus-derived natural ontologies.



Achour A., M. Le Tallec, S. Saint-Aime, B. Le Pévédic, J. Villaneau, J.-Y. Antoine and D. Duhautd

(2008). “EmotiRob: from understanding to cognitive interaction”, In 2008 IEEE International Conference on Mechatronics and Automation – ICMA 2008. Takamatsu, Kagawa, Japan.

Hanks, P., 2008, “Lexical Patterns: From Hornby to Hunston and Beyond”, In E. Bernal and

J. Decesaris (eds) (2008). Proceedings of the XIII EURALEX International Congress,            Barcelona, Sèrie Activitats 20. Barcelona: Universitat Pompeu Fabra, 89-129

Hanks P. and E. Jezec (2008), “Shimmering lexical sets”, In Euralex XIII 2008 Proceedings,            Pompeu Fabra University, Barcelona.

A corpus-based analysis of reader positioning: Robert Mugabe as a pattern in the Guardian newspaper

Izaskun Elorza


Several attempts have been carried out to describe genres or text types within newspaper discourse by means of contrastive analyses of bilingual comparable corpora. Recent approaches to the description of different features of this type of discourse focus on the analysis of evaluative language and reader positioning (e.g., McCabe & Heilman 2007; Marín Arrese & Núñez Perucha 2006). It is my contention that contrastive analyses on evaluation and metadiscourse (Hyland 2005) can benefit from corpus methodology as a complementary tool for revealing patterns used by newsworkers with various purposes which include reader positioning and, in order to illustrate this, an analysis of the Guardian newspaper is presented in this paper.

            In cross-cultural contrastive analyses, one way of minimising the great variety of variables at stake is to choose the coverage of the same event as the criterion for the compilation of the comparative corpora, so that the event coverage or the topic is taken as the control variable. Along this line, a contrastive analysis has been carried out of the coverage by El País and the Guardian of the opening day of a summit organised by UN’s Food and Agricultural Organisation in June 2008. A list of keywords has been produced by means of WordSmith Tools 4.0 (Scott 2004), revealing that the participant who received greater attention in El País was Ban Ki-moon (UN secretary general) but that the Guardian devoted much more attention to Robert Mugabe (the controversial president of Zimbabwe). An imbalance of this kind in the coverage of the summit might be due to cultural differences in consonance (Bell 1991), which relates the newsworkers’ choice of devoting attention to some participants or others to the readers’ expectations in each case. And, as the newspaper readers’ expectations are conformed by, among other things, the repetition of the same patterns in the newspaper (O’Halloran 2007; 2009), a second analysis has been carried out to try to find out whether the presence of Robert Mugabe in the summit has been constructed by means of some recurrent pattern and, if so, which elements belong to that pattern.

            For this analysis a second corpus of approximately 1 million running words has been compiled consisting of news articles and editorials from the Guardian (2003 to 2008). Concordances of ‘Mugabe’ and ‘summit’ have been obtained of this corpus allowing me to identify some recurrent elements involving different kinds of evaluative resources, including attitudinal expressions of judgement and appreciation (Martin & White 2005). The results presented here come to reinforce the suggestion that an approach integrating discourse analysis and corpus linguistics is more productive for in-depth discourse description than qualitative approaches alone.



Hyland, K. (2005). Metadiscourse. London: Continuum.

O’Halloran, K. (2007). “Using Corpus Analysis to Ascertain Positioning of Argument in a Media Text”. In M.          Davies, P. Rayson, S. Hunston, and P. Danielsson (eds) Proceedings of the Corpus Linguistics           Conference CL2007.

O’Halloran, K. (2009). “Inferencing and Cultural Reproduction: A Corpus-based Critical Discourse Analysis”.         Text & Talk, 29 (1), 21-51.

Marín Arrese, J. I. and B. Núñez Perucha (2006). “Evaluation and Engagement in Journalistic Commentary and      News reportage”. Revista Alicantina de Estudios Ingleses, 19, 225-248.

Martin, J. R. and P. R. R. White (2005). The Language of Evaluation: Appraisal in English. Houndmills:    Palgrave MacMillan.

McCabe, A. & K. Heilman (2007). “Textual and Interpersonal Differences between a News Report and an             Editorial”. Revista Alicantina de Estudios Ingleses, 20,139-156.

Scott, M. (2004). WordSmith Tools. Oxford: Oxford University Press.

Constructing relationships with disciplinary knowledge: Continuous aspect in reporting verbs and disciplinary interaction in lecture discourse


Nick Endacott


This paper examines use of the continuous aspect of reporting verbs in statements recontextualising disciplinary knowledge in lecture discourse, focusing on how it sets up complex participation frameworks and dialogic interaction patterns in such statements and the roles these patterns play in negotiating lecturer and disciplinary identities in the genre.

            Authentic academic lecture discourse exhibits many complexities in its lexico-grammar realising reporting and discourse representation. Significant among these complexities is the use of the continuous aspect with reporting verbs. Although reporting and discourse representation have been investigated in more lexico-gramatically ‘standard’ genres such as the Research Article (Thompson & Yiyun 1995, Hyland 2000), the use of the continuous aspect falls outside existing typologies for studying reporting and discourse representation in academic discourse (Thompson & Yiyun 1991, Hyland 2000). This is likely to be a reflection of the lexico-grammatical complexity of lecture discourse, comprising as it does features of both written and spoken discourse (e.g. Dudley-Evans & Johns 1981, Flowerdew 1994), and a reflection too of the pedagogic nature of lecture discourse. Broadly, the continuous aspect is used in lecture discourse both in metadiscourse and in discourse putting forward propositional content. Although the focus on message as opposed to ‘original words’ (McCarthy 1998; cf. Tannen 1989) and the construction of highly dialogic co-authorship in discourse (Bakhtin 1981, Goffman 1974) made through this usage is similar in both functional areas, the pragmatic purposes and participants involved differ. Using data from the BASE* corpus, this paper investigates the use of continuous aspect in the latter area, propositional content, and in particular examines the complex disciplinary interaction patterns constructed in lecture discourse by the continuous aspect and the highly dialogic participation frameworks (Goffman 1974) co-constructed with it. These interaction patterns create varying identities for lecturers, for the disciplinary knowledge they are recontextualising in lectures, and for the disciplines they lecture from, and as such are another resource for comparing patterns of social interaction in academic genres (Hyland 2000) and assessing disciplinary structure (Becher 2001, Hyland 2000).


* The British Academic Spoken English Corpus (BASE) was developed at the universities of Warwick and Reading under the directorship of Hilary Nesi and Paul Thompson. Corpus development was assisted by funding from BALEAP, EURALEX, the British Academt and the Arts and Humanities Research Council.




Exploring morphosyntactic variation in Old Spanish with Biblia Medieval (a parallel corpus of Spanish medieval Bible translations)


Andrés Enrique-Arias


This paper discusses the main advantages and shortcomings of the Biblia Medieval corpus in the historical study of morphosyntactic variation and change. Containing over 5 million words, Biblia Medieval is a freely accessible online tool that enables linguists to consult and compare side-by-side the existing medieval Spanish versions of the Bible and that gives them access the facsimiles of the originals.

            The historical study of morphosyntactic variation and change through written texts involves some well-known challenges. One of them is the difficulty to identify and to define the contexts of occurrences of morphosyntactic variants since these are normally conditioned by a complex set of syntactic, semantic and discursive factors. In order to minimize this problem, it is necessary to locate and to examine a large number of occurrences of the same linguistic structure in versions that were produced at different time periods. Ideally, these occurrences should proceed from texts that have been influenced by the same textual conventions. Conventional corpora, however, are not well-suited for locating such occurrences. This problem is alleviated when the linguist has access to a parallel corpus. Parallel corpora consist of texts that are translated equivalents of a single original. Therefore they have the same underlying content and have been influenced by similar textual conventions. When working with a parallel corpus, the analyst may abstract away from the influence of contextual properties and focus instead on the diachronic evolution of structural phenomena. This will lead to more nuanced generalizations of the historical evolution of the phenomena being scrutinized. Alternatively, one can focus on contextual properties seeking to identify, for instance, variation across subgenres within texts, or the differences between translated texts in the corpus and non-translated ones. While a parallel corpus does not solve all the problems inherent to working with historical texts, it does enable the analyst to observe the historical evolution of structural phenomena while controlling the situational dimensions that condition variation.

            In order to illustrate the possibilities that Biblia Medieval offers, this paper gives an overview of specific case studies of Spanish morphosyntactic variation: (i) possessive structures, (ii) differential object marking, and (iii) clitic related phenomena. In sum, this study demonstrates how a parallel corpus of medieval biblical texts may enrich our theoretical understanding of change and variation phenomena in Spanish.

“To be perfectly honest …” Patterns of amplifier use in speech and writing.

David Evans


Quirk et al’s (1985) A Comprehensive Grammar of English has been influential in shaping much of the subsequent literature on ‘upscaling’ degree modifiers in two important ways. First, Quirk et al, and subsequent commentators such as Allerton (1987) and Paradis (1997), take an essentially semantic view of the relationship between the modifying and modified elements. In this view of degree modification, the main constraint on an amplifier is that the modified lexical item and modifier “must harmonize … in terms of scalarity and totality to make a successful match” (Paradis, 1997: 158). Beyond this constraint, amplifiers were viewed as strong exponents of the open-choice principle (Sinclair, 1991); free to co-occur with a wide variety of types, with the choice of one amplifier over another having little impact on meaning.

            Second, the traditional structure of major grammars, such as Quirk et al, puts an emphasis on parts of speech and their interaction with each other. This structure has had two effects: adverbs have been, and to a large extent remain, the focus of much work on intensification, ignoring the fact that taboo words such as fucking are more frequent as emphatic modifiers than a number of -ly adverbs, particularly in demographically sampled spoken data. It also means that the focus has been quite narrow, focussing very much on the amplifier’s immediate neighbours.

            Subsequent corpus-based studies by Partington (1998, 2004) and Kennedy (2002, 2003) have shown that collocation, semantic prosody and semantic preference place further limits on the occurrence of amplifiers.

            Examining a wide range of degree modifiers, both standard (very important, utterly devastated) and non-standard (fucking massive, crystal clear) in spoken and written English this paper attempts to address some of these issues. It argues that patterns of amplifier use vary according to aspects such as the meaning of the adjective, where it is polysemous, and the nominal expression an adjective is itself modifying. It also suggests that grammatical environment, i.e. whether an adjective is operating predicatively or attributively, has an impact on the meaning and frequency of certain degree modifiers.

            In so doing, this paper challenges the notion that some degree words are interchangeable, concluding that degree modifiers are subject to ‘priming’ (Hoey, 2005) to a much greater extent than had previously been thought.



Allerton, D.J. (1987) “English intensifiers and their idiosyncrasies”. In R. Steele and T. Threadgold (eds)       Language topics: Essays in honour if Michael Halliday 2. Amsterdam: Benjamins, 15-31.

Hoey, M. (2005) Lexical Priming. Abingdon: Routledge

Kennedy, G. (1998) “Absolutely diabolical or relatively straightforward: Modification of adjectives by           degree   adverbs in the British National Corpus.” In A. Fischer, G. Tottie and H.M. Lehmann (eds) Text             types and corpora: Studies in honour of Udo Fries Tubingen: Gunter Narr, 151-163.

Kennedy, G. (2003) “Amplifier Collocations in the British National Corpus: Implications for English             Language Teaching”. TESOL Quarterly, 37(3), 467-487.

Paradis, C. (1997) Degree modifiers of adjectives in spoken British English. Lund: Lund University Press.

Partington, A. (1998) Patterns and Meanings. Amsterdam: Benjamins.

Partington, A. (2004) “‘Utterly content in each other’s company’: Semantic prosody and semantic     preference.” International Journal of Corpus Linguistics, 9(1), 131-156.

Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985) A Comprehensive Grammar of English. Harlow:    Longman.

Sinclair, J.M. (1991) Corpus, concordance, collocation. Oxford: Oxford University Press.


Large and noisy vs small and reliable: Combining two types

of corpora for adjective valence extraction


Cécile Fabre and Anna Kupsc


This work investigates a possibility of combining two different types of corpora to build a valence lexicon for French adjectives. We complete adjectival frames extracted from a Treebank with statistical cues computed from a large automatically parsed corpus. This experiment shows how linguistic knowledge and large amount of annotated data can be used in a complementary manner.

            On one hand, we use a 1 million-word Treebank, manually revised and enriched with syntactic and functional annotations for major constituents. On the other hand, we have a 200 million-word corpus automatically parsed, with no subsequent human validation, where the texts have been annotated with dependency relations. In none of the two corpora is the argument/adjunct distinction specified for dependents of adjectives.

            In the first step, adjectival frames are extracted from the Treebank. The main issue is to separate higher-level constructions from valency information. We develop a set of rules to filter out constructions such as comparative, superlative, intensifying or impersonal. At this stage, 41 different frames are extracted for 304 adjectives. This

result needs further exploration: first, due to imperfect or insufficient corpus annotations, the data is not totally reliable. In particular, frames of adjectives in impersonal constructions are often incorrectly extracted. Second, we are not able to separate real PP-arguments from adjuncts  (e.g. applicable à N vs applicable sur N: applicable to / on N). Finally, due to the small size of the corpus, the  number of extracted frames and the size of the lexicon are reduced.

            In the second step, we use much more data and apply statistical methods with two objectives:  rank  the frames on the scale of the argument/adjunct continuum and discover additional frames. We focus on PP-complements as they turned out to be the most problematic for the previous approach.  We use several simple statistical measures to evaluate valence properties of the frames, focusing in particular on the obligatoriness and autonomy of the PP with respect to the adjective. We define optimal valency criteria which we apply for ranking and identifying new frames. PP is all the more likely to be an argument of the adjective when it meets the following three conditions: the frame is productive (the adjective combines with a large range of nouns or infinitives introduced by the preposition), the adjective is rarely found alone (i.e., without the accompanying frame), and prepositional expansions that are attached to this adjective are mostly introduced by the preposition which appears in the frame. Applying these criteria allowed us to rank the frames extracted from the Treebank and extract dependency information for about 2600 new adjectives. 


An empirical survey of French, Latin and Greek words in the British National Corpus


Alex Chengyu Fang, Jing Cao, Nancy Ide, Hanhong Li, Wanyin Li and Xing Zhang


In this article we report the most extensive survey of foreign words in contemporary British English. In particular, we report the distribution of French, Latin and Greek words according to different textual settings in the British National Corpus (BNC). The reported study has several motivations. Firstly, the English language has borrowed extensively from other languages, mainly from French, Latin and Greek (Roberts 1965; Greenbaum 1996). Yet, no large-scale empirical survey has been conducted or reported for their significant presence in our present-day linguistic communication. Secondly, many text genres are typically characterized by the use of foreign words, such as Latinate ones (De Forest and Johnson 2001); however, there is no quantitative indication that relates proportions of foreign words to different types of texts. A third motivation is the fact that technical or professional writing tends to have a high density of specialized terms as basic knowledge units and the automatic recognition of such terms has so far ignored the important role that foreign words typically play in the formation of such terms. The study to be reported here is therefore significant not only for its descriptive statistics of foreign words in the 100-million-word BNC. It is also significant that the empirical results that we will report here will lend themselves readily to practical applications such as automatic term recognition and text classification.

            Our survey has focused on the distribution of foreign words according to two settings: text types (such as writing vs. speech and academic prose vs. non-academic prose) and subject domains (such as sciences and humanities). A central question we attempt to answer is whether there is significant difference in the distribution of foreign words across different text types and the various subject domains concerned in the study. As our results will show, there is an uneven use of foreign words across text categories and subject domains. With regard to text categories, the proportion of foreign words successfully separates speech from writing, and, within writing, academic prose from non-academic prose. This finding shows a strong correlation between degrees of formality and proportions of foreign words. Our investigation also shows that even different domains have their own preferences for the use of foreign words. To be more exact, French is more related to humanities while the proportion of Latinate words is higher in scientific texts. Greek words behave in the same manner as Latinate words but they are relatively few and thus insignificant compared with the other two groups.

            In conclusion, this survey demonstrates on empirical basis that the use of foreign words such as French, Latin and Greek distinguishes texts on a scale of different formalities. The survey also demonstrates that different domains seem to have a different proportion and therefore preference for the use of foreign words. These findings will lead to applications in automatic text classification, genre detection and term recognition, a possible development that we are currently investigating in a separate study.



De Forest, M. and E. Johnson (2001). “The Density of Latinate Words in the Speeches of             Jane Austen’s Characters”. Literary and Linguistic Computing, 16(4), 389-401

Greenbaum, S. (1996). The Oxford English grammar. Oxford : Oxford University Press.

Roberts, A. H. (1965). A Statistical Linguistic Analysis of American English. The Hague:    Mouton.

The use of vague formulaic language by native and non-native speakers of English in synchronous one-on-one CMC

Julieta Fernandez and Aziz Yuldashev


Vague lexical units serve essential functions in everyday conversation. People use vague language when they want to soften their utterances, create in-group membership (Carter & McCarthy, 2006) or invite interlocutors into a presumably shared social space (Thorne & Lantolf, 2007). Although speakers hardly ever notice it, vague language appears rather widespread in everyday communication (Channell, 1994). Recent corpus linguistic studies of attested utterances point to a prominence of multiword vague expressions, the most recurrent being “I don’t know if,” “or something like that” and “and things like that” (McCarthy, 2004). This body of research suggests that multiword vague language constructions constitute an essential resource for language-in-use.

            To date, research on vague language has mostly focused on its prevalence in spoken discourse, with more recent studies also addressing vague language use in written discourse. However, we are aware of no studies that have described vague language use in synchronous computer-mediated communication (CMC) contexts, such as Instant Messaging (IM) chat. The present paper reports on a study that investigates the use of multiword vague units in one-on-one IM chats. It specifically examines vague expressions that fall under the category of general extenders (adjunctive [e.g., ‘and stuff like that’, ‘and what not’] and disjunctive general extenders [e.g., ‘or anything’, ‘or something’]) as defined by Overstreet (1999). The analysis focuses on the comparison of variations in the use of general extenders (GEs) in IM between native and non-native expert users of English, the impact of their use on fluency and the implications of the findings for computer assisted second language pedagogy.

Drawing on the analysis of a corpus of 524 instant messaging conversations of different lengths, the study sheds light on the patterns of use and functions of GEs in one-on-one text-based IM chat in relation to their use and functions in oral conversations as documented in previous studies. The GEs utilized by participants have been found to operate at several levels (Evinson et al., 2007), where some GEs implicate knowledge that is ostensibly shared by all human beings; some other GEs are more locally constrained or group-bound; while a third group appears to be culturally bound to conceptual and conventional speech communities.

            The examination of the data reveals a number of tendencies in the use of GEs by native and non-native expert English language users. Variations in the frequencies of GE use are discussed in view of the existing evidence regarding the overuse of multi-word vague expressions by non-native language users. The trends of vague expression use are scrutinized in light of a body of research on the affordances provided in CMC environments for language learning and use. The findings and conclusions underscore the pedagogical significance of sensitizing L2 learners to the interactional value of vague formulaic sequences in CMC contexts, their potential for conveying a variety of meanings for a range of purposes, and their contribution to fluency.


Carter, R. and M. McCarthy (2006). Cambridge Grammar of English. Cambridge: CUP

Channell, J. (1994). Vague Language. Oxford: OUP

Evinson, J., M. McCarthy and A. O’Keefe, ““Looking Out for Love and All the Rest of It”: Vague Category            Markers as Shared Social Space” In J. Cutting (ed.) Vague Language

Explored. New York: Palgrave MacMillan, 138-157. 

Thorne, S. L. And J. P. Lantolf(2007). “A Linguistics of Communicative Activity” . In S. Makoni & A.        Pennycook (eds), Disinventing and Reconstituting Languages. Clevedon: MM, 170-195.

McCarthy, M. (2004). “Lessons from the analysis of chunks”. The Language Teacher, 28 (7), 9-12.

Units of meaning in language = units of teaching in ESP classes? The case of Business English


Bettina Fischer-Starcke


Both words and phrases are units of meaning in language. When teaching business English to university students of economics, however, the question arises whether either words or phrases or both words and phrases are suitable units of teaching. This is particularly relevant since teaching materials in business English courses frequently focus on single words, possibly with their collocations, for example to effect payment, or terminology consisting of one or more words, for example seasonal adjustment. This paper takes a corpus linguistic approach to answering the question which linguistic units distinguish business English from general English and should therefore be taught to students of business English by analyzing words and phrases of the Wolverhampton Business English Corpus for their business specificity.

            In this paper, (1) quantitative keywords from the Wolverhampton Business English Corpus when compared with the BNC and (2) the Wolverhampton Business English Corpus’ most frequent phrases between three and six words are analysed and classified into the categories general English and business specific. The phrases are (1) uninterrupted strings of words, so-called n-grams, for example net asset value, and (2) phrases that are variable in one slot, so-called p-frames, for example the * of. The phrases can be either compositional or non-compositional entities.

            The list of keywords is dominated by business specific lexis, for instance securities, shares and investment. This dominance indicates large lexical differences between general English and business English. This demonstrates the usefulness of teaching words in ESP classes.

            Looking at the corpus’ most frequent phrases, however, the distinction between general English and business English is less clear. Frequently, the genre specificity of a phrase is effected by the occurrence of specialized lexis as part of the phrase. This finding calls (1) the genre specificity of a phrase as a whole and (2) the distinction between words and phrases as characteristic units of business English into question.

            This paper concludes by discussing (1) reasons for the differences of genre specificity between the keywords and the frequent phrases, (2) implications of the lexical identification of genre specificity on units of teaching in business English classes and (3) relevant units of teaching in ESP classes.


The application of corpus linguistics to pedagogy: The role of mediation


Lynne Flowerdew


Corpus-driven learning is usually associated with an inductive approach to learning. However, in this approach students may find it difficult to extrapolate rules or phraseological tendencies from corpus data. In such cases, some kind of ‘pedagogic mediation’ may be necessary (Widdowson 2002), but to date this aspect has not received much attention in the literature.

            I will describe a writing course which, although primarily taking a corpus-driven approach (Johns 1991; Tognini-Bonelli 2001) also makes use of various types of pedagogic mediation. Peer response activities, drawing on Vygotskian socio-cultural theories of co-constructing knowledge through collaborative dialogue, were set up for discussion of the corpus data. Weaker students were intentionally grouped with more proficient ones to foster productive dialogue through ‘assisted performance’. It was found that the student discussions of corpus data not only focused on rule-based grammar queries, but also on phraseological queries and whether a phrase was suitable from a socio-pragmatic perspective to transfer to their own context of writing. In some instances, teacher intervention was necessary to supply a hint or prompt to carry the discussions forward. By so doing, the teacher would seem to be moving away from a totally inductive approach towards a more deductive one.



Flowerdew, L. (2008) “Corpus linguistics for academic literacies mediated through           discussion activities”. In D. Belcher and A. Hirvela (eds) The Oral-literate     Connection: Perspectives on L2 Speaking, Writing and Other Media Interactions.     Ann Arbor, MI: University of Michigan Press, 268-287.

Flowerdew, L. (in press, 2009) “Applying corpus linguistics to pedagogy: a critical            evaluation”. International Journal of Corpus Linguistics, 14 (3).

Flowerdew, L. (in press, 2010) “Using corpora for writing instruction”. In M. McCarthy and A. O’Keeffe (eds) The Routledge Handbook of Corpus Linguistics.

Johns, T. (1991) “From printout to handout: Grammar and vocabulary teaching in the context       of data-driven learning”. In T. Odlin (ed.) Perspectives on Pedagogical Grammar,            Cambridge: CUP, 293-313.

Tognini-Bonelli, E. (2001) Corpus Linguistics at Work. Amsterdam: John Benjamins.

Widdowson, H.G. (2002) “Corpora and language teaching tomorrow”. Keynote lecture    delivered at the Fifth Teaching and Language Corpora Conference, Bertinoro, Italy.



Spontaneity reloaded: American face-to-face and movie conversation compared


Pierfranca Forchini


The present paper empirically examines the linguistic features characterizing American face-to-face and movie conversation, two domains which are usually claimed to differ especially in terms of spontaneity (Taylor 1999, Sinclair 2004). Face-to-face conversation, indeed, is usually considered the quintessence of the spoken language for it is totally spontaneous: it is neither planned, nor edited (Chafe 1982, McCarthy 2003, Miller 2006) in that it takes place in real time; since the context is often shared by the participants, it draws on implicit meaning and, consequently, lacks semantic and grammatical elaboration (Halliday 1985, Biber et al. 1999). Fragmented language (Chafe 1982:39) and normal dysfluency (Biber et al. 1999:1048) phenomena, such as repetitions, pauses, and hesitation (Tannen 1982, Halliday 2005), are some examples which clearly illustrate the unplanned, spontaneous nature of face-to-face conversation. On the other hand, movie conversation is usually described as non-spontaneous: it is artificially written-to-be spoken, it lacks the spontaneous traits which are typical of face-to-face conversation, and, consequently, it is not likely to represent the general usage of conversation (Sinclair 2004). In spite of what is generally maintained by the literature, the Multi-Dimensional analysis presented here shows that the two conversational domains do not differ to a great extent in terms of linguistic features and, thus, confutes the claim that movie language has “a very limited value” in that it does not reflect natural conversation and, consequently, is “not likely to be representative of the general usage of conversation” (Sinclair 2004:80). This finding suggests that movies could be used as potential sources in teaching spoken English.

      The present analysis is in line with Biber’s (1988) Multi-Dimensional analysis approach, which applies multivariate statistical techniques by observing and analyzing more than one statistical variable at a time and it is based on empirical data retrieved from spoken and movie corpora: the data about American face-to-face conversation are from the Longman Spoken American Corpus, whereas those about American movie conversation are from the AMC (American Movie Corpus), a new corpus purposely built and manually transcribed to study movie language.



Biber, D. 1988. Variation across speech and writing. Cambridge: CUP.

Biber, D, S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999). Longman grammar of spoken                                       and written English. London: Longman.

Chafe, W. (1982). “Integration and involvement in speaking, writing, and oral Literature”. In D. Tannen (ed.) Spoken and written Language: Exploring orality and literacy. Norwood: Ablex, 35–53.

Halliday, M. A. K. (1985). Spoken and written language. Oxford: OUP.

Halliday, M. A. K. (2005) “The spoken language corpus: a foundation for grammatical theory”. In J.           Webster (ed.) Computational and quantitative studies. London: Continuum, 157-190.

McCarthy, M. (2003). Spoken language and applied linguistics. Cambridge: CUP.

Miller, J. (2006). “Spoken and Written English”. In B. Aarts and A. McMahon (ed.) The handbook of           English linguistics. Oxford:  Blackwell, 670-691.

Pavesi, M. (2005). La Traduzione Filmica. Roma: Carocci.

Sinclair, J. M. (2004) “Corpus creation”. In G. Sampson and D. McCarthy (ed.) Corpus linguistics:              Readings in a widening discipline. London: Continuum, 78-84.

Taylor, C. (1999). “Look who’s talking”. In L. Lombardo, L. Haarman, J. Morley and C. Taylor (ed.)          Massed medias. Milano: LED, 247-278.


Keyness as correlation: Notes on extending the notion of keyness from categorical to ordinal association


Richard Forsyth and Phoenix Lam


The notion of 'keyness' has proved valuable to linguists since it was given a practical operational definition by Scott (1997). Several numeric indices of keyness have been proposed (Kilgarriff, 2001), the most popular of which is computed from the log-likelihood formula (Dunning, 1993). The basis for all such measures is the different frequency of occurrence of a word between two contrasting corpora. In other words, existing measures of keyness assess the strength of association between the occurrence rate of a word (more generally, a term, since short phrases or symbols such as punctuation may be considered) and a categorical variable, namely, presence in one collection of documents or another.

            The present study is motivated by the desire to find an analogous measure for the case in which the variable to be associated with a term's frequency is not simply binary (presence in one or other corpus) but measured on an ordinal or possibly an interval scale. This motivation arose in connection with the affective scoring of texts, specifically rating of emails according to degree of hostility or friendliness expressed in them. In our test data, each text has been manually assigned a score between -4 and +4 on five rating scales, including expressed hostility. However, such text-to-numeric linkages are not specific to this particular application, but may be of interest in many areas, for example: in Diachronic Linguistics, with the link between a text and its date of composition; in Economics, with the link between a news story about a company and its financial performance; in Medicine, with the link between a transcript and the speaker's severity of cognitive impairment; in Political Science, with the link between an election speech and its degree of support for a policy position; and so on.

            This paper examines a number of potential indicator functions for the keyness of terms in such text collections, i.e. those in which the dependent variable is ordered, not categorical, and discusses the results of applying them to a corpus of electronic messages annotated with scores for emotive tone as well as to comparison corpora where the dependent variable is the date of composition of the text. Repeated split-half subsampling suggests that a novel index (frequency-adjusted z-score) is more reliable than either the Pearson correlation coefficient or a straightforward modification of the log-likelihood statistic for this purpose. The paper also discusses some criteria for assessing the quality of such indices.




Dunning, T. (1993). “Accurate methods for the statistics of surprise and coincidence”.      Computational Linguistics, 19(1), 61-74.

Kilgarriff, A. (2001). “Comparing corpora”. International Journal of Corpus Linguistics 6  (1): 1-37.

Scott, M. (1997). “PC analysis of key words -- and key key words”, System, 25 (2), 233-245.


A classroom corpus resource for language learners


Gill Francis and Andrew Dickinson


This paper reports on the development of a classroom corpus tool that will give teachers and learners an easy-to-use interface between them and the world of language out there on the web and in a range of compiled corpora. The resource is intended to be used by UK secondary schools and first-year university students, as well as in ELT classrooms here and in other countries. It will enable teachers and learners to access a corpus during a class, whenever they want to investigate how a word or phrase is used in a range of real language texts and situations.

            The aim is to provide users with the facility to type a word or phrase into a web page without any special punctuation or symbols. Corpus output is returned in a clear, uncluttered, and visually attractive display, which learners can easily interpret.

To simplify and clarify the output, there are restrictions on how it is selected and manipulated – for example there is an upper limit on the number of concordance lines retrievable; alphabetical sorting is simplified; and significant collocates are presented positionally without statistical information. Only the most important and often-used options are offered, and ease of use is the priority.

            The user will have access to a range of different corpora, although output from all of these will of course ‘look the same’. The web-server software will use an off-the-shelf concordancer to access licensed corpora such as the BNC, as well as some of the hundreds of corpora freely available on the web. Corpora can also be compiled using text from the web, for example from out-of-copyright literary works studied for GCSE, or by trawling through the web looking for key words and sites. Different corpora will be compiled for different groups of users, such as English schoolchildren or intermediate level EFL students.

For initial guidance and ideas, the website also offers a large number of suggestions for stand-alone classroom activities practising points of grammar, lexis, and phraseology, stressing their interconnectedness. The materials owe much to Data-Driven Learning, the pioneering enterprise of Tim Johns, who has sadly recently died.

            Activities include many on collocation, for example the collocates of common ‘delexicalized’ verbs (make a choice, have a look, give a reason). There is also a thread that addresses language change and the tension between description and prescription in language teaching. For example, the changing pronoun use in English (eg they as a general pronoun to replace he or she; the shifting use of subject and object pronouns – Alena and I left together / Alena and me left together / Me and Alena left together). Teachers and learners will be able to use these resources as a springboard for developing their own approach to looking at language in the bright light of a corpus.


A corpus-driven approach to identifying distinctive lexis in translation


Ana Frankenberg-Garcia


Translated texts have traditionally been regarded as inferior, owing to the general perception that they are contaminated by the source text language. In recent years, however, a growing number of translation scholars have begun to question this assumption. While the constraints imposed by the source text language upon the translated text are certainly inevitable, they are not necessarily negative. For Frawley (1984), translated texts are different from the source texts that give rise to them and, at the same time, they are different from equivalent target language texts that are not translations. As such, this third code (Frawley 1984:168) deserves being studied in its own right. 

       A third-code feature that has received relatively little attention in the literature is the distribution of lexis in translated texts. In one of the few studies available, Shama'a (Shama'a 1978, cited in Baker 1993) found that the words day and say could be twice as frequent in English translated from Arabic than in original English texts. In contrast, Tirkkonen-Condit (2004) focuses her analysis on typically Finnish verbs of sufficiency and notices that they are markedly less frequent in translations than in texts originally written in Finnish. Both these studies adopt a bottom-up approach, taking a selection of words as a starting point and subsequently comparing their distributions in translated and non-translated texts. In the present study, I adopt an exploratory, top-down, corpus-driven approach instead. I begin with a comparable corpus of translated and non-translated texts and then attempt to identify which lemmas are markedly over and under-represented in the translations. The results not only appear to support existing bottom-up intuitions regarding distinctive lexical distributions, but also disclose a number of unexpected contrasts that would not have been discernible without recourse to corpora.  



Baker, M. (1993). 'Corpus linguistics and translation studies. Implications and applications.'          In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology: In          Honour of John Sinclair, pp 233-250. Amsterdam & Philadelphia: John Benjamins.

Frawley, W. (1984). 'Prolegomenon to a theory of translation'. In W. Frawley (ed.)            Translation, literary, linguistic and philosophical perspectives. London & Toronto:           Associated University Presses, pp 159-175.

Shama'a, N. (1978). A linguistic analysis of some problems of Arabic to English translation.          D. Phil thesis, Oxford University.

Tirkkonen-Condit, S. (2004). 'Unique items – over – or under-represented in translated      language?' In A. Mauranen and P. Kujamäki (eds) Translation Universals, Do They Exist? Amsterdam & Philadelphia: John Benjamins, p. 177-184.



The role of language in the popular discursive construction of belonging in Quebec:

A corpus assisted discourse study of the Bouchard-Taylor Commission briefs


Rachelle Freake


This study investigates how identity is constructed in relation to language in Quebec popular discourse. Since its rebirth after the Quiet Revolution, Quebec has found itself in the midst of debate as to how to adapt French Canadian culture and language to globalizing and immigration-based contexts. The debate culminated in 2007 with the establishment of the Bouchard Taylor Commission, which had the purpose of making recommendations concerning religious and cultural accommodation to the Premier of Quebec; these recommendations were based on public consultations held across Quebec. One method of participating in the consultation was through the submission of a brief; over 900 briefs in both French and English were submitted, and these serve as data for this research. The briefs were compiled into two corpora, one English and one French.  Using a corpus-assisted discourse studies (CADS) approach (Baker, Gabrielatos, Khosravinik, Krzyzanowski, McEnery & Wodak, 2008), this research analyses both the French and English corpora separately, using bottom-up corpus analysis frequency, concordance, and keyword tools as well as the top-down Discursive Construction of National Identity approach to Critical Discourse Analysis (Wodak, de Cillia, Reisigl & Liebhart, 1999).  Collocation and cluster patterns reveal categories of reference that represent Quebec in terms of both civic and ethnocultural nationalism. Corpus findings are enhanced by the in-depth critical discourse analysis of three individual briefs, which show how individuals adopt and adapt different versions of nationalism to advance their respective interests. Together, corpus and discourse analysis show that the majority of participants construct their identity in relation with Quebec civic society, drawing on the “modernizing” (c.f. Heller, 2001) French Canadian identity discourse.  This discourse blends civic and ethnocultural elements, and emphasizes language as an identity pillar and a key sign of membership. With regard to methodological approach, this study finds that discourse analysis reveals important contextual dimensions that are not apparent from the corpus analysis alone, thus highlighting the advantage of a combined corpus and discourse analysis approach. Also, comparing corpora of different languages and different sizes, and compiling keyword lists for these corpora using different comparator corpora, raises questions about the comparability of keywords.



Baker, P., C. Gabrielatos, M. Khosravinik, M. Krzyzanowski, T. Mcenery and R. Wodak              (2008). “A useful methodological synergy? combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK          press”. Discourse & Society, 19(3), 273-306.

Heller, M. (2001). “Critique and sociolinguistic analysis of discourse”. Critique of Anthropology, 21(2), 117-141.

Wodak, R.,  R. de Cillia, M. Reisigl and K. Liebhart (1999). The discursive construction of            national identity. Edinburgh: Edinburgh University Press.


A linguistic mapping of science popularisation: Feynman’s lectures on physics and what corpus methods can tell us


Maria Freddi


The present paper addresses the popularization of science by scientists themselves, a topic that has so far been neglected (Gross 2006), analysed along the cline from teaching to popularising science. The aim is twofold: on the one hand, to analyse Richard Feynman’s collection of didactic lectures, entitled Six Easy Pieces, to investigate the physicist’s pedagogical style. On the other hand, to compare them with a collection of popularising lectures published as The Character of Physical Law, to pinpoint differences and/or similarities in terms of language choices, word-choice and phraseology, and textual strategies. Corpus methods are used to quantify variation internal to each text and across texts. The approach is corpus-driven and draws on different models of words distribution and recurrent word-combinations (see Sinclair 1991, 2004; Moon 1998; Biber and Conrad 1999; Stubbs 2007). In particular, work by Biber et al. on how different multi-word sequences – lexical bundles – occur with different distributions in different text-types is referred to. The paper also draws on other computer-aided text analysis methods to describe meaning in texts such as Mahlberg’s (2007) concept of ‘local textual functions’, i.e. the textual functions of frequent phraseology, and Hoey’s notion of ‘textual colligation’ (Hoey 2006).

Each collection of lectures is thus treated both as text and corpus, so that keywords resulting from the comparison of the two corpora are then considered and related to the textual and registerial specificities of each (see among others, Baker 2006; Scott and Tribble 2006). Finally, findings are related to qualitative studies of scientific discourse, particularly in the wake of rhetoric of science (e.g. Gross 2006). It is suggested that corpus tools, by helping identify frequent phraseology and textual functions and compare the academic and public lectures, will lead to a more thorough mapping of Feynman’s unconventional style of communicating science as well as a deeper understanding of successful science popularisation.



Baker, P. (2006), Using Corpora in Discourse Analysis. London: Continuum

Biber, D. and S. Conrad (1999), “Lexical bundles in conversation and academic prose”, in H.

Hasselgard and S. Oksefjell (eds) Out of Corpora. Studies in Honour of Stig Johansson.

            Amsterdam: Rodopi, 181-190

Gross, A. G. (2006). Starring the Text. The Place of Rhetoric in Science Studies. Carbondale:

            Southern Illinois Press.

Hoey, M. (2006). “Language as choice: what is chosen?”. In G. Thompson and S. Hunston            (eds) System and Corpus. London: Equinox, 37-54

Mahlberg, M. 2007, “Clusters, key clusters and local textual functions in Dickens.” Corpora,        2 (1), 1-3

Moon, R. (1998). Fixed Expressions and Idioms in English. A corpus-based approach.       Oxford: Clarendon.

Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press

Sinclair, J. (2004). Trust the Text: Language, corpus and discourse. London: Routledge

Automatic grouping of morphologically related collocations


Fabienne Fritzinger and Ulrich Heid


Term variants have been extensively investigated in the past [2], but there is less knowledge about the variability of collocations in specialised languages. Variability phenomena include differences in article use, number and modifiability, but also different morphological reali-sations of the same combination of lexical elements: German examples include the relation between the verb+object collocation Warenzeichen+eintragen („register+trademark“) and the collocation of a nominalisation and its genitive attribute Eintragung des Warenzeichens, „registration of +trade-mark“), or the relation with a noun and an (adjectival) participle collocation: eingetragenes/ einzutragendes Warenzeichen („registered/to-be-registered trademark'“) .

            Quantitative data for each collocation type show preferences in specialised phraseology: the collocation Patentanmeldung/Anmeldung+einreichen („submit a (patent) registration“) is equally typical in verb+noun, noun+genitive and participle+noun form; By contrast, the combination Klage+einreichen shows an uneven distribution over the collocation types: Einreichung+KlageGen and eingereichte Klage are rather frequent (ranks 7 and 8), whereas combinations of the verb and the noun, Klage+einreichen („file a law suit“), are quite rare (rank 43).

            We present a system for grouping such morphologically related collocation candidates together, using a standard corpus linguistic processing pipeline and a morphology system.

            A systematic account of collocation variants (and tools to prepare it) are needed for terminography and resource building for both symbolic and statistical Machine Translation: grouping the collocation variants together in a terminological database or a specialised dictionary, improves the structure of such a resource making it more compact. Adding quantitative data about preferences serves technical writers who need to select adequate expressions. „Bundles“ of related collocations of two languages can be integrated into lexicons of symbolic Machine Translation systems, and they can serve to reduce data sparseness in alignment and equivalent learning from parallel and comparable corpora.

            The architecture and application of our tools is exemplified with German data from the field of IPR and trademark legislation; however, the tools are applicable to general language as well as to any specialised domain. To identify the vocabulary of the trademark legislation sublanguage, single word candidate terms are extracted from a 78 million word corpus (of German juridical literature) and are compared with terms of general language texts [1]. For these single word term candidates, we then extract collocation candidates. In order to identify even non-adjacent collocations (which frequently occur in German), we use dependency parsing [3] to extract the following collocational patterns: adjectives+nouns, nouns+genitive nouns and verb+object pairs.

            The verb+object pairs are the starting point for grouping the collocations, as nominalisations and derived adjectives can easily be mapped onto their underlying verbal predicates. For this mapping, we use SMOR [4]. It analyses complex words into a sequence of morphemes which include base and affixes or, in the case of compounds, head and non-head components. Based on this output, we group the different morphological realisations together.



[1] K. Ahmad, A. Davies, H. Fulford and M. Rogers (1992). “What is a tern? The semi-automatic    extraction of terms from text”. In M. Snell-Hornby et al. Translation studies – an            interdiscipline.

[2] C. Jacquemin (2001). Spotting and discovering terms through Natural Language Processing.

[3] M. Schiehlen (2003). “A cascaded finite-state parser for German”. In Proceedings of the Research         Note Sessions of the 10th Conference of the EACL'03.

[4] H. Schmid, A. Fitschen and U. Heid (2004). “A German computational morpholgy covering       derivation, composition and inflection”. In Proceedings of the LREC 2004.

Mediated modality: A corpus-based investigation of original, translated and non-native ESP

Federico Gaspari and Silvia Bernardini


Although linguists disagree on the finer points involved in classifying and describing English modal verbs, it does not seem controversial to argue that modality is a central and challenging topic in English linguistics. In this respect, Palmer (1979/1990) observes that “[t]here is, perhaps, no area of English grammar that is both more important and more difficult than the system of the modals”, and Perkins (1983: 269) claims that “the English modal system tends to more anarchy than any other area of the English language”. Due to its complexities, modality is likely to be particularly problematic for translators and non-native speakers writing in English. It has been suggested that these two sets of language users employ similar strategies and produce mediated language that has common and distinctive patterns, showing properties that differ in a number of respects from those found in native/original texts (Ulrych & Murphy, 2008).

      This paper attempts to shed light on this claim, investigating a range of features of modality by means of two specialised monolingual comparable corpora of English: one containing original/native and translated financial reports (with the source language of all the translations being Italian), the other comprising research working papers in economics written by native and non-native speakers of English (in this case all the L2 authors have Italian as their mother tongue). While assembling a set of more closely comparable corpora would have been desirable, this proved impossible to achieve due to the unavailability of appropriate texts for all the components that were required.

      Phenomena discussed in the paper include the distribution of epistemic vs. deontic modals in each sub-corpus, their collocational patterning (particularly concerning co-occurrence with adverbs expressing judgements of the writer’s confidence, such as possibly and perhaps for possibility, as opposed to surely and certainly for necessity), as well as an analysis of their discursive functions. Looking at how modality is expressed in L2 English writing by Italian native speakers and in translations from Italian, we aim first of all to determine the extent to which these two distinct forms of mediated language differ from each other, and secondly to understand how they compare to original/native usage.

      At the descriptive and theoretical levels, this study provides insights about English as a Lingua Franca, while contributing to the ongoing debate on translation universals taking place within Corpus-Based Translation Studies (Baker, 1993; Chesterman, 2004). From an applied point of view, on the other hand, the findings of this research have implications for English language teaching (particularly ESP and EAP in the fields of economics and finance), translation teaching and translation quality evaluation.



Baker, M. (1993) “Corpus Linguistics and Translation Studies: Implications and Applications”. In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology: in Honour of John Sinclair.                  Amsterdam: John Benjamins, 233-250.

Chesterman, A. (2004) “Hypotheses about Translation Universals”. In G. Hansen, K. Malmkjær and  D.         Gile (eds) Claims, Changes and Challenges in Translation Studies. Amsterdam: John Benjamins,          1-13.

Palmer, F. R. (1979/1990) Modality and the English modals. Second edition. London: Longman.

Perkins, M. R. (1983) “The Core Meaning of the English Modals”. Journal of Linguistics 18, 245-273.

Ulrych, M. & A. Murphy (2008) “Descriptive Translation Studies and the Use of Corpora: Investigating         Mediation Universals”. In C. Taylor Torsello, K. Ackerley & E. Castello (eds) Corpora for       University Language Teachers. Bern: Peter Lang.


Speaking in piles: Paradigmatic annotation of a spoken French corpus


Kim Gerdes


This presentation focuses on the “paradigmatic” syntactic schemes that are currently used in the annotation of the Rhapsodie corpus project (, an ongoing 3-year project sponsored by the French National Research Agency (ANR) and consisting of developing a 30 hours (360,000 words) reference corpus of spoken French with the key features of being free, representative, and annotated with sound, syllable-aligned phonological transcription, orthographically corrected transcription and, on at least 20% of the corpus, prosodic and syntactic layers of information.

Based on the Aix School grid analysis of spoken French (Blanche-Benveniste et al. 1979), the notion of « pile » is introduced in the syntactic annotation, allowing for a unified elegant description of various paradigmatic phenomena like disfluency, reformulation, apposition, question-answer relationships, and coordinations. Piles naturally complete dependency annotations by modeling non-functional relations between phrases. We consider that a segment Y of an utterance piles up with a previous segment X when Y fills the same syntactic position as X: Y can be a (voluntary or not) reformulation of X, Y can instantiate X or Y can be added to X in a coordination. If the layers X and Y are adjacent, we note {X| C Y} the pile, where C is the pile marker (i.e. a conjunction in a coordination or an interregnum in a disfluency (Shriberg 1994)).

            on imaginait {des bunkers | tout ce {qui avait pu faire le rideau de fer | qui avait été supprimé {très rapidement | mais sans en enlever toutes les infrastructures}}}

'One imagined bunkers, everything that the Iron Curtain was made up of that was   removed very quickly but without taking down its entire infrastructure.

Although our layers’ encoding seem rather symmetric, we give to {, | and } very different interpretations. We call | the junction position: it marks the point where a backtrack to { is done. The last position marked } is less relevant, and even irrelevant in case of disfluency (see Heeman et al. 2006, who propose a similar encoding for disfluency without the closing } ). Only the junction position corresponds to a prosodic break.

We use a two-dimensional graphical representation of piles where layers, as well as pile markers, are vertically aligned. We add horizontal lines, called junctions, between the various layers of a pile, in order to indicate the extension of each layer. Only the first layer is syntactically linked by a functional dependency to the left context.


     on imaginait des bunkers

                   tout ce   qui avait pu faire le rideau de fer

                                 qui avait été supprimé        très rapidement

                                                                  mais sans en enlever …


Blanche-Benveniste C. et al., Des grilles pour le français parlé, Recherches sur le français parlé, 2, 163-205.

Heeman P., A. McMillin, J.S. Yaruss (2006) An annotation scheme for complex disfluencies, International Conference on Spoken Language Processing, Pittsburgh.

Shriberg E. (1994). Preliminaries to a Theory of Speech Disfluencies, PhD Thesis, Berkeley University.

eHumanities Desktop — An extensible Online System for

Corpus Management and Analysis


Rüdiger Gleim, Alexander Mehler, Ulli Waltinger and Peter Menke


Beginning with quantitative corpus linguistics and related fields the application of computer-based methods increasingly reaches all kinds of disciplines in the humanities. The spectrum of processed documents spans textual content (e.g. historical texts, newspaper articles, lexica and dialog transcriptions), annotation formats (e.g. of multimodal data), and images as well as multimedia. This raises new challenges in maintaining, processing and analyzing resources, especially because research groups are often distributed over several institutes. Finally, sustainability and harmonization of linguistic resources also become important issues. Thus, the ability to collaboratively work on shared resources, while ensuring interoperability, is an important issue in the field of corpus linguistics (Dipper et al., 2006; Ide and Romary, 2004).

The eHumanities Desktop1 is designed as a general purpose platform for sciences in humanities. It offers web based access to (i) managing documents of arbitrary type and organizing them freely in repositories, (ii) sharing and working collaboratively on resources with other users and user groups, (iii) browsing and querying resources (including e.g. XQueries on XML documents), (iv) processing and analyzing documents via an extensible Tool API (including e.g. raw text preprocessing in multiple languages, categorization and lexical chaining). The Desktop is platform independent and offers a full-fledged desktop environment to work on corpora via any modern browser.

The eHumanities Desktop is an advancement of the Ariadne System2, which was developed in the DFG funded project SFB673/X1 "Multimodal alignment corpora: statistical modeling and information management". We invite researchers to contact us3 and use the Desktop for their studies.



Dipper, S., E. Hinrichs, T. Schmidt, A. Wagner and A. Witt (2006). “Sustainability of

linguistic resources”. In E. Hinrichs, N. Ide, M. Palmer and J. Pustejovsky (eds) Proceedings of the LREC 2006 Workshop on Merging and Layering Linguistic Information. Genoa, Italy.

Ide, N. and L. Romary (2004). “International standard for a linguistic annotation framework”.

Nat. Lang. Eng., 10 (3-4), 211–225.

Mehler, A., R. Gleim, A. Ernst and U. Waltinger (2008). “WikiDB: Building Interoperable

Wiki-Based Knowledge Resources for Semantic Databases”. In Sprache und Datenverarbeitung. International Journal for Language Data Processing, 2008.

Mehler, A., U. Waltinger and A Wegner (2007). “A formal text representation model based

on lexical chaining”. In Proceedings of the KI 2007 Workshop on Learning from Non-Vectorial Data (LNVD 2007) September 10, Osnabrück, 17–26. Universität Osnabrück.

Waltinger, U., A. Mehler and G. Heyer (2008a). “Towards automatic content tagging:

Enhanced web services in digital libraries using lexical chaining.” In 4th Int. Conf. on Web Information Systems and Technologies (WEBIST ’08), 4-7 May, Funchal, Portugal. Barcelona.





Analysing speech acts with the Corpus of Greek Texts: Implications for a theory of language


Dionysis Goutsos


The paper takes as its starting point the analysis of two instances of language use with the help of material provided by the Corpus of Greek Texts (CGT). CGT is a new reference corpus of Greek (30 million words), developed at the University of Athens for the purposes of linguistic analysis and pedagogical applications (Goutsos 2003). It has been designed as a general monolingual corpus, including data from a wide variety of spoken and written genres in Greek from two decades (1990 to 2010). The examples discussed here concern:


      a) an utterance on a public sign, used to enforce a prohibition, and

      b) the different forms of the verb ksexnao (‘to forget’) in Greek.


      The question in both cases is whether there is evidence in the corpus that can be brought to clarify the role and illocutionary force of the speech acts involved. This evidence comes from phraseological patterns, which are found to be used for specific discourse purposes. It is suggested thus that discursive acts manifest themselves in linguistic traces, retrievable from the wider co-text and characterizing particular contexts. These traces allow us to identify repeatable relations between forms and functions and reveal the predominantly evaluative orientation of language.

            On the basis of this analysis, the paper moves on to discuss the claim of corpus linguistics and its implications for a theory of language. In particular, it is argued that corpus linguistics can bridge the gap between formalist insights and interactional approaches to language, by emphasizing recurrent relations between forms and function and thus subscribing neither to a view of fixed correspondences between form and function nor to a totally context-dependent approach. As such, corpus linguistics points to a view of language as a form of historical praxis of a fundamentally social and dialogical nature and in this it concurs with Voloshinov’s (1973 [1929]) approach to language as a systematic accident, produced by repeatable and recurring forms of utterances in particular circumstances. A Bakhtinian theory, emphasizing evaluation and addressivity as fundamental characteristics of language, can mediate between today’s theories of ‘individualistic subjectivism’ and ‘abstract objectivism’, offering a more fruitful approach to what language is and how it works.



Goutsos, D. (2003). “Corpus of Greek Texts: Design and implementation [In Greek]”.       Proceedings of the 6th International Conference of Greek Linguistics, University of             Crete, 2003. Available at:

Voloshinov, N. (1973 [1929]). Marxism and the Philosophy of Language. Cambridge, MA:           Harvard University Press.

Bigrams in registers, domains, and varieties: A bigram gravity approach to the homogeneity of corpora


Stefan Th. Gries


One of the most central domains in corpus linguistics is concerned with issues of corpus comparability, homogeneity, and similarity. In spite of the importance of this topic – after all, only when we know how homogeneous and/or comparable our corpora are can we be sure as top how robust our findings and generalizations are. In addition, most of the little work that is there is based on the frequencies with which particular lexical items occur in a corpus or in parts of corpora (cf. Kilgarriff 2001, 2005, Gries 2005). However, this is problematic because (i) researchers working on grammatical topics may not benefit at all from corpus homogeneity measures that are based on lexical frequencies alone, and (ii) raw frequencies of lexical items are a rather blunt instrument that does not take much of the information that corpora provide into account.

      In this paper, I will explore how we can compare corpora or parts of corpora in a way that I will argue is somewhat more appropriate. First, I will compare parts of corpora on the basis of the strength of association of all bigrams in the corpus (parts). Second, the measure of association strength is one that has so far not received a lot of attention although it exhibits a few interesting characteristics, namely Daudaravičius and Marcinkevičiené's (2004) gravity counts.

      Unlike nearly all other collocational statistics, this measure has the attractive property that it is not only based on the token numbers of occurrence and co-occurrence, but also considers the number of different types the parts of a bigram co-occur with.

The corpus whose homogeneity I study is the BNC baby. In order to determine the degree of homogeneity and at the same time validate the approach, I split up the BNC Baby into its 4 main parts (aca, dem, fic, and news) and its 19 different domains, and, within each of these subdivisions, compute each bigram's gravity measure of collocational strength. This setup allows to perform several comparisons, whose results I will present

-           the 4 corpus parts and the 19 corpus domains can be compared with regard to the overall average bigram gravity;

-          the 4 corpus parts and the 19 corpus domains can be compared with regard to the average bigram gravity per file;

-           the 4 corpus parts and the 19 corpus domains can be compared with regard to the average bigram gravity per sentence.

More interestingly even, I show to what degree the domain classifications of the corpus compilers are replicated in the sense that, when the 19 corpus domains are clustered in a hierarchical cluster analysis, they largely form groups that resemble the four main corpus parts.Similar results for the comparison of different varieties of English will also be reported briefly.

      These findings not only help evaluate the so far understudied measure of gravity counts, but also determines to what degree top-down defined register differences are reflected in distributional patterns.


Multimodal Russian Corpus (MURCO): Types of annotation and annotator’s workbench


Elena Grishina


2009 is the first year of the actual design of the Multimodal Russian Corpus (MURCO), which will be created in the framework of the Russian National Corpus (RNC). The RNC contains the Spoken Subcorpus (just now its volume is circa 8 million tokens), but this subcorpus does not include the oral speech proper – it includes only the transcripts of the spoken texts (see [Grishina 2006]). Therefore, to replenish the Spoken Subcorpus of the RNC, we have decided to work out the first generally accessible and relatively fair-sized multimodal corpus. To avoid the problems concerning copyright and privacy invasion, we have decided to use the cinematographic material in the MURCO. The total volume of cinematographic transcripts in the Spoken Subcorpus of the RNC is about 5 million tokens? If we manage to transform this subcorpus into multimodal state, we will obtain the largest open multimodal corpus, so the task is ambitious enough.

The MURCO will include the annotation of different types. These are:

·         the standard RNC annotation, which consists of morphological, semantic, accentological and sociological annotation

·         the orthoepic annotation

·         the speech act annotation

·         the gesture annotation (see [Grishina 2009a]).

So, we plan to annotate MURCO from the point of view of phonetics and orthoepy, in addition to the standard RNC annotation (see [Grishina 2009b]). It includes two types of information: information concerning sound combination and information concerning accentological structure of a word.

The speech act annotation will characterize any phrase in a clip from the point of view of 1) typical social situation, 2) types of speech acts, 3) speech manner, 4) types of repetition, 5) types of vocal gestures and non-verbal words.

The gesture annotation will characterize every clip from the point of view of 1) types of gestures, 2) meaning of gestures, 3) Russian gesture names, 4) active/passive organs, 5) spatial orientation of an active organ and direction of its movement.

The annotation process in the MURCO may be

·         1) automatic (morphology, semantics, orthoepy),

·         2) manual: to annotate the speech acts and the gestures in MURCO an annotator ought to work in manual mode only. To facilitate the work we have decided to construct two special-purpose workbenches: the “Maker” – to annotate speech acts, vocal gestures, interjections, repetitions, and so on, the “GesturesMarker” – to annotate gestures and their components. The workbenches have user-friendly interface and considerably increase the intensity and speed of the annotation process.



Grishina E. (2006) “Spoken Russian in the Russian National Corpus (RNC)” . LREC2006:                        5th International Conference on Language Resources and Evaluation. ELRA, 2006. p.             121-124.

Grishina E. (2009a) “Multimedijnyj corpus russkogo jazyka (MURCO): problemy annotacii”.        RNC2009.      

Grishina E. (2009b forthcoming) “National’nyj corpus russkogo jazyka kak istochnik

svedenij ob ustnoj rechi”. Rechevyje tekhnologii, 3, 2009.


Synergies between transcription and lexical database building: the case of German Sign Language


Thomas Hanke, Susanne Koenig, Reiner Konrad, Gabriele Langer and Christian Rathmann


In corpus-based lexicography research, tokenising and lemmatising play a crucial role for empirical analyses, including frequency statistics or lemma context analyses. For languages with conventionalised writing systems and available lexical database resources, lemmatising a corpus is more or less an automatic process. Changes in the lexical database resource are rather a secondary issue.

            Apparently, this approach does not apply for German Sign Language (DGS), a language in the visuo-gestural modality. Due to the lack of an established writing system and close-to-complete lexical database resources for DGS, it is necessary to do both tokenising and lemmatising manually and to extend the lexical resources in parallel with the lemmatisation process. In lieu of using transcripts being processed with support of available lexical database resources and then stored again as independent entities, we adopt an integrated approach using a multi-user relational database. The database combines transcripts with the lexical resource. Transcript tags which identify stretches of the utterance as tokens are not text strings but are treated as database entities. For lemma review in the lexical database, tokens currently assigned to a type can be accessed directly. Consequently, changes of type information, e.g. gloss or citation form, are immediately reflected in all transcripts.

            This approach has three advantages. First, it allows to treat productive forms (i.e. non-conventionalised forms) in parallel with lemmas under review. Productive forms consist of sublexical elements with visually-motivated (i.e. iconic) properties. Providing structures to access productive forms in the database is important as these forms are perfect candidates for lexicalisation. Second, the approach enables transcribers to constantly switch back and forth between bottom-up and top-down analyses as the token-type matching requires reviewing, extending, and refining the lexical database. Third, this contributes to quality assurance of the transcripts since the standard approach including completely independent double-transcription would be prohibitively expensive. The database avoids problems inherent in applying the standard text tagging approach to sign languages, such as glossing consistency (cf. Johnston 2008).

            However, the disadvantage inherent in both our approach and the more standard one is that the phonetic form of the utterance is not adequately represented in the transcript. While immediate access to digital video is available, the one-step procedure going from digital video directly to tags (combining tokenisation and lemmatisation) leaves out the step of describing phonetic properties. A transcriber is supposed to notate any form derivation from the type’s standard inflected form, but this can easily be forgotten to be filled out or later to be modified when the type’s standard form is changed. To identify problems in this area, we use avatar technology: The avatar is fed the phonetic information combined from the type and token data (encoded in HamNoSys, an IPA-like notation system for sign languages), and the transcriber can then compare the avatar performance with the original video data.



Syntactic annotation of transcriptions in the Czech Academic Corpus: Then and now


Barbora Hladká and Zdeňka Urešová


Rich corpus annotation is performed manually (at least initially) using predefined annotation guidelines, available now for many languages in their written form. No speech-specific syntactic annotation guidelines exist, however. For English, the Fisher Corpus (Cieri et al., 2004) and the Switchboard part of the Penn Treebank [1] had been annotated using written-text guidelines. For Czech, the Czech Academic Corpus [2] (CAC) created by a team of linguists at the Institute of Czech Language in Prague in 1971-1985, also did not use any such specific guidelines, despite the fact that a large part (~220,000 tokens) of the CAC contains transcriptions of spoken Czech and their morphological and syntactic annotation. In our contribution, we discuss syntactic annotation of these spoken-language transcriptions by comparing their original annotation style and the modern syntactic representation as used in the Prague Dependency Treebank (PDT); it was quite natural to try to merge (convert) CAC into PDT using the latter treebank’s format (Vidová Hladká et al. 2008).

      While most of the conversion has been relatively straightforward (albeit labor-intensive), the spoken part posed a more serious problem. Many phenomena typical for spoken language had not appeared previously in written texts, such as unfinished sentences (fragments), ellipsis, false beginnings (restarts), repetition of words in the middle of sentence (fillers), redundant and ungrammatically used words – none of these are covered in our text annotation guidelines. Since the PDT annotation principles do not allow word deletion, regrouping or word order changes, one would have to introduce arbitrary, linguistically irrelevant spoken-language-specific rules with doubtful use even if applied consistently to the corpus. The spoken part of the CAC thus was not converted (at the syntactic layer).

      However, we plan to complete its annotation (i.e. the spoken language transcriptions) using the results of the “speech reconstruction” project (Mikulová et al., 2008). Speech reconstruction will enable the use of the text-based guidelines for syntactic annotation of spoken material by introducing a separate layer of annotation, which allows for “editing” the original transcript into a grammatical text (for the original ideas see Fitzgerald, Jelinek, 2008).


Acknowledgements: This work is supported by the Charles University Grant Agency under the grant GAUK 52408/2008 and by the Czech Ministry of Education project MSM0021620838.



Cieri, C., D. Miller and K. Walker (2004). “The Fisher Corpus: a Resource for the Next     Generations of Speech-to-Text”. In Proceedings of the 4th LREC, Lisbon, Portugal,     69-71,

Fitzgerald E. and F. Jelinek (2008). “Linguistic resources for reconstructing spontaneous   speech text”. In Proceedings of the 6th LREC, Marrakesh, Morocco, ELRA.

Mikulová, M. (2008). “Rekonstrukce standardizovaného textu z mluvené řeči v Pražském             závislostním korpusu mluvené češtiny”. Manuál pro anotátory. Tech. report no.   2008/TR-2008-38, ÚFAL MFF UK, Prague, Czech Republic.

Vidová Hladká B., J. Hajič, J. Hana, J. Hlaváčová, J. Mírovský and J. Raab (2008). “The Czech Academic Corpus 2.0 Guide”. Prague Bulletin of Mathematical Linguistics,

89, 41-96.





A top-down approach to discourse-level annotation


Lydia-Mai Ho-Dac, Cécile Fabre, Marie-Paule Péry-Woodley and Josette Rebeyrolle


The ANNODIS project aims at developing a diversified French corpus annotated with discourse information. Its design innovates in combining bottom-up and top-down approaches to discourse. In the bottom-up perspective, basic constituents are identified and linked via discourse relations. In a complementary manner, the top-down approach starts from the text as a  whole and focuses on the identification of configurations of cues signalling higher-level text  segments, in an attempt to address the interplay of continuity and discontinuity within  discourse. We describe the annotation scheme used in this top-down approach in terms of three defining choices: the type of discourse object to be annotated, the type of texts to be included in the corpus, and the environment to assist the annotation task.

            The discourse object chosen to “bootstrap” our top-down approach is the list or rather the enumerative structure, envisaged as a meta-structure covering a wide range of discourse organization phenomena. Enumerative structures are perceived and interpreted through interacting high-and low-level cues (e.g. visual cues such as bullets vs. lexical choices). They epitomise the linearity constraint, setting out a great diversity of discourse elements in the linear format of written text: a canonical enumerative structure comprises a trigger (introductory segment), a list of co-items and a closure. Enumerating can be seen as a basic strategy in written text through which different facets of discourse organization may be approached. Of particular interest to our annotation project is the fact that enumerative structures occur at all levels of granularity (from within the sentence to across text sections), and can be nested. We also see them as a particularly interesting object for the study of the interaction between the ideational and textual metafunctions (Halliday 1985).

            Two main corpus-selection requirements follow on from our decision to view discourse organization from a top-down perspective in order to identify high-level structures. First, text genres must be carefully selected so as to include long expository texts, such as scientific papers or essays. Second, the corpus must be composed of texts in which crucial elements of discourse organization such as subdivisions and layout are available (Power et al 2003).

            The annotation procedure calls upon pre-processing techniques. It consists in three successive steps: tagging, marking and annotation. The texts are tagged for POS and syntactic dependences. This information is used in the marking phase, whereby cues associated with the signalling of enumerative structures are automatically located. These cues may signal components of the structure (list markers, encapsulations…) or various elements that help reveal the text’s organization (section headings, frame introducers, parallelisms…). In the annotation stage, the colour-coded highlighting of cues guides the annotator’s scanning of the text. The interface allows the annotator to approximate the top-down approach by zooming on parts of the texts that are dense in enumerative cues. This combination of a marking procedure and specific interface facilities is a key element of our method, making it possible for the annotator to navigate the text and identify relevant spans at different granularity levels.



Halliday, M. A. K. (1985). An Introduction to Functional Grammar. London: Edward Arnold.

Power, R., D. Scott and N Bouayad-Agha (2003). “Document Structure”. Computational             Linguistics, 29(2), 211-260.


Nominalization in scientific discourse: A corpus-based study of abstracts and research articles.


Mônica Holtz


This paper reports on a corpus-based comparative analysis of research articles and abstracts. Systemic Functional Linguistics (Halliday 2004a; Halliday and Martin 1993) and register analysis (Biber 1988, 1995, 1998) are the theoretical backgrounds of this research.

Abstracts are considered to be one of the most important parts of research articles. Abstracts represent the main thoughts of research articles and are almost a surrogate for them. Although abstracts have been quite intensively studied, most of the existing studies are concerned only with abstracts themselves, not comparing them to the full research articles (e.g. Hyland 2007; Swales 2004, 1990, 1987; Ventola 1997).

            The most distinctive feature of abstracts is their information density. Investigations of how this information density is linguistically construed are highly relevant, not just in the context of scientific discourse but also for texts more generally. It is commonly known that complexity in scientific language is achieved mainly through specific terminology, and nominalization, which is part of grammatical metaphor (cf. Halliday and Martin 1993; Halliday 2004b). The ultimate goal of this research is to investigate how abstracts and research articles deploy these linguistic means in different ways.

            Nominalization ‘is the single most powerful resource for creating grammatical metaphor’ (Halliday 2004a: 656). Through nominalization, processes (linguistically realized as verbs) and properties (linguistically realized, in general, as adjectives) are re-construed metaphorically as nouns, enabling an informationally dense discourse.

            This work focuses on the quantitative analysis of instances of nominalization, i.e., nouns ending in e.g., -age (cover - coverage), -al (refuse - refusal), -(e)ry (deliver - delivery), -sion/-tion (convert - conversion / adapt - adaptation), -ing (draw - drawing), -ity (intense - intensity), -ment (judge - judgment), -ness (empty - emptiness), -sis (analyze - analysis), -ure (depart - departure), and -th (grow - growth), in a corpus of research articles.

            The corpus under study consists of 94 full research articles from several scientific journals in English of the disciplines of computer science, linguistics, biology, and mechanical engineering, comprising over 440,000 words. The corpus was compiled, pre-processed, and automatically annotated for parts-of-speech and lemmata. Emphasis will be given to the analysis and discussion of the use of nominalization in abstracts and research articles, across corpora and domains.



Halliday, M. (2004a). An Introduction to Functional Grammar 3rd ed. London: Arnold.

Halliday, M. A. K. (2004b). “The Language of Science”, In Collected Works of M. A. K.   Halliday (vol. 5). London: Continuum.

Halliday, M. A. K. and J. Martin (1993). Writing Science: Literacy and Discursive Power.             London: University of Pittsburgh Press.

Hyland, K. (2007). Disciplinary Discourses. Ann Arbor: U. Of Michigan Press.

Swales, J. M. (1987). “Aspects of article introductions”. Aston-ESP-research-reports NO. 1,         5th Impression, The University of Aston, Birmingham, England.

Swales, J. M. (1990). Genre Analysis. Cambridge: CUP.

Swales, J. M. (2004). Research Genres. Exploration and Applications. Cambridge: CUP.

Ventola, E. (1997). “Abstracts as an object of linguistic study”. In F. Danes, E. Havlova and S. Cmejrkova (eds) Writing vs. Speaking: Language, Text, Discourse, Communication. Proceedings of the Conference held at the Czech Language, 333–352.

Construction or obstruction: A corpus-based investigation of diglossia in Singapore classrooms


Huaqing Hong and Paul Doyle


This paper takes a corpus linguistics approach to investigating the pattern of language use, diglossia in particular, in the teacher?student interactions in Singapore primary and secondary schools in relation to the effectiveness of students learning in such a situation. Diglossia is an interesting sociolinguistic phenomenon in a bilingual or multilingual setting where two languages or language varieties (H-variety and L-variety) occur side by side in the community, and each has a clear range of functions. Given the Singapore English-speaking community, the H-variety, Singapore Standard English (SSE) is encouraged to be used in classrooms, while the L-variety, Singapore Colloquial English (SCE) has been supposed to be eliminated from classrooms, and this has been criticized by many scholars for it’s almost impossible to eliminate SCE from the local education settings (Pakir 1991a & 1991b; Gupta 1994; Deterding 1998; to name just a few). We have no attempt to involve in the debate, but we would rather to look at this issue from a corpus linguistics perspective.

            This study makes use of SCE?annotated data from the Singapore Corpus of Research in Education (SCoRE) (Hong 2005), a corpus of 544 lessons totaling about 600 hours recording of real classroom interactions in Singapore schools. With the identification of the pervasive use of SCE features in classroom interactions and their distribution patterns across four disciplinary subjects, we’ll look into how the diglossia use of English language in classroom interaction affects the learning process in class. Then, questions, such as to what extent diglossia can hinder or facilitate learner contribution in classroom communication and in what way teachers and students can take advantage of this to improve their meaning negotiation, will be addressed with the corpus evidence and statistical justification. Finally, we present the implication of this study and provide the recommendation for a practical and useful approach in classroom practices. The conclusion, that increasing teachers and students’ awareness of their use of language in class is at least as important as their ability to select appropriate methodologies, has implications for both teacher education and pedagogical practices.



Deterding, D. (1998). “Approaches to diglossia in the classroom: the middle way”. REACT,         2, 18-23.

Gupta, A. F. (1994). The Step-tongue: children's English in Singapore. Clevedon: Multilingual Matters.

Hong, H. (2005). “SCoRE: A multimodal corpus database of education discourse in          Singapore schools”. Proceedings of the Corpus Linguistics Conference Series, 1 (1),        2005. University of Birmingham, UK.

Pakir, A. (1991a). “The range and depth of English-knowing bilinguals in Singapore”. World        Englishes, 10 (2), 167-179.

Pakir, A. (1991b). “The status of English and the question of ‘standard’ in Singapore: a     sociolinguistic perspective”. In M. L. Tickoo (ed.) Language and Stadards: Issues,             Attitudes, Case Studies. Singapore: SEAMEO.


A GWAPs approach to collaborative annotation of learner corpora


Huaqing Hong and Yukio Tono


It is generally agreed that a richly annotated corpus can be useful for a number of research questions which otherwise may not be answered. Annotation of learner corpora presents some theoretic and practical challenges to researchers of corpus linguistics and Applied Linguistics (Granger 2003, Nesselhauf, 2004, Pravec 2002, Díaz-Negrillo & García-Cumbreras 2007, Tono & Hong 2008, to name just a few). Regarding the efficiency and reliability of annotation of learner corpora, here we present an on?going work with the innovative GWAPs approach to online collaborative annotation of two learner corpora. GWAPs (Games With A Purpose) approach was originally designed in ESP Game for image annotation (von Ahn & Dabbish 2004), further developed in, and recently applied to linguistic annotation successfully (e.g. Phrase Detectives, see Chamberlain, Poesio & Kruschwitz 2008 for more). This approach has been proved more efficient in time and cost saving and is able to yield more reliable output with a higher inter-annotator agreement rate in corpus annotation. Inspired by this, we adapted the GWAPs approach according to our purposes and data to apply it to online collaborative annotation of learner corpora. This paper starts with the comparative analysis of various approaches to learner corpus annotation, some common issues and. Then, the design of the GWAPs-based corpus annotation architecture and its implementation are introduced with demonstration followed. In so doing, we hope such an online collaborative approach can fulfil various requirements of annotation in learner corpora, and will eventually be applied to related areas, and benefit researchers from a wider community.



Chamberlain, J., Poesio, M., and Kruschwitz, U. (2008). “Phrase detectives: a web-based collaborative annotation game”. Paper presented at International Conference on         Semantic Systems (I-SEMANTICS'08). September 3-5, 2008. Messecongress I graz,   Austria.

Díaz-Negrillo, A., and Fernández-Domínguez J. (2007). “A tagging tool for error analysis on        learner corpora”. ICAME Journal, 31, 197-203.

Granger, S. (2003). “Error-tagged learner corpora and CALL: a promising synergy”.          CALICO Journal, 20 (3), 465-480.

Nesselhauf, N. (2004). “Learner corpora and their potential for language teaching”. In J.   Sinclair (ed.), How to Use Corpora in Language Teaching. Amsterdam: John   Benjamins, 125-152.

Pravec, N. A. (2002). “Survey of learner corpora”. ICAME Journal, 26, 81-114.

            von Ahn, L., and Dabbish, L. (2004). “Labeling images with a computer game”.    Proceedings of the SIGCHI Conference on Human Factors in Computing Systems         (ISBN 1-58113-702-8). Vienna, Austria, 319-326.



Phraseology and evaluative meaning in corpus and in text: The study of intensity and saturation


Susan Hunston


This paper develops the argument that the investigation of language features associated with saturation and intensity in Appraisal Theory (Martin and White 2005) can be enhanced by attention to corpus-based methodologies in accord with language theories such as those of Sinclair (1991; 2004).

            There have been numerous responses from corpus linguistics to the challenge of attitudinal language. For example, markers of stance have been identified and quantified in texts that differ in genre or field (Biber 2006; Hyland 2000; Charles 2006). When it comes to what Martin and White call Appraisal, it is less clear how best corpus approaches can assist discourse analysis. This paper explores one aspect of appraisal theory – that relating to the systems of saturation and intensity. Saturation is the presence of relevant items in different parts of the clause (e.g. I think he may be in the library, possibly) whereas intensity refers to a cline of meaning (e.g. good versus excellent). Alongside the inscribed / evoked distinction these systems explain how one text can be ‘more evaluative’ than another.

            Corpus research that focuses on the patterning and prosody of individual words and phrases (e.g. Sinclair 2004; Hoey 2004) often reveals that given words / phrases tend to co-occur with evaluative meaning even though they do not very obviously themselves indicate affect. This has been extensively researched in the contexts of particular phraseologies, but the effect of the use of such items on the texts in which they appear is less frequently argued. It might be said that there is an open-ended class of phraseologies that predict attitudinal meaning, such that there is a considerable amount of redundancy in the expression of that attitude. These phraseologies increase both the saturation and the intensity of appraisal in a text.

            The paper will demonstrate this in two ways: through examination of a number of ‘attitude predicting’ phraseologies (e.g. bordering on and to the point of) and through the comparative analysis of sample texts. It represents an example of the interface between corpus and discourse studies.



Biber, D. (2006). University Language: a corpus-based study of spoken and written registers.        Amsterdam: Benjamins.

Charles, M. (2006). 'The construction of stance in reporting clauses: a cross-disciplinary

study of theses' Applied Linguistics 27, 492-518.

Hoey, M. (2004). Lexical Priming: A new theory of words and language. London: Routledge.

Hyland, K. (2000). Disciplinary discourses: Social interactions in academic writing. Harlow:         Longman.

Martin, J. and P White (2005). The Language of Evaluation: Appraisal in English. London:           Palgrave.

Sinclair, J. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press.

Sinclair, J (2004). Trust the Text: language, corpus, discourse. London: Routledge.


A corpus-based approach to the modalities in Blood & Chocolate


Wesam Ibrahim and Mazura Muhammad


This paper will use a corpus-based approach to investigate the modal worlds in the text world of Annette Curtis Klause’s (1997) Blood and Chocolate, a crossover fantasy novel addressing children over 12 years of age. Gavins (2001, 2005 and 2007) has modified the third conceptual layer of Werth’s (1999) Text World Theory, i.e. subworlds.  She divides subworlds into two categories, namely world switches and modal worlds. Hence, modal worlds will be the focus of this paper. Following Simpson (1993), Gavins classifies modal worlds into Deontic, Boulomaic and Epistemic. She provides a number of linguistic indicators functioning as modal world builders including modal auxiliaries, adjectival participial constructions, modal lexical verbs, modal adverbs, conditional-if and so on. The prime interest of this paper is to determine the feasibility of using corpus tools to investigate modal worlds. The Blood and Chocolate corpus comprising 58, 368 words will be compared to the BNC written sample. This paper combines the quantitative as well as qualitative methods. Wordsmith 5 (Scott, 1999) will be utilised to ascertain the frequency and collocations of these modal world builders. Additionally, a description of the impact of the prevalence of one modality over the others will be provided. The mode of narration in Blood and Chocolate, according to Simpson (1993: 55, 62-3) belongs to category B in reflector mode (B(R)), since it is a third-person narrative taking place within the confines of a single character’s consciousness. Furthermore, in accordance with Simpson’s (1993: 62-3) subdivisions of category B in reflector mode (B(R)), Blood and Chocolate is a mixture of the subcategories B(R)+ve and B(R)-ve. The former is characterised by the dominance of deontic and boulomaic modality systems in addition to the use of evaluative adjectives and adverbs, verba sentiendi and generic sentences which possess universal or timeless reference; while the latter is distinguished by the dominance of the epistemic modality. The preliminary findings of the study show that there is a significant clash between deontic and boulomaic modalities since the protagonist, Viviane, is traumatised by a conflict between her duty towards her family of werewolves and her desire to continue her relationship with a human being. However, the epistemic modality is also prevalent since the whole narrative is focalised through Viviane’s perspective.



 Gavins, J. (2007). Text World Theory: An Introduction. Edinburgh: Edinburgh University           Press.

Gavins, J. (2005). “(Re)thinking modality: A text-world perspective”. Journal of Literary Semantics, 34, 79–93.

Gavins, J. (2001). “Text world theory: A critical exposition and development in relation to            absurd prose fiction”. Unpublished PhD thesis, Sheffield Hallam University.

Klause, C. A. (1997). Blood and Chocolate. London: Corgi.

Simpson, Paul (1993). Language, Ideology, and Point of View. London: Routledge.

Scott, M. (1996). WordSmith Tools. Oxford: OUP.

Werth, P. (1999). Text Worlds: Representing Conceptual Space in Discourse. London:       Longman.


Gender representation in Harry Potter series: A corpus-based study


Wesam Ibrahim and Amir Hamza


The aim of this study is to probe the question of gender representation in J.K. Rowling’s Harry Potter series; specifically, to investigate the gender-indicative words (function and content) as clues to any potential gender bias on Rowling’s part. We intend to tackle the following question: What are the stereotypical representations associated with male and female characters in Harry Potter and how are they linguistically realised?

            The methodology used in this study is a combination of quantitative and qualitative. On the one hand, we are concerned to use the corpus tool of WordSmith 5 (Scott 1999) to calculate the keywords of each book as compared to two reference corpora, namely, a corpus composed of the seven books (1,182,715 words) and the BNC written sample. The significant keyword collocates of each gender-indicative node will be calculated against the LL score. (The p-value is 0.000001.) The extraction of those keywords will help us 1) determine the gender focus in each book and track the gender-representation development across the whole series, and 2) see the difference between our data and general British English in the BNC in terms of gender indicative words.

            The procedural analysis will proceed thus. For a start, the collocates will be linguistically investigated within the parameters of modality (Simpson 1993). The modality system will prove so essential a clue to the negative or positive prosodies associated with each of these characters: first, the use of deontic and/or boulomaic modality markers in the collocational environments of the names of certain characters might textually foreground a positive modal shading of duty, responsibly, desire, hope, etc.; second, by contrast, the use of epistemic and/or perception modality markers in some other collocational environments might textually foreground a negative modal shading of uncertainty or estrangement. So, overall, the modality system may linguistically reveal certain positive or negative prosody in the collocates that characteristically occur with the names male and female characters in text; this could be an avenue to how gender is narratively framed by Rowling within the fictional world of Harry Potterfrom an authorial point of view.  On the other hand, a CDA perspective (Fairclough 1995, 2001; Van Dijk 1998) will be employed, with a special focus on 1) interpreting the gender representation in terms of the schemata, frames and scripts of the discourse participants (in this case the characters in fiction), and 2) explaining the biased position of the text producer by problematising her authorial stance in relation to her readership. 



Fairclough, N. (2001). Language and Power. (2nd ed.). New York: London.

Fairclough, N. (1995). Critical Discourse Analysis: The Critical Study of Language. London          and  New York: Longman.

Scott, M. (1996). WordSmith Tools. Oxford: OUP.

Van Dijk. (1998). Ideology: A Multidisciplianry Approach. London: Sage.

Simpson. (1993). Language, Ideology and Point of View. London: Routledge.


A corpus stylistic study of To the Lighthouse


Reiko Ikeo


As a novel of stream-of-consciousness, Virginia Woolf’s To the Lighthouse has been of great interest in literary studies. The main concern of stylistic discussions is that each character’s psyche is presented in free indirect thought and this is closely related to the shift of viewpoints. However, the novel involves many more features than speech and thought presentation in free indirect forms. Direct forms of speech or presentation of characters’ dialogues are also important in developing the plot and in triggering characters’ inner thoughts. The leading character, Mrs. Ramsay has her personality and beauty depicted and conveyed to the reader more by other characters’ speech and thought presentation than descriptions of her own speech and actions. By applying a corpus approach, this study examines how speech and thought presentation in the novel construct the narratives and how characters interact verbally and psychologically. Approximately 17% of the text is direct speech and thought presentation, 44% is indirect speech and thought presentation and 39% is narration. In addition to the categories of speech and thought presentation model which was proposed by Semino and Short (2004), the corpus includes information about the identification of the speaker, to whom the speaker is referring and the summary of the content of the presented discourse. By sorting each character’s speech and thought presentation according to the speaker and the referees, relationships between characters, characterisation and chronological changes of each character will be shown.

    Another interest of this corpus approach is the variation of narrative voices. The third-person omniscient narrator who maintains neutrality is hardly present in the novel.  Instead, the narrator, being sympathetic to characters in each scene, takes over the character’s point of view. Since the corpus makes a collective examination of narration possible, such shifts of viewpoints can systematically be examined. Whose point of view the narrator is taking is often indicated by the use of metaphors. 



Semino, E. And M. Short (2004). Corpus Stylistics: Speech, Writing and Thought   Presentation in a Corpus of English Writing. London: Routledge.

Use of modal verbs by Japanese learners of English: A comparison with NS and Chinese learners


Shin’ichiro Ishikawa


1. Introduction

English learner corpus studies have illuminated various features of NNS’s use of L2 vocabulary (Granger, 1998; Granger et al., 2002). Many of the previous studies in the field of computational linguistics emphasize that high frequent words are stabilized in terms of distribution, but analysis of learner corpus often reveals that NS/NNS gap can be seen even in the use of the most frequent functional words. The aim of the current study lies in probing the features of the use of modal verbs by Japanese learners of English (JLE). Adopting the methodology of the multi-layered contrastive interlanguage analysis (MCIA), we will compare the English writing by JLE with that by NS and that by Chinese learners of English (CLE). Examining the essays by NS and European NNS, Ringbom (1988) concludes that NNS overuse modal verbs such as “be,” “do,” “have,” and “can.” Aijmer (2002) suggests that Swedish, French, and German learners tend to overuse “will,” “would,” “might,” and “should,” which she says is caused by the “transfer from L1.” Also, Tono (2007) shows that Japanese junior high and high school students have a clear tendency to overuse almost all of the modal verbs.


2. Data and Analysis

We prepared three kinds of corpora: JLE corpus (169,654 tokens), CLE corpus (20.367 tokens), and NS corpus (37,173 tokens). All of these are a part of the Corpus of English Essays Written by Asian University Students (CEEAUS), which the author has recently released. Unlike other major learner corpora, CEEAUS rigidly controls writing conditions such as topic, length, time, and a dictionary use, which enables us to conduct a robust comparison among different writer groups (Ishikawa, 2008). CEEAUS also holds the detailed L2 proficiency data of NNS writers. JLE’s essays are thus classified into the four levels: Lower (-495 in TOEIC® test), Middle (500+), Semi-upper (600+), and Upper (700+). The data was analyzed both qualitatively and quantitatively. For the latter, we conducted correspondence analysis, which visually summarizes internal correlation among observed variables.


3. Conclusion

Contrary to the findings of previous studies, JLE do not necessarily overuse all the modal verbs. For instance, they underuse “could,” “used to,” and “would,” which suggests that they cannot properly control the epistemic hedges and the time frame.?JLE and CLE’s usage pattern of modal verbs are essentially identical, though CLE show a stronger overuse tendency in use of “can” and “will.” In case of Asian learners, L1 transfer is limited. In spite of L2 proficiency, JLE continue to over/underuse many of the modal verbs. This shows the need of explicit teaching of modality in TEFL in Japan. Especially, we should focus on the epistemic use of “would” and “could,” which is closely related to the “native-likeness” in the writings.



A corpus-based study on gender-specific words


Yuka Ishikawa


The feminist movement, dating back to the 1970s, aims to secure equality between the sexes in language as well as in real life. The advocators have argued that language shapes society as much as it reflects society. If we eliminate discriminatory elements from the language circulated in the community, we may be able to remove the old gender stereotype in our society.

        The campaign was initiated mainly by feminists in the U.S., but since the mid-1970s, generic use of “man”, including compounds ending in “-man”, has been regarded as discriminatory by many (Maggio, 1987; Pilot, 1999). In 1975, the U.S. Department of Labor revised the Dictionary of Occupational Titles. This was meant to eliminate occupational titles ending with gender-marking suffixes, such as “fireman” or “chairman”, in order to conform to the equal employment legislation which was enacted in the late 1960s. Trends to avoid sexist language first spread to other English-speaking countries, then to non-English speaking countries around the world. By the beginning of 1990s, this movement reached Japan. The Japanese government introduced language reform and revised the Equal Opportunity Act in 1999. Using sexist job titles and expressions in classified advertisements has been legally banned since then. Japanese newspaper reporters, officers, and people in the business world are now to avoid using gender-biased expressions in public.

        In this paper, Kotonoha Corpus, which is compiled by the National Institute for Japanese Language and available to the public via the Internet, is used to examine gender-specific expressions mainly in the White Paper and Minutes of the Diet. The purpose of the study is to dig deeper into official text that appears to be non-sexist at first glance to reveal the hidden images of men and women. The use of traditional job titles ending in the gender-specific suffix have been dramatically reduced since the government started the language reform. However, a close examination of the use of gender-specific words, including job titles, we will have a clearer image of men and women in the Japanese society.



Maggio, R. (1987). The nonsexist word finder. Phoenix, Arizona: The Oryx Press.

Pilot, M. J. (1999). “Occupational outlook handbook: A review of 50 years of change”.                Monthly Labor Review, 8-26. Retrieved November 14, 2002, from mlr/1999/ 05/art2abs.htm.


English-Spanish equivalence unveiled by translation corpora: The case of the Gerund in the ACTRES Parallel Corpus


Marlén Izquierdo


The notion of equivalence is central to translation. There seems to be such a bias for equivalence that this issue might question the usability of translation corpora for contrastive research. This paper intends to prove in which way translation corpora may be useful and usable for contrastive studies.

      Assuming that translation is a situation of languages in contact might be a suitable starting point to explain how translational equivalence is conceived of and why this is paramount in translation and contrastive linguistics, as well as a linking factor between them. From a functional approach, the notion of equivalence can be observed through translation units, which are made up of an original text (OT) item and its equivalent in the target text (TT). In an attempt at preserving equivalence, translators might resort to system equivalents, that is, resources which the codes involved share as common means of expressing the same meaning. However, not always do translators know what resources they have at their disposal for translating a given item from English into Spanish. In addition, more often than not, translators make use of what they assume they can do instead of what is typically done. In other words, their choice deviates from the outcome expected by the target audience, questioning, thus, the acceptance of the translation and, by default, the underlying degree of equivalence.

      Assessing the degree of equivalence of a translation unit depends on the knowledge about the functionality of each of their constituents, as well as their common ground and the various possibilities for matching them with other resources in either language. Yet, such an assessment is not always complete due to the fact that certain linguistic correspondences could be unnoticed by the contrastive linguists or the translator, who have much to contribute to each other.

      Bearing this in mind, this paper explores the features of the Spanish Gerund as a translational option of English texts. The ultimate goal of this piece of research is to spot unexpected matches which are supposed to be functional equivalents in English-Spanish translation. Assessing to what extent they keep functional equivalence relies on the matches expected in view of their assumed similarity and on the equivalent patterns recently found in previous, corpus-based, contrastive studies. Along with these, this study seeks to unveil other relations, of sense and/or form, between the Spanish translational option and its corresponding original text(s). This leads us to the discussion about the reliability of translation corpora as a source of linguistic correspondence, which cannot be easily contested if equivalence is a condition of translation, and it is precisely translated facts that are under examination, from a corpus-based, functional approach. 



Baker, M. (2001). “Investigating the language of translation: a corpus-based approach”. In J. M. Bravo Gozalo       and P. Fernández Nistal (eds), Pathways of Translation Studies.

Granger, S., J. Lerot and S. Petch-Tyson (2003). Corpus-based Approaches to Contrastive Linguistics and Translation Studies. New York: Rodopi, 17-29.

Izquierdo, M., K. Hofland and O. Reigem (2008). “The ACTRES Parallel Corpus: an English-Spanish        Translation Corpus”. Corpora, 3, Edinburgh: Edinburgh UP.

Köller, W. (1989). “Equivalence in translation theory”. In A. Chesterman (ed.) Readings in translation theory,        Helsinki: Oy Finn Lectura Ab, 99-104.

Mauranen, A. (2002). “Will 'translationese' ruin a CA?”. Languages in Contrast, 2 (2), 161-185.

Mauranen, A. (2004). “Contrasting Languages and Varieties with Translational Corpora”. Languages in    Contrast, 5 (1), 73-92.

Rabadán, R. (1991). Equivalencia y Traducción. León: Universidad de León.

From a specialised corpus to classrooms for specific purposes


Reka Jablonkai


A productive way to use corpus-based techniques to analyse the lexis of a special field for pedagogic purposes is to create word lists based on the specialised corpus of the given field. Word lists are generated on the basis of lexical frequency information and are used for making informed decisions about the selection of vocabulary for course and materials design in language teaching in general and in English for Specific Purposes in particular. Examples of earlier studies focusing on the variety of English used in a special field include Mudraya’s (2006) Student Engineering Word List and the Medical Academic Word List created by Wang, Liang and Ge (2008).

            The present paper will report on a similar study that looked into the lexical characteristics of the variety of English used in the official documents of the European Union. The English EU Discourse (EEUD) corpus used for the investigation comprises EU written genres like regulations, decisions and reports of about 1 million running words. Genres and texts were selected on the basis of a needs analysis questionnaire survey among EU professionals.

            Overall analysis of the lexical items shows that the word families of the General Service List cover 76% of the total number of tokens in the EEUD corpus and the word families of the Academic Word List (AWL) account for another 14%. Altogether these lexical items make up 36% of all word types in the corpus.

            Exploring the lexis of English EU documents included the creation and analysis of the EU Word List. The initial word list of the EEUD corpus contained 17 084 different word types. These entries were organized by word families according to Level 7 of Bauer and Nation’s scale (1993) which resulted in 7 207 word families. The word selection criteria established by Coxhead (2000), that is specialised occurrence, range and frequency, were applied to create the final EU Word List. Headwords of the most frequent word families include Europe, Commission, Community, finance, regulate, implement, proceed, treaty, EU and EC. Further analysis showed that about half of the headwords appear in the AWL and around 5% of the headwords are abbreviations.

            The EU Word List can serve as a reference for EU English courses and can form the basis for a lexical syllabus and for data-driven corpus-based materials.

            The tools used for the analysis were the Range program and the WordSmith corpus analysis software.



Bauer, L., Nation, P. (1993). “Word families”. International Journal of Lexicography, 6 (4),          253-279.

Coxhead, A. (2000). “A new academic word list”. TESOL Quarterly, 34 (2) 213-238.

Heatley, A., I. S. P. Nation and A. Coxhead (2002). Range program software. Available at:   

Mudraya, O. (2006). “Engineering English: A lexical frequency instructional model”. English       for Specific Purposes, 25, 235-256.

Scott, M. (1996). WordSmith tools. Oxford: Oxford University Press. Available at:   

Wang, J., S. Liang, and G. Ge (2008). “Establishment of a Medical English word list”.      English for Specific Purposes, 27, 442-458.



Collecting spoken learner data: Challenges and benefits


Joanna Jendryczka-Wierszycka


Although compilation of a spoken corpus always demands more time and human resources than gathering written data (e.g. the BNC project, Hoffmann et al.2008), it is a worthwhile venture.

            This paper reports on the stages of collection and annotation of a spoken learner corpus of English, namely the Polish component of LINDSEI project (De Cock 1998). It starts with a short introduction into learner corpora, and proceeds to the experience of the Polish LINDSEI project, beginning with the project design, through interviewee recruitment, data recording, transcription and finishing with part-of-speech annotation.

            LINDSEI (Louvain International Database of Spoken Learner Interlanguage) is the first spoken learner English corpus, with 11 L1 backgrounds to date. A wide range of linguistic analyses has already been performed with the use of the data by all of the LINDSEI coordinators, and this year (2009) the corpus is to be made available for scientific use for all interested scholars.

            Apart from the corpus compilation description itself, it is problem areas, present at each stage of corpus collection, that are being reported on. The first group of problems comprises those occurring during the stages of spoken data collection as such, and of team work. Another group of problematic issues is connected with the adaptation of a POS tagger (Garside 1995) from native written language, through native spoken language to non-native spoken language.

            It is worth mentioning that while studies on the Polish component of LINDSEI (to be precise, on vagueness and discourse markers) have been published (Jendryczka-Wierszycka 2008a, 2008b), the corpus compilation stages and design application have never been reported on.

            While already used for the investigation of spoken learner language, it is hoped that the new existing corpus, especially in its enriched, annotated version, will become a resource not only for linguists, but also for language teachers and translators, particularly for those interested in the L1 background of the speakers.



De Cock, S. (1998). “Corpora of Learner Speech and Writing and ELT”. In A. Usoniene (ed.)      Proceedings from the International Conference on Germanic and Baltic Linguistic       Studies and Translation. University of Vilnius. Vilnius: Homo Liber., 56-66.

Garside, R. (1995) “Grammatical tagging of the spoken part of the Britidh National Corpus: a      progress report”. In Leech, Myers, Thomas (eds) Spoken English on computer.   Harlow: Longman, 161-167.

Hoffmann, S, S. Evert, N. Smith , D. Lee, and Y. Berglund Prytz. (2008). Corpus Linguistics       with BNCweb - a Practical Guide. Frankfurt am Main: Peter Lang.

Jendryczka-Wierszycka, J. (2008a) “Vagueness in Polglish Speech”. In Lewandowska-Tomaszczyk (ed.) PALC'07: Practical Applications in Language and Computers.Papers from      the International Conference at the University of Lódz, 2007. Frankfurt am Main:         Peter Lang Verlag, 599-615.

Jendryczka-Wierszycka, J. (2008b) “'Well I don't know what else can I say' Discourse       markers in English learner speech”. In Frankenberg-Garcia, A. et al. (eds) 8th    Teaching and Language Corpora Conference. Lisboa: Associação de Estudos de      Investigação Científica do ISLA-Lisboa, 158-166.


Self editing using online concordance


Stephen Jennings


Self-editing of learner writing may be thought of as one of the goals of advanced writers in a foreign language. However, previous studies have shown that learners expect correction of writing assignments by teachers. While there is a question about how much learners need grammatical correction, some scholarly papers point out student requests for it; while others point out the need for attention on structural coherence. Whether it is correction for grammar or for coherence, it can be seen that both learner and teacher need to be involved in the process of editing drafts of written work.

       This paper uses analysis of data provided by learner writing in the context of Japan. The analysis is provided by learners comparing samples of their own writing to that of an online concordance program (The Collins Cobuild Corpus Concordance Sampler). Learners in this study are at a higher end on the spectrum of interlanguage achievement, as a result, they are able to make sufficient pragmatic or other errors deemed not to irrevocably interfere with intelligibility, but enough so as to warrant a follow-up on coherence, or on accepted grammatical norms.

       The teaching involved in the analysis of concordance is student centred and tailored to each individual learner, i.e. the learner himself analyses his own use of English. Rather than having been corrected at the earliest stage by the teacher, the error is pointed out obliquely. The learner then uses this clue to search for the solution to a more often-used English phrasing.

       Results indicate that learners perceive that this approach has helped them learn patterns of English more thoroughly than before and that this use of online concordance is useful and encourages them to think more carefully as they write.

   The thesis of this paper is that advanced learners will have the chance to;

·         Fix their own errors

·         Respond in an appropriate way to errors

·         Become more independent in their learning

·         Become aware of language in such a way as to become careful of error making







The SYN concept: Towards one-billion corpus of czech


Michal Kren


One of the aims of the Czech National Corpus (CNC) is continuous mapping of contemporary written language. This effort results mainly in compilation, maintenance and providing access to a SYN-series of corpora (the prefix indicates their synchronic nature and is followed by the year of publication): SYN2000, SYN2005 and SYN2006PUB. The former two are balanced 100-million corpora selected from a large variety of written text types that cover two consecutive time periods, the latter is a complementary 300-million newspaper corpus aimed at users that require large data. The corpora are lemmatised and morphologically tagged using a combination of stochastic and rule-based methods.

      The paper has two main aims: first, to introduce a new 700-million newspaper corpus SYN2009PUB in the context of previously published synchronic corpora. SYN2009PUB is currently being prepared and it will comprise about 70 titles including also a number of non-specialised magazines and regional newspapers. Since all the SYN-series corpora are disjoint, their total size will reach 1 200 million tokens.

      Second, the paper concentrates on national corpora architecture and corpus updating policy in the CNC. Corpus building is easier as more and more texts become available in electronic form. Specifically, it is enormous growth of the web that undoubtedly brings a lot of new possibilities of how to use the on-line data or to create new corpora almost instantly. However, there are also a number of limitations of the web-as-corpus approach, mainly because pure quantity is often not what a user is looking for. The traditional corpora with their specialised query engines, detailed and reliable metadata annotation and other markup are thus still in great demand.

      The CNC policy is conservative in this respect, so that quality of the corpus data is not compromised despite their growing amount. Moreover, CNC guarantees that all corpora are invariable entities once published, which ensures that identical queries will always give identical results. However, this static approach obviously lacks updating mechanisms concerning also corrections of metadata annotation, lemmatisation, part-of-speech tagging etc. Similarly, conservation of older versions of the corpora can lead to misinterpretations of the data because of various processing differences that may be significant.

      This is why corpus SYN was introduced, a regularly updated unification of all SYN-series corpora into one super-corpus where the original corpora are consistently re-processed with state-of-the-art versions of available tools (tokenisation, morphological analysis, disambiguation etc.). It is also easily possible to create subcorpora of SYN, so that their composition exactly corresponds to the original corpora. Furthermore, similar unifying concept as SYN will be applied to spoken and diachronic corpora in the near future. Corpus SYN can thus be seen also as a traditional response to the web-as-corpus approach showing that good old static corpora can keep their virtues while being made large and fairly dynamic at the same time.


Extracting semantic themes and lexical patterns from a text corpus


Margareta Kastberg Sjöblom


This article focuses on the analysis of textual data and the extraction of lexical semantics. The techniques provided by different lexical statistics tools, such as Hyperbase, Lexico3, Alceste and Weblex (Brunet, Salem, Reinert, Heiden) today opens the door to many avenues of research in the field of corpus linguistics, including reconstructing the major semantic themes of a textual corpus in a systematic way, thanks to a computer-assisted semantic extraction. The object used as a testing ground is a corpus made up by the literary work of one of France most famous contemporary writers: Jean-Marie Le Clézio, Nobel Prize winner 2008. The literary production of Le Clézio is vast, spanning more than forty years of writing and including several genres. The corpus consists of over 2 million tokens (51,000 lemmas) obtained from 31 novels and texts by the author.

      Beyond the literary and genre dimensions, the corpus is here used as a field of experimentation, and to implement a different methodology application. A key issue in this article is to focus on various forms of approximation of lexical items in a text corpus: what are the various constellations and semantics patterns of a text? We here try to take advantage of the latest innovations in the analysis of textual data to apply different French corpus theories as “isotopies” (Rastier 1987-1996) and “isotropies” (Viprey 2005).

      The automatic extraction of collocations and the micro-distribution of lexical items encourage the further development of different methods of automatic extraction of semantic poles and collocations: on one side by the extraction of thematic universes, revolving around a pole, on the other side by the extraction of co-occurrences and sequences of lexical items. These methods are also interesting when compared to other approaches: co-occurrences in between the items of a text and co-occurrence networks “mondes lexicaux” (Reinert, Alceste), semantic patterns based on ontologies (Tropes), but also other techniques such as simple or recursive “lexicogrammes” (Weblex, Heiden) or W. Martinez methods for multiple co-occurrence extraction.



Adam, J.-M. and U. Heidemann (2006). Sciences du texte et analyse de discours, Enjeux   d’une interdisciplinarité. Genève: Slatkine Erudition.

Heiden, S. (2004). “Interface hypertextuelle à un espace de cooccurrences: implémentation           dans Weblex”. 7èmes Journées internationales d'Analyse Statistique des Données    Textuelles (JADT 2004), Louvain-la-Neuve.

Kastberg Sjöblom, M. (2006). L’écriture de J.M.G. Le Clézio – des mots aux themes. Paris:            Honoré Champion.

Lafon, P. (1984). Dépouillements et Statistiques en Lexicométrie. Paris: Slatkine-Champion.

Leblanc, J.M. (2005). “Les vœux des présidents de la cinquième République (1959-2001).             Recherches et expérimentations lexicométriques à propos de l’ethos dans un genre            discursif ritual”. Thèse de Doctorat en Sciences du Langage, Université de Paris.

Martinez, W. (2003). “Contribution à une méthodologie de l’analyse des cooccurrences     lexicales multiples dans les corpus textuels”. Thèse de Doctorat en Sciences du Langage, Université de la Sorbonne nouvelle.

Rastier, F. (1987). Sens et textualité, Paris: Hachette.

Viprey, J.-M. (2005). “Philologie numérique et herméneutique intégrative”. In J.-M. Adam            and U. Heidemann (2006). Sciences du texte et analyse de discours, Enjeux d’une          interdisciplinarité. Genève: Slatkine Erudition

Discovering and exploring academic primings with academic learners


Przemysław Kaszubski


A specially designed online pedagogical concordancer has been in use with several groups of EFL academic writers for the past two years. The essential components of the environment are three interfaces: a corpora search interface (with small disciplinary corpora of written EAP texts contrasted with learner-written texts and with general registers), an annotation-supporting search history interface (personal and/or collaborative) and a resources site, where some of the more valuable findings are deployed and being arranged into a structured, textbook-like environment. Re-use of previously accessed searches is facilitated by the fact that each corpora and each history search produce unique URL addresses that can be exploited as web-links (cf. Gaskell & Cobb 2004). An intricate network of variously connected searches can thus emerge and be utilised by the student and the teacher, who can collaborate on search annotations. Discovery learning assumes new perspectives – it can be informed by pre-existing annotations illustrated with search links or remain more traditionally inductive, yet collaborative and constructionist.

            The research goals of this highly practical venture are twofold – concerning both linguistic description and EAP pedagogy. On the descriptive side, the aim is to try to harness EAP students' noticing potential for the discovery and annotation of units, strings and patterns of use that (learner) academic writers in the relevant disciplines are likely to need. The encouraged framework of reference is Hoey's (2005) lexical priming, which combines phraseological as well as stylistically important textual-lexical patterning. On the pedagogical side, the search history module can be used to oversee and monitor students' work and to gather feedback to help track the problems and obstacles that have been holding data-driven learning from taking a more central position in every-day language pedagogy.

            The first-year pilot studies brought some conflicting opinions as to the friendliness of the overall environment (especially among the less proficient users/levels). There was, however, much agreement over the value of the collaborative options. There also emerged interesting trends between web-link-driven as opposed to independently undertaken searches. Consequently, decisions and programming efforts were made to enhance the profile of the search history interface to enable new learners to take up more challenging tasks, more swiftly.

            The goal of my presentation would be to show selected fruits of the work undertaken by and with students, last year and this year, with respect to the patterning of academic discourse, and to extract some of the linguistic trends behind the observations, such as the priming types identified. On the pedagogical level, I will try to assess whether the students appear ready for the intensive and extensive DDL activity my tool offers, or whether the tool is ready for them, given the needs, expectations and track records observed.



Gaskell, D. and T. Cobb. (2004). "Can learners use concordance feedback for writing        errors?", System, 32 (3),301-19.

Hoey, M. (2005). Lexical priming: A new theory of words and language. London: Routledge.


Give your eyes the comfort they deserve: Imperative constructions in English print ads


Elma Kerz


Advertising operates within certain constraints, including the communicative situation, as well as space and time limitations. The whole aim of the advertising copywriters is to get us to register their communication, either for the purpose of immediate action or to make us favourably disposed in general terms to the advertised product or service. To this end, copywriters make use of a rather restricted set of constructions, among which we frequently encounter imperatives. As already pointed out by Leech (1966:30), “one of the striking features of grammar of advertising is an extreme frequency of imperative sentences”. Leech (1966) provides a list of verb items which he considers to be especially frequent fillers of the V-slot of imperative constructions in advertising: (i) items which have to do with the acquisition of the product such as get, buy or ask for (ii) items which have to do with the consumption or use of the product, have, try, use or enjoy, (iii) items which act as appeals for notice, look, see, watch, make sure or remember.

            To date, however, there has been no comprehensive corpus-based study on the use and function of imperative constructions (henceforth ICs) in the genre of English print advertisements. By adopting a usage-based constructionist approach (cf. Langacker 1987, 2000, 2008; Croft 2001; Goldberg 2006;), the present paper seeks to close this gap by (i) casting light upon the nature of lexical items filling the V-slot of ICs, (ii) considering a general constructional meaning of ICs, (iii) investigating how the use of ICs contribute to the selling effectiveness of an advertised product/service. The corpus used in this study is composed of 200,183 words from a wide range of ‘mainstream’ magazine ads, covering the issues from the years 2004 to 2008.



Croft, W. (2001). Radical construction grammar: syntactic theory in typological perspective.         Oxford: Oxford University Press.

Goldberg, A. (2006). Constructions at work: the nature of generalization in language.        Oxford: Oxford University Press.

Langacker, R. (1987). Foundations of cognitive grammar, vol. 1: theoretical prerequisites.             Stanford, CA: Stanford University Press.

Langacker, R. (2000). “A dynamic usage-based model. In Usage-based models of             language”.       In M. Barlow and S. Kemmer (eds), Stanford: CSLI, 1-63

Langacker, R. (2008). Cognitive grammar: a basic introduction. Oxford: Oxford University         Press.

Leech, G. (1996). English in advertising: a linguistic study of advertising in Great Britain. London: Longman.


A corpus-based study of Pashto


Mohammad Abid Khan and Fatima Tuz Zuhra


This research paper presents a corpus-based study of the inflectional morphological system and collocations in Pashto language. For this purpose, a corpus containing a huge amount of Pashto text is used. There exist two other considerable Pashto corpora: one developed for the BBN Byblos Pashto OCR System (Decerbo et al., 2004) and the other developed by Khan and Zuhra (2007). The former corpus contains Pashto data in the form of images and the later contains relatively little amount of data that cannot be fruitfully used for the language study purposes. A 1.225 million words Pashto corpus is developed using the corpus development tool Xaira (, 2008).  The corpus contains written Pashto text. The text is taken from various fields such as news, essays, letters, research publications, books, novels, sports and short stories. The text in the corpus is tagged according to Text Encoding Initiative (TEI) guidelines, using Extensible Markup Language (XML). To work out more productive inflectional morphological rules of affixation, hapax legomena were also searched (Aronoff and Fudeman, 2005). The claim of Aronoff and Fudeman (2005) appears to be true in this study, because the authors observed that these words, though not formed from the frequently occurring stems, are constructed by such rules that are productive when implemented afterwards. Based on this corpus-based study, an inflectional morphological analyzer for Pashto is designed, using finite state transducers (FSTs) (Beesley and Kattunen, 2003) as a tool. These FSTs are implemented using Xerox finite state tools i.e. lexc and xfst. To develop a lexicon for Pashto morphological analyzer, containing a reasonable amount of stems, the Xaira word searching facility is used. The lexicon of the morphological analyzer is stored in Microsoft Access tables. The interface to the morphological analyzer is developed using C# under .Net framework. This whole development process is presented in detail. The detailed collocation study of 30 most frequently occurring words in the corpus is provided in this paper. The statistical formula used for the identification of collocations is z-score. Z-score is used because it is widely used in Xaira. A higher z-score indicates a greater degree of collocability of an item with the node word (McEnery et al., 2006). The collocations of these words, having a higher z-score, are tabulated and presented in this paper that can be used in decision-making, in part-of-speech (POS) tagging and word-sense disambiguation.



Aronoff, M. and K. Fudeman (2005). What is Morphology. Blackwell Publishing.

Beesley, K. R. and L. Karttunen  (2003). Finite State Morphology: CSLI studies in             Computational Linguistics.

Decerbo, M. et al. (2004). The BBN Byblos Pashto OCR System. ACM Press.

Khan, M. A. and Zuhra, F. T. (2007). “A General-Purpose Monitor Corpus of Written       Pashto”, In  Proceedings of the Conference on Corpus Linguistics, Birmingham,         2007.

McEnery, T. et al. (2006). Corpus-Based Language Studies. Routledge.

[Online] Retrieved (2008).


1 corpus + 1 corpus = 1 corpus


Adam Kilgarriff and Pavel Rychlý


If we add one pile of sand to another pile of sand, we do not get two piles of sand.  We get one bigger pile of sand. 

            Corpora (and other collections) are like piles of sand.  A corpus is a collection of texts, and when two or more collections are added together, they make one bigger collection.

            Where the corpora are for the same language, and do not have too radically different designs and text types, there are at least three benefits to treating them as one, and loading the data into a corpus query system as a single object:

1. users only need to make one query to interrogate all the data

2. better analyses from bigger datasets

3. where the corpus tool has functions for comparing and constrasting subcorpora, they     can be used to compare and contrast the components.

            This is frequently an issue, particularly for English, where there are many corpora available.  To consider one particular set of corpus users - lexicographers: they work to tight schedules, so want to keep numbers of queries to a minimum; they want as much data as possible (to give plenty of evidence for rare items); and they often need to quickly check a word’s distribution across text types (for example, spoken vs. written, British vs. American.)

            Corpora like the BNC or LDC’s Gigaword contain well-prepared data in large quantities and of known provenance so it is appealing to use them in building a large corpus of general English.  Both have limitations: the BNC is old, and British only; the Gigaword is American and journalism only: so lexicographers and others would like both these components (and others too).

            For academic English, we would like one query to interrogate MICASE, BASE and BAWE. 

            A related issue arises with access restrictions.  Where one corpus is publicly available, but another is available only to, for example, members of a particular organisation, who also want to benefit from the public corpus, then how should it be managed (if a single setup is to support everyone)? As a principle of data management, there should be one master copy of each dataset, to support maintenance and updating.

            In our corpus system, we have recently developed a mechanism for defining a corpus as a number of component corpora, to be described and demonstrated in the talk.


Finally, two cautions:

·         Where two components may contain some of the same texts, care must be taken not to include them twice.

·         Any two corpora typically have different mark-up schemes, whereas a single corpus should have a single scheme.  This applies to both text mark-up (tokenisation, lemmatisation, POS-tagging) and headers.   For text mark-up, we re-process components so they all use the same software.  For header mark-up, mappings are required from the schemes used in each component to a single unified scheme.  The mapping must be designed with care, to reconcile different perspectives and categories in the different sources while keeping the scheme simple for the user and avoiding misrepresenting the data.



Simple maths for keywords

Adam Kilgarriff


“This word is twice as common here as there.”  Such observations are central to corpus linguistics.  We very often want to know which words are distinctive of one corpus, or text type, versus another. “Twice as common” means the word’s relative frequency in corpus one (the focus corpus, fc) is twice that in corpus two (the reference corpus, rc).  We count occurrences in each corpus, divide by the number of words in that corpus and divide the one by the other to give a ratio.   If we find ratios for all words and sort by the ratio we have a first pass “keywords” list of the words with the highest ratios.

            In addition to issues arising from unwanted biases in the corpus contents, and “burstiness” (Gries 2008), there are two problems with the method:

1. You can’t divide by zero, so it is not clear what to do about words which are in fc but not rc.

2. The list will be dominated by words with few hits in rc: a contrast between 10 in fc and 1 in rc (assuming fc  and rc  are the same size) giving a ratio of 10 is not unusual, but 100,000 vs. 10,000 is, even though the ratio is still 10.  Simple ratios give lists of rarer words.

            The last problem has been the launching point for an extensive literature which is shared with collocation statistics, since formally, the problems are similar.  Popular statistics include Mutual Information, Log Likelihood and Fisher’s Exact Test (see Manning and Schütze 1999).  However the mathematical sophistication of these statistics is of no value to us, since all it serves to do is disprove a null hypothesis - that language is random - which is patently untrue, as argued in (anonymised). 

            Perhaps we can meet our needs with simple maths.

            A common solution to the zeros problem is “add one”.  If we add one to all frequencies, including those for words which were present in fc but absent in rc, then we have no zeros and can compute a ratio for all words.   “Add one” is widely used as a solution to problems with zeros in language technology and elsewhere (Manning and Schütze 1999). 

            This suggests a solution to problem 4.  Consider what happens when we add 1, 100, or 1000 to all counts from both corpora.  The results, for the three words obscurish, middling  and common, in two hypothetical corpora, are presented below:






Add 1

Add 100

Add 1000
























































Table 1: Frequencies, adjusted frequencies (AdjFs), ratio (rtio) and keyword rank (rk) for three “add-N” settings for rare, medium and common words.


All three words are notably more common in fc than rc so are candidates for the keyword list, but are in different frequency ranges. 

* Adding 1, the order is obscurish, middling, common. 

* Adding 100, it is middling, common, obscurish.

* Adding 10,000, it is common, middling, obscurish. 

             Different values for the “add-N” parameter focus on different frequency ranges.

            For some purposes a keyword list focusing on commoner words is wanted, for others, we want to focus on rarer words.  Our model lets the user specify the keyword list they want by using N as a ‘slider’.

             The model provides a way of identifying keywords without unwarranted mathematical sophistication, and reflects the fact that there is no one-size-fits-all list, but different lists according to the frequency range the user is interested in.


Gries, S (2008). “Dispersion and Adjusted Frequencies in Corpora”. Corpus Linguistics 13 (4), 403-437.

Manning, C. and H. Schütze (1999). Foundations of Statistical Natural Language Processing, MIT Press:             Cambridge, MA.

    English equivalents of the most frequent Czech prepositions.

A contrastive corpus-based study


Ales Klegr and Marketa Mala


Prepositions in cross-linguistic perspective are not a widely studied area despite their frequency in text and interference rate in SLA. Although English and Czech are typologically different (analytic and inflectional respectively), among the first 25 most frequent words are 8 prepositions in the BNC and 10 prepositions (!) in the CzechNC. The study, based on aligned fiction texts and using the parallel concordancer ParaConc, focuses on the first and the last of the 10 most frequent Czech prepositions, spatiotemporal v/ve and po. It seeks to explore their English equivalents and compare the similarities and idiosyncrasies of the v/ve and po translation equivalence patterns. The analysis works with three kinds of translation: prepositional and two kinds of non-prepositional, i.e. lexicalstructural transpositions and zero translation (omitting the preposition and its complement, though otherwise preserving the rest of the sentence). Instances of textual non-correspondence were excluded. For each preposition 600 parallel concordance lines were analysed. The sources are three contemporary Czech novels for v/ve (200 occurrences from each), four texts (and translators) for po (some contained fewer than 200 occurrences). The hypothesis was that, as a synsemantic function word, prepositions are more susceptible to translation shifts than lexical words, although there are apparently uses where neither type of langue can do without them.

            The findings for v/ve: the patterns in the three texts are remarkably similar, the prepositional equivalents form 68.2 % on the average, the non-prepositional equivalents (transpositions and zero) account for 31.8 %. There is one dominant equivalent, the preposition in (49.7 %). The range of prepositional equivalents (even considering the polysemy of v/ve) was rather wide, including 21 different prepositions. The findings for po: the distribution of equivalents in the four texts was far more varied than in v/ve texts, the average representation – 62.2 % of prepositional equivalents,

37.8 % of non-prepositional equivalents – shows a 6 % difference in favour of non-prepositional translation compared to v/ve. There are two other noticeable differences: the most frequent prepositional equivalent, after (18 %), is significantly less prominent than in among v/ve equivalents. Second, the variety of prepositional equivalents is enormous, 37 (almost twice as many as the v/ve prepositional equivalents).

            The study correlates the distribution of the equivalents with syntactic analysis, the polysemy of the Czech prepositions and the types of structures they are part of (regular combinations, MWUs, lexical phrases, chunks, idioms, etc.). It seems that po with a high number of transpositions and less clear-cut central prepositional equivalents is oftener part of higher lexical units than v/ve, heading a number of free-standing adjuncts. In sum, the prepositions exhibit a relatively high incidence of different-word class translations (e.g., compared to the noun), and a surprisingly wide range of prepositional equivalents (far greater than given in the largest bilingual dictionaries). Although both prepositions display a rich variety of translation solutions, it is possible to discern some clear tendencies and specific sense-equivalent pairings, but also specific and distinct combinatory propensities. The findings provide a useful starting point for both theoretical and practical description.



Bennett, D.C. (1975). Spatial and Temporal Uses of English Prepositions. Longman.

Cermák, F., Kren M. et al. (2004). Frekvencní slovník ceštiny. Prague: NLN.

Hoffmann, S. (2005). Grammaticalization and English Complex Prepositions. A Corpus-Based Study.         London and New York: Routledge.

Lindkvist, K.-G. (1978). AT versus ON, IN, BY: On the Early History of Spatial AT and Certain        Primary Ideas Distinguishing AT from ON, IN, BY. Stockholm: Almquist & Wiksell.

Lindstromberg, S. (1998). English Prepositions Explained, Amsterdam: John Benjamins.

Saint-Dizier, P. (ed.)(2006). Syntax and Semantics of Prepositions. Dordrecht: Springer.

Sandhagen, H. (1956). Studies of the Temporal Senses of the Prepositions AT, ON, IN, BY, and FOR            in Present-day English. Uppsala: Almquist and Wiksells.

Tyler, A., Evans, V. (2003). The Semantics of English Prepositions, Cambridge: CUP.


Collecting and collating heterogeneous datasets for multi-modal corpora


Dawn Knight


This paper examines the future of multi-modal CL research. It discusses the need for constructing corpora which concentrate on language reception rather than production, and starts to investigate the means by which this may be achieved.

            This paper presents preliminary findings made as part of the NCeSS funded (National Centre for e-Social Science) DReSS II (Understanding New Forms of Digital Records for e-Social Science) project based at the University of Nottingham. DReSS II seeks to allow for the collection and collation of a wider range of heterogeneous datasets for linguistic research, with the aim of facilitating for the investigation of the interface between multiple modes of communication in everyday life.

            Records of everyday (inter)actions, including SMS messages, MMS messages, interaction in virtual environments (instant messaging, entries on personal notice boards etc), GPS data, face-to-face situated discourse, phone calls and video calls provide the ‘data’ for this research. I will demonstrate (using actual datasets from pilot recordings) how this data can be utilised to enable a more detailed investigation of the interface between these different communicative modes. This is undertaken from an individual’s perspective, by means of tracking and recording a specific person’s (inter)actions over time (i.e. across an hour, day or even week). The analysis of these investigations will help us to question to what extent language choices are determined by different communicative contexts, proposing the need for a redefinition of the notion of ‘context’, one that is relevant to the future of multi-modal CL research. 

            This presentation will provide an overview of the methodological, technical and practical issues/challenges faced when attempting to collect and collate these datasets. In addition, the presentation will also discuss the key ethical issues and challenges that will be faced in DReSS II, looking at ethics from three different perspectives; the institutional, professional and the personal (discussing the key moral and legal obligations imposed by each of these, and their potential impact on data collection and analysis). I will review the notions of consent and anonymisation in digital datasets, and deliberate over what informed consent may mean to every stage of this research, from the collection to re-use and purpose. Finally I will discuss some of the possible applications of corpora (and research) of this nature.



Deductive versus inductive approaches to metaphor identification


Tina Krennmayr


Work within a cognitive linguistic framework tends to favor deductive approaches to finding metaphor in language (e.g. Koller 2004; Charteris-Black 2004), starting either from complete conceptual metaphors or from particular target domains. However, if the analyst assumes, for instance, the conceptual metaphor FOOTBALL IS WAR, he or she may be misled into identifying linguistic expressions as evidence of such a mapping, without considering that those very same linguistic data could be manifestations of an alternative mapping. “When a word or phrase like ‘defend’, ‘position’, ‘maneuver’, or ‘strategy’ is used, there is no a priori way to determine whether the intended underlying conceptual metaphor is war, an athletic contest, or game of chess.” (Ritchie 2003: 125).

      While it is tempting to think of global mappings consistent with the themes of a text, the actual mapping may not fit the scenario in every instance. Therefore, identifying mappings locally can prevent the analyst from assuming the most (subjectively) obvious mapping from the outset. Such bottom-up approaches do not start out from the presumption of existing conceptual metaphors but decide on underlying conceptual structures for each individual case.

      By using newspaper excerpts from the BNC-baby corpus, I will make explicit how a deductive and an inductive approach lead to two different outcomes when explicating conceptual metaphors and their mappings. For example, a top-down analysis of “the company won the bid” assuming the conceptual metaphor BUSINESS IS WAR may align the company with soldiers or a country and the bid with battle or war. A bottom-up analysis, however, does not restrict itself to a mapping that is coherent with war imagery, but considers alternatives such as the more general structure SUCCEEDING IN A BID IS LIKE WINNING A COMPETITION.

      In order to demonstrate the fundamental difference between bottom-up and top-down approaches, I use the 5-step method (Steen 1999), an inductive approach, which has been designed to bridge the gap between linguistic and conceptual metaphor. This bottom-up approach can be modified in a way to serve as a useful tool for explaining the different thought processes employed in deductive and inductive approaches and their consequences for possible mappings.

      The outcome of this analysis is verified by using the semantic annotation tool Wmatrix (Rayson 2008), which provides a web interface for corpus analysis. It contains the UCREL semantic analysis system (USAS) (Rayson et al. 2004), a framework that automatically annotates each word of a running text semantically. Hardie et al. (2007) have suggested an exploitation of USAS for metaphor analysis, since the semantic fields automatically assigned to words of a text by USAS roughly correspond to metaphorical domains, as suggested by conceptual metaphor theory. Wmatrix can therefore offer yet another alternative perspective to manual deductive and inductive approaches to metaphor analysis.



Charteris-Black, J. (2004). Corpus Approaches to Critical Metaphor Analysis. Houndmills, Basingstoke:        Palgrave.

Hardie, A., Koller, V., Rayson, P. and E. Semino (2007). “Exploiting a semantic annotation tool for             metaphor analysis”. In M. Davies, P. Rayson, S. Hunston and P. Danielsson (eds) Proceedings of           the Corpus Linguistics 2007 conference.

Koller, V. (2004). Metaphor and gender in business media discourse. Palgrave.

Rayson, P., Archer, D., Piao, S. L., McEnery, T. (2004). “The UCREL semantic analysis system”. In             Proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP             tasks in association with LREC 2004, pp. 7-12.

Rayson, P. (2008). Wmatrix: a web-based corpus processing environment. Computing Department,    Lancaster University.



Metaphors pop songs live by


Rolf Kreyer


Pop-song lyrics are often felt to be highly stereotypical and clichéd. The present paper is an attempt to shed light on the linguistic substance of these stereotypes and clichés by analysing the use of metaphors in pop song lyrics within the framework of conceptual metaphor theory. The data underlying the present study are drawn from an early version of the Giessen-Bonn Corpus of Popular Music (GBoP), a corpus containing the lyrics of 48 Albums from the US Album Charts of 2003 containing 758 songs and some 350,000 words (see Kreyer/ Mukherjee 2007 for details on the sampling of the pilot version of GBoP).

            From this corpus, metaphors are extracted with the help of metaphorical-pattern analysis (Stefanowitsch 2006). A metaphorical pattern is defined as "a multi-word expression from a given source domain (SD) into which a specific lexical item from a given target domain (TD) has been inserted" (66). Such patterns can be identified by analysing all instances of a lexeme (or a group of lexemes) relating to a particular target domain and determining all metaphorical uses of this lexeme in the data. These can then be "grouped into coherent groups representing general mappings" (66). The following examples show the use of the metaphor LOVE IS AN OPPONENT IN A FIGHT.


(1)        Ten rounds in the ring with love

(2)        Knocked out of the ring by love

(3)        K-O, knocked out by technicality / The love has kissed the canvas


With regard to clichédness, we would assume that either pop song lyrics use the same metaphors again and again, thus showing a lack of variation in the use of metaphors, or that pop songs do not exploit metaphors in a very creative way (or both). However, the analysis of the use of love metaphors in GBoP shows that this is not the case: pop songs lyrics show a fair amount of variation as well as creativity. The paper suggests an alternative explanation along the lines of the Russian formalists' approach to poetic language.

            On the whole, the paper shows how a combination of corpus-based approaches and conceptual metaphor theory can be fruitfully applied to the study of the important genre of pop-song lyrics.



Kreyer, R. and J. Mukherjee (2007). "The style of pop song lyrics: a corpus-linguistic pilot             study". Anglia, 125, 31-58.

Stefanowitsch, A. (2006): "Words and their metaphors: A corpus-based approach". In A. Stefanowitsch and S. Th. Gries (eds), Corpus-based Approaches to Metaphor and       Metonymy, Berlin and New York: Mouton de Gruyter, 61-105.


Nominal coreference in translations and originals – a corpus-based study


Kerstin Anna Kunz


The present work deals with the empirical investigation of nominal coreference relations in a corpus of English originals, their translations into German and comparable original texts in German.

Coreference between nominal expressions is an essential linguistic strategy to establish coherence and topic continuity in texts. Cohesive chains between nominal expressions on the linguistic level of the text evoke a cognitive relation of reference identity on the level of text processing. The chains may contain more than two nominal expressions and may link these expressions below and above the sentence level. Nominal coreference therefore also is a focal means employed in translations to express the same or similar relations of meaning as in their source texts.

As is commonly known, languages provide a variety of means to create nominal coreference in texts, concerning the form of coreferential expressions, their semantic relation to each other, their syntactic function and position as well as their frequency and textual distance (cf. Halliday & Hasan 1976, Gundel et al. 1993, Grosz et al. 1995). It has been widely accepted in the literature that these variations reflect different conditions of mental processing. (cf. Lambrecht 1994, Schwarz 2000, Sanders, Prince 1981, Ariel 1990). However, systemic constraints on different linguistic levels are also assumed to impact on the coreferential structure in German and English original texts (Hawkins 1968, Steiner & Teich 2003, Doherty 2004, Fabricius-Hansen 1999). As for translations, the question has to be raised whether these criteria affect the process of translation and whether this manifests in the coreferential features of the translation product.

In order to obtain a very detailed picture of the coreferential properties of translations, a small corpus of the register of political essays was compiled consisting of ten American political essays (13300 words), their translation into German (13449 words) and ten comparable original political essays in German (13679 words). The corpus was encoded manually with linguistic information about the nominal coreference expressions contained. First, the marked expressions were assigned to their respective coreference chains. A distinction was drawn within each coreference chain between antecedent and anaphor(s). Second, linguistic information was indicated with each expression as to its form, its syntactic function and position and its semantic relation to other coreference expressions in the same coreference chain. Third, the coreference expressions in the German translations were aligned with their equivalent coreference expressions in the English originals.

The current paper presents some first findings from this fine-grained corpuslinguistic analysis. By comparing the distribution of nominal coreference expressions in the three subcorpora we seek to interpret the differences traced in the translations behind the background of several sources of explanation: language typology, register and the translation process. Apart from that, we show how the different types of properties of translations as outlined by Baker (1992), Teich (2003) and others also manifest in the coreference structure of translations.



Baker, M. (1992). In Other Words. A Coursebook on Translation. Routledge

Doherty, M. (2004). “Strategy of incremental parsimony”. SPRIKreports, 25.

            Fabricius-Hansen, C. (1999). “Information packaging and translation: Aspects of translational sentence    splitting”. Studia Grammatica, 47:175-214

Grosz, B. J. , A. K. Joshi & S. Weinstein. (1995). Centering: A framework for modelling the local coherence of          discourse. Computational Linguistics, 21: 203-225.

Hawkins, J. A. (1986). A comparative typology of ENGLISH AND GERMAN. Unifying the contrasts. London         & Sydney: Croom Helm.

König E. & V. Gast. (2007). Understanding English-German Contrasts. Grundlagen der Anglistik und          Amerikanistik. Berlin: Erich Schmidt Verlag.

Subcategorisation “inheritance”: A corpus-based extraction and classification approach


Ekaterina Lapshinova-Koltunski


In this study, we analyse the subcategorisation of predicates, extracting them from German corpora. Predicates, such as verbs, nouns and multiwords, are automatically extracted from German corpora and are classified according to their subcategorisation properties.

 Our aim is not only to classify the extracted data by subcategorisation but also to compare the properties of different morphologically related predicates, i.e. verbs, deverbal nouns and support verb constructions. We thus analyse the phenomenon of “inheritance” in subcategorisation. An example are deverbal nouns (occurring both alone and within a multiword), most of which which share their subcategorisation properties with their underlying verbs. For instance, the verb bedingen (“to condition”) subcategorises only for a dass-clause in our data. So does its nominalisation Bedingung, which can be used both as a simple predicate, e.g. die Bedingung, dass... (“the condition that...”), and within a multiword, e.g zur Bedingung machen, dass... (“to make it a condition that...”).

            We intend to distinguish the cases where nominal predicates share their subcategorisation properties with their base verbs from those where they have their own properties. An example of the second case is the nominalisation Ankündigung can subcategorise only for a dass-clause, whereas its underlying verb ankündigen has also a wh-clause as a complement.

            Our preliminary experiments show that different kinds of predicates have their own subcategorisation and contextual properties. There are both correspondencesand differences in the subcategorisation of verbs and deverbal predicates, which should be considered in lexicon building for NLP. In “inheritance” cases, we don’t even need to describe the predicate-argument structure of a nominalisation, as we can just rewrite it from that of the underlying verb. The differences between the subcategorisation of nominalisations and their base verbs are necessary to be taken into account in dictionaries or NLP lexicons. For “non-inheritance” cases some additional information should be included. Our system can identify such cases by means of extracting them from tokenised, pos-tagged and lemmatised text corpora.

            For our study, we used a corpus German newspaper texts (220M tokens). Further extractions are planned to be done from a corpus of German texts consisting of newspaper and literary texts from Germany, Austria and Switzerland, a total of ca. 1300M words.

            The reasons for non-correspondences could be of contextual or semantic character. For instance, many nominalisations subcategorising only for a dass-clause are semantically weak. They serve as a container for the content expressed in the subcategorised dass-clause. W-/ob-clauses presuppose an open set of answers which doesn’t correspond to the semantics of such kind of nominalisations. A deeper semantic analysis of such phenomena is necessary to reveal systematicities of the inheritance process.



A corpus-based study of the phraseological behaviour of abstract nouns in medical English


Natalia Judith Laso


It has been long acknowledged (Carter 1998, Williams 1998; Biber 2006; Hyland 2008) that writing a text not only entails the accurate selection of correct terms and grammatical constructions but also a good command of appropriate lexical combinations and phraseological expressions. This assumption becomes especially apparent in scientific discourse, where a precise expression of ideas and description of results is expected. Several scholars (Gledhill 2000; Flowerdew 2003; Hyland 2008) have pointed to the importance of mastering the prototypical formulaic patterns of scientific discourse so as to produce phraseologically competent scientific texts.

            Research on specific-domain phraseology has demonstrated that acquiring the appropriate phraseological knowledge (i.e. mastering the prototypical lexico-grammatical patterns in which multiword units occur) is particularly difficult for non-native speakers, who must gain control of the conventions of native-like discourse (Howarth 1996/1998; Wray 1999; Oakey 2002; Williams 2005; Granger & Meunier 2008).

            This paper aims to analyse native speakers’ usage of abstract nouns in medical English, which will contribute to the linguistic characterisation of the discourse of medical science. More precisely, this research study intends to explore native speakers’ prototypical lexico-grammatical patterns around abstract nouns. This analysis is based entirely on corpus evidence, since all collocational patterns discussed have been extracted from the Health Science Corpus (HSC), which consists of a 4 million word collection of health science (i.e. medicine, biomedicine, biology and biochemistry) texts, specifically compiled for the current research study. The exploration of the collocational behaviour of abstract nouns in medical English will serve as a benchmark against which to measure non-native speakers’ production.



Biber, D. (2006). University language: A corpus-based study of spoken and written registers.           Benjamins.

Carter, R. (1998) (2nd edition). Vocabulary: Applied Linguistic Perspectives. London: Routledge.

Flowerdew, J. (2003). “Signalling nouns in discourse” English for Specific Purposes 22, 329-346. Gledhill, C. (2000a). “The discourse function of collocation in research article introductions” English       for Specific Purposes, 19 (2), 115-135.

Gledhill, C. (2000b). Collocations in science writing, Gunter Narr, Tübingen.

Howarth, P. A. (1998a). “The phraseology of learners’ academic writing” in Cowie, A. P. (Ed.) Phraseology: Theory, analysis and applications. Oxford: Clarendon Press, 161-186. 

Howarth, P. A. (1998b). “Phraseology and second language proficiency” Applied Linguistics 19 (1),            24-44.

Oakey, D. (2002a). “A corpus-based study of the formal and functional variation of a lexical phrase in       different academic disciplines” in Reppen, R. et al. (eds) Using corpora to Explore     Linguistic Variation. Benjamins, 111-129.

Oakey, D. 2002b. “Lexical Phrases for Teaching Academic Writing in English: Corpus Evidence” in            Nuccorini, S. Phrases and phraseology –data and descriptions. Bern: Peter Lang, 85-105.

Williams, G. 1998. “Collocational Networks: Interlocking Patterns of Lexis in a Corpus of Plant       Biology Research Articles”. International Journal of Corpus Linguistics, 3 (1), 151–171.

Williams, G. 2005. “Challenging the native-speaker norm: a corpus-driven analysis of scientific       usage” in Barnbrook, G., Danielsson, P. & Mahlberg, M. (eds) Meaningful Texts. The   Extraction of Semantic Information from Monolingual and Multilingual Corpora. London/       New York: Continuum, 115-127.

“According to the equation…” : Key words and clusters in Chinese and British students’ undergraduate assignments from UK universities


Maria Leedham


Chinese students are now the largest non-native English group in UK universities (British Council, 2008), yet relatively little is known of this group’s undergraduate-level writing. This paper describes a corpus study of Chinese and British students’ undergraduate assignments from UK universities. A corpus of 267,000 words from first language (L1) Mandarin and Cantonese students is compared with a reference corpus of 1.3 million words of L1 English students’ writing. Both corpora were compiled from the 6.5 million- word British Academic Written English (Bawe) corpus with some additionally-collected texts. Each corpus contains successful assignments from the same disciplines and from a similar range of genres (such as essays, empathy writing, laboratory reports and case studies).

            The aims of this study are to explore similarities and differences in the writing of the two student groups, and to track development from year 1 to year 3 of undergraduate study with a view to making pedagogical recommendations. WordSmith Tools was used to extract key words, key key words and key clusters from the Chinese corpus, and compare these with the British students’ writing within different disciplines. Clusters were further explored using WordSmith’s Concgram feature to consider non-contiguous n-grams. Both key words and clusters were assigned to categories based on Halliday’s metafunctions (Halliday and Matthiessen 2004).

            The key word findings showed the influence of the discipline of study in the writing of both student groups, even at first year level and with topic-specific key words excluded. This has implications for the teaching of English for Academic Purposes (EAP) for both native and non-native speakers, as discipline-specific teaching is still the exception rather than the norm in EAP classes.

            The comparison of cluster usage suggests that Chinese students use fewer clusters within the textual and interpersonal categories and many more topic-based clusters. One reason for this may be the students’ strategy of using features other than connected text to display information: for example the greater use of lists and pseudo lists in methodology sections and tables in the results sections of scientific reports. The Chinese students also employed three times as many “listlike” text sections as the British students; these consist of text within paragraphs in a pseudo list format (Heuboeck et al, 2007:29).



British Council 2007. “China Market Introduction”. Retrieved from   on 070409.

Halliday, M.A.K., and C.M.I.M. Matthiessen. (2004).  An Introduction to Functional         Grammar. London: Arnold.

Heuboeck, A., J. Holmes and H. Nesi (2007). The Bawe Corpus Manual. Retrieved from    on 07049.

Scott, M. (2008). Wordsmith Tools version 5. Available from:


Note: The British Academic Written English (BAWE) corpus is a collaboration between the universities of Warwick, Reading and Oxford Brookes. It was collected as part of the project, 'An Investigation of Genres of Assessed Writing in British Higher Education' funded by the ESRC (2004 - 2007  RES-000-23-0800).


The Mandarin (reflexive) pronoun of SELF (zi4ji3) in Bible: A corpus-based study


Wang-Chen Ling and Siaw-Fong Chung

National Chengchi University


 Coindexation of the reflexive pronoun can be easily recognized if there is only one referent. An example of self in English is seen in (1) (coindexed referents are marked x). 


(1) Heidi(x) bopped herself(x) on the head with a zucchini. (Carnie, 2002: 94)

However, in Mandarin, it is not always easy to distinguish what zi4ji3 is conindexed with, especially when there is more than one referent (illustrated in (2) below; example modified from Yuan (1997: 226)).

(2) zhang1san3(x) ren4wei2 li3si4(y) zhi1dao4 wang2wu5(z) bu4 xiang1xin4

              Zhang-San    think    Li-Si   know   Wang-Wu   NEG believe



             ‘Zhang-San thinks that Li-Si knows that Wang-Wu do not believe himself.’    


Zi4ji3 in (2) can optionally indicate x, y, or z (although the English translation has the interpretation that himself refers to x). With regard to the difficulties mentioned above, this study aims to investigate the patterns of zi4ji3 and its referent(s) through observing 1,511 Mandarin corpora data taken from a parallel corpus of the Bible online. The Holy Bible Chinese Union Version was chosen because it is widely used among Mandarin speakers since 1,919. Five patterns which indicate different coindexation of zi4ji3 are proposed. 





Pattern 1

Referent(x) zi4ji3 (x)



Pattern 2

Referent(x)… zi4ji3 (x)



Pattern 3

Referent A(x)… Referent B(y)... ReferentC(z)....…

zi4ji3 (x)



Pattern 4

Zi4ji3(x)….Referent (x)



Pattern 5

No referent

(Referent(s) may be the author or the readers.)







Patterns and distribution of zi4ji3 and its referent.


    Based on the results in the above table, it is found that 97.9% of the total instances (patterns 1, 2 and 3) show to have the referent(s) of zi4ji3 appearing before it. This results demonstrates that word order is a crucial clue for determining the conindexation of Mandarin zi4ji3, a result which conforms to the findings in Huang and Chui (1997) which indicates that clause initial NP carries important information. We also found that pragmatic interpretation is needed when processing the meaning of zi4ji3. Future work will investigate the collocations of zi4ji3 so as to better predict the cognitive mechanisms of Mandarin speakers when they use coindexation.



Li, J.L. (2004). “The blocking effect of the long-distance reflexive zi4ji3 in Chinese”.  Foreign Language      and Literature Studies 1, 6-9.

Huang, S. and Chui, K. (1997). “Is Chinese a pragmatic order language?” In Chiu-Yu Tseng (ed.)     Typological Studies of Languages in China 4,  51-79. Taipei: Academia Sinica.

Zhang, J. (2002) “Long-distance nonsyntactic anaphoring of English and Chinese reflexives”. Journal of        Anqing Teachers College, 21, 93-96.

Yuan, B.P. (1997). “The comprehension of Mandarin reflexive pronoun by English and Japanese-     experiment of second language acquisition”. Proceedings of 5th World Chinese Language             Conference: Studies in Linguistics.

Establishing a historiography for corpus-events from their frequency: A celebration of Bertrand Russell's (1948) five postulates


Bill Louw


Over time, concordance lines have become associated, trivially, with the retrieval of linguistic forms for entirely utilitarian purposes rather than with the possibility that each occurrence-line represents an instance of a 'repeatable event (Russell, 1948; Firth, 1957) within the world or a world. The latter type of search becomes possible if and only if the collocates that make up 'facts''or 'states of affairs' (Wittgenstein, 1921: 7) are co-selected from a corpus rather than recovered as single strings. Once this procedure has been adopted, the fabric of the resulting concordance material 'gestures' both truth (Louw, 2003) and the logical structure of the world (Carnap, 1928). The nature of the events recovered by co-selection of their component collocates is of scientific interest in a number of ways, in terms of: (1) procedures needed to determine the quantum of real time in the world that must be deemed to occupy the authentic, logical and temporal 'space' between one concordance line and the next; (2) additional event-bound collocates that the procedure identifies; (3) the extent to which several states of affairs ever share the same collocates; and (4) whether delexicalisation and relexicalisation are part of the latter process. The methods (Gk. meta+hodos: after+path) used for determining spatial sampling would need to be determined scientifically (Kuhn, 1962). In this regard, Russell's early and later work on logic and perception plays a key role. The early work (written in prison) (1914; 2nd edition 1926) considers the nature of sense-data. However, by 1948, Russell abandons the latter term in favour of the word events. In particular, his five postulates offer the corpus-based investigator scientific and automated insights into determining the boundaries between events and occasions upon which events are the same rather than merely similar. The paper will be extensively illustrated.



Carnap, R. (1928) Der logische Aufbau der Welt. Leipzig: Felix Meiner Verlag.

Firth, J.R. 1957. Papers in Linguistics 1934-1951. Oxford: OUP.

Kuhn, T. (1962) The Structure of Scientific Revolutions. Chicago: University of Chicago    Press.

Louw, W.E. (2003) A stochastic-collocational reading of the Truth and Reconciliation        Commission. Bologna: CESLIC.

Russell, B. (1926) Our Knowledge of the External World. London: Allen and Unwin.

Russell, B. (1948) Human Knowledge: Its Scope and Limits. London: Allen and Unwin.


Subjunctives in the argumentative writing of learners of German


Ursula Maden-Weinberger


The compilation of learner corpora and the analysis of learner language has proven useful in research on learner errors, second language acquisition and the improvement of language learning and teaching (cf. e.g. Granger/Hung/Petch-Tyson, 2002; Kettemann/Marko, 2002). This study utilises a corpus of learner German (CLEG) to investigate the use of subjunctive forms in the argumentative writing of university students of German (L1 English). CLEG is a 200,000 word corpus of free compositions collected from all three years of the post A-level undergraduate course of German at a British university.    

            The German subjunctive plays a pivotal role in all texts and discourses that deal with the discussion or examination of practical or theoretical problems as it attests the freedom of human beings to step out of the boundaries of the immediate situation and allows the speaker/writer to explore new possibilities in hypothetical scenarios (cf. Brinkman 1971; Zifonun et al. 1997). These types of texts and discourses are the typical contexts in which students of modern foreign languages at university level produce language – they write argumentative essays, critical commentaries and discussions of controversial current topics. Together with other modal means (e.g. modal verbs, modal adverbials etc.) the subjunctive is therefore a crucial tool that learners are regularly required to use in their argumentative writing.

            There are two subjunctive verb paradigms in the German language: The subjunctive I is generally used in indirectness contexts (indirect speech), while the subjunctive II is used in non-factuality contexts such as conditional clauses or hypothetical argumentation. Due to a great extent of syncretism between these synthetic subjunctives and indicative verb forms, there is also a so-called “replacement subjunctive”, which is formed analytically (würde + bare infinitive). This distinction sounds straight-forward and clear-cut, but the reality of subjunctive uses in contemporary German is, indeed, very complex. Firstly, the subjunctive is only obligatory in a very limited set of circumstances and, secondly, the two subjunctive forms can, to a degree, reach into each other’s domains (e.g. subjunctive II is often used in indirect speech). Corpus investigations have shown that subjunctive use is mostly dependent on text type, genre, language variety and medium rather than on contextual circumstances.       

            The present investigation shows that while the morphological forms themselves pose problems for the learners even at advanced stages, the complex conventions of subjunctive application also seem to induce usage patterns that diverge from those of native speakers. Through a multiple-comparison approach of learner corpus data at different learning stages and native speaker data, it is possible to link these patterns of over-/under- and misuse to their possible causes as transfer-related (interlingual), developmental (intralingual) or teaching(material) induced phenomena.



Granger, S., J. Hung and S. Petch-Tyson (eds)(2002). Computer Learner Corpora, Second            Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins

Kettemann, B. and G. Marko (eds) (2002). Teaching and Learning by Doing Corpus         Analysis. Amsterdam: Rodopi

Brinkmann, H. (1971). Die Deutsche Sprache. Gestalt und Leistung. 2nd revised and         extended edition. Düsseldorf: Schwann

Zifonun, G., L. Hoffman and B. Strecker (1997). Grammatik der deutschen Sprache.        Schriften des Instituts für deutsche Sprache Bd. 7. Berlin-New York: De Gruyter


The British National Corpus and the archives of The Guardian: the influence of American phraseology on British English


Ramón Martí Solano


It is a well-known fact that the influence of American English on British English is not a recent phenomenon. However, the use of a strictly American phraseology is definitely something that bears on the linguistic production of the last decade. It has already been pointed out that this influence is far more widespread in certain language registers or genres in British English (Moon, 1998: 134-135). What is more, and in the case of lexicalised variants of the same phraseological unit (PhU), « corpus evidence shows the increasing incidence of American variants in British English » (Moon, 2006: 230-231).

            Does the British National Corpus suffice to examine the influence of American phraseology on British English? Phraseological phenomena, as is also the case with neology and loanwords, need to be observed in the light of up-to-date corpora. The language of the written press thus represents the ideal environment for lexicological and phraseological research. Only the combination of a reference corpus such as the BNC and the archives of at least one British daily paper can guarantee the empirical corpus-based results concerning the extent of phraseological Americanisms in British English.

            The problems that can be encountered concern the representativeness and the reliability of the results retrieved by means of newspaper archives used as linguistic corpora. Newspaper archives are not real corpora for a number of reasons but mainly because they have not been created for linguistic purposes but as a practical means for readers to find out information of their interest.

            How could we then examine the associations established between some idiomatic expressions and the type of discourse or the lexico-grammatical patterns in which they are inserted if a corpus such as the BNC does not provide any tokens or if the idiomatic expressions themselves are poorly represented? Newspaper archives, unable to be used for analysing grammar words—needing in their turn a morpho-syntactical tagged corpus— are perfectly suitable for the analysis of PhUs (Minugh, 1999: 68).

            In order to account for the pervasiveness of strictly American PhUs in British English we have used the Longman Idioms Dictionary as a reference guide since it registers a considerable number of idioms whose entries are labelled as AmE. Among others, we have selected idioms such as come unglued, make a pit stop, eat my shorts, hell on wheels or get a bum steer with the aim of assessing their occurrence and frequency of use in the corpora. It should be underlined that the large majority of PhUs labelled as American by the LID are likewise labelled as slang or spoken.

            We examine the case of the PhU not amount to a hill of beans/not worth a row of beans as a revealing example of the evolution of phraseological variant forms in English but also as a representative instance of how the American variant is gaining ground and is establishing itself not only in everyday use in British English but also in the lexicographic treatment of this type of units.



Minugh,  D. (1999). “You people use such weird expressions: the frequency of idioms in newspaper CDs       as corpora”. In J.M. Kirk (ed.) Corpora Galore: Analyses and Techniques in Describing English.            Amsterdam: Rodopi. 57-71.

Moon, R. (1998). Fixed Expressions and Idioms in English. Oxford: OUP

Moon, R. (2001). “The Distribution of Idioms in English”. In Studi Italiani di Linguistica Teorica e Applicata , 30 (2),  229-241.

Moon, R. (2006). “Corpus Approaches to Idiom”. In K. Brown, Encyclopedia of Language and        Linguistics, 2nd edition, (3). Oxford: Elsevier, 230-234.

Stubbs, M. (2001). Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell.

A corpus-based approach to discourse presentation in Early Modern English writing


Dan McIntyre and Brian Walker


This paper reports on a small pilot project to build and analyse a 40,000-word corpus in order to investigate the forms and functions of discourse presentation (also known as speech, writing and thought presentation) in Early Modern English writing. Prototypically, discourse presentation refers to the presentation of speech, writing and thought from an anterior discourse in a later discourse situation. Our corpus consists of approximately 20,000 words of fiction and 20,000 words of journalism, and has been manually annotated for categories of discourse presentation using the model of speech, writing and thought presentation (SW&TP) outlined originally in Leech and Short (1981) and later developed in Semino and Short (2004). Our aims have been (i) to test the model of SW&TP on an older form of English; (ii) to determine the dominant forms and functions of SW&TP in Early Modern English writing; and (iii) to compare our results against a similarly annotated corpus of Present Day English writing in order to determine the diachronic development of discourse presentation from the Early Modern period to the present day. In so doing, an overarching aim of the project has been to contribute to what Watts and Trudgill (2001) have described as the alternative history of English; that is, a perspective on the diachronic development of English that goes beyond formal elements of language such as syntax, lexis and phonology. In our talk we will describe the construction of our corpus and issues in annotating it, as well as presenting the quantitative and qualitative results of our analysis. We will discuss similarities and differences between SW&TP in Early Modern and Present Day English, and the steps we are currently taking to extend the scope of our project.



Leech, G. and M. Short (1981). Style in Fiction. London: Longman.

Semino, E. and M. Short (2004). Corpus Stylistics: Speech, Writing and Thought    Presentation in a Corpus of English Writing. London: Routledge.

Watts, R. and P. Trudgill (2001). Alternative Histories of English. London: Routledge.


WriCLE:  A learner corpus for second language acquisition research


Amaya Mendikoetxea, Michael O’Donnell and Paul Rollinson


The validity and reliability of Second Language Acquisition hypotheses rest heavily on adequate methods of data gathering.  Much current SLA research relies on elicited experimental data and disfavours natural language use.  This situation, however, is beginning to change thanks to the availability of computerized linguistic databases and learner corpora.   The area of linguistic inquiry known as ‘learner corpus research’ has recently come into being as a result of the confluence of two previously disparate fields: corpus linguistics and Second Language Acquisition (see Granger 2002, Barlow 2005).   Despite the interest in what learner corpora can tell us about learner language, studies in corpus-based SLA research are mostly descriptive, focusing on differences between the language of native speakers and those of learners, as observed in the written performance of advanced learners from a variety of L2 backgrounds (see e.g. Granger 2002, 2004 and Myles 2005, 2007 for a similar evaluation).  In this talk, we analyse the reasons why many SLA researchers are still reticent about using corpora and how good corpus design and adequate tools to annotate and search corpora could help overcome some of the problems observed.   We present the key design principles of a learner corpus we are compiling (WriCLE: Written Corpus of Learner English, 1,000.000 words) which will soon be made available to the academic community through an online interface (    We present some WriCLE-based studies and show how the corpus can be annotated semi-automatically using freely available software: Corpus Tool (by Michael O’Donnell, available at, which permits annotation of texts in different layers, allowing searches across levels, and incorporates a comparative statistics package.  We then show how our corpus (and software) can be used to test current hypotheses in SLA. First, we present the results of a study we conducted, which supports the psychological reality of the Unaccusative Hypothesis, and we then show how the corpus can be used to inform the current debate on the nature and source of optionality in learner language (Interface Hypothesis) and how it can complement research based on acceptability judgements.   Our paper concludes with an analysis of future challenges for corpus-based SLA research.



Barlow, M. (2005). “Computer-based analysis of learner language”.  In R. Ellis and G. Barkhuizen (eds) Analysing Learner Language.  Oxford: OUP.

Granger, S, 2002. “A bird's-eye view of computer learner corpus research”.  In S.

            Granger, S., J. Hung and S. Petch-Tyson (eds) Computer Learner Corpora,

            Second Language Acquisition and Foreign Language Teaching. Language

Learning and Language Teaching 6. Amsterdam & Philadelphia: John

            Benjamins, 3-33.

Granger, S. (2004). “Computer learner corpus research: current status and future

            Prospects”. In G. Aston et al. (eds) Corpora and Language Learners. Amsterdam &         Philadelphia: John Benjamins.

Myles, F. (2005). Review article. “Language Corpora and Second Language Acquisition

            Research”. Second Language Research, 21 (4),  373-391.

Myles, F. (2007). “Using electronic corpora in SLA research”.  In D. Ayoun (ed.) French Applied Linguistics. Amsterdam & Philadelphia: John Benjamins, 377-400.


Semi-automatic morphosyntactic tagging of a diachronic corpus of Russian


Roland Meyer


The task of annotating a historical corpus with morphosyntactic information poses problems which differ in quality and extent from those involved in the tagging of modern languages: Resources such as morphological analyzers, electronic lexica and pre-annotated training data are lacking; historical word forms occur in numerous orthographic and morphological variants, even within a single text; diachronic change leads to different realizations or loss of morphological categories. However challenging these issues are, it should be borne in mind that when tagging a historical corpus, we are not just dealing with yet another language, but with a variant more or less closely related to the modern language in many respects. This paper presents an approach to tagging a corpus of Old Russian which capitalizes on the latter point, using a parallel corpus of (digitized and edited) Old Russian documents, as well as their normalized version and modern Russian translation, as it is available from the Biblioteka literatury drevnej Rusi published by the Russian Academy of Sciences in twelve volumes from 1976 to 1994. We proceed in the following manner: An automatic sentence aligner first assigns segments of normalized Old Russian text to their modern equivalents. The modern Russian text is then tagged and lemmatized using the stochastic TreeTagger (Schmid 1994) and its most advanced modern Russian parameter files (Sharoff et al. 2008). The Old Russian version is provisionally annotated by a morphological guesser which we programmed using the Xerox finite-state tools (Beesley & Karttunen 2003) and a restricted electronic basic dictionary of about 10,000 frequent lemmas. The goal is then to reduce the ambiguities left by the guesser through information available from the modern Russian translation, without falling into the trapdoors of diachronic change. To this end, automatic word alignment is performed between Old Russian and modern Russian textual segments, based on an adapted version of the Levenshtein edit distance, up to a certain threshold level and relativized to segment length. A more “costly” part of these alignments is manually double-checked. Finally, for each of the aligned word pairs, we reduce the number of potential tags in the Old Russian version using a set of mapping rules applied to the grammatical information available from the new Russian tag. This is certainly the most intricate and error-prone, but also the linguistically most interesting part of the procedure. In devising the mapping rules, special caution was taken not to unduly overwrite information implicit in the Old Russian tags. Morphological categories which are known to have changed over time, are never touched upon (e.g., Old Russian dual number vs. modern Russian plural; modern Russian verbal aspect, which was not present yet in Old Russian). Part-of-speech categories are matched where possible, leading already to quite effective disambiguation in the Old Russian version. Where syncretisms in Old vs. modern Russian overlap only partly, the set of potential tags can regularly be reduced to that subset of the paradigm which is common to both versions of the text. This applies e.g. when an adjective form is ambiguous between nominative and accusative case in Old Russian and between genitive and accusative in modern Russian. If the information present in the two paired forms cannot be unified – e.g., if an Old Russian noun was assigned only dative case by the guesser, and the modern Russian noun only instrumental –, the Old Russian value is never changed. While this automatic projection of tagging information by no means always leads to perfect results, it drastically reduces the effort needed for manual annotation – it is well-known that disambiguation takes less time than annotation from scratch. In our case, only the remaining untagged or ambiguously tagged forms must be checked at any rate. It should also be noted that the alignment between the normalized version of the Old Russian text and its original scanned edition is almost invariably token-by-token and can thus be very easily restored. Thus, we end up with a philologically acceptable and linguistically annotated digital edition which contains a rendition of the original forms, and the normalized forms annotated with part-of-speech categories, grammatical information and lemmata. The procedure described has been implemented in Python and successfully applied to an important longer Old Russian document, the Laurentian version of  Nestor’s chronicle (14th century). The resulting resource can now be used to bootstrap a larger lexicon and to train a stochastic tagger which may subsequently be applied to yet untranslated Old Russian documents.

Today and yesterday: What has changed? Aboutness in American politics

Denise Milizia


The purpose of the present paper is twofold: it aims firstly at identifying keywords in American political discourse today and secondly at analysing the phraseology that these keywords create. A fresh spoken corpus of Democratic speeches is compared to a larger corpus of Republican speeches assembled during the Bush administration. The procedure used for identifying keywords in this research is the one devised by WordSmith Tools 5.0 (Scott 2007) and is based on simple verbatim repetition. The two wordlists generated in both corpora are analysed and then compared, and the items that emerge are those which occur unusually frequently in the reference corpus – the Republican in this case – or, to put it another way, the ones which occur more frequently in the study corpus – the Democratic one. Unchanging topics will not be taken into account in the present investigation, and issues such as Afghanistan, for example, which occur in both corpora with exactly the same percentage (0.05%) will not emerge as prominent, namely as keywords. Also the topics that both Democrats and Republicans talk about will not surface in the comparison, whereas those where there is a significant departure from the reference corpus become prominent for inspection (Scott 2008), and thus the object of the current analysis.

            The words and phrases emerging from the comparison are indicative not only of the “aboutness” of the text but also of the context in which they are embedded, relating, predictably, to the major ongoing topics of debate (Partington 2003), signaling a change in priorities of the new government. 

            Bearing in mind that phraseology is not fixed and that, in political discourse in particular, some phrases have a relatively short "shelf life" compared to others (Cheng 2004), the aim here is to unveil “aboutgrams” (Sinclair in Warren 2007) which are prioritized today but were not an issue in the previous administration, and to show that phrases are usually much better at revealing the ‘ofness’ of the text (and the context) than individual words.

            Relying on the assumption that the unit of language is “the phrase, the whole phrase, and nothing but the phrase” (Sinclair 2008), and that words are attracted to one another also at a distance, phraseology is investigated using also another piece of software, ConcGram 1.0 (Greaves 2009), capable of identifying not only collocations that are strictly adjacent but also discontinuous phrasal frameworks, handling both positional (AB, BA) and constituency (AB, ACB) variants.



Cheng, W. (2004). “FRIENDS, LAdies and GENtlemen: some preliminary findings from a          corpus of spoken public discourses in Hong Kong”. In U. Connor U. and T. Upton     (eds) Applied Corpus Linguistics: A Multidimensional Perspective. Rodopi,35-50. 

Greaves C. (2009). ConcGram 1.0. Amsterdam: John Benjamins.

Partington, A. (2003). The Linguistics of Political Argument. London: Routledge.

Sinclair, J. (2008). “The phrase, the whole phrase, nothing but the phrase”. In S. Granger and        F. Meunier (eds), Phraseology. An Interdisciplinary Perspective. Amsterdam: John     Benjamins, 407-410.

Scott, M. (2007). WordSmith Tools 5.0. Oxford: Oxford University Press.

Scott, M. (2008). “In Search of a Bad Reference Corpus”. In D. Archer (ed.) What’s in a word-list?. Ashgate.

Warren, M. (2007). “Making sense of phraseological variation and some     possible implications for notions of aboutness and keyness”. Paper presented at the         International Conference: Keyness in Text, University of Siena, Italy.

Phraseological choice as ‘register-idiosyncratic’ evaluative meaning? A corpus-assisted comparative study of Congressional debate


Donna Rose Miller and Jane Helen Johnson


Linguistic research has been focussing in recent times on the possibly ‘register-idiosyncratic’ (Miller & Johnson, to appear) significance of lexical bundles/clusters. Biber & Conrad (1999) examine these in conversation and academic prose; others, such as Partington & Morley (2004), look at how lexical bundles might mark the expression of ideology, also through patterns of metaphors in newspaper editorials and news reports; Goźdź-Roszkowski (2006) classifies lexical bundles in legal discourse; Morley (2004) and Murphy & Morley (2006) examine their discourse-marking function of introducing the writer’s evaluation in newspaper editorials.

We propose to report select findings from our investigation into evaluative and speaker-positioning function bundles (Halliday 1985) in one sub-variety of political discourse: US congressional speech. Our corpus of nearly 1.5 million words consists of speeches on the homogeneous topic of the Iraq war, compiled from the transcribed sessions of the US House of Representatives for the year 2003. Methodologically, and in the wake of Hunston (2004), our study begins with a ‘text’: a 1-minute speech, whose evaluative patterns, analysed with the appraisal systems model (Martin and White 2005), serve as the basis for subsequent, comparative, corpus investigation, using primarily Wordsmith Tools and Xaira.

            This particular presentation will recount results with reference to the phraseology it’s/it is + adj + that/to… and it’s/ it is time…. Our corpus findings were tested against some large general corpora of English as well as other smaller UK and US political corpora. The purpose was to see: 1) whether the patterns proved statistically ‘salient’ and/or semantically ‘primed’ (Hoey 2005); 2) how the albeit circumscribed semantic prosody in the congressional corpus compared to that in the reference corpora; 3) to what extent evaluative distinctions depended on what was being evaluated and 4) whether use of these bundles could be said to transcend register boundaries and be rather a consequence of ideological ‘saturation’ – a cross-fertilization of comparable cultural paradigms. Also probed were variations according to gender and political party.

            The study is located in long-ongoing investigation into register-idiosyncratic features of evaluation and stance in parliamentary debate (cf. Miller 2007), with register being defined as “[…] a tendency to select certain combinations of meanings with certain frequencies”, and register variation as the “[…] systematic variation in [such] probabilities” (Halliday 1991: 33).



Goźdź Roszkowski S. (2006) ‘Frequent Phraseology in Contractual Instruments’, in Gotti M. & D.S.            Giannoni (eds) New Trends in Specialized Discourse Analysis. Bern: Peter Lang, 147-161.

Hoey M. (2005) Lexical Priming, Abingdon: Routledge.

Hunston S. (2004) ‘Counting the uncountable: Problems of identifying evaluation in a text and in a corpus’, in Partington A.,  J. Morley, L. Haarman (eds) Corpora and Discourse. Bern: Peter       Lang,   157-188.

Martin J.R. and White P.R.R. (2005) The Language of Evaluation: Appraisal in English, Palgrave.

Miller D.R. (2007) ‘Towards a Typology of Evaluation in Parliamentary debate: From theory to       Practice –and back again’, in Dossena M. & A. Jucker (eds), (Re)volutions in Evaluation:           Textus XX  n.1,             159-180

Miller D.R. & Johnson J. H., (to appear) ‘Evaluation, speaker-hearer positioning and the Iraq war: A            corpus-assisted study of Congressional argument’, in Morley J. & P. Bayley, Wordings of     War: Corpus Assisted Discourse Studies on the Iraq War, Routledge, chapt. 2.

Compiling and exploring frequent collocations


Maria Moreno-Jaen


Collocations are described by many authors (Bahns 1993; Lewis 2000) as an essential component for developing second language learning in general and lexical competence in particular. However, until quite recently teaching collocations has been an ignored area in most teaching scenarios. Perhaps one of the main reasons for this neglect is the lack of a reliable and graded bank of collocations.

            Therefore, this study intends to present a systematic and reliable corpus–driven compilation of frequent collocations extracted from the Bank of English and the British National Corpus. Taking the first 400 nouns of the English language, a list of their most frequent collocations has been drawn. The first part of this paper will provide a careful explanation of the procedures followed to obtain this database, where not only statistical significance but also pedagogical factors were taken into consideration for the final selection. But, as it is often the case in corpus-based research, the output produced by this study also provided insights into new and unexpected linguistic patterning by, what according to Johns (1988) is called the “serendipity process”. Thus, the discussion of this knock-on effect will be the focus of the second part of this paper.



Bahns, J. (1993). “Lexical collocations: a contrastive view.” English Language Teaching   Journal, 47 (1), 56-63.

Johns, T. (1988). "Whence and Whither Classroom Concordancing?". In E. van Els et al.                (eds), Computer Applications in Language Learning.

Lewis, M. (ed.) (2000). Teaching collocation. Further developments in the lexical approach.          Hove: Language Teaching Publications.

The problem of authorship identification in African-American slave narratives:

 Whose voice are we hearing?


Emma Moreton


‘The History of Mary Prince, a West Indian Slave’ was first published in 1831. It is a short narrative of less than 15,000 words which outlines Prince’s experiences as a slave from her birth in Bermuda in 1788 until 1828 when, whilst living in London, she walked out on her masters. The dictated narrative was transcribed by amanuensis Susan Stringland and consists of extensive supporting documentation, including a sixteen page editorial supplement. The purpose of this supporting documentation (typical of most slave narratives of the period) was to authenticate the narrative and to give credence to the claims made by the narrator. Arguably, however, it is this documentation (extensive and intrusive as it is) that has led scholars to question the authenticity of the voice of the slave, raising questions regarding the extent to which the slave narrative was tailored to hegemonic discourses about the slave experience (see Andrews, 1988; Beckles, 2000; Ferguson, 1992; Fisch, 2007; Fleischner, 1996; McBride, 2001; Paquet, 2002; and Stepto, 1979). Important questions, therefore, are how do these ‘other voices’ influence the narrative and whose voice are we hearing - the editor’s, the transcriber’s, or the slave’s?

            This paper is an attempt to address these questions. Taking ‘The History of Mary Prince’ as a pilot study, this research begins by separating out the different voices within the text. Then, drawing from the theory and practice of forensic linguistics, and using computational methods of analysis, the linguistic features of the supporting documentation, as well as the narrative itself are examined in order to identify idiolectal differences. Although the findings are preliminary and by no means conclusive, it is hoped that this research will encourage further discussion regarding the authenticity of the voice of the slave.



Andrews, W.L. (1988). Six Women’s Slave Narratives. Oxford: Oxford University Press.

Beckles, H.McD. (2000). “Female Enslavement and Gender Ideologies in the Caribbean”. In        P.E. Lovejoy (ed.) Identity in the Shadow of Slavery, pp. 163-182. London:       Continuum.

Ferguson, M. (1992). Subject To Others: British Women Writers and Colonial Slavery, 1670-         1834. London: Routledge.

Fisch, A. (ed.) (2007). The Cambridge Companion to The African American Slave Narrative.        Cambridge: Cambridge University Press.

Fleischner, J. (1996). Mastering Slavery: Memory, Family, and Identity in Women’s Slave Narratives. New York: New York University Press.

McBride, D.A. (2001). Impossible Witnesses: Truth, Abolitionism, and Slave Testimony. New        York: New York University Press.

Paquet, P.S. (2002). Caribbean autobiography: Cultural Identity and Self-Representation. Wisconsin: The University of Wisconsin Press.

Prince, M. (1831). The History of Mary Prince, a West Indian Slave. London: F. Westley and        A.H. Davis, Stationers’ Hall Court.



A combined corpus approach to the study of adversative discourse markers


Liesbeth Mortier and Liesbeth Degand


The current paper proposal offers a corpus-based analysis of discourse markers of adversativity, i.e. markers expressing more or less strong contrasting viewpoints in language use (Schwenter 2000: 259), by comparing the semantic status and grammaticalization patterns observed for Dutch eigenlijk ('actually') and French en fait ('in fact'). Previous synchronic studies of their quantitative and qualitative distribution in spoken and written comparable corpora, as well as in translation corpora, showed these markers to be highly polysemous in nature, with meanings ranging from opposition over (counter)expectation to reformulation (author, submitted). The present study aims to show the implications of these semantic profiles for a description of en fait and eigenlijk in terms of discourse markers and (inter)subjectification (cf. Traugott and Dasher 2005, Schwenter and Traugott 2000 for in fact). To this effect, the range of corpus data will be extended to include diachronic corpora, which are expected to confirm the claim that these markers are involved in an ongoing process of subjectification and intersubjectification.

Drawing in part on previous synchronic studies and on pilot studies in the diachronic realm, these data will be analyzed both qualitatively and quantitatively, and take into account intra-linguistic as well as inter-linguistic tendencies and differences. More specifically, we will test the hypothesis according to which current meanings of opposition and reformulation in en fait and eigenlijk have a foundational meaning of deviation or "écart" (Rossari 1992: 159) in diachrony. If this is the case, then this would enable us to explain some of the subjective and intersubjective uses of eigenlijk and en fait: deviation may serve as a "discourse marker hedge" to soften what is said, with the purpose of acknowledging the addressee's actual or possible objections (cf. Traugott and Dasher 2005 for in fact).

            A four-partite corpus approach, based on the criteria of text variation and time span, will be applied to eigenlijk and en fait, resulting in analyses of (i) different texts over different periods of time (comparable diachronic data), (ii) different texts within the same period of time (comparable synchronic data, written and spoken), (iii) same texts within the same period (present-day translations of literary and journalistic texts), and (iv) same texts over different periods of time (Bible translations). The confrontation of the results of such an approach to French and Dutch causal connectives in Evers-Vermeul et al. (2008) already proved successful in determining levels and traces of (inter)subjectification, and is expected to yield equally promising results in the field of adversativity. The present study would thus reflect the growing interest of grammaticalization theorists in corpus linguistics (cf. Mair 2004; Rissanen, Kytö & Heikkonen 1997), and further its application by showing the advantages of taking a corpus-based approach to the study of discourse markers.



Evers-Vermeul, J. et al. (in prep). "Historical and comparative perspectives on the subjectification of

causal connectives".

Mair, C. (2004). "Corpus linguistics and grammaticalisation theory: Statistics, frequencies, and beyond". In

H. Lindquist & C. Mair (eds). Corpus approaches to grammaticalization in English. Amsterdam: John Benjamins, 121-150.

Rissanen, M., M. Kytö, and K. Heikkonen. (eds) (1997). Grammaticalization at work: Studies of long-term

developments in English. Berlin/New York: Mouton de Gruyter.

Rossari, C. (1992). "De fait, en fait, en réalité: trois marqueurs aux emplois inclusifs". Verbum 3, 139 - 161.

Schwenter, S. (2000). "Viewpoints and polysemy: linking adversative and causal meanings of discourse

markers". In Couper-Kuhlen, E. and B. Kortmann (eds). Cause - Condition - Concession - Contrast. Berlin Mouton de Gruyter. 257 - 282.

Schwenter, S. and E. Traugott. (2000). "Invoking scalarity: the development of in fact". Journal of

Historical Pragmatics 1 (1). 7 - 25.

Traugott, E. and R. Dasher. (2002). Regularity in semantic change. Cambridge: CUP.

LIX 68 revisited - an extended readability measure


Katarina Mühlenbock and Sofie Johansson Kokkinakis


The readability of a text has to be established from a linguistic perspective, and is not to be confused with legibility, which concerns aspects of layout. Consequently, readability variables must be studied and set up specifically for a certain language. Linguistically, Swedish is characterized as an inflecting and compounding language, prompting Björnsson (1968) to formulate the readability index LIX by simply adding average sentence length in terms of number of words to the percentage of words > 6 characters. LIX is still used for determining the readability of texts intended for persons with specific linguistic needs, due to cognitive disabilities or dyslexia, second language learners or beginning readers. However, we regard the LIX value as insufficient for determining the degree of adaptation required for these highly heterogeneous groups, and suggest additional parameters to be considered when aiming at tailored text production for individual readers.

            To date, readability indices have mainly been constructed for English texts. In the 1920s and 30s a vast  number of American formulae were constructed, based on a large variety of parameters. Realizing that statistical measurements could be useful, especially multiple variable analysis, Chall (1958) concluded that “only four types of elements seem to be significant for a readability criterion”, namely vocabulary load, sentence structure, idea density and human interest. Björnsson’s study of additional Swedish textual factors may well fit into most of these categories. They were, however, gradually abandoned in favour of factors focusing on features of surface structure alone, i.e. running words and sentences. Many of the factors initially considered were regarded as useless at the time, owing to the lack of suitable means and methods for carrying out statistical calculations on sufficiently large text collections. Our investigation is carried out on the 1.5 million-word corpus LäSBarT (Mühlenbock 2008), consisting of simplified texts and children’s books, divided into four subcorpora: community information, news, easy-to-read fiction and children’s fiction. The corpus is POS-tagged with the TNT-tagger and semiautomatically lemmatized.

            Chall’s categories are used as a repository for adding further parameters to Swedish readability studies. In addition to LIX, vocabulary load is calculated by the number of extra long words (≥ 14 characters), indicating the proportion of long compounds. Sentence structure is reflected by the number of subordinate clauses per word and sentence. Idea density is indicated by measuring lexical variation OVIX (Hultman and Westman 1977), lexical density (Laufer and Nation 1995) and nominal quote, NQ (Johansson Kokkinakis 2008), indicating information load. Finally, we regard human interest as mirrored by the proportion of names, such as those of places, companies and people (Kokkinakis 2004)

            Applying the extended calculations to the LIX formula and comparing the new results across the four subcorpora, we found that a measure based on more factors on lexical, syntactic and semantic levels contributes strongly to a more appropriate weighing of text difficulty. Texts adapted to the specific needs of an individual reader are valuable assets for various types of applications connected to research and education, constituting a prerequisite for the integration into society of language-impaired persons.



Björnsson, C. H. (1968). Läsbarhet. Stockholm: Bokförlaget Liber.

Chall, J. S. (1958). Readability. An appraisal of research and application. Ohio.

Hultman, T. G. and M. Westman (1977). Gymnasistsvenska. Lund: Liber Läromedel.

Johansson Kokkinakis, S. (2008). En datorbaserad lexikalisk profilMeijerbergs institut,Göteborg.

Kokkinakis, D. (2004). “Reducing the effect of name explosion”. Proceedings of the LREC workshop:           Beyond named entity recognition - semantic labelling for NLP., Lisbon, Portu   

Laufer, B. and P. Nation (1995). "Vocabulary Size and Use: Lexical richness in L2 Written Production."        Applied Linguistics 16 (3), 307-322.

Mühlenbock, K. (2008). “Readable, legible or plain words - presentation of an easy-to-read Swedish corpus”. Readability and Multilingualism, workshop at 23rd Scandinavian Conference of       Linguistics. Uppsala, Sweden.

Hedges, boosters and attitude markers in L1 and L2 research article Introductions


Pilar Mur

English has no doubt become the language of publication in the academia. Most –if not all– high impact journal are nowadays published in English, and getting one’s research article accepted in any of them is a great concern for scholars worldwide, as tenure, promotion and other reward systems are based on them. The drafting of a research article in English is even harder for non-native scholars who are used to different writing conventions and styles in their own disciplinary national contexts and who very often need to turn to ‘literacy brokers’ (i.e. language reviewers, translators, proof readers, etc.) (Curry and Lillis 2006) for help. In this context carrying out intercultural specific analyses may be useful to better determine where the differences in writing in the two socio-cultural contexts (local in a national language, and international in English) lie and thus better inform non-native scholars on the necessary adjustments to be made to have better chances of successful publication in the competitive international context. This paper aims at analysing interactional metadiscourse features in Business Management research article Introductions written in English, by Anglo-American and by Spanish authors. A total of 25 interactional features, corresponding to the categories of hedges, boosters and attitude markers, which were found to be the most common metadiscourse features in a preliminary analysis will be contrastively analysed. Their frequency of use and phraseology will be studied in a corpus of 48 research articles, 24 written in English by scholars based at American institutions and 24 written in English by Spanish scholars, which is part of the SERAC (Spanish English Research Article Corpus) compiled by the InterLAE research group at the University of Zaragoza (Spain). This analysis will help us determine whether the use of these common metadiscourse features is rather homogenous in international publications, regardless of the writers’ L1 and their academic writing conventions in their national contexts, or whether a degree of divergence is allowed for as regards certain specific features. An insight into the degree of (in)variability apparently acceptable in international publications as regards the use of the most common hedges, boosters and attitude markers, that is, the degree of required acculturation, will enable us to give more accurate guidance to Spanish (or other L2 writers) when it comes to draft their research in English for an international readership.



You’re so dumb man. It’s four thousand years old. Who’s dumber me cos I said it or the guys that thought I was serious?: Age and gender-related examination of insults in Irish English


Brona Murphy


Insults are emotionally harmful expressions (Jay, 1999) which come under the term of ‘taboo language’. This paper examines insults in a 90,000 word (approx) spoken corpus of Irish English comprising three all-female and three all-male age-differentiated sub-corpora spanning the following age groups: 20s, 40s, and 70s/80s. The examination is carried out by using quantitative and qualitative corpus-based tools and methodologies such as relative frequency lists, concordances and as well as details of formulaic strings including significant clusters. Interviews, which were carried out after the compilation of the corpus, are also referred to as a form of data in this paper. This study examines age and gender-related variation in the use of insults and concludes that their usage seems to be, almost predominantly, a linguistic feature that is characteristic of young adulthood and in particular, young males. This, we found, seems to be primarily due to the influence of their particular life-stage, as well as the relationships and typical linguistic interactional behaviour common to them. In order to highlight this, the study will make reference to insults functioning at two levels of language: the level of lexis (a. intellectual insults, b. sexual insults, c. animal insults, d. expletives) as well as at the level of discourse, for example, she's not like she's not repulsive I mean look at you like. The paper will discuss both the use of insults directly to one’s interlocutor as well as the use of insults about other people who are not present. The study highlights the importance of corpus linguistics as an extremely effective tool in facilitating such analyses which, consequently, allow us to build on research in conversation analysis, discourse analysis, sociolinguistics and variational pragmatics.



Jay, T. (1999). Why We Curse – A Neuro-psycho Social Theory of Speech. Amsterdam:  John        Benjamins.


k dixez?: A corpus study of Spanish internet orthography


Mark Myslín and Stefan Th. Gries


New technologies have always influenced communication, by adding new ways of communication to the existing ones and/or changing the ways in which existing forms of communication are utilized. This is particularly obvious in the way in which computer-mediated communication, CMC, has had an impact on communication. In this paper, we are concerned with a form of communication that is often regarded as somewhat peripheral, namely orthography. CMC has given rise to forms of orthography that deviate from standardized conventions and are motivated by segmental phonology, discourse pragmatics, and other exigencies of the channel (e.g., the fact that typed text does not straightforwardly exhibit prosody). We focus here on the characteristics of a newly evolving form of Spanish internet orthography (hereafter SIO); consider (1) and (2) for an example of SIO and its standardized equivalent respectively.


            (1) hace muxo k no pasaba x aki,, jaja,, pz aprovehio pa saludart i dejar un komentario       aki n tu space q sta xidillo :)) ps ia m voi


            (2) Hace mucho que no pasaba por aquí, jaja. Pues aprovecho para saludarte y dejar un comentario aquí en tu space que está chidillo. Pues ya me voy.


It's been a long time since I've been here, haha. Well, I thought I'd take the opportunity to leave you a comment here on your space, which is really cool. Well, off I go.


SIO, which has hardly been studied, differs markedly but not arbitrarily from standard

Spanish orthography but also exhibits considerable internal variation. In (1), for instance, que ('that') is spelt in two different ways: k and q. Since the number of non-arbitrary spelling variations is huge, in this paper, we only explore a small set of phenomena. These include the influence of colloquialness, word frequency, and word length on


-          post-vocalic d/[ð] deletion in the past participle marker -ado: for example, we found that deletion in SIO is most strongly preferred in high-frequency words (cf. Figure 1);

-           b/v interchanges: for example, we found that b can represent [b] and v can represent [ß] even for words whose standard spelling is the other way round, but also that high frequency words are much less flexible;

-           ch à  x substitution: for example, we found that vulgar words are surprisingly resistant to this spelling change;

-           the repetition of letters: for example, we found a strong preference to repeat letters representing vowels and continuant consonants, which we can explain both with reference to phonological and cognitive/iconic motivations.


      Our study is based on our own corpus of approx. 2.7 million words of regionally balanced informal internet Spanish. Our corpus consists of brief comments and messages (mean length of entry = 19.5 words; sd = 36.2) and was compiled in May 2008 using the scripting language R to crawl various forums and social networking websites. In addition, we use for comparison Mark Davies's (2002-) 100 million word Corpus del Español. For the quantitative study, we used Leech and Fallon's (1992) difference coefficient as well as a slightly modified version of Gries and Stefanowitsch's' (2004) distinctive collexeme analysis.

Mock trial and interpreters’ choices of lexis: Issues involving lexicalization and re-lexicalization of the crime


Sachiko Nakamura and Makiko Mizuno


It is widely believed and actually practiced by many interpreters that when they fail to come up with an exact matching word, opting for a substitution is better than omitting or discontinuing translating (e.g. “approximization” in Gile, 1995; “close renditions” in Rosenburg, 2002; “nearest translations” in Komatsu, 2005). In court interpretation training and practice, an emphasis has been placed on mastering legal terminology while relatively little attention has been paid to risks of inadvertent use of synonyms of common lexis. Some researchers also suggest that the amount of research on “the role of lexis in creating nuances of meaning for the jury” is still limited (Cotterill, 2004 p.513).

            At present, Japan is in the midst of a major reform of its legal system; it will introduce the citizen-judge system, a system similar to the jury system, in May 2009. Anticipating possible impacts of interpreter intervention on lay judges, the legal discourse analysis team of the Japan Association of Interpreting and Translation Studies conducted the second mock trial based on a scenario involving a typical injury case which had been written by the team. The mock trial focused on an interpreter-mediated prosecutor questioning session, inviting mock lay-judges and two interpreters to identify impacts of interpreter interventions.

            The mock trial data revealed some interesting phenomena. We found that there were two storylines running in parallel in the questioning session; one was the prosecutor’s storyline expressed in lexis representing the court norms, the other was the defendant’s storyline representing his lexical world, which was, however, interpreted and re-lexicalized only through the voices of interpreter. Although the prosecutor consistently used the lexis naguru to describe the defendant’s act of assaulting the victim, the defendant used the lexis hit in his testimony.

            Furthermore, we found that two interpreters used different lexis to translate naguru in their English rendition: One interpreter used beat(en) while the other used hit. Although they used different lexis to describe the same criminal act, this difference did not manifest in the court, where Japanese language testimony alone is treated as authentic evidence.

            The corpus data show that collocates of hit and beat(en) differ widely, suggesting that these lexis are not necessarily substitutable. Advertent or inadvertent use of synonymous lexis might lead to manipulation of meanings (Nakamura 2006), and we suggest that it might also alter the legal implication of key expressions in the criminal procedures, which may result in different legal judgments concerning the gravity of an offense (Mizuno 2006).

            In our paper we will pick up some Japanese expressions to which two interpreters rendered decisively different translations, and examine problems concerning the choice of lexis and its legal consequences.



Coterill, J. (2004) “Collocation, Connotation, and Courtroom Semantics: Lawyers’ Control of Witness

Testimony through Lexical Negotiation”. Applied Linguistics, 25 (4), 513-537.

Gile, D. (1995) Basic Concepts and Models for Interpreter and Translator Training. Amsterdam:

John Benjamins.

Komatsu, T. (2005) Tsuyaku no gijyutsu (Interpreting Skills) Kenkyusha.

Mizuno, M. (2006). “ Possibilities and Limitations for Legally Equivalent Interpreting of Written

Judgment”. Speech Communication Education, 19, 113-130.

Nakamura, S. (2006). “Legal Discourse Analysis – a Corpus Linguistic Approach. Interpretation

Studies”. The Journal of the Japan Association for Interpretation Studies, 6, 197-206.

How reliable are modern English tagged corpora? A case-study of modifier and degree adverbs in Brown, LOB, the BNC-Sampler, and ICE-GB.


Owen Nancarrow and Eric Atwell


Adverbs heading adverb phrases with a phrase-internal function are generally termed “modifiers”, or less often “qualifiers” (e.g. enormously enjoyable), as opposed to those heading clause-internal adverb phrases (e.g. I enjoyed it enormously). Since most modifier adverbs are also semantically “degree” (or equivalently “intensifier”) adverbs, many adverbs may be classified both syntactically as modifiers, and semantically as degree adverbs. Only word-classes containing the adverb very will be discussed in this paper. Each corpus has just one such word-class.

            Both Brown and LOB use the same syntactically-based tag QL (qualifier). The Sampler and ICE-GB, on the other hand, use semantically-based tags: RG (degree adverb), and ADV(inten) (intensifier adverb), respectively.

            However, severe restrictions are placed on the use of the LOB QL tag, and the Sampler RG tag. Essentially, adverbs tagged QL in LOB must also be degree adverbs, and never function in the clause. And in the Sampler, adverbs tagged RG must also function as qualifiers, and, again, never function in the clause.. Though the tags are different, the definitions are quite similar. Thus the Brown tag is used for qualifiers, both the LOB and Sampler tags for degree adverbs which are also qualifiers, and the Sampler tag for degree adverbs. The authors of LOB and the Sampler could have made this clear by using a tag which combined the two properties of being both a qualifier and a degree adverb.

            Assessing the reliability of the tagging depends on a clear understanding of the requirements which it must meet. For these, the user must consult online documentation, and other reference materials. These documents vary greatly in reliability. The Brown documentation has conflicting accounts. The LOB Manual is not altogether clear. The Sampler material, however, is clear and straightforward, as is that of ICE.

            The tagging of three of the four corpora, although all have been manually-corrected, has numerous inconsistencies, where the same word in the same context is tagged differently. LOB is almost free of this kind of error, the Sampler has a number, ICE has a lot, and Brown probably has even more. An example from the Sampler is That is far (RR) short versus majorities that ... are far (RG) short, one from Brown is he was enormously (QL) happy, versus this was an enormously (RB) long building.

            When the tagging is inconsistent and the documentation is inadequate, it may not always be possible to know what tag an author really intended. This is particularly true for the degree adverb tag, because of the many cases where the semantic property of “degreeness” is only part of the meaning of the word.  Thus, in ICE the word depressingly is tagged once as a general adverb and once as a degree adverb.

            The reliability of the tagging of adverbs may suggest how reliable the tagging of these corpora is in the case of other tags. There are few, if any, published accounts of error-rates, and most users suppose them to be almost free of all but a few unavoidable errors. The reliability of the tagging of qualifier and degree adverbs in these four corpora, as suggested by our investigation, is: LOB, highly reliable; the Sampler, good but still with too many undesirable errors; ICE, a large number of inconsistent taggings; and Brown, by far the most unreliable.


Populating a framework for readability analysis


Neil Newbold and Lee Gillam


This paper discusses a computational approach to readability that is expected to lead eventually towards a new and configurable metric for text readability. This metric will be comprised of multiple parts for assessing and comparing different features of texts and of readers. Our research involves the elaboration, implementation and evaluation of an 8-part framework proposed by Oakland and Lane (2004) that requires consideration of both textual and cognitive factors. Oakland and Lane account for, but do not sufficiently elaborate a computational handling of factors involving language, vocabulary, background knowledge, motivation and cognitive load. We have considered the integration of techniques from a variety of fragmented research that will provide for a system for comparing the various elements involved with this view of readability. Historically, readability research has been largely focussed on simple measures of sentence and word length, though many have found these measures to be overly simplistic or an inadequate means to evaluate and compare the readability of documents, or even sentences. And yet limited attention has been paid to the contribution that simplifying and/or improving textual content can make for both human-readability and machine-readability.

            We will discuss our work to date that has examined the limitations of current measures of readability and made consideration for how the wider textual and cognitive phenomena may be accounted for. This has, to some extent, validated Oakland and Lane’s framework. We have explored how these factors are realised in the construction of prototypes that that implement several parts of this framework for: (i) document quality control for technical authors (Newbold and Gillam, 2008); (ii) automatic video annotation. This includes techniques for statistical and linguistic approaches to terminology extraction (Gillam, Ahmad and Tariq, 2005), approaches to lexical and grammatical simplification (Williams and Reiter, 2008), and considerations of plain language (Boldyreff et al, 2001). We also address other elements of Oakland and Lane's framework for example how using lexical cohesion (Hoey, 1991) could address idea density, and we consider the difficulty of semantic leaps (Halliday and Martin, 1993) and how this relates to cognitive load. Repetitious patterns may help readers form an understanding of the text, but this may assume a reader has a complete understanding of the terms being used.

            Our interest in readability is motivated towards making semantic content more readily available and, as a consequence, improving quality of documents. We are expecting in due course to begin to provide new methods for measuring readability that through human evaluation will lead on to a new metric. There are already indications of support for a British Standard for readability, including an inaugural working group meeting, to which it is hoped that this work will contribute in the short-to-medium term.



Boldyreff, C., Burd, E., Donkin, J. and Marshall, S. (2001). “The Case for the Use of Plain English to          Increase Web Accessibility.” In Proceedings of the 3rd Intl. Workshop on Web Site Evolution                     (WSE’01).

Gillam, L., Tariq, M. and Ahmad, K. (2005). “Terminology and the construction of ontology”.         Terminology, 11(1), 55-81.

Halliday, M.A.K. and Martin J.R. (1993). Writing Science: Literacy and Discursive Power. London :             Falmer Press.

Hoey, M. (1991). Patterns of Lexis in Text. Oxford: OUP.

Oakland, T. and Lane, H.B. (2004), “Language, Reading, and Readability Formulas”. International    Journal of Testing, 4 (3), 239-252.

Newbold, N. and Gillam, L. (2008). “Automatic Document Quality Control”. In Proceedings of the Sixth       Language Resources and Evaluation Conference (LREC).

A corpus-based comparison among CMC, cpeech and writing in Japanese


Yukiko Nishimura


Linguistic aspects of computer-mediated communication (CMC) in English have been compared with speech and writing from corpus-based approaches (Yates 1996, Collot & Belmore 1996). Message board corpora and online newsgroup discussions in English have also been studied (Marcoccia 2004, Lewis 2005, Claridge 2007). However, studies on Japanese and Japanese CMC are limited (Wilson et al eds. (2003) contains no article on Japanese). Earlier studies of Japanese CMC qualitatively found that users employ informal conversational styles and creative orthography (Nishimura 2003, 2007). This paper attempts to fill a gap in corpus-based analyses of Japanese CMC by comparing speech and writing. Based on the parts of speech (POS) distribution, it quantitatively reveals how CMC resembles or differs from speech and writing in Japanese.

     Because large-scale spoken and written corpora in Japanese equivalent to the British National or Cobuild Corpora are unavailable (currently under compilation), this study creates smaller written and spoken corpora for comparison with the CMC corpus. While the word is the basic unit of quantitative study in English, the morpheme takes this role in Japanese. ChaSen morphological parsing software was therefore used as a tagging device. The CMC corpus for this study consists of messages from two major bulletin board system (BBS) websites, Channel 2 and Yahoo! Japan BBS. The written corpus was created by scanning magazine articles on topics similar to those discussed in CMC; the spoken corpus consists of transcriptions of 21 hours of casual conversation mostly among college-aged friends on everyday topics. The mismatch on the topics between CMC and speech does not seem to affect the result, as the analysis is not based on lexical frequency of vocabulary but grammatical categorisation of POS.

     With the 9 POS categories as variables, the study finds that interjections distinguish speech from CMC and writing. A more detailed analysis of two key areas, particles and auxiliaries find that CMC differs from writing on the distribution of case and sentence final particles and that polite auxiliary verbs separate the two target websites, Channel 2 and Yahoo within CMC. Case particles specify grammatical relations to express content explicitly, fulfilling the “ideational function,” while interjections, sentence final particles and polite auxiliaries embody the “interpersonal function” (Halliday 1978). This corpus-based study quantitatively reveals that CMC occupies an intermediate position in a continuum from writing to speech. These findings generally conform to results of English CMC studies comparing speech and writing (Yates 1996), though details are different. The study has also discovered that variations exist within CMC in the open-access BBS context, where participant background is disclosed.

     This study employs POS and their subcategories alone as a measure of comparison, unlike the multi-feature/multi- dimensional (MF-MD) approach (Biber 1988). The three corpora have limitations in terms of representativeness and scale. However, the present study, part of a larger research project on Japanese CMC (Nishimura 2008), is expected to contribute to still limited corpus-based studies of Japanese and variations in CMC. It has also clarified the medium-specific realisation of language functions as embodied by the grammatical categories.



Claridge, C. (2007). “Constructing a corpus from the web: Message boards”. In M.Hundt, N. <