Arabic question-answering via instance based learning from an FAQ corpus

Bayan Abu Shawar and Eric Atwell

We describe a way to access information in an Arabic Frequently-Asked Questions corpus, without the need for sophisticated natural language processing or logical inference. We collected an Arabic FAQ corpus from a range of medical websites, covering 412 subtopics from 5 domains: motherhood and pregnancy issues; dental care issues; fasting and related health issues; blood disease issues such as cholesterol level, and diabetes; and blood charity issues.

FAQs are documents designed to capture the key concepts or logical ontology of a given domain. Any Natural Language interface to an FAQ is expected to reply with the given Answers, so there is no need for Arabic NL generation to recreate well-formed answers, or for Arabic NLP analysis or logical inference to map user input questions onto this logical ontology; instead, a simple set of pattern-template matching rules can be extracted from each Question and Answer in the FAQ corpus, to be used in an Instance Based Learning system which finds the nearest match for any user input question and outputs the corresponding answer(s). The training corpus is in effect transformed into a large number of categories or pattern-template pairs: from the 412 subtopics, our system generated 5,665 pattern-template rules to match against user input. User input is used to search the categories extracted from the training corpus for a nearest match, and the corresponding reply is output.

Previous work demonstrated this Instance Based Learning approach with English and other European languages. Arabic offers additional challenges: a different writing system; no Capitalization to identify proper names; highly inflectional and derivational morphology; and little standardization of format for Questions and Answers across documents in the Corpus. Nevertheless, we were able to adapt the Instance Based Learning system to the Arabic FAQ corpus straightforwardly, demonstrating that the Instance Based Learning approach is much more language-independent than systems which build in more linguistic knowledge.

Initial tests showed that a high proportion of answers (93%) to a limited set of test questions were correct. The same questions were submitted to Google and AskJeeves. However, because Google and AskJeeves return complete documents that hold the answers, we tried to measure how easy it was to find the answers inside the documents; we found that, given the documents, users had to then spend more time searching the documents to find the answer, and they only succeeded around half the time.

Future work includes extending the Arabic FAQ Corpus to a wider range of medical questions and answers; and more rigorous evaluation, including comparisons with more sophisticated NLP-based Arabic Question-Answering systems. At least we have demonstrated that a simple Instance Based Learning engine can be used as a tool to access Arabic WWW FAQs. We did not need sophisticated natural language analysis or logical inference; a simple (but large) set of corpus-derived pattern-template matching rules is sufficient.

Using monolingual corpora to improve Arabic/English cross-language information retrieval

Farag Ahmed, Andreas Nürnberger and Ernesto William De Luca

In a time of wide availability of communication technologies, language barriers are a serious issue to world communication and to economic and cultural exchanges. More comprehensive tools to overcome such barriers, such as machine translation and cross-lingual information application, are nowadays in strong demand. In this paper, related research problems will be considered especially in the context of Arabic/English Cross Language Information Retrieval (CLIR).

Word sense disambiguation is a non-trivial task. For some languages the problem is even more complicated where appropriate resources, e.g., lexical resources like WordNet [1], not yet exist. Furthermore, sense-annotated corpora are rare for those languages. This is because obtaining labelled data is expensive and time-consuming. In addition to these problems, the Arabic language suffers from the absence of the vowel in written text which causes a high ambiguity in which the Arabic words could have different meanings and interpretations. The vowel signs are signs placed above and below the letters in order to indicate the proper pronunciation.

The lack of one-to-one mapping of a lexical item and its meaning causes a translation ambiguity, which produce translation errors that impact the multilingual retrieval system performance. Therefore, extra effort is needed in Arabic to deal with the ambiguous word translations. For example, due to the absence of vowel signs, the Arabic word “يعد “ can have the meanings "promise", "prepare", "count", "return", "bring back" and the word “علم “ can have the meanings "flag", "science", "he knew", "it was known", "he taught", "he was taught" in English. One solution to tackle this problem is to use statistical information extracted from corpora.

In this paper, we will briefly review the major issues related to CLIR, e.g., translation ambiguity, inflection, translating proper names, spelling variants and special terms [2], concentrating on WSD tasks from Arabic to English. Furthermore, based on the encouraging results we achieved in previous work, which is based on a modification of Naïve Bayesian algorithm [3, 4], we will propose in this paper an improved statistical method to disambiguate the user’s query terms using statistical data from a monolingual corpus1 and the web. This method consists of two main steps: First, using an Arabic analyzer, the query terms are analyzed and the senses of the ambiguous query terms are defined. Secondly, the correct senses of the ambiguous query terms are selected, based on co-occurrence statistics. To compensate for the lack of training data for some test queries, we used web queries in order to obtain the statistical co-occurrence data needed to disambiguate the ambiguous query terms. This was achieved by constructing all possible translation combinations of the query terms and then sending them individually to a particular search engine, in order to identify how often these translation combinations appear together. The evaluation was based on 50 Arabic queries. Two experiments have been done, one was based on the co-occurrence data, which was obtained from the monolingual corpus and the other was based on the co-occurrence data which was obtained from the web.

References:

Ahmed, F. and A. Nürnberger (forthcoming) “Corpora based Approach for Arabic/English Word Translation Disambiguation”. Journal of Speech and Language Technology, 11.

Ahmed, F. and A. Nürnberger. (2008) “Arabic/english word translations disambiguation using parallelcorpus and matching scheme”. In Proceedings of the 12th European Machine Translation Conference (EAMT08), 6–11.

Hedlund, T., E. Airio, H. Keskustalo,R. Lehtokangas, A. Pirkola and K. Järvelin (2004). “Dictionary-based cross-language information retrieval: Learning experiences from CLEF 2000-2002”. Information Retrieval, 7(1/2), 99-119.

A corpus-based analysis of conceptual love metaphors

Yesim Aksan, Mustafa Aksan, Taner Sezer and Türker Sezer

Corpus-based conceptual metaphor studies have underscored the significant impact of authentic data analysis on the theoretical development of conceptual metaphor theory (Charteris-Black 2004; Deignan 2005; Stefanowitsch and Gries 2006). In this context, most of the linguistic metaphors constituting the conceptual metaphors identified by Lakoff and Johnson (1980) have subjected to corpus-based analysis (Stefanowitsch 2006; Deignan 2008 among others). In this study, we will focus on metaphors of romantic love in English and we will argue that 25 source domains identified by Kövecses (1988) in the conceptualization of romantic love display differences in British and American English.

The aim of this paper is twofold: (1) to show how contrastive corpus-based conceptual metaphor analysis will unveil language specific elaboration of linguistic metaphors of romantic love in British and American English; (2) to explicate frequent and current use of the proposed source domains for English love metaphors through the analysis of large corpora.

To achieve these ends, main lexical items from the proposed source domains of romantic love— namely, ECONOMIC EXCHANGE, SOCIAL CONSTRUCT, INSANITY, DISEASE, BOND, FIRE, JOURNEY, etc.— will be concordanced using the BNCweb (CQP-edition) and COCA (Corpus of Contemporary American English by Mark Davies). The concordance data will be analysed for differences between literal and metaphorical uses, and the frequency of metaphorical uses of the lexical items referring to the above-mentioned source domains will be identified in both corpora. For instance, while the lexical item invest from the ECONOMIC EXCHAGE source domain does not collocate with the nouns relationship, feeling and emotion in the BNC, it collocates with all of these nouns in the COCA. In other words, the linguistic metaphor ‘She has invested a lot on that relationship’ appears to be a more likely usage in American English rather than in British English. The distribution of the linguistic metaphors of romantic love in the corpora with respect to medium of text and text type will also be examined.

Overall, the applied methodology of this paper will reveal a number of different linguistic metaphors used to conceptualize romantic love in British and American English. We might even think that the analysis proposed in this study, — analysis of lexical items with their metaphorical patterns— would yield a database, which might be employed in studies that aim to develop software for identifying metaphor candidates in corpora.

References:

Charteris-Black, J. (2004). Corpus approaches to critical metaphor analysis. Palgrave MacMillan.

Deignan, A. (2005). Metaphor and corpus linguistics. John Benjamins.

Deignan, A. (2008). “Corpus linguistic data and conceptual metaphor theory”. In M.S. Zanotto, L. Cameron and M. C. Cavalcanti (eds) Confronting metaphor in use. Amsterdam: John Benjamins, 149-162.

Kövecses, Z. (1988). The language of love. Bucknell University Press.

Lakoff, G. and M. Johnson (1980). Metaphors we live by. University of Chicago Press.

Stefanowitsch, A. (2006). “Words and their metaphors: A corpus based approach”. In A. Stefanowitsch and S. Th. Gries (eds) Corpus-based approaches to metaphor and metonymy. Amsterdam: John Benjamins, 63- 105.

A corpus-based reappraisal of the role of biomechanics in lexical phonotactics

Eleonora C. Albano

In a number of influential papers, MacNeilage and Davis (e.g., 2000) have forwarded the view that CV co-occurrence is shaped by biomechanics. The claim is based on certain phonotactic biases recurring in several languages in babbling, first words, and the lexicon. These are: labial C’s favor open V’s; coronal C’s favor front V’s; velar C’s favor back V’s; positions C1 and C2 favor labials and coronals, respectively.

The frequency counts are performed on oral language transcripts and instantiate a productive use of corpora in phonology. They are held to support the frame-then-content theory of the syllable, viz.: frames are biomechanical CV gestalts emerging from primitive vocal tract physiology in ontogeny and phylogeny; content is, in turn, the richer phonetic inventory emerging gradually from linguistic experience.

However instigating, this theory rests on questionable data: the corpora are generally small; the statistics is limited to observed-to-expected (O/E) ratios (derived from chi-squared tables); and no effect size measure is offered.

This paper is an attempt to illuminate the lexical biases in question with substantial data from two related languages: Spanish and Portuguese. It is assumed that CV biases should be investigated in greater depth in languages with known grammar and history before they can be attributed to biomechanics.

The data on CV co-occurrence are drawn from two lexicons (about 45,000 words each) derived from public oral language databases of each language. This sample size has been validated by correlation with larger, written language databases. The phonetic coding includes the entire segment inventory given by orthography to phone conversion. Statistics includes chi-squared, Cramer’s V, and Freeman-Tukey deviates.

The results confirm most of the tendencies reported by MacNeilage and Davis, but lend themselves to a subtler interpretation, in which the role indeed played by biomechanics is subject to linguistic constraints.

Firstly, the biases only attain a non-negligible effect size when all C’s and V’s in a phonetic class are pooled together. Secondly, strength of association only becomes expressive if manner and place of articulation are separated. Thirdly, the biases are sensitive to stress and position in word. Thus, the largest effect sizes, which agree with the general trends reported by the authors, occur in initial unstressed position. Smaller appreciable effect sizes occur in medial stressed position also, but are partly due to derivational morphology. Fourthly, most of the latter biases are common to the two languages, as predictable from their history.

The findings support a view of the lexicon in which the role of biomechanics is under linguistic control. “Natural” CV biases are found in grammatically less constrained environments, while freer, mainly contrastive, C-V combination is found where grammar is at work. Grammar does, however, accommodate the less flexible biomechanical constraints, e.g., those affecting the tongue dorsum. Thus, both languages exhibit a sizeable set of productive unstressed derivational morphemes combining velar C’s with back V’s.

To conclude, our findings are not entirely congruent with MacNeilage and Davis’ claims and call for a re-examination of the Frame-then-Content theory in light of more carefully prepared corpora.

References:

MacNeilage, P.F., and B. L. Davis. (2000). “On the origin of internal structure of word forms”. Science, 288, 527-531.

Identifying discourse functions through parallel text corpora

Pavlina Aldová

Since the use of corpora predominantly relies on searches based on form, to retrieve data based on communicative functions is a more complex task. Using a parallel English-Czech corpus, my paper will exemplify that a parallel corpus can provide a tool for identification, relating and comparison of varying constructions expressing a discourse function not only in studying translation equivalents, but also to study a language (eg English) as it naturally occurs. The approach is based on using original untranslated language as primary data, and employing translation equivalents as ancillary intermediate material only.

To identify an array of means performing a given function, we start with a simple search in the English original using the selected marker to obtain a set of functional equivalents in the translated version. The results of this search are subsequently used for reversed searches to obtain other, functionally equivalent (or related) means in the original language (English). A wider list of means expressing, or related to, the given function can thus be compiled based on original untranslated language, extending our knowledge into the range of forms that are not prone to identification or comparison in the sense of a ‘specific form’ in English:

Function F expressed in English by Form A (B, C)
English (original)	Czech (translation)	English (original)
Form A →	Form X →	Form A
	→	Form B
→	Form Y →	Form A,
	→	Form C ...

To illustrate this approach, various discourse functions will be searched: permission, exclamation, and directives for the third person. In English, the core means to express appeal to the third person subjects mediated through the addressee is let. Translation equivalents indicate that this function is performed by the Czech particles a? or nech?. Assessing the corresponding results for [[a?]] and [[nech?]] in Czech translations of original English texts, it seems possible to correlate the central means let (ex 1) with other constructions, e.g. modal verbs, negative yes/no questions (2), but also to observe a tendency of English to indicate the appeal mediated through the addressee explicitly, using the infinitive construction after the second person imperative (3), or through the use of causative constructions where the addressee is the beneficiary of the action (4), including the use of the passive, or formulaic imprecations (5), etc.

Example:

	Czech (translation)	English (original)
1	"[[Ať]] si dělá hlavu."	Let him wonder
2	[[Ať]] neřve! "[[Ať]] se v Ramsey modlí za mou duši," pravila ...	Can‘t he stop shouting? ‘Ramsey may pray for my soul,’ she said serenely, ‘and that will settle all accounts.’
3	"[[Ať]] sem přijde Bannister."	After a pause, "Tell Bannister to come in."
4	[[Ať]] vám poví něco z těch pitomostí, kterým ...	Get her to tell you some of the stuff she believes.
5	"[[Ať]] jde Eustace Swayne k čertu i se všemi s ...	He said explosively, "To hell with Eustace Swayne and all his works!"

References:

Dušková, L. (2005) “Syntaktická konstantnost mezi jazyky” [Syntactic constancy across languages], Slovo a slovesnost 66, 243-260.

Hunston, S. (2006) “Corpus Linguistics”. In Encyclopedia of Language and Linguistics 2nd Edition, Elsevier, 234-248.

Izquierdo Fernández, M. (2007) “Corpus-based Cross-linguistic Research: Directions and Applications [James’ Interlingual linguistics revisited]”, Interlingüistica, 17, 520-527.

Using a corpus of simplified news texts to investigate features of the intuitive approach to simplification

David Allen

The current paper reports on the investigation of a corpus of news texts which have been simplified for learners of English. Following on from a previous study by the author (accepted by System, 2009), this paper details a number of quantitative and qualitative findings from the analysis which highlight the impact of simplification upon the linguistic features of texts, specifically relative clauses. Using quantitative methods alone is shown, in this case, to be insufficient to address the effects of simplification upon the surrounding text. A discussion of further quantitative approaches to the study of modified texts will then be presented. The study will be of interest to researchers interested in graded texts, the debate surrounding simplified vs. authentic materials in second language learning, and reading in a second language. The corpus is anticipated to serve a number of pedagogic and research purposes in the future.

Corpora and headword lists for learner’s dictionaries: The case of the Spanish Learner’s Dictionary DAELE

Araceli Alonso Campo and Elene Estremera Paños

A learner’s dictionary for non-native speakers differs from a general monolingual dictionary in several aspects: it covers less vocabulary, its definitions are usually written in a controlled vocabulary in order to facilitate comprehension, and it includes more grammatical information because the main target users, who do not have full command of the grammar of the language, need information about the use of words in a specific context. These aspects determine the selection of the main headwords to be included in the dictionary.

Learner’s dictionaries for non-native speakers are based on the assumption that users look up words in the dictionary not only for decoding texts, but also for encoding. On the one hand, the headword list should constitute a representative sample of the common vocabulary of the average speaker of the language; on the other hand, it should include those lexical units which can cause grammatical problems in text production. In addition, the dictionary should include the words and expressions that learners need for oral communication (Nation 2004).

The English language has a significant tradition in learner’s dictionaries, which means that there are many studies on the selection of the vocabulary to be included in a dictionary for non-native learners. Spanish lexicographic tradition, on the contrary, has dedicated itself mainly to elaborating monolingual dictionaries for native speakers, and as a result there are currently no studies available on how to obtain and select the headwords to be included in a Spanish learner’s dictionary.

Our paper focuses on the selection of a list of 7,000 headwords ?nouns and adjectives? for elaborating a prototype for a digital Spanish Dictionary for learners. We first briefly describe the Spanish Learner’s Dictionary for Foreigners Project DAELE (Diccionario de aprendizaje del español como lengua extranjera), which is currently underway at Universitat Pompeu Fabra in Barcelona. Our dictionary project aims to include information which is contrastive with English, as many learners around the world have experience with English or are speakers of English before they start studying Spanish. We compare four lists based on frequency of use, two of which were based on corpora (El Corpus del Español by Mark Davies and the Corpus PAAU created at the Institut Universitari de Lingüística Aplicada of Universitat Pompeu Fabra), one of which was based on lexical availability index (Ahumada 2006) and one of which was taken from a monolingual Spanish dictionary for native-speaking children (Battaner 1997). The comparison points up many differences in the distribution of nouns and adjectives, which may be summarized as follows:

* 9107 nouns and adjectives were found on only one list

* 3853 nouns and adjectives were found on two lists

* 2380 nouns and adjectives were found on three lists

* 836 nouns and adjectives were found on all four lists

We discuss why there should be such divergence across lists and, more importantly for our dictionary prototype project, the implications of using corpora to determine the headword list for learner’s dictionaries.

References:

Ahumada, I. (2006). El léxico disponible de los estudiantes preuniversitarios de la provincia de Jaén. Jaén: Servicio de Publicaciones de la Universidad de Jaén.

Battaner, P. (dir.) (1997). Diccionario de Primaria de la Lengua española Anaya-Vox. Barcelona: Vox-Biblograf.

Davies, M. (2006). A frequency dictionary of Spanish. New York: Routledge.

Nation, P. (2004) “A study of the most frequent word families in the British National Corpus, in P. Bogaards, B. Laufer (eds) Vocabulary in a Second Language. Amsterdam: John Benjamins, 3-13.

Torner, S. and P. Battaner (eds) (2005) El corpus PAAU 1992: estudios descriptivos, textos y vocabulario Barcelona: IULA, documenta universitaria. http://www.iula.upf.edu/rec/corpus92/

Making it plainer: Revision tendencies in a corpus of edited English texts

Amanda Murphy

There are currently 23 official languages in the EU, but English is widely spoken as a lingua franca within its Institutions, and most official written texts – at least within the European Commission - are drawn up in English by non-native speakers. They are subsequently translated into the other 22 official EU languages. Since the setting up of the Editing Unit in 2005, a considerable amount of these texts are voluntarily sent to native-speaker editors, who revise the texts before they are published and/or translated.

The present research examines some of the revisions in a one-million token corpus of documents from the Editing Unit. On a sample of the documents, all the revisions were classified by hand, and then examined using Wordsmith Tools (Scott 2006).

While many revisions concern objective grammatical mistakes, this paper focuses on two other aspects of language naturalness, one lexical and one syntactic. Regarding lexis, the paper focuses on revisions of common colligational-collocational patterns such as transitive verb + direct object, where the editor changes the verb. Examples of this type of revision include:

* To perform responsibilities (edited version: to fulfil responsibilities);

* To give a contribution (edited version: to make a contribution);

* To take a commitment (edited version: to make a commitment);

* To perform a study (edited version: to conduct a study);

* To match gaps (edited version: to fill gaps);

* To realize targets (edited version: to meet targets).

Noun phrases where both the pre-modifying adjective and noun are substituted with a more natural-sounding pair are also examined. Examples of these include great discussion (changed to major debate) or important role (changed to big part).

The second focus is on revisions beyond the phrase. By examining larger chunks of text, whole sentences or groups of sentences, two opposing tendencies can be observed: on the one hand, simplification, which shortens and ‘straightens out’ sentences, and on the other, explicitation, which adds elements for the sake of clarity, and generally lengthens sentences.

Reflections are made on the potential of using such texts for teaching advanced level students to write more plainly, along the lines of Cutts (2004) and Williams (2007).

References:

Cutts, M. (2006). The Plain English Guide. Oxford: Oxford University Press.

Scott, M. (2006). Wordsmith Tools. Version 4.0. Oxford: Oxford University Press.

Williams, J. M. (1997). Style: Toward Clarity and Grace, Chicago: The University of Chicago Press.

Word order phenomenain spoken French: A study on four corpora of task-oriented dialogue and its consequences on language processing.

Jean-Yves Antoine, Jérome Goulian, Jeanne Villaneau and Marc Le Tallec

We are presenting a corpus study that investigates the question of word order variations (WOV) in a spoken language (namely French) and its consequences on the parsing techniques that are used in Natural Language Processing (NLP). It is a common practice to distinguish between free or rigid word order languages (Hale 1983, Covington 2000). This presentation aims at examining this question according to two directions:

1) We investigate whether language registers (Biber 1988) have an influence on WOV. French is usually considered as a rigid order language. However, our observations on task oriented interactions tend to show that sponntaneous spoken French presents a higher variability. This study aims at quantifying this influence.

2) WOV should result in the presence of syntactic discontinuities (Hudson 2000 ; Bartha and al. 2006) which can not be parsed by projective parsing formalisms (Holan and al. 2000). Given a particular interactive situation (task oriented Human-Machine Interaction), we estimate the occurrence frequency of discontinuities in spoken French.

Our study investigates WOV on four task-oriented spoken dialogue corpora which concern three different application tasks :

* air transport reservation (Air France Corpus)

* tourism information (Murol corpus and OTG corpus)

* switchboard call (UBS corpus)

Two corpora (Air France, UBS) concern phone conversations while the two other are corresponding to direct interaction.

Every word order variation has been manually annotated by 3 experts, following a cross-validation procedure. The annotation gives a detailed description of the observed variations : type (inversion, extraction, cleft sentence, pseudo-cleft sentence), direction (ante-position or post-position), syntactic function of the extracted element and the eventual presence of a discontinuity.

This study shows that conversational spoken French should be affected by a high rate of WOV. The occurrence of word order variations increases significantly with the degree of interactivity of the spoken dialogue. For instance, around 26.6% of the speech turns of the Murol corpus (informal interaction with a high interactivity) present a word order variation, while the other corpora (moderate interactivity) are less concerned (from 11.5% to 13.6% of the speech turns). This difference is statistically significant.

Despite this observation, our results show that task-oriented spontaneous spoken French should still be considered as a rigid order language : most the observed variations correspond to weak variations (Holan et al. 2000) which result very rarely in discontinuous syntactic structures (from 0.2% to 0.4% of the speech turns according to the considered corpus). Non-projective parsers remains therefore well adapted to conversational spken French.

Besides this important result, this study shows that WOV follow some impressive regularities :

* antepositions are prefered to postpositions (82.5% to 91.1% of the variations) and respect usually the SVO order (98.0% to 98.7% of the speech turns)

* subjects are significantly more affected (25.4% to 42.8% of the observed variations) than objects (5.3% to 15.4%), while order varations concern also modifiers and phrasal complements.

* Most of subject WOV (95.4% to 100%) are lexically (pronoum) or syntactically (cleft or pseudo-cleft sentence) marked, while modifiers variations result usually in a simple inversion (84.9% to 96.8 %).

References:

Bartha, C., T. Spiegelhaue., R. Dormeyer and I. Fischer. (2006). “Word order and discontinuities in dependency grammar”. Acta Cybernetica, 17(3), 617-632.

Biber, D. (1988). Variation across speech and writing, Cambridge University Press, Cambridge.

Holan T., O. Kubon and M. Plátek M. (2000). “On complexity of word order.” TAL, 41(1), 273-300.

Arabic and Arab English in the Arab world

Eric Atwell

A broad range of research is undertaken in Arab countries, and researchers have carried out and reported on their research using either English or Arabic (or both). English is widely accepted as the international language of science and technology, but little formal research has been done on the English used in the Arab world. Is it dominated by British or American English influences? Or is it a recognisable regional variant, on a footing with Indian English or Singapore English?

Arabic is the first language of most Arabs, but contemporary use shows noticeable variation from Modern Standard Arabic. Some Arab researchers may feel doubly stigmatised, in that their local variant of Arabic differs from MSA, and their English differs from UK or US standard English.

(Al-Sulaiti and Atwell 2006) used the World Wide Web to gather a representative selection of contemporary Arabic texts from a wide range of Middle Eastern and North African countries, approximately one million words in total. We have used WWW sources to gather selections of contemporary English texts from across the Arab region. We have collected a Corpus of Arab English, parallel to the International Corpus of English used to study national and regional varieties of English. We then used text-mining tools to compare these against UK and US English standards, to show that some countries in the Arab world prefer British English, some prefer American English, and some fit neither category straightforwardly.

Natural Language Processing and Corpus Linguistics research has previously focussed on British and American English, and applying these techniques to Arab English presents interesting new challenges. I will present our resources to the Corpus Linguistics audience, to elicit useful applications of these resources in Arab English research, and also to receive requests and/or suggestions for extensions to our corpus which the community might find useful. My goal is to document the contemporary use of Arab English and Arabic across the Arab world, and develop computational resources for both; and to raise the status of both Arab English and Arabic, so they are recognised as different but equal alongside American English and British English in the Arab world and beyond.

References:

Al-Sulaiti, L. and E. Atwell. (2006). “The design of a corpus of contemporary Arabic”. International Journal of Corpus Linguistics, 11, 135-171.

Civil partnership – “Gay marriage in all but name”!? Uncovering discourses of same-sex relationships

Ingo Bachmann

It has been demonstrated (Gabrielatos and Baker 2008, Teubert 2001, Baker 2005) that corpus linguistic techniques are an appropriate means of uncovering and critically analysing discourses – be that discourses of refugees and asylum seekers, of Euro-sceptics or of gay men.

The present paper aims to contribute to the study of discourses of gay men, but with a shifting focus: whereas so far gay (single) men and the question of identity were concentrated on (Baker 2005), I want to show how gay men (and to some extent lesbian women) in relationships are represented and what discourses of same-sex relationships are prevalent.

As a suitable starting point for such an analysis I singled out the debates in the Houses of Parliament in 2004 which lead to the introduction of civil partnerships in the United Kingdom. Since the end of 2005 gay and lesbian couples in the UK have been able to register their relationship as a civil partnership, thus transferring to them more or less the same rights and responsibilities that heterosexual married couples enjoy. As the concept of civil partnership had not existed before, it was during these debates that it was decided upon if civil partnerships should be introduced at all, what rights and responsibilities they entail and who actually is allowed to register for them.

After compiling a corpus containing all the relevant debates in both Houses, a statistical keyword analysis (WordSmith Tools) was conducted. These keywords are categorised and they cluster around ideas such as

* the concept of a relationship (long-term, committed).

* the controversial issue if civil partnership is actually gay marriage in all but name and if it undermines the institution of marriage.

* legal implications of the bill (tax, equality, rights)

* identity and sexual orientation – same-sex emerges as a new term next to homosexual, gay and lesbian, reflecting a shift from a focus on sexual acts via a person's identity to the relationship of two people.

Keywords from each cluster are investigated in context to try and uncover discourses of civil partnership.

References:

Baker, P. (2005). Public discourses of gay men. London: Routledge.

Gabrielatos, C. and P. Baker (2008). "Fleeing, Sneaking, Flooding: A Corpus Analysis of Discursive Constructions of Refugees and Asylum Seekers in the UK Press, 1996- 2005." Journal of English Linguistics, 36 (1), 5-38.

Teubert, W. (2001). "A province of a federal superstate, ruled by an unelected bureaucracy. Keywords of the Euro-sceptic discourse in Britain." In A. Musolff, C. Good, P. Points and R Wittlinger (eds) Attitudes towards Europe. Language in the Unification Process. Aldershot: Ashgate, 45-88.

The British English '06 Corpus - using the LOB model to build a contemporary corpus from the internet.

Paul Baker

This paper describes the collection of the BE06 corpus, a one million word corpus of contemporary written British English texts which followed the same sampling framework as the LOB and Brown corpora. Problems associated with collecting 500 texts via the internet (the texts were all previously published in paper form) are described, and some preliminary findings from the corpus data are given, including comparisons of the data to equivalent corpora from the 1930s, 1960s and 1990s. The paper also addresses the issue of whether diachronic corpora can ever be really 'equivalent' because genres of language do not remain constant.

Corpus tagging of connectors: The case of Slovenian and Croatian academic discourse

Tatjana Balažic Bulc and Vojko Gorjanc

Two issues stand out as a special challenge in corpus research on connectors in Slovenian and Croatian academic discourse. The first issue concerns tagging connectors, which explicitly signal textual cohesion and also establish logical or semantic connections between parts of a text through their meaning and function and, at the same time, indicate the relations between the constituent parts of a text. Our research used a corpus-based approach; that is, we did not tag connectors according to a predefined list, but instead defined them on the basis of two corpora constructed for this purpose, comprised of 19 Slovenian and 17 Croatian original scholarly articles in linguistics. In the first phase, connectors were tagged manually and therefore the sample is somewhat smaller. In the second phase, we tested various methods of automatic connector tagging, making it possible to analyze a larger corpus and yielding more relevant statistical data. We continued by comparing variation in the frequency of hits for both tagging methods. As in other languages, automatic tagging in Slovenian and Croatian is hindered by morphosyntactic and discourse ambiguities because certain connectors are synonymous with conjunctions, adverbs, and so on; there are also a considerable number of synonymous connectors that serve various functions in a text. The relevance of hits with automatic tagging in Slovenian and Croatian is also encumbered by the inflectability of these parts of speech and quite flexible word order, and so the validity of hits using automatic tagging is especially questionable. Automatic tagging also indirectly changes the research methodology because a corpus tagged in advance is already marked by a particular theory, which makes a corpus-based approach impossible.

Tagging connectors also raises a second issue: text segmentation. Specifically, it turned out that the function and position of connectors is considerably more transparent within smaller text units than in an entire text. Various attempts at text segmentation exist in corpus linguistics; for example, the Linguistic Discourse Model (LDM), Discourse GraphBank Annotation Procedure, and the Penn Discourse Treebank (PDTB). These all have different theoretical premises, but they all define the boundaries of individual text segments using orthographic markers such as periods, semicolons, colons, and commas. Our study also confirmed that investigating connectors as explicit links between two sentences is not sufficient because they also often appear within a sentence. In this sense we segmented the text into smaller units that we refer to as utterances and we understand these as syntactically and semantically complete contextualized units of linguistic production, whether spoken or written.

A comparative study of binominals of the keyword democracy in corpora of the US and South Korean press

Minhee Bang and Seoin Shin

This paper presents findings of a comparative analysis of collocational patterns of the word democracy with a focus on noun collocates occurring as part of binominals (e.g. democracy and freedom) in the corpora of two US and South Korean newspapers. Democracy is identified as one of the keywords of the US newspaper corpus of 42 million words, which comprises of foreign news reports taken from the New York Times and Washington Post from 1999 to 2003. This illustrates that democracy is a major issue in the context of foreign news reporting by the US press. The analysis of noun phrases occurring with democracy in the US newspaper corpus shows different degrees in the strength of the semantic link of the noun phrases with democracy. The semantic link between democracy and noun phrases such as freedom or human rights is relatively established and uncontroversial, whereas the semantic link is less transparent in democracy and noun phrases such as stability or independence. Semantically, by being coordinated as a binominal, it is possible that a semantic link is ‘manufactured’ between democracy and these noun phrases, such that democracy is contingent on the things represented by these noun phrases, or democracy brings about these things. The analysis also shows that certain noun phrases tend to be associated with references to certain countries or regions (for example, noun phrases referring to a capitalist economy, such as free market economy most frequently occur in the context of Russia). The noun phrases paired with democracy seem to reveal where the US interest is vested in its relations with different foreign countries. It is thus hypothesised that noun phrases occurring as binominals with democracy may differ in newspapers from other countries, reflecting the countries’ own perspective and interest. In order to test this hypothesis, binomials of democracy are analysed in a corpus of news articles collected from the two South Korean newspapers, the Dong-A Ilbo and the Hankyoreh, which are two of the most influential broadsheet newspapers in South Korea. The corpus covers the same period between the year 1999 and 2003, and is of a relatively modest size of 200 thousand words. The analysis will show that there are considerable differences in frequency and types of noun phrases occurring with democracy in the South Korean context. As a positively connoted ‘banner’ word (Teubert, 2001: 49), it is argued that democracy is used to highlight and justify what is deemed to be in the country’s own economic and political interest.

Analysing the output of individual speakers

Michael Barlow

Virtually all corpus linguistic analyses have been based on corpora such as the British National Corpus, which are an amalgamation of the writing or speech of many individuals, an exception being Coniam (2004) which examined the writing of one individual.

This presentation examines the speech of individuals in order to (i) determine the extent of the differences in the patterns found in an amalgamated corpus and an idiolectal corpus; (ii) investigate whether individual speakers exhibit a characteristic fingerprint; and (iii) to assess the nature of the variation from one speaker to the next.

To pursue this study, the speech of five White House press secretaries was examined.

One advantage of analysing White House press conferences is the fact that a large number of transcripts are available and the context in which the speech is produced is held constant across the different speakers. The sample covered around of year of output for each speaker. The transcripts were downloaded from the web, transformed to a standard format, and POS-tagged. In addition, the speech of all participants other than the press secretary was removed, making for the rather odd format shown below.

<Q>

<PERINO> I'll let the DNI's office disclose that if they want to. We never --

<Q>

<PERINO> I don't know.

The data was divided into slices of 200,000 words. The amount of data available for each of the speakers ranged from 200,000 words for two speakers to 1,200,000 words for one of the speakers. This format made it easy to compare inter- and intra-speaker variability. The results of bigram and trigram analyses reveal the distinctive patterns of speech of each speaker.

If, for example, we examine the ranked bigram profiles for four speakers, we can see differences in use of the most frequent bigrams -- the president, of the, going to, in the, I think, and the, to the, that we, on the, that the, think that, to be, I don't, continue to, and we have. In these graphs, the y axis represents rank, which means that the shorter the line, the more frequent the bigram. The lines extending all the way down represents bigrams with a ranking greater than 60. Each line on the x axis represents the rank of one bigram. The left side of the diagram display the profiles for three samples of the output of Mike; the right hand side contains profiles of three different press secretaries. It can be seen that the bigram profile for the speaker Mike are very similar across the different 200,000 word samples and that these profiles differ from that of the other speakers shown.

The theoretical consequences of these and other results will be discussed in the paper.

References:

Coniam,D. (2004). “Concordancing oneself: Constructing individual textual

Profiles”. International Journal of Corpus Linguistics, 9 (2), 271–298.

Automatic standardization of texts containing spelling variation:

How much training data do you need?

Alistair Baron and Paul Rayson

Spelling variation within corpora results in a considerable effect on corpus linguistic techniques; this has been shown particularly for Early Modern English, the latest period of the English language to contain large amounts of spelling variation. The accuracy of key word analysis (Baron et al., 2009), part-of-speech annotation (Rayson et al., 2007) and semantic analysis (Archer et al., 2003) are all adversely affected by spelling variation.

Some researchers have avoided this issue by using modernized versions of texts; for example, Culpeper (2002) opted to use a modern edition of Shakespeare’s Romeo and Juliet in his study to avoid spelling variation affecting his statistical results. However, modern editions are not always readily available for historical texts. Another potential solution is to manually standardize the spelling variation within these texts, although this approach is likely to be unworkable with the large historical corpora now being made available through digitization initiatives such as Early English Books Online1.

The solution we offer is a piece of software named VARD 2. The tool can be used to manually and automatically standardize spelling variation in individual texts or corpora of any size. An important feature of VARD 2 is that through manual standardization of corpus samples by the user, the tool “learns” how best to standardize the spelling variation in a particular corpus being processed. This results in VARD 2 being better equipped to automatically standardize the remainder of the corpus. After automatic processing, corpus linguistic techniques can be used with far more accuracy on the standardized version of the corpus, and this avoids the need for difficult and time-consuming manual standardization of the full text.

This paper studies and quantifies the effect of training VARD 2 with different sample sizes, calculating the optimum2 sample size required for manual standardization by the user. To test the tool with Early Modern English, the Innsbruck Letters corpus, part of the Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET) corpus (Markus, 1999) has been used. The corpus is a collection of 469 complete letters dated between 1386 and 1688, totaling 182,000 words. The corpus is particularly useful here as it has been standardized and manually checked with parallel lines of original and standardized text.

Whilst VARD 2 was built initially to deal with spelling variation in Early Modern English, it is not restricted to this particular variety of English, or indeed the English language itself. There are many other language varieties other than historical containing spelling variation; these include: first and second language acquisition, SMS messaging, weblogs, emails and other web based communication platforms such as chat-rooms and message boards. VARD 2 has the potential to deal with this large variety of spelling variation in any language, although training will be required for the tool to “learn” how to best automatically standardize corpora. The second part of this paper evaluates VARD 2’s capability in dealing with one of these language varieties, namely a corpus of children’s written English.

References:

Archer, D., T. McEnery, P. Rayson and A. Hardie (2003). “Developing an

automated semantic analysis system for Early Modern English”. In D. Archer, P. Rayson,

A. Wilson and T. McEnery (eds) Proceedings of the Corpus Linguistics 2003

conference. UCREL technical paper number 16., 22 - 31.

Baron, A., P. Rayson and D. Archer (2009). “Word frequency and key word

statistics in historical corpus linguistics”. In R. Ahrens and H. Antor (eds) Anglistik: International Journal of English Studies, 20 (1), 41-67.

Culpeper, J. (2002). “Computers, language and characterisation: An analysis of six characters in Romeo and Juliet”. In U. Merlander-Marttala, C Ostman and M. Kytö (eds) Conversation in Life and in Literature: Papers from the ASLA Symposium, 15. Uppsala: Universitetstryckeriet, 11-30.

Rayson, P., D. Archer, A. Baron, J. Culpeper and N. Smith (2007). “Tagging the

Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English Corpora”. In Proceedings of Corpus Linguistics 2007, July 27-30, University of Birmingham, UK.

Fat cats, bankers, greed and gloom - the language of financial meltdown

Del Barrett

This paper offers a two-part dissection of the language used in press reporting of the credit crunch.

The starting point for part 1 is a keyword analysis of two general corpora comprising news items from the tabloid and broadsheet press. This is then used as a control against which the discourse of the credit crunch is analysed through an investigation of keywords, pivotal words and collocates. The initial hypothesis was that the language of the broadsheets is beginning to show characteristics normally associated with the tabloid press, particularly through headlines designed to sensationalise. In testing this hypothesis, it was revealed that while the broadsheets are certainly becoming more tabloid through their lexical choices, the tabloids are also changing and have started to adopt broadsheet characteristics in their reporting of ‘systemic meltdowns’ and ‘financial turbulence’.

The second part of the paper examines the way that the ‘credit crunch’ itself is portrayed. Using a frame-tracing technique, which combines elements of frame theory and (critical) discourse analysis, the study identifies a number of ‘discourse scenarios’ which are constructed through collocate chains. Investigation of these scenarios reveals a clear distinction between the way in which the broadsheets and the tabloids view the current financial crisis. The former invokes a strong storm metaphor indicating that the credit-crunch is a force of nature and that we are all helpless victims, whilst the latter sees the crisis as a war that has been declared by an enemy. Again, we are helpless victims as the other side seems to be winning, but there are strategies that can be employed to win some of the battles. In other words, the broadsheets do not apportion blame for the crisis, whereas the tabloids place the blame firmly on the shoulders of the ‘Other’. It is particularly interesting that the ‘Other’ is not one of the minority groups commonly found in critical discourse studies, but rather the ‘fat cat bankers’ – a label ascribed to anyone working in the financial sector. A number of other lesser scenarios are also uncovered, in which the credit-crunch is reified as a thug, a monster, a villain and a pandemic.

A corpus analysis of English article use in Chinese students’ EFL writing

Neil Barrett, Zhi-jun Yang, Yue-wen Wang and Li-Mein Chen

It is a well known fact that many English learners have trouble using the English article system and it is even more troublesome for students whose first language doesn’t contain an article system, such as Mandarin Chinese. This can cause problems in learner’s academic writing where accuracy is lost due to article errors. In order to investigate these article problems in academic writing, a corpus based coding system was developed to analyze Chinese EFL student’s compositions. This was based on a larger, learner based corpus into Chinese EFL learner’s coherence problems in academic writing (AWTA learner’s corpus). First, over 600 essays were collected from 1st, 2nd and 3rd year undergraduate students who studied English composition. Following this, an article coding system was constructed using Hawkins (1978) Location Theory with adaptations based on research by Robertson (2000) and Moore (2004). This system identifies the pragmatic, semantic and syntactic noun phrase environments including tags for other determiners, adjectives and numerals which have been identified (Moore, 2004) as having an effect on article errors. The coding system was used for a preliminary analysis of article usage and article errors in 30 papers taken from the learner’s corpus. Results show that all learners were able to distinguish between definite and indefinite contexts in terms of article use but problems occurred in 24% of the noun phrases. Two main error types were indentified, the first of which involves a misanalysis of noun countability for the indefinite article and for the null article. For example: ‘This didn’t help her grandpa acquire admirable education’. The second error was the overuse or underuse of the definite article. This error frequently occurred in generic, plural nouns and non count nouns where a native speaker would typically use the null article. The following example illustrates this: ‘the modern women are different to the women in the past’. All the data and the coding system can be accessed online from the learner’s corpus, providing an additional resource for teachers to help identify article errors in student’s writing.

Patterns of verb colligation and collocation in scientific discourse

Sabine Bartsch

Studies of scientific discourse tend to focus on those linguistic units that are deemed to convey information regarding scientific concepts. In light of the observation that a high frequency of nouns, nominalizations and noun phrases is characteristic of scientific writing, these linguistic units have received a lot of due attention. Yet, it is indisputable that verbs are likewise central to the linguistic construal and communication of meaning in scientific discourse, not only because many verbs have domain-specific meanings, but also in view of the fact that verbs establish the relations between concepts denoted by nouns (cf. Halliday & Martin 1993).

This paper presents a corpus-study of the lexico-grammatical and lexical combinatorial properties of verbs in scientific discourse, i.e. colligation and collocation (cf. Hunston & Francis 2000; Sinclair

1991; Hoey 2005; Gledhill 2000). The study is based on a corpus of scientific writing from engineering, the natural sciences and the humanities. The focus of the paper is on the contribution of verbs of communication (e.g. 'explain', 'report') and construal of scientific knowledge (e.g. 'demonstrate', 'depict') to the textual rendition of scientific meaning. The classification of the verbs is based on Levin (1993) and the FrameNet project.

Many of these verbs are of interest as an object of research on scientific texts because of their discursive function in the communication of the scientific research process. Their occurrence in specific colligational patterns is indicative of their function. For example, 'that'-clauses following communication verbs can be shown to be frequently employed in the reporting of research results, whereas 'to'-infinitive clauses tend to present the scientist's ideas or statements of known facts.

Furthermore, verbs of the types under study here play a central role in establishing connections between what is said in the natural language text and other modalities such as graphics, diagrams, tables and formulae which are commonly employed in scientific text in order to present experimental settings, quantitative data or symbolic representations ('Figure 1.1 demonstrates ...'). They are thus instrumental in the establishment of relations between different modalities in multimodal texts which is also reflected in specific colligational patterns. Another aspect that is of interest in the study of verbs of communication and construal of scientific knowledge are patterns of lexical co-occurrence, i.e. the specific patterns of collocation. Many of the verbs under study display characteristic patterns of lexical collocation which highlight domain-specific aspects of their potential meaning (e.g. ‘represent the arithmetic mean’).

Results of this study include qualitative and quantitative findings about characteristic patterns of colligation of the verbs under study as well as a profile of recurrent patterns of collocation. The study also reveals some interesting semantic patterns in the characteristic participant structure of these verbs in the domains represented in the corpus.

References:

FrameNet Project (URL: http://framenet.icsi.berkeley.edu/) Last visited: 20.01.2009.

Gledhill, C. (2000). “The discourse function of collocation in research article introductions”, English for Specific Purposes, 19, 115-135.

Halliday, M. A. K. and J.R. Martin. (1993). Writing Science. Literacy and Discursive Power. Washington, D.C.: The Falmer Press.

Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London: Routledge.

Hunston, S., and G. Francis. 2000. Pattern Grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins.

Kjellmer, G. (1991). "A mint of phrases." In K. Aijmer and B. Altenberg (eds) English

Corpus Linguistics. London, New York: Longman, 111–127.

Levin, B. (1993). English Verb Classes and Alternations. A Preliminary Investigation. Chicago: The University of Chicago Press.

Putting corpora into perspective: Sampling strategies for synchronic corpora

Cyril Belica, Holger Keibel, Marc Kupietz, Rainer Perkuhun and Marie Vachková

The scientific interest of most corpus-based work is to extrapolate observations from a corpus to a specific language domain. In theory, such inferences are only justified when the corpus constitutes a sufficiently representative sample of the language domain. Because for most language domains the representativeness of a corpus cannot be evaluated in practice, the sampling of corpora usually seeks to approximate representativeness by intuitively estimating some qualitative and quantitative properties of the respective language domain (e.g., distribution and proportions of registers, genres and text types) and requiring the corpus to roughly display these properties, as far as time, budget and other practical constraints permit.

When the object of investigation is some present-day language domain (e.g., contemporary written German), the question arises how its dependency on time is to be represented in a corresponding synchronic corpus. A practical solution typically taken is to consider only language material that was produced in some prespecified time range (e.g., 1964-1994 for the written component of the British National Corpus). Sometimes, the corpus is additionally required to be balanced across time (i.e., to contain roughly the same amount of texts or running words for each year).

However, this practical solution is not satisfactory: as language undergoes continuous processes of change, more recent language productions are certainly better candidates for being included in a synchronic corpus than are older ones. On the other hand, restricting a corpus to the most recent data (e.g., to language productions of the last six months) would exclude a lot of potential variety in language use and capture only those phenomena that happened to have a sufficient number of occasions to become manifest in this short time period. Consequently, rare phenomena might not appear at all in an overly short time period, and more frequent phenomena might be over- or underrepresented, due to statistical and language-external factors rather than to language change. This is clearly not an appropriate solution either: the relevance of phenomena as part of a given synchronic language domain does not suddenly decrease when their observed frequency has decreased recently, and in particular, rare phenomena do not suddenly cease to be a part of the synchronic language domain just because they did not occur at all in the available sample of recent texts.

In order to reconcile the effects of language change with statistical and language-external effects, this paper proposes a solution that does without any sharp cut-off and instead takes a vanishing point perspective which posits a gradually fading relevance of time slices with increasing “age”. The authors argue from a corpus-linguistic point of view (cognitive arguments are presented elsewhere) that for synchronic studies such a vanishing point perspective on language use is a more adequate approach than the common bird's eye view where all time slices are weighted equally. The authors further present language-independent explorations on how a vanishing point perspective can be achieved in corpus-linguistic work, by what formal functions the relevance of time slices is to be modelled and how the choice of a particular function may be justified in principle. Finally, they evaluate, for the case of contemporary written German, the empirical consequences of using a synchronic corpus in the above sense for phenomena such as simple word frequency, collocational patterning and the complex similarity structure between collocation profiles.

A parsimonious measure of lexical similarity for the characterisation of latent stylistic profiles

Edward J. L. Bell, Damon Berridge and Paul Rayson

Stylometry is the study of the computational and mathematical properties of style. The aim of a stylometrist is to derive accurate stylometrics and models based upon those metrics to gauge stylistic propensities. This paper presents a method of formulating a parsimonious stylistic distance measure via a weighted combination of both non-parametric and parametric lexical stylometrics.

Determining stylistic distance is harder than black-box classification because the algorithm must not only assign texts to a probable category (are the works penned by the same or different authors); it must also provide the degree of stylistic similarity between texts. The concept of stylistic distance was first indirectly broached by Yule (1944) when he derived the characteristic constant K from vocabulary frequency distributions. Recent work has focused on judging distance by considering the intersection of lexical frequencies between two documents (Labbé, 2007).

We tackle the problem with a ratio of composite predictors, the higher the ratio the more the styles diverge. The coefficients of the authorship predictor are estimated using Powells conjugate gradient method of function minimisation (Powell, 1964) on a corpus of 19th Century literature.

The corpus comprises of 248 fictional works by 103 authors and totals to around 40 million words.

The utility of the selected stylometrics is initially demonstrated by performing an exploratory analysis on the most prolific authors in the corpus (Austen, Dickens, Hardy and Trollope). The exploratory model shows that each author is characterised by a unique set of stylometric values. This unique set of values corresponds to a lexical subset of the author’s latent stylistic profile.

After estimating the authorship ratio coefficients, the metric is used to compare large samples of text and empirically assess the stylistic distance between the authors of those samples. The discrimination ability of the method proves accurate over 30,000 binary comparisons and rivals the discernment aptitude of established techniques; namely support vector machines. Finally it is shown that the proposed authorship ratio fairs well in terms of discriminatory aptitude and sample size invariance when compared to the intertextual distance metric (Labbé and Labbé. 2001).

References:

Labbé, C. and D. Labbé (2001). “Intertextual distance and authorship attribution -- Corneille and Moliere”. Journal of Quantitative Linguistics, 8(19):213–231.

Labbé, D. (2007)” Experiments on authorship attribution by intertextual distance in English”. Journal of Quantitative Linguistics, 14(48):33–80.

Powell, M. J. D. (1964) “An efficient method for finding the minimum of a function of several variables without calculating derivatives”. The Computer Journal, 7(2):155– 162.

Yule, G. U. (1994) The Statistical Study Of Literary Vocabulary. Cambridge University Press, Cambridge.

Annotating a multi-genre corpus of Early Modern German

Paul Bennett, Martin Durrell, Silke Scheible and Richard J. Whitt

This study addresses the challenges in automatically annotating a spatialised multi-genre corpus of Early Modern German with linguistic information, and describes how the data can be used to carry out a systematic evaluation of state-of-the-art corpus annotation tools on historical data, with the goal of creating a historical text processing pipeline.

The investigation is part of an ongoing project funded jointly by ESRC and AHRC whose goal is to develop a representative corpus of Early Modern German from 1650-1800. This period is particularly relevant to standardisation and the emergence of German as a literary language, with the gradual elimination of regional variation in writing. In order to provide a broad picture of German during this period, the corpus includes a total of eight different genres (representing both print-oriented and orally-oriented registers), and is ‘spatialised’ both temporally and topographically, i.e. subdivided into three 50-year periods and the five major dialectal regions of the German Empire. The corpus consists of sample texts of 2,000 words, and to date contains around 400,000 words (50% complete). The corpus is comparable with extant English corpora for this period, and promises to be an important research tool for comparative studies of the development of the two languages.

In order to facilitate a thorough linguistic investigation of the data, we plan to add the following types of annotation:

- Word tokens

- Sentence boundaries

- Lemmas

- POS tags

- Morphological tags

Due to the lexical, morphological, syntactic, and graphemic peculiarities characteristic of this particular stage of written German, and the additional variation introduced by the three variables of genre, region, and time, automatic annotation of the texts poses a major challenge. While current annotation tools tend to perform very well on the type of data on which they have been developed, only few specialised tools are available for processing historical corpora and other non-standard varieties of language. The present study therefore proposes to do the following:

(1) Carry out a systematic evaluation of current corpus annotation tools and assess their robustness across the various sub-corpora (i.e. across different genres, regions, and periods)

(2) Identify procedures for building/improving annotation tools for historical texts, and incorporate them in a historical text processing pipeline

Point (1) is of particular interest to historical corpus linguists faced with the difficult decision of which tools are most suitable for processing their data, and are likely to require the least manual correction. Point (2) utilises the findings of this investigation to improve the performance of existing tools, with the goal of creating a historical text processing pipeline. We plan to experiment with a novel spelling variant detector for Early Modern German which can identify the modern spelling of earlier non-standard variants in context, similar to Rayson et al.’s variant detector tool (VARD) for English (2005), and complementing the work of Ernst-Gerlach and Fuhr (2006) and Pilz et al. (2006) on historic search term variant generation in German.

References:

Ernst-Gerlach, A. and N. Fuhr (2006). “Generating Search Term Variants for Text Collections with Historic Spellings”. In Proceedings of the 28th European Conference on Information Retrieval Research.

Pilz, T., W. Luther, U. Ammon and N. Fuhr (2006). ‘Rule-based Search in Text Databases with Nonstandard Orthography’. Literary and Linguistic Computing, 21(2), 179–186.

Rayson, P., D. Archer and N. Smith (2005). “VARD vs. Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora”. In Proceedings of Corpus Linguistics 2005, Birmingham.

A tool for finding metaphors in corpora using lexical patterns

Tony Berber Sardinha

Traditionally, retrieving metaphor from corpora has been carried out with general-purpose corpus linguistics tools, such as concordancers, wordlisters, and frequency markedness identifiers. These tools present a problem for metaphor researchers in that they require that users know which words or expressions are more likely to be metaphors in advance. Hence, researchers may turn to familiar terms, words that have received attention in previous research or have been noticed by reading portions of the corpus, and so on. To alleviate this problem, we need a computer tool that can process a whole corpus, evaluate the metaphoric potential of each word in the corpus, and then present researchers with a list of words and their metaphoric potential in the corpus. In this paper, I present an online metaphor identification program that retrieves potential metaphors from both English and Portuguese corpora. This is currently the only publicly available such tool. The program looks for ‘metaphor candidates’, or metaphorically used words. A word is considered metaphorically used when it is part of a linguistic metaphor, which in turn means a stretch of text in which it is possible to interpret incongruity between two domains from the surface lexical content (Cameron 2002: 10). Analysts may then inspect the possible metaphorically used words suggested by the program and determine whether they are part of an actual metaphor or not. The program works by analyzing the patterns (bundles and collocational framework) and part of speech of each word and then matching these patterns to the information in its databases. The databases are the following: (1) word database, which records the probability of individual words being used metaphorically in the training corpora; (2) a left bundles database, with the probability of bundles (3-grams) occurring in front of a metaphorically used word in the training data; (3) a right bundles database (3-grams), with the probability of bundles occurring to the right of a metaphorically used word in the training corpora; (4) frameworks database, with the probability of frameworks (the single words immediately to the left and right) occurring around a metaphorically used word in the training data (e.g. ‘a … of’, etc.); (5) a part of speech database, with the probability of each word class being used metaphorically in the training corpora. The training corpora were the following: for Portuguese, (1) a corpus of conference calls held in Portuguese by an investment bank in Brazil, with 85,438 tokens and 5,194 types, and (2) the Banco de Português (Bank of Portuguese), a large, register-diversified corpus, containing nearly 240 million words of written and spoken Brazilian Portuguese; for English, the British National Corpus, from which a sample of 500 types was taken; each word form was then concordanced, and each concordance was hand analyzed for metaphor. In addition to a description of the way the tools were set up, the paper will include a demonstration of the tools, examples of analyses, discuss problems and propose possible future developments.

References:

Cemeron, L. (2002). Metaphor in Educational Discourse. London: Continuum.

Institutional English in Italian university websites: The acWaC corpus

Silvia Bernardini, Adriano Ferraresi and Fredrico Gaspari

The study of English for Academic Purposes, including learner and lingua franca varieties, has been a constant focus of interest within corpus linguistics. However, the bulk of research has been on disciplinary writing (see e.g. L. Flowerdew 2002, Cortes 2004, Mauranen 2003). Institutional university language (e.g., course catalogues, brochures) remains an understudied subject, despite being of great interest, both from a descriptive point of view -- Biber (2006:50) suggests that it is "more complex than any other university register" -- and from an applied one. Institutional texts, especially those published on the web, are read by a vast audience, including, crucially, prospective students, who more and more often make their first encounter with a university through its website. Producing appropriate texts of an institutional nature in English is especially relevant for institutions in non-English speaking countries, nowadays under increasing pressure to make their courses accessible to an international public.

Focusing on Italy as a case in point, our paper looks at institutional university language as used in the websites of Italian universities vs. universities in the UK and Ireland (taken as examples of native English standards within the EU). Two corpora were created retrieving a set number of English web pages published by a selection of universities based in these countries. Our aim was to obtain corpora which would be comparable in terms of size and variety of text types, and excluding as far as possible disciplinary texts (e.g. research articles). The same construction procedure was adopted for the Italian corpus and for the UK/Irish corpus: we used the BootCaT toolkit (Baroni and Bernardini 2004), relying on Google's language identifier and excluding .pdf files from the search.

Since the procedure is semi-automatic and allows limited control over the corpus contents (a general issue with web-as-corpus resources, see Fletcher 2004), the preliminary part of the analysis is aimed at assessing to what extent the two corpora may be considered as comparable in terms of topics covered and (broadly speaking) text types included. To this end, a random sample of pages from each corpus is manually checked, following Sharoff (2006).

The bulk of the paper is a double comparison of the two corpora. First, word and part-of-speech distributions (of both unigrams and n-grams) across the two corpora are compared. Second, taking as our starting point the analysis of institutional university registers in Biber (2006), we carry out a comparative analysis of characteristic lexical bundles and of stance expressions, focusing specifically on ways of expressing obligation.

The study is part of a larger project which seeks to shed light on the most salient differences between institutional English in the websites of British/Irish vs. Italian universities. The project's short-term applied aim is to provide resources for Italian authors and translators working in this area, as a first step towards creating a pool of corpora of non-native institutional English as used in other European countries.

References:

Baroni, M. and S. Bernardini (2004). "BootCaT: Bootstrapping corpora and terms from the web". Proceedings of LREC 2004, 1313-1316.

Biber, D. (2006). University Language: A Corpus-based Study of Spoken and Written Registers. Amsterdam: John Benjamins.

Cortes, V. (2004). "Lexical bundles in published and student disciplinary writing: Examples from history and biology". English for Specific Purposes, 23, 397-423.

Fletcher, W.H. (2004). "Making the web more useful as a source for linguistic corpora". In U. Connor and T. Upton (eds) Corpus Linguistics in North America 200,. 191-205.

Flowerdew, L. (2002). "Corpus-based analyses in EAP". In J. Flowerdew (ed.) Academic Discourse. London: Longman, 95-114.

Mauranen, A. (2003). "The corpus of English as Lingua Franca in academic settings". TESOL Quarterly, 37 (3), 513-527.

Sharoff, S. (2006). "Creating general-purpose corpora using automated search engine queries". In M. Baroni and S. Bernardini (eds) WaCky! Working Papers on the Web as Corpus. Bologna: GEDIT, 63-98.

Nominalizations and nationality expressions: A corpus analysis

Daniel Berndt, Gemma Boleda, Berit Gehrke and Louise McNally

Many languages offer competing strategies for expressing the participant roles associated with event nominalizations, such as the use of a denominal adjective ((1a)) vs. a PP ((1b)):

(1) a. French agreement to participate in the negotiations (Nation_adjective+N)

b. agreement by France to participate in the negotiations (N+P+Nation_noun)

Since Kayne (1981) theoretical work on nominalizations has focused mainly on whether such PPs and adjectives are true arguments of the nominalization or simply modifiers which happen to provide participant role information, with no conclusive results (see e.g. Grimshaw 1990, Alexiadou 2001, Van de Velde 2004, McNally & Boleda 2004); little attention has been devoted to the factors determining when one or the other option is used (but see Bartning 1986). The goal of this research is to address this latter question, in the hope that a better understanding of these factors will also offer new insight into the argument vs. adjunct debate.

Methodology: We hypothesize that the choice between (1a)-(1b) depends on various factors, including whether the noun is deverbal, the argument structure of the underlying verb, prior or subsequent mention of the participant in question (in (1), France), and what we call concept stability – the degree to which the full noun phrase describes a well-established class of (abstract or concrete) entities. We tested for these factors in a study on the British National Corpus. To reduce unintended sources of variation, we limited our study to nationality adjectives/nouns. We examined Nation_adjective+N and N+P+Nation_noun examples from 49 different nations whose adjective (French) and proper noun (France) forms occur 1,000-30,000 times in the BNC, filtering the examples whose head noun was too infrequent (≤ 24 occurrences) or too nation-specific (e.g., reunification). To determine the semantic class of the head nouns, we used the WordNet-based Top Concept Ontology (Álvez et al. 2008). For the analysis of nominalizations, we considered only a manually-selected list of 45 nouns.

Results: Unlike nouns denoting physical objects, abstract nouns, including nominalizations, prefer the prepositional construction (nouns in the categories Part and Place, e.g. border, area, are an exception). The adjective construction occurs with a much smaller range of nouns than does the PP construction, an effect that is more pronounced with infrequent nations and when only nominalizations are considered. These results suggest that use of the adjective construction positively correlates with concept stability: Adjective+nominalization combinations are arguably less likely to form stable concepts than adjective+concrete noun combinations (cp. e.g. French wine and a French agreement).

Though the other factors are pending analysis, these initial results indicate an asymmetry in the distribution of the constructions in (1); and the strong association between the use of the adjective construction and concept stability specifically lends support to an analysis of nationality adjectives as classifying modifiers rather than as argument-saturating expressions.

The presence of lexical collocations in bilingual dictionaries:

A study of nouns related to feelings in English and Italian.

Barbara Berti

Nowadays, it is generally accepted that naturalness in languages results from the synergy between grammatical well-formedness and an interlocking series of precise choices at lexical level. From a SLA perspective, the awareness of lexical relations between words becomes particularly crucial as it allows learners to avoid combinatorial mistakes that result in an awkward linguistic production. One of the most important reference tools for students is the dictionary. It is therefore essential that it should contain the largest amount of information concerning the lexical environment proper to each word.

Lexicographic research has mainly focused on investigating the presence of collocations in learners' dictionaries or general purpose dictionaries (Cowie 1981, Benson 1990, Hausmann 1991, Béjoint 1981, Cop 1988, Siepmann 2006). Nevertheless, despite teachers' effort to encourage the use of monolingual resources, it has been largely shown (Atkins 1985, Atkins and Knowles 1990, Atkins and Varantola 1997, Nuccorini 1992,1994, Béjoint 2002) that learners tend to prefer bilingual ones. Although bilingual lexicography has already taken the issue of collocations into account (Hausmann 1988, Cop 1988, Siepmann 2006), a systematic study concerning the actual presence of collocations in bilingual dictionaries has not been carried out yet, especially in a contrastive perspective English-Italian. This is of particular relevance since dictionary editors often acknowledge the importance of collocations as well as the use of a corpus-based methodology for the compilations of dictionaries.

In this paper, I investigate the presence of lexical collocations of some nouns related to feelings in a sample of bilingual dictionaries. In the attempt to compromise between theoretical issues and practical applications, I will apply a little flexibility to the notion of lexical collocation on the one hand, as well as set some boundaries on the other. In general, I will consider combinations of words such as lexical word + preposition + lexical word (eg. jump for joy) as lexical collocations (this choice derives from the scarce presence of pure lexical collocations in my sample of dictionaries). I will pinpoint that too often collocations do not stand out among the definitions of lemmas but are present in a more 'silent' form as parts of examples (eg. He is in for a nasty shock where nasty shock is not properly highlighted). I will reflect on the fact that many nouns share a good deal of collocates (eg. The adjective deep collocates with pleasure, satisfaction, happiness, contentment, delight), and try to understand the semantic boundaries of this phenomenon. I will underline the extent to which collocations are interlocking, in the sense expressed by Hoey (2005) (e.g. squeal with sheer delight is made up of two collocations: squeal with delight and sheer delight) and what are their combinatorial limitations (eg. although mischievious delight is a collocation, it is not possible to combine it with squeal with delight and form *squeal with mischievious delight). All this will be done with reference to the BNC and The Oxford Collocation Dictionary. As will be shown, the problem extends when collocations are to be investigated from an Italian perspective, due to the anisomorphism of the two languages.

References:

Atkins, B. T. and F. E. Knowles (1990). "Interim report on the EURALEX/AILA research project into dictionary use”. In T. Magay and J. Zigàny (eds), BudaLEX '88 proceedings, 381-392.

Atkins, B. T. and K Varantola (1997). “Monitoring dictinary use”, IJL, 10 (1), 1-45.

Cop, M. (1988). “The function of collocations in dictionaries”. In T. Magay and J. Zigàny (eds), BudaLEX '88 proceedings, 35-46.

Cowie, A. P. (1981). "The treatment of collocations and idioms in learners' dictionaries," Applied Linguistics, 2 (3), 223-35.

Hausmann, F. J. (1991). “Collocations in monolingual and bilingual dictionaries”, V. Ivir and D. Kalojera (eds), Languages in contact and contrast,.Walter de Gruyter, 225-236.

Hausmann, F. J. (2002). "La lexicographie bilingüe en Europe. Peut-on l’améliorer ?". In E. Ferrario and V. Pulcini, (eds), La lessicografia bilingue tra presente e avvenire, Atti del Convegno Vercelli, 11-21.

Nuccorini, S. (1994). “On dictionary misuse”, EURALEX '94 Proceedings, Amsterdam, 586-597.

Siepmann, D. (2006) “Collocation, Colligation and Encoding Dictionaries. Part II: Lexicographical Aspects”, International Journal of Lexicography, 19 (1), 1-39.

Understanding culture: Automatic semantic analysis of a general Web corpus vs. a corpus of elicited data

Francesca Bianchi

A long established tradition (Szalay & Maday, 1973; Wilson & Mudraya, 2006) has analysed cultural orientations by extracting Elementary Meaning Units (EMUs) – i.e. subjective meaning reactions elicited by a particular word – from corpora of elicited data. In line with this tradition, Bianchi (2007) attempted the analysis of EMUs in two different corpora of non-elicited data. Her results seemed to support the hypothesis that non-elicited data are as suitable as elicited ones for the extraction of EMUs.

The current study aims to further assess the contribution that corpora from non-elicited data, and large Web corpora in particular, may provide to the analysis of cultural orientations. This study will focus on British cultural orientations around two specific node words: wine, and chocolate. A general Web corpus of British English (UKWAC; Baroni and Kilgarriff, 2006) will be used, alongside with data elicited from British native speakers. Elicited data will be collected via specifically designed questionnaires based on sentence completion and picture description items. Sentences including the node words will be singled out in the Web and the elicited corpora and subsequently tagged using specific automatic tagging systems (Claws for POS tagging, and SemTag for semantic tagging). Semantic tagging will highlight semantic preference of the node word, i.e. its EMUs; while frequency and strength calculations will distinguish cultural from personal EMUs. The emerging EMUs will be analysed and commented with reference to existing literature on cultural orientations and previous studies based on the chosen node words (Fleischer, 2002; Bianchi, 2007). Attention will be dedicated in particular to comparing and contrasting the advantages and limitations of Web general corpora with respect to more traditional elicited data.

References:

Baroni M., and A. Kilgarrif (2006). “Large linguistically processed web corpora for multiple languages”. Proceedings of EACL. Trento.

Bianchi F. (2007). “The cultural profile of chocolate in current Italian society: A corpus-based pilot study”, ETC – Empirical Text and Culture Research, 3, 106-120.

Fleischer M. (2002). “Das Image von Getränken in der polnischen, deutschen und französischen Kultur”. ETC: Empirical Text and Culture Research, 2, 8-47.

Szalay L.B. and B. C. Maday (1973). “Verbal Associations in the Analysis of Subjective Culture”. Current Anthropology, 14 (1-2) 33-42.

Wilson, A., and O. Mudraya. (2006). “Applying an Evenness Index in Quantitative Studies of Language and Culture: A Case Study of Women’s Shoe Styles in Contemporary Russia”. In P. Grzybek and R Köhler (eds), Exact Methods in the Study of Language and Text, Berlin: Mouton de Gruyter, 709-722.

Introducing probabilistic information in Constraint Grammar parsing

Eckhard Bick

Constraint Grammar (CG), a framework for robust parsing of running text (Karlsson et al. 1995), is a rule based methodology for assigning grammatical tags to tokens and disambiguating them. Traditionally, rules build on lexical information and sentence context, providing annotation for e.g. part of speech (PoS), syntactic function and dependency relations in a modular and progressive way. However, apart from the generative tradition, e.g. HPSG and AGFL (Koster 1991), most robust taggers and parsers today employ statistical methods and machine learning, e.g. the MALT parser (Nivre & Scholz 2004), and though CG has achieved astonishing results for a number of languages, e.g. Portuguese (Bick 2000), the formalism has so far only addressed probabilistics in a very crude and “manual” way – by allowing lexical <Rare> tags and by providing for delayed use of more heuristic rules.

It would seem likely, therefore, that a performance increase could be gained from integrating statistical information proper into the CG framework. Also, since mature CG rule sets are labour intensive and contain thousands of rules, statistical information could help to ensure a more uniform and stable performance at an earlier stage in grammar development, especially for less-resourced languages.

To address these issues, our group designed and programmed a new CG rule compiler, allowing the use of numerical frequency tags in parallel with morphological, syntactic and semantic tags. For our experiments, we used our unmodified, preexisting CG system (EngGram) to analyse a number of English text corpora (BNC [http://www.natcorp.ox.ac.uk/], Europarl [Koehn 2005] and a 2005 Wikipedia dump), and then extracted frequency information for both lemma_POS and wordform_tagstring pairs. This information was then made accessible to the CG rules in the form of (a) isolated and (b) cohort-relative frequency tags, the latter providing a derived frequency percentage for each of a wordform's possible readings relative to the sum of all frequencies for this particular wordform.

We then introduced layered threshold rules, culling progressively less infrequent readings after each batch of ordinary, linguistic-contextual CG rules. Also, frequency based exceptions were added to a few existing rules. Next, we constructed two revised gold-standard annotations - 4.300 tokens each of (a) randomized Wikipedia sentences and (b) Leipzig Wortschatz sentences (Quasthoff et al. 2006) from the business domain - and tested the performance of the modified grammar against the original grammar, achieving a marked increase in F-scores (weighted recall/precision) for both text types. Though the new rules all addressed morphological/PoS disambiguation, the syntactic function gain (91.1->93.4 and 90.8->93.4) was about twice as big as the PoS gain (96.9->97.9 and 96.9->98.2), reflecting the propagating impact that PoS errors will typically have on subsequent modules.

A detailed inspection of annotation differences revealed that of all changes, about 80% were for the better, while 20% introduced new errors, suggesting an added potential for improvement by crafting more specific rule contexts for these cases, or by using machine learning techniques to achieve a better ordering of rules.

References:

Bick, E. (2000), The Parsing System Palavras - Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus: Aarhus University Press

Karlsson et al. (1995), “Constraint Grammar - A Language-Independent System for Parsing Unrestricted Text”. Natural Language Processing, 4.

Koehn, P. (2005). "Europarl: A Parallel Corpus for Statistical Machine Translation", Machine Translation Summit X, 79-86,

Koster, C.H.A. (1991). “Affix Grammars for natural languages”. In Lecture Notes in Computer Science, 545.

Nivre, J. & M. Scholz, M. (2004) “Deterministic Dependency Parsing of English Text”. In Proceedings of COLING 2004, Geneva, Switzerland, 2004.

Quasthoff, U, M. Richter and C. Biemann (2006) “Corpus Portal for Search in Monolingual Corpora”. In Proceedings of the fifth international conference on Language Resources and Evaluation, 1799-1802.

Word distance distribution in literary texts

Gemma Boleda, Álvaro Corral, Ramon Ferrer i Cancho and Albert Díaz-Guilera

We study the distribution of distances between consecutive repetitions of words in literary texts, measured as the number of words between two occurrences plus one. This reflects the dynamics of language usage, in contrast to the static properties captured in Zipf's law (Zipf 1949). We focus on distance distributions within single documents, as opposed to different documents or amalgamated sources as in previous work.

We have examined eight novels in English, Spanish, French, and Finnish. The following consistent results are found:

1. All words (including “function” words such as determiners or prepositions) display burstiness, that is, one occurrence of the word tends to trigger more occurrences of the same word (Church and Gale 1995, Bell et al. 2009).

2. Burstiness fades in short distances. In very short distances even a repulsion effect is observed, which can be explained on both linguistic and communicative grounds.

3. We observe systematic differences in the degree of clustering of different words. In particular, words that are relevant from a rhetorical or narrative point of view (alternatively, keywords; Church and Gale 1995) display higher clustering than the rest.

4. The distribution of the rest of the words follows a scaling law, independent of language and frequency.

5. If words are shuffled, neither the scaling law nor burstiness are observed. A Poisson distribution is obtained instead. Note that e.g. Zipf's law is not altered in this circumstance.

6. There are striking similarities between our data and other physical and social phenomena that have been studied as complex systems (Corral 2004, Bunde et al. 2005, Barabási 2005).

Our data reveal universal statistical properties of language as used in literary texts. The factors that determine this distribution, that is, to what extent are linguistic as opposed to more general communicative or cognitive processes involved, as well as the relationship between word distance distribution and other phenomena with similar properties (see point 6), remain an exciting challenge for the future.

References:

Barabási, A-L (2005). “The origin of bursts and heavy tails in human dynamics”. Nature, 435, 207–211.

Bell, A, J. M. Brenier, M. Gregory, C. Girand and D .Jurafsky (2009). “Predictability effects on durations of content and function words in conversational English”. Journal of Memory and Language, 60 (1), 92–111.

Bunde, A, J.F. Eichner, J.W. Kantelhardt and S. Havlin, (2005). “Long-term memory: a natural mechanism for the clustering of extreme events and anomalous residual times in climate records”. Phys. Rev. Lett. 94, 048701

Church, K. W. and W. A. Gale (1995). “Poisson mixtures”. Nat. Lang. Eng. 1, 163–190.

Corral, A. (2004). “Long-term clustering, scaling, and universality in the temporal occurrence of earthquakes”. Phys. Rev. Lett. 92, 108501.

Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Cambridge, MA: Addison-Wesley. (Hafner reprint, New York, 1972.)

Frequency effects on the evolution of discourse markers in spoken vs. written French

Catherine Bolly and Liesbeth Degand

The current paper discusses the role of frequency effects in the domain of language variation and language change (Bybee 2003), with particular attention to differences between spoken and written language (Chafe & Danielewicz 1987). The following questions will be addressed: How do spoken and written modes interact with each other in the grammaticalization process? Are uses in spoken language the mirror of what we could expect for the future in writing? The question is not new, but needs systematic analyses. The aim of the paper is thus to explore the extent to which evolution of frequent linguistic phenomena may be predicted from their observed frequency in use. Conversely, it also focuses on how the variation of such phenomena in contemporary French can be explained by their diachronic evolution. To shed some light on these issues, we analyzed two types of pragmatic markers in French: (i) the parenthetical construction tu vois (‘you see’), and (ii) the causal conjunctions car and parce que (‘because’).We argue that the historical functional shift from the conceptual domain to the pragmatic/causal domain of these elements could be interpreted as a consequence of the primacy of spoken uses at a given stage of language evolution. First, the evolution from non-parenthetical to parenthetical uses, typical of spoken language (Brinton 2008), will be explored with respect to their context of production. A preliminary analysis shows an increase in frequency of tu vois (‘you see’) in literature (from infrequent in Preclassical French to frequent in Contemporary French) and in drama (from frequent in Preclassical French to highly frequent in Contemporary French). According to the cline of (inter-)subjectification in language change (Traugott to appear), these results allow for two hypotheses: (i) we expect a higher frequency of non-parenthetical constructions (objective uses) in written vs. “speech-like” French at any historical period; (ii) at the same time, we expect an increase over time in the frequency of parentheticals (subjectified uses) as they undergo a grammaticalization process, mainly in “speech-based” corpora.

Secondly, the reversed frequency evolution of car (from highly frequent in Old French to dramatically infrequent in Contemporary spoken French) and parce que (from infrequent in Old French to highly frequent in Spoken French) is analyzed in terms of different subjectification patterns for the two conjunctions. Parce que shows an increasing cline of subjective uses, typical of spoken language, while car (being subjective from Old French on) does not show this evolution that makes it particularly fit to (renewed) use in spoken contexts, leading to its quasi disappearance from speech, and probable specialization to restricted (written) genres.

References:

Brinton, L. J. (2008). The Comment Clause in English. Syntactic Origins and Pragmatic Development. Cambridge: CUP/ Studies in English Language.

Bybee, J. (2003). “Mechanisms of change in grammaticalization: The role of frequency”. In B. D. Joseph and R. D. Janda (eds) The Handbook of Historical Linguistics. Oxford: Blackwell, 604-623.

Chafe, W. and Danielewicz, J. (1987). “Properties of spoken and written language”. In

R. Horowitz and S. J. Samuels (eds) Comprehending Oral and Written Language. San Diego: Academic Press, 83-113.

Traugott, E. C. (to appear). “(Inter)subjectivity and (inter)subjectification: a reassessment”. In

H. Cuyckens, K. Davidse and L. Vandelanotte (eds) Subjectification, intersubjectification and grammaticalization. Berlin and New York: Mouton de Gruyter.

“What came to be called”: Evaluative what and interpreter identity in the discourse of history

Marina Bondi

Research into specialized discourses has often paid particular attention to specific lexis, and how this instantiates the meanings and values associated with specific communities or institutions. Studies on academic discourse, however, have paid growing attention to general lexis and its disciplinary specificity, for example in metadiscourse or evaluative language use. This finer-grained analysis of disciplinary language can be shown to profit from recent approaches to EAP-specific phraseology, whether paying attention to automatic derivation of statistically significant clusters (Biber 2004, Biber et al. 2004, Hyland 2008), or to grammar patterns and semantic sequences (Groom 2005, Charles 2006, Hunston 2008).

The paper explores the notion of semantic sequences and its potential in identifying distinctive features of specialized discourses. The study is based on a corpus of academic journal articles (2.5 million words) in the field of history. The methodology adopted combines a corpus and a discourse perspective. A preliminary analysis of frequency data (wordlists and keywords) offers an overview of quantitative variation. Starting from the frequencies of word forms and Multi-Word-Units, attention is paid to semantic associations between different forms and to the association of the unit with further textual-pragmatic meanings.

The analysis focuses on sequences involving the use of what and a range of signals referring to a shift in time perspectives or attribution. Paraphrasing Hyland and Tse’s analysis of ‘evaluative that’ (2005), we have here an ‘interpretative what, signalling various patterns of interaction of the writer’s voice with other interpretations of historical fact. A tentative classification of the sequences is provided, as an extension of previous studies on a local grammar of evaluation (Hunston and Sinclair 2000). Co-textual analysis looks both at the lexico-semantic patterns and at the pragmatic functions involved. Frequencies and patterns are then interpreted in the light of factors characterizing academic discourse and specific features of writer identity in historical discourse (Coffin 2006).

References:

Biber, D. (2004). “Lexical bundles in academic speech and writing”. In B. Lewandowska-

Tomaszczyk (ed.) Practical Applications in Language and Computers. Frankfurt am Mein:

Peter Lang, 165-178.

Biber, D., S. Conrad and V. Cortes (2004). “If you look at: Lexical bundles in university teaching

and textbooks”. Applied Linguistics, 25(3), 371-405.

Charles, M. (2006). “Phraseological Patterns in Reporting Clauses Used in Citation: A Corpus-Based Study of Theses in Two Disciplines”. English for Specific Purposes, 25 (3) 310-331.

Coffin,C. (2006). Historical Discourse. London: Continuum.

Groom, N. (2005). “Pattern and Meaning across Genres and Disciplines: An Exploratory Study”. Journal of English for Academic Purposes, 2005, 4 (3) 257-277.

Hunston, S. and J. Sinclair (2000). “A local Grammar of Evaluation”. In S. Hunston, G. Thompson (eds). Evaluation in Text. Oxford: Oxford University Press, 74-101.

Hunston, S. 2008. “Starting with the small words: Patterns, lexis and semantic sequences”. International Journal of Corpus Linguistics 13 (3) 271-295.

Hyland, K. (2008). “Academic Clusters: text patterning in published and postgraduate writing. International Journal of Applied Linguistics”, 18 (1), 41-62.

Hyland, K and P. Tse (2005). “Hooking the reader: a corpus study of evaluative that in abstracts”. English for Specific Purposes. 24 (2) 123-139.

A comparative study of Phd abstracts written in English by native and non native speakers across different disciplines

Genevieve Bordet

Phd abstracts constitute a specific case of specialized academic genre. In the particular case of the French scientific community, they are now nearly always translated into or directly written in English by non native speakers. Therefore, they offer an interesting opportunity to understand how would-be researchers try and build up their authorial stance through this specific type of writing. Besides defining their field of investigation and formulating new scientific propositions, authors have to demonstrate their ability to belong to a discourse community through their mastering of a specific genre.

Based on the study of parallel corpora of abstracts written in English by NS and NNS in different disciplines, evidence is given of phraseological regularities specific to NS and NNS.The abstracts belong to different disciplines and fields of research, mainly in hard sciences, such as “material science”, “information research”,” mathematics didactics”.

We analyse more specifically the expression of authorial stance through pronouns and impersonal tenses, modal verbs, adverbs and adjectives. The realization of lexical cohesion is considered as part of the building up of an authorial stance.

Using the theoretical and methodological approach of lexicogrammar and genre study, the structure analysis of several texts in different disciplines and both languages (L1 and L2) provides evidence of different discourse planes specific to this type of text (Hunston 2000): text and metatext, the abstract presenting an account of the thesis, which, in turn, accounts for and comments the research work. This detailed study makes it possible to distinguish the different markers used to separate the different planes. It also shows that all abstracts include the following elements or “moves” (Biber 2008): referential discourse (theoretical basis), definition of a proposition or hypothesis, an account of the research work, conclusions and/or prescriptions. This will be mainly shown through the use of a combination of tenses and pronouns. Verbs and verbs collocations could also be semantically categorized and their distribution analysed, distribution analysis and semantic categorisation of verbs and their collocations providing further evidence on communicative strategies in abstracts.

Based on these findings and using concordancing tools, a comparative study of abstracts written by NS and NNS t reveal significant differences between these two types of texts. This in turn leads to an interrogation as to the consequences on their specific semantic prosody and their acceptance by the academic community (Swales 1990).

The results of this study, since they show evidence of phraseological regularities specific to an academic genre, could be useful for information research, giving leads towards the automatic recognition of the genre.

Besides, the evidence of important differences between NS and NNS realisations and the analysis of their consequences will help to define more adapted goals to EAP teaching.

References:

Biber, D ., U. Connor and T. A. Upton (2008). Discourse on the move. Amsterdam: John Benjamins.

Biber, D. (2006). University language : a corpus based study of written and spoken language. John Benjamins.

Gledhill, C. J. (2000). Collocations in science writing. Gunter Narr Verlag Tubingen.

Halliday, M. A. K and J. R. Martin (1993). Writing science : literacy and discursive power. Pittsburgh: University of Pittsburgh Press.

Hunston, S. and G. Thompson (1999). “Evaluation in text : authorial stance and the construction of discourse”. Oxford Linguistics.

Partington, A. (1998). Using corpora for English language research and teaching. John Benjamins.

Swales, J. (1990). English in academic and research settings. Cambridge: Cambridge University Press.

Corpora for all? Learning styles and data-driven learning

Alex Boulton

One possible exploitation of corpus linguistics is for learners to access the data themselves, in what Johns (e.g. 1991) has called “data-driven learning” or DDL. The approach has generated considerable empirical research, generally with positive results, although quantitative outcomes tend to remain fairly small or not statistically significant. One possible explanation is that such quantitative data conceal substantial variation, with some learners benefiting considerably, others not at all. This paper sets out to see if the variation can be related to learning styles and preferences, a field as yet virtually unexplored even for inductive/deductive preferences (e.g. Chan & Liou, 2005).

In this study, learners of English at a French architectural college were encouraged to explore the BNC (http://corpus.byu.edu/bnc/) for 15 minutes at the end of each class for specific points which had come up during the lesson. The learners completed a French version of the Index of Learning Styles (ILS) based on the model first developed by Felder and Silverman (1988/2002). This instrument, originally designed for engineering students, has also been widely used in language learning (e.g. Felder & Henriques, 1995), is quick and easy to administer, and provides numerical scores for each learner on four scales (active / reflective, sensing / intuitive, visual / verbal, sequential / global). The results are described, and compared against two further sets of data collected from the participants at the end of the semester. Firstly, they completed a questionnaire concerning their reactions to the DDL activities in their course; secondly, their corpus consultation skills were tested in the lab on a number of new points.

If it can be shown that DDL is particularly appropriate for certain learner profiles, it may help teachers to tailor its implementation in class for more or less receptive learners, and perhaps increase the appeal of DDL to a wider learner population.

References:

Chan, P-T. & H-C. Liou. (2005). “Effects of web-based concordancing instruction on EFL students’ learning of verb–noun collocations”. Computer Assisted Language Learning, 18 (3), 231-251.

Felder, R. & E. Henriques. (1995). “Learning and teaching styles in foreign and second language education”. Foreign Language Annals, 28 (1), 21-31.

Felder, R. & L. Silverman. (1988/2002). “Learning and teaching styles in engineering education”. Engineering Education, 78 (7), p. 674-681. http://www4.ncsu.edu/unity/lockers/users/f/felder/public/Papers/LS-1988.pdf

Johns, T. (1991). “Should you be persuaded: two examples of data-driven learning”. In T. Johns & P. King (eds) Classroom Concordancing. English Language Research Journal, 4, 1-16.

Exploring future constructions in Medieval Spanish using the Biblia Medieval Parallel Corpus

Miriam Bouzouita

The subject of this paper is the variation and change found in Medieval Spanish future constructions. To be more specific, this study aims to uncover the differences in use between the so-called analytic future structures, such as tornar-m-é ‘I shall return’ in example (1), and the synthetic ones, such as daras ‘you will give’ in example (2). It also aims to trace the diachronic development of these future structures, which ended with the analytic ones being lost.

(1) E dixo: “Tornar-m-é a Jherusalem […]

and said.3SG return.INF-CL-will.1SG to Jerusalem

‘And he said: “I shall return to Jerusalem […]”’ (Fazienda: 194)

(2) E dyxo ella: “Que me daras?”

And said.3SG she what CL will-give.2SG

‘And she said: “What will you give me?”’ (Fazienda: 52)

As can be observed, these future constructions differ in that a clitic intervenes between the two parts that form the future tense (an infinitive and a present indicative form of the verb aver ‘to have’) in the analytic structures but not in the synthetic ones.

Although some scholars observed a correlation between the use of these future variants and general clitic placement principles present in Medieval Spanish, such as the restriction precluding clitics from appearing in sentence-initial positions (e.g. Eberenz 1991; Castillo Lluch 1996, 2002), hardly any attention has been paid to this view and alternative explanations, which completely disregard this correlation, continue to enjoy widespread acceptance. The most widely held view is that the variation in future constructions is conditioned by discourse-pragmatic factors: analytic constructions are said to be emphatic while the synthetic ones unmarked (e.g. Company Company 1985-86, 2006; Girón Alconchel 1997). By disentangling the various syntactic environments in which each of the future constructions appears, I hope to make apparent the undeniable link that exists between the use of the various future constructions and clitic placement. I shall show that there exists (i) a complementary distribution between the synthetic forms with preverbal placement and the analytic forms, and (ii) a distributional parallelism between the analytic futures and postverbal placement in non-future tenses. Since this study has been carried out using the Biblia Medieval corpus, a parallel corpus containing Medieval Spanish Bibles dating from different periods, a same token can be traced through time which aids to establish the locus of the change more easily. Finally, I shall account for the observed variation and change within the Dynamic Syntax framework (Cann et al. 2005; Kempson et al. 2001). I shall show that within a processing (parsing/production) perspective, synchronic variation is expected for the future forms. The conclusion is that the variation and change in the future constructions cannot be satisfyingly explained without taking into account (i) clitic phenomena, such as positional pressures that preclude clitics from appearing sentence-initially, and (ii) the processing strategies used for the left-peripheral expressions.

Regrammaticalization as a restrategizing device in political discourse

Michael S. Boyd

Political actors exploit different genres to express their ideas, opinions and messages, legitimize their own policies, and delegitimize their opponents in different situations and contexts. What they say and how they say it are constructed by the particular type of social activity being pursued (Fairclough 1995: 14). Campaign speeches and debates are two important, yet very different genres of political discourse in which political actors often operate. While campaign speeches are an example of mostly scripted, one-to-many communication, in which statements generally remain uninterrupted and unchallenged, political debates are an example of spontaneous, interactive communication, in which statements are often interrupted and challenged at the time of the speech event. In such different contexts it is not surprising that speakers should adopt different linguistic strategies to frame their message. The work is based on the hypothesis that genres and their contexts deeply influence surface lexico-grammatical as well as syntactic realizations. In such an approach, it naturally follows that socially defined contextual features such as role, location, timing, etc. are all “pivotal” for the (different) discourse realizations in different genres (Chilton and Schäffner 2002: 16). Moreover, this type of analysis cannot ignore “the broader societal and political context in which such discourse is embedded” (Schäffner 1996: 201).

To demonstrate both the differences and similarities of such textual and discursal realizations the study is based on a quantitative and qualitative analysis of two small corpora taken from the discourse of Barack Obama. The first corpus consists of campaign speeches given during the Democratic primary campaign and elections (2007-8), while the second consists of the statements made by Obama in the three televised debates with Republican candidate John McCain (2008) and taken from written transcripts. The corpora are analyzed individually with particular reference to lexical frequency and keyness. The lexical data are subsequently analyzed qualitatively on the basis of concordances to determine grammatical and syntactic usage.

The study is particularly interested in regrammaticalization strategies and how they are used in the different corpora. One such strategy is nominalization, which, according to Fowler “offers extensive ideological opportunities” (1991, cited in Partington 2003: 15). We can see this, for example, in Obama’s overwhelming nominal use of the lexemes CHANGE and HOPE in his speeches. Nominalization also allows for syntactic fronting, giving the terms an almost slogan-like quality (such as, for example, “change we can believe in”). In the debate, on the contrary, not only do these lemmas have a much lower relative frequency but, when they are used, nominalization is rarely encountered. The message of hope and change is reframed to reflect the different micro- and macro-contexts, and in the debates modality appears to be the preferred grammatical (or, indeed, regrammaticalization) strategy. Thus, diverse grammaticalization strategies are used to varying degrees to reflect the differences in context and genre.

References:

Chilton, P. and C. Schäffner (2002). “Introduction: Themes and principles in the analysis of political discourse”. In P. Chilton and C. Schäffner (eds) Politics as Text and Talk Analytic approaches to political discourse. Amsterdam: John Benjamins, 1-41.

Fairclough, N. (1995). Critical Discourse Analysis. London: Longman.

Partington, A. (2003). The Linguistics of Political Argument: The Spin-doctor and the Wolf-pace at the White House. London: Routledge.

Schäffner, C. (1996). “Editorial: Political Speeches and Discourse Analysis”. Current Issues in Language & Society, 3 (3), 201-204.

Exploring imagery in literary corpora with the Natural Language ToolKit

Claire Brierley and Eric Atwell

Introduction: Corpus Linguists have used concordance and frequency-analysis tools such as WordSmith (Scott, 2004) and WMatrix (Rayson, 2003) for the exploration and analysis of English literature. Version 0.9.8 of the Python Natural Language Toolkit (Bird et al, 2009) includes a range of sophisticated NLP tools for corpus analysis, such as multi-level tokenization and lexical semantic analysis. We have explored the application of NLTK to literary analysis, in particular to explore imagery or ‘imaginative correspondence’ (Wilson-Knight, 2001: 161) in Shakespeare’s Macbeth and Hamlet; and we anticipate that code snippets and experiments can be adapted for research and research-led teaching with other literary texts.

(1) Preparing the eText for NLP: Modern English versions of these plays appear in NLTK’s Shakespeare corpus and the initial discussion refers to NLTK guides for Corpus Readers and Tokenizers to provide a rationale and user access code for preserving form during verse tokenization – for example, preserving orthographic, compositional and rhythmic integrity in hyphenated compounds like ‘…trumpet-tongued…’ while detaching punctuation tokens as normal.

(2) Raw counts and the cumulative resonance of words in context: Having transformed the raw text of a play into a nested structure of different kinds of tokens (e.g. line and word tokens), we report on various investigations which involve some form of counting, with coverage of: frequency distributions; lexical dispersion; and distributional similarity. Raw counts for the following words in Macbeth highlight the phenomenon of repetition and its immersive effect: fantastical: 2; fear: 35; fears: 8; horrid: 3; horrible: 3. We also perform analyses on the plays as Text objects via associated methods such as concordance()and collocations().

(3) Semantic distance and the element of surprise in metaphor: Spatial measures of semantic relatedness, such as physical closeness (co-occurrence and collocation) or taxonomic path length in a lexical network, are used in a number of NLP applications (Budanitsky and Hirst, 2006). The imaginative atmosphere of Macbeth is one of confusion, disorder, unreality and nightmare; things are not what they seem: ‘…faire is foul, and foul is fair’. These and similar instances suggest that antonymy is a feature of the play. To investigate such palpable tension in Macbeth, plus the ‘jarring opposites’ in Hamlet (cf. Wilson-Knight, 2001: 347), we might evaluate how well WordNet (Fellbaum, 1998), another of NLTK’s datasets, can uncover the prevalence of antonymy or the degree of semantic distance within selected phrases - since it is ‘distance’ that surprises us in the ‘fusion’ of metaphor. Dictionary classes in NLTK’s wordnet package provide access to the four content-word dictionaries in WordNet {nouns; verbs; adjectives; and adverbs}; and sub-modules enable users to explore synsets of polysemous words; move up and down the concept hierarchy; and measure the semantic similarity of any two words as a correlate of path length between their senses in the hypernym-hyponym taxonomy in WordNet.

A multilingual annotated corpus for the study of Information Structure

Lisa Brunetti, Stefan Bott, Joan Costa and Enric Vallduví

An annotated speech corpus in Catalan, Italian, Spanish, English, and German is presented. The aim of the corpus compilation is to create an empirical resource for a comparative study of Information Structure (IS).

Description. A total of 68 speakers were asked to tell a story by looking at the pictures of three text-less books by M. Meyer (cf. Strömqvist & Verhoven 2004 and references quoted therein). The participants were mostly university students. Catalan and Spanish speakers were from Catalonia; Italian, English, and German speakers had recently arrived in Barcelona. The results are 222 narrations of about 2-9 minutes each (a total of about 16 hours of speech). The recordings are transcribed with an orthographic transcription. Transcriptions and annotations of some selected high quality recordings have been aligned to the acoustic signal stream using the program PRAAT (Boersma et al. 2009) and its specific format (cf. the Corpus of Interactional Data, Bertrand et al. 2008).

Annotation. An original annotation is proposed of non-canonical constructions (NCCs) for the Romance subgroup, namely of syntactically/prosodically marked constructions that represent informational categories such as topic, focus, contrast. The list of NCCs to be annotated is chosen on the basis of our knowledge of the typical NCCs of these languages: left/right dislocations, cleft and pseudocleft clauses, subject inversion, null subjects, focus fronting, etc.

The analysis of NCCs in context is extremely useful for the study of IS, as they show explicitly what IS strategy the speaker uses within a specific discourse context. Despite their importance, only one example of this kind of annotation is available in the literature, to the best of our knowledge: the MULI corpus of written German (Baumann 2006). Therefore, our corpus provides a so-far missing empirical resource, which will enhance the research on IS based on quantitative analysis of sentences in real context.

Exploitation. The annotation allows for a comparative description of IS strategies in languages with very similar linguistic potential, in particular with respect to the ‘IS-syntax’ and ‘IS-prosody’ interface (via the alignment to the acoustic signal), and the study of the ‘IS-discourse’ interface. A survey of the difference in frequency and use of NCCs in the three languages is given. For instance, it will be shown that these languages differ in their strategies to hide the agent of the event. Passives are largely used in Italian but not in Spanish and Catalan, while Spanish makes a larger use of arbitrary subjects. This is presumably connected to a larger use of left dislocations that we witness in Spanish than in the other two languages. Another topicalizing strategy, namely a pseudo-cleft, is more common in Spanish than in Italian. These and similar data allow us to make generalizations concerning the degree of transparency of languages in their linguistic representation of IS (cf. Leonetti 2008).

References:

Baumann, S. (2006). “Information Structure and Prosody: Linguistic Categories for Spoken Language Annotation”. In S. Sudhoff et al. (eds). Methods in Empirical Prosody Research Berlin: W. de Gruyter.

Bertrand, R., P. Blache, R. Espesser, G. Ferré, C. Meunier, B. Priego-Valverde, and S. Rauzy (2008), “Le CID - Corpus of Interactional Data - Annotation et Exploitation Multimodale de Parole Conversationnelle”. Traitement Automatique des Langues, 49, 3.

Boersma, P. and Weenink, D. (2009). Praat: doing phonetics by computer (Version 5.0.47) [Computer program]. Retrieved January 21, 2009, from http://www.praat.org/

Leonetti, M. 2008, “Alcune differenze tra spagnolo e italiano relative alla struttura informative”, Convegno dell’Associazione Internazionale dei Professori d’Italiano, Oviedo, Sept. 2008.

Early Modern English expressions of stance

Beatrix Busse

Stance refers to epistemic or attitudinal comments on propositional information by the speaker and to the information perspectives of utterances in discourse. It describes how speakers express their attitudes and sources of information communicated (Biber et al. 1999).

Modality and modals, complementary clauses, attitudinal expressions and adverbials can be linguistic indicators of stance. Yet, while some very fruitful diachronic studies exist on, for example, modality and mood (including EModE phenomena, e.g. Krug 2000, Closs Traugott

2006), stance adverbials have not been extensively studied diachronically (Biber 2004).

This paper will identify and analyse possible Early Modern English stylistic and epistemic stance adverbials, such as forsooth, in regard of or to say precisely/true/the truth/the sooth. According to Biber et al. (1999), modern English stance adverbials can be subdivided into a) “epistemic stance adverbials,” which express the truth value of the proposition in terms of, for example, actuality or limitation or viewpoint, b) “attitudinal stance adverbials,” which express the speaker’s attitude to or evaluation of the content, and c) “stylistic stance adverbial,” which represent a speaker’s comment on the style or form of the utterance. Following a function-to-form and a form-to-function mapping as well as a discussion of methodological difficulties associated with these mappings and modern categorisations, my study investigates the frequencies of a selected set of these stance adverbials in relation to change and stability in the course of the Early Modern English period, as well as in relation to grammaticalisation, lexicalisation and pragmatisation (e.g. Brinton and Closs Traugott 2005, Brinton and Closs Traugott 2007). Corpora to be examined will be The Helsinki Corpus of English Texts, The Corpus of Early English Correspondence, A Corpus of English Dialogues 1560-1760, and a corpus of all Shakespeare’s plays. Reference will also be made to Early Modern Grammars and other sources in order to reconstrue contemporary statements of (and about expressing) stance.

Furthermore, the registers (Biber 2004) in which these constructions occur will be analysed on the assumption that oral and literate styles do not directly correspond to medium or to high and low style. They have to be seen as a continuum. Moreover, the functional potential of stance adverbials will be investigated, for example, in relation to their positions in the clause and to the accompanying speech acts of the clause. Sociolinguistic and pragmatic parameters drawn on will be medium, domain, age of speaker, and other contextual factors, such as degree of construed formality. Within a more qualitative framework, it will be argued that Early Modern epistemic and style adverbials have interpersonal, experiential and textual meanings as discourse markers, indicating scope, attitude and speaker-hearer relationship.

References:

Biber, D., S. Johansson, G. Leech, S.Conrad and E. Finegan (1999). Longman Grammar of Spoken and Written English. London: Longman.

Biber, D. (2004). “Historical Patterns for the Grammatical Marking of Stance: A Cross-

Brinton, L., and E. Closs-Traugott. (2005). Lexicalization and Language Change. Cambridge: Cambridge University Press.

Brinton, L. and E. Closs-Traugott. (2005). “Lexicalization and Grammaticalization All

over Again.” Historical Linguistics 2005: Selected Papers from the 17th International

Conference on Historical Linguistics, Madison, Wisconsin.

Closs-Traugott, E. (2006)“Historical aspects of modality”. In W. Frawley, (2006). The Expression of Modality, Berlin: Mouton de Gruyter, 107-139.

Krug, M. (2000). Emerging English Modals: A Corpus-Based Study of Grammaticalization.

Berlin: Mouton de Gruyter.

Integration of morphology and syntax unsupervised techniques for language induction

Héctor Fabio Cadavid and Jonatan Gomez

The unsupervised learning process of language elements has a wide field of applications, including protein sequences identification, studies on child’s language induction and evolution, and natural language processing. In particular, the development of the Semantic Web technology requires some techniques for inducing grammatical knowledge, including the syntactical, morphological and phonological levels. These techniques should be as detailed as possible to build up concepts and relations out of dynamic and error-prone contents.

Almost all unsupervised language learning techniques are extensions of some classic technique or principle, like ABL (Alignment Based Learning), MDL (Minimum Description Length), or LSV (Letter Successor Variety). For instance, Alignment Based Learning techniques try to induce syntactical categories by identifying some interchangeable elements on sentences’ patterns at the syntactic level, the minimum description length principle is used to incrementally improve the search over a morphologies space, and the Letter Successor Variety principle gives a morpheme’s segmentation criteria at the morphologic level.

However, such techniques are always focused on a single element of the language’s grammar. This fact has two disadvantages when trying to combine them, as it is required in the semantic Web case: first, two or more separated outputs from different techniques may be difficult to be consolidated since such outputs can be very different in format and content, and second, the information induced by one technique at a lower level is not easy to use into techniques of higher levels to improve their performance. An integration of a morphology unsupervised learning technique with a syntax unsupervised learning technique for language induction on corpuses extracted from internet is discussed in this paper.

The proposed technique integrates a LSV (Letter Successor Variety)-based unsupervised morphology learning technique with a MEX (Motif Extraction)-based unsupervised syntax learning technique.

This integration enhances some of the grammatical categories obtained by the MEX-based technique with additional word categories identified by the LSV based technique.

In this way, a hierarchical clustering technique for finding morphologically-related words is developed.

The integrated solution is tested with corpuses build from two different natural languages: English (predominantly non-flexive/concatenative language) and Spanish (predominantly a flexive language).

References:

Bordag, S. (2007). “Unsupervised and knowledge-free morpheme segmentation and analysis”. In Proceedings of the Working Notes for the CLEF Workshop 2007.

Clark, A. (2002). Unsupervised Language Acquisition: Theory and Practice. PhD thesis.

Demberg, V. (2007). “A language-independent unsupervised model for morphological segmentation”. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, 920- 927.

Goldsmith, J. (2006) “An algorithm for the unsupervised learning of morphology”. Nat. Lang. Eng., 12(4), 353-371.

Roberts, A. and E. Atwell (2002). “Unsupervised grammar inference systems for natural language”. Technical Report 2002.20, School of Computing, University of Leeds.

Solan, Z., D. Horn, E. Ruppin, and S. Edelman. (2005). “Unsupervised learning of natural languages”. Proc Natl Acad Sci U S A, 102(33), 11629-11634.

van Zaanen, M. (2000) “ABL: alignment-based learning”. In Proceedings of the 18th conference on Computational linguistics, 961-967.

From theory to corpus and back again – the case of inferential constructions

Andreea Calude

Technological advances have revolutionised linguistics research via the use of corpora through increased storage and computation power. However, the shift in paradigm from full introspection to pure usage-based models (Barlow and Kemmer 2000) has not gone without criticism (Newmeyer 2003). The current paper sets out to show that a combined approach, where both these methods play a role in the analysis is most fruitful and provides a more complete picture of linguistic phenomena. The case is made by presenting an investigation of the inferential construction.

Starting from a theoretical viewpoint, constructions such as It’s that he’s so self-satisfied that I find off-putting have been placed under the umbrella of it-clefts (Huddleston and Pullum 2002: 1418-1419). They are focusing structures, where the cleft constituent happens to be coded by a full clause rather than a noun phrase as is (typically) found in it-clefts (compare the above example to It’s a plate of soggy peas that I find off-putting).

However, an inspection of excerpts of spontaneous conversation from the Wellington Corpus of Spoken New Zealand English (WCSNZE) (Homes et al 1998) suggests a different profile of the construction. As also found by Koops (2007), the inferential typically occurs in the negative form, and/or with qualifying adverbs such as just, like, and more. More problematically, the inferential only rarely contains a relative clause (Delahunty 2001). This has caused great debate in the literature with regard to whether the that-clause following the copula represents in fact the cleft constituent (Delahunty 2001, Lambrecht 2001), or whether it is instead best analysed as the relative/cleft clause (Collins 1991, and personal communication). Such debates culminated with some banning the inferential from the cleft class altogether (Collins 1991).

Returning to the drawing board, we see further evidence that the inferential examples identified in the corpus differ from it-clefts. While in it-clefts, the truth conditions of the presupposition change with a change in the cleft’s polarity, in the inferential, this is not the case, as exemplified below from the WCSNZE.

(1) It-cleft

a. It’s Norwich Union he’s working for. (WCSNZE, DPC083)

Pressupposition. He is working for Norwich Union.

b. It’s not Norwich Union he’s working for.

Pressupposition. He is not working Norwich Union.

(2) Inferential

a. It’s just that she needs to be on the job. (WCSNZE, DPC059)

Pressuposition: She needs to be on the job.

b. It’s not just that she needs to be on the job.

Pressuposition: She [still] needs to be on the job [but the real issue is something else, not just she needs to be on the job].

As illustrated by the inferential construction, the power of a hybrid approach in linguistic inquiry comes from the ability to draw on both resources, namely, real language examples found in corpus data, and also, the wealth of theoretical tools developed by linguists to account for, question, and (hopefully) explain linguistic phenomena.

References :

Barlow, M. and S. Kemmer (2000). Usage-Based Models of Language. Stanford: CSLI Publications.

Collins, P. (1991). Cleft and Pseudo-cleft constructions in English. London: Routledge.

Delahunty, G. (2001). “Discourse functions of inferential sentences”. Linguistics. 39 (3), 517-545.

Holmes, J., B. Vine and G. Johnson (1998). Guide to the Wellington corpus of spoken New Zealand English.

Koops, C. (2007). “Constraints on Inferential Constructions”. In G. Radden, K. M. Kopcke, B. Thomas and P. Sigmund (eds) Aspects of Meaning Construction, 207-224. Amsterdam: John Benjamins.

Lambrecht, K. (2001). “A framework for the analysis of cleft constructions”. Linguistics 39, 463–516.

Newmeyer, F. (2003). “Grammar is grammar and usage is usage”. Language, 79 (4), 682-707.

Speech disfluencies in formal context: Analysis based on spontaneous speech corpora

Leonardo Campillos and Manuel Alcantra

This paper examines disfluencies in a corpus of Spanish spontaneous speech in formal contexts. This topic has recently gained interest in the research community, as the last ICAME workshop or the NIST competitive evaluations have shown. The goal of this paper is to propose a classification of disfluencies in formal spoken Spanish and to show their relations with both linguistic and extra-linguistic factors.

122,000 words taken from the MAVIR and C-ORAL-ROM corpora have been analyzed. Both corpora have been annotated with prosodic and linguistic tags, including 2,800 hand‑annotated disfluencies. Corpora are made up of 33 documents classified in different classes depending on contexts: business, political speech, professional explanation, preaching, conference, law, debate, and education.

Main disfluency phenomena which are considered in this work are repeats, false starts, filled pauses, incomplete words, and non-grammatical utterances. Main characteristics are described together with samples. To take an example, analysis shows that most repeated forms are function words (mainly prepositions, conjunctions, articles), with "de" (of ), "en" (in), "y" (and), and "el" (the) on the top of the list. Regarding different disfluency phenomena, data show that frequencies are not related. For example, documents with high rate of repeats do not necessarily have a high rate of pause fillers.

Linguistic and extra-linguistic factors to be considered have been chosen following the state‑of‑the-art literature of disfluencies in other languages since they have not received much attention till now for Spanish. Linguistic factors are: word-related aspects of disfluencies (e.g. which words are repeated) and syntactic complexity (e.g. how syntactically complex are parts following a disfluency). Extralinguistic factors are: document class, number of speakers, and speech domain. They altogether help us to explain differences in occurrence frequency of disfluencies. In the corpora, frequency clearly varies going from a disfluency every 1,000 words to 12 disfluencies every 100 words (with a mean of almost 5 words every 100 words).

On the one hand, results are compared with two previous studies carried out with the C‑ORAL-ROM corpus. The first one proposed a comprehensive typology of phenomena affecting transcription tasks. Phenomena where classified into types and frequencies were compared for informal, formal, and media interactions. Though the goal of that work was not a study of disfluencies, we show now that these are one of the main factors to make difficult a document for the transcriber. The second study was also focused on differences between informal, formal, and media interactions, but from an acoustic point of view. Instead of manual transcriptions, difficulty of automatic processing of different types of spontaneous speech was tested by performing acoustic-phonetic decoding by means of a recognizer on parts of the corpus.

On the other hand, main figures of our study are compared to those of spontaneous English in formal contexts from in order to show how dependent on the language these phenomena are.

Results are important not only for the linguistic insights they provide. They will also help improve current automatic speech recognition systems for spontaneous language since disfluencies are one of the main problems in ASR.

References:

Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. (1999). Longman

Grammar of Spoken and Written English. London: Longman.

Huang, X., A. Acero, H.-W. Hon (2001). Spoken Language Processing: A Guide to Theory, Algorithm and System Development. New Jersey: Prentice Hall PTR.

González Ledesma, A., G. de la Madrid, M. Alcántara Plá, R. de la Torre and A. Moreno‑Sandoval (2004). “Orality and Difficulties in the Transcription of Spoken Corpora”. Proceedings of the Workshop on Compiling and Processing Spoken Language Corpora, LREC,

Shriberg, E. (1994). Preliminaries to a Theory of Speech Disfluencies. Ph.D. Thesis.

Toledano, D. T., A. Moreno, J. Colas and J. Garrrido. (2005). “Acoustic-phonetic decoding of different types of spontaneous speech in Spanish”.Disfluencies in Spontaneous Speech Workshop 2005.

Featuring linguistic decline in Alzheimer’s disease: A corpus-based approach

Pascual Cantos-Gomez

Alzheimer’s disease (AD) is a brain disorder -the most common form of dementia named for German physician Alois Alzheimer, who first described it in 1906. It is a progressive and fatal brain disease that destroys brain cells, causing problems with memory, thinking and behaviour. The neuropsychological deficits attributable to Alzheimer’s disease have been documented extensively.

The various stages of cognitive decline in AD patients include a linguistic decline. Language is known to be vulnerable to the earliest stages of Alzheimer's disease, and the findings of the Iris Murdoch Project (Garrard et al. 2005) confirmed that linguistic changes can appear even before the symptoms are recognised by either the patient or their closest associates.

This paper describes a longitudinal language assessment of Prime Minister Harold Wilson’s speeches (1964-1970 and 1974-1976), extracted from the Hansard transcripts (the edited verbatim report of proceedings in both Houses) in order to explore possible effects of AD process on his language use.

We are confident that this longitudinal study and the language use variables selected might provide us with some new understanding of the evolution and progression of language deterioration in AD.

References:

Burrows, J. F. (1987). “Word-patterns and Story-shapes: The Statistical Analysis of Narrative Style”. Literary and Linguistic Computing, 2 (2), 61-70.

Cantos, P. (2000). “Investigating Type-token Regression and its Potential for Automated Text-Discrimination”. Cuadernos de Filología Inglesa, 9 (1), 71-92.

Chaski, C. (2005). “Computational Stylistics in Forensic Author Identification”. [Document available on the Internet at http://eprints.sics.se/26/01/style2005.pdf]

Conrad, S. and D. Biber (eds) (2001). Variation in English: Multi-Dimensional studies. London: Longman.

Garrard, P. (in press) “Cognitive Archaeology: Uses, Methods, and Result”. Journal of Neurolinguistics, doi:10.1016/j-neurolinguistics.2008.07.006.

Garrard, P. et al. (2005) “The Effects of very Early Alzheimer’s Disease on the Characteristics of Writing by a Renowned Author”. Brain 128: 250-260.

Gómez Guinovart, X. And J. Pérez Guerra (2000). “A multidimensional corpus-based analysis of English spoken and written-to-be-spoken discourse”. Cuadernos de Filología Inglesa, 9 (9), 39-70.

Kempler, D. et al. (1987) “Syntactic Preservation in Alzheimer’s Disease”. Journal of Speech and Hearing Research, 30, 343-350.

March, G. E. et al. (2006) “The uses of nouns and deixis in discourse production in Alzheimer's disease”. Journal of Neurolinguistics, 19, 311-340.

March, G. E. et al. (2009) “The Role of Cognition in Context-dependent Language Use: Evidence from Alzheimer's Disease”. Journal of Neurolinguistics, 22, 18-36.

Mollin, S. (2007) “The Hansard Hazard. Gauging the accuracy of British parliamentary transcripts.” Corpora, 2 (2), 187-210.

Rosenberg, S. and L. Abbeduto (1987) “Indicators of Linguistic Competence in Peer Group Conversational Behavior of Mildly Retarded Adults”. Applied Psycholinguistics, 8 (1), 19-32.

The evolution of institutional genres in time: The case of the White House press briefings

Amelia Maria Cava, Silvia De Candia, Giulia Riccio, Cinzia Spinzi and Marco Venuti

Literature on genre analysis mainly focuses on the description of language use in the different professional and institutional domains (Bhatia 2004). Despite the different directions of the studies on genre (Martin and Christie 1997; Swales 1990), however, a common orientation may be seen in their tendency to describe homogeneous concepts, such as communicative situation, register and function. Nevertheless, genre-specific features are subject to changes due to the ongoing processes of internationalisation and globalisation (Candlin and Gotti 2004; Cortese and Duszak 2005). Within the framework of a wider research project titled “Tension and change in English domain-specific genres” funded by the Italian Ministry of Research, the present paper aims to outline, through a corpus-based analysis of lexico-grammatical and syntactic features (Baker 2006), in what ways the White House press briefings, as a genre, have evolved in the last 16 years under the pressure of technological developments and of media market transformation. White House press briefings are meetings between the White House press secretary and the press, held on an almost daily basis and they may be regarded as the main official channel of communication for the White House. Embracing a diachronic perspective, our analysis aims at identifying the main features of the evolution of the briefings as a genre, during the Clinton and George W. Bush administrations. A corpus (DiaWHoB) including all the briefings from January 1993 to January 2009, available on the American Presidency Project website (http://www.presidency.ucsb.edu/press_briefings.php), has been collected in order to carry out the analysis. The corpus consists of about 4,000 briefings and is made up of more than 18 million words. The scope and size of a specialised corpus of this kind make it a powerful tool to investigate the evolution of the White House press briefing. In order to manage the data more efficiently, the corpus has been annotated. The XML mark-up includes information about individual speakers and their roles, date, briefing details and text structure. Our intent is to compare different discourse strategies adopted by speakers in the briefings at different points in time, and also to identify differences between discourse features employed by the press secretary and those more typical of the press. The present research paper draws up the corpus structure and in what ways the corpus architecture helps in investigating the evolution of the genre, and also presents some preliminary results. In particular, focus is on some examples of evolution in phraseology within the genre of briefings in order to support the hypothesis that a diachronic corpus-based investigation facilitates comparisons among different speakers thanks to the XML mark-up while providing interesting insight into the evolution of a genre.

References:

Baker, P. (2006). Using Corpora in Discourse Analysis. London: Continuum.

Bhatia, V. K. (2004). Worlds of Written Discourse. London: Continuum.

Candlin C. And M Gotti (eds) (2004). Intercultural Discourse in Domain-Specific English. Textus, 17(1).

Cortese, G. and A. Duszak (eds) (2005). Identity, Community, Discourse: English in Intercultural

Settings. Bern: Peter Lang.

Crystal, D. (2003). English as a Global Language. Second Edition. Cambridge: CUP

Gotti, M. (2003). Specialized Discourse. Bern: Peter Lang.

Martin, J. and F. M. Christie (1997). Genre and Institutions. London: Cassell.

Partington, A. (2003). The Linguistics of Political Argument: the Spin-doctor and the Wolf-pack at the

White House. London/New York: Routledge.

Swales J. M. (1990). Genre Analysis. Cambridge: CUP.

Xaira: http://www.oucs.ox.ac.uk/rts/xaira/

Parallel corpora: The case of InterCorp

Frantisek Cermák

There is a growing awareness, started decades ago, that parallel corpora might substantially contribute to language contrastive research and various applications based on them. However, except for notorious and rather one-sided type of parallel corpora, such as the Canadian Hansard and Europarl corpora, most of the attention paid to them has been oddly restricted, mostly to two things. On the one hand, computer scientists seem to compete fiercely in the field of tools including search of optimal alignment methods and when they have arrived at a solution and become convinced that there is no more to be achieved here, they drop the subject and interest in it as well. On the other hand, parallel corpora hardly ever means anything more than a bilingual parallel corpus. Thus, the whole field seems to be lacking in a number of aspects, including both real use and exploitation, that should be linguistic, preferably, and a broader goal of comparing and researching more languages, a goal which should suggest itself in today´s multilingual Europe. Moreover, most attention is being paid, understandably, to such language pairs where at least one is a large language, such as English.

InterCorp, a subproject of Czech National Corpus (www.korpus.cz), currently under progress, is a joint attempt of linguists, language teachers and representatives of over 25 languages to change this picture a little and to make Czech, a language spoken by 10-million people, a centre and, if possible, a hub, for the rest of languages included. The list contains now most state European languages, small and large. Given the familiar limited supply of translations the plan is to cover as much as possible from (1) contemporary language (starting with the end of World War 2), (2) also non-fiction of any type (fiction prevails in any case), if available, (3) also translations from a third language, apart from the pair of languages in question (in case of need), and (4) translations into more than one language, if possible. A detailed description of this, general guidelines and problems will be discussed.

Obviously, this contribution is aimed at redressing the balance looking at linguistic types of exploitation, although some thoughts will also be given to non-linguistic ones. It seems that such a large general multilingual corpus, which seems to have not many parallels elsewhere, could be a basis and tool for finding out more, including answers to questions such as what is possible on such a large scale, what its major problems and desiderata might be (which are still to be discovered). First results (the project will run till 2011) will be available at a conference in August 2009 held in Prague.

Using linguistic metaphor to express modality: A corpus study of argumentative writing

Claudia Marcela Chapetón Castro and Danica Salazar

This paper is part of a larger study that investigates the use of linguistic metaphor in native and non-native argumentative writing by means of corpus methods. Cognitive research has shown that many concepts, especially abstract ones, are structured and mentally represented in terms of metaphor (Lakoff, 1993; Lakoff and Johnson, 1980), which is claimed to be highly systematic and pervasive in different registers. However, until recently, cognitive metaphor research has been based largely on elicited, invented and decontextualized data. The present study takes a corpus-driven approach to the analysis of metaphor in order to connect the conceptual with the linguistic by using empirical, naturally occurring written data.

Three corpora are compared in the study. The first corpus is a collection of professionally written editorials published in American newspapers. The second is a sample from the Louvain Corpus of Native English Essays (LOCNESS), which contains argumentative essays written by native-speaker American students. The third is a sample from a sub-corpus of the International Corpus of Learner English (ICLE) composed of essays written by advanced Spanish-speaking learners of English. These corpora were chosen to represent expert native writing (editorials), novice native writing (LOCNESS) and learner writing (SPICLE).

Linguistic metaphor was identified in context through the application of a rigorous, systematic procedure that combined Cameron’s (2003) Metaphor Identification through Vehicle Terms (MIV) procedure with the Pragglejaz Group’s (2007) Metaphor Identification Procedure (MIP). Once identified, instances of linguistic metaphor were analyzed in terms of their function in the texts.

This paper focuses on one of the functions that metaphors have been found to perform in the data: the expression of modality. Several studies have shown modality to be important in argumentative writing, since they help writers convey their assessments and degree of confidence in their propositions and enable them to set the appropriate tone and distance from their readers (Aijmer, 2002; Hyland and Milton, 1997). In other words, modal devices fulfill pragmatic and interpersonal functions crucial to argumentation.

This paper compares how the three writer groups under study use linguistic metaphor to communicate modal meanings. In the presentation, patterns of occurrence will be described and differences among the three corpora with regard to the appropriateness of the degree of commitment and certainty expressed through metaphor will be discussed. Aspects such as the use of metaphor in personalized and impersonalized structures and the influence of cultural and linguistic variations on the use of metaphor to express epistemic commitment will be explained. The pedagogical implications of the results obtained will be highlighted.

References:

Aijmer, K. (2002). “Modality in advanced Swedish learners’ written interlanguage”. In S. Granger, J. Hung and S. Petch-Tyson (eds) Computer learner corpora, second language acquisition and foreign language teaching, 55-76. Amsterdam: John Benjamins.

Cameron, L. (2003). Metaphor in educational discourse. London: Continuum.

Hyland, K. and Milton, J. (1997). “Qualification and certainty in L1 and L2 students’ writing”. Journal of Second Language Writing, 6, 183-205.

Lakoff, G. (1993). “The contemporary theory of metaphor”. In A. Ortony (ed.) Metaphor and thought. Cambridge: Cambridge University Press. 202-251

Lakoff, G. And M. Johnson (1980). Metaphors we live by. Chicago: University of Chicago Press.

Pragglejaz Group. (2007). “MIP: A method for identifying metaphorically used words in discourse”. Metaphor and Symbol, 22 (1), 1-39.

DDL for the EFL classroom: Effective uses of a Japanese-English parallel corpus and the development of a learner-friendly, online parallel concordance

Kiyomi Chujo, Laurence Anthony and Kathryn Oghigian

There is no question that the use of corpora in the classroom has value, but how useful is concordancing with beginner level EFL students? In this presentation, we outline a course in which a parallel Japanese-English concordancer is used successfully to examine specific grammar features in a newspaper corpus. We will also discuss the benefits and limitations of existing parallel concordance programs, and introduce a new freeware parallel concordancer, WebConc-bilingual, currently being developed.

In the debate over whether an inductive (Seliger, 1975) or deductive (Shaffer, 1989) approach to grammar learning is more effective, we agree with Corder (1973) that a combination is most effective and we have incorporated both into a successful DDL classroom procedure: (1) hypothesis formation through inductive DDL exercises; (2) explicit explanations from the teacher to confirm or correct these hypotheses; (3) hypothesis testing through follow-up exercises; and (4) learner production. Through this procedure we show that incorporating cognitive processes such as noticing, hypothesis formation, and hypothesis testing enables learners to develop skills effectively.

The course was based on the findings of two studies. Uchibori et al. (2007) identified a set of grammatical features and structures used in practical English expressions that are found in TOEIC questions but not generally taught in Japanese high school textbooks. Chujo (2003) identified the vocabulary found in TOEIC but not taught in Japanese high school textbooks. These grammatical structures and vocabulary form the basis of twenty DDL lessons taught over two semesters.

A case study was used to assess the validity of the course design. Students were asked to follow carefully crafted guidelines to explore various noun or verb phrases using parallel Japanese-English concordancing. In addition, students completed follow-up activities using targeted vocabulary to reinforce the grammar patterns discovered through the DDL activities. The evaluation of learning outcomes showed that the course design was effective for understanding the basic patterns of these noun and verb phrases, as well as for learning vocabulary.

Although we have previously relied on Paraconc (Barlow, 2008), for this course we have started to develop a new online web-based parallel concordancer. Built on a standard LAMP (Linux, Apache, MySQL, PHP) framework, we designed the concordancer to be both powerful and intuitive for teachers and learners to use. In addition, the concordancer engine is designed on a similar architecture to the Google search engine, allowing it to work comfortably on very large corpora of hundreds of millions of words. The concordancer is also built to Unicode standards, and thus, it can process both English and Japanese texts smoothly, and does not require any cumbersome token definition settings. Preliminary results show that the new software is considerably easier to use than standard desktop programs, and also, it provides students the opportunity to carry out DDL exercises at home or even during their commute using a mobile device.

References:

Barlow, M. (2008). ParaConc (Build 269) [Software]. http://www.athel.com/para.html.

Chujo, K. (2003). “Selecting TOEIC Vocabulary 1 & 2 for Beginner Level Students and Measuring its Effect on a Sample TOEIC Test”. Journal of the College of Industrial Technology, 36, 27-42.

Seliger, H. (1975). “Inductive Method and Deductive Method in Language Teaching: A Re-Examination”. International Review of Applied Linguistics, 13, 1-18.

Shaffer, C. (1989). “A Comparison of Inductive and Deductive Approaches to Teaching Foreign Languages”. The Modern Language Journal, 73(4), 395-402.

Uchibori A., K. Chujo and S. Hasegawa (2006). “Toward Better Grammar Instruction: Bridging the Gap between High School Textbooks and TOEIC”. The Asian EFL Journal, 8(2), 228-253.

A case study of economic growth based on U.S. presidential speeches

Siaw-Fong Chung

In addition to being a metaphor itself, growth is found to be highly lexicalized in the expression economy/economic growth which can form other metaphors (cf. White (2003) for discussion of ‘growth’). Using ECONOMIC GROWTH as a target domain, this work addresses the issues of source domain indeterminacy in metaphor research. Data from thirty-eight U.S. presidential speeches from the State of the Union from 1970 through 2005 were analyzed (comprises data from presidents Nixon, Ford, Carter, Reagan, G. H. W. Bush, Clinton and G. W. Bush). Seventy-six instances (from total 178; 43%) of growth were found to be related to economy. Three methods (shown in the table below) were used to extract the source domains for ECONOMIC GROWTH, namely through (a) examining syntactic positions (intuition-based); (b) using WordNet and SUMO (Suggested Upper Merged Ontology) (top-down); and (c) using Sketch Engine for collocations (cf. Kilgarriff and Tugwell, 2001) (bottom-up). (Similar idea but with the aid of computational tools was used in Chung (Forthcoming) based on Mandarin newspaper data.)

Determining Source Domains

Metaphorical Expressions	Possible Source Domains
Metaphorical Expressions	Syntactic Positions	WordNet 1.6 and SUMO	Collocates
the real engine of economic growth* in this country is private sector*	machine	machine	*Engine* *of X* X: car, aircraft, war
to promote freedom as the key to economic growth	lock/ success	lock	*Key to* X X: success, understanding
the fruits of growth* must be widely shared*	plant	plant/ organism	*Fruits of* X X: tree, effort, work

Even though ECONOMIC GROWTH can serve as a target domain and create new metaphors, inconsistency arises in determining source domains, which is also shown by the use of varied source domains for similar metaphorical expressions in previous studies. From the three methods suggested above, this paper finds collocations usually do not work well with items which metaphorical frequency overrules the literal uses (such as slow in slow down). Combining the analyses from syntactic positions, ontologies, and collocations may show advantages in providing more information regarding source domain determination but it may also cause difficulties methodology-wise especially when contradicting results are found from different methods.

References:

Kilgarriff, A. and D. Tugwell. (2001). “WORD SKETCH: Extraction and Display of Significant Collocations for Lexicography.” In the Proceedings of the ACL Workshop COLLOCATION: Computational Extraction, Analysis and Exploitation. Toulouse, 32-38.

White, M. (2003). “Metaphor and Economics: The Case of Growth.” English for Specific Purposes. 22, 131-151.

Chung, S-F. (forthcoming). A Corpus-driven Approach to Source Domain Determination. Language and Linguistics Monograph Series. Nankang: Academia Sinica.

Corpora and EFL / ELT: Losses, gains and trends in a computerized world

Adrian Ciupe

In the last decade or so, thanks to unprecedented developments in computer technology, corpora research has been yielding unique insights into the workings of English lexis – one of the languages which, as a second acquisition, is an almost ubiquitous requirement in curricula across the world, from primary to tertiary level. Catching up with these ground-breaking advances (derived from the painstaking research of linguists and lexicographers) are also the EFL / ELT beneficiaries of such endeavours, prestigious publishers and non-native learners / teachers of English alike. Nonetheless, although major EFL publishers boast each a proprietor corpus (running to billions of words and counting!), the ultimate general challenge lies in final-product usability by the intended target audience. Such publishers have embarked on the assiduous task of processing raw corpus data in terms of frequency, usage, register, jargon etc towards its (twice-filtered) inclusion into learner dictionaries (notably, their electronic versions), as well as into various course books (especially, exam preparation courses). They are vying (suffice it to say, obvious from blurbs alone) for market leader positions, but with what benefits to the non-native users at large? Are there any grey areas left to be covered? Electronic dictionaries may have become sophisticated enough as to accommodate more user-friendly replicas of book formats, laying at the disposal of learners advanced search tools and thus blending language and computer literacy towards a more efficient acquisition of native-like competence. Despite the apparent benefits, several shortcomings crop up on closer inspection: electronic learner dictionaries draw on mammoth professional corpora, ultimately emerging as mini/guided corpora targeted at non-native speakers. But how does the processed data lend itself to relevance and searchability? How much does the software replicate the compiling strategies behind book-format counterparts? How truly advanced are the incorporated search tools? How much vertical vs. horizontal exploration do they allow? How much of the information is deductively structured (i.e. headword > phraseology > real-life language / examples = the classic approach) and how much inductively (i.e. real-life language / examples > phraseology (> headword) = the new computer-assisted methods)? Do interfaces and layout make a difference? Similarly, regarding corpora use in EFL book formats, what kinds of publications are currently available, whom do they target (c.f. a predilection for exam preparation), what do they contain, how are they organized and how do they practise what they preach? How can the end user really benefit from this classic / modern tandem of data processing (the experts), transfer (the EFL publications) and acquisition (the learners)? My paper will be addressing the questions above, based on practice-derived findings. I will be contending that although cutting-edge technology is readily available both to experts and to an intended target audience (non-native speakers of English), corpora-based EFL / ELT resources are still in a relative infancy (c.f. their novelty) but still, remarkably capable of reshaping focus through need-based end-products that could eventually live up to the much vaunted reliability, usability and flexibility of technology-assisted language learning.

A corpus approach to images and keywords in King Lear

Maria Cristina Consiglio

This paper aims to apply tools and methodology of Corpus Linguistics to Shakespeare’s King Lear in order to explore the relationship between images and keywords. King Lear is a play where language is particularly important in both the characterization and the development of the plot. Traditional critical studies – like Spurgeon (1935); Armstrong (1946); Evans (1952); Clemen, (1966); Wilson Knight (1974); Frye (1986) – have identified some thematic words, which strongly contribute to convey to the audience the general atmosphere of the play and to create an imagery which effectively ‘suggest[s] to us the fundamental problems lying beneath the complex construction of [the] play’ (Clemen, 4). The plot develops in an atmosphere of strokes, blows, fights, shakes, strains, spasms and the imagery contributes to excite, intensify and multiply the emotions and sometimes, through the use of symbols and metaphors, they foreground aspects of the characters’ thought.

This paper intends to analyse the above-mentioned characteristics using the software WordSmith Tools. A first step consists in a quantitative analysis aiming at identifying the keywords of the play, using a reference corpus made of the other so-called great tragedies by Shakespeare; a second step consists in a qualitative analysis consisting in the study of concordance lines, which allows to verify the different uses and meanings a given word assumes in a given text. The results will be evaluated against those of the ‘traditional’ studies about the imagery of the play in order to verify if and to what extent they coincide. The reason for this choice is to ascertain the usefulness of Corpus Linguistics in studying a linguistically complex literary text like King Lear where language plays such an important role.

References:

Armstrong, E.A. (1946). Shakespeare’s Imagination: A Study of the Psychology of Association and Inspiration. London: Drummond.

Clemen, W. (1959). The Development of Shakespeare’s Imagery. London: Methuen.

Culpeper, J. (2002). “Computer, Language, and Characterization: An Analysis of six Characters in Romeo and Juliet”. In U. Melander-Marttala, C. Ostman and M. Kyto (eds) Conversation in Life and in Literature: Papers from the ASLA Symposium, Association Suedoise de Linguistique Appliquée, 15, Universitetstryckeriet: Uppsala, 11-30.

Culpeper, J. (2007). “A New kind of Dictionary for Shakespeare’s Plays: An Immodest Proposal”. Yearbook of the Spanish and Portuguese Society for English Renaissance Studies, 17, 47-73.

Evans, B.I. (1952). The Language of Shakespeare’s Plays. Bloomington: Indiana University Press.

Frye, N. (1986) Northrop Frye on Shakespeare, ed. by R. Sandler, New Haven / London: Yale University Press.

Louw, B. (1993). “Irony in the Text or Insincerity of the Writer? The Diagnistic Potential of Semantic Prosodies”. In M. Baker, G. Francis and E. Tognini Bonelli (eds) Text and Technology. Amsterdam: John Benjamins, 157-176.

Louw, B. (1997). “The Role of Corpora in Critical Literary Appreciation”, in A. Wichman, S. Fligelstone, T. McEnery and G. Knowles (eds) Teaching and Language Corpora. Harlow: Longman, 240-251.

Scott ,M. (2006) “Key words of Individual Texts”. In M. Scott and C. Trimble (eds) Textual Patterns. Key Words and Corpus Analysis in Language Education. Amsterdam: John Benjamins, 55- 72.

Spurgeon, C. (1952) Shakespeare’s Imagery and What It Tells Us, Cambridge: UP.

Wilson Knight, G. (1959) The Wheel of Fire: Interpretations of Shakespearian Tragedy with Three New Essays. London: Methuen.

Corpus-based lexical analysis for NNS academic writing

Alejandro Curado-Fuentes

The analysis of NNS (non-native speakers of English) writing for academic purposes has led to extensive research over the past three decades. A main focus is on university compositions or essays where L2 learners ought to go through re-writing procedures. Less consideration seems to be given, by comparison, to NNS research writing for publication aims, although just to give two examples, Burrough-Boenisch (2003; 2005) examine proof-reading procedures.

This paper aims to present a corpus-based lexical analysis of nine journal papers written by Spanish Computer Science faculty. The texts available for the corpus analysis are those authored by the writers in their final versions; however, they are accessed prior to the editors’ last review. The corpus examination has been done by comparing the texts with NS material from a selection of the BNC (British National Corpus) Sampler (Burnard and McEnery, 1999). The chief objective in the process has been to identify both similarity and divergence in terms of the significant lexical items used. Word co-occurrence and use probability in the contrasted contexts determine academic competence, since the mastery of specific lexical patterns should indicate specialised writing (cf. Hoey, 2005). Based on the literature, an attempt at assessing major idiosyncratic traits as well as flaws has been made by comparing main native English, non-native English, and Spanish writing patterns. The contrastive study may constitute an interesting case study as reference for the design of teaching material and / or NNS writing guidelines.

References:

Burnard, L. and T. McEnery (1999). The BNC Sampler. Oxford: Oxford University Press.

Burrough-Boenisch, J. (2003). “Shapers of published NNS research articles”. Journal of Second Language Writing, 12, 223–243.

Burrough-Boenisch, J. (2005). “NS and NNS scientists’ amendments of Dutch scientific English and their impact on hedging”. English for Specific Purposesm 24, 25–39.

Hoey, M. (2005). Lexical Priming. A New Theory of Words and Language. London: Routledge.

Edinburgh Academic Spoken English Corpus: Tutor interaction

with East Asian students in tutorials

Joan Cutting

Some East Asian students are quiet in tutorials (De Vita 2007), and lecturers struggle to help them participate interactively (Jin 1992; Ryan and Hellmundt 2003). Training to increase lecturers’ cultural awareness of non-western learning styles is essential , as is training on interactive tutorial management with East Asian students (Grimshaw 2008; Stokes and Harvey 2008).

This paper discusses a pilot study of UK university postgraduate tutorials. The tuorials were video-recorded and form a sub-corpus (20,000 words) of the nascent Edinburgh Academic Spoken English Corpus. The transcripts were analysed using corpus linguistics and interactional sociolinguistics, to find what happens when students are required to interact. The database was coded for linguistic features (technical words, vague language, etc), structural features (length of speaking time, interruptions and overlaps, comprehension checks, etc), interactional features (speech acts, cooperative maxims, politeness strategies, etc) and pedagogical strategies (pair work, presentations, posters, etc). The features were then cross-tabulated to identify which linguistic forms, interactional and teaching strategies used by lecturers correlated with students’ active participation. Findings revealed that student participation varied according to some of the tutor behaviour. For example, if the tutor used vague language (e.g. general nouns and verbs, and indefinite pronouns) and colloquialisms, student participation was delayed until the tutor initiation was reformulated satisfactorily; and if the tutor used positive politeness strategies (e.g. using student first names and hedging) student participation increased. This paper discusses reasons for the correlations and lack of correlation, taking into account social variables such as individual differences, gender and language proficiency level.

The paper is essentially one that focuses on interactional information in the corpus, with a view to drawing up guidelines to support professional development of lecturers. It works on the assumption that findings of linguistic analysis could guide them in good practices (Cutting 2008; Reppen 2008).

References:

Cutting, J. and Feng, J. (2008). 'Getting them to participate: East Asian students' interaction with tutors in UK seminars'. Unpublished paper at East Asian Learner conference, University of Portsmouth.

De Vita, G. (2007). “An Appraisal of the Literature on Internationalising HE Learning”. In E. Jones and S. Brown (eds) Internationalising High Education. Oxford: Routledge.

Edwards, V. And A. Ran (2006). Meeting the needs of Chinese students in British Higher Education. Reading: University of Reading.

Grimshaw, T. (2008). 'But I don't have any issues': perceptions and constructions of Chinese-speaking students at a British university. Unpublished paper at East Asian Learner conference, Portsmouth.

Jin, L. (1992). “Academic Cultural Expectations and Second Language Use: Chinese Postgraduate Students in the UK: A Cultural Synergy Model”. Unpblished PhD Thesis, University of Leicester.

Ryan, J. and Hellmundt, S. (2003). “Excellence through diversity: Internationalisation of curriculum and pedagogy”. Refereed paper presented at the 17th IDP Australian International Education Conference, Melbourne

Stokes, P. and Harvey, S. (2008) Expectation, experience in the East Asian learner. Unpublished paper at East Asian Learner conference, Portsmouth.

CLIPS: diatopic, diamesic and diaphasic variations of spoken Italian

Francesco Cutugno and Renata Savy

We present here the largest corpus of spoken Italian ever collected. Compared to the average dimension of available speech corpora, CLIPS (Corpora e Lessici di Italiano Parlato e Scritto), in its spoken component [1], is among the largest ones (see table 1)

Table 1

In recent years Italian linguistics has dedicated an increasing amount of resources to the study of spoken communication, constructing their proper analytic tools and reducing the historical lack of available data for research. Among the various sources of variability naturally encountered in human languages along different dimensions of expression, Italian presents a particular relevance of diatopic variance which cannot be neglected and that is difficult to be represented: standard ‘Italian’ is an abstraction built on mixing and combining all regional varieties, each derived by one or more local romance dialects (De Mauro, 1972; Lepschy&Lepschy. 1977; Harris&Vincent, 2001).

CLIPS has been collected in 15 locations representative of 15 diatopic varieties of Italian chosen on the base of a detailed socio-economic and socio-linguistic analyses. Moreover the corpus is structured into 5 diaphasic/diamesic layers, presents a variety of textual typologies (see table 2). 30% of the entire corpus is orthographically transcribed, around 10% is labelled at various levels.

Table 2

Specific standard protocols have been used both for transcription and labelling. Orthographic transcription includes metadata description and classification, non-speech and non-lexical phenomena annotation (empty and filled pauses, false starts, repetitions, noises, etc.). Labelling is delivered in TIMIT format. It includes the following levels: lexical, phonological (citation form), phonetic, acoustic, extra-text (comments). The corpus is freely available for research and is carefully described by a set of public documents (website: http://www.clips.unina.it).

The CLIPS project has been funded by Italian Ministry of Education, University and Research (L.488/92) and coordinated by prof. Federico Albano Leoni.

References:

De Mauro, T. (1972). Storia linguistica dell’Italia unita, Bari: Laterza.

Lepschy, A.L. and G. C Lepschy (1977). The Italian Language Today, London: Hutchinson.

Harris M. and N. Vincent (2001). The Romance Languages (4th ed.), London: Routledge.

[1] Lexicon and text corpora in CLIPS have been collected by ILC, Pisa http://www.ilc.cnr.it/viewpage.php/sez=ricerca/id=41/vers=ing

Corpus-driven morphematic analysis

Václav Cvrček

In this paper I would like to present a computer tool for a corpus-driven morphematic analysis as well as a new corpus approach to derivational and inflectional morphology. The tool is capable of segmenting the Czech and English word-forms into morphemes (i.e. the smallest parts of words which have a meaning or function) by combining distributional and graphotactic methods. The important fact for a further research in this area is that the segmentation process described below can be quite efficient without involving any previous linguistic knowledge about the word structure or its meaning.

The program recursively forms a series of hypotheses about morpheme-candidates (esp. about morpheme boundaries) and evaluates these candidates on the basis of their distribution among other words in corpus. A working hypothesis assumes that when a morphematic boundary occurs, distribution of morpheme-candidates derived from this partitioning would be decisive (i.e. many words should contain these morphemes) in comparison to a hypothetical division separating two parts of one morpheme from each other (Urrea 2000). Another criterion is derived from graphotactics; assuming that between graphemes which co-occur often (measured with MI-score and t-score) is the probability of the morpheme boundary smaller, this criterion provides a less important but still valuable clue in segmentating the word into morphemes. Final boundary is drawn at a position which suits best for both of the criteria. Results gained from this computer program can be improved by involving some of linguistic information in the process, such as list of characters which are not capable of forming a prefix etc.

This tool can help to improve the linguistic marking (lemmatisation, tagging) of corpora (which is necessary for languages with rich morphology) and it, too, may be seen as a first possible step towards a semantic marking or clustering of words (identifying morphologically and therefore semantically related words). Besides these advantages, this statistical approach can help in redefining the term 'morpheme' on the grounds of corpus linguistics (only by statistical and formal properties of morpheme-candidates) as a part of word which is regularly distributed in the set of words of the same morphological characteristics (i.e. part of speech or words created by the same derivational process etc.).

Corpus-based derivation of a “basic scientific vocabulary” for indexing purposes

Lyne Da Sylva

The work described here pertains to developing language resources for NLP-based document description and retrieval algorithms, and to exploiting a specialized corpus for this purpose. Retrieval applications seek to help users find documents relevant to their information needs; thus document collections must be described, or indexed, using expressive and discriminating keywords.

Our work focuses on the definition of different lexical classes within the vocabulary of a given language, where each has a different role in indexing. Most indexing terms stem from the specialized scientific and technical vocabulary (SSTV). Of special interest however is what we call the “Basic scientific vocabulary”, which contains general words not belonging to any specialized vocabulary, but rather used in all scientific or scholarly domains, e.g. “model”, “structure”, “absence”, “example”. This allows subdivisions of the SSTV (e.g. “parabolic antenna, structure”) and may also help in improving term extraction (e.g. removing “structure” in “structure of parabolic antenna”) for automatic indexing. Our goal is to provide a proper linguistic description of the BSV class and develop means of identifying members of this class in a given language, including extracting this type of words from a large corpus (Drouin, 2007, describes a similar endeavour on a French corpus).

We have identified linguistic criteria (categorial, semantic and morphological) which may define this class, which differs from Ogden’s Basic English (Ogden, 1930), the Academic Word List (Coxhead, 2000) or frequency lists such as those extracted from the British National Corpus. We have also conducted an experiment which attempts to derive this class automatically, from a large corpus of scholarly writing. Our originality lies in the choice of corpus: the documents consist of titles and abstracts taken from bibliographic databases for a large number of different scholarly disciplines: ARTbibliographies Modern, ASFA1 : Biological sciences and living resources, Inspec, Library and Information Science Abstracts, LLBA, Sociological Abstracts, Worldwide Political Science Abstracts. The rationale behind this is they (i) contain scholarly vocabulary, (ii) tend to use general concepts (more than the full texts), and (iii) cover a wide range of scholarly disciplines, so the frequency of vocabulary items used in all should outweigh disciplinary specialities. The total corpus contains close to 14 million words.

An initial, manually-constructed list for English will be sketched (approximately 344 words), based on the linguistic criteria. Then details of the experiment will be given; 80% of the first 75 extracted nouns do meet the linguistic criteria, as do around 70% of the first 1000. This method has allowed us to increase our working BSV list to 756 words. We will discuss problems with the methodology, explain how the new list has been incorporated into an automatic indexing program yielding richer index entries, and present issues for further research.

References:

Coxhead, A. (2000). “A New Academic Word List”. TESOL Quarterly, 34(2), 213-238.

Drouin, P. (2007). “Identification automatique du lexique scientifique transdisciplinaire”. Revue française de linguistique appliquée, 12(2), 45-64.

Ogden, C. K. (1930) Basic English: A General Introduction with Rules and Grammar. London: Paul Treber.

A corpus-based approach to the language of the social services: Identifying changes in the profession of the social worker and priorities in social problems.

Adriana Teresa Damascelli

In recent times the need for highly specialised professional profiles has brought about changes in the job market. New professions are increasingly required to cover job position in fields which have developed their own specificity and produced linguistic items, i.e. specialised terminology, which are used to connote concepts, objects, actions, and activities relating to them. Nowadays, the proliferation of specialised fields is so rapid that it is not always possible to trace back their origin, their state, and what kind of evolution has taken place in terms of specialty as well as identify changes in terms of communication. On the other hand, there are some professions whose theoretical background has well-established roots which date back to the past and have gained their status of specialised field.

This paper focuses on the domain of the social services and in particular on the social worker profession. The aim is to determine if any further process of specialisation has taken place and/or has developed in ten years, and in what way. For present purposes a corpus of all the issues published from 1998 to 2008 by the British Journal of Social Workers is being collected. The corpus will maintain annual subdivision, although this would be not of help because of journal policy which may or may not impose the development of certain issues or topics in each volume or year. Subsequently, different computer tools, namely wordlists, concordances, and key words, will be used in order to investigate the corpus. The corpus-based approach will also be of help to trace any trend in the way topics have been covered and hypothesise any relationship with general attitude towards social matters. Examples as well data will be provided as evidence.

Lexical bundles in English abstracts: A corpus-based study of published and non-native graduate writing

Carmen Dayrell

English for Academic Purposes (EAP) poses major challenges to novice writers who, in addition to dealing with the various difficulties involved in the writing process itself, also have to comply with the conventions adopted by their academic discourse community. This is mainly because the presence (or absence) of some specific linguistic features is usually viewed as a key indicator of communicative competence in a field of study. The main rationale behind this suggestion is that expert writers who regularly participate in a given discourse community are familiar with the lexical and syntactical patterns more frequently used by their community and hence often resort to them. Not surprisingly, a number of corpus-based studies have paid increasing attention to lexical bundles (or clusters) in academic texts with a view to developing teaching materials which can assist novice researchers, and non-native speakers in particular, to write academic texts in English. This study follows this line of research and focuses on scientific abstracts. The primary purpose is to investigate lexical bundles in English abstracts written by Brazilian graduate students as opposed to abstracts of published papers from the same disciplines. The comparison is made across three disciplines, namely, pharmaceutical sciences, physics and computer science and takes into consideration the frequencies of lexical bundles with respect to their forms and structures. The data is drawn from two separate corpora of English abstracts. One corpus is made up of 158 abstracts (approximately 32,500 words) written by Brazilian graduate students from the abovementioned disciplines. These abstracts were collected in seven courses on academic writing offered between 2004 and 2008 by the relevant departments of a Brazilian university. The other corpus consists of 1,170 abstracts (over 205,000 words) taken from papers published by various leading academic journals. It has been designed to match the specifications of the corpus of students’ abstracts in terms of disciplines and percentages of texts in each. The long-term objective of this study is to translate the most relevant differences between the two corpora into pedagogic materials in order to raise students’ awareness of their most frequent inadequacies as well as of the chunks or phrases which are most regularly used within their academic discourse community.

Identifying speech acts in emails: Business English and non-native speakers

Rachele De Felice and Paul Deane

This work presents an approach to the automated identification of speech acts in a corpus of workplace emails. As well as contributing to ongoing research in automated processing of emails for task identification (see e.g. Carvalho and Cohen 2006; Lampert, Dale et al. 2008; Mildinhall and Noyes 2008), its main goal is to determine whether learners of English are using the appropriate structures for this type of text. In particular, tests of English for the business environment often require learners to write emails responding to specific task requirements, such as asking for information, or agreeing to a proposal, which can be assigned to basic speech act categories. Automated identification of such items can be used in the scoring of these open-content answers, e.g. noting whether all the speech acts required by the task are present in the students’ texts.

Our data consists of a corpus of over 1000 such emails written by L2 English students in response to a test prompt. Each email is manually annotated at the sentence level for a speech act, using a scheme which includes categories such as Acknowledgement, Question, Advisement (requesting the recipient do something), Disclosure (of personal thoughts or

intentions), and Edification (stating objective information).

We then identify a set of features which discriminates among the categories. Similar methods have been found successful in acquiring usage patterns for lexical items such as prepositions and determiners (De Felice and Pulman 2008; Gamon, Gao et al. 2008; Tetreault and Chodorow 2008); one of the novel aspects of this work is to extend the classifier approach to aspects of the text above the lexical level. The features refer to several characteristics of the sentence: punctuation, use of pronouns, use of modal verbs, presence of particular lexical items, and so on. A machine learning classifier is trained to associate particular combinations of features to a given speech act category, to automatically assign a speech act label to novel sentences. The success of this approach is tested by using ten-fold cross-validation on the L2 email set. First results show accuracy of up to 80%. We also intend to assess the performance of this classifier on comparable L1 data, specifically a subset of the freely available Enron email corpus.

Our work combines methods from corpus linguistics and natural language processing to offer new perspectives on both speech act identification in emails and the wider domain of automated scoring of open-content answers. Additionally, analysis of the annotated learner data can offer further insights into the lexicon and phraseology used by L2 English writers in this text type.

References:

Carvalho, V. and W. Cohen (2006). “Improving email speech act analysis via n-gram Selection”. HLT-NAACL ACTS Workshop. New York City.

De Felice, R. and S. Pulman (2008). “A classifier-based approach to preposition and

determiner error correction in L2 English”. COLING. Manchester, UK.

Gamon, M., J. Gao, et al. (2008). “Using contextual speller techniques and language

modeling for ESL error correction”. IJCNLP.

Lampert, A., R. Dale, et al. (2008). “The nature of requests and commitments in email

Messages”. AAAI Workshop on Enhanced Messaging. Chicago, 42-47.

Mildinhall, J. and J. Noyes (2008). “Toward a stochastic speech act model of email Behaviour”. Conference on Email and Anti-Spam (CEAS).

Tetreault, J. and M. Chodorow (2008). “The ups and downs of preposition error detection”. COLING, Manchester, UK.

What they do in lectures is…: The functions of wh-cleft clauses in university lectures

Katrien Deroey

As part of a study of lecture functions in general and highlighting devices in specific, this paper presents the findings of an investigation into the discourse functions of basic wh-cleft clauses in a corpus of lectures. These clauses, such as What our brains do is complicated information processing, are identifying constructions which background the information in the relative clause (What our brains do) and present the information foregrounded in the complement (complicated information processing) as newsworthy. Corpus-based studies of this construction to date have mainly described its function in writing (Collins 1991, Herriman 2003 & 2004) and spontaneous speech (Collins 1991 & 2006). From his examination of wh-clefts in speech and writing, Collins (1991: 214) concludes that ‘the linear progression from explicitly represented non-news to news offers speakers an extended opportunity to formulate the message’, while Biber et al (1999: 963) note that in conversation, speakers may use this construction with its typically low information content as ‘a springboard in starting an utterance’. With regard to academic speech, Rowley-Jolivet & Carter-Thomas (2003) found that basic wh-clefts are particularly effective in conference presentations to highlight the New and that their apparent ‘underlying presupposed question’ (p. 57) adds a dialogic dimension to monologic speech. All these features suggest that wh-clefts may be a useful in lectures, which are typically monologic and mainly concerned with imparting information. So far, however, studies on the function of these clefts in lectures have generally focussed on the function of part of the wh-clause as a lexical bundle (Biber 2006, Nesi & Basturkmen 2006) and mostly discussed its role as a discourse organising device.

For the current investigation, a corpus of 12 lectures drawn from the British Academic Spoken English (BASE) Corpus were analysed. This yielded 132 basic wh-clefts, which were classified for their main discourse functions based on the presence of certain lexico-grammatical features, the functional relationship between the clefts and their co-text and an understanding of the purposes of and disciplinary variation within the lecture genre. Four main functional categories thus emerged: informing, evaluating, discourse organizing, evaluating and managing the class. These functions of wh-clefts and their relative frequency are discussed and related to lecture purposes; incidental findings on their co-occurrence with pauses and discourse markers are also touched upon. The study of this highlighting device in a lecture corpus thus aims to contribute to our understanding of what happens in authentic lectures and how this is reflected in the language.

References:

Biber, D. (2006). University language: a corpus-based study of spoken and written registers: Studies in Corpus Linguistics 23, Amsterdam: John Benjamins.

Collins, P. C. (1991). Cleft and pseudo-cleft constructions in English. London: Routledge.

Collins, P. C. (2006). “It-clefts and wh-clefts: prosody and pragmatics”. Journal of Pragmatics, 38, 1706-1720.

Herriman, J. (2003). “Negotiating identity: the interpersonal functions of wh-clefts in English”. Functions of Language, 10 (1), 1-30.

Herriman, J. (2004). “Identifying relations: the semantic functions of wh-clefts in English”. Text, 24 (4), 447-469.

Nesi, H. and H. Basturkmen, (2006). “Lexical bundles and discourse signalling in academic lectures”. International Journal of Corpus Linguistics, 11 (3), 283-304.

Rowley-Jolivet, E., and S. Carter-Thomas (2005). “Genre awareness and rhetorical appropriacy: manipulation of information structure by NS and NNS scientists in the international conference setting”. English for Specific purposes, 24, 41-64.

Building a dynamic and comprehensive field association terms dictionary from domain-specific corpora using linguistic knowledge

Tshering Cigay Dorji, Susumu Yata, El-sayed Atlam, Masao Fuketa,

Kazuhiro Morita and Jun-ichi Aoe

With the exponential growth of digital data in recent years, it remains a challenge to retrieve and process this vast amount of data into useful information and knowledge. A novel technique based on Field Association Terms (FA Terms) has been found to be effective in document classification, similar file retrieval, and passage retrieval. It also holds much potential in areas like machine translation, ontology building, text summarization, cross-language retrieval etc.

The concept of FA Terms is based on the fact that the subject of a text (document field) can usually be identified by looking at the occurrence of certain specific terms or words in that text. An FA Term is defined as a minimum word or phrase that serves to identify a particular field and cannot be divided further without losing its semantic meaning. Since an FA Term may belong to more than one field, its strength to indicate a specific field is determined by assigning one of the five pre-defined levels.

However, the main drawback today is the lack of a comprehensive FA Terms dictionary. This paper proposes a method to build a dynamically-updatable comprehensive FA Terms dictionary by extracting and selecting FA Terms from large collections of domain-specific corpora. Firstly, the documents in a domain-specific corpus are part-of-speech (POS) tagged and lemmatized using a program called TreeTagger. Secondly, the FA Term candidates which consist of “rigid noun phrases” are extracted by matching predefined POS patterns. Thirdly, the term frequencies and the document frequencies of the extracted FA Term candidates are compared with those of the candidate terms from a reference corpus. The FA Term candidates are then weighted and ranked by using a special formula based on tf-idf. Finally, the level of each newly selected FA Term is decided by comparing with the existing FA Terms in the dictionary. The process is repeated on a regular basis for newly obtained corpora so that the FA Terms dictionary remains up-to-date.

Experimental evaluation for 21 fields using 306.28 MB of domain-specific corpora obtained from English Wikipedia dumps selected 497 to 2,517 Field Association Terms for each field at precision and recall of 74-97% and 65-98% respectively. The relevance of the automatically selected FA Terms was checked by human experts. The newly selected FA Terms were added to the existing dictionary consisting of 14 super-fields, 50 median fields and 393 terminal fields. Experimental results showed that the proposed method is an effective method for building a dynamically-updatable comprehensive FA Terms dictionary. Future studies will improve the proposed methodology by adding a new module for automatic classification of new documents for extraction of new FA Terms, and explore the application of FA Terms in cross-language retrieval and machine translation.

Encoding intonation: The use of italics and the challenges for translation

Peter Douglas

English, Halliday reminds us, is ‘a language in which a relatively heavy semantic load is carried by rhythm and intonation’. The position of the accented word or syllable is largely determined by the speaker’s involvement in an ongoing discourse in which New (rather than Given) information receives tonal prominence. It has been noted that intonation, an overt signal of information status in spoken English, is difficult to convey in the written language. However, the use of italics, particularly in literary texts, can be seen to represent such emphasis; they are a hybrid convention, signs located at the border between written and spoken discourse. General opinion that regards italics as mere punctuation options belies their complex nature and masks their importance for the translation scholar and for the theorist interested in the written representation of spoken discourse.

The present paper claims that as languages have their own prosodic conventions and their own codes of how these conventions are represented in writing, problems will arise during the translation process. Research into English and Italian source and target texts illustrates how two languages represent intonation in written language and the challenges that this poses for the translator. To this effect, a small, bidirectional parallel corpus of 19th and 20th century English and Italian source texts and their respective target texts was created, and each occurrence of italics identified. Given the typographical nature of the item, compilation of the corpus presented its own challenges.

Successive rounds of investigation provided quantitative data which, when analysed, revealed not only the considerable communicative potential of italics, but also the marked differences between the two language codes, their respective textual conventions and the strategies available to the translator. A distinctive feature that emerged from the analysis of the English source texts is that italicised items tend to signal marked rather than unmarked default intonation patterns. They highlight words which would generally be expected to go unselected, i.e. they evidence part of the information unit that is Given rather than New. In addition, analysis shows the extent to which italics communicating emphasis appear to be culture-specific to the English texts. However, further examination of Italian target texts also reveals that a wide range of lexico-grammatical resources can be drawn upon to approximate the source text meaning.

References:

Halliday, M. A. K. (1985). An Introduction to Functional Grammar. London: Edward Arnold, 271.

Lexical bundles in classroom discourse: A corpus-based approach to tracking disciplinary variation and knowledge

Paul Doyle

Corpus-based approaches to discourse analysis provide sophisticated analytical tools for investigating linguistic patterning beyond the clause level. Coupled with interpretations sensitive to the functional role of such patterning in texts, these approaches can facilitate deeper insights into differences between genres and registers, and, in the context of the discourse of primary and secondary classroom teaching, relate these understandings to broader pedagogic processes. In this paper, I focus on recurrent word sequences, or lexical bundles (Biber et al. 1999), as markers of disciplinary variation in a corpus of primary and secondary teacher talk. Frequently occurring lexical bundles can be classified using functional categories such as epistemic stance expressions, modality and topic related discourse organising expressions (ibid). However, in order to account for variation in lexical bundle distribution across disciplines, there is a need for an interpretative framework that relates to a specific community of language users operating in a single genre (Hyland, 2008). Classroom talk is a hybrid discourse (Biber, Conrad and Cortes, 2004) that exhibits both the characteristic interpersonal features of spoken language and ‘literate’ features of written language from textbooks, and it is especially rich in lexical bundles. Using data from the Singapore Corpus of Research in Education, an ongoing educational research project utilising a corpus of classroom interactions, textbooks and students artefacts in key curriculum disciplines, I trace variations in pedagogic practice as evidenced in teacher talk from English medium lessons in English language, Mathematics, Science and Social Studies in Singapore classrooms. Frequent lexical bundles are classified using a framework adapted from Hyland’s (2008) taxonomy, and the distribution of the various categories is compared across the four school disciplines. The approach is evaluated in terms of its ability to relate linguistic variation to significant disciplinary differences, and to highlight processes of knowledge construction in the classroom.

References:

Biber, D., S. Conrad and V. Cortes (2004). “If you look at...: Lexical Bundles in University Teaching and Textbooks”. Applied Linguistics, 25 (3), 371-405.

Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999). Longman Grammar of Spoken and Written English. London: Longman Pearson.

Hyland, K. (2008). “As can be seen: Lexical bundles and disciplinary variation”. English for Specific Purposes, 27 (1), 4-21.

A function-first approach to identifying formulaic language

Philip Durrant

There has been much recent interest in creating pedagogically-oriented descriptions of formulaic language (e.g., Biber, Conrad, & Cortes, 2004; Ellis, Simpson-Vlach, & Maynard, 2008; Hyland, 2008; Shin & Nation, 2008). Corpus-based research in this area has typically taken what might be called a ‘form-first’ approach, in which candidate formulas are identified as the most frequent collocations or n-grams in a relevant corpus. The functional correlates of these high frequency forms are then determined at a later stage of analysis. This approach has much to commend it. It enables fast, reliable analysis of large bodies of data, and has produced some important insights, retrieving patterns which are often not otherwise evident.

While this research continues to yield valuable results, the present paper argues that much can also be gained by taking what I will call a ‘function-first’ approach to identifying formulaic language. On this approach, a corpus is first annotated for communicative functions; formulas are then identified as the recurrent patterns used to express each function. This method has two key advantages. First, it enables the analyst to identify not only how frequent a given form is, but also what proportion of attempts to express a particular message use that form. As Wray (2002, p. 30) has pointed out, information of this sort is often vital, alongside overall frequency information, in determining whether an item is formulaic.

The second advantage of a function-first approach is that it enables the analyst to take account of the range of variation associated with a particular expression. Form-first approaches are required to specify in advance the formal features that two strings of language must have in common to count as instances of a single formula. This has usually meant restricting analysis to simple forms such as two-word collocations or invariant ‘lexical bundles’. Since many formulas are to a certain degree internally variable, this restriction inevitably involves the loss of potentially important information. The function-first approach aims to account for the range of variation typical in the expression of a given function and so to include a fuller description of such variable formulas.

While these features of the function-first approach have the potential to be of much theoretical interest, the primary motivation for the approach is pedagogical. For language learners, the key information is often not which strings of language are the most frequent, but rather which functions they are most likely to need, what formulas will most appropriately meet these needs, and how those formulas can be manipulated. This is precisely the information which a function-first approach aims to provide.

This paper will demonstrate a function-first approach to identifying formulaic language through an investigation of a collection of introductions to academic essays written by MA students in the social sciences, taken from the corpus of British Academic Written English. The methodological issues involved will be discussed and sample results presented.

References:

Biber, D., S. Conrad and V. Cortes (2004). “If you look at ...: Lexical Bundles in University Teaching and Textbooks”. Applied Linguistics, 25 (3), 371-405.

Ellis, N. C., R. Simpson-Vlach and C. Maynard (2008). “Formulaic language in native and second- language speakers: psycholinguistics, corpus linguistics, and TESOL”. TESOL Quarterly, 41 (3), 375-396.

Hyland, K. (2008). “Academic clusters: text patterning in published and postgraduate writing”. International Journal of Applied Linguistics, 18 (1).

Shin, D. and P. Nation (2008). “Beyond single words: the most frequent collocations in spoken English”. ELT Journal, 62 (4), 339-348.

Wray, A. (2002). Formulaic language and the lexicon. Cambridge: CUP.

Corpus-based identification and disambiguation of reading indicators for German nominalizations

Kurt Eberle, Gertrud Faaß and Ulrich Heid

We present and discuss an automatic system for the disambiguation of certain lexical semantic ambiguities and for the identification of disambiguation clues in corpus text.

As an example, we analyse German nominalizations of verbs of information (e.g. Äußerung “statement”, Darstellung “(re-)presentation”, Mitteilung “information”) and nominalizations of verbs of measuring and recording (e.g. Beobachtung “observation”, Messung “measurement”, which can be ambiguous between an event reading (“event of uttering, informing, observing”, etc.) and a fact-like object reading which makes reference to the content in question. For language understanding, content analysis, and translation, it is desirable to be able to resolve this type of ambiguity.

The ambiguity can only be resolved in context; relevant types of contextual information include the following:

* semantic class of the verb underlying the nominalization;

* syntactic embedding selectors of the nominalization, e.g.

- verbs (e.g. modal vs. full verbs, tense);

- prepositions;

* complements and modifiers of the nominalizations, e.g. determiners, adjectives, PPs and expressing participants of the verb frame, such as the agent.

There are many contexts, however, where indicators can themselves be ambiguous, like the German preposition nach, in constructions of the kind nach (den) Äußerungen von X . . . : nach can mean a temporal sequence (“after”, event indicator) or reference to a content (“according to”); to resolve this combined ambiguity definiteness and agent phrases play a role.

Such fine-grained interpretations require a sophisticated syntactic and semantic analysis tool. Many interactions need to be identified and tested. Our system provides both these functions: an automatic syntactic and semantic analysis with partial disambiguation and with an underspecified representation of ambiguity, and a search mode which allows us to browse texts for relevant constellations to find indicator candidates.

Our tool is a research prototype based on the large coverage grammar of the machine translation product translate (http://lingenio.de/English/Products/translation-software.htm) it includes a dependency parser based on Slot Grammar (McCord 1991) and provides underspecified representations in the style of FUDRT (Eberle 2004).We added a functionality to the parser that allows us to modularly add hypotheses about the reading indicator function of certain words or combinations thereof; by adding these hypotheses to the system’s knowledge source, we can test their disambiguation effect.

A search/browse mode allows us to search and parse results for specific syntactic/semantic constellations, to cross-check them in large corpora and to group and count context with related properties, as, e.g. nach Äußerungen von Maier “according to statements by Maier” versus nach den Äußerungen von Maier “after the statements by Maier”.

In the paper, we present the tool architecture and functions, and we elaborate on details of the event vs. fact disambiguation with nominalizations of verbs of information.

Natural ontologies at work: Investigating fairy tales

Ismaïl El Maarouf

The goal of the EMOTIROB project is to build a smart robot which could respond adequately to children experiencing emotional difficulties. One of the research activities is to set up a comprehension module (Achour et alli 2008) signalling the emotional contour of utterances in order to stimulate an adequate response (non-linguistic) from the robot when faced with child linguistic input. This will require a corpus analysis of input language. Whilst the oral corpus is being completed, research is being carried out on another kind of child-related material: Fairy Tales, that is, a corpus of adults addressing children through texts destined to arouse their imagination. This corpus will mainly serve as a reference for a comparison to a corpus of authentic child language.

The purpose of this study is to characterize semantically this subset of children's linguistic environment. Studies of child language in the mainstream of corpus linguistics are scarce (Thompson & Sealey 2007), and so are child-related corpora, which means that we know very little about their specificities.

This study focuses on most frequent verbs and investigates concordances to classify their collocations. The theoretical framework used here is Patrick Hanks's Corpus Pattern Analysis (Hanks 2008). This corpus-driven model makes the most of linguistic contexts to define each meaning unit. Each verb pattern is extracted and associated to a particular meaning while every pattern argument is semantically typed when it helps to disambiguate the meaning of the verb. Building semantic patterns is not a straightforward procedure: we analyse how predefined ontological categories can be used, confronted to, or combined with, the set of collocates found in a similar pattern position. Each collocate set can then be grouped with another under the label of a natural semantic category. Natural ontologies is, in this perspective, a term which designates the network of semantic categories emerging from corpus data.

The corpus has been annotated semantically and manually with Anaphora and Proper Noun Resolution and verb patterns have been extracted and their arguments semantically typed. This annotation enables the analysis of preferences of semantic types and linguistic units separately since the same unit may pertain to diverse semantic categories. The corpus, though very small according to the standards of Corpus Linguistics (less than 200 000 running words), reveals peculiar semantic deviances compared to studies led so far on adult-addressed texts such as newspapers. In order to account for such deviances, we propose to characterize verbal arguments around prototype categories, which are defined through corpus frequencies. This article shows, among other things, how child-addressed corpora can be described thanks to the specific semantic organisation of verb arguments and to corpus-derived natural ontologies.

References:

Achour A., M. Le Tallec, S. Saint-Aime, B. Le Pévédic, J. Villaneau, J.-Y. Antoine and D. Duhautd

(2008). “EmotiRob: from understanding to cognitive interaction”, In 2008 IEEE International Conference on Mechatronics and Automation – ICMA 2008. Takamatsu, Kagawa, Japan.

Hanks, P., 2008, “Lexical Patterns: From Hornby to Hunston and Beyond”, In E. Bernal and

J. Decesaris (eds) (2008). Proceedings of the XIII EURALEX International Congress, Barcelona, Sèrie Activitats 20. Barcelona: Universitat Pompeu Fabra, 89-129

Hanks P. and E. Jezec (2008), “Shimmering lexical sets”, In Euralex XIII 2008 Proceedings, Pompeu Fabra University, Barcelona.

A corpus-based analysis of reader positioning: Robert Mugabe as a pattern in the Guardian newspaper

Izaskun Elorza

Several attempts have been carried out to describe genres or text types within newspaper discourse by means of contrastive analyses of bilingual comparable corpora. Recent approaches to the description of different features of this type of discourse focus on the analysis of evaluative language and reader positioning (e.g., McCabe & Heilman 2007; Marín Arrese & Núñez Perucha 2006). It is my contention that contrastive analyses on evaluation and metadiscourse (Hyland 2005) can benefit from corpus methodology as a complementary tool for revealing patterns used by newsworkers with various purposes which include reader positioning and, in order to illustrate this, an analysis of the Guardian newspaper is presented in this paper.

In cross-cultural contrastive analyses, one way of minimising the great variety of variables at stake is to choose the coverage of the same event as the criterion for the compilation of the comparative corpora, so that the event coverage or the topic is taken as the control variable. Along this line, a contrastive analysis has been carried out of the coverage by El País and the Guardian of the opening day of a summit organised by UN’s Food and Agricultural Organisation in June 2008. A list of keywords has been produced by means of WordSmith Tools 4.0 (Scott 2004), revealing that the participant who received greater attention in El País was Ban Ki-moon (UN secretary general) but that the Guardian devoted much more attention to Robert Mugabe (the controversial president of Zimbabwe). An imbalance of this kind in the coverage of the summit might be due to cultural differences in consonance (Bell 1991), which relates the newsworkers’ choice of devoting attention to some participants or others to the readers’ expectations in each case. And, as the newspaper readers’ expectations are conformed by, among other things, the repetition of the same patterns in the newspaper (O’Halloran 2007; 2009), a second analysis has been carried out to try to find out whether the presence of Robert Mugabe in the summit has been constructed by means of some recurrent pattern and, if so, which elements belong to that pattern.

For this analysis a second corpus of approximately 1 million running words has been compiled consisting of news articles and editorials from the Guardian (2003 to 2008). Concordances of ‘Mugabe’ and ‘summit’ have been obtained of this corpus allowing me to identify some recurrent elements involving different kinds of evaluative resources, including attitudinal expressions of judgement and appreciation (Martin & White 2005). The results presented here come to reinforce the suggestion that an approach integrating discourse analysis and corpus linguistics is more productive for in-depth discourse description than qualitative approaches alone.

References:

Hyland, K. (2005). Metadiscourse. London: Continuum.

O’Halloran, K. (2007). “Using Corpus Analysis to Ascertain Positioning of Argument in a Media Text”. In M. Davies, P. Rayson, S. Hunston, and P. Danielsson (eds) Proceedings of the Corpus Linguistics Conference CL2007.

O’Halloran, K. (2009). “Inferencing and Cultural Reproduction: A Corpus-based Critical Discourse Analysis”. Text & Talk, 29 (1), 21-51.

Marín Arrese, J. I. and B. Núñez Perucha (2006). “Evaluation and Engagement in Journalistic Commentary and News reportage”. Revista Alicantina de Estudios Ingleses, 19, 225-248.

Martin, J. R. and P. R. R. White (2005). The Language of Evaluation: Appraisal in English. Houndmills: Palgrave MacMillan.

McCabe, A. & K. Heilman (2007). “Textual and Interpersonal Differences between a News Report and an Editorial”. Revista Alicantina de Estudios Ingleses, 20,139-156.

Scott, M. (2004). WordSmith Tools. Oxford: Oxford University Press.

Constructing relationships with disciplinary knowledge: Continuous aspect in reporting verbs and disciplinary interaction in lecture discourse

Nick Endacott

This paper examines use of the continuous aspect of reporting verbs in statements recontextualising disciplinary knowledge in lecture discourse, focusing on how it sets up complex participation frameworks and dialogic interaction patterns in such statements and the roles these patterns play in negotiating lecturer and disciplinary identities in the genre.

Authentic academic lecture discourse exhibits many complexities in its lexico-grammar realising reporting and discourse representation. Significant among these complexities is the use of the continuous aspect with reporting verbs. Although reporting and discourse representation have been investigated in more lexico-gramatically ‘standard’ genres such as the Research Article (Thompson & Yiyun 1995, Hyland 2000), the use of the continuous aspect falls outside existing typologies for studying reporting and discourse representation in academic discourse (Thompson & Yiyun 1991, Hyland 2000). This is likely to be a reflection of the lexico-grammatical complexity of lecture discourse, comprising as it does features of both written and spoken discourse (e.g. Dudley-Evans & Johns 1981, Flowerdew 1994), and a reflection too of the pedagogic nature of lecture discourse. Broadly, the continuous aspect is used in lecture discourse both in metadiscourse and in discourse putting forward propositional content. Although the focus on message as opposed to ‘original words’ (McCarthy 1998; cf. Tannen 1989) and the construction of highly dialogic co-authorship in discourse (Bakhtin 1981, Goffman 1974) made through this usage is similar in both functional areas, the pragmatic purposes and participants involved differ. Using data from the BASE* corpus, this paper investigates the use of continuous aspect in the latter area, propositional content, and in particular examines the complex disciplinary interaction patterns constructed in lecture discourse by the continuous aspect and the highly dialogic participation frameworks (Goffman 1974) co-constructed with it. These interaction patterns create varying identities for lecturers, for the disciplinary knowledge they are recontextualising in lectures, and for the disciplines they lecture from, and as such are another resource for comparing patterns of social interaction in academic genres (Hyland 2000) and assessing disciplinary structure (Becher 2001, Hyland 2000).

* The British Academic Spoken English Corpus (BASE) was developed at the universities of Warwick and Reading under the directorship of Hilary Nesi and Paul Thompson. Corpus development was assisted by funding from BALEAP, EURALEX, the British Academt and the Arts and Humanities Research Council.

Exploring morphosyntactic variation in Old Spanish with Biblia Medieval (a parallel corpus of Spanish medieval Bible translations)

Andrés Enrique-Arias

This paper discusses the main advantages and shortcomings of the Biblia Medieval corpus in the historical study of morphosyntactic variation and change. Containing over 5 million words, Biblia Medieval is a freely accessible online tool that enables linguists to consult and compare side-by-side the existing medieval Spanish versions of the Bible and that gives them access the facsimiles of the originals.

The historical study of morphosyntactic variation and change through written texts involves some well-known challenges. One of them is the difficulty to identify and to define the contexts of occurrences of morphosyntactic variants since these are normally conditioned by a complex set of syntactic, semantic and discursive factors. In order to minimize this problem, it is necessary to locate and to examine a large number of occurrences of the same linguistic structure in versions that were produced at different time periods. Ideally, these occurrences should proceed from texts that have been influenced by the same textual conventions. Conventional corpora, however, are not well-suited for locating such occurrences. This problem is alleviated when the linguist has access to a parallel corpus. Parallel corpora consist of texts that are translated equivalents of a single original. Therefore they have the same underlying content and have been influenced by similar textual conventions. When working with a parallel corpus, the analyst may abstract away from the influence of contextual properties and focus instead on the diachronic evolution of structural phenomena. This will lead to more nuanced generalizations of the historical evolution of the phenomena being scrutinized. Alternatively, one can focus on contextual properties seeking to identify, for instance, variation across subgenres within texts, or the differences between translated texts in the corpus and non-translated ones. While a parallel corpus does not solve all the problems inherent to working with historical texts, it does enable the analyst to observe the historical evolution of structural phenomena while controlling the situational dimensions that condition variation.

In order to illustrate the possibilities that Biblia Medieval offers, this paper gives an overview of specific case studies of Spanish morphosyntactic variation: (i) possessive structures, (ii) differential object marking, and (iii) clitic related phenomena. In sum, this study demonstrates how a parallel corpus of medieval biblical texts may enrich our theoretical understanding of change and variation phenomena in Spanish.

“To be perfectly honest …” Patterns of amplifier use in speech and writing.

David Evans

Quirk et al’s (1985) A Comprehensive Grammar of English has been influential in shaping much of the subsequent literature on ‘upscaling’ degree modifiers in two important ways. First, Quirk et al, and subsequent commentators such as Allerton (1987) and Paradis (1997), take an essentially semantic view of the relationship between the modifying and modified elements. In this view of degree modification, the main constraint on an amplifier is that the modified lexical item and modifier “must harmonize … in terms of scalarity and totality to make a successful match” (Paradis, 1997: 158). Beyond this constraint, amplifiers were viewed as strong exponents of the open-choice principle (Sinclair, 1991); free to co-occur with a wide variety of types, with the choice of one amplifier over another having little impact on meaning.

Second, the traditional structure of major grammars, such as Quirk et al, puts an emphasis on parts of speech and their interaction with each other. This structure has had two effects: adverbs have been, and to a large extent remain, the focus of much work on intensification, ignoring the fact that taboo words such as fucking are more frequent as emphatic modifiers than a number of -ly adverbs, particularly in demographically sampled spoken data. It also means that the focus has been quite narrow, focussing very much on the amplifier’s immediate neighbours.

Subsequent corpus-based studies by Partington (1998, 2004) and Kennedy (2002, 2003) have shown that collocation, semantic prosody and semantic preference place further limits on the occurrence of amplifiers.

Examining a wide range of degree modifiers, both standard (very important, utterly devastated) and non-standard (fucking massive, crystal clear) in spoken and written English this paper attempts to address some of these issues. It argues that patterns of amplifier use vary according to aspects such as the meaning of the adjective, where it is polysemous, and the nominal expression an adjective is itself modifying. It also suggests that grammatical environment, i.e. whether an adjective is operating predicatively or attributively, has an impact on the meaning and frequency of certain degree modifiers.

In so doing, this paper challenges the notion that some degree words are interchangeable, concluding that degree modifiers are subject to ‘priming’ (Hoey, 2005) to a much greater extent than had previously been thought.

References:

Allerton, D.J. (1987) “English intensifiers and their idiosyncrasies”. In R. Steele and T. Threadgold (eds) Language topics: Essays in honour if Michael Halliday 2. Amsterdam: Benjamins, 15-31.

Hoey, M. (2005) Lexical Priming. Abingdon: Routledge

Kennedy, G. (1998) “Absolutely diabolical or relatively straightforward: Modification of adjectives by degree adverbs in the British National Corpus.” In A. Fischer, G. Tottie and H.M. Lehmann (eds) Text types and corpora: Studies in honour of Udo Fries Tubingen: Gunter Narr, 151-163.

Kennedy, G. (2003) “Amplifier Collocations in the British National Corpus: Implications for English Language Teaching”. TESOL Quarterly, 37(3), 467-487.

Paradis, C. (1997) Degree modifiers of adjectives in spoken British English. Lund: Lund University Press.

Partington, A. (1998) Patterns and Meanings. Amsterdam: Benjamins.

Partington, A. (2004) “‘Utterly content in each other’s company’: Semantic prosody and semantic preference.” International Journal of Corpus Linguistics, 9(1), 131-156.

Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985) A Comprehensive Grammar of English. Harlow: Longman.

Sinclair, J.M. (1991) Corpus, concordance, collocation. Oxford: Oxford University Press.

Large and noisy vs small and reliable: Combining two types

of corpora for adjective valence extraction

Cécile Fabre and Anna Kupsc

This work investigates a possibility of combining two different types of corpora to build a valence lexicon for French adjectives. We complete adjectival frames extracted from a Treebank with statistical cues computed from a large automatically parsed corpus. This experiment shows how linguistic knowledge and large amount of annotated data can be used in a complementary manner.

On one hand, we use a 1 million-word Treebank, manually revised and enriched with syntactic and functional annotations for major constituents. On the other hand, we have a 200 million-word corpus automatically parsed, with no subsequent human validation, where the texts have been annotated with dependency relations. In none of the two corpora is the argument/adjunct distinction specified for dependents of adjectives.

In the first step, adjectival frames are extracted from the Treebank. The main issue is to separate higher-level constructions from valency information. We develop a set of rules to filter out constructions such as comparative, superlative, intensifying or impersonal. At this stage, 41 different frames are extracted for 304 adjectives. This

result needs further exploration: first, due to imperfect or insufficient corpus annotations, the data is not totally reliable. In particular, frames of adjectives in impersonal constructions are often incorrectly extracted. Second, we are not able to separate real PP-arguments from adjuncts (e.g. applicable à N vs applicable sur N: applicable to / on N). Finally, due to the small size of the corpus, the number of extracted frames and the size of the lexicon are reduced.

In the second step, we use much more data and apply statistical methods with two objectives: rank the frames on the scale of the argument/adjunct continuum and discover additional frames. We focus on PP-complements as they turned out to be the most problematic for the previous approach. We use several simple statistical measures to evaluate valence properties of the frames, focusing in particular on the obligatoriness and autonomy of the PP with respect to the adjective. We define optimal valency criteria which we apply for ranking and identifying new frames. PP is all the more likely to be an argument of the adjective when it meets the following three conditions: the frame is productive (the adjective combines with a large range of nouns or infinitives introduced by the preposition), the adjective is rarely found alone (i.e., without the accompanying frame), and prepositional expansions that are attached to this adjective are mostly introduced by the preposition which appears in the frame. Applying these criteria allowed us to rank the frames extracted from the Treebank and extract dependency information for about 2600 new adjectives.

An empirical survey of French, Latin and Greek words in the British National Corpus

Alex Chengyu Fang, Jing Cao, Nancy Ide, Hanhong Li, Wanyin Li and Xing Zhang

In this article we report the most extensive survey of foreign words in contemporary British English. In particular, we report the distribution of French, Latin and Greek words according to different textual settings in the British National Corpus (BNC). The reported study has several motivations. Firstly, the English language has borrowed extensively from other languages, mainly from French, Latin and Greek (Roberts 1965; Greenbaum 1996). Yet, no large-scale empirical survey has been conducted or reported for their significant presence in our present-day linguistic communication. Secondly, many text genres are typically characterized by the use of foreign words, such as Latinate ones (De Forest and Johnson 2001); however, there is no quantitative indication that relates proportions of foreign words to different types of texts. A third motivation is the fact that technical or professional writing tends to have a high density of specialized terms as basic knowledge units and the automatic recognition of such terms has so far ignored the important role that foreign words typically play in the formation of such terms. The study to be reported here is therefore significant not only for its descriptive statistics of foreign words in the 100-million-word BNC. It is also significant that the empirical results that we will report here will lend themselves readily to practical applications such as automatic term recognition and text classification.

Our survey has focused on the distribution of foreign words according to two settings: text types (such as writing vs. speech and academic prose vs. non-academic prose) and subject domains (such as sciences and humanities). A central question we attempt to answer is whether there is significant difference in the distribution of foreign words across different text types and the various subject domains concerned in the study. As our results will show, there is an uneven use of foreign words across text categories and subject domains. With regard to text categories, the proportion of foreign words successfully separates speech from writing, and, within writing, academic prose from non-academic prose. This finding shows a strong correlation between degrees of formality and proportions of foreign words. Our investigation also shows that even different domains have their own preferences for the use of foreign words. To be more exact, French is more related to humanities while the proportion of Latinate words is higher in scientific texts. Greek words behave in the same manner as Latinate words but they are relatively few and thus insignificant compared with the other two groups.

In conclusion, this survey demonstrates on empirical basis that the use of foreign words such as French, Latin and Greek distinguishes texts on a scale of different formalities. The survey also demonstrates that different domains seem to have a different proportion and therefore preference for the use of foreign words. These findings will lead to applications in automatic text classification, genre detection and term recognition, a possible development that we are currently investigating in a separate study.

References:

De Forest, M. and E. Johnson (2001). “The Density of Latinate Words in the Speeches of Jane Austen’s Characters”. Literary and Linguistic Computing, 16(4), 389-401

Greenbaum, S. (1996). The Oxford English grammar. Oxford : Oxford University Press.

Roberts, A. H. (1965). A Statistical Linguistic Analysis of American English. The Hague: Mouton.

The use of vague formulaic language by native and non-native speakers of English in synchronous one-on-one CMC

Julieta Fernandez and Aziz Yuldashev

Vague lexical units serve essential functions in everyday conversation. People use vague language when they want to soften their utterances, create in-group membership (Carter & McCarthy, 2006) or invite interlocutors into a presumably shared social space (Thorne & Lantolf, 2007). Although speakers hardly ever notice it, vague language appears rather widespread in everyday communication (Channell, 1994). Recent corpus linguistic studies of attested utterances point to a prominence of multiword vague expressions, the most recurrent being “I don’t know if,” “or something like that” and “and things like that” (McCarthy, 2004). This body of research suggests that multiword vague language constructions constitute an essential resource for language-in-use.

To date, research on vague language has mostly focused on its prevalence in spoken discourse, with more recent studies also addressing vague language use in written discourse. However, we are aware of no studies that have described vague language use in synchronous computer-mediated communication (CMC) contexts, such as Instant Messaging (IM) chat. The present paper reports on a study that investigates the use of multiword vague units in one-on-one IM chats. It specifically examines vague expressions that fall under the category of general extenders (adjunctive [e.g., ‘and stuff like that’, ‘and what not’] and disjunctive general extenders [e.g., ‘or anything’, ‘or something’]) as defined by Overstreet (1999). The analysis focuses on the comparison of variations in the use of general extenders (GEs) in IM between native and non-native expert users of English, the impact of their use on fluency and the implications of the findings for computer assisted second language pedagogy.

Drawing on the analysis of a corpus of 524 instant messaging conversations of different lengths, the study sheds light on the patterns of use and functions of GEs in one-on-one text-based IM chat in relation to their use and functions in oral conversations as documented in previous studies. The GEs utilized by participants have been found to operate at several levels (Evinson et al., 2007), where some GEs implicate knowledge that is ostensibly shared by all human beings; some other GEs are more locally constrained or group-bound; while a third group appears to be culturally bound to conceptual and conventional speech communities.

The examination of the data reveals a number of tendencies in the use of GEs by native and non-native expert English language users. Variations in the frequencies of GE use are discussed in view of the existing evidence regarding the overuse of multi-word vague expressions by non-native language users. The trends of vague expression use are scrutinized in light of a body of research on the affordances provided in CMC environments for language learning and use. The findings and conclusions underscore the pedagogical significance of sensitizing L2 learners to the interactional value of vague formulaic sequences in CMC contexts, their potential for conveying a variety of meanings for a range of purposes, and their contribution to fluency.

References:

Carter, R. and M. McCarthy (2006). Cambridge Grammar of English. Cambridge: CUP

Channell, J. (1994). Vague Language. Oxford: OUP

Evinson, J., M. McCarthy and A. O’Keefe, ““Looking Out for Love and All the Rest of It”: Vague Category Markers as Shared Social Space” In J. Cutting (ed.) Vague Language

Explored. New York: Palgrave MacMillan, 138-157.

Thorne, S. L. And J. P. Lantolf(2007). “A Linguistics of Communicative Activity” . In S. Makoni & A. Pennycook (eds), Disinventing and Reconstituting Languages. Clevedon: MM, 170-195.

McCarthy, M. (2004). “Lessons from the analysis of chunks”. The Language Teacher, 28 (7), 9-12.

Units of meaning in language = units of teaching in ESP classes? The case of Business English

Bettina Fischer-Starcke

Both words and phrases are units of meaning in language. When teaching business English to university students of economics, however, the question arises whether either words or phrases or both words and phrases are suitable units of teaching. This is particularly relevant since teaching materials in business English courses frequently focus on single words, possibly with their collocations, for example to effect payment, or terminology consisting of one or more words, for example seasonal adjustment. This paper takes a corpus linguistic approach to answering the question which linguistic units distinguish business English from general English and should therefore be taught to students of business English by analyzing words and phrases of the Wolverhampton Business English Corpus for their business specificity.

In this paper, (1) quantitative keywords from the Wolverhampton Business English Corpus when compared with the BNC and (2) the Wolverhampton Business English Corpus’ most frequent phrases between three and six words are analysed and classified into the categories general English and business specific. The phrases are (1) uninterrupted strings of words, so-called n-grams, for example net asset value, and (2) phrases that are variable in one slot, so-called p-frames, for example the * of. The phrases can be either compositional or non-compositional entities.

The list of keywords is dominated by business specific lexis, for instance securities, shares and investment. This dominance indicates large lexical differences between general English and business English. This demonstrates the usefulness of teaching words in ESP classes.

Looking at the corpus’ most frequent phrases, however, the distinction between general English and business English is less clear. Frequently, the genre specificity of a phrase is effected by the occurrence of specialized lexis as part of the phrase. This finding calls (1) the genre specificity of a phrase as a whole and (2) the distinction between words and phrases as characteristic units of business English into question.

This paper concludes by discussing (1) reasons for the differences of genre specificity between the keywords and the frequent phrases, (2) implications of the lexical identification of genre specificity on units of teaching in business English classes and (3) relevant units of teaching in ESP classes.

The application of corpus linguistics to pedagogy: The role of mediation

Lynne Flowerdew

Corpus-driven learning is usually associated with an inductive approach to learning. However, in this approach students may find it difficult to extrapolate rules or phraseological tendencies from corpus data. In such cases, some kind of ‘pedagogic mediation’ may be necessary (Widdowson 2002), but to date this aspect has not received much attention in the literature.

I will describe a writing course which, although primarily taking a corpus-driven approach (Johns 1991; Tognini-Bonelli 2001) also makes use of various types of pedagogic mediation. Peer response activities, drawing on Vygotskian socio-cultural theories of co-constructing knowledge through collaborative dialogue, were set up for discussion of the corpus data. Weaker students were intentionally grouped with more proficient ones to foster productive dialogue through ‘assisted performance’. It was found that the student discussions of corpus data not only focused on rule-based grammar queries, but also on phraseological queries and whether a phrase was suitable from a socio-pragmatic perspective to transfer to their own context of writing. In some instances, teacher intervention was necessary to supply a hint or prompt to carry the discussions forward. By so doing, the teacher would seem to be moving away from a totally inductive approach towards a more deductive one.

References:

Flowerdew, L. (2008) “Corpus linguistics for academic literacies mediated through discussion activities”. In D. Belcher and A. Hirvela (eds) The Oral-literate Connection: Perspectives on L2 Speaking, Writing and Other Media Interactions. Ann Arbor, MI: University of Michigan Press, 268-287.

Flowerdew, L. (in press, 2009) “Applying corpus linguistics to pedagogy: a critical evaluation”. International Journal of Corpus Linguistics, 14 (3).

Flowerdew, L. (in press, 2010) “Using corpora for writing instruction”. In M. McCarthy and A. O’Keeffe (eds) The Routledge Handbook of Corpus Linguistics.

Johns, T. (1991) “From printout to handout: Grammar and vocabulary teaching in the context of data-driven learning”. In T. Odlin (ed.) Perspectives on Pedagogical Grammar, Cambridge: CUP, 293-313.

Tognini-Bonelli, E. (2001) Corpus Linguistics at Work. Amsterdam: John Benjamins.

Widdowson, H.G. (2002) “Corpora and language teaching tomorrow”. Keynote lecture delivered at the Fifth Teaching and Language Corpora Conference, Bertinoro, Italy.

Spontaneity reloaded: American face-to-face and movie conversation compared

Pierfranca Forchini

The present paper empirically examines the linguistic features characterizing American face-to-face and movie conversation, two domains which are usually claimed to differ especially in terms of spontaneity (Taylor 1999, Sinclair 2004). Face-to-face conversation, indeed, is usually considered the quintessence of the spoken language for it is totally spontaneous: it is neither planned, nor edited (Chafe 1982, McCarthy 2003, Miller 2006) in that it takes place in real time; since the context is often shared by the participants, it draws on implicit meaning and, consequently, lacks semantic and grammatical elaboration (Halliday 1985, Biber et al. 1999). Fragmented language (Chafe 1982:39) and normal dysfluency (Biber et al. 1999:1048) phenomena, such as repetitions, pauses, and hesitation (Tannen 1982, Halliday 2005), are some examples which clearly illustrate the unplanned, spontaneous nature of face-to-face conversation. On the other hand, movie conversation is usually described as non-spontaneous: it is artificially written-to-be spoken, it lacks the spontaneous traits which are typical of face-to-face conversation, and, consequently, it is not likely to represent the general usage of conversation (Sinclair 2004). In spite of what is generally maintained by the literature, the Multi-Dimensional analysis presented here shows that the two conversational domains do not differ to a great extent in terms of linguistic features and, thus, confutes the claim that movie language has “a very limited value” in that it does not reflect natural conversation and, consequently, is “not likely to be representative of the general usage of conversation” (Sinclair 2004:80). This finding suggests that movies could be used as potential sources in teaching spoken English.

The present analysis is in line with Biber’s (1988) Multi-Dimensional analysis approach, which applies multivariate statistical techniques by observing and analyzing more than one statistical variable at a time and it is based on empirical data retrieved from spoken and movie corpora: the data about American face-to-face conversation are from the Longman Spoken American Corpus, whereas those about American movie conversation are from the AMC (American Movie Corpus), a new corpus purposely built and manually transcribed to study movie language.

References:

Biber, D. 1988. Variation across speech and writing. Cambridge: CUP.

Biber, D, S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999). Longman grammar of spoken and written English. London: Longman.

Chafe, W. (1982). “Integration and involvement in speaking, writing, and oral Literature”. In D. Tannen (ed.) Spoken and written Language: Exploring orality and literacy. Norwood: Ablex, 35–53.

Halliday, M. A. K. (1985). Spoken and written language. Oxford: OUP.

Halliday, M. A. K. (2005) “The spoken language corpus: a foundation for grammatical theory”. In J. Webster (ed.) Computational and quantitative studies. London: Continuum, 157-190.

McCarthy, M. (2003). Spoken language and applied linguistics. Cambridge: CUP.

Miller, J. (2006). “Spoken and Written English”. In B. Aarts and A. McMahon (ed.) The handbook of English linguistics. Oxford: Blackwell, 670-691.

Pavesi, M. (2005). La Traduzione Filmica. Roma: Carocci.

Sinclair, J. M. (2004) “Corpus creation”. In G. Sampson and D. McCarthy (ed.) Corpus linguistics: Readings in a widening discipline. London: Continuum, 78-84.

Taylor, C. (1999). “Look who’s talking”. In L. Lombardo, L. Haarman, J. Morley and C. Taylor (ed.) Massed medias. Milano: LED, 247-278.

Keyness as correlation: Notes on extending the notion of keyness from categorical to ordinal association

Richard Forsyth and Phoenix Lam

The notion of 'keyness' has proved valuable to linguists since it was given a practical operational definition by Scott (1997). Several numeric indices of keyness have been proposed (Kilgarriff, 2001), the most popular of which is computed from the log-likelihood formula (Dunning, 1993). The basis for all such measures is the different frequency of occurrence of a word between two contrasting corpora. In other words, existing measures of keyness assess the strength of association between the occurrence rate of a word (more generally, a term, since short phrases or symbols such as punctuation may be considered) and a categorical variable, namely, presence in one collection of documents or another.

The present study is motivated by the desire to find an analogous measure for the case in which the variable to be associated with a term's frequency is not simply binary (presence in one or other corpus) but measured on an ordinal or possibly an interval scale. This motivation arose in connection with the affective scoring of texts, specifically rating of emails according to degree of hostility or friendliness expressed in them. In our test data, each text has been manually assigned a score between -4 and +4 on five rating scales, including expressed hostility. However, such text-to-numeric linkages are not specific to this particular application, but may be of interest in many areas, for example: in Diachronic Linguistics, with the link between a text and its date of composition; in Economics, with the link between a news story about a company and its financial performance; in Medicine, with the link between a transcript and the speaker's severity of cognitive impairment; in Political Science, with the link between an election speech and its degree of support for a policy position; and so on.

This paper examines a number of potential indicator functions for the keyness of terms in such text collections, i.e. those in which the dependent variable is ordered, not categorical, and discusses the results of applying them to a corpus of electronic messages annotated with scores for emotive tone as well as to comparison corpora where the dependent variable is the date of composition of the text. Repeated split-half subsampling suggests that a novel index (frequency-adjusted z-score) is more reliable than either the Pearson correlation coefficient or a straightforward modification of the log-likelihood statistic for this purpose. The paper also discusses some criteria for assessing the quality of such indices.

References:

Dunning, T. (1993). “Accurate methods for the statistics of surprise and coincidence”. Computational Linguistics, 19(1), 61-74.

Kilgarriff, A. (2001). “Comparing corpora”. International Journal of Corpus Linguistics 6 (1): 1-37.

Scott, M. (1997). “PC analysis of key words -- and key key words”, System, 25 (2), 233-245.

A classroom corpus resource for language learners

Gill Francis and Andrew Dickinson

This paper reports on the development of a classroom corpus tool that will give teachers and learners an easy-to-use interface between them and the world of language out there on the web and in a range of compiled corpora. The resource is intended to be used by UK secondary schools and first-year university students, as well as in ELT classrooms here and in other countries. It will enable teachers and learners to access a corpus during a class, whenever they want to investigate how a word or phrase is used in a range of real language texts and situations.

The aim is to provide users with the facility to type a word or phrase into a web page without any special punctuation or symbols. Corpus output is returned in a clear, uncluttered, and visually attractive display, which learners can easily interpret.

To simplify and clarify the output, there are restrictions on how it is selected and manipulated – for example there is an upper limit on the number of concordance lines retrievable; alphabetical sorting is simplified; and significant collocates are presented positionally without statistical information. Only the most important and often-used options are offered, and ease of use is the priority.

The user will have access to a range of different corpora, although output from all of these will of course ‘look the same’. The web-server software will use an off-the-shelf concordancer to access licensed corpora such as the BNC, as well as some of the hundreds of corpora freely available on the web. Corpora can also be compiled using text from the web, for example from out-of-copyright literary works studied for GCSE, or by trawling through the web looking for key words and sites. Different corpora will be compiled for different groups of users, such as English schoolchildren or intermediate level EFL students.

For initial guidance and ideas, the website also offers a large number of suggestions for stand-alone classroom activities practising points of grammar, lexis, and phraseology, stressing their interconnectedness. The materials owe much to Data-Driven Learning, the pioneering enterprise of Tim Johns, who has sadly recently died.

Activities include many on collocation, for example the collocates of common ‘delexicalized’ verbs (make a choice, have a look, give a reason). There is also a thread that addresses language change and the tension between description and prescription in language teaching. For example, the changing pronoun use in English (eg they as a general pronoun to replace he or she; the shifting use of subject and object pronouns – Alena and I left together / Alena and me left together / Me and Alena left together). Teachers and learners will be able to use these resources as a springboard for developing their own approach to looking at language in the bright light of a corpus.

A corpus-driven approach to identifying distinctive lexis in translation

Ana Frankenberg-Garcia

Translated texts have traditionally been regarded as inferior, owing to the general perception that they are contaminated by the source text language. In recent years, however, a growing number of translation scholars have begun to question this assumption. While the constraints imposed by the source text language upon the translated text are certainly inevitable, they are not necessarily negative. For Frawley (1984), translated texts are different from the source texts that give rise to them and, at the same time, they are different from equivalent target language texts that are not translations. As such, this third code (Frawley 1984:168) deserves being studied in its own right.

A third-code feature that has received relatively little attention in the literature is the distribution of lexis in translated texts. In one of the few studies available, Shama'a (Shama'a 1978, cited in Baker 1993) found that the words day and say could be twice as frequent in English translated from Arabic than in original English texts. In contrast, Tirkkonen-Condit (2004) focuses her analysis on typically Finnish verbs of sufficiency and notices that they are markedly less frequent in translations than in texts originally written in Finnish. Both these studies adopt a bottom-up approach, taking a selection of words as a starting point and subsequently comparing their distributions in translated and non-translated texts. In the present study, I adopt an exploratory, top-down, corpus-driven approach instead. I begin with a comparable corpus of translated and non-translated texts and then attempt to identify which lemmas are markedly over and under-represented in the translations. The results not only appear to support existing bottom-up intuitions regarding distinctive lexical distributions, but also disclose a number of unexpected contrasts that would not have been discernible without recourse to corpora.

References:

Baker, M. (1993). 'Corpus linguistics and translation studies. Implications and applications.' In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology: In Honour of John Sinclair, pp 233-250. Amsterdam & Philadelphia: John Benjamins.

Frawley, W. (1984). 'Prolegomenon to a theory of translation'. In W. Frawley (ed.) Translation, literary, linguistic and philosophical perspectives. London & Toronto: Associated University Presses, pp 159-175.

Shama'a, N. (1978). A linguistic analysis of some problems of Arabic to English translation. D. Phil thesis, Oxford University.

Tirkkonen-Condit, S. (2004). 'Unique items – over – or under-represented in translated language?' In A. Mauranen and P. Kujamäki (eds) Translation Universals, Do They Exist? Amsterdam & Philadelphia: John Benjamins, p. 177-184.

The role of language in the popular discursive construction of belonging in Quebec:

A corpus assisted discourse study of the Bouchard-Taylor Commission briefs

Rachelle Freake

This study investigates how identity is constructed in relation to language in Quebec popular discourse. Since its rebirth after the Quiet Revolution, Quebec has found itself in the midst of debate as to how to adapt French Canadian culture and language to globalizing and immigration-based contexts. The debate culminated in 2007 with the establishment of the Bouchard Taylor Commission, which had the purpose of making recommendations concerning religious and cultural accommodation to the Premier of Quebec; these recommendations were based on public consultations held across Quebec. One method of participating in the consultation was through the submission of a brief; over 900 briefs in both French and English were submitted, and these serve as data for this research. The briefs were compiled into two corpora, one English and one French. Using a corpus-assisted discourse studies (CADS) approach (Baker, Gabrielatos, Khosravinik, Krzyzanowski, McEnery & Wodak, 2008), this research analyses both the French and English corpora separately, using bottom-up corpus analysis frequency, concordance, and keyword tools as well as the top-down Discursive Construction of National Identity approach to Critical Discourse Analysis (Wodak, de Cillia, Reisigl & Liebhart, 1999). Collocation and cluster patterns reveal categories of reference that represent Quebec in terms of both civic and ethnocultural nationalism. Corpus findings are enhanced by the in-depth critical discourse analysis of three individual briefs, which show how individuals adopt and adapt different versions of nationalism to advance their respective interests. Together, corpus and discourse analysis show that the majority of participants construct their identity in relation with Quebec civic society, drawing on the “modernizing” (c.f. Heller, 2001) French Canadian identity discourse. This discourse blends civic and ethnocultural elements, and emphasizes language as an identity pillar and a key sign of membership. With regard to methodological approach, this study finds that discourse analysis reveals important contextual dimensions that are not apparent from the corpus analysis alone, thus highlighting the advantage of a combined corpus and discourse analysis approach. Also, comparing corpora of different languages and different sizes, and compiling keyword lists for these corpora using different comparator corpora, raises questions about the comparability of keywords.

References:

Baker, P., C. Gabrielatos, M. Khosravinik, M. Krzyzanowski, T. Mcenery and R. Wodak (2008). “A useful methodological synergy? combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press”. Discourse & Society, 19(3), 273-306.

Heller, M. (2001). “Critique and sociolinguistic analysis of discourse”. Critique of Anthropology, 21(2), 117-141.

Wodak, R., R. de Cillia, M. Reisigl and K. Liebhart (1999). The discursive construction of national identity. Edinburgh: Edinburgh University Press.

A linguistic mapping of science popularisation: Feynman’s lectures on physics and what corpus methods can tell us

Maria Freddi

The present paper addresses the popularization of science by scientists themselves, a topic that has so far been neglected (Gross 2006), analysed along the cline from teaching to popularising science. The aim is twofold: on the one hand, to analyse Richard Feynman’s collection of didactic lectures, entitled Six Easy Pieces, to investigate the physicist’s pedagogical style. On the other hand, to compare them with a collection of popularising lectures published as The Character of Physical Law, to pinpoint differences and/or similarities in terms of language choices, word-choice and phraseology, and textual strategies. Corpus methods are used to quantify variation internal to each text and across texts. The approach is corpus-driven and draws on different models of words distribution and recurrent word-combinations (see Sinclair 1991, 2004; Moon 1998; Biber and Conrad 1999; Stubbs 2007). In particular, work by Biber et al. on how different multi-word sequences – lexical bundles – occur with different distributions in different text-types is referred to. The paper also draws on other computer-aided text analysis methods to describe meaning in texts such as Mahlberg’s (2007) concept of ‘local textual functions’, i.e. the textual functions of frequent phraseology, and Hoey’s notion of ‘textual colligation’ (Hoey 2006).

Each collection of lectures is thus treated both as text and corpus, so that keywords resulting from the comparison of the two corpora are then considered and related to the textual and registerial specificities of each (see among others, Baker 2006; Scott and Tribble 2006). Finally, findings are related to qualitative studies of scientific discourse, particularly in the wake of rhetoric of science (e.g. Gross 2006). It is suggested that corpus tools, by helping identify frequent phraseology and textual functions and compare the academic and public lectures, will lead to a more thorough mapping of Feynman’s unconventional style of communicating science as well as a deeper understanding of successful science popularisation.

References:

Baker, P. (2006), Using Corpora in Discourse Analysis. London: Continuum

Biber, D. and S. Conrad (1999), “Lexical bundles in conversation and academic prose”, in H.

Hasselgard and S. Oksefjell (eds) Out of Corpora. Studies in Honour of Stig Johansson.

Amsterdam: Rodopi, 181-190

Gross, A. G. (2006). Starring the Text. The Place of Rhetoric in Science Studies. Carbondale:

Southern Illinois Press.

Hoey, M. (2006). “Language as choice: what is chosen?”. In G. Thompson and S. Hunston (eds) System and Corpus. London: Equinox, 37-54

Mahlberg, M. 2007, “Clusters, key clusters and local textual functions in Dickens.” Corpora, 2 (1), 1-3

Moon, R. (1998). Fixed Expressions and Idioms in English. A corpus-based approach. Oxford: Clarendon.

Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press

Sinclair, J. (2004). Trust the Text: Language, corpus and discourse. London: Routledge

Automatic grouping of morphologically related collocations

Fabienne Fritzinger and Ulrich Heid

Term variants have been extensively investigated in the past [2], but there is less knowledge about the variability of collocations in specialised languages. Variability phenomena include differences in article use, number and modifiability, but also different morphological reali-sations of the same combination of lexical elements: German examples include the relation between the verb+object collocation Warenzeichen+eintragen („register+trademark“) and the collocation of a nominalisation and its genitive attribute Eintragung des Warenzeichens, „registration of +trade-mark“), or the relation with a noun and an (adjectival) participle collocation: eingetragenes/ einzutragendes Warenzeichen („registered/to-be-registered trademark'“) .

Quantitative data for each collocation type show preferences in specialised phraseology: the collocation Patentanmeldung/Anmeldung+einreichen („submit a (patent) registration“) is equally typical in verb+noun, noun+genitive and participle+noun form; By contrast, the combination Klage+einreichen shows an uneven distribution over the collocation types: Einreichung+KlageGen and eingereichte Klage are rather frequent (ranks 7 and 8), whereas combinations of the verb and the noun, Klage+einreichen („file a law suit“), are quite rare (rank 43).

We present a system for grouping such morphologically related collocation candidates together, using a standard corpus linguistic processing pipeline and a morphology system.

A systematic account of collocation variants (and tools to prepare it) are needed for terminography and resource building for both symbolic and statistical Machine Translation: grouping the collocation variants together in a terminological database or a specialised dictionary, improves the structure of such a resource making it more compact. Adding quantitative data about preferences serves technical writers who need to select adequate expressions. „Bundles“ of related collocations of two languages can be integrated into lexicons of symbolic Machine Translation systems, and they can serve to reduce data sparseness in alignment and equivalent learning from parallel and comparable corpora.

The architecture and application of our tools is exemplified with German data from the field of IPR and trademark legislation; however, the tools are applicable to general language as well as to any specialised domain. To identify the vocabulary of the trademark legislation sublanguage, single word candidate terms are extracted from a 78 million word corpus (of German juridical literature) and are compared with terms of general language texts [1]. For these single word term candidates, we then extract collocation candidates. In order to identify even non-adjacent collocations (which frequently occur in German), we use dependency parsing [3] to extract the following collocational patterns: adjectives+nouns, nouns+genitive nouns and verb+object pairs.

The verb+object pairs are the starting point for grouping the collocations, as nominalisations and derived adjectives can easily be mapped onto their underlying verbal predicates. For this mapping, we use SMOR [4]. It analyses complex words into a sequence of morphemes which include base and affixes or, in the case of compounds, head and non-head components. Based on this output, we group the different morphological realisations together.

References:

[1] K. Ahmad, A. Davies, H. Fulford and M. Rogers (1992). “What is a tern? The semi-automatic extraction of terms from text”. In M. Snell-Hornby et al. Translation studies – an interdiscipline.

[2] C. Jacquemin (2001). Spotting and discovering terms through Natural Language Processing.

[3] M. Schiehlen (2003). “A cascaded finite-state parser for German”. In Proceedings of the Research Note Sessions of the 10th Conference of the EACL'03.

[4] H. Schmid, A. Fitschen and U. Heid (2004). “A German computational morpholgy covering derivation, composition and inflection”. In Proceedings of the LREC 2004.

Mediated modality: A corpus-based investigation of original, translated and non-native ESP

Federico Gaspari and Silvia Bernardini

Although linguists disagree on the finer points involved in classifying and describing English modal verbs, it does not seem controversial to argue that modality is a central and challenging topic in English linguistics. In this respect, Palmer (1979/1990) observes that “[t]here is, perhaps, no area of English grammar that is both more important and more difficult than the system of the modals”, and Perkins (1983: 269) claims that “the English modal system tends to more anarchy than any other area of the English language”. Due to its complexities, modality is likely to be particularly problematic for translators and non-native speakers writing in English. It has been suggested that these two sets of language users employ similar strategies and produce mediated language that has common and distinctive patterns, showing properties that differ in a number of respects from those found in native/original texts (Ulrych & Murphy, 2008).

This paper attempts to shed light on this claim, investigating a range of features of modality by means of two specialised monolingual comparable corpora of English: one containing original/native and translated financial reports (with the source language of all the translations being Italian), the other comprising research working papers in economics written by native and non-native speakers of English (in this case all the L2 authors have Italian as their mother tongue). While assembling a set of more closely comparable corpora would have been desirable, this proved impossible to achieve due to the unavailability of appropriate texts for all the components that were required.

Phenomena discussed in the paper include the distribution of epistemic vs. deontic modals in each sub-corpus, their collocational patterning (particularly concerning co-occurrence with adverbs expressing judgements of the writer’s confidence, such as possibly and perhaps for possibility, as opposed to surely and certainly for necessity), as well as an analysis of their discursive functions. Looking at how modality is expressed in L2 English writing by Italian native speakers and in translations from Italian, we aim first of all to determine the extent to which these two distinct forms of mediated language differ from each other, and secondly to understand how they compare to original/native usage.

At the descriptive and theoretical levels, this study provides insights about English as a Lingua Franca, while contributing to the ongoing debate on translation universals taking place within Corpus-Based Translation Studies (Baker, 1993; Chesterman, 2004). From an applied point of view, on the other hand, the findings of this research have implications for English language teaching (particularly ESP and EAP in the fields of economics and finance), translation teaching and translation quality evaluation.

References:

Baker, M. (1993) “Corpus Linguistics and Translation Studies: Implications and Applications”. In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology: in Honour of John Sinclair. Amsterdam: John Benjamins, 233-250.

Chesterman, A. (2004) “Hypotheses about Translation Universals”. In G. Hansen, K. Malmkjær and D. Gile (eds) Claims, Changes and Challenges in Translation Studies. Amsterdam: John Benjamins, 1-13.

Palmer, F. R. (1979/1990) Modality and the English modals. Second edition. London: Longman.

Perkins, M. R. (1983) “The Core Meaning of the English Modals”. Journal of Linguistics 18, 245-273.

Ulrych, M. & A. Murphy (2008) “Descriptive Translation Studies and the Use of Corpora: Investigating Mediation Universals”. In C. Taylor Torsello, K. Ackerley & E. Castello (eds) Corpora for University Language Teachers. Bern: Peter Lang.

Speaking in piles: Paradigmatic annotation of a spoken French corpus

Kim Gerdes

This presentation focuses on the “paradigmatic” syntactic schemes that are currently used in the annotation of the Rhapsodie corpus project (http://rhapsodie.risc.cnrs.fr/en), an ongoing 3-year project sponsored by the French National Research Agency (ANR) and consisting of developing a 30 hours (360,000 words) reference corpus of spoken French with the key features of being free, representative, and annotated with sound, syllable-aligned phonological transcription, orthographically corrected transcription and, on at least 20% of the corpus, prosodic and syntactic layers of information.

Based on the Aix School grid analysis of spoken French (Blanche-Benveniste et al. 1979), the notion of « pile » is introduced in the syntactic annotation, allowing for a unified elegant description of various paradigmatic phenomena like disfluency, reformulation, apposition, question-answer relationships, and coordinations. Piles naturally complete dependency annotations by modeling non-functional relations between phrases. We consider that a segment Y of an utterance piles up with a previous segment X when Y fills the same syntactic position as X: Y can be a (voluntary or not) reformulation of X, Y can instantiate X or Y can be added to X in a coordination. If the layers X and Y are adjacent, we note {X| C Y} the pile, where C is the pile marker (i.e. a conjunction in a coordination or an interregnum in a disfluency (Shriberg 1994)).

• on imaginait {des bunkers | tout ce {qui avait pu faire le rideau de fer | qui avait été supprimé {très rapidement | mais sans en enlever toutes les infrastructures}}}

'One imagined bunkers, everything that the Iron Curtain was made up of that was removed very quickly but without taking down its entire infrastructure.

Although our layers’ encoding seem rather symmetric, we give to {, | and } very different interpretations. We call | the junction position: it marks the point where a backtrack to { is done. The last position marked } is less relevant, and even irrelevant in case of disfluency (see Heeman et al. 2006, who propose a similar encoding for disfluency without the closing } ). Only the junction position corresponds to a prosodic break.

We use a two-dimensional graphical representation of piles where layers, as well as pile markers, are vertically aligned. We add horizontal lines, called junctions, between the various layers of a pile, in order to indicate the extension of each layer. Only the first layer is syntactically linked by a functional dependency to the left context.

• on imaginait des bunkers

tout ce qui avait pu faire le rideau de fer

qui avait été supprimé très rapidement

mais sans en enlever …

References:

Blanche-Benveniste C. et al., Des grilles pour le français parlé, Recherches sur le français parlé, 2, 163-205.

Heeman P., A. McMillin, J.S. Yaruss (2006) An annotation scheme for complex disfluencies, International Conference on Spoken Language Processing, Pittsburgh.

Shriberg E. (1994). Preliminaries to a Theory of Speech Disfluencies, PhD Thesis, Berkeley University.

eHumanities Desktop — An extensible Online System for

Corpus Management and Analysis

Rüdiger Gleim, Alexander Mehler, Ulli Waltinger and Peter Menke

Beginning with quantitative corpus linguistics and related fields the application of computer-based methods increasingly reaches all kinds of disciplines in the humanities. The spectrum of processed documents spans textual content (e.g. historical texts, newspaper articles, lexica and dialog transcriptions), annotation formats (e.g. of multimodal data), and images as well as multimedia. This raises new challenges in maintaining, processing and analyzing resources, especially because research groups are often distributed over several institutes. Finally, sustainability and harmonization of linguistic resources also become important issues. Thus, the ability to collaboratively work on shared resources, while ensuring interoperability, is an important issue in the field of corpus linguistics (Dipper et al., 2006; Ide and Romary, 2004).

The eHumanities Desktop1 is designed as a general purpose platform for sciences in humanities. It offers web based access to (i) managing documents of arbitrary type and organizing them freely in repositories, (ii) sharing and working collaboratively on resources with other users and user groups, (iii) browsing and querying resources (including e.g. XQueries on XML documents), (iv) processing and analyzing documents via an extensible Tool API (including e.g. raw text preprocessing in multiple languages, categorization and lexical chaining). The Desktop is platform independent and offers a full-fledged desktop environment to work on corpora via any modern browser.

The eHumanities Desktop is an advancement of the Ariadne System2, which was developed in the DFG funded project SFB673/X1 "Multimodal alignment corpora: statistical modeling and information management". We invite researchers to contact us3 and use the Desktop for their studies.

References:

Dipper, S., E. Hinrichs, T. Schmidt, A. Wagner and A. Witt (2006). “Sustainability of

linguistic resources”. In E. Hinrichs, N. Ide, M. Palmer and J. Pustejovsky (eds) Proceedings of the LREC 2006 Workshop on Merging and Layering Linguistic Information. Genoa, Italy.

Ide, N. and L. Romary (2004). “International standard for a linguistic annotation framework”.

Nat. Lang. Eng., 10 (3-4), 211–225.

Mehler, A., R. Gleim, A. Ernst and U. Waltinger (2008). “WikiDB: Building Interoperable

Wiki-Based Knowledge Resources for Semantic Databases”. In Sprache und Datenverarbeitung. International Journal for Language Data Processing, 2008.

Mehler, A., U. Waltinger and A Wegner (2007). “A formal text representation model based

on lexical chaining”. In Proceedings of the KI 2007 Workshop on Learning from Non-Vectorial Data (LNVD 2007) September 10, Osnabrück, 17–26. Universität Osnabrück.

Waltinger, U., A. Mehler and G. Heyer (2008a). “Towards automatic content tagging:

Enhanced web services in digital libraries using lexical chaining.” In 4th Int. Conf. on Web Information Systems and Technologies (WEBIST ’08), 4-7 May, Funchal, Portugal. Barcelona.

1 http://hudesktop.hucompute.org

2 http://ariadne-cms.eu

3 http://www.hucompute.org

Analysing speech acts with the Corpus of Greek Texts: Implications for a theory of language

Dionysis Goutsos

The paper takes as its starting point the analysis of two instances of language use with the help of material provided by the Corpus of Greek Texts (CGT). CGT is a new reference corpus of Greek (30 million words), developed at the University of Athens for the purposes of linguistic analysis and pedagogical applications (Goutsos 2003). It has been designed as a general monolingual corpus, including data from a wide variety of spoken and written genres in Greek from two decades (1990 to 2010). The examples discussed here concern:

a) an utterance on a public sign, used to enforce a prohibition, and

b) the different forms of the verb ksexnao (‘to forget’) in Greek.

The question in both cases is whether there is evidence in the corpus that can be brought to clarify the role and illocutionary force of the speech acts involved. This evidence comes from phraseological patterns, which are found to be used for specific discourse purposes. It is suggested thus that discursive acts manifest themselves in linguistic traces, retrievable from the wider co-text and characterizing particular contexts. These traces allow us to identify repeatable relations between forms and functions and reveal the predominantly evaluative orientation of language.

On the basis of this analysis, the paper moves on to discuss the claim of corpus linguistics and its implications for a theory of language. In particular, it is argued that corpus linguistics can bridge the gap between formalist insights and interactional approaches to language, by emphasizing recurrent relations between forms and function and thus subscribing neither to a view of fixed correspondences between form and function nor to a totally context-dependent approach. As such, corpus linguistics points to a view of language as a form of historical praxis of a fundamentally social and dialogical nature and in this it concurs with Voloshinov’s (1973 [1929]) approach to language as a systematic accident, produced by repeatable and recurring forms of utterances in particular circumstances. A Bakhtinian theory, emphasizing evaluation and addressivity as fundamental characteristics of language, can mediate between today’s theories of ‘individualistic subjectivism’ and ‘abstract objectivism’, offering a more fruitful approach to what language is and how it works.

References:

Goutsos, D. (2003). “Corpus of Greek Texts: Design and implementation [In Greek]”. Proceedings of the 6th International Conference of Greek Linguistics, University of Crete, 2003. Available at: http://www.philology.uoc.gr/conferences/6thICGL/gr.htm

Voloshinov, N. (1973 [1929]). Marxism and the Philosophy of Language. Cambridge, MA: Harvard University Press.

Bigrams in registers, domains, and varieties: A bigram gravity approach to the homogeneity of corpora

Stefan Th. Gries

One of the most central domains in corpus linguistics is concerned with issues of corpus comparability, homogeneity, and similarity. In spite of the importance of this topic – after all, only when we know how homogeneous and/or comparable our corpora are can we be sure as top how robust our findings and generalizations are. In addition, most of the little work that is there is based on the frequencies with which particular lexical items occur in a corpus or in parts of corpora (cf. Kilgarriff 2001, 2005, Gries 2005). However, this is problematic because (i) researchers working on grammatical topics may not benefit at all from corpus homogeneity measures that are based on lexical frequencies alone, and (ii) raw frequencies of lexical items are a rather blunt instrument that does not take much of the information that corpora provide into account.

In this paper, I will explore how we can compare corpora or parts of corpora in a way that I will argue is somewhat more appropriate. First, I will compare parts of corpora on the basis of the strength of association of all bigrams in the corpus (parts). Second, the measure of association strength is one that has so far not received a lot of attention although it exhibits a few interesting characteristics, namely Daudaravičius and Marcinkevičiené's (2004) gravity counts.

Unlike nearly all other collocational statistics, this measure has the attractive property that it is not only based on the token numbers of occurrence and co-occurrence, but also considers the number of different types the parts of a bigram co-occur with.

The corpus whose homogeneity I study is the BNC baby. In order to determine the degree of homogeneity and at the same time validate the approach, I split up the BNC Baby into its 4 main parts (aca, dem, fic, and news) and its 19 different domains, and, within each of these subdivisions, compute each bigram's gravity measure of collocational strength. This setup allows to perform several comparisons, whose results I will present

- the 4 corpus parts and the 19 corpus domains can be compared with regard to the overall average bigram gravity;

- the 4 corpus parts and the 19 corpus domains can be compared with regard to the average bigram gravity per file;

- the 4 corpus parts and the 19 corpus domains can be compared with regard to the average bigram gravity per sentence.

More interestingly even, I show to what degree the domain classifications of the corpus compilers are replicated in the sense that, when the 19 corpus domains are clustered in a hierarchical cluster analysis, they largely form groups that resemble the four main corpus parts.Similar results for the comparison of different varieties of English will also be reported briefly.

These findings not only help evaluate the so far understudied measure of gravity counts, but also determines to what degree top-down defined register differences are reflected in distributional patterns.

Multimodal Russian Corpus (MURCO): Types of annotation and annotator’s workbench

Elena Grishina

2009 is the first year of the actual design of the Multimodal Russian Corpus (MURCO), which will be created in the framework of the Russian National Corpus (RNC). The RNC contains the Spoken Subcorpus (just now its volume is circa 8 million tokens), but this subcorpus does not include the oral speech proper – it includes only the transcripts of the spoken texts (see [Grishina 2006]). Therefore, to replenish the Spoken Subcorpus of the RNC, we have decided to work out the first generally accessible and relatively fair-sized multimodal corpus. To avoid the problems concerning copyright and privacy invasion, we have decided to use the cinematographic material in the MURCO. The total volume of cinematographic transcripts in the Spoken Subcorpus of the RNC is about 5 million tokens? If we manage to transform this subcorpus into multimodal state, we will obtain the largest open multimodal corpus, so the task is ambitious enough.

The MURCO will include the annotation of different types. These are:

· the standard RNC annotation, which consists of morphological, semantic, accentological and sociological annotation

· the orthoepic annotation

· the speech act annotation

· the gesture annotation (see [Grishina 2009a]).

So, we plan to annotate MURCO from the point of view of phonetics and orthoepy, in addition to the standard RNC annotation (see [Grishina 2009b]). It includes two types of information: information concerning sound combination and information concerning accentological structure of a word.

The speech act annotation will characterize any phrase in a clip from the point of view of 1) typical social situation, 2) types of speech acts, 3) speech manner, 4) types of repetition, 5) types of vocal gestures and non-verbal words.

The gesture annotation will characterize every clip from the point of view of 1) types of gestures, 2) meaning of gestures, 3) Russian gesture names, 4) active/passive organs, 5) spatial orientation of an active organ and direction of its movement.

The annotation process in the MURCO may be

· 1) automatic (morphology, semantics, orthoepy),

· 2) manual: to annotate the speech acts and the gestures in MURCO an annotator ought to work in manual mode only. To facilitate the work we have decided to construct two special-purpose workbenches: the “Maker” – to annotate speech acts, vocal gestures, interjections, repetitions, and so on, the “GesturesMarker” – to annotate gestures and their components. The workbenches have user-friendly interface and considerably increase the intensity and speed of the annotation process.

References:

Grishina E. (2006) “Spoken Russian in the Russian National Corpus (RNC)” . LREC2006: 5th International Conference on Language Resources and Evaluation. ELRA, 2006. p. 121-124.

Grishina E. (2009a) “Multimedijnyj corpus russkogo jazyka (MURCO): problemy annotacii”. RNC2009.

Grishina E. (2009b forthcoming) “National’nyj corpus russkogo jazyka kak istochnik

svedenij ob ustnoj rechi”. Rechevyje tekhnologii, 3, 2009.

Synergies between transcription and lexical database building: the case of German Sign Language

Thomas Hanke, Susanne Koenig, Reiner Konrad, Gabriele Langer and Christian Rathmann

In corpus-based lexicography research, tokenising and lemmatising play a crucial role for empirical analyses, including frequency statistics or lemma context analyses. For languages with conventionalised writing systems and available lexical database resources, lemmatising a corpus is more or less an automatic process. Changes in the lexical database resource are rather a secondary issue.

Apparently, this approach does not apply for German Sign Language (DGS), a language in the visuo-gestural modality. Due to the lack of an established writing system and close-to-complete lexical database resources for DGS, it is necessary to do both tokenising and lemmatising manually and to extend the lexical resources in parallel with the lemmatisation process. In lieu of using transcripts being processed with support of available lexical database resources and then stored again as independent entities, we adopt an integrated approach using a multi-user relational database. The database combines transcripts with the lexical resource. Transcript tags which identify stretches of the utterance as tokens are not text strings but are treated as database entities. For lemma review in the lexical database, tokens currently assigned to a type can be accessed directly. Consequently, changes of type information, e.g. gloss or citation form, are immediately reflected in all transcripts.

This approach has three advantages. First, it allows to treat productive forms (i.e. non-conventionalised forms) in parallel with lemmas under review. Productive forms consist of sublexical elements with visually-motivated (i.e. iconic) properties. Providing structures to access productive forms in the database is important as these forms are perfect candidates for lexicalisation. Second, the approach enables transcribers to constantly switch back and forth between bottom-up and top-down analyses as the token-type matching requires reviewing, extending, and refining the lexical database. Third, this contributes to quality assurance of the transcripts since the standard approach including completely independent double-transcription would be prohibitively expensive. The database avoids problems inherent in applying the standard text tagging approach to sign languages, such as glossing consistency (cf. Johnston 2008).

However, the disadvantage inherent in both our approach and the more standard one is that the phonetic form of the utterance is not adequately represented in the transcript. While immediate access to digital video is available, the one-step procedure going from digital video directly to tags (combining tokenisation and lemmatisation) leaves out the step of describing phonetic properties. A transcriber is supposed to notate any form derivation from the type’s standard inflected form, but this can easily be forgotten to be filled out or later to be modified when the type’s standard form is changed. To identify problems in this area, we use avatar technology: The avatar is fed the phonetic information combined from the type and token data (encoded in HamNoSys, an IPA-like notation system for sign languages), and the transcriber can then compare the avatar performance with the original video data.

Syntactic annotation of transcriptions in the Czech Academic Corpus: Then and now

Barbora Hladká and Zdeňka Urešová

Rich corpus annotation is performed manually (at least initially) using predefined annotation guidelines, available now for many languages in their written form. No speech-specific syntactic annotation guidelines exist, however. For English, the Fisher Corpus (Cieri et al., 2004) and the Switchboard part of the Penn Treebank [1] had been annotated using written-text guidelines. For Czech, the Czech Academic Corpus [2] (CAC) created by a team of linguists at the Institute of Czech Language in Prague in 1971-1985, also did not use any such specific guidelines, despite the fact that a large part (~220,000 tokens) of the CAC contains transcriptions of spoken Czech and their morphological and syntactic annotation. In our contribution, we discuss syntactic annotation of these spoken-language transcriptions by comparing their original annotation style and the modern syntactic representation as used in the Prague Dependency Treebank (PDT); it was quite natural to try to merge (convert) CAC into PDT using the latter treebank’s format (Vidová Hladká et al. 2008).

While most of the conversion has been relatively straightforward (albeit labor-intensive), the spoken part posed a more serious problem. Many phenomena typical for spoken language had not appeared previously in written texts, such as unfinished sentences (fragments), ellipsis, false beginnings (restarts), repetition of words in the middle of sentence (fillers), redundant and ungrammatically used words – none of these are covered in our text annotation guidelines. Since the PDT annotation principles do not allow word deletion, regrouping or word order changes, one would have to introduce arbitrary, linguistically irrelevant spoken-language-specific rules with doubtful use even if applied consistently to the corpus. The spoken part of the CAC thus was not converted (at the syntactic layer).

However, we plan to complete its annotation (i.e. the spoken language transcriptions) using the results of the “speech reconstruction” project (Mikulová et al., 2008). Speech reconstruction will enable the use of the text-based guidelines for syntactic annotation of spoken material by introducing a separate layer of annotation, which allows for “editing” the original transcript into a grammatical text (for the original ideas see Fitzgerald, Jelinek, 2008).

Acknowledgements: This work is supported by the Charles University Grant Agency under the grant GAUK 52408/2008 and by the Czech Ministry of Education project MSM0021620838.

References:

Cieri, C., D. Miller and K. Walker (2004). “The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text”. In Proceedings of the 4th LREC, Lisbon, Portugal, 69-71,

Fitzgerald E. and F. Jelinek (2008). “Linguistic resources for reconstructing spontaneous speech text”. In Proceedings of the 6th LREC, Marrakesh, Morocco, ELRA.

Mikulová, M. (2008). “Rekonstrukce standardizovaného textu z mluvené řeči v Pražském závislostním korpusu mluvené češtiny”. Manuál pro anotátory. Tech. report no. 2008/TR-2008-38, ÚFAL MFF UK, Prague, Czech Republic.

Vidová Hladká B., J. Hajič, J. Hana, J. Hlaváčová, J. Mírovský and J. Raab (2008). “The Czech Academic Corpus 2.0 Guide”. Prague Bulletin of Mathematical Linguistics,

89, 41-96.

------------------------------

[1] http://www.cis.upenn.edu/~treebank/

[2] http://ufal.mff.cuni.cz/rest/CAC/cac_20.html

A top-down approach to discourse-level annotation

Lydia-Mai Ho-Dac, Cécile Fabre, Marie-Paule Péry-Woodley and Josette Rebeyrolle

The ANNODIS project aims at developing a diversified French corpus annotated with discourse information. Its design innovates in combining bottom-up and top-down approaches to discourse. In the bottom-up perspective, basic constituents are identified and linked via discourse relations. In a complementary manner, the top-down approach starts from the text as a whole and focuses on the identification of configurations of cues signalling higher-level text segments, in an attempt to address the interplay of continuity and discontinuity within discourse. We describe the annotation scheme used in this top-down approach in terms of three defining choices: the type of discourse object to be annotated, the type of texts to be included in the corpus, and the environment to assist the annotation task.

The discourse object chosen to “bootstrap” our top-down approach is the list or rather the enumerative structure, envisaged as a meta-structure covering a wide range of discourse organization phenomena. Enumerative structures are perceived and interpreted through interacting high-and low-level cues (e.g. visual cues such as bullets vs. lexical choices). They epitomise the linearity constraint, setting out a great diversity of discourse elements in the linear format of written text: a canonical enumerative structure comprises a trigger (introductory segment), a list of co-items and a closure. Enumerating can be seen as a basic strategy in written text through which different facets of discourse organization may be approached. Of particular interest to our annotation project is the fact that enumerative structures occur at all levels of granularity (from within the sentence to across text sections), and can be nested. We also see them as a particularly interesting object for the study of the interaction between the ideational and textual metafunctions (Halliday 1985).

Two main corpus-selection requirements follow on from our decision to view discourse organization from a top-down perspective in order to identify high-level structures. First, text genres must be carefully selected so as to include long expository texts, such as scientific papers or essays. Second, the corpus must be composed of texts in which crucial elements of discourse organization such as subdivisions and layout are available (Power et al 2003).

The annotation procedure calls upon pre-processing techniques. It consists in three successive steps: tagging, marking and annotation. The texts are tagged for POS and syntactic dependences. This information is used in the marking phase, whereby cues associated with the signalling of enumerative structures are automatically located. These cues may signal components of the structure (list markers, encapsulations…) or various elements that help reveal the text’s organization (section headings, frame introducers, parallelisms…). In the annotation stage, the colour-coded highlighting of cues guides the annotator’s scanning of the text. The interface allows the annotator to approximate the top-down approach by zooming on parts of the texts that are dense in enumerative cues. This combination of a marking procedure and specific interface facilities is a key element of our method, making it possible for the annotator to navigate the text and identify relevant spans at different granularity levels.

References:

Halliday, M. A. K. (1985). An Introduction to Functional Grammar. London: Edward Arnold.

Power, R., D. Scott and N Bouayad-Agha (2003). “Document Structure”. Computational Linguistics, 29(2), 211-260.

Nominalization in scientific discourse: A corpus-based study of abstracts and research articles.

Mônica Holtz

This paper reports on a corpus-based comparative analysis of research articles and abstracts. Systemic Functional Linguistics (Halliday 2004a; Halliday and Martin 1993) and register analysis (Biber 1988, 1995, 1998) are the theoretical backgrounds of this research.

Abstracts are considered to be one of the most important parts of research articles. Abstracts represent the main thoughts of research articles and are almost a surrogate for them. Although abstracts have been quite intensively studied, most of the existing studies are concerned only with abstracts themselves, not comparing them to the full research articles (e.g. Hyland 2007; Swales 2004, 1990, 1987; Ventola 1997).

The most distinctive feature of abstracts is their information density. Investigations of how this information density is linguistically construed are highly relevant, not just in the context of scientific discourse but also for texts more generally. It is commonly known that complexity in scientific language is achieved mainly through specific terminology, and nominalization, which is part of grammatical metaphor (cf. Halliday and Martin 1993; Halliday 2004b). The ultimate goal of this research is to investigate how abstracts and research articles deploy these linguistic means in different ways.

Nominalization ‘is the single most powerful resource for creating grammatical metaphor’ (Halliday 2004a: 656). Through nominalization, processes (linguistically realized as verbs) and properties (linguistically realized, in general, as adjectives) are re-construed metaphorically as nouns, enabling an informationally dense discourse.

This work focuses on the quantitative analysis of instances of nominalization, i.e., nouns ending in e.g., -age (cover - coverage), -al (refuse - refusal), -(e)ry (deliver - delivery), -sion/-tion (convert - conversion / adapt - adaptation), -ing (draw - drawing), -ity (intense - intensity), -ment (judge - judgment), -ness (empty - emptiness), -sis (analyze - analysis), -ure (depart - departure), and -th (grow - growth), in a corpus of research articles.

The corpus under study consists of 94 full research articles from several scientific journals in English of the disciplines of computer science, linguistics, biology, and mechanical engineering, comprising over 440,000 words. The corpus was compiled, pre-processed, and automatically annotated for parts-of-speech and lemmata. Emphasis will be given to the analysis and discussion of the use of nominalization in abstracts and research articles, across corpora and domains.

References:

Halliday, M. (2004a). An Introduction to Functional Grammar 3rd ed. London: Arnold.

Halliday, M. A. K. (2004b). “The Language of Science”, In Collected Works of M. A. K. Halliday (vol. 5). London: Continuum.

Halliday, M. A. K. and J. Martin (1993). Writing Science: Literacy and Discursive Power. London: University of Pittsburgh Press.

Hyland, K. (2007). Disciplinary Discourses. Ann Arbor: U. Of Michigan Press.

Swales, J. M. (1987). “Aspects of article introductions”. Aston-ESP-research-reports NO. 1, 5th Impression, The University of Aston, Birmingham, England.

Swales, J. M. (1990). Genre Analysis. Cambridge: CUP.

Swales, J. M. (2004). Research Genres. Exploration and Applications. Cambridge: CUP.

Ventola, E. (1997). “Abstracts as an object of linguistic study”. In F. Danes, E. Havlova and S. Cmejrkova (eds) Writing vs. Speaking: Language, Text, Discourse, Communication. Proceedings of the Conference held at the Czech Language, 333–352.

Construction or obstruction: A corpus-based investigation of diglossia in Singapore classrooms

Huaqing Hong and Paul Doyle

This paper takes a corpus linguistics approach to investigating the pattern of language use, diglossia in particular, in the teacher?student interactions in Singapore primary and secondary schools in relation to the effectiveness of students learning in such a situation. Diglossia is an interesting sociolinguistic phenomenon in a bilingual or multilingual setting where two languages or language varieties (H-variety and L-variety) occur side by side in the community, and each has a clear range of functions. Given the Singapore English-speaking community, the H-variety, Singapore Standard English (SSE) is encouraged to be used in classrooms, while the L-variety, Singapore Colloquial English (SCE) has been supposed to be eliminated from classrooms, and this has been criticized by many scholars for it’s almost impossible to eliminate SCE from the local education settings (Pakir 1991a & 1991b; Gupta 1994; Deterding 1998; to name just a few). We have no attempt to involve in the debate, but we would rather to look at this issue from a corpus linguistics perspective.

This study makes use of SCE?annotated data from the Singapore Corpus of Research in Education (SCoRE) (Hong 2005), a corpus of 544 lessons totaling about 600 hours recording of real classroom interactions in Singapore schools. With the identification of the pervasive use of SCE features in classroom interactions and their distribution patterns across four disciplinary subjects, we’ll look into how the diglossia use of English language in classroom interaction affects the learning process in class. Then, questions, such as to what extent diglossia can hinder or facilitate learner contribution in classroom communication and in what way teachers and students can take advantage of this to improve their meaning negotiation, will be addressed with the corpus evidence and statistical justification. Finally, we present the implication of this study and provide the recommendation for a practical and useful approach in classroom practices. The conclusion, that increasing teachers and students’ awareness of their use of language in class is at least as important as their ability to select appropriate methodologies, has implications for both teacher education and pedagogical practices.

References:

Deterding, D. (1998). “Approaches to diglossia in the classroom: the middle way”. REACT, 2, 18-23.

Gupta, A. F. (1994). The Step-tongue: children's English in Singapore. Clevedon: Multilingual Matters.

Hong, H. (2005). “SCoRE: A multimodal corpus database of education discourse in Singapore schools”. Proceedings of the Corpus Linguistics Conference Series, 1 (1), 2005. University of Birmingham, UK.

Pakir, A. (1991a). “The range and depth of English-knowing bilinguals in Singapore”. World Englishes, 10 (2), 167-179.

Pakir, A. (1991b). “The status of English and the question of ‘standard’ in Singapore: a sociolinguistic perspective”. In M. L. Tickoo (ed.) Language and Stadards: Issues, Attitudes, Case Studies. Singapore: SEAMEO.

A GWAPs approach to collaborative annotation of learner corpora

Huaqing Hong and Yukio Tono

It is generally agreed that a richly annotated corpus can be useful for a number of research questions which otherwise may not be answered. Annotation of learner corpora presents some theoretic and practical challenges to researchers of corpus linguistics and Applied Linguistics (Granger 2003, Nesselhauf, 2004, Pravec 2002, Díaz-Negrillo & García-Cumbreras 2007, Tono & Hong 2008, to name just a few). Regarding the efficiency and reliability of annotation of learner corpora, here we present an on?going work with the innovative GWAPs approach to online collaborative annotation of two learner corpora. GWAPs (Games With A Purpose) approach was originally designed in ESP Game for image annotation (von Ahn & Dabbish 2004), further developed in gwap.com, and recently applied to linguistic annotation successfully (e.g. Phrase Detectives, see Chamberlain, Poesio & Kruschwitz 2008 for more). This approach has been proved more efficient in time and cost saving and is able to yield more reliable output with a higher inter-annotator agreement rate in corpus annotation. Inspired by this, we adapted the GWAPs approach according to our purposes and data to apply it to online collaborative annotation of learner corpora. This paper starts with the comparative analysis of various approaches to learner corpus annotation, some common issues and. Then, the design of the GWAPs-based corpus annotation architecture and its implementation are introduced with demonstration followed. In so doing, we hope such an online collaborative approach can fulfil various requirements of annotation in learner corpora, and will eventually be applied to related areas, and benefit researchers from a wider community.

References:

Chamberlain, J., Poesio, M., and Kruschwitz, U. (2008). “Phrase detectives: a web-based collaborative annotation game”. Paper presented at International Conference on Semantic Systems (I-SEMANTICS'08). September 3-5, 2008. Messecongress I graz, Austria.

Díaz-Negrillo, A., and Fernández-Domínguez J. (2007). “A tagging tool for error analysis on learner corpora”. ICAME Journal, 31, 197-203.

Granger, S. (2003). “Error-tagged learner corpora and CALL: a promising synergy”. CALICO Journal, 20 (3), 465-480.

Nesselhauf, N. (2004). “Learner corpora and their potential for language teaching”. In J. Sinclair (ed.), How to Use Corpora in Language Teaching. Amsterdam: John Benjamins, 125-152.

Pravec, N. A. (2002). “Survey of learner corpora”. ICAME Journal, 26, 81-114.

von Ahn, L., and Dabbish, L. (2004). “Labeling images with a computer game”. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (ISBN 1-58113-702-8). Vienna, Austria, 319-326.

Phraseology and evaluative meaning in corpus and in text: The study of intensity and saturation

Susan Hunston

This paper develops the argument that the investigation of language features associated with saturation and intensity in Appraisal Theory (Martin and White 2005) can be enhanced by attention to corpus-based methodologies in accord with language theories such as those of Sinclair (1991; 2004).

There have been numerous responses from corpus linguistics to the challenge of attitudinal language. For example, markers of stance have been identified and quantified in texts that differ in genre or field (Biber 2006; Hyland 2000; Charles 2006). When it comes to what Martin and White call Appraisal, it is less clear how best corpus approaches can assist discourse analysis. This paper explores one aspect of appraisal theory – that relating to the systems of saturation and intensity. Saturation is the presence of relevant items in different parts of the clause (e.g. I think he may be in the library, possibly) whereas intensity refers to a cline of meaning (e.g. good versus excellent). Alongside the inscribed / evoked distinction these systems explain how one text can be ‘more evaluative’ than another.

Corpus research that focuses on the patterning and prosody of individual words and phrases (e.g. Sinclair 2004; Hoey 2004) often reveals that given words / phrases tend to co-occur with evaluative meaning even though they do not very obviously themselves indicate affect. This has been extensively researched in the contexts of particular phraseologies, but the effect of the use of such items on the texts in which they appear is less frequently argued. It might be said that there is an open-ended class of phraseologies that predict attitudinal meaning, such that there is a considerable amount of redundancy in the expression of that attitude. These phraseologies increase both the saturation and the intensity of appraisal in a text.

The paper will demonstrate this in two ways: through examination of a number of ‘attitude predicting’ phraseologies (e.g. bordering on and to the point of) and through the comparative analysis of sample texts. It represents an example of the interface between corpus and discourse studies.

References:

Biber, D. (2006). University Language: a corpus-based study of spoken and written registers. Amsterdam: Benjamins.

Charles, M. (2006). 'The construction of stance in reporting clauses: a cross-disciplinary

study of theses' Applied Linguistics 27, 492-518.

Hoey, M. (2004). Lexical Priming: A new theory of words and language. London: Routledge.

Hyland, K. (2000). Disciplinary discourses: Social interactions in academic writing. Harlow: Longman.

Martin, J. and P White (2005). The Language of Evaluation: Appraisal in English. London: Palgrave.

Sinclair, J. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press.

Sinclair, J (2004). Trust the Text: language, corpus, discourse. London: Routledge.

A corpus-based approach to the modalities in Blood & Chocolate

Wesam Ibrahim and Mazura Muhammad

This paper will use a corpus-based approach to investigate the modal worlds in the text world of Annette Curtis Klause’s (1997) Blood and Chocolate, a crossover fantasy novel addressing children over 12 years of age. Gavins (2001, 2005 and 2007) has modified the third conceptual layer of Werth’s (1999) Text World Theory, i.e. subworlds. She divides subworlds into two categories, namely world switches and modal worlds. Hence, modal worlds will be the focus of this paper. Following Simpson (1993), Gavins classifies modal worlds into Deontic, Boulomaic and Epistemic. She provides a number of linguistic indicators functioning as modal world builders including modal auxiliaries, adjectival participial constructions, modal lexical verbs, modal adverbs, conditional-if and so on. The prime interest of this paper is to determine the feasibility of using corpus tools to investigate modal worlds. The Blood and Chocolate corpus comprising 58, 368 words will be compared to the BNC written sample. This paper combines the quantitative as well as qualitative methods. Wordsmith 5 (Scott, 1999) will be utilised to ascertain the frequency and collocations of these modal world builders. Additionally, a description of the impact of the prevalence of one modality over the others will be provided. The mode of narration in Blood and Chocolate, according to Simpson (1993: 55, 62-3) belongs to category B in reflector mode (B(R)), since it is a third-person narrative taking place within the confines of a single character’s consciousness. Furthermore, in accordance with Simpson’s (1993: 62-3) subdivisions of category B in reflector mode (B(R)), Blood and Chocolate is a mixture of the subcategories B(R)+ve and B(R)-ve. The former is characterised by the dominance of deontic and boulomaic modality systems in addition to the use of evaluative adjectives and adverbs, verba sentiendi and generic sentences which possess universal or timeless reference; while the latter is distinguished by the dominance of the epistemic modality. The preliminary findings of the study show that there is a significant clash between deontic and boulomaic modalities since the protagonist, Viviane, is traumatised by a conflict between her duty towards her family of werewolves and her desire to continue her relationship with a human being. However, the epistemic modality is also prevalent since the whole narrative is focalised through Viviane’s perspective.

References:

Gavins, J. (2007). Text World Theory: An Introduction. Edinburgh: Edinburgh University Press.

Gavins, J. (2005). “(Re)thinking modality: A text-world perspective”. Journal of Literary Semantics, 34, 79–93.

Gavins, J. (2001). “Text world theory: A critical exposition and development in relation to absurd prose fiction”. Unpublished PhD thesis, Sheffield Hallam University.

Klause, C. A. (1997). Blood and Chocolate. London: Corgi.

Simpson, Paul (1993). Language, Ideology, and Point of View. London: Routledge.

Scott, M. (1996). WordSmith Tools. Oxford: OUP.

Werth, P. (1999). Text Worlds: Representing Conceptual Space in Discourse. London: Longman.

Gender representation in Harry Potter series: A corpus-based study

Wesam Ibrahim and Amir Hamza

The aim of this study is to probe the question of gender representation in J.K. Rowling’s Harry Potter series; specifically, to investigate the gender-indicative words (function and content) as clues to any potential gender bias on Rowling’s part. We intend to tackle the following question: What are the stereotypical representations associated with male and female characters in Harry Potter and how are they linguistically realised?

The methodology used in this study is a combination of quantitative and qualitative. On the one hand, we are concerned to use the corpus tool of WordSmith 5 (Scott 1999) to calculate the keywords of each book as compared to two reference corpora, namely, a corpus composed of the seven books (1,182,715 words) and the BNC written sample. The significant keyword collocates of each gender-indicative node will be calculated against the LL score. (The p-value is 0.000001.) The extraction of those keywords will help us 1) determine the gender focus in each book and track the gender-representation development across the whole series, and 2) see the difference between our data and general British English in the BNC in terms of gender indicative words.

The procedural analysis will proceed thus. For a start, the collocates will be linguistically investigated within the parameters of modality (Simpson 1993). The modality system will prove so essential a clue to the negative or positive prosodies associated with each of these characters: first, the use of deontic and/or boulomaic modality markers in the collocational environments of the names of certain characters might textually foreground a positive modal shading of duty, responsibly, desire, hope, etc.; second, by contrast, the use of epistemic and/or perception modality markers in some other collocational environments might textually foreground a negative modal shading of uncertainty or estrangement. So, overall, the modality system may linguistically reveal certain positive or negative prosody in the collocates that characteristically occur with the names male and female characters in text; this could be an avenue to how gender is narratively framed by Rowling within the fictional world of Harry Potterfrom an authorial point of view. On the other hand, a CDA perspective (Fairclough 1995, 2001; Van Dijk 1998) will be employed, with a special focus on 1) interpreting the gender representation in terms of the schemata, frames and scripts of the discourse participants (in this case the characters in fiction), and 2) explaining the biased position of the text producer by problematising her authorial stance in relation to her readership.

References:

Fairclough, N. (2001). Language and Power. (2nd ed.). New York: London.

Fairclough, N. (1995). Critical Discourse Analysis: The Critical Study of Language. London and New York: Longman.

Scott, M. (1996). WordSmith Tools. Oxford: OUP.

Van Dijk. (1998). Ideology: A Multidisciplianry Approach. London: Sage.

Simpson. (1993). Language, Ideology and Point of View. London: Routledge.

A corpus stylistic study of To the Lighthouse

Reiko Ikeo

As a novel of stream-of-consciousness, Virginia Woolf’s To the Lighthouse has been of great interest in literary studies. The main concern of stylistic discussions is that each character’s psyche is presented in free indirect thought and this is closely related to the shift of viewpoints. However, the novel involves many more features than speech and thought presentation in free indirect forms. Direct forms of speech or presentation of characters’ dialogues are also important in developing the plot and in triggering characters’ inner thoughts. The leading character, Mrs. Ramsay has her personality and beauty depicted and conveyed to the reader more by other characters’ speech and thought presentation than descriptions of her own speech and actions. By applying a corpus approach, this study examines how speech and thought presentation in the novel construct the narratives and how characters interact verbally and psychologically. Approximately 17% of the text is direct speech and thought presentation, 44% is indirect speech and thought presentation and 39% is narration. In addition to the categories of speech and thought presentation model which was proposed by Semino and Short (2004), the corpus includes information about the identification of the speaker, to whom the speaker is referring and the summary of the content of the presented discourse. By sorting each character’s speech and thought presentation according to the speaker and the referees, relationships between characters, characterisation and chronological changes of each character will be shown.

Another interest of this corpus approach is the variation of narrative voices. The third-person omniscient narrator who maintains neutrality is hardly present in the novel. Instead, the narrator, being sympathetic to characters in each scene, takes over the character’s point of view. Since the corpus makes a collective examination of narration possible, such shifts of viewpoints can systematically be examined. Whose point of view the narrator is taking is often indicated by the use of metaphors.

Reference:

Semino, E. And M. Short (2004). Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing. London: Routledge.

Use of modal verbs by Japanese learners of English: A comparison with NS and Chinese learners

Shin’ichiro Ishikawa

1. Introduction

English learner corpus studies have illuminated various features of NNS’s use of L2 vocabulary (Granger, 1998; Granger et al., 2002). Many of the previous studies in the field of computational linguistics emphasize that high frequent words are stabilized in terms of distribution, but analysis of learner corpus often reveals that NS/NNS gap can be seen even in the use of the most frequent functional words. The aim of the current study lies in probing the features of the use of modal verbs by Japanese learners of English (JLE). Adopting the methodology of the multi-layered contrastive interlanguage analysis (MCIA), we will compare the English writing by JLE with that by NS and that by Chinese learners of English (CLE). Examining the essays by NS and European NNS, Ringbom (1988) concludes that NNS overuse modal verbs such as “be,” “do,” “have,” and “can.” Aijmer (2002) suggests that Swedish, French, and German learners tend to overuse “will,” “would,” “might,” and “should,” which she says is caused by the “transfer from L1.” Also, Tono (2007) shows that Japanese junior high and high school students have a clear tendency to overuse almost all of the modal verbs.

2. Data and Analysis

We prepared three kinds of corpora: JLE corpus (169,654 tokens), CLE corpus (20.367 tokens), and NS corpus (37,173 tokens). All of these are a part of the Corpus of English Essays Written by Asian University Students (CEEAUS), which the author has recently released. Unlike other major learner corpora, CEEAUS rigidly controls writing conditions such as topic, length, time, and a dictionary use, which enables us to conduct a robust comparison among different writer groups (Ishikawa, 2008). CEEAUS also holds the detailed L2 proficiency data of NNS writers. JLE’s essays are thus classified into the four levels: Lower (-495 in TOEIC® test), Middle (500+), Semi-upper (600+), and Upper (700+). The data was analyzed both qualitatively and quantitatively. For the latter, we conducted correspondence analysis, which visually summarizes internal correlation among observed variables.

3. Conclusion

Contrary to the findings of previous studies, JLE do not necessarily overuse all the modal verbs. For instance, they underuse “could,” “used to,” and “would,” which suggests that they cannot properly control the epistemic hedges and the time frame.?JLE and CLE’s usage pattern of modal verbs are essentially identical, though CLE show a stronger overuse tendency in use of “can” and “will.” In case of Asian learners, L1 transfer is limited. In spite of L2 proficiency, JLE continue to over/underuse many of the modal verbs. This shows the need of explicit teaching of modality in TEFL in Japan. Especially, we should focus on the epistemic use of “would” and “could,” which is closely related to the “native-likeness” in the writings.

A corpus-based study on gender-specific words

Yuka Ishikawa

The feminist movement, dating back to the 1970s, aims to secure equality between the sexes in language as well as in real life. The advocators have argued that language shapes society as much as it reflects society. If we eliminate discriminatory elements from the language circulated in the community, we may be able to remove the old gender stereotype in our society.

The campaign was initiated mainly by feminists in the U.S., but since the mid-1970s, generic use of “man”, including compounds ending in “-man”, has been regarded as discriminatory by many (Maggio, 1987; Pilot, 1999). In 1975, the U.S. Department of Labor revised the Dictionary of Occupational Titles. This was meant to eliminate occupational titles ending with gender-marking suffixes, such as “fireman” or “chairman”, in order to conform to the equal employment legislation which was enacted in the late 1960s. Trends to avoid sexist language first spread to other English-speaking countries, then to non-English speaking countries around the world. By the beginning of 1990s, this movement reached Japan. The Japanese government introduced language reform and revised the Equal Opportunity Act in 1999. Using sexist job titles and expressions in classified advertisements has been legally banned since then. Japanese newspaper reporters, officers, and people in the business world are now to avoid using gender-biased expressions in public.

In this paper, Kotonoha Corpus, which is compiled by the National Institute for Japanese Language and available to the public via the Internet, is used to examine gender-specific expressions mainly in the White Paper and Minutes of the Diet. The purpose of the study is to dig deeper into official text that appears to be non-sexist at first glance to reveal the hidden images of men and women. The use of traditional job titles ending in the gender-specific suffix have been dramatically reduced since the government started the language reform. However, a close examination of the use of gender-specific words, including job titles, we will have a clearer image of men and women in the Japanese society.

References:

Maggio, R. (1987). The nonsexist word finder. Phoenix, Arizona: The Oryx Press.

Pilot, M. J. (1999). “Occupational outlook handbook: A review of 50 years of change”. Monthly Labor Review, 8-26. Retrieved November 14, 2002, from http://www.bls.gov/opub/ mlr/1999/ 05/art2abs.htm.

English-Spanish equivalence unveiled by translation corpora: The case of the Gerund in the ACTRES Parallel Corpus

Marlén Izquierdo

The notion of equivalence is central to translation. There seems to be such a bias for equivalence that this issue might question the usability of translation corpora for contrastive research. This paper intends to prove in which way translation corpora may be useful and usable for contrastive studies.

Assuming that translation is a situation of languages in contact might be a suitable starting point to explain how translational equivalence is conceived of and why this is paramount in translation and contrastive linguistics, as well as a linking factor between them. From a functional approach, the notion of equivalence can be observed through translation units, which are made up of an original text (OT) item and its equivalent in the target text (TT). In an attempt at preserving equivalence, translators might resort to system equivalents, that is, resources which the codes involved share as common means of expressing the same meaning. However, not always do translators know what resources they have at their disposal for translating a given item from English into Spanish. In addition, more often than not, translators make use of what they assume they can do instead of what is typically done. In other words, their choice deviates from the outcome expected by the target audience, questioning, thus, the acceptance of the translation and, by default, the underlying degree of equivalence.

Assessing the degree of equivalence of a translation unit depends on the knowledge about the functionality of each of their constituents, as well as their common ground and the various possibilities for matching them with other resources in either language. Yet, such an assessment is not always complete due to the fact that certain linguistic correspondences could be unnoticed by the contrastive linguists or the translator, who have much to contribute to each other.

Bearing this in mind, this paper explores the features of the Spanish Gerund as a translational option of English texts. The ultimate goal of this piece of research is to spot unexpected matches which are supposed to be functional equivalents in English-Spanish translation. Assessing to what extent they keep functional equivalence relies on the matches expected in view of their assumed similarity and on the equivalent patterns recently found in previous, corpus-based, contrastive studies. Along with these, this study seeks to unveil other relations, of sense and/or form, between the Spanish translational option and its corresponding original text(s). This leads us to the discussion about the reliability of translation corpora as a source of linguistic correspondence, which cannot be easily contested if equivalence is a condition of translation, and it is precisely translated facts that are under examination, from a corpus-based, functional approach.

References:

Baker, M. (2001). “Investigating the language of translation: a corpus-based approach”. In J. M. Bravo Gozalo and P. Fernández Nistal (eds), Pathways of Translation Studies.

Granger, S., J. Lerot and S. Petch-Tyson (2003). Corpus-based Approaches to Contrastive Linguistics and Translation Studies. New York: Rodopi, 17-29.

Izquierdo, M., K. Hofland and O. Reigem (2008). “The ACTRES Parallel Corpus: an English-Spanish Translation Corpus”. Corpora, 3, Edinburgh: Edinburgh UP.

Köller, W. (1989). “Equivalence in translation theory”. In A. Chesterman (ed.) Readings in translation theory, Helsinki: Oy Finn Lectura Ab, 99-104.

Mauranen, A. (2002). “Will 'translationese' ruin a CA?”. Languages in Contrast, 2 (2), 161-185.

Mauranen, A. (2004). “Contrasting Languages and Varieties with Translational Corpora”. Languages in Contrast, 5 (1), 73-92.

Rabadán, R. (1991). Equivalencia y Traducción. León: Universidad de León.

From a specialised corpus to classrooms for specific purposes

Reka Jablonkai

A productive way to use corpus-based techniques to analyse the lexis of a special field for pedagogic purposes is to create word lists based on the specialised corpus of the given field. Word lists are generated on the basis of lexical frequency information and are used for making informed decisions about the selection of vocabulary for course and materials design in language teaching in general and in English for Specific Purposes in particular. Examples of earlier studies focusing on the variety of English used in a special field include Mudraya’s (2006) Student Engineering Word List and the Medical Academic Word List created by Wang, Liang and Ge (2008).

The present paper will report on a similar study that looked into the lexical characteristics of the variety of English used in the official documents of the European Union. The English EU Discourse (EEUD) corpus used for the investigation comprises EU written genres like regulations, decisions and reports of about 1 million running words. Genres and texts were selected on the basis of a needs analysis questionnaire survey among EU professionals.

Overall analysis of the lexical items shows that the word families of the General Service List cover 76% of the total number of tokens in the EEUD corpus and the word families of the Academic Word List (AWL) account for another 14%. Altogether these lexical items make up 36% of all word types in the corpus.

Exploring the lexis of English EU documents included the creation and analysis of the EU Word List. The initial word list of the EEUD corpus contained 17 084 different word types. These entries were organized by word families according to Level 7 of Bauer and Nation’s scale (1993) which resulted in 7 207 word families. The word selection criteria established by Coxhead (2000), that is specialised occurrence, range and frequency, were applied to create the final EU Word List. Headwords of the most frequent word families include Europe, Commission, Community, finance, regulate, implement, proceed, treaty, EU and EC. Further analysis showed that about half of the headwords appear in the AWL and around 5% of the headwords are abbreviations.

The EU Word List can serve as a reference for EU English courses and can form the basis for a lexical syllabus and for data-driven corpus-based materials.

The tools used for the analysis were the Range program and the WordSmith corpus analysis software.

References:

Bauer, L., Nation, P. (1993). “Word families”. International Journal of Lexicography, 6 (4), 253-279.

Coxhead, A. (2000). “A new academic word list”. TESOL Quarterly, 34 (2) 213-238.

Heatley, A., I. S. P. Nation and A. Coxhead (2002). Range program software. Available at: http://www.vuw.ac.nz/lals/staff/Paul_Nation.

Mudraya, O. (2006). “Engineering English: A lexical frequency instructional model”. English for Specific Purposes, 25, 235-256.

Scott, M. (1996). WordSmith tools. Oxford: Oxford University Press. Available at: http://www.lexically.net/wordsmith

Wang, J., S. Liang, and G. Ge (2008). “Establishment of a Medical English word list”. English for Specific Purposes, 27, 442-458.

Collecting spoken learner data: Challenges and benefits

Joanna Jendryczka-Wierszycka

Although compilation of a spoken corpus always demands more time and human resources than gathering written data (e.g. the BNC project, Hoffmann et al.2008), it is a worthwhile venture.

This paper reports on the stages of collection and annotation of a spoken learner corpus of English, namely the Polish component of LINDSEI project (De Cock 1998). It starts with a short introduction into learner corpora, and proceeds to the experience of the Polish LINDSEI project, beginning with the project design, through interviewee recruitment, data recording, transcription and finishing with part-of-speech annotation.

LINDSEI (Louvain International Database of Spoken Learner Interlanguage) is the first spoken learner English corpus, with 11 L1 backgrounds to date. A wide range of linguistic analyses has already been performed with the use of the data by all of the LINDSEI coordinators, and this year (2009) the corpus is to be made available for scientific use for all interested scholars.

Apart from the corpus compilation description itself, it is problem areas, present at each stage of corpus collection, that are being reported on. The first group of problems comprises those occurring during the stages of spoken data collection as such, and of team work. Another group of problematic issues is connected with the adaptation of a POS tagger (Garside 1995) from native written language, through native spoken language to non-native spoken language.

It is worth mentioning that while studies on the Polish component of LINDSEI (to be precise, on vagueness and discourse markers) have been published (Jendryczka-Wierszycka 2008a, 2008b), the corpus compilation stages and design application have never been reported on.

While already used for the investigation of spoken learner language, it is hoped that the new existing corpus, especially in its enriched, annotated version, will become a resource not only for linguists, but also for language teachers and translators, particularly for those interested in the L1 background of the speakers.

References:

De Cock, S. (1998). “Corpora of Learner Speech and Writing and ELT”. In A. Usoniene (ed.) Proceedings from the International Conference on Germanic and Baltic Linguistic Studies and Translation. University of Vilnius. Vilnius: Homo Liber., 56-66.

Garside, R. (1995) “Grammatical tagging of the spoken part of the Britidh National Corpus: a progress report”. In Leech, Myers, Thomas (eds) Spoken English on computer. Harlow: Longman, 161-167.

Hoffmann, S, S. Evert, N. Smith , D. Lee, and Y. Berglund Prytz. (2008). Corpus Linguistics with BNCweb - a Practical Guide. Frankfurt am Main: Peter Lang.

Jendryczka-Wierszycka, J. (2008a) “Vagueness in Polglish Speech”. In Lewandowska-Tomaszczyk (ed.) PALC'07: Practical Applications in Language and Computers.Papers from the International Conference at the University of Lódz, 2007. Frankfurt am Main: Peter Lang Verlag, 599-615.

Jendryczka-Wierszycka, J. (2008b) “'Well I don't know what else can I say' Discourse markers in English learner speech”. In Frankenberg-Garcia, A. et al. (eds) 8th Teaching and Language Corpora Conference. Lisboa: Associação de Estudos de Investigação Científica do ISLA-Lisboa, 158-166.

Self editing using online concordance

Stephen Jennings

Self-editing of learner writing may be thought of as one of the goals of advanced writers in a foreign language. However, previous studies have shown that learners expect correction of writing assignments by teachers. While there is a question about how much learners need grammatical correction, some scholarly papers point out student requests for it; while others point out the need for attention on structural coherence. Whether it is correction for grammar or for coherence, it can be seen that both learner and teacher need to be involved in the process of editing drafts of written work.

This paper uses analysis of data provided by learner writing in the context of Japan. The analysis is provided by learners comparing samples of their own writing to that of an online concordance program (The Collins Cobuild Corpus Concordance Sampler). Learners in this study are at a higher end on the spectrum of interlanguage achievement, as a result, they are able to make sufficient pragmatic or other errors deemed not to irrevocably interfere with intelligibility, but enough so as to warrant a follow-up on coherence, or on accepted grammatical norms.

The teaching involved in the analysis of concordance is student centred and tailored to each individual learner, i.e. the learner himself analyses his own use of English. Rather than having been corrected at the earliest stage by the teacher, the error is pointed out obliquely. The learner then uses this clue to search for the solution to a more often-used English phrasing.

Results indicate that learners perceive that this approach has helped them learn patterns of English more thoroughly than before and that this use of online concordance is useful and encourages them to think more carefully as they write.

The thesis of this paper is that advanced learners will have the chance to;

· Fix their own errors

· Respond in an appropriate way to errors

· Become more independent in their learning

· Become aware of language in such a way as to become careful of error making

The SYN concept: Towards one-billion corpus of czech

Michal Kren

One of the aims of the Czech National Corpus (CNC) is continuous mapping of contemporary written language. This effort results mainly in compilation, maintenance and providing access to a SYN-series of corpora (the prefix indicates their synchronic nature and is followed by the year of publication): SYN2000, SYN2005 and SYN2006PUB. The former two are balanced 100-million corpora selected from a large variety of written text types that cover two consecutive time periods, the latter is a complementary 300-million newspaper corpus aimed at users that require large data. The corpora are lemmatised and morphologically tagged using a combination of stochastic and rule-based methods.

The paper has two main aims: first, to introduce a new 700-million newspaper corpus SYN2009PUB in the context of previously published synchronic corpora. SYN2009PUB is currently being prepared and it will comprise about 70 titles including also a number of non-specialised magazines and regional newspapers. Since all the SYN-series corpora are disjoint, their total size will reach 1 200 million tokens.

Second, the paper concentrates on national corpora architecture and corpus updating policy in the CNC. Corpus building is easier as more and more texts become available in electronic form. Specifically, it is enormous growth of the web that undoubtedly brings a lot of new possibilities of how to use the on-line data or to create new corpora almost instantly. However, there are also a number of limitations of the web-as-corpus approach, mainly because pure quantity is often not what a user is looking for. The traditional corpora with their specialised query engines, detailed and reliable metadata annotation and other markup are thus still in great demand.

The CNC policy is conservative in this respect, so that quality of the corpus data is not compromised despite their growing amount. Moreover, CNC guarantees that all corpora are invariable entities once published, which ensures that identical queries will always give identical results. However, this static approach obviously lacks updating mechanisms concerning also corrections of metadata annotation, lemmatisation, part-of-speech tagging etc. Similarly, conservation of older versions of the corpora can lead to misinterpretations of the data because of various processing differences that may be significant.

This is why corpus SYN was introduced, a regularly updated unification of all SYN-series corpora into one super-corpus where the original corpora are consistently re-processed with state-of-the-art versions of available tools (tokenisation, morphological analysis, disambiguation etc.). It is also easily possible to create subcorpora of SYN, so that their composition exactly corresponds to the original corpora. Furthermore, similar unifying concept as SYN will be applied to spoken and diachronic corpora in the near future. Corpus SYN can thus be seen also as a traditional response to the web-as-corpus approach showing that good old static corpora can keep their virtues while being made large and fairly dynamic at the same time.

Extracting semantic themes and lexical patterns from a text corpus

Margareta Kastberg Sjöblom

This article focuses on the analysis of textual data and the extraction of lexical semantics. The techniques provided by different lexical statistics tools, such as Hyperbase, Lexico3, Alceste and Weblex (Brunet, Salem, Reinert, Heiden) today opens the door to many avenues of research in the field of corpus linguistics, including reconstructing the major semantic themes of a textual corpus in a systematic way, thanks to a computer-assisted semantic extraction. The object used as a testing ground is a corpus made up by the literary work of one of France most famous contemporary writers: Jean-Marie Le Clézio, Nobel Prize winner 2008. The literary production of Le Clézio is vast, spanning more than forty years of writing and including several genres. The corpus consists of over 2 million tokens (51,000 lemmas) obtained from 31 novels and texts by the author.

Beyond the literary and genre dimensions, the corpus is here used as a field of experimentation, and to implement a different methodology application. A key issue in this article is to focus on various forms of approximation of lexical items in a text corpus: what are the various constellations and semantics patterns of a text? We here try to take advantage of the latest innovations in the analysis of textual data to apply different French corpus theories as “isotopies” (Rastier 1987-1996) and “isotropies” (Viprey 2005).

The automatic extraction of collocations and the micro-distribution of lexical items encourage the further development of different methods of automatic extraction of semantic poles and collocations: on one side by the extraction of thematic universes, revolving around a pole, on the other side by the extraction of co-occurrences and sequences of lexical items. These methods are also interesting when compared to other approaches: co-occurrences in between the items of a text and co-occurrence networks “mondes lexicaux” (Reinert, Alceste), semantic patterns based on ontologies (Tropes), but also other techniques such as simple or recursive “lexicogrammes” (Weblex, Heiden) or W. Martinez methods for multiple co-occurrence extraction.

References:

Adam, J.-M. and U. Heidemann (2006). Sciences du texte et analyse de discours, Enjeux d’une interdisciplinarité. Genève: Slatkine Erudition.

Heiden, S. (2004). “Interface hypertextuelle à un espace de cooccurrences: implémentation dans Weblex”. 7èmes Journées internationales d'Analyse Statistique des Données Textuelles (JADT 2004), Louvain-la-Neuve.

Kastberg Sjöblom, M. (2006). L’écriture de J.M.G. Le Clézio – des mots aux themes. Paris: Honoré Champion.

Lafon, P. (1984). Dépouillements et Statistiques en Lexicométrie. Paris: Slatkine-Champion.

Leblanc, J.M. (2005). “Les vœux des présidents de la cinquième République (1959-2001). Recherches et expérimentations lexicométriques à propos de l’ethos dans un genre discursif ritual”. Thèse de Doctorat en Sciences du Langage, Université de Paris.

Martinez, W. (2003). “Contribution à une méthodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels”. Thèse de Doctorat en Sciences du Langage, Université de la Sorbonne nouvelle.

Rastier, F. (1987). Sens et textualité, Paris: Hachette.

Viprey, J.-M. (2005). “Philologie numérique et herméneutique intégrative”. In J.-M. Adam and U. Heidemann (2006). Sciences du texte et analyse de discours, Enjeux d’une interdisciplinarité. Genève: Slatkine Erudition

Discovering and exploring academic primings with academic learners

Przemysław Kaszubski

A specially designed online pedagogical concordancer has been in use with several groups of EFL academic writers for the past two years. The essential components of the environment are three interfaces: a corpora search interface (with small disciplinary corpora of written EAP texts contrasted with learner-written texts and with general registers), an annotation-supporting search history interface (personal and/or collaborative) and a resources site, where some of the more valuable findings are deployed and being arranged into a structured, textbook-like environment. Re-use of previously accessed searches is facilitated by the fact that each corpora and each history search produce unique URL addresses that can be exploited as web-links (cf. Gaskell & Cobb 2004). An intricate network of variously connected searches can thus emerge and be utilised by the student and the teacher, who can collaborate on search annotations. Discovery learning assumes new perspectives – it can be informed by pre-existing annotations illustrated with search links or remain more traditionally inductive, yet collaborative and constructionist.

The research goals of this highly practical venture are twofold – concerning both linguistic description and EAP pedagogy. On the descriptive side, the aim is to try to harness EAP students' noticing potential for the discovery and annotation of units, strings and patterns of use that (learner) academic writers in the relevant disciplines are likely to need. The encouraged framework of reference is Hoey's (2005) lexical priming, which combines phraseological as well as stylistically important textual-lexical patterning. On the pedagogical side, the search history module can be used to oversee and monitor students' work and to gather feedback to help track the problems and obstacles that have been holding data-driven learning from taking a more central position in every-day language pedagogy.

The first-year pilot studies brought some conflicting opinions as to the friendliness of the overall environment (especially among the less proficient users/levels). There was, however, much agreement over the value of the collaborative options. There also emerged interesting trends between web-link-driven as opposed to independently undertaken searches. Consequently, decisions and programming efforts were made to enhance the profile of the search history interface to enable new learners to take up more challenging tasks, more swiftly.

The goal of my presentation would be to show selected fruits of the work undertaken by and with students, last year and this year, with respect to the patterning of academic discourse, and to extract some of the linguistic trends behind the observations, such as the priming types identified. On the pedagogical level, I will try to assess whether the students appear ready for the intensive and extensive DDL activity my tool offers, or whether the tool is ready for them, given the needs, expectations and track records observed.

References:

Gaskell, D. and T. Cobb. (2004). "Can learners use concordance feedback for writing errors?", System, 32 (3),301-19.

Hoey, M. (2005). Lexical priming: A new theory of words and language. London: Routledge.

Give your eyes the comfort they deserve: Imperative constructions in English print ads

Elma Kerz

Advertising operates within certain constraints, including the communicative situation, as well as space and time limitations. The whole aim of the advertising copywriters is to get us to register their communication, either for the purpose of immediate action or to make us favourably disposed in general terms to the advertised product or service. To this end, copywriters make use of a rather restricted set of constructions, among which we frequently encounter imperatives. As already pointed out by Leech (1966:30), “one of the striking features of grammar of advertising is an extreme frequency of imperative sentences”. Leech (1966) provides a list of verb items which he considers to be especially frequent fillers of the V-slot of imperative constructions in advertising: (i) items which have to do with the acquisition of the product such as get, buy or ask for (ii) items which have to do with the consumption or use of the product, have, try, use or enjoy, (iii) items which act as appeals for notice, look, see, watch, make sure or remember.

To date, however, there has been no comprehensive corpus-based study on the use and function of imperative constructions (henceforth ICs) in the genre of English print advertisements. By adopting a usage-based constructionist approach (cf. Langacker 1987, 2000, 2008; Croft 2001; Goldberg 2006;), the present paper seeks to close this gap by (i) casting light upon the nature of lexical items filling the V-slot of ICs, (ii) considering a general constructional meaning of ICs, (iii) investigating how the use of ICs contribute to the selling effectiveness of an advertised product/service. The corpus used in this study is composed of 200,183 words from a wide range of ‘mainstream’ magazine ads, covering the issues from the years 2004 to 2008.

References:

Croft, W. (2001). Radical construction grammar: syntactic theory in typological perspective. Oxford: Oxford University Press.

Goldberg, A. (2006). Constructions at work: the nature of generalization in language. Oxford: Oxford University Press.

Langacker, R. (1987). Foundations of cognitive grammar, vol. 1: theoretical prerequisites. Stanford, CA: Stanford University Press.

Langacker, R. (2000). “A dynamic usage-based model. In Usage-based models of language”. In M. Barlow and S. Kemmer (eds), Stanford: CSLI, 1-63

Langacker, R. (2008). Cognitive grammar: a basic introduction. Oxford: Oxford University Press.

Leech, G. (1996). English in advertising: a linguistic study of advertising in Great Britain. London: Longman.

A corpus-based study of Pashto

Mohammad Abid Khan and Fatima Tuz Zuhra

This research paper presents a corpus-based study of the inflectional morphological system and collocations in Pashto language. For this purpose, a corpus containing a huge amount of Pashto text is used. There exist two other considerable Pashto corpora: one developed for the BBN Byblos Pashto OCR System (Decerbo et al., 2004) and the other developed by Khan and Zuhra (2007). The former corpus contains Pashto data in the form of images and the later contains relatively little amount of data that cannot be fruitfully used for the language study purposes. A 1.225 million words Pashto corpus is developed using the corpus development tool Xaira (http://www.xaira.org, 2008). The corpus contains written Pashto text. The text is taken from various fields such as news, essays, letters, research publications, books, novels, sports and short stories. The text in the corpus is tagged according to Text Encoding Initiative (TEI) guidelines, using Extensible Markup Language (XML). To work out more productive inflectional morphological rules of affixation, hapax legomena were also searched (Aronoff and Fudeman, 2005). The claim of Aronoff and Fudeman (2005) appears to be true in this study, because the authors observed that these words, though not formed from the frequently occurring stems, are constructed by such rules that are productive when implemented afterwards. Based on this corpus-based study, an inflectional morphological analyzer for Pashto is designed, using finite state transducers (FSTs) (Beesley and Kattunen, 2003) as a tool. These FSTs are implemented using Xerox finite state tools i.e. lexc and xfst. To develop a lexicon for Pashto morphological analyzer, containing a reasonable amount of stems, the Xaira word searching facility is used. The lexicon of the morphological analyzer is stored in Microsoft Access tables. The interface to the morphological analyzer is developed using C# under .Net framework. This whole development process is presented in detail. The detailed collocation study of 30 most frequently occurring words in the corpus is provided in this paper. The statistical formula used for the identification of collocations is z-score. Z-score is used because it is widely used in Xaira. A higher z-score indicates a greater degree of collocability of an item with the node word (McEnery et al., 2006). The collocations of these words, having a higher z-score, are tabulated and presented in this paper that can be used in decision-making, in part-of-speech (POS) tagging and word-sense disambiguation.

References:

Aronoff, M. and K. Fudeman (2005). What is Morphology. Blackwell Publishing.

Beesley, K. R. and L. Karttunen (2003). Finite State Morphology: CSLI studies in Computational Linguistics.

Decerbo, M. et al. (2004). The BBN Byblos Pashto OCR System. ACM Press.

Khan, M. A. and Zuhra, F. T. (2007). “A General-Purpose Monitor Corpus of Written Pashto”, In Proceedings of the Conference on Corpus Linguistics, Birmingham, 2007.

McEnery, T. et al. (2006). Corpus-Based Language Studies. Routledge.

[Online] http://www.xaira.org. Retrieved (2008).

1 corpus + 1 corpus = 1 corpus

Adam Kilgarriff and Pavel Rychlý

If we add one pile of sand to another pile of sand, we do not get two piles of sand. We get one bigger pile of sand.

Corpora (and other collections) are like piles of sand. A corpus is a collection of texts, and when two or more collections are added together, they make one bigger collection.

Where the corpora are for the same language, and do not have too radically different designs and text types, there are at least three benefits to treating them as one, and loading the data into a corpus query system as a single object:

1. users only need to make one query to interrogate all the data

2. better analyses from bigger datasets

3. where the corpus tool has functions for comparing and constrasting subcorpora, they can be used to compare and contrast the components.

This is frequently an issue, particularly for English, where there are many corpora available. To consider one particular set of corpus users - lexicographers: they work to tight schedules, so want to keep numbers of queries to a minimum; they want as much data as possible (to give plenty of evidence for rare items); and they often need to quickly check a word’s distribution across text types (for example, spoken vs. written, British vs. American.)

Corpora like the BNC or LDC’s Gigaword contain well-prepared data in large quantities and of known provenance so it is appealing to use them in building a large corpus of general English. Both have limitations: the BNC is old, and British only; the Gigaword is American and journalism only: so lexicographers and others would like both these components (and others too).

For academic English, we would like one query to interrogate MICASE, BASE and BAWE.

A related issue arises with access restrictions. Where one corpus is publicly available, but another is available only to, for example, members of a particular organisation, who also want to benefit from the public corpus, then how should it be managed (if a single setup is to support everyone)? As a principle of data management, there should be one master copy of each dataset, to support maintenance and updating.

In our corpus system, we have recently developed a mechanism for defining a corpus as a number of component corpora, to be described and demonstrated in the talk.

Finally, two cautions:

· Where two components may contain some of the same texts, care must be taken not to include them twice.

· Any two corpora typically have different mark-up schemes, whereas a single corpus should have a single scheme. This applies to both text mark-up (tokenisation, lemmatisation, POS-tagging) and headers. For text mark-up, we re-process components so they all use the same software. For header mark-up, mappings are required from the schemes used in each component to a single unified scheme. The mapping must be designed with care, to reconcile different perspectives and categories in the different sources while keeping the scheme simple for the user and avoiding misrepresenting the data.

Simple maths for keywords

Adam Kilgarriff

“This word is twice as common here as there.” Such observations are central to corpus linguistics. We very often want to know which words are distinctive of one corpus, or text type, versus another. “Twice as common” means the word’s relative frequency in corpus one (the focus corpus, fc) is twice that in corpus two (the reference corpus, rc). We count occurrences in each corpus, divide by the number of words in that corpus and divide the one by the other to give a ratio. If we find ratios for all words and sort by the ratio we have a first pass “keywords” list of the words with the highest ratios.

In addition to issues arising from unwanted biases in the corpus contents, and “burstiness” (Gries 2008), there are two problems with the method:

1. You can’t divide by zero, so it is not clear what to do about words which are in fc but not rc.

2. The list will be dominated by words with few hits in rc: a contrast between 10 in fc and 1 in rc (assuming fc and rc are the same size) giving a ratio of 10 is not unusual, but 100,000 vs. 10,000 is, even though the ratio is still 10. Simple ratios give lists of rarer words.

The last problem has been the launching point for an extensive literature which is shared with collocation statistics, since formally, the problems are similar. Popular statistics include Mutual Information, Log Likelihood and Fisher’s Exact Test (see Manning and Schütze 1999). However the mathematical sophistication of these statistics is of no value to us, since all it serves to do is disprove a null hypothesis - that language is random - which is patently untrue, as argued in (anonymised).

Perhaps we can meet our needs with simple maths.

A common solution to the zeros problem is “add one”. If we add one to all frequencies, including those for words which were present in fc but absent in rc, then we have no zeros and can compute a ratio for all words. “Add one” is widely used as a solution to problems with zeros in language technology and elsewhere (Manning and Schütze 1999).

This suggests a solution to problem 4. Consider what happens when we add 1, 100, or 1000 to all counts from both corpora. The results, for the three words obscurish, middling and common, in two hypothetical corpora, are presented below:

Word	fc freq	rc freq	Add 1			Add 100			Add 1000
	fc freq	rc freq	AdjFs	rtio	Rk	AdjFs	rtio	rk	AdjFs	rtio	rk
obscurish	10	0	11, 1	11.0	1	110, 100	1.1	3	1010, 1000	1.01	3
middling	200	100	201, 101	1.99	2	300, 200	1.50	1	1200, 1100	1.09	2
Common	12000	10000	12001, 10001	1.20	3	12100, 10100	1.20	2	13000, 11000	1.18	1

Table 1: Frequencies, adjusted frequencies (AdjFs), ratio (rtio) and keyword rank (rk) for three “add-N” settings for rare, medium and common words.

All three words are notably more common in fc than rc so are candidates for the keyword list, but are in different frequency ranges.

* Adding 1, the order is obscurish, middling, common.

* Adding 100, it is middling, common, obscurish.

* Adding 10,000, it is common, middling, obscurish.

Different values for the “add-N” parameter focus on different frequency ranges.

For some purposes a keyword list focusing on commoner words is wanted, for others, we want to focus on rarer words. Our model lets the user specify the keyword list they want by using N as a ‘slider’.

The model provides a way of identifying keywords without unwarranted mathematical sophistication, and reflects the fact that there is no one-size-fits-all list, but different lists according to the frequency range the user is interested in.

Reference:

Gries, S (2008). “Dispersion and Adjusted Frequencies in Corpora”. Corpus Linguistics 13 (4), 403-437.

Manning, C. and H. Schütze (1999). Foundations of Statistical Natural Language Processing, MIT Press: Cambridge, MA.

English equivalents of the most frequent Czech prepositions.

A contrastive corpus-based study

Ales Klegr and Marketa Mala

Prepositions in cross-linguistic perspective are not a widely studied area despite their frequency in text and interference rate in SLA. Although English and Czech are typologically different (analytic and inflectional respectively), among the first 25 most frequent words are 8 prepositions in the BNC and 10 prepositions (!) in the CzechNC. The study, based on aligned fiction texts and using the parallel concordancer ParaConc, focuses on the first and the last of the 10 most frequent Czech prepositions, spatiotemporal v/ve and po. It seeks to explore their English equivalents and compare the similarities and idiosyncrasies of the v/ve and po translation equivalence patterns. The analysis works with three kinds of translation: prepositional and two kinds of non-prepositional, i.e. lexicalstructural transpositions and zero translation (omitting the preposition and its complement, though otherwise preserving the rest of the sentence). Instances of textual non-correspondence were excluded. For each preposition 600 parallel concordance lines were analysed. The sources are three contemporary Czech novels for v/ve (200 occurrences from each), four texts (and translators) for po (some contained fewer than 200 occurrences). The hypothesis was that, as a synsemantic function word, prepositions are more susceptible to translation shifts than lexical words, although there are apparently uses where neither type of langue can do without them.

The findings for v/ve: the patterns in the three texts are remarkably similar, the prepositional equivalents form 68.2 % on the average, the non-prepositional equivalents (transpositions and zero) account for 31.8 %. There is one dominant equivalent, the preposition in (49.7 %). The range of prepositional equivalents (even considering the polysemy of v/ve) was rather wide, including 21 different prepositions. The findings for po: the distribution of equivalents in the four texts was far more varied than in v/ve texts, the average representation – 62.2 % of prepositional equivalents,

37.8 % of non-prepositional equivalents – shows a 6 % difference in favour of non-prepositional translation compared to v/ve. There are two other noticeable differences: the most frequent prepositional equivalent, after (18 %), is significantly less prominent than in among v/ve equivalents. Second, the variety of prepositional equivalents is enormous, 37 (almost twice as many as the v/ve prepositional equivalents).

The study correlates the distribution of the equivalents with syntactic analysis, the polysemy of the Czech prepositions and the types of structures they are part of (regular combinations, MWUs, lexical phrases, chunks, idioms, etc.). It seems that po with a high number of transpositions and less clear-cut central prepositional equivalents is oftener part of higher lexical units than v/ve, heading a number of free-standing adjuncts. In sum, the prepositions exhibit a relatively high incidence of different-word class translations (e.g., compared to the noun), and a surprisingly wide range of prepositional equivalents (far greater than given in the largest bilingual dictionaries). Although both prepositions display a rich variety of translation solutions, it is possible to discern some clear tendencies and specific sense-equivalent pairings, but also specific and distinct combinatory propensities. The findings provide a useful starting point for both theoretical and practical description.

References:

Bennett, D.C. (1975). Spatial and Temporal Uses of English Prepositions. Longman.

Cermák, F., Kren M. et al. (2004). Frekvencní slovník ceštiny. Prague: NLN.

Hoffmann, S. (2005). Grammaticalization and English Complex Prepositions. A Corpus-Based Study. London and New York: Routledge.

Lindkvist, K.-G. (1978). AT versus ON, IN, BY: On the Early History of Spatial AT and Certain Primary Ideas Distinguishing AT from ON, IN, BY. Stockholm: Almquist & Wiksell.

Lindstromberg, S. (1998). English Prepositions Explained, Amsterdam: John Benjamins.

Saint-Dizier, P. (ed.)(2006). Syntax and Semantics of Prepositions. Dordrecht: Springer.

Sandhagen, H. (1956). Studies of the Temporal Senses of the Prepositions AT, ON, IN, BY, and FOR in Present-day English. Uppsala: Almquist and Wiksells.

Tyler, A., Evans, V. (2003). The Semantics of English Prepositions, Cambridge: CUP.

Collecting and collating heterogeneous datasets for multi-modal corpora

Dawn Knight

This paper examines the future of multi-modal CL research. It discusses the need for constructing corpora which concentrate on language reception rather than production, and starts to investigate the means by which this may be achieved.

This paper presents preliminary findings made as part of the NCeSS funded (National Centre for e-Social Science) DReSS II (Understanding New Forms of Digital Records for e-Social Science) project based at the University of Nottingham. DReSS II seeks to allow for the collection and collation of a wider range of heterogeneous datasets for linguistic research, with the aim of facilitating for the investigation of the interface between multiple modes of communication in everyday life.

Records of everyday (inter)actions, including SMS messages, MMS messages, interaction in virtual environments (instant messaging, entries on personal notice boards etc), GPS data, face-to-face situated discourse, phone calls and video calls provide the ‘data’ for this research. I will demonstrate (using actual datasets from pilot recordings) how this data can be utilised to enable a more detailed investigation of the interface between these different communicative modes. This is undertaken from an individual’s perspective, by means of tracking and recording a specific person’s (inter)actions over time (i.e. across an hour, day or even week). The analysis of these investigations will help us to question to what extent language choices are determined by different communicative contexts, proposing the need for a redefinition of the notion of ‘context’, one that is relevant to the future of multi-modal CL research.

This presentation will provide an overview of the methodological, technical and practical issues/challenges faced when attempting to collect and collate these datasets. In addition, the presentation will also discuss the key ethical issues and challenges that will be faced in DReSS II, looking at ethics from three different perspectives; the institutional, professional and the personal (discussing the key moral and legal obligations imposed by each of these, and their potential impact on data collection and analysis). I will review the notions of consent and anonymisation in digital datasets, and deliberate over what informed consent may mean to every stage of this research, from the collection to re-use and purpose. Finally I will discuss some of the possible applications of corpora (and research) of this nature.

Deductive versus inductive approaches to metaphor identification

Tina Krennmayr

Work within a cognitive linguistic framework tends to favor deductive approaches to finding metaphor in language (e.g. Koller 2004; Charteris-Black 2004), starting either from complete conceptual metaphors or from particular target domains. However, if the analyst assumes, for instance, the conceptual metaphor FOOTBALL IS WAR, he or she may be misled into identifying linguistic expressions as evidence of such a mapping, without considering that those very same linguistic data could be manifestations of an alternative mapping. “When a word or phrase like ‘defend’, ‘position’, ‘maneuver’, or ‘strategy’ is used, there is no a priori way to determine whether the intended underlying conceptual metaphor is war, an athletic contest, or game of chess.” (Ritchie 2003: 125).

While it is tempting to think of global mappings consistent with the themes of a text, the actual mapping may not fit the scenario in every instance. Therefore, identifying mappings locally can prevent the analyst from assuming the most (subjectively) obvious mapping from the outset. Such bottom-up approaches do not start out from the presumption of existing conceptual metaphors but decide on underlying conceptual structures for each individual case.

By using newspaper excerpts from the BNC-baby corpus, I will make explicit how a deductive and an inductive approach lead to two different outcomes when explicating conceptual metaphors and their mappings. For example, a top-down analysis of “the company won the bid” assuming the conceptual metaphor BUSINESS IS WAR may align the company with soldiers or a country and the bid with battle or war. A bottom-up analysis, however, does not restrict itself to a mapping that is coherent with war imagery, but considers alternatives such as the more general structure SUCCEEDING IN A BID IS LIKE WINNING A COMPETITION.

In order to demonstrate the fundamental difference between bottom-up and top-down approaches, I use the 5-step method (Steen 1999), an inductive approach, which has been designed to bridge the gap between linguistic and conceptual metaphor. This bottom-up approach can be modified in a way to serve as a useful tool for explaining the different thought processes employed in deductive and inductive approaches and their consequences for possible mappings.

The outcome of this analysis is verified by using the semantic annotation tool Wmatrix (Rayson 2008), which provides a web interface for corpus analysis. It contains the UCREL semantic analysis system (USAS) (Rayson et al. 2004), a framework that automatically annotates each word of a running text semantically. Hardie et al. (2007) have suggested an exploitation of USAS for metaphor analysis, since the semantic fields automatically assigned to words of a text by USAS roughly correspond to metaphorical domains, as suggested by conceptual metaphor theory. Wmatrix can therefore offer yet another alternative perspective to manual deductive and inductive approaches to metaphor analysis.

References:

Charteris-Black, J. (2004). Corpus Approaches to Critical Metaphor Analysis. Houndmills, Basingstoke: Palgrave.

Hardie, A., Koller, V., Rayson, P. and E. Semino (2007). “Exploiting a semantic annotation tool for metaphor analysis”. In M. Davies, P. Rayson, S. Hunston and P. Danielsson (eds) Proceedings of the Corpus Linguistics 2007 conference.

Koller, V. (2004). Metaphor and gender in business media discourse. Palgrave.

Rayson, P., Archer, D., Piao, S. L., McEnery, T. (2004). “The UCREL semantic analysis system”. In Proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks in association with LREC 2004, pp. 7-12.

Rayson, P. (2008). Wmatrix: a web-based corpus processing environment. Computing Department, Lancaster University. http://ucrel.lancs.ac.uk/wmatrix/.

Metaphors pop songs live by

Rolf Kreyer

Pop-song lyrics are often felt to be highly stereotypical and clichéd. The present paper is an attempt to shed light on the linguistic substance of these stereotypes and clichés by analysing the use of metaphors in pop song lyrics within the framework of conceptual metaphor theory. The data underlying the present study are drawn from an early version of the Giessen-Bonn Corpus of Popular Music (GBoP), a corpus containing the lyrics of 48 Albums from the US Album Charts of 2003 containing 758 songs and some 350,000 words (see Kreyer/ Mukherjee 2007 for details on the sampling of the pilot version of GBoP).

From this corpus, metaphors are extracted with the help of metaphorical-pattern analysis (Stefanowitsch 2006). A metaphorical pattern is defined as "a multi-word expression from a given source domain (SD) into which a specific lexical item from a given target domain (TD) has been inserted" (66). Such patterns can be identified by analysing all instances of a lexeme (or a group of lexemes) relating to a particular target domain and determining all metaphorical uses of this lexeme in the data. These can then be "grouped into coherent groups representing general mappings" (66). The following examples show the use of the metaphor LOVE IS AN OPPONENT IN A FIGHT.

(1) Ten rounds in the ring with love

(2) Knocked out of the ring by love

(3) K-O, knocked out by technicality / The love has kissed the canvas

With regard to clichédness, we would assume that either pop song lyrics use the same metaphors again and again, thus showing a lack of variation in the use of metaphors, or that pop songs do not exploit metaphors in a very creative way (or both). However, the analysis of the use of love metaphors in GBoP shows that this is not the case: pop songs lyrics show a fair amount of variation as well as creativity. The paper suggests an alternative explanation along the lines of the Russian formalists' approach to poetic language.

On the whole, the paper shows how a combination of corpus-based approaches and conceptual metaphor theory can be fruitfully applied to the study of the important genre of pop-song lyrics.

References:

Kreyer, R. and J. Mukherjee (2007). "The style of pop song lyrics: a corpus-linguistic pilot study". Anglia, 125, 31-58.

Stefanowitsch, A. (2006): "Words and their metaphors: A corpus-based approach". In A. Stefanowitsch and S. Th. Gries (eds), Corpus-based Approaches to Metaphor and Metonymy, Berlin and New York: Mouton de Gruyter, 61-105.

Nominal coreference in translations and originals – a corpus-based study

Kerstin Anna Kunz

The present work deals with the empirical investigation of nominal coreference relations in a corpus of English originals, their translations into German and comparable original texts in German.

Coreference between nominal expressions is an essential linguistic strategy to establish coherence and topic continuity in texts. Cohesive chains between nominal expressions on the linguistic level of the text evoke a cognitive relation of reference identity on the level of text processing. The chains may contain more than two nominal expressions and may link these expressions below and above the sentence level. Nominal coreference therefore also is a focal means employed in translations to express the same or similar relations of meaning as in their source texts.

As is commonly known, languages provide a variety of means to create nominal coreference in texts, concerning the form of coreferential expressions, their semantic relation to each other, their syntactic function and position as well as their frequency and textual distance (cf. Halliday & Hasan 1976, Gundel et al. 1993, Grosz et al. 1995). It has been widely accepted in the literature that these variations reflect different conditions of mental processing. (cf. Lambrecht 1994, Schwarz 2000, Sanders, Prince 1981, Ariel 1990). However, systemic constraints on different linguistic levels are also assumed to impact on the coreferential structure in German and English original texts (Hawkins 1968, Steiner & Teich 2003, Doherty 2004, Fabricius-Hansen 1999). As for translations, the question has to be raised whether these criteria affect the process of translation and whether this manifests in the coreferential features of the translation product.

In order to obtain a very detailed picture of the coreferential properties of translations, a small corpus of the register of political essays was compiled consisting of ten American political essays (13300 words), their translation into German (13449 words) and ten comparable original political essays in German (13679 words). The corpus was encoded manually with linguistic information about the nominal coreference expressions contained. First, the marked expressions were assigned to their respective coreference chains. A distinction was drawn within each coreference chain between antecedent and anaphor(s). Second, linguistic information was indicated with each expression as to its form, its syntactic function and position and its semantic relation to other coreference expressions in the same coreference chain. Third, the coreference expressions in the German translations were aligned with their equivalent coreference expressions in the English originals.

The current paper presents some first findings from this fine-grained corpuslinguistic analysis. By comparing the distribution of nominal coreference expressions in the three subcorpora we seek to interpret the differences traced in the translations behind the background of several sources of explanation: language typology, register and the translation process. Apart from that, we show how the different types of properties of translations as outlined by Baker (1992), Teich (2003) and others also manifest in the coreference structure of translations.

References:

Baker, M. (1992). In Other Words. A Coursebook on Translation. Routledge

Doherty, M. (2004). “Strategy of incremental parsimony”. SPRIKreports, 25.

Fabricius-Hansen, C. (1999). “Information packaging and translation: Aspects of translational sentence splitting”. Studia Grammatica, 47:175-214

Grosz, B. J. , A. K. Joshi & S. Weinstein. (1995). Centering: A framework for modelling the local coherence of discourse. Computational Linguistics, 21: 203-225.

Hawkins, J. A. (1986). A comparative typology of ENGLISH AND GERMAN. Unifying the contrasts. London & Sydney: Croom Helm.

König E. & V. Gast. (2007). Understanding English-German Contrasts. Grundlagen der Anglistik und Amerikanistik. Berlin: Erich Schmidt Verlag.

Subcategorisation “inheritance”: A corpus-based extraction and classification approach

Ekaterina Lapshinova-Koltunski

In this study, we analyse the subcategorisation of predicates, extracting them from German corpora. Predicates, such as verbs, nouns and multiwords, are automatically extracted from German corpora and are classified according to their subcategorisation properties.

Our aim is not only to classify the extracted data by subcategorisation but also to compare the properties of different morphologically related predicates, i.e. verbs, deverbal nouns and support verb constructions. We thus analyse the phenomenon of “inheritance” in subcategorisation. An example are deverbal nouns (occurring both alone and within a multiword), most of which which share their subcategorisation properties with their underlying verbs. For instance, the verb bedingen (“to condition”) subcategorises only for a dass-clause in our data. So does its nominalisation Bedingung, which can be used both as a simple predicate, e.g. die Bedingung, dass... (“the condition that...”), and within a multiword, e.g zur Bedingung machen, dass... (“to make it a condition that...”).

We intend to distinguish the cases where nominal predicates share their subcategorisation properties with their base verbs from those where they have their own properties. An example of the second case is the nominalisation Ankündigung can subcategorise only for a dass-clause, whereas its underlying verb ankündigen has also a wh-clause as a complement.

Our preliminary experiments show that different kinds of predicates have their own subcategorisation and contextual properties. There are both correspondencesand differences in the subcategorisation of verbs and deverbal predicates, which should be considered in lexicon building for NLP. In “inheritance” cases, we don’t even need to describe the predicate-argument structure of a nominalisation, as we can just rewrite it from that of the underlying verb. The differences between the subcategorisation of nominalisations and their base verbs are necessary to be taken into account in dictionaries or NLP lexicons. For “non-inheritance” cases some additional information should be included. Our system can identify such cases by means of extracting them from tokenised, pos-tagged and lemmatised text corpora.

For our study, we used a corpus German newspaper texts (220M tokens). Further extractions are planned to be done from a corpus of German texts consisting of newspaper and literary texts from Germany, Austria and Switzerland, a total of ca. 1300M words.

The reasons for non-correspondences could be of contextual or semantic character. For instance, many nominalisations subcategorising only for a dass-clause are semantically weak. They serve as a container for the content expressed in the subcategorised dass-clause. W-/ob-clauses presuppose an open set of answers which doesn’t correspond to the semantics of such kind of nominalisations. A deeper semantic analysis of such phenomena is necessary to reveal systematicities of the inheritance process.

A corpus-based study of the phraseological behaviour of abstract nouns in medical English

Natalia Judith Laso

It has been long acknowledged (Carter 1998, Williams 1998; Biber 2006; Hyland 2008) that writing a text not only entails the accurate selection of correct terms and grammatical constructions but also a good command of appropriate lexical combinations and phraseological expressions. This assumption becomes especially apparent in scientific discourse, where a precise expression of ideas and description of results is expected. Several scholars (Gledhill 2000; Flowerdew 2003; Hyland 2008) have pointed to the importance of mastering the prototypical formulaic patterns of scientific discourse so as to produce phraseologically competent scientific texts.

Research on specific-domain phraseology has demonstrated that acquiring the appropriate phraseological knowledge (i.e. mastering the prototypical lexico-grammatical patterns in which multiword units occur) is particularly difficult for non-native speakers, who must gain control of the conventions of native-like discourse (Howarth 1996/1998; Wray 1999; Oakey 2002; Williams 2005; Granger & Meunier 2008).

This paper aims to analyse native speakers’ usage of abstract nouns in medical English, which will contribute to the linguistic characterisation of the discourse of medical science. More precisely, this research study intends to explore native speakers’ prototypical lexico-grammatical patterns around abstract nouns. This analysis is based entirely on corpus evidence, since all collocational patterns discussed have been extracted from the Health Science Corpus (HSC), which consists of a 4 million word collection of health science (i.e. medicine, biomedicine, biology and biochemistry) texts, specifically compiled for the current research study. The exploration of the collocational behaviour of abstract nouns in medical English will serve as a benchmark against which to measure non-native speakers’ production.

References:

Biber, D. (2006). University language: A corpus-based study of spoken and written registers. Benjamins.

Carter, R. (1998) (2nd edition). Vocabulary: Applied Linguistic Perspectives. London: Routledge.

Flowerdew, J. (2003). “Signalling nouns in discourse” English for Specific Purposes 22, 329-346. Gledhill, C. (2000a). “The discourse function of collocation in research article introductions” English for Specific Purposes, 19 (2), 115-135.

Gledhill, C. (2000b). Collocations in science writing, Gunter Narr, Tübingen.

Howarth, P. A. (1998a). “The phraseology of learners’ academic writing” in Cowie, A. P. (Ed.) Phraseology: Theory, analysis and applications. Oxford: Clarendon Press, 161-186.

Howarth, P. A. (1998b). “Phraseology and second language proficiency” Applied Linguistics 19 (1), 24-44.

Oakey, D. (2002a). “A corpus-based study of the formal and functional variation of a lexical phrase in different academic disciplines” in Reppen, R. et al. (eds) Using corpora to Explore Linguistic Variation. Benjamins, 111-129.

Oakey, D. 2002b. “Lexical Phrases for Teaching Academic Writing in English: Corpus Evidence” in Nuccorini, S. Phrases and phraseology –data and descriptions. Bern: Peter Lang, 85-105.

Williams, G. 1998. “Collocational Networks: Interlocking Patterns of Lexis in a Corpus of Plant Biology Research Articles”. International Journal of Corpus Linguistics, 3 (1), 151–171.

Williams, G. 2005. “Challenging the native-speaker norm: a corpus-driven analysis of scientific usage” in Barnbrook, G., Danielsson, P. & Mahlberg, M. (eds) Meaningful Texts. The Extraction of Semantic Information from Monolingual and Multilingual Corpora. London/ New York: Continuum, 115-127.

“According to the equation…” : Key words and clusters in Chinese and British students’ undergraduate assignments from UK universities

Maria Leedham

Chinese students are now the largest non-native English group in UK universities (British Council, 2008), yet relatively little is known of this group’s undergraduate-level writing. This paper describes a corpus study of Chinese and British students’ undergraduate assignments from UK universities. A corpus of 267,000 words from first language (L1) Mandarin and Cantonese students is compared with a reference corpus of 1.3 million words of L1 English students’ writing. Both corpora were compiled from the 6.5 million- word British Academic Written English (Bawe) corpus with some additionally-collected texts. Each corpus contains successful assignments from the same disciplines and from a similar range of genres (such as essays, empathy writing, laboratory reports and case studies).

The aims of this study are to explore similarities and differences in the writing of the two student groups, and to track development from year 1 to year 3 of undergraduate study with a view to making pedagogical recommendations. WordSmith Tools was used to extract key words, key key words and key clusters from the Chinese corpus, and compare these with the British students’ writing within different disciplines. Clusters were further explored using WordSmith’s Concgram feature to consider non-contiguous n-grams. Both key words and clusters were assigned to categories based on Halliday’s metafunctions (Halliday and Matthiessen 2004).

The key word findings showed the influence of the discipline of study in the writing of both student groups, even at first year level and with topic-specific key words excluded. This has implications for the teaching of English for Academic Purposes (EAP) for both native and non-native speakers, as discipline-specific teaching is still the exception rather than the norm in EAP classes.

The comparison of cluster usage suggests that Chinese students use fewer clusters within the textual and interpersonal categories and many more topic-based clusters. One reason for this may be the students’ strategy of using features other than connected text to display information: for example the greater use of lists and pseudo lists in methodology sections and tables in the results sections of scientific reports. The Chinese students also employed three times as many “listlike” text sections as the British students; these consist of text within paragraphs in a pseudo list format (Heuboeck et al, 2007:29).

References:

British Council 2007. “China Market Introduction”. Retrieved from http://www.britishcouncil.org/eumd-information-background-china.htm on 070409.

Halliday, M.A.K., and C.M.I.M. Matthiessen. (2004). An Introduction to Functional Grammar. London: Arnold.

Heuboeck, A., J. Holmes and H. Nesi (2007). The Bawe Corpus Manual. Retrieved from http://www.coventry.ac.uk/researchnet/d/505/a/5160 on 07049.

Scott, M. (2008). Wordsmith Tools version 5. Available from: www.lexically.net//wordsmith/purchasing.htm

Note: The British Academic Written English (BAWE) corpus is a collaboration between the universities of Warwick, Reading and Oxford Brookes. It was collected as part of the project, 'An Investigation of Genres of Assessed Writing in British Higher Education' funded by the ESRC (2004 - 2007 RES-000-23-0800).

The Mandarin (reflexive) pronoun of SELF (zi4ji3) in Bible: A corpus-based study

Wang-Chen Ling and Siaw-Fong Chung

National Chengchi University

Coindexation of the reflexive pronoun can be easily recognized if there is only one referent. An example of self in English is seen in (1) (coindexed referents are marked x).

(1) Heidi(x) bopped herself(x) on the head with a zucchini. (Carnie, 2002: 94)

However, in Mandarin, it is not always easy to distinguish what zi4ji3 is conindexed with, especially when there is more than one referent (illustrated in (2) below; example modified from Yuan (1997: 226)).

(2) zhang1san3(x) ren4wei2 li3si4(y) zhi1dao4 wang2wu5(z) bu4 xiang1xin4

Zhang-San think Li-Si know Wang-Wu NEG believe

zi4ji3(x/y/z)

SELF

‘Zhang-San thinks that Li-Si knows that Wang-Wu do not believe himself.’

Zi4ji3 in (2) can optionally indicate x, y, or z (although the English translation has the interpretation that himself refers to x). With regard to the difficulties mentioned above, this study aims to investigate the patterns of zi4ji3 and its referent(s) through observing 1,511 Mandarin corpora data taken from a parallel corpus of the Bible online. The Holy Bible Chinese Union Version was chosen because it is widely used among Mandarin speakers since 1,919. Five patterns which indicate different coindexation of zi4ji3 are proposed.

	Prescription	Frequency	%
Pattern 1	Referent(x) zi4ji3 (x)	143	9.5
Pattern 2	Referent(x)… zi4ji3 (x)	796	52.7
Pattern 3	Referent A(x)… Referent B(y)... ReferentC(z)....… zi4ji3 (x)	540	35.7
Pattern 4	Zi4ji3(x)….Referent (x)	21	1.4
Pattern 5	No referent (Referent(s) may be the author or the readers.)	11	0.7
Total		1,511	100

Patterns and distribution of zi4ji3 and its referent.

Based on the results in the above table, it is found that 97.9% of the total instances (patterns 1, 2 and 3) show to have the referent(s) of zi4ji3 appearing before it. This results demonstrates that word order is a crucial clue for determining the conindexation of Mandarin zi4ji3, a result which conforms to the findings in Huang and Chui (1997) which indicates that clause initial NP carries important information. We also found that pragmatic interpretation is needed when processing the meaning of zi4ji3. Future work will investigate the collocations of zi4ji3 so as to better predict the cognitive mechanisms of Mandarin speakers when they use coindexation.

References:

Li, J.L. (2004). “The blocking effect of the long-distance reflexive zi4ji3 in Chinese”. Foreign Language and Literature Studies 1, 6-9.

Huang, S. and Chui, K. (1997). “Is Chinese a pragmatic order language?” In Chiu-Yu Tseng (ed.) Typological Studies of Languages in China 4, 51-79. Taipei: Academia Sinica.

Zhang, J. (2002) “Long-distance nonsyntactic anaphoring of English and Chinese reflexives”. Journal of Anqing Teachers College, 21, 93-96.

Yuan, B.P. (1997). “The comprehension of Mandarin reflexive pronoun by English and Japanese- experiment of second language acquisition”. Proceedings of 5th World Chinese Language Conference: Studies in Linguistics.

Establishing a historiography for corpus-events from their frequency: A celebration of Bertrand Russell's (1948) five postulates

Bill Louw

Over time, concordance lines have become associated, trivially, with the retrieval of linguistic forms for entirely utilitarian purposes rather than with the possibility that each occurrence-line represents an instance of a 'repeatable event (Russell, 1948; Firth, 1957) within the world or a world. The latter type of search becomes possible if and only if the collocates that make up 'facts''or 'states of affairs' (Wittgenstein, 1921: 7) are co-selected from a corpus rather than recovered as single strings. Once this procedure has been adopted, the fabric of the resulting concordance material 'gestures' both truth (Louw, 2003) and the logical structure of the world (Carnap, 1928). The nature of the events recovered by co-selection of their component collocates is of scientific interest in a number of ways, in terms of: (1) procedures needed to determine the quantum of real time in the world that must be deemed to occupy the authentic, logical and temporal 'space' between one concordance line and the next; (2) additional event-bound collocates that the procedure identifies; (3) the extent to which several states of affairs ever share the same collocates; and (4) whether delexicalisation and relexicalisation are part of the latter process. The methods (Gk. meta+hodos: after+path) used for determining spatial sampling would need to be determined scientifically (Kuhn, 1962). In this regard, Russell's early and later work on logic and perception plays a key role. The early work (written in prison) (1914; 2nd edition 1926) considers the nature of sense-data. However, by 1948, Russell abandons the latter term in favour of the word events. In particular, his five postulates offer the corpus-based investigator scientific and automated insights into determining the boundaries between events and occasions upon which events are the same rather than merely similar. The paper will be extensively illustrated.

References:

Carnap, R. (1928) Der logische Aufbau der Welt. Leipzig: Felix Meiner Verlag.

Firth, J.R. 1957. Papers in Linguistics 1934-1951. Oxford: OUP.

Kuhn, T. (1962) The Structure of Scientific Revolutions. Chicago: University of Chicago Press.

Louw, W.E. (2003) A stochastic-collocational reading of the Truth and Reconciliation Commission. Bologna: CESLIC.

Russell, B. (1926) Our Knowledge of the External World. London: Allen and Unwin.

Russell, B. (1948) Human Knowledge: Its Scope and Limits. London: Allen and Unwin.

Subjunctives in the argumentative writing of learners of German

Ursula Maden-Weinberger

The compilation of learner corpora and the analysis of learner language has proven useful in research on learner errors, second language acquisition and the improvement of language learning and teaching (cf. e.g. Granger/Hung/Petch-Tyson, 2002; Kettemann/Marko, 2002). This study utilises a corpus of learner German (CLEG) to investigate the use of subjunctive forms in the argumentative writing of university students of German (L1 English). CLEG is a 200,000 word corpus of free compositions collected from all three years of the post A-level undergraduate course of German at a British university.

The German subjunctive plays a pivotal role in all texts and discourses that deal with the discussion or examination of practical or theoretical problems as it attests the freedom of human beings to step out of the boundaries of the immediate situation and allows the speaker/writer to explore new possibilities in hypothetical scenarios (cf. Brinkman 1971; Zifonun et al. 1997). These types of texts and discourses are the typical contexts in which students of modern foreign languages at university level produce language – they write argumentative essays, critical commentaries and discussions of controversial current topics. Together with other modal means (e.g. modal verbs, modal adverbials etc.) the subjunctive is therefore a crucial tool that learners are regularly required to use in their argumentative writing.

There are two subjunctive verb paradigms in the German language: The subjunctive I is generally used in indirectness contexts (indirect speech), while the subjunctive II is used in non-factuality contexts such as conditional clauses or hypothetical argumentation. Due to a great extent of syncretism between these synthetic subjunctives and indicative verb forms, there is also a so-called “replacement subjunctive”, which is formed analytically (würde + bare infinitive). This distinction sounds straight-forward and clear-cut, but the reality of subjunctive uses in contemporary German is, indeed, very complex. Firstly, the subjunctive is only obligatory in a very limited set of circumstances and, secondly, the two subjunctive forms can, to a degree, reach into each other’s domains (e.g. subjunctive II is often used in indirect speech). Corpus investigations have shown that subjunctive use is mostly dependent on text type, genre, language variety and medium rather than on contextual circumstances.

The present investigation shows that while the morphological forms themselves pose problems for the learners even at advanced stages, the complex conventions of subjunctive application also seem to induce usage patterns that diverge from those of native speakers. Through a multiple-comparison approach of learner corpus data at different learning stages and native speaker data, it is possible to link these patterns of over-/under- and misuse to their possible causes as transfer-related (interlingual), developmental (intralingual) or teaching(material) induced phenomena.

References:

Granger, S., J. Hung and S. Petch-Tyson (eds)(2002). Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins

Kettemann, B. and G. Marko (eds) (2002). Teaching and Learning by Doing Corpus Analysis. Amsterdam: Rodopi

Brinkmann, H. (1971). Die Deutsche Sprache. Gestalt und Leistung. 2nd revised and extended edition. Düsseldorf: Schwann

Zifonun, G., L. Hoffman and B. Strecker (1997). Grammatik der deutschen Sprache. Schriften des Instituts für deutsche Sprache Bd. 7. Berlin-New York: De Gruyter

The British National Corpus and the archives of The Guardian: the influence of American phraseology on British English

Ramón Martí Solano

It is a well-known fact that the influence of American English on British English is not a recent phenomenon. However, the use of a strictly American phraseology is definitely something that bears on the linguistic production of the last decade. It has already been pointed out that this influence is far more widespread in certain language registers or genres in British English (Moon, 1998: 134-135). What is more, and in the case of lexicalised variants of the same phraseological unit (PhU), « corpus evidence shows the increasing incidence of American variants in British English » (Moon, 2006: 230-231).

Does the British National Corpus suffice to examine the influence of American phraseology on British English? Phraseological phenomena, as is also the case with neology and loanwords, need to be observed in the light of up-to-date corpora. The language of the written press thus represents the ideal environment for lexicological and phraseological research. Only the combination of a reference corpus such as the BNC and the archives of at least one British daily paper can guarantee the empirical corpus-based results concerning the extent of phraseological Americanisms in British English.

The problems that can be encountered concern the representativeness and the reliability of the results retrieved by means of newspaper archives used as linguistic corpora. Newspaper archives are not real corpora for a number of reasons but mainly because they have not been created for linguistic purposes but as a practical means for readers to find out information of their interest.

How could we then examine the associations established between some idiomatic expressions and the type of discourse or the lexico-grammatical patterns in which they are inserted if a corpus such as the BNC does not provide any tokens or if the idiomatic expressions themselves are poorly represented? Newspaper archives, unable to be used for analysing grammar words—needing in their turn a morpho-syntactical tagged corpus— are perfectly suitable for the analysis of PhUs (Minugh, 1999: 68).

In order to account for the pervasiveness of strictly American PhUs in British English we have used the Longman Idioms Dictionary as a reference guide since it registers a considerable number of idioms whose entries are labelled as AmE. Among others, we have selected idioms such as come unglued, make a pit stop, eat my shorts, hell on wheels or get a bum steer with the aim of assessing their occurrence and frequency of use in the corpora. It should be underlined that the large majority of PhUs labelled as American by the LID are likewise labelled as slang or spoken.

We examine the case of the PhU not amount to a hill of beans/not worth a row of beans as a revealing example of the evolution of phraseological variant forms in English but also as a representative instance of how the American variant is gaining ground and is establishing itself not only in everyday use in British English but also in the lexicographic treatment of this type of units.

References:

Minugh, D. (1999). “You people use such weird expressions: the frequency of idioms in newspaper CDs as corpora”. In J.M. Kirk (ed.) Corpora Galore: Analyses and Techniques in Describing English. Amsterdam: Rodopi. 57-71.

Moon, R. (1998). Fixed Expressions and Idioms in English. Oxford: OUP

Moon, R. (2001). “The Distribution of Idioms in English”. In Studi Italiani di Linguistica Teorica e Applicata , 30 (2), 229-241.

Moon, R. (2006). “Corpus Approaches to Idiom”. In K. Brown, Encyclopedia of Language and Linguistics, 2nd edition, (3). Oxford: Elsevier, 230-234.

Stubbs, M. (2001). Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell.

A corpus-based approach to discourse presentation in Early Modern English writing

Dan McIntyre and Brian Walker

This paper reports on a small pilot project to build and analyse a 40,000-word corpus in order to investigate the forms and functions of discourse presentation (also known as speech, writing and thought presentation) in Early Modern English writing. Prototypically, discourse presentation refers to the presentation of speech, writing and thought from an anterior discourse in a later discourse situation. Our corpus consists of approximately 20,000 words of fiction and 20,000 words of journalism, and has been manually annotated for categories of discourse presentation using the model of speech, writing and thought presentation (SW&TP) outlined originally in Leech and Short (1981) and later developed in Semino and Short (2004). Our aims have been (i) to test the model of SW&TP on an older form of English; (ii) to determine the dominant forms and functions of SW&TP in Early Modern English writing; and (iii) to compare our results against a similarly annotated corpus of Present Day English writing in order to determine the diachronic development of discourse presentation from the Early Modern period to the present day. In so doing, an overarching aim of the project has been to contribute to what Watts and Trudgill (2001) have described as the alternative history of English; that is, a perspective on the diachronic development of English that goes beyond formal elements of language such as syntax, lexis and phonology. In our talk we will describe the construction of our corpus and issues in annotating it, as well as presenting the quantitative and qualitative results of our analysis. We will discuss similarities and differences between SW&TP in Early Modern and Present Day English, and the steps we are currently taking to extend the scope of our project.

References:

Leech, G. and M. Short (1981). Style in Fiction. London: Longman.

Semino, E. and M. Short (2004). Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing. London: Routledge.

Watts, R. and P. Trudgill (2001). Alternative Histories of English. London: Routledge.

WriCLE: A learner corpus for second language acquisition research

Amaya Mendikoetxea, Michael O’Donnell and Paul Rollinson

The validity and reliability of Second Language Acquisition hypotheses rest heavily on adequate methods of data gathering. Much current SLA research relies on elicited experimental data and disfavours natural language use. This situation, however, is beginning to change thanks to the availability of computerized linguistic databases and learner corpora. The area of linguistic inquiry known as ‘learner corpus research’ has recently come into being as a result of the confluence of two previously disparate fields: corpus linguistics and Second Language Acquisition (see Granger 2002, Barlow 2005). Despite the interest in what learner corpora can tell us about learner language, studies in corpus-based SLA research are mostly descriptive, focusing on differences between the language of native speakers and those of learners, as observed in the written performance of advanced learners from a variety of L2 backgrounds (see e.g. Granger 2002, 2004 and Myles 2005, 2007 for a similar evaluation). In this talk, we analyse the reasons why many SLA researchers are still reticent about using corpora and how good corpus design and adequate tools to annotate and search corpora could help overcome some of the problems observed. We present the key design principles of a learner corpus we are compiling (WriCLE: Written Corpus of Learner English, 1,000.000 words) which will soon be made available to the academic community through an online interface (http://www.uam.es/proyectosinv/woslac/Wricle). We present some WriCLE-based studies and show how the corpus can be annotated semi-automatically using freely available software: Corpus Tool (by Michael O’Donnell, available at http://www.wagsoft.com/CorpusTool/), which permits annotation of texts in different layers, allowing searches across levels, and incorporates a comparative statistics package. We then show how our corpus (and software) can be used to test current hypotheses in SLA. First, we present the results of a study we conducted, which supports the psychological reality of the Unaccusative Hypothesis, and we then show how the corpus can be used to inform the current debate on the nature and source of optionality in learner language (Interface Hypothesis) and how it can complement research based on acceptability judgements. Our paper concludes with an analysis of future challenges for corpus-based SLA research.

References:

Barlow, M. (2005). “Computer-based analysis of learner language”. In R. Ellis and G. Barkhuizen (eds) Analysing Learner Language. Oxford: OUP.

Granger, S, 2002. “A bird's-eye view of computer learner corpus research”. In S.

Granger, S., J. Hung and S. Petch-Tyson (eds) Computer Learner Corpora,

Second Language Acquisition and Foreign Language Teaching. Language

Learning and Language Teaching 6. Amsterdam & Philadelphia: John

Benjamins, 3-33.

Granger, S. (2004). “Computer learner corpus research: current status and future

Prospects”. In G. Aston et al. (eds) Corpora and Language Learners. Amsterdam & Philadelphia: John Benjamins.

Myles, F. (2005). Review article. “Language Corpora and Second Language Acquisition

Research”. Second Language Research, 21 (4), 373-391.

Myles, F. (2007). “Using electronic corpora in SLA research”. In D. Ayoun (ed.) French Applied Linguistics. Amsterdam & Philadelphia: John Benjamins, 377-400.

Semi-automatic morphosyntactic tagging of a diachronic corpus of Russian

Roland Meyer

The task of annotating a historical corpus with morphosyntactic information poses problems which differ in quality and extent from those involved in the tagging of modern languages: Resources such as morphological analyzers, electronic lexica and pre-annotated training data are lacking; historical word forms occur in numerous orthographic and morphological variants, even within a single text; diachronic change leads to different realizations or loss of morphological categories. However challenging these issues are, it should be borne in mind that when tagging a historical corpus, we are not just dealing with yet another language, but with a variant more or less closely related to the modern language in many respects. This paper presents an approach to tagging a corpus of Old Russian which capitalizes on the latter point, using a parallel corpus of (digitized and edited) Old Russian documents, as well as their normalized version and modern Russian translation, as it is available from the Biblioteka literatury drevnej Rusi published by the Russian Academy of Sciences in twelve volumes from 1976 to 1994. We proceed in the following manner: An automatic sentence aligner first assigns segments of normalized Old Russian text to their modern equivalents. The modern Russian text is then tagged and lemmatized using the stochastic TreeTagger (Schmid 1994) and its most advanced modern Russian parameter files (Sharoff et al. 2008). The Old Russian version is provisionally annotated by a morphological guesser which we programmed using the Xerox finite-state tools (Beesley & Karttunen 2003) and a restricted electronic basic dictionary of about 10,000 frequent lemmas. The goal is then to reduce the ambiguities left by the guesser through information available from the modern Russian translation, without falling into the trapdoors of diachronic change. To this end, automatic word alignment is performed between Old Russian and modern Russian textual segments, based on an adapted version of the Levenshtein edit distance, up to a certain threshold level and relativized to segment length. A more “costly” part of these alignments is manually double-checked. Finally, for each of the aligned word pairs, we reduce the number of potential tags in the Old Russian version using a set of mapping rules applied to the grammatical information available from the new Russian tag. This is certainly the most intricate and error-prone, but also the linguistically most interesting part of the procedure. In devising the mapping rules, special caution was taken not to unduly overwrite information implicit in the Old Russian tags. Morphological categories which are known to have changed over time, are never touched upon (e.g., Old Russian dual number vs. modern Russian plural; modern Russian verbal aspect, which was not present yet in Old Russian). Part-of-speech categories are matched where possible, leading already to quite effective disambiguation in the Old Russian version. Where syncretisms in Old vs. modern Russian overlap only partly, the set of potential tags can regularly be reduced to that subset of the paradigm which is common to both versions of the text. This applies e.g. when an adjective form is ambiguous between nominative and accusative case in Old Russian and between genitive and accusative in modern Russian. If the information present in the two paired forms cannot be unified – e.g., if an Old Russian noun was assigned only dative case by the guesser, and the modern Russian noun only instrumental –, the Old Russian value is never changed. While this automatic projection of tagging information by no means always leads to perfect results, it drastically reduces the effort needed for manual annotation – it is well-known that disambiguation takes less time than annotation from scratch. In our case, only the remaining untagged or ambiguously tagged forms must be checked at any rate. It should also be noted that the alignment between the normalized version of the Old Russian text and its original scanned edition is almost invariably token-by-token and can thus be very easily restored. Thus, we end up with a philologically acceptable and linguistically annotated digital edition which contains a rendition of the original forms, and the normalized forms annotated with part-of-speech categories, grammatical information and lemmata. The procedure described has been implemented in Python and successfully applied to an important longer Old Russian document, the Laurentian version of Nestor’s chronicle (14th century). The resulting resource can now be used to bootstrap a larger lexicon and to train a stochastic tagger which may subsequently be applied to yet untranslated Old Russian documents.

Today and yesterday: What has changed? Aboutness in American politics

Denise Milizia

The purpose of the present paper is twofold: it aims firstly at identifying keywords in American political discourse today and secondly at analysing the phraseology that these keywords create. A fresh spoken corpus of Democratic speeches is compared to a larger corpus of Republican speeches assembled during the Bush administration. The procedure used for identifying keywords in this research is the one devised by WordSmith Tools 5.0 (Scott 2007) and is based on simple verbatim repetition. The two wordlists generated in both corpora are analysed and then compared, and the items that emerge are those which occur unusually frequently in the reference corpus – the Republican in this case – or, to put it another way, the ones which occur more frequently in the study corpus – the Democratic one. Unchanging topics will not be taken into account in the present investigation, and issues such as Afghanistan, for example, which occur in both corpora with exactly the same percentage (0.05%) will not emerge as prominent, namely as keywords. Also the topics that both Democrats and Republicans talk about will not surface in the comparison, whereas those where there is a significant departure from the reference corpus become prominent for inspection (Scott 2008), and thus the object of the current analysis.

The words and phrases emerging from the comparison are indicative not only of the “aboutness” of the text but also of the context in which they are embedded, relating, predictably, to the major ongoing topics of debate (Partington 2003), signaling a change in priorities of the new government.

Bearing in mind that phraseology is not fixed and that, in political discourse in particular, some phrases have a relatively short "shelf life" compared to others (Cheng 2004), the aim here is to unveil “aboutgrams” (Sinclair in Warren 2007) which are prioritized today but were not an issue in the previous administration, and to show that phrases are usually much better at revealing the ‘ofness’ of the text (and the context) than individual words.

Relying on the assumption that the unit of language is “the phrase, the whole phrase, and nothing but the phrase” (Sinclair 2008), and that words are attracted to one another also at a distance, phraseology is investigated using also another piece of software, ConcGram 1.0 (Greaves 2009), capable of identifying not only collocations that are strictly adjacent but also discontinuous phrasal frameworks, handling both positional (AB, BA) and constituency (AB, ACB) variants.

References:

Cheng, W. (2004). “FRIENDS, LAdies and GENtlemen: some preliminary findings from a corpus of spoken public discourses in Hong Kong”. In U. Connor U. and T. Upton (eds) Applied Corpus Linguistics: A Multidimensional Perspective. Rodopi,35-50.

Greaves C. (2009). ConcGram 1.0. Amsterdam: John Benjamins.

Partington, A. (2003). The Linguistics of Political Argument. London: Routledge.

Sinclair, J. (2008). “The phrase, the whole phrase, nothing but the phrase”. In S. Granger and F. Meunier (eds), Phraseology. An Interdisciplinary Perspective. Amsterdam: John Benjamins, 407-410.

Scott, M. (2007). WordSmith Tools 5.0. Oxford: Oxford University Press.

Scott, M. (2008). “In Search of a Bad Reference Corpus”. In D. Archer (ed.) What’s in a word-list?. Ashgate.

Warren, M. (2007). “Making sense of phraseological variation and some possible implications for notions of aboutness and keyness”. Paper presented at the International Conference: Keyness in Text, University of Siena, Italy.

Phraseological choice as ‘register-idiosyncratic’ evaluative meaning? A corpus-assisted comparative study of Congressional debate

Donna Rose Miller and Jane Helen Johnson

Linguistic research has been focussing in recent times on the possibly ‘register-idiosyncratic’ (Miller & Johnson, to appear) significance of lexical bundles/clusters. Biber & Conrad (1999) examine these in conversation and academic prose; others, such as Partington & Morley (2004), look at how lexical bundles might mark the expression of ideology, also through patterns of metaphors in newspaper editorials and news reports; Goźdź-Roszkowski (2006) classifies lexical bundles in legal discourse; Morley (2004) and Murphy & Morley (2006) examine their discourse-marking function of introducing the writer’s evaluation in newspaper editorials.

We propose to report select findings from our investigation into evaluative and speaker-positioning function bundles (Halliday 1985) in one sub-variety of political discourse: US congressional speech. Our corpus of nearly 1.5 million words consists of speeches on the homogeneous topic of the Iraq war, compiled from the transcribed sessions of the US House of Representatives for the year 2003. Methodologically, and in the wake of Hunston (2004), our study begins with a ‘text’: a 1-minute speech, whose evaluative patterns, analysed with the appraisal systems model (Martin and White 2005), serve as the basis for subsequent, comparative, corpus investigation, using primarily Wordsmith Tools and Xaira.

This particular presentation will recount results with reference to the phraseology it’s/it is + adj + that/to… and it’s/ it is time…. Our corpus findings were tested against some large general corpora of English as well as other smaller UK and US political corpora. The purpose was to see: 1) whether the patterns proved statistically ‘salient’ and/or semantically ‘primed’ (Hoey 2005); 2) how the albeit circumscribed semantic prosody in the congressional corpus compared to that in the reference corpora; 3) to what extent evaluative distinctions depended on what was being evaluated and 4) whether use of these bundles could be said to transcend register boundaries and be rather a consequence of ideological ‘saturation’ – a cross-fertilization of comparable cultural paradigms. Also probed were variations according to gender and political party.

The study is located in long-ongoing investigation into register-idiosyncratic features of evaluation and stance in parliamentary debate (cf. Miller 2007), with register being defined as “[…] a tendency to select certain combinations of meanings with certain frequencies”, and register variation as the “[…] systematic variation in [such] probabilities” (Halliday 1991: 33).

References:

Goźdź Roszkowski S. (2006) ‘Frequent Phraseology in Contractual Instruments’, in Gotti M. & D.S. Giannoni (eds) New Trends in Specialized Discourse Analysis. Bern: Peter Lang, 147-161.

Hoey M. (2005) Lexical Priming, Abingdon: Routledge.

Hunston S. (2004) ‘Counting the uncountable: Problems of identifying evaluation in a text and in a corpus’, in Partington A., J. Morley, L. Haarman (eds) Corpora and Discourse. Bern: Peter Lang, 157-188.

Martin J.R. and White P.R.R. (2005) The Language of Evaluation: Appraisal in English, Palgrave.

Miller D.R. (2007) ‘Towards a Typology of Evaluation in Parliamentary debate: From theory to Practice –and back again’, in Dossena M. & A. Jucker (eds), (Re)volutions in Evaluation: Textus XX n.1, 159-180

Miller D.R. & Johnson J. H., (to appear) ‘Evaluation, speaker-hearer positioning and the Iraq war: A corpus-assisted study of Congressional argument’, in Morley J. & P. Bayley, Wordings of War: Corpus Assisted Discourse Studies on the Iraq War, Routledge, chapt. 2.

Compiling and exploring frequent collocations

Maria Moreno-Jaen

Collocations are described by many authors (Bahns 1993; Lewis 2000) as an essential component for developing second language learning in general and lexical competence in particular. However, until quite recently teaching collocations has been an ignored area in most teaching scenarios. Perhaps one of the main reasons for this neglect is the lack of a reliable and graded bank of collocations.

Therefore, this study intends to present a systematic and reliable corpus–driven compilation of frequent collocations extracted from the Bank of English and the British National Corpus. Taking the first 400 nouns of the English language, a list of their most frequent collocations has been drawn. The first part of this paper will provide a careful explanation of the procedures followed to obtain this database, where not only statistical significance but also pedagogical factors were taken into consideration for the final selection. But, as it is often the case in corpus-based research, the output produced by this study also provided insights into new and unexpected linguistic patterning by, what according to Johns (1988) is called the “serendipity process”. Thus, the discussion of this knock-on effect will be the focus of the second part of this paper.

References:

Bahns, J. (1993). “Lexical collocations: a contrastive view.” English Language Teaching Journal, 47 (1), 56-63.

Johns, T. (1988). "Whence and Whither Classroom Concordancing?". In E. van Els et al. (eds), Computer Applications in Language Learning.

Lewis, M. (ed.) (2000). Teaching collocation. Further developments in the lexical approach. Hove: Language Teaching Publications.

The problem of authorship identification in African-American slave narratives:

Whose voice are we hearing?

Emma Moreton

‘The History of Mary Prince, a West Indian Slave’ was first published in 1831. It is a short narrative of less than 15,000 words which outlines Prince’s experiences as a slave from her birth in Bermuda in 1788 until 1828 when, whilst living in London, she walked out on her masters. The dictated narrative was transcribed by amanuensis Susan Stringland and consists of extensive supporting documentation, including a sixteen page editorial supplement. The purpose of this supporting documentation (typical of most slave narratives of the period) was to authenticate the narrative and to give credence to the claims made by the narrator. Arguably, however, it is this documentation (extensive and intrusive as it is) that has led scholars to question the authenticity of the voice of the slave, raising questions regarding the extent to which the slave narrative was tailored to hegemonic discourses about the slave experience (see Andrews, 1988; Beckles, 2000; Ferguson, 1992; Fisch, 2007; Fleischner, 1996; McBride, 2001; Paquet, 2002; and Stepto, 1979). Important questions, therefore, are how do these ‘other voices’ influence the narrative and whose voice are we hearing - the editor’s, the transcriber’s, or the slave’s?

This paper is an attempt to address these questions. Taking ‘The History of Mary Prince’ as a pilot study, this research begins by separating out the different voices within the text. Then, drawing from the theory and practice of forensic linguistics, and using computational methods of analysis, the linguistic features of the supporting documentation, as well as the narrative itself are examined in order to identify idiolectal differences. Although the findings are preliminary and by no means conclusive, it is hoped that this research will encourage further discussion regarding the authenticity of the voice of the slave.

References:

Andrews, W.L. (1988). Six Women’s Slave Narratives. Oxford: Oxford University Press.

Beckles, H.McD. (2000). “Female Enslavement and Gender Ideologies in the Caribbean”. In P.E. Lovejoy (ed.) Identity in the Shadow of Slavery, pp. 163-182. London: Continuum.

Ferguson, M. (1992). Subject To Others: British Women Writers and Colonial Slavery, 1670- 1834. London: Routledge.

Fisch, A. (ed.) (2007). The Cambridge Companion to The African American Slave Narrative. Cambridge: Cambridge University Press.

Fleischner, J. (1996). Mastering Slavery: Memory, Family, and Identity in Women’s Slave Narratives. New York: New York University Press.

McBride, D.A. (2001). Impossible Witnesses: Truth, Abolitionism, and Slave Testimony. New York: New York University Press.

Paquet, P.S. (2002). Caribbean autobiography: Cultural Identity and Self-Representation. Wisconsin: The University of Wisconsin Press.

Prince, M. (1831). The History of Mary Prince, a West Indian Slave. London: F. Westley and A.H. Davis, Stationers’ Hall Court.

A combined corpus approach to the study of adversative discourse markers

Liesbeth Mortier and Liesbeth Degand

The current paper proposal offers a corpus-based analysis of discourse markers of adversativity, i.e. markers expressing more or less strong contrasting viewpoints in language use (Schwenter 2000: 259), by comparing the semantic status and grammaticalization patterns observed for Dutch eigenlijk ('actually') and French en fait ('in fact'). Previous synchronic studies of their quantitative and qualitative distribution in spoken and written comparable corpora, as well as in translation corpora, showed these markers to be highly polysemous in nature, with meanings ranging from opposition over (counter)expectation to reformulation (author, submitted). The present study aims to show the implications of these semantic profiles for a description of en fait and eigenlijk in terms of discourse markers and (inter)subjectification (cf. Traugott and Dasher 2005, Schwenter and Traugott 2000 for in fact). To this effect, the range of corpus data will be extended to include diachronic corpora, which are expected to confirm the claim that these markers are involved in an ongoing process of subjectification and intersubjectification.

Drawing in part on previous synchronic studies and on pilot studies in the diachronic realm, these data will be analyzed both qualitatively and quantitatively, and take into account intra-linguistic as well as inter-linguistic tendencies and differences. More specifically, we will test the hypothesis according to which current meanings of opposition and reformulation in en fait and eigenlijk have a foundational meaning of deviation or "écart" (Rossari 1992: 159) in diachrony. If this is the case, then this would enable us to explain some of the subjective and intersubjective uses of eigenlijk and en fait: deviation may serve as a "discourse marker hedge" to soften what is said, with the purpose of acknowledging the addressee's actual or possible objections (cf. Traugott and Dasher 2005 for in fact).

A four-partite corpus approach, based on the criteria of text variation and time span, will be applied to eigenlijk and en fait, resulting in analyses of (i) different texts over different periods of time (comparable diachronic data), (ii) different texts within the same period of time (comparable synchronic data, written and spoken), (iii) same texts within the same period (present-day translations of literary and journalistic texts), and (iv) same texts over different periods of time (Bible translations). The confrontation of the results of such an approach to French and Dutch causal connectives in Evers-Vermeul et al. (2008) already proved successful in determining levels and traces of (inter)subjectification, and is expected to yield equally promising results in the field of adversativity. The present study would thus reflect the growing interest of grammaticalization theorists in corpus linguistics (cf. Mair 2004; Rissanen, Kytö & Heikkonen 1997), and further its application by showing the advantages of taking a corpus-based approach to the study of discourse markers.

References:

Evers-Vermeul, J. et al. (in prep). "Historical and comparative perspectives on the subjectification of

causal connectives".

Mair, C. (2004). "Corpus linguistics and grammaticalisation theory: Statistics, frequencies, and beyond". In

H. Lindquist & C. Mair (eds). Corpus approaches to grammaticalization in English. Amsterdam: John Benjamins, 121-150.

Rissanen, M., M. Kytö, and K. Heikkonen. (eds) (1997). Grammaticalization at work: Studies of long-term

developments in English. Berlin/New York: Mouton de Gruyter.

Rossari, C. (1992). "De fait, en fait, en réalité: trois marqueurs aux emplois inclusifs". Verbum 3, 139 - 161.

Schwenter, S. (2000). "Viewpoints and polysemy: linking adversative and causal meanings of discourse

markers". In Couper-Kuhlen, E. and B. Kortmann (eds). Cause - Condition - Concession - Contrast. Berlin Mouton de Gruyter. 257 - 282.

Schwenter, S. and E. Traugott. (2000). "Invoking scalarity: the development of in fact". Journal of

Historical Pragmatics 1 (1). 7 - 25.

Traugott, E. and R. Dasher. (2002). Regularity in semantic change. Cambridge: CUP.

LIX 68 revisited - an extended readability measure

Katarina Mühlenbock and Sofie Johansson Kokkinakis

The readability of a text has to be established from a linguistic perspective, and is not to be confused with legibility, which concerns aspects of layout. Consequently, readability variables must be studied and set up specifically for a certain language. Linguistically, Swedish is characterized as an inflecting and compounding language, prompting Björnsson (1968) to formulate the readability index LIX by simply adding average sentence length in terms of number of words to the percentage of words > 6 characters. LIX is still used for determining the readability of texts intended for persons with specific linguistic needs, due to cognitive disabilities or dyslexia, second language learners or beginning readers. However, we regard the LIX value as insufficient for determining the degree of adaptation required for these highly heterogeneous groups, and suggest additional parameters to be considered when aiming at tailored text production for individual readers.

To date, readability indices have mainly been constructed for English texts. In the 1920s and 30s a vast number of American formulae were constructed, based on a large variety of parameters. Realizing that statistical measurements could be useful, especially multiple variable analysis, Chall (1958) concluded that “only four types of elements seem to be significant for a readability criterion”, namely vocabulary load, sentence structure, idea density and human interest. Björnsson’s study of additional Swedish textual factors may well fit into most of these categories. They were, however, gradually abandoned in favour of factors focusing on features of surface structure alone, i.e. running words and sentences. Many of the factors initially considered were regarded as useless at the time, owing to the lack of suitable means and methods for carrying out statistical calculations on sufficiently large text collections. Our investigation is carried out on the 1.5 million-word corpus LäSBarT (Mühlenbock 2008), consisting of simplified texts and children’s books, divided into four subcorpora: community information, news, easy-to-read fiction and children’s fiction. The corpus is POS-tagged with the TNT-tagger and semiautomatically lemmatized.

Chall’s categories are used as a repository for adding further parameters to Swedish readability studies. In addition to LIX, vocabulary load is calculated by the number of extra long words (≥ 14 characters), indicating the proportion of long compounds. Sentence structure is reflected by the number of subordinate clauses per word and sentence. Idea density is indicated by measuring lexical variation OVIX (Hultman and Westman 1977), lexical density (Laufer and Nation 1995) and nominal quote, NQ (Johansson Kokkinakis 2008), indicating information load. Finally, we regard human interest as mirrored by the proportion of names, such as those of places, companies and people (Kokkinakis 2004)

Applying the extended calculations to the LIX formula and comparing the new results across the four subcorpora, we found that a measure based on more factors on lexical, syntactic and semantic levels contributes strongly to a more appropriate weighing of text difficulty. Texts adapted to the specific needs of an individual reader are valuable assets for various types of applications connected to research and education, constituting a prerequisite for the integration into society of language-impaired persons.

References:

Björnsson, C. H. (1968). Läsbarhet. Stockholm: Bokförlaget Liber.

Chall, J. S. (1958). Readability. An appraisal of research and application. Ohio.

Hultman, T. G. and M. Westman (1977). Gymnasistsvenska. Lund: Liber Läromedel.

Johansson Kokkinakis, S. (2008). En datorbaserad lexikalisk profilMeijerbergs institut,Göteborg.

Kokkinakis, D. (2004). “Reducing the effect of name explosion”. Proceedings of the LREC workshop: Beyond named entity recognition - semantic labelling for NLP., Lisbon, Portu

Laufer, B. and P. Nation (1995). "Vocabulary Size and Use: Lexical richness in L2 Written Production." Applied Linguistics 16 (3), 307-322.

Mühlenbock, K. (2008). “Readable, legible or plain words - presentation of an easy-to-read Swedish corpus”. Readability and Multilingualism, workshop at 23rd Scandinavian Conference of Linguistics. Uppsala, Sweden.

Hedges, boosters and attitude markers in L1 and L2 research article Introductions

Pilar Mur

English has no doubt become the language of publication in the academia. Most –if not all– high impact journal are nowadays published in English, and getting one’s research article accepted in any of them is a great concern for scholars worldwide, as tenure, promotion and other reward systems are based on them. The drafting of a research article in English is even harder for non-native scholars who are used to different writing conventions and styles in their own disciplinary national contexts and who very often need to turn to ‘literacy brokers’ (i.e. language reviewers, translators, proof readers, etc.) (Curry and Lillis 2006) for help. In this context carrying out intercultural specific analyses may be useful to better determine where the differences in writing in the two socio-cultural contexts (local in a national language, and international in English) lie and thus better inform non-native scholars on the necessary adjustments to be made to have better chances of successful publication in the competitive international context. This paper aims at analysing interactional metadiscourse features in Business Management research article Introductions written in English, by Anglo-American and by Spanish authors. A total of 25 interactional features, corresponding to the categories of hedges, boosters and attitude markers, which were found to be the most common metadiscourse features in a preliminary analysis will be contrastively analysed. Their frequency of use and phraseology will be studied in a corpus of 48 research articles, 24 written in English by scholars based at American institutions and 24 written in English by Spanish scholars, which is part of the SERAC (Spanish English Research Article Corpus) compiled by the InterLAE research group at the University of Zaragoza (Spain). This analysis will help us determine whether the use of these common metadiscourse features is rather homogenous in international publications, regardless of the writers’ L1 and their academic writing conventions in their national contexts, or whether a degree of divergence is allowed for as regards certain specific features. An insight into the degree of (in)variability apparently acceptable in international publications as regards the use of the most common hedges, boosters and attitude markers, that is, the degree of required acculturation, will enable us to give more accurate guidance to Spanish (or other L2 writers) when it comes to draft their research in English for an international readership.

You’re so dumb man. It’s four thousand years old. Who’s dumber me cos I said it or the guys that thought I was serious?: Age and gender-related examination of insults in Irish English

Brona Murphy

Insults are emotionally harmful expressions (Jay, 1999) which come under the term of ‘taboo language’. This paper examines insults in a 90,000 word (approx) spoken corpus of Irish English comprising three all-female and three all-male age-differentiated sub-corpora spanning the following age groups: 20s, 40s, and 70s/80s. The examination is carried out by using quantitative and qualitative corpus-based tools and methodologies such as relative frequency lists, concordances and as well as details of formulaic strings including significant clusters. Interviews, which were carried out after the compilation of the corpus, are also referred to as a form of data in this paper. This study examines age and gender-related variation in the use of insults and concludes that their usage seems to be, almost predominantly, a linguistic feature that is characteristic of young adulthood and in particular, young males. This, we found, seems to be primarily due to the influence of their particular life-stage, as well as the relationships and typical linguistic interactional behaviour common to them. In order to highlight this, the study will make reference to insults functioning at two levels of language: the level of lexis (a. intellectual insults, b. sexual insults, c. animal insults, d. expletives) as well as at the level of discourse, for example, she's not like she's not repulsive I mean look at you like. The paper will discuss both the use of insults directly to one’s interlocutor as well as the use of insults about other people who are not present. The study highlights the importance of corpus linguistics as an extremely effective tool in facilitating such analyses which, consequently, allow us to build on research in conversation analysis, discourse analysis, sociolinguistics and variational pragmatics.

References:

Jay, T. (1999). Why We Curse – A Neuro-psycho Social Theory of Speech. Amsterdam: John Benjamins.

k dixez?: A corpus study of Spanish internet orthography

Mark Myslín and Stefan Th. Gries

New technologies have always influenced communication, by adding new ways of communication to the existing ones and/or changing the ways in which existing forms of communication are utilized. This is particularly obvious in the way in which computer-mediated communication, CMC, has had an impact on communication. In this paper, we are concerned with a form of communication that is often regarded as somewhat peripheral, namely orthography. CMC has given rise to forms of orthography that deviate from standardized conventions and are motivated by segmental phonology, discourse pragmatics, and other exigencies of the channel (e.g., the fact that typed text does not straightforwardly exhibit prosody). We focus here on the characteristics of a newly evolving form of Spanish internet orthography (hereafter SIO); consider (1) and (2) for an example of SIO and its standardized equivalent respectively.

(1) hace muxo k no pasaba x aki,, jaja,, pz aprovehio pa saludart i dejar un komentario aki n tu space q sta xidillo :)) ps ia m voi

<http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendID=198138943>

(2) Hace mucho que no pasaba por aquí, jaja. Pues aprovecho para saludarte y dejar un comentario aquí en tu space que está chidillo. Pues ya me voy.

It's been a long time since I've been here, haha. Well, I thought I'd take the opportunity to leave you a comment here on your space, which is really cool. Well, off I go.

SIO, which has hardly been studied, differs markedly but not arbitrarily from standard

Spanish orthography but also exhibits considerable internal variation. In (1), for instance, que ('that') is spelt in two different ways: k and q. Since the number of non-arbitrary spelling variations is huge, in this paper, we only explore a small set of phenomena. These include the influence of colloquialness, word frequency, and word length on

- post-vocalic d/[ð] deletion in the past participle marker -ado: for example, we found that deletion in SIO is most strongly preferred in high-frequency words (cf. Figure 1);

- b/v interchanges: for example, we found that b can represent [b] and v can represent [ß] even for words whose standard spelling is the other way round, but also that high frequency words are much less flexible;

- ch à x substitution: for example, we found that vulgar words are surprisingly resistant to this spelling change;

- the repetition of letters: for example, we found a strong preference to repeat letters representing vowels and continuant consonants, which we can explain both with reference to phonological and cognitive/iconic motivations.

Our study is based on our own corpus of approx. 2.7 million words of regionally balanced informal internet Spanish. Our corpus consists of brief comments and messages (mean length of entry = 19.5 words; sd = 36.2) and was compiled in May 2008 using the scripting language R to crawl various forums and social networking websites. In addition, we use for comparison Mark Davies's (2002-) 100 million word Corpus del Español. For the quantitative study, we used Leech and Fallon's (1992) difference coefficient as well as a slightly modified version of Gries and Stefanowitsch's' (2004) distinctive collexeme analysis.

Mock trial and interpreters’ choices of lexis: Issues involving lexicalization and re-lexicalization of the crime

Sachiko Nakamura and Makiko Mizuno

It is widely believed and actually practiced by many interpreters that when they fail to come up with an exact matching word, opting for a substitution is better than omitting or discontinuing translating (e.g. “approximization” in Gile, 1995; “close renditions” in Rosenburg, 2002; “nearest translations” in Komatsu, 2005). In court interpretation training and practice, an emphasis has been placed on mastering legal terminology while relatively little attention has been paid to risks of inadvertent use of synonyms of common lexis. Some researchers also suggest that the amount of research on “the role of lexis in creating nuances of meaning for the jury” is still limited (Cotterill, 2004 p.513).

At present, Japan is in the midst of a major reform of its legal system; it will introduce the citizen-judge system, a system similar to the jury system, in May 2009. Anticipating possible impacts of interpreter intervention on lay judges, the legal discourse analysis team of the Japan Association of Interpreting and Translation Studies conducted the second mock trial based on a scenario involving a typical injury case which had been written by the team. The mock trial focused on an interpreter-mediated prosecutor questioning session, inviting mock lay-judges and two interpreters to identify impacts of interpreter interventions.

The mock trial data revealed some interesting phenomena. We found that there were two storylines running in parallel in the questioning session; one was the prosecutor’s storyline expressed in lexis representing the court norms, the other was the defendant’s storyline representing his lexical world, which was, however, interpreted and re-lexicalized only through the voices of interpreter. Although the prosecutor consistently used the lexis naguru to describe the defendant’s act of assaulting the victim, the defendant used the lexis hit in his testimony.

Furthermore, we found that two interpreters used different lexis to translate naguru in their English rendition: One interpreter used beat(en) while the other used hit. Although they used different lexis to describe the same criminal act, this difference did not manifest in the court, where Japanese language testimony alone is treated as authentic evidence.

The corpus data show that collocates of hit and beat(en) differ widely, suggesting that these lexis are not necessarily substitutable. Advertent or inadvertent use of synonymous lexis might lead to manipulation of meanings (Nakamura 2006), and we suggest that it might also alter the legal implication of key expressions in the criminal procedures, which may result in different legal judgments concerning the gravity of an offense (Mizuno 2006).

In our paper we will pick up some Japanese expressions to which two interpreters rendered decisively different translations, and examine problems concerning the choice of lexis and its legal consequences.

References:

Coterill, J. (2004) “Collocation, Connotation, and Courtroom Semantics: Lawyers’ Control of Witness

Testimony through Lexical Negotiation”. Applied Linguistics, 25 (4), 513-537.

Gile, D. (1995) Basic Concepts and Models for Interpreter and Translator Training. Amsterdam:

John Benjamins.

Komatsu, T. (2005) Tsuyaku no gijyutsu (Interpreting Skills) Kenkyusha.

Mizuno, M. (2006). “ Possibilities and Limitations for Legally Equivalent Interpreting of Written

Judgment”. Speech Communication Education, 19, 113-130.

Nakamura, S. (2006). “Legal Discourse Analysis – a Corpus Linguistic Approach. Interpretation

Studies”. The Journal of the Japan Association for Interpretation Studies, 6, 197-206.

How reliable are modern English tagged corpora? A case-study of modifier and degree adverbs in Brown, LOB, the BNC-Sampler, and ICE-GB.

Owen Nancarrow and Eric Atwell

Adverbs heading adverb phrases with a phrase-internal function are generally termed “modifiers”, or less often “qualifiers” (e.g. enormously enjoyable), as opposed to those heading clause-internal adverb phrases (e.g. I enjoyed it enormously). Since most modifier adverbs are also semantically “degree” (or equivalently “intensifier”) adverbs, many adverbs may be classified both syntactically as modifiers, and semantically as degree adverbs. Only word-classes containing the adverb very will be discussed in this paper. Each corpus has just one such word-class.

Both Brown and LOB use the same syntactically-based tag QL (qualifier). The Sampler and ICE-GB, on the other hand, use semantically-based tags: RG (degree adverb), and ADV(inten) (intensifier adverb), respectively.

However, severe restrictions are placed on the use of the LOB QL tag, and the Sampler RG tag. Essentially, adverbs tagged QL in LOB must also be degree adverbs, and never function in the clause. And in the Sampler, adverbs tagged RG must also function as qualifiers, and, again, never function in the clause.. Though the tags are different, the definitions are quite similar. Thus the Brown tag is used for qualifiers, both the LOB and Sampler tags for degree adverbs which are also qualifiers, and the Sampler tag for degree adverbs. The authors of LOB and the Sampler could have made this clear by using a tag which combined the two properties of being both a qualifier and a degree adverb.

Assessing the reliability of the tagging depends on a clear understanding of the requirements which it must meet. For these, the user must consult online documentation, and other reference materials. These documents vary greatly in reliability. The Brown documentation has conflicting accounts. The LOB Manual is not altogether clear. The Sampler material, however, is clear and straightforward, as is that of ICE.

The tagging of three of the four corpora, although all have been manually-corrected, has numerous inconsistencies, where the same word in the same context is tagged differently. LOB is almost free of this kind of error, the Sampler has a number, ICE has a lot, and Brown probably has even more. An example from the Sampler is That is far (RR) short versus majorities that ... are far (RG) short, one from Brown is he was enormously (QL) happy, versus this was an enormously (RB) long building.

When the tagging is inconsistent and the documentation is inadequate, it may not always be possible to know what tag an author really intended. This is particularly true for the degree adverb tag, because of the many cases where the semantic property of “degreeness” is only part of the meaning of the word. Thus, in ICE the word depressingly is tagged once as a general adverb and once as a degree adverb.

The reliability of the tagging of adverbs may suggest how reliable the tagging of these corpora is in the case of other tags. There are few, if any, published accounts of error-rates, and most users suppose them to be almost free of all but a few unavoidable errors. The reliability of the tagging of qualifier and degree adverbs in these four corpora, as suggested by our investigation, is: LOB, highly reliable; the Sampler, good but still with too many undesirable errors; ICE, a large number of inconsistent taggings; and Brown, by far the most unreliable.

Populating a framework for readability analysis

Neil Newbold and Lee Gillam

This paper discusses a computational approach to readability that is expected to lead eventually towards a new and configurable metric for text readability. This metric will be comprised of multiple parts for assessing and comparing different features of texts and of readers. Our research involves the elaboration, implementation and evaluation of an 8-part framework proposed by Oakland and Lane (2004) that requires consideration of both textual and cognitive factors. Oakland and Lane account for, but do not sufficiently elaborate a computational handling of factors involving language, vocabulary, background knowledge, motivation and cognitive load. We have considered the integration of techniques from a variety of fragmented research that will provide for a system for comparing the various elements involved with this view of readability. Historically, readability research has been largely focussed on simple measures of sentence and word length, though many have found these measures to be overly simplistic or an inadequate means to evaluate and compare the readability of documents, or even sentences. And yet limited attention has been paid to the contribution that simplifying and/or improving textual content can make for both human-readability and machine-readability.

We will discuss our work to date that has examined the limitations of current measures of readability and made consideration for how the wider textual and cognitive phenomena may be accounted for. This has, to some extent, validated Oakland and Lane’s framework. We have explored how these factors are realised in the construction of prototypes that that implement several parts of this framework for: (i) document quality control for technical authors (Newbold and Gillam, 2008); (ii) automatic video annotation. This includes techniques for statistical and linguistic approaches to terminology extraction (Gillam, Ahmad and Tariq, 2005), approaches to lexical and grammatical simplification (Williams and Reiter, 2008), and considerations of plain language (Boldyreff et al, 2001). We also address other elements of Oakland and Lane's framework for example how using lexical cohesion (Hoey, 1991) could address idea density, and we consider the difficulty of semantic leaps (Halliday and Martin, 1993) and how this relates to cognitive load. Repetitious patterns may help readers form an understanding of the text, but this may assume a reader has a complete understanding of the terms being used.

Our interest in readability is motivated towards making semantic content more readily available and, as a consequence, improving quality of documents. We are expecting in due course to begin to provide new methods for measuring readability that through human evaluation will lead on to a new metric. There are already indications of support for a British Standard for readability, including an inaugural working group meeting, to which it is hoped that this work will contribute in the short-to-medium term.

References:

Boldyreff, C., Burd, E., Donkin, J. and Marshall, S. (2001). “The Case for the Use of Plain English to Increase Web Accessibility.” In Proceedings of the 3rd Intl. Workshop on Web Site Evolution (WSE’01).

Gillam, L., Tariq, M. and Ahmad, K. (2005). “Terminology and the construction of ontology”. Terminology, 11(1), 55-81.

Halliday, M.A.K. and Martin J.R. (1993). Writing Science: Literacy and Discursive Power. London : Falmer Press.

Hoey, M. (1991). Patterns of Lexis in Text. Oxford: OUP.

Oakland, T. and Lane, H.B. (2004), “Language, Reading, and Readability Formulas”. International Journal of Testing, 4 (3), 239-252.

Newbold, N. and Gillam, L. (2008). “Automatic Document Quality Control”. In Proceedings of the Sixth Language Resources and Evaluation Conference (LREC).

A corpus-based comparison among CMC, cpeech and writing in Japanese

Yukiko Nishimura

Linguistic aspects of computer-mediated communication (CMC) in English have been compared with speech and writing from corpus-based approaches (Yates 1996, Collot & Belmore 1996). Message board corpora and online newsgroup discussions in English have also been studied (Marcoccia 2004, Lewis 2005, Claridge 2007). However, studies on Japanese and Japanese CMC are limited (Wilson et al eds. (2003) contains no article on Japanese). Earlier studies of Japanese CMC qualitatively found that users employ informal conversational styles and creative orthography (Nishimura 2003, 2007). This paper attempts to fill a gap in corpus-based analyses of Japanese CMC by comparing speech and writing. Based on the parts of speech (POS) distribution, it quantitatively reveals how CMC resembles or differs from speech and writing in Japanese.

Because large-scale spoken and written corpora in Japanese equivalent to the British National or Cobuild Corpora are unavailable (currently under compilation), this study creates smaller written and spoken corpora for comparison with the CMC corpus. While the word is the basic unit of quantitative study in English, the morpheme takes this role in Japanese. ChaSen morphological parsing software was therefore used as a tagging device. The CMC corpus for this study consists of messages from two major bulletin board system (BBS) websites, Channel 2 and Yahoo! Japan BBS. The written corpus was created by scanning magazine articles on topics similar to those discussed in CMC; the spoken corpus consists of transcriptions of 21 hours of casual conversation mostly among college-aged friends on everyday topics. The mismatch on the topics between CMC and speech does not seem to affect the result, as the analysis is not based on lexical frequency of vocabulary but grammatical categorisation of POS.

With the 9 POS categories as variables, the study finds that interjections distinguish speech from CMC and writing. A more detailed analysis of two key areas, particles and auxiliaries find that CMC differs from writing on the distribution of case and sentence final particles and that polite auxiliary verbs separate the two target websites, Channel 2 and Yahoo within CMC. Case particles specify grammatical relations to express content explicitly, fulfilling the “ideational function,” while interjections, sentence final particles and polite auxiliaries embody the “interpersonal function” (Halliday 1978). This corpus-based study quantitatively reveals that CMC occupies an intermediate position in a continuum from writing to speech. These findings generally conform to results of English CMC studies comparing speech and writing (Yates 1996), though details are different. The study has also discovered that variations exist within CMC in the open-access BBS context, where participant background is disclosed.

This study employs POS and their subcategories alone as a measure of comparison, unlike the multi-feature/multi- dimensional (MF-MD) approach (Biber 1988). The three corpora have limitations in terms of representativeness and scale. However, the present study, part of a larger research project on Japanese CMC (Nishimura 2008), is expected to contribute to still limited corpus-based studies of Japanese and variations in CMC. It has also clarified the medium-specific realisation of language functions as embodied by the grammatical categories.

References:

Claridge, C. (2007). “Constructing a corpus from the web: Message boards”. In M.Hundt, N. Nesselhauf & C. Biewer (eds). Corpus linguistics and the web. Language and computers: studies in practical linguistics; no. 59. Amsterdam and New York: Rodopi. 87-108.

Marcoccia, M. (2004). “On-line polylogues: conversation structure and participation framework in internet newsgroups”. Journal of Pragmatics. 36. 115-145.

Nishimura, Y. (2003) “Linguistic innovations and interactional features of casual online communication in Japanese.” Journal of Computer-Mediated Communication. 9.1. <http://jcmc.indiana.edu/vol9/issue1/nishimura.html>.

Nishimura, Y. (2007) “Linguistic innovations and interactional features in Japanese BBS communication”. In B. Danet & S. Herring (eds) Multilingual Internet: Language culture and communication online. Oxford: OUP, 163-183.

How does a grammar interplay with other grammars?:

A corpus-based approach to the Korean verbal negation system in panel discussions

Jini Noh

This study starts with several fundamental questions about how language works: Does a particular genre acquire its own distinctive grammar? If it so, how and to what extent do the features in a particular genre interact or interplay with the general features in other genres?

Though previous studies put emphasis on ordinary conversation in contrast to written discourse (Green 1982, Chafe 1994, Linell 2005), we should pay close attention to in-between genres, such as panel discussion discourse, in that they comprise most of our linguistic activities, and a grammar in a particular genre consists of genre-specific features adapted to the genre as well as features shared with other genres (Ferguson 1983, Biber 1988). Acknowledging that the evolution of grammar is the consequence of the adaptation to the conditions of use where interlocutors’ past experience of language and their assessment of present context are constructed (Pawley and Syder 1983, Hopper 1998, Iwasaki m.s.), this paper intends to reveal the dynamic nature of a grammar in panel discussion (PD) in comparison to ordinary conversation (OC) by examining the use of the Korean negation system including two forms: short form (SN) and long form negations(LN).

Overall frequencies and grammatical and lexical collocations of Korean negation forms in PD and in OC are quantitatively presented to understand what motivates the deployment of the two forms in a particular way adapted to the context of PD. First, the interplay between PD and OC is found as two shared features: (1) the preferred negations in each genre are almost evenly distributed in main and embedded clauses. (2) In both genres, more than half of LN instances in main clauses are used in interrogative sentences, while SN is used in interrogative sentences in fewer cases. Second, the adaptation to PD can be summarized as follows: (1) LN is much more preferred than SN and is distributed evenly in both main and embedded clauses. (2) Less preferred SN occurs in embedded clauses much more than in main clauses. (3) SN is collocated with toyta, ‘to become’ more frequently, while LN occurs with issta, ‘to exist/have’ and epsta, ‘not to exist/have’ more frequently than SN does.

Grammar is adaptive to the condition of use; thus, the informative and argumentative content among expert participants, the partly planned and literary style of discourse, and the question and answer format in PD lead to the genre-specific grammar of the Korean negation system. That is, the two forms are selectively employed for genre-specific needs such as turn-holding strategies in question and answer sequences, stance-marking strategies for providing opinions or arguing with other participants, genre-switching strategies for presenting argument grounds, and so forth.

Indeed, as illustrated in this study, adaptation to a particular genre is a very intriguing phenomenon, especially because there is no clear-cut boundary between shared and adaptive features which means they are very dynamically intertwined. Further studies in discourse analysis should, therefore, pay more attention to the dynamic aspects of multiple facets of grammar.

Discourse functions of lexical bundles: Isotextual comparisons

David Oakey

In the decade since its initial description by Biber et al. (1999), the lexical bundle has been widely studied as a phraseological unit for making comparisons between corpora of language use in different registers (Cortes 2002; Biber et al. 2004; Biber & Barbieri 2007). In addition to comparisons of the forms and structure of lexical bundles, later work has focused on their different discourse functions (Cortes 2004; Biber 2006; Hyland 2008). The original definition of a lexical bundle was a fixed string of three or more words which occurred more than 10 times per million words in a register corpus. This paper attempts to show that this definition is problematic in the case of lexical bundles which have been assigned discourse functions. It first discusses methodological issues relating to the construction of comparative corpora and suggests a distinction between isolexical comparisons, in which subcorpora containing a similar number of tokens are compared, and isotextual comparisons, in which subcorpora containing a similar number of texts are compared. It then presents a comparison of lexical bundle frequencies between isolexical and isotextual subcorpora of research articles in different disciplines. The results from this study suggest that isotextual comparisons reveal more about the discourse functions of lexical bundles the current definition of the lexical bundle as an isolexical phraseological unit needs revisiting.

References:

Biber, D. (2006). University English: A Corpus-Based Study of Spoken and Written Registers. Amsterdam: John Benjamins.

Biber, D., and F. Barbieri (2007). “Lexical Bundles in University Spoken and Written Registers”. English for Specific Purposes, 26, 263-286.

Biber, D., S. Conrad and V. Cortes (2004). “If You Look At ...: Lexical Bundles in University Teaching and Textbooks”. Applied Linguistics, 25(3), 371-405.

Biber, D., S. Johansson, G. N. Leech, S. M. Conrad and E. Finegan (1999). Longman Grammar of Spoken and Written English. London: Longman.

Cortes, V. (2002). “Lexical Bundles in Freshman Composition”. In R. Reppen, S. Fitzmaurice, D. Biber, (eds) Using Corpora to Explore Linguistic Variation. Amsterdam: John Benjamins, 131-145.

Cortes, V. (2004). “Lexical Bundles in Published and Student Writing in History and Biology”. English for Specific Purposes, 23(4), 397-423.

Hyland, K. (2008). “As Can Be Seen: Lexical Bundles and Disciplinary Variation”. English for Specific Purposes, 27, 4-21.

Measuring formulaic language in corpora from the perspective of language as a complex system

Matthew Brook O'Donnell and Nick Ellis

Phraseology and formulaic language are increasingly important phenomena both in the analysis of text corpora and in models of language processing (Ellis 2008; Wray 2008). Corpus linguistics, in particular, has been at the forefront of the widespread empirical investigation of phraseology/formulaicity (Gries 2008). This paper is part of a larger project to investigate the range of measures available for the quantification of formulaicity across a range of corpora compiled to provide variety in the variables that have potential interactions with formulaic language, including genre/text-type, native speaker status, education level, text length and number and sample size.

A defining characteristic of a complex system is the scale-free (Zipfian) nature of the networks that it consists of. For language, the frequency of phonemes, morphemes, lexical items and so on, have been shown to form scale-free distributions (Zipf 1935; Baayen 2001). Other components of language structure (including syntactic frames or constructions and the co-occurrence of items) behave similarly, as do speakers and learners of a language as they join a language system (Ninio 2006). Using the lists of formulas obtained from our analysis across a range of corpora (including BNCBaby, the ICLE and LOCNESS corpora, CHILDES and spoken SLA corpora) we examine the resulting distributions from three perspectives: 1. The frequency of items (2-9 grams as individual and combined lists) under various constraints (frequency thresholds and significance measures), 2. the degree of formulaic items, that is their co-occurrence frequency with other formulas (e.g. if in a text I_think is followed by at_the_end_of and then again by you_know_that it would have a degree of two) and 3. the use of formulas by language users, that is in a given corpus how many speakers make use of each of the identified formulas. Each of these appear to produce scale-free distributions indicating that a small number of formulas are particularly frequent, particularly well connected to other formulas and particularly utilized by language users joining or linking to the language system.

Our findings highlight the importance of including formulaic language among the inventory of items considered in describing language as a complex system. We also discuss the implications for language learning and processing.

References:

Baayen, H. (2001). Word frequency distributions. Dordrecht: Kluwer.

Ellis, N. C. (2008). “Phraseology: The periphery and the heart of language”. In F.

Meunier & S. Grainger (eds) Phraseology in language learning and teaching.

Amsterdam: John Benjamins, 1-13.

Gries, S. T. (2008). Phraseology and linguistic theory: a brief survey. In S. Granger &

F. Meunier (eds) Phraseology: an interdisciplinary perspective. Amsterdam: John

Benjamins, 3-25.

Ninio, A. (2006). Language and the learning curve. Oxford: Oxford University Press.

Wray, A. (2008). Formulaic language: pushing the boundaries. Oxford: OUP.

Zipf, G. K. (1935). The psycho-biology of language: An introduction to dynamic

philology. Cambridge, MA: The M.I.T. Press.

Annotating a Learner Corpus of English

Michael O'Donnell, Susana Murcia and Rebeca García

This paper describes our experiences annotating the WriCLE corpus, a large Learner Corpus of English, containing 750 essays (750,000 words) written by Spanish University students.

A standard way to explore a learner corpus is by exploring the errors in the corpus (e.g. Dagneaux et al 1998). However, we have found that some students make few errors because they avoid using structures they are not sure about, while more adventurous students take risks and thus make more errors. We are thus concerned not only with exploring the errors the students make at each level of proficiency, but also with what structures each learner attempts.

To follow this two-pronged approach, we have developed corpus software providing facilities both for hand-annotating errors and also for automatically annotating syntactic tags. To discover what the learners are doing right, we are semi-automatically tagging sentences, clauses and NPs with features representing the structural classes which the segment embodies. For instance, clauses are tagged for type of clause structure (finite/infinitive, etc.), modality-tense-aspect, clause dependence, transitivity, etc. For this tagging, the software calls on the Stanford Parser (Klein and Manning 2003) to parse each sentence in the text, and our software then converts the parse trees to features on the segments. The software is configured to allow human intervention at each step of the tagging, to eliminate false matches.

To discover what the learners are doing wrong, the software offers an interface for the hand tagging of each text, allowing a user to swipe each observed error, and then select a tag for the error from a hierarchically organised error scheme. The cross-layer search facilities of the software allow us to associate each error segment with the syntactic unit it belongs to, for instance, we can locate correctly performed passive-clauses by searching for all passive-clauses which do not contain an error of type ‘passive-formation-error’.

A large part of this study has been completed, including the preparation of the software, and automatic annotation (in a 500,000 word sub-corpus) of many of the clause structures we are exploring. Error analysis is under way as of December 2008.

When finished, our intention is to use the corpus to redesign the grammar teaching syllabus at our university. We will firstly define the degree of acquisition of each syntactic structure in terms of the degree it is attempted (compared with native writers) and also the degree it is used successfully. We will then explore those cases where grammatical structures are acquired before they are actually taught, and only those cases where structures are only adequately acquired long after they are taught. These cases suggest a change in the curriculum is needed.

References:

Dagneaux E., E. Denness and S. Granger (1998). “Computer-aided Error Analysis. System”. An International Journal of Educational Technology and Applied Linguistics, 26(2), 163-174.

Klein D. and C. Manning (2003). “Fast Exact Inference with a Factored Model for Natural Language Parsing”. Advances in Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, 3-10.

Why to ‘fill a niche’ in Spanish is not that bad anymore: A corpus-based look at anglicisms in Spanish

José L. Oncins-Martínez

Even though the study of Anglicism in contemporary Spanish has produced a few excellent works in the last decades (e.g. Pratt 1980, Lorenzo 1996), these have not enjoyed the advantages of large corpora of Spanish, unavailable for linguists and lexicographers until very recently (Davies and Face 2006). In consequence, some of the observations made on this phenomenon remain too vague or still lack adequate characterization. For instance, little has been said about the distribution of these foreign forms across genres or whether they abound more in Peninsular or American Spanish.

More recently, Spanish lexicographers have voiced the need for studies on Anglicisms to resort to digitized corpora as a sounder and more reliable way of exploring and characterizing the phenomenon (Rodríguez-González 2003). The present paper situates itself in this corpus-based approach to Anglicisms. Its main aim is to show how corpora can help us track down the occurrence of foreign usages more systematically and assess the extent of their presence in Spanish more in a more accurate way. The paper reports on research carried out on Anglicism in contemporary Spanish using CORDE and CREA, the two corpora of the Spanish Royal Academy of Language. It concentrates on semantic Anglicisms, a variety that has not received much attention, probably due to the difficulties involved in their identification. Through several case studies the paper aims to show how some Spanish words are being affected by the pressing influence of some cognate English forms, for instance by extending their denotational range or altering their collocational and colligational behaviour. Since the paper seeks to show in what ways a part of the Spanish lexicon is ‘becoming English’, data from the BNC will be used for comparison.

References:

C. Pratt (1980) El anglicismo en el español contemporáneo. Madrid: Gredos

E. Lorenzo (1996) Anglicismos hispánicos. Madrid: Gredos.

Davies, M. and T.L. Face (2006) “Vocabulary Coverage in Spanish Textbooks: How Respresentative Is It?”, Proceedings of the 9th Hispanic Linguistics Symposium, Somerville, MA: Cascadilla Proceedings Project, 132-43.

Rodríguez-González, F. (2003). “Orientaciones en torno a la elaboración de un corpus de anglicismos”. In Echenique, M.T and J. Sánchez (ed.) Lexicografía y lexicología en Europa y América. Homenaje a Günther Haensch en su 80 aniversario. Gredos: Madrid, 559-575.

WMatrix and the language of electronic gaming

Vincent B. Y. Ooi

This paper presents a corpus of electronic gaming in English and aspects of its analysis using the WMatrix software. Electronic gaming is a popular genre especially among younger netizens and should count its place among the more recent Web genres which increasingly draw the attention of corpus linguistics (e.g. Hundt et al 2006). Its sub-text types would go beyond the OED’s mere characterization of the term to include only ‘war-games’ and ‘role-playing games.’ As a rich genre that comprises sports games, fighting games, puzzle games, real-time strategy games, role-playing games, first or third-person action games, and simulation-oriented games (see Meigs 2003 for a more complete list), the compilation of a reasonably large linguistic corpus of gaming challenges us to look at observable sources that include manufacturer’s products descriptions, gaming reviews, gaming web/chat logs and discussion forums for gaming participants. Such external criteria (cf Sinclair 2005), rather than internal ones which involve pre-selecting well-known gaming terms, form the basis of the G-(aming) corpus of approximately 1-million words.

The corpus is analysed using the WMatrix software (Rayson 2008), which is an integrated corpus analysis and comparison tool for generating word frequency profiles and concordances, annotated texts (including part-of-speech and word sense categories); such profiles can be statistically compared against standard corpora samplers, e.g. the British National Corpus (or ‘BNC’) word sampler. An examination of these profiles includes part-of-speech and semantic category rankings. For the latter, the top five rankings (in order of descending frequency) for the semantic profiling (as measured against the BNC’s Spoken Sampler for ‘overused categories) include “Sports”, “Warfare, defence and the army; weapons”, “Unmatched”, “Children’s games and toys”, and “Able/Intelligent”. As both a quantitative and qualitative tool, WMatrix readily complements linguistic theories such as Sinclair (2004) and Hoey (2005) that offer a lexical-grammatical perspective. While an analysis of a few specimen entries (according to the Sinclair-Hoey approach) has been made in Ooi (2008), the present paper further examines the perspectives that WMatrix offers. For instance, the wordforms that comprise the “Sports” category include multiplayer, online, and role-playing; the wordforms in the ‘‘Warfare’’ category include sword, shoot, weapons, combat; the wordforms for “Children’s games and toys’’ include player and players (hence an imprecise characterization by WMatrix here); the wordforms for the ‘‘Able/intelligent’’ category include intelligence, ability, gameplay, skills etc. and the ‘Unmatched’ category includes those terms which WMatrix would understandably would miss as a software designed originally to analyze general English (but yet is still robust enough to handle electronic gaming features in English).

References:

Hoey, M. (2005). Lexical Priming. London: Routledge.

Hundt, M., Nesselhauf, N. and Biewer, C. (eds) (2006). Corpus Linguistics and the Web. Amsterdam: Rodopi.

Meigs, T. (2003). Ultimate Game Design: Building Game Worlds. New York: McGraw-Hill.

Ooi, V.B.Y. (2008). “The lexis of electronic gaming on the Web: a Sinclairian approach”. International Journal of Lexicography. 31 (3), 311-323.

Rayson, P. (2008). “From key words to key semantic domains”. International Journal of Corpus Linguistics. 13 (4), 519-549. DOI: 10.1075/ijcl.13.4.06ray

Sinclair, J. (2004). Trust the Text: Language, Corpus and Discourse. London: Routledge.

Sinclair, J. (2005). “Corpus and text - basic principles”. In M. Wynne (ed). Developing Linguistic Corpora: a Guide to Good Practice, Oxford: Oxbow Books.

The GENTT corpus: A genre-based approach within specialised translation

Maria del Pilar Ordonez-Lopez

Textual genres, considered ‘conventionalised forms of texts which reflect the functions and goals involved in particular social occasions as well as the purposes of the participants in them’ (Hatim and Mason, 1990) are a key element in the study of specialised communication.

The concept of textual genre is the core element of an empirical investigation covering three fields of specialisation – legal, medical and technical – that is currently being carried out by the research group GENTT (Géneros textuales para la traducción/Textual Genres for Translation) in the Department of Translation and Communication at the Universitat Jaume I in Castellón de la Plana (Spain). In GENTT’s approach, genre is defined as a dynamic category which changes according to the evolution of socio-professional and cultural parameters (García-Izquierdo, 2005) and constitutes an interface between the text and the context (both source and target) (Montalt, 2003), allowing the translator to tackle the distance between his/her position as an outsider (García-Izquierdo & Montalt, 2002) and specialised professional fields. Genre makes it possible for the translator to familiarise himself/herself with the professional, social and linguistic conventions which govern specialised communication.

The GENTT research group is developing a multilingual specialised corpus (English, Spanish, Catalan, German and French), including the legal, medical and technical fields. On the one hand, this corpus is aimed at providing researchers with a comprehensive sample of specialised texts for textual analysis purposes. On the other hand, the GENTT corpus is intended to constitute a useful and dynamic tool for specialised translators and writers of specialised texts.

This paper will provide a revision the main characteristics of specialised communication, placing particular emphasis on legal translation. This revision will constitute the theoretical background for a practical analysis of the role of the GENTT corpus in the translating process performed by a specialised legal translator in order to illustrate the usefulness of this tool.

References:

Garcia Izquierdo, I., and V. Montalt (2002). ‘Translating into Textual Genres’, Linguistica Antverpiensia, 1, 135-145.

Hatim, B., and I. Mason (1990). Discourse and the translator, London: Longman.

Montalt, V. (2003). ‘El génere textual com a interfície pedagògica en la docència de la traducción cientifico-tècnica’. In M. Cánovas et al. (eds) Actes de les VII Jornades de Traducció a Vic.

Talking about risk in the MMR debate

Debbie Orpin

In 1998, the publication in the Lancet of a paper claiming a link between the measles, mumps and rubella (MMR) triple vaccine and regressive autism and inflammatory bowel disorder led to a sharp fall in uptake of the vaccine in the UK. Several medical/sociological studies of parents’ attitudes (e.g. Cassell et al, 2006) highlighted differences between scientific understanding and public perceptions of risk, a distinction often commented on in ‘risk’ literature (Hamilton et al, 2007). However, the current approach in discourse analysis to scientific popularization eschews a ‘deficit’ model of public understanding, acknowledging that ‘lay’ people can exhibit a degree of expertise in certain areas (Myers, 2003). The work reported in this paper is part of a PhD project on the discourse of the MMR debate. The data drawn on for this paper come from a 5 million word corpus of web-based vaccination related texts. The corpus is divided into four sub-corpora, one of which consists of texts from the JABS website (Justice Awareness and Basic Support: ‘the support group for vaccine damaged children’) and another of which consists of texts from UK National Health Service (NHS) websites. The aims of this paper are to examine the ways in which ‘lay’ people talk about risk in connection with vaccination and to discover the extent to which their conceptions of risk differ from and are similar to ‘scientific’ conceptions of risk. After an initial discussion of technical and generic definitions of risk and an analysis of the lemma RISK as evidenced in the Bank of English, an analysis of RISK in the JABS and NHS sub-corpora will be presented. Using Wordsmith Tools, key collocates in each of the two sub-corpora will compared and contrasted. This analysis will be enhanced with closer examination of concordances. In an attempt to find non-explicit ways of talking about risk, the collocates of certain key collocates of RISK will be examined. Finally, the strengths and limitations of a corpus approach to the analysis of ‘risk discourse’ will be evaluated.

References:

Cassell, J. A., M. Leach, M. S. Poltorak, C. H. Mercer, A. Iversen and J. R. Fairhead, J. (2006). “Is the cultural context of MMR rejection a key to an effective public health discourse?”. Public Health, 120 (9), 783-794.

Hamilton, C., Adolphs, S. and Nerlich, B. (2007). “The Meanings of ‘Risk’: a view from corpus linguistics”. Discourse and Society ,18 (2), 163-181.

Myers, G. (2003). “Discourse Studies of Scientific Popularization: questioning the boundaries”. Discourse Studies, 5 (2), 265-279.

Metaphor, part of speech, and register variation: A corpus approach to metaphor in Dutch discourse

Tryntje Pasma

Since Lakoff and Johnson’s (1980) influential work on Conceptual Metaphor Theory, the phenomenon of metaphor has been studied repeatedly within different traditions of linguistics. Over the years, numerous insights into the nature of metaphorical language, the understanding and processing of metaphors and the different linguistic representations of conceptual metaphors have been presented in a great number of studies (e.g. Lakoff & Johnson, 1980; Gibbs, 1994; Bowdle & Gentner, 2005; and Steen, 2007 for an overview). However, many of these studies have predominantly illustrated their insights into the nature of metaphors with invented language data, and have not looked at naturally-occurring discourse.

In recent years, a number of studies have substantiated the importance of naturally occurring discourse to support theoretical claims about metaphor, having made use of corpora consisting of naturally elicited language data to examine the nature of metaphor in discourse (cf. Charteris-Black, 2004; Deignan, 2005). They have given insight into the linguistic structures of metaphor, and have illustrated patterns of use in different registers of discourse. Some of these structures and patterns seem considerably different from what has been focused on in previous work. For example, as Deignan (2005) states, ‘the focus on nominal examples in much metaphor theory is not representative of the diversity of use in naturally-occurring data.’ (2005: 147). In addition, ‘there seem to be quantitative and qualitative differences in the way that different parts of speech behave metaphorically’ (2005: 147).

For the current research project, we have looked at the combination of metaphorically used words and parts of speech in a corpus of Dutch naturally-occurring language data. The corpus consists of Dutch news texts from 1950 and 2002 (30,000 words and 50,000 words respectively) and recent spontaneous conversations (50,000 words). During the initial stage, we have manually identified the metaphorically used words in both registers, using a systematic identification method (Pragglejaz Group, 2007). The second stage included generating frequency lists of metaphorically used words in the two registers, and obtaining statistical information on the occurrence of metaphor in different parts of speech. The preliminary figures and patterns illustrate that the registers of news and conversation show major dissimilarities in the way parts of speech behave metaphorically. In addition, the figures for some parts of speech, for instance for prepositions, reveal interesting patterns on a metaphorical level, regardless of the register they occur in. This paper will discuss these figures and patterns of metaphor in Dutch news discourse and conversations in more detail.

References:

Bowdle, B. F. And D. Gentner. (2005). “The career of metaphor”. Psychological Review, 112,

193-216.

Charteris-Black, J. (2004). Corpus approaches to critical metaphor analysis. London:

Palgrave MacMillan.

Deignan, A. (2005). Metaphor and corpus linguistics. Amsterdam and Philadelphia: John

Benjamins.

Gibbs, R. W., jr. (1994). The poetics of mind: Figurative thought, language, and

understanding. Cambridge: Cambridge University Press.

Lakoff, G and M. Johnson. (1980). Metaphors we live by. Chicago: Chicago University Press.

Pragglejaz Group (2007). MIP: A method for identifying metaphorically used words in

discourse. Metaphor and Symbol, 22, 1-39.

Steen, G. J. (2007). Finding metaphor in grammar and usage. Amsterdam and Philadelphia:

John Benjamins.

Signalling connections: Linking adverbials in research writing across eight disciplines

Matthew Peacock

This paper describes a corpus-based analysis of linking adverbials e.g. however and thus in research writing across eight disciplines, four science and four non-science: Chemistry, Computer Science, Materials Science, Neuroscience, Economics, Language and Linguistics, Management, and Psychology. Apart from Biber (2006), there seems to be little research in the area since Biber et al. (1999), who say linking adverbials perform important cohesive and connective functions by signalling connections between units of discourse. They found them much more common in academic prose than fiction and news.

Our aim and approach was not only to explore these cohesive and connective functions, but to investigate in detail how these adverbials perform other important functions in research writing: presenting, developing, and supporting arguments. Our corpus was 320 published research articles (RAs), 40 from each discipline. We built up new lists of 46 linking adverbials from Biber et al. and a thesaurus, and examined frequency, function, and disciplinary variation using the Concord and Contexts functions of WordSmith Tools. We investigated four Biber et al. semantic categories: Contrast/concession e.g. although, which mark contrasts, Addition e.g. also, which marks the next unit of discourse as additional, Apposition e.g. for example and i.e., which show the following units are examples or restatements, and Result/inference e.g. therefore, which marks conclusions and links claims to supporting facts.

We found three categories to be far more frequent than previously thought. Frequency per million words over all eight disciplines was contrast/concession 4590, addition 3487, apposition 3221 and result/inference 2441, whereas Biber et al. found 1200, 1000, 1800 and 3000 respectively. One reason was the greater number of linking adverbials we looked for. We also found many statistically significant disciplinary differences, e.g. the sciences (particularly Chemistry and Materials Science) used significantly fewer than the non-sciences. Linking adverbials functioned mostly as predicted by Biber et al. However, we found them to be often clustered together in long and complex sequences, which appear to be an important method of constructing, developing, supporting and strengthening arguments in RAs. Additionally, we found that also often introduces claims. A close examination of Chemistry and Materials Science RAs revealed some reasons for the much lower rate of occurrence. Authors developed arguments in a different way, describing methods and results in a more narrative or descriptive style. Apparently it is much less necessary for these readers to be explicitly told the connections between ideas, claims and facts.

Conclusions are that linking adverbials are much more important in research writing as signalling and cohesive devices, and for constructing and strengthening arguments, than previously thought. Also, different disciplines achieve this in significantly different ways, confirming the importance of discipline variation when researching their use. We hope the results help us better understand scientific expression and the RA.

References:

Biber, D. (2006). University language: a corpus-based study of spoken and written registers.

Amsterdam: John Benjamins.

Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. (1999). Longman grammar of

spoken and written English. Harlow: Pearson Education.

Distributional characterization of constructional meaning

Florent Perek

Despite their descriptive adequacy, constructional approaches to language are still challenging for corpus linguistics. A common problem is that the tokens of a particular construction may be difficult to identify automatically in a corpus. While some constructions can be identified on the basis of lexical material (e.g. way in the way-construction) or part-of-speech tags, these strategies do not work for Argument Structure Constructions (hereafter ASCs), defined by Goldberg (1995) as independent form-meaning pairs that associate a set of argument roles and their syntactic realization with a basic clausal meaning. For instance, the caused-motion construction superficially consists of the same set of constituents that is found in a transitive clause with a locative adjunct:

(1) a. He swept the dirt under the carpet.

b. He saw the dirt under the carpet.

Moreover, ASCs convey their own independent meaning to the clause, which eludes corpusbased analyses because only formal cues are available in most corpora; there is no direct way to access the meaning of words, let alone that of larger units. The main goal of this paper is to design and test ways to derive constructional meaning from corpus data.

Drawing on recent corpus findings about the interaction of verbs and constructions

(Stefanowitsch and Gries 2005), we suggest that the meaning of ASCs can be approximated by the distribution of the verbs that occur with it, a prediction actually supported by Goldberg (1995). This predicts that differences in constructional meaning should be reflected by differences in verbal distribution; in other words, the more similar the distribution of two distinct constructions, the closer they are semantically. If successful, this technique could be ultimately used to objectively determine which ASCs are available for a given syntactic pattern, in other words, to resolve constructional homonymy.

In order to implement this hypothesis, we used several indices quantifying the similarity between distributions: the cosine distance metric of vector space and our own indices based on type and token frequencies. We derived a construction sample from the ICE-GB corpus by manually identifying several constructions defined in Goldberg (1995), namely the ditransitive, the conative, and the resultative constructions. We used the indices to quantify the similarity between the verbal distributions of these constructions and checked whether they adequately reflect semantic differences between them.

Our results show that the basic strategy is successful at distinguishing between the meaning of different constructions, since constructions with very different meanings have very divergent distributions, and conversely, the distributions of constructions that share some meaning, like the ditransitive and the caused motion constructions that both feature a transfer component, are identified as more closely related (but not identical). Since those two constructions assign different semantic roles, the latter statement suggest that the arguments of each construction play a significant role in further contrasting semantically close patterns, which indicates a way in which our method could be refined.

References:

Goldberg, A. E. (1995). Constructions: a construction grammar approach to argument

structure.Chicago: University of Chicago Press.

Stefanowitsch, A. and S. T. Gries (2005). Covarying collexemes. Corpus Linguistics and

LinguisticTheory 1 (1), 1-43.

The Tatian corpus of Old High German: Information-structural and grammatical annotation

Svetlana Petrova, Christian Chiarcos, Julia Ritz and Amir Zeldes

This paper describes the development and the evaluation of a historical corpus for Old High German which was designed as part of a more general investigation on the development of word order in the Germanic languages. In particular, we are interested in the influence of factors pertaining to information structure. In a theoretical approach to syntactic change, Hinterhölzl (2004), for instance, suggests that for the needs of communicative explicitness and rhetorical expressivity, novel word order patterns and constructions emerge, which can lose their stylistic weight and become the unmarked pattern of the language.

Building upon this hypothesis, we set up a pilot corpus project aiming to offer a profound empirical investigation on the correlation between the pragmatic value of sentence constituents and their positional realisation in the clause. For this purpose, a corpus of Old High German sentences (approx. 1600 clauses) has been compiled. The sentences were retrieved from the largest prose texts of the Old High German period, the Tatian translation from Latin (Cod. St. Gall. 56, 9^th century). In order to base the analysis on native word orders, only sentences deviating from the Latin word order were excerpted, giving the corpus its name T-CODEX (Tatian Corpus of Deviating Examples).

Sentences were annotated for various grammatical and information structural features implementing a multi-layer corpus architecture. The grammatical annotation comprises layers such as parts of speech, phrase structures, grammatical functions, phonological weight, clause status, etc. The annotation of information structure is based on recent approaches (cf. Krifka 2007) suggesting that information packaging is a complex phenomenon comprising at least the following three layers: (1) cognitive status, i.e. ‘given’ vs. ‘new’, (2) predication structure, i.e. ‘topic’ vs. ‘comment’, and (3) informational relevance, i.e. ‘focus’ vs. ‘background’. With a range of factors annotated independently from each other, we are able to query the corpus across multiple levels and thus investigate possible interdependencies between particular pragmatic features, or combinations of features (e.g. searching for focused preverbal constituents in main clauses). Building on the findings of these investigations, we aim at an information-structural cartography of the left and right sentence periphery in Old High German.

Our paper presents results and selected query examples, and in particular an empirical evaluation of the possible correlation between (a) grammatical function and positioning, (b) givenness and grammatical function, and (c) givenness and positioning of discourse referents. We show a statistically highly significant correlation between grammatical function and positioning. Similarly, we tested question (b) and found a highly significant correlation between givenness and grammatical function. However, if we compare preverbal discourse referents in main clauses to those occurring directly after the verb, we cannot find any significant evidence that given elements prefer a pre-verbal position.

References:

Hinterhölzl, R. (2004). “Language Change versus Grammar Change: What Diachronic Data Reveal about the Distinction between Core Grammar and Periphery”. In E. Fuβ and C. Trips (eds). Diachronic Clues to Synchronic Grammar. Amsterdam and Philadelphia: John Benjamins, 131–160.

Krifka, M. (2007). “Basic Notions of Information Structure”. In C. Fery et al. (eds). The Notions of Information Stucture. Potsdam: Universitätsverlag, 12–56.

Investigating lexical convergence as a representativeness criterion in the National Corpus of Polish

Piotr Pezik and Lukasz Drozdz

Ensuring a proper level of representativeness is a widely recognized corpus compilation issue. Depending on the structural aspect in question, a priori qualitative corpus design principles (Atkins et al. 1992) can be complemented with more quantitative criteria (Biber 1993). In this paper we investigate the lexical dimension of representativeness by adopting the notion of lexical convergence. We first define lexical convergence in terms of Heaps’ law (Heaps 1978) as the limit of the sublinear growth of vocabulary size. We then discuss some theoretical aspects of measuring lexical convergence, such as defining lexical units, the treatment of named entities, morphological variants and multi-word expressions. Next, we compare the performance of two textual data sampling methods to optimize the compilation of a corpus: a K-farthest neighbours algorithm is benchmarked against a text coverage procedure described in (Chujo and Utiyama 2005). The basic goal of applying these procedures is to maximize the number of lexical types, while minimizing the number of lexical tokens within the predefined typological sections of a corpus. As a result, a comparable level of lexical representativeness can be achieved in a relatively smaller corpus. Another advantage of using this approach is the increased number of rare lexical types in the corpus, which might otherwise not be represented in the corpus at all due to the predefined size limits. We have carried out our experiments on a pool of more than 600,000 texts (400 million words) collected as part of the National Corpus of Polish. We conclude that the process of corpus compilation can be optimized for lexicographic purposes by applying the abovementioned text sampling algorithms thus maximizing the number of possible lexical types for a given corpus size. Moreover, we believe this approach can be applied to other quantifiable grammatical constituents of texts, so as to maximize the representation of morphological, syntactic and semantic types in a corpus.

References:

Atkins S., J. Clear and N. Ostler. (1992). “Corpus Design Criteria”. In Literary and Linguistic

Computing 1992 7(1), 1-16.

Biber D. (1993). “Representativeness in Corpus Design”. In Literary and Linguistic

Computing 1993 8(4), 243-257.

Chujo, K. and M. Utiyama. (2005). “Understanding the role of text length, sample size and

vocabulary size in determining text coverage”. In Reading in a Foreign Language,

17:1, 1-22.

Heaps, H. S. (1978). Information Retrieval - Computational and Theoretical Aspects.

Academic Press.

Why prosodies aren’t always present: Insights into the idiom principle

Gill Philip

One of the enduring arguments against the existence of semantic prosodies is the fact that they cannot always be detected. One cause of this problem can be found in overlooking the fact that prosodies are a feature of an extended unit of meaning rather than of the stand-alone (and often underspecified) orthographic word. Yet this explanation does not account for the rather more problematic observation that some extended units of meaning seem not to have a discernable semantic prosody; or, rather, the prosody that is detected contributes nothing to the overall meaning that is not already present in the salient meanings of the units component parts. Stewart (2008), for example, finds this to be the case for the phrase “from bad to worse”; while Philip (forthcoming) notes that “caught in the act” resists any attempt at attributing a semantic prosody.

This contribution looks in detail at the reasons why semantic prosodies can sometimes be self-evident (as in Stewart’s example) or, indeed, undetectable (in Philip’s case). Approaching the issue within the Sinclairean framework of open choice vs. idiom principle (Sinclair 1991), it examines factors that seem to play a part in determining the need for pragmatic information to be included as an integral part of the extended unit of meaning, especially for multi-word nodes such as “from bad to worse” or “caught in the act”. The argument revolves around notions of salience and compositionality/analysability, and in doing so highlights the fact that not all phrases are equal: some are noncompositional and thus governed by the idiom principle, while others are conventionalised sequences essentially drawn from the open choice model. Others still, despite being perceived as complete are in fact better defined as phraseological fragments. The potential significance of such divisions for phraseology and idiom scholarship will be discussed.

The unit of meaning is seen to be tightly bound up with the idiom principle. The need for a prosody seems only to arise when phraseological meaning is both noncompositional and non-analysable. This is not to claim that only idioms have semantic prosodies, but that the presence of a semantic prosody which contributes additional meaning seems to be an essential feature of idiomaticity in general. Not all phrases are idiomatic, and “self-evident” or “missing” prosodies are likely to indicate the presence of conventionalised sequences which are essentially compositional in nature. The contribution intends to provide insights not only into semantic prosody itself, but also into the very nature of the open choice/idiom principle continuum.

References:

Philip, G. (forthcoming). From Collocation to Connotation: exploring meaning in context.

Amsterdam: John Benjamins.

Sinclair, JM. (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Stewart, D. (2008) “In Sinclair’s footsteps?” Paper presented at Semantic Prosody: has it set

in, or should it be budged? Forlì, Italy. 23 June 2008.

Working with nonstandard documents: A server-based trainable fuzzy search-plugin for Mozilla Firefox

Thomas Pilz, Christoph Buck and Wolfram Luther

The interdisciplinary project Rule-Based Search in Text Databases with Nonstandard Orthography develops mechanisms to ease working with unstandardized documents and text-databases. Several types of nonstandard spellings impede working with these resources, the most prominent being historical spellings and recognition errors (cf. Pilz and Luther 2009). Many historical online archives still do not support fuzzy search or, in some cases, even any full-text search at all. In this paper, we do not intend to discuss specialized measures used for linguistic analyses or language comparison but provide researchers, as well as interested amateurs, with an easy-to-use system readily available on the websites they work on.

The distance measure FlexMetric was implemented in Java and is ideally trained on ~4,000–word data sets of standard and variant spellings using a stochastic learning algorithm based on expectation maximization (Ristad and Yianilos 1998). The representative historical German Manchester Corpus (GerManC, Durrell et al. 2007) was used in a recent evaluation of the system, but it has also been tested with English data since Pilz et al. (2008) suggest an automatic approach for both languages. The FlexMetric system was embedded into a plugin for the popular browser Mozilla Firefox, which is available on several operating systems, including a portable edition to run from removable media.

The architecture WebMetric consists of a server-side and a client-side component. The server provides clients with an HTTP Interface for communication. The calculations are performed by the server; the clients present the results visually. The client-side plugin is less than 200kb in size. It was implemented in XUL, the XML User Interface Language supported by Mozilla Firefox. With WebMetric, users are provided with instant fuzzy search on any HTML page rendered in Firefox. The browser is extended by an additional search bar on the bottom of the page, where users can enter their queries in standard orthography. Matching spelling variants on the page are highlighted according to their word distance, and a side bar can be displayed for detailed results. The server-side Web Archive can be deployed on any Java-capable server. The system is currently being tested by scientists from the University of Frankfurt on their indogermanic text corpus (TITUS, titus.uni-frankfurt.de).

Implementation in client-server architecture offers several benefits. Given a reasonable client-to-server ratio, numerous users can be handled by one server and still obtain instant results. Also, both components can be further developed separately without imposing further duties on the users. In much the same way, the trained distance measure can be changed unobtrusively. A version currently under development adapts the distance measure to the text that is being searched using text classification technology.

References:

Durrell, M., A. Ensslin and P. Bennett. (2007). “The GerManC project”. In Sprache und

Datenverarbeitung, 31, 71–80.

Pilz, T., A. Ernst-Gerlach, S. Kempken, P. Rayson and D. Archer. (2008). “The identification

of spelling variants in English and German historical texts: Manual or automatic?”. In

Literary and Linguistic Computing, 23(1), 65–72. doi: 10.1093/llc/fqm044

Pilz, T. and W. Luther (2009). “Automated support for evidence retrieval in documents with

nonstandard orthography”. In S. Featherston and S. Winkler (eds). The fruits of

empirical linguistics, Vol. 1: Process. Berlin: de Gruyter.

Ristad, E. and P. Yianilos. (1998). “Learning String Edit Distance”. In IEEE Transactions on

Pattern Recognition and Machine Intelligence, 20(5), 522–532

Capturing the generic structure of French linguistic articles with a focus on the core features of the genre

Céline Poudat

This paper will describe the generic structure of French linguistic articles, using a contrastive and corpus-based methodology. The main question that we wish to answer is the following: “To what extent the generic structure of scientific articles – and more particularly linguistics ones – can be captured and how instructive and useful are the structure and the regularities observed to organize and select the core features of the genre?”.

We will provide some answers to this linguistic question using techniques from computational and corpus linguistics. Indeed,

The notion of genre is more and more present as much in linguistics as in information retrieval or in didactics. Genres and texts are intimately connected, as genres could not be tackled within the restricted framework of the word or the sentence. Indeed, genres can only be perceptible using text corpora both generically homogeneous and representative of the genre studied. The progress of information technology and the possibilities of digitization have made it possible to gather homogeneous and synchronic corpora of written texts to analyse and characterize genres. Moreover, the development of computational linguistics, of linguistic statistics and more generally of corpus linguistics has led to that of tools and methods to process large corpora which make it possible nowadays to detect linguistic phenomena and regularities that could not have been traced before. In that sense, inductive typological methods and multi-dimensional statistical methods (see Biber, 1988) seem crucial to make the criteria which define the genres appear more clearly.

The study is based on a generically homogeneous corpus composed of 224 French journal articles issued between 1995 and 2001 that all belong to the linguistic domain.Three smaller text corpora are also used by way of comparison: 49 articles belonging to the domain of mechanics and 98 texts belonging to two other linguistic genres (linguistic book reviews and journal presentations). Texts have all been marked up in XML according to the TEI Guidelines and two types of tags belonging to two levels of annotation have been encoded: (i) tags allowing us to observe the document structure, including article sections, titles and specific components (i.e. examples, citations), and (ii) linguistic features devoted to the description of scientific texts. The linguistic observations are mostly morphosyntactic, as the level proved to be particularly relevant to discriminate genres (e.g. Karlgren&Cutting 1994, Malrieu&Rastier 2001). The tagset gathers 145 features, including the main part of speech as well as the general descriptive hypothesis put forward in the literature concerning scientific discourse (i.e. modals, connectives, dates, symbols, title cues, etc.).

The data were first analyzed with Factor Analysis (FA) – and more precisely Main Constituents Analysis (MCA) – to bring out the generic structure of the genre, which was then described and assessed in contrast with the other corpora. The features taken into account were then reduced to the most differential ones, in order to capture the core features of the genre. We will finally discuss these features, and the conclusions that could be drawn from these findings.

References:

Biber, D. (1988). Variation across Speech and Writing. Cambridge: University Press.

Karlgren, J. and D. Cutting. (1994), “Recognizing text genres with simple metrics using discriminant analysis”. In Proceedings of COLING 94, Kyoto.

Malrieu, D. and F. Rastier. (2001). “Genres et variations morphosyntaxiques”. In TAL, 42 (2).

Swales, J. (1990). Genre Analysis : English in Academic and research settings. Cambridge: University Press.

“On quoting …” – a corpus-based study on the phraseology of well-known quotations

Sixta Quassdorf

Times columnist Bernard Levin once wrote a little sketch "On Quoting Shakespeare", in which he embedded more than 40 Shakespearean quotations into the reiterated (structure try phrase) "if you say ... you are quoting Shakespeare". This rhetoric form nicely echoes the mantra-like quality with which Shakespeare's influence on the English language is often pronounced. Yet these claims need to be put onto firmer, i.e. empirically verifiable ground: The questions of what, when, where, why and how  an expression from a literary work leaves its original context and is re-applied in a more or less comparable new setting, call for a linguistic investigation. For this purpose, Shakespearean phrases and expressions have to be traced in both their historical and present-day usage. Various digitally available text corpora, such as Literature Online, The British National Corpus, 19^th/20^th c. House of Commons Parliamentary Papers 1801-2004, Eighteen's Century Collections Online (ECCO) and, of course, the World Wide Web are useful working tools in this endeavour.

The proposed paper focuses on three lines from Hamlet: "A little more than kin, and less than kind" (I, ii, line 65), "It is a custom more honour'd in the breach than the observance" (I, iv, lines 15-16) and "For 'tis sport to have the engineer hoist with his own petard" (III, iv, lines 206-207). In my presentation, I will concentrate on the description of the what, when and where these lines occur outside the context of the Shakespeare play and its reception history. The domains and period of re-application of these lines differ despite some similarities such as the expression of contrast or unexpectedness, their proverbial origin and that they are all “spoken” by Hamlet. While honoured breach and hoist petard are frequently used in the language of world affairs, the kin/kind contrast is more often used in cultural domains. Hoist petard and the kin/kind lines are popular for titles or headers, while honoured breach occurs nearly exclusively within the body of a text. Lastly, the kin/kind line may have become popular only in the second half of the 20^th century, whereas honoured breach and the hoist petard had already gained phraseological status in the 19^th century. Moreover, the data indicate that Shakespearean quotations are not necessarily used continuously, but that they can be “forgotten” for as much as 50 years and then rise as freshly as a phoenix.

Suggestions are made which evaluate these findings in view of the semantic field of the keywords and the rhetoric make-up of the phrase.

Automatic term recognition as a resource for theory of terminology

Dominika Srajerova

This paper presents a new corpus-driven method for automatic term recognition (ATR) which is based on data-mining techniques. In our research, ATR is seen as a valuable resource for theory of terminology and for the definition of the term.

Existing methods of ATR extract terms from academic texts on the basis of statistical and/or linguistic features; in most cases, these features used for the given extraction approach were selected prior to experiments that followed. In contrast, our method does not aim primarily at the extraction of terms but rather at the criteria for their selection. The data-mining tools process chosen academic texts. For each of the word-forms a number of features is listed potentially contributing to the specific ‘essence’ of a term. The relevance and significance of individual features is automatically detected by the data-mining tools.

Thus, the ultimate goal of the approach proposed here is not the automatic recognition of terms in given academic disciplines but rather a substantive contribution to the theory of terminology. That is why this corpus-driven approach is not aimed at practical goals such as building a terminology dictionary or a database. The main objective is to contribute to the definition of the term that, in its traditional form (semantically and pragmatically oriented), is not sufficient. The basis for such a definition refinement is provided by a feature-ranking (built-in function of the data-mining tools) which is able to determine the impact of individual statistical and linguistic features on the accuracy of automatic term recognition procedure. The degree of termhood (or terminological strength), as a basic characteristic of any term, depends on combination of most significant features in given wordform.

Another contribution to the theory of terminology may be seen in a demonstration of cases where the use of terms is not restricted to academic texts only. Terminological units may be automatically recognized in any type of non-academic text as well.

The method suggested here is aimed at one-word terms or one-word constituents of multi-word terms. The next step in the research should be concentration on multi-word terms as undivided units.

The use of marginal and complex prepositions in learner English

Tom Rankin and Barbard Schiftner

This paper will examine the use of a particular semantic class of marginal and complex prepositions in learner English, all of which have the meaning ‘to be about something or someone’ or ‘relating to something or someone’. Experience of teaching English at university level has shown that this class of prepositions is persistently used in a non-target fashion by German-speaking students even at higher levels of proficiency.

In order to investigate how learner usage diverges from native usage, we compared the distribution of concerning, regarding, in connection with, with/in regard to, with respect to, with/in reference to, in terms of in the German L1 component of ICLE (Granger et al 2002) and the written component of BNC Baby. The comparison shows that, in native English, the different prepositions are used in distinct structural and collocational environments, while the German-speaking learners permit a greater degree of interchangeability.

To investigate the possibility that this involves an L1 specific transfer problem, patterns of usage where also studied in the Dutch, French, Finnish and Russian L1 subcorpora of ICLE. It was found that, in comparison to the native corpus, there were significant patterns of overuse or underuse across all the different learner groups regardless of L1. However, a qualitative analysis shows that collocational behaviour and sentence structure varies in the individual learner subcorpora. While this suggests that there could be evidence of a transfer problem in the German subcorpus, it would appear that this class of prepositions poses problems for learners irrespective of their L1.

This one particular linguistic phenomenon sheds light on how even advanced learners of EFL continue to struggle with mastering discourse structure and cohesion in their writing. Consequences for teaching in this instance could include using corpus data to illustrate how the distribution of these prepositions is more differentiated than might be suggested by the current standard learner reference works.

References:

The BNC Baby, version 2. 2005. Distributed by Oxford University Computing Services on

behalf of the BNC Consortium.

Granger S., E. Dagneaux and F. Meunier. (2002). The International Corpus of Learner

English. Handbook and CD-ROM. Louvain-la-Neuve: Presses Universitaires de

Louvain.

Visualising corpus linguistics

Paul Rayson and John Mariani

Corpus linguists are not unusual in that they share the problem that every computer user in the world faces – an ever-increasing onslaught of data. It has been said that companies are drowning in data but starving for information. How are we supposed to take the mountain of data and extract nuggets of information? Perhaps corpus linguists have been ahead of the pack, having had large volumes of textual data available for a considerable number of years, but visualisation techniques have not been widely explored within corpus linguistics.

Information visualisation is one technique that can be usefully applied to support human cognitive analysis of data allowing the discovery of information through sight. Commercially available systems provide the ability to display data in many ways, including as a set of points in a scatterplot or as a graph of links and nodes. There are many different established techniques for visualising large bodies of data and to allow the user to browse, select and explore the data space.

In this paper, we explore two aspects of visualisation in corpus linguistics. First, using one type of visualisation technique – key word clouds (Rayson, 2008) – as an exemplar, we are able to visualise the development of the field of corpus linguistics itself. Second, we have been investigating how other visualisation techniques may be able to assist in applying the standard corpus linguistic methodology to ever increasingly large datasets, for example in the web-as-corpus paradigm (Kilgarriff and Grefenstette, 2003).

During the fifth conference in the series that has previously been hosted twice in Lancaster and twice in Birmingham, we are now in a position to look back and see how the focus of the conference and therefore the field has changed over time. It would be a difficult task to carry out manually as one would need to read all 540 papers or abstracts. Hence we have decided to use corpus techniques in combination with information visualisation tools to find key themes from the data itself. For each conference proceedings we extracted the titles and authors, cleaned up the remaining texts by removing headers and footers, created a word frequency list and then applied the key words procedure as implemented in the Wmatrix tool (Rayson, 2008). For each one of the four datasets (one for each conference) we generated key word clouds by comparing the samples to a standard reference of the BNC sampler. Using the key word clouds as our main guide, the patterns that we can see emerging are as follows[2]:

Annotation and tagging are well represented throughout all four conferences
There is a move over time from grammar to semantic and phraseology
Representation of Spoken data is increasing since 2003
Translation has been a strong recurring theme (apart from in 2003)
Web appears as a major theme from 2005 onwards

In the second part of this paper, we comparatively present further visualisation techniques for corpus linguistics including variants of the word cloud display utilised above, and others such as DocuBurst (Collins, 2006), Compus (Fekete, 2006), Many Eyes[3], and Collocate Clouds (Beavan, 2008). With significantly larger corpora being compiled, we predict that the need for visualisation techniques will grow stronger in order to allow interesting patterns to be seen within the language data.

References:

Collins, C. (2006). DocuBurst: Document Content Visualization Using Language Structure. IEEE Information Visualization Symposium 2006.

Beavan, D., ‘Glimpses though the clouds: collocates in a new light’. Proceedings of Digital Humanities 2008, University of Oulu, 25-29 June 2008.

Fekete, J-D. (2006). “Information visualisation for corpora”. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, Wadern, Germany, 2006.

Kilgarriff, A. and G. Grefenstette. (2003). “Introduction to the Special Issue on the Web as Corpus”. Computational Linguistics 29 (3), 333-347.

Rayson, P. (2008). “From key words to key semantic domains”. In International Journal of Corpus Linguistics. 13 (4), 519-549.

Who said what? Methodological issues in applying corpus-based methods to analyse online chat data

Paul Rayson, Phil Greenwood, Awais Rashid and James Walkerdine

In this paper, we will examine practical and methodological problems when applying corpus-based techniques to the analysis of online chat data. The positive and negative aspects of utilising corpus methods on standard reference corpora of written text and spoken transcripts are well understood (Hoffmann et al, 2008) and techniques for exploring web-derived data are under development (Kilgarriff and Grefenstette, 2003). However, much less corpus research has focussed on language data derived from social networking sites such as Facebook and MySpace (Thelwall, 2008), chat systems such as Skype and IRC (Lopresti et al, 2008) and from wikis and blogs (Karlgren, 2006).

Availability of online chat language is one problem, but there are a number of ethical and legal issues that the corpus researcher must be aware of in enterprises such as this. Using online data without first obtaining permission from the speaker or writer is problematic although it can be argued that such data is already in the public domain (Seale et al, 2005). Also, is it possible to redistribute this data as a corpus to allow replicable research to be carried out? Previous research in the web-as-corpus paradigm has identified corpus cleanup as a vital first step before further analysis can be undertaken (Evert et al, 2008). The same issue has to be addressed when dealing with chat logs. Here, we will also focus on subsequent encoding issues for meta-level information including speaker identifiers, which are very important for this kind of data (Hoffmann, 2007). In addition, we describe our work on how to train corpus annotation tools such as a part-of-speech tagger to deal with non-standard vocabulary and grammar. Similar work has been carried out tackling parallel problems for historical data (Rayson et al, 2008) and dialectal corpora (Beal et al, 2007).

This research fits within a larger project which aims to develop an ethics-centred monitoring framework and tools for supporting law enforcement agencies in policing online social networks for the purpose of protecting children. Key requirements are recording who said what and being able to support retrieval and analysis from chat logs in a traceable and precise manner and thus to enable further stylistic analysis over time and across multiple speakers in the online chat room setting.

References:

Beal, J., K. Corrigan, N. Smith and P. Rayson. (2007). “Writing the Vernacular: Transcribing and Tagging the Newcastle Electronic Corpus of Tyneside English”. In Studies in Variation, Contacts and Change in English. Volume 1. Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki. http://www.helsinki.fi/varieng/journal/volumes/01/beal_et_al/

Evert, S., A. Kilgarriff and S. Sharoff. (2008). “Preface”. Proceedings of the 4th Web as Corpus workshop (WAC-4) Can we beat Google? Marrakech, Morocco, June 2008.

Hoffmann, S. (2007). “Processing internet-derived text – creating a corpus of Usenet messages”. In Literary and linguistic computing, 22 (2), 151-165.

Hoffmann, S., S. Evert, N. Smith, D. Y. W. Lee and Y. Berglund Prytz. (2008). Corpus Linguistics with BNCweb—a Practical Guide. Frankfurt am Main: Peter Lang.

Karlgren, J. (2006). Preface to the proceedings of the workshop on NEW TEXT: Wikis and

blogs and other dynamic text sources. 11th Conference of the European Chapter of the

Association for Computational Linguistics.

Kilgarriff, A. and G. Grefenstette. (2003). “Introduction to the Special Issue on the Web as Corpus”. In Computational Linguistics 29 (3), 333-347.

Lopresti, D., S. Roy, K. Schulz and L. V. Subramaniam. (2008). “Foreword”. In Proceedings of the Second Workshop on Analytics For Noisy Unstructured Text Data. AND '08, vol. 303. ACM, New York, 1-1.

Rayson, P., D. Archer, A. Baron and N. Smith. (2008). “Travelling Through Time with

Corpus Annotation Software”. In B. Lewandowska-Tomaszczyk (ed.) Corpus Linguistics, Computer Tools, and Applications - State of the Art. PALC 2007. Frankfurt am Main: Peter Lang, 29-46.

Filling in the gaps: Supplementing modern diachronic corpora using WebCorpLSE

Andrew Kehoe

Corpus linguists are reasonably well served with modern-day finite corpora, but still lack data to support short-term modern diachronic corpus study. Several notable and well-documented initiatives have succeeded in providing small-scale but collectively useful diachronic text resources, most notable being the collection of one-million-word corpora dubbed the ‘BROWN family’, now extending from 1931 to 1991, at thirty-year intervals. Such an arrangement supports the study of language changes observable across such time gaps, but not those occurring in the shorter term. As a solution, Mair (2007) advocated the combined study of corpus and web data. An attempt to provide diachronic data was attempted by the WebCorp project (Renouf, 2002; Kehoe, 2006), but success was limited by unreliable or absent date information. This paper will show how the next stage WebCorpLSE can facilitate the combined approach Mair sought to achieve, by scouring web-accessible newspaper texts of the last century to fill in the gaps in the picture of language use provided by existing corpora. Some aspects of language change which are made available by this means will be illustrated.

References:

Kehoe, A. (2006) “Diachronic Linguistic Analysis on the Web with WebCorp”. In A. Renouf

and A. Kehoe (eds). The Changing Face of Corpus Linguistics. Amsterdam/Atlanta

GA: Rodopi

Kehoe, A. and M. Gee. (2007). “New corpora from the web: making web text more “text-like””. In P. Pahta, I. Taavitsainen, T. Nevalainen and J. Tyrkkö (eds). Towards Multimedia in Corpus Studies. University of Helsinki:

http://www.helsinki.fi/varieng/journal/volumes/02/kehoe_gee

Mair, C. (2007). “Change and variation in present-day English: integrating the analysis of closed corpora and web-based monitoring”. In M. Hundt, N. Nesselhauf and C. Biewer (eds). Corpus Linguistics and the Web. Amsterdam/New York: Rodopi, 233-247.

Renouf, A. (2002). “WebCorp: providing a renewable data source for corpus linguists”. In S. Granger and S. Petch-Tyson (eds). Extending the scope of corpus-based research: new applications, new challenges. Amsterdam: Rodopi, 38-53.

Negotiating the agenda through the White House press briefings:

A CADS analysis of important and importance in the Bush administration press briefings

Giulia Riccio

Today’s world is witnessing the development of a “social technology of influence” (Manheim 1998), orchestrated by political actors by exploiting communication and marketing techniques, in order to shape public opinion to achieve political goals. Discourse is one of the primary means through which such communication strategies are realized. I argue that the exploitation of specific discourse strategies by political actors goes hand in hand with their political strategies, and that the former are realized through repeated patterns and subtly conveyed meanings. In this perspective, I exploit the Corpus-Assisted Discourse Studies (CADS) approach (Partington 2004, 2006, 2008; Bayley 2008) to explore a discourse type by combining quantitative and qualitative analysis (Baker 2006), in order to identify the way discourse features are exploited to achieve strategic goals.

In particular, I explore the discourse strategies in one of the most important arenas of political communication today: the daily White House press briefings, where the president’s press secretary meets reporters to respond to their demand for presidential news and, more importantly from the White House point of view, to set the day’s agenda by prioritising specific issues to suit the administration’s needs. The briefings I focus on date back to George W. Bush’s first term as president (2001-2005). This administration deliberately chose to assign a vital role to its communication strategies (Kumar 2007), in a transformed context characterised by unprecedented change in the media system. Therefore, its briefings are likely to provide interesting insights into the way discourse strategies are exploited by today’s political communication wizards. I have thus assembled a corpus of the official transcripts of the 697 briefings and ‘gaggles’ that were available on the Bush administration’s website for its first term. I have added XML markup to be able to carry out a corpus-based investigation without losing sight of the context of interaction.

This paper focuses on a typical CADS research question: how are the items important and importance, typical of political communication jargon and, in theory, agenda-setting-related, used by the press secretary in relation to specific topics during Bush’s first term and in order to give prominence to some issues on the agenda? The analysis reveals that these items often carry ‘non-obvious meanings’, such as those CADS aims to detect. Rather than straightforwardly referring to the agenda-setting process, these words often replace explicit markers of deontic modality, through patterns that emphasize the need for counterparts to act in a given way, not out of an imposition from the US government, but for the sake of the international community or the US citizens.

References:

Baker, P. (2006). Using Corpora in Discourse Analysis. London: Continuum.

Bayley, P. (2008). “Weakness and fear. A fragment of corpus-assisted discourse analysis”. In A. Martelli and V. Pulcini (eds), Investigating english with corpora. Studies in honour of Maria Teresa Prat. Monza: Polimetrica.

Kumar, M. J. (2007). Managing the President’s Message. The White House Communications Operation. Baltimore: The Johns Hopkins University Press.

Manheim, J. B. (1998). “The news shapers: strategic communication as a third force in news making”, in D. A. Graber, D. McQuail and P. Norris (eds), The Politics of News. The News of Politics. Washington, DC: Congressional Quarterly Press, 94-109.

Partington, A. (2004). “Corpora and discourse, a most congruous beast”, in L. Haarman, J. Morley and A. Partington (eds), Corpora and Discourse. Bern: Peter Lang, 11-19.

Partington, A. (2006). “Metaphors, motifs and similes across discourse types: Corpus-Assisted Discourse Studies (CADS) at work”, in A. Stefanowitsch, S. Th. Gries (eds), Corpus-Based Approaches to Metaphor and Metonymy. Berlin and New York: Mouton de Gruyter, 267-304.

Partington, A. (2008). “The armchair and the machine: Corpus-assisted discourse research”, in C. Taylor Torsello, K. Ackerley and E. Castello (eds), Corpora for University Language Teachers. Bern: Peter Lang, 95-118.

Establishing the phraseological profile of a text or text type

Ute Römer

Corpus research centres around textual patterning and aims to examine how meanings are encoded in language. Starting from Sinclair’s (2005) observation that “the normal carrier of meaning is the phrase”, this paper focuses on the examination of recurring phrases in language. It discusses a new analytical model that leads corpus researchers to a profile of the central phraseological items (defined as frequently occurring word combinations that express a certain meaning) in a selected text or text collection.

The model integrates some core features of current corpus linguistic practice and shows how Sinclair’s (1996) search for units of meaning can be continued with more powerful software tools. Central to the identification of phraseological items in a text or collection of texts is the use of phraseological search engines, i.e. software tools that automatically extract recurring contiguous and non-contiguous word combinations from corpora. These tools include Collocate (Barlow 2004), kfNgram (Fletcher 2002-2007) and ConcGram (Greaves 2005). Further steps in the model (after the identification of phraseological items in a text or corpus) involve the examination of item-internal variation (e.g. *BCD, A*CD, AB*D and ABC* for the 4-word item ABCD), a functional analysis of identified items, and the analysis of item distribution across texts that relates phraseological items with text structure and hence highlights instances of textual colligation (cf. Hoey 2005). The outcome of these steps is a text-type specific inventory of phraseological items together with their functions, variation and distribution.

In this paper, the model will be applied to a 3.5-million word corpus of online academic book reviews that represents part of the specialized discourse of the global community of linguists (a ‘restricted language’ in Firth’s 1956/1968 sense). This will demonstrate how the model facilitates the study of the occurrence and distribution of the central phraseological items in linguistic book reviews, and how it helps to determine the extent of the phraseological tendency of language.

References:

Barlow, M. (2004). Collocate 1.0: Locating Collocations and Terminology. Houston, TX:

Athelstan.

Firth, J. R. (1956). “Descriptive linguistics and the study of English”. In F. R. Palmer (ed.)

(1968). Selected Papers of J. R. Firth 1952-59. Bloomington: Indiana University

Press, 96-113.

Fletcher, W. H. (2002-2007). KfNgram. Annapolis, MD: USNA.

Greaves, C. (2005). ConcGram Concordancer with ConcGram Analysis. Hong Kong:

HKUST.

Hoey, M. P. (2005). Lexical Priming: A New Theory of Words and Language. London:

Routledge.

Sinclair, J. M. (1996). “The search for units of meaning”. Textus IX(1): 75-106

Sinclair, J. M. (2005). “The phrase, the whole phrase, and nothing but the phrase”. Plenary

lecture at Phraseology 2005 conference, 13-15 Oct 2005, Louvain-la-Neuve,

Belgium.

Exploring the variation and distribution of academic phrase-frames in MICUSP

Ute Römer and Matthew Brook O’Donnell

The Michigan Corpus of Upper-level Student Papers (MICUSP) is a new corpus of proficient student academic writing samples compiled at the English Language Institute of the University of Michigan, Ann Arbor. The corpus consists of more than 800 papers, ranging from essays to lab reports, collected across a selection of disciplines within four subject divisions (humanities, social sciences, physical sciences, and biological and health sciences) from final year undergraduate and first to third year graduate students (including native and non-native speakers of English) obtaining an A grade for a paper. Each of the papers in the corpus has been marked up in XML according to the TEI Guidelines and maintains the structural divisions (sections, headings and paragraphs) of the original paper.

The first part of the paper reports on the construction and analysis of a phraseological profile of academic writing by students across the different disciplines and levels in MICUSP. Following the methodological approach taken in Römer (forthcoming), we extract and classify frequent n-grams and phrase-frames (‘p-frames’) of different lengths in the four subject divisions of MICUSP. We use the Key Word procedure (Scott 1997) to isolate items with particular associations for use in the different areas along with items shared across disciplines, i.e. belonging to student academic writing in general (e.g. at the end of, the fact that, on the other hand, and it is clear that). Our p-frame analysis complements these findings in that it provides insights into pattern variability (e.g., the p-frame at the * of summarizes the 4-grams at the end of, at the beginning of, and at the risk of).

The second part of the paper makes use of Hoey’s (2005) notion of textual colligation, namely that words and phrases may carry with them particular associations for occurrence at a specific location in text (i.e. beginning or end of a text, paragraph or sentence). This phenomenon has been demonstrated in a corpus of newspaper articles (Hoey and O’Donnell 2008). Here we select items from our n-gram and p-frame analysis and make use of the XML annotation in MICUSP to identify where in a text they are most commonly found. For example, 62% of the instances of in addition to begin a sentence (and one in five of these also begin a paragraph), whereas only 15% of the instances of the fact that occur in sentence-initial position.

This paper serves to introduce MICUSP as a new resource for the study of apprentice academic writing. The findings from our initial study on repeated word combinations in the corpus suggest that students are not only learning to make use of the appropriate phraseological items for their specific discipline as they move from undergraduate to graduate level but also that they become more and more sensitive to their common textual locations.

References:

Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London:

Routledge.

Hoey, M. and M. B. O’Donnell (2008). “The Beginning of Something Important? Corpus

Evidence on the Text Beginnings of Hard News Stories”. In Lewandowska-

Tomaszczyk (ed.). Corpus Linguistics, Computer Tools, and Applications-State of the

Art. PALC 2007. Bern: Peter Lang.

Römer, U. (forthcoming). “English in academia: Does nativeness matter?” In C. M. Bongartz

and J. Mukherjee (eds). Anglistik: International Journal of English Studies 20, 2

(2009). Special Issue on "Non-native Englishes: Exploring second-language varieties

and learner Englishes".

Scott, M. (1997). “PC Analysis of key words - and key key words”. In System 25(1), 1-13.

Modality in South African English: A historical and comparative study

Ronel Rossouw and Bertus van Rooy

The grammar of South African English (SAE) is an under-researched aspect of this variety of English, which developed from the input of the so-called Settler English of the early 19^th century. Mesthrie and West (1995) have outlined some grammatical characteristics of this input form of SAE or Proto South African English, whereas Bowerman (2004) has compiled a short list of grammatical features of contemporary SAE (mostly from spoken data), but the scope of the grammar domain, especially that of written SAE, remains largely uncharted.

This paper will attempt to close the gap between the input and contemporary forms of SAE by investigating the diachronic development in the usage of modals and semi-modals from 1820 until today. A corpus of correspondence available in various archives, covering the period from the first settlement in 1820 to the mid twentieth century, will be compared to the written correspondence section from ICE South Africa, in order to trace the changes in the meaning nuances or illocutionary force of modals such as must (briefly mentioned by Bowerman 2004), will, shall, can, may, should and would, and semi-modals such as have to, ought to, need to, supposed to and (have) got to.

Previous historical corpus analyses relating to the changes in modal and semi-modal relationships in English have revealed a growing tendency toward the wider use of semi-modals (see e.g. Biber et al. 1998). But are these tendencies and changing relationships mirrored in the development of SAE modality? After this question has been investigated by analysing modality within the SAE corpus collections, a comparison will be drawn between our findings and work on the development of British English during the same period. This will reveal whether SAE has developed along similar lines than the variety from which it descends, or has developed in a different direction.

References:

Biber, D., S. Conrad and R. Reppen. (1998). Corpus Linguistics. Investigating Language

Structure and Use. Cambridge: Cambridge University Press.

Bowerman, S. (2004). “White South African English: morphology and syntax”. In B.

Kortmann, K. Burridge, R. Mesthrie, E. W. Schneider and C. Upton (eds). A

Handbook of Varieties of English, Volume 2: Morphology and Syntax Berlin/New

York: Mouton de Gruyter, 948-961

Mesthrie, R. and P. West. (1995). “Towards a grammar of Proto South African English”. In

English World-Wide, 16(1), 105-133.

Corpus linguistics and literary translation: An impossible liaison?

Guadalupe Ruiz Yepes

Many researchers in the field of literary translation are reluctant to include corpus methodologies in their work. They argue that computers are of no use when it comes to exploring literature, and that a ‘top-down’ analysis, starting with the study of the sociocultural context down to the rhetorical figures, is the most appropriate approach considering the complexity of the task.

With this paper, I would like to show, however, that some software, such as WordSmith Tools, can be very useful for comparing a specific aspect of two translations of the same literary masterpiece, for instance the translation of proper names. The use of frequency lists facilitates the search for proper names and shows us how frequently certain translation decisions have been taken by the translator. On the other hand, the concordance tool shows us the co-ocurrence of other words with proper names and gives us the opportunity of studying its co-text. Therefore, we can say that corpus software is an extraordinary aid in trying to find out which translation strategy has been used predominantly by the translator. Is the translator trying to domesticate the original masterpiece by adapting the proper names to the target culture? Or is he/she rather trying to keep the proper names in their original form to provide his/her translation with the exotic taste that the world of the original means to the target reader? Are these decisions motivated by certain ideological tendencies, cultural movements or political pressures?

What I am suggesting is that carrying out a ‘bottom-up’ analysis is a very valid approach to the analysis of literary texts and their translations, and that the way to achieve the best results in this case is by combining corpus methodologies with other more traditional approaches, such as discourse analysis, contrastive linguistics and sociolinguistics.

References:

Berger, K. and C. Nord. (1999). Das Neue Testament und frühchristliche Schriften. Übersetzung und Kommentar. Frankfurt a. M.: Insel Verlag.

Bowker, L. And J. Pearson. (2002). Working with Specialized Language. A practical guide to using corpora. London and New York: Routledge.

Cervantes Saavedra, M. de. (1998). Don Quijote de la Mancha (ed. del Instituto Cervantes; dirigida por Francisco Rico; con la colaboración de Joaquín Forradellas; estudio peeliminar de Fernado Lázaro Carreter). Barcelona: Instituto Cervantes/Crítica.

Cervantes Saavedra, M. de. (1961). Leben und Taten des scharfsinnigen edlen Don Quixote von la Mancha (mit Zeichnungen von Gerhart Kraaz; in der Übertragung von Ludwig Tieck; Geleitwort von Heinrich Heine) (trad. de Ludwig Tieck; prólogo de Heinrich Heine). Hamburgo: Rütten und Loening.

Cervantes Saavedra, M. de. (1964). Don Quijote de la Mancha (Herausgegeben und neu übersetzt von Anton Maria Rothbauer) (ed. y trad. de Anton Maria Rothbauer). Stuttgart: Henry Govert.

Cervantes Saavedra, M. de. (1827). Histoire de don Quichotte de la Manche (traduite de l’espagnol par Filleau de Saint-Martin ; précedée d’une notice historique sur la vie et les ouvrages de Cervantes, par M. Pr. Merimée ; tome premier [-sixième]). Paris : Imprimerie d’Auguste Barthelemy.

Nord, C. (2003). “Proper Names in Translations for Children: Alice in Wonderland as a Case in Point”. In Meta 182-196, vol. 48, 1-2.

Strosetzki, C. (2005-2006). “Alemania”. In C. Alvar, A. A. Ezquerra and F. Sevilla Arroyo (eds). Gran Enciclopedia Cervantina, vol. I. Madrid: Castalia, 304-322.

Venuti, L. (1997). The Translator´s Invisibility. Londres: Routledge.

Venuti, L. (1998). The Scandals of Translation. Towards an ethics of difference. Londres: Routledge.

Blurring dichotomies: Why lexical semantics needs corpus analysis

Irene Russo

Corpus analysis offers many advantages for theoretical linguistics: a wide range of data can be fast and easily accessed; focus on phenomena, based on frequency, can be selective; inter- accessibility of data is guaranteed.

However, some caveats should be considered: it is extremely hard to find corpora that are representative and really balanced. The naive assumption that overwhelmingly large amount of text will work better is unproven. Peripheral areas of ranked frequencies are problematic: how to deal with rare phenomena? Are they marginal cases or the corpus is not representative enough? Throughout two case studies of adjectival semantics, advantages and shortcomings of corpus analysis for lexical semantics will be illustrated.

First and foremost, some important notions in semantics - e.g. semantic schema, lexical pattern, collocation - has been refined thanks to corpus analysis. Today these kind of abstractions are not just an empty formal shell: lexical units has been recognised as highly variable since decades, but only corpus semantics can specify probabilistic relations between components (Stubbs 2001, Hoey 2005). This methodology corresponds to a clear epistemological attitude: previous hypothesis are verified through data analysis.

Radical opinions in cognitive linguistics about vagueness are effectively contrasted; prototypical representations and fuzzy semantic classifications are not just generated by linguist’s intuitions – as in Givón (1993) - but are the outcomes of inductive processes from data to abstract generalisation.

Corpus semantics can change past well-ordered classifications blurring dichotomies and offers kind of data analysis that can be reiterated to build up sound theoretical statements.

To perform more refined semantic analysis through statistical tools, operationalizations of semantic and/ or pragmatic variables are necessary. They are quantitative measures of qualitative phenomena that rely on information directly extracted from the corpus (Wulff 2003). But sometimes these “translations” are not straightforward: the defining criteria of a linguistic phenomenon can be operazionalized in multiple ways.

More problematic is the issue of corpus representativeness. If our knowledge of a language is not only a knowledge of individual words but of their predictable combinations, corpus analysis creates a window on mental lexicon - frequency is interpreted as typicality -. However, because of the nature of available corpora, claims should be restricted to specific genres. For example, it has been shown the impact of corpus selection on the performance of measures of semantic association respect to human behaviour (Lindsey 2007). Our exposure to speech and text may, to some degree, be considered open to many sources and corpora cannot be perfect mirrors of our linguistic competence.

References:

Givón, T. (1993). English Grammar: a function-based introduction. Amsterdam: John

Benjamins

Hoey, M. (2005), Lexical priming. A new theory of words and language. London: Routledge.

Lindsey, R. et al. (2007). “Be Wary of What Your Computer Reads: The Effects of Corpus

Selection on Measuring Semantic Relatedness”. In Proceedings of the 8th

International Conference on Cognitive Modeling.

Stubbs, M. (2001). Words and Phrases. Corpus studies of lexical semantics. Oxford:

Blackwell.

Wulff, S. (2003). “A multifactorial corpus analysis of adjective order in English”.

International Journal of Corpus Linguistics, 8 (2), 245-282.

Distinctive-collexeme analysis of an Italian epistemic construction

Irene Russo and Francesca Strik Lievers

Epistemicity is traditionally recognized as the linguistic coding of the relationship between the speaker and his enunciation. Speaker’s commitment toward a proposition may be codified by specific grammatical means or by lexical choices, but also by specific constructions correlated to distinctive lexical items. Therefore, modality can be fruitfully analysed by constructional theories of grammar (e.g. Goldberg 1995). We propose a corpus based analysis (La Repubblica corpus, 380M tokens, newspaper text) of Italian constructions with sembrare, parere, apparire (all of them roughly meaning ‘to seem’), followed by an adjective, and with a subject clause. In particular, we are interested in a specific alternation, infinitive (1) vs. clausal (che-clause) subject (2):

(1) Sembrava improbabile convincere tutti gli italiani d’un colpo ad adeguarsi.

‘It seemed unlikely to persuade all of the Italians to adapt themselves suddenly’.

(2) Sembra improbabile che i Cobas possano decidere di sospendere lo sciopero.

‘It seems unlikely that the Cobas may decide to call of the strike’.

Some adjectives occur in both of these constructions, which may lead us to assume that the two constructions are similar. However, analysing the data with the distinctive-collexeme analysis (Gries & Stefanowitsch 2004), it appears that the two constructions are clearly differentiated: each construction may be characterised by the collocational associations with epistemic and evidential adjectives or with evaluative adjectives.

Sembrare/parere/apparire+ADJ+infinite giusto (‘right’) 27.67, facile (‘easy’) 22.72,

opportune (‘opportune’) 14.82, lecito (‘right’) 7.34

Sembrare/parere/apparire+ADJ+che chiaro (‘clear’) 29.71, improbabile (‘unlikely’)

17.84, evidente (‘evident’) 17.63, probabile

(‘likely’) 14.88

Epistemic (e.g. probabile) and evidential (e.g. chiaro) adjectives show a stronger association with the che-clause, while the infinitive shows a strong association with evaluative (e.g. giusto) adjectives. Moreover, within the che-clause construction the presence of an experiencer shows a stronger association with evaluative adjectives. Sembrare+che exhibits the same preference in comparison with essere+che.

Clitic+ sembrare/parere/apparire+ADJ+che giusto (‘right’) 32.98, logico (‘logical’) 5.29,

assurdo (‘absurd’) 4.11, importante (‘important’) 4.06

Sembrare/parere/apparire+ADJ+che probabile (‘likely’) 10.10, improbabile (‘unlikely’)

6.03, scontato (‘expected’) 5.89, chiaro (‘clear’) 4.78

Essere+ADJ+che vero (‘true’) 184.09, probabile (‘likely’) 27.44, chiaro

(‘clear’) 17.16, possibile (‘possible’) 12.86

Sembrare+ADJ+che strano (‘strange’) 72.22, impossibile (‘impossibile’)

61.20, incredibile (‘incredibile’) 29.39, giusto (‘right’)

19.38

Even if it is possible to isolate epistemic adjectival constructions that are prototypical, different choices of lexical items cause a shift from epistemic judgments to evaluative judgments. Through corpus analysis and distinctive-collexeme analysis the epistemicity of a sentence emerges as a gradable property dynamically determined by the interaction of elements (verb, adjectives and clitics) in constructions.

References:

Gries, S. Th. (2007). Coll.analysis 3.2. A program for R for Windows 2.x.

Goldberg, A. E. (1995).. Constructions: A Construction Grammar Approach to Argument

Structure. Chicago, University of Chicago Press.

Gries, S. Th. And A. Stefanowitsch. (2004). “Extending Collostructional Analysis: A Corpus-

based Perspective on Alternations”. In International Journal of Corpus Linguistics 9, 97-129.

Using data mining to trace changing association patterns over time

Rob Sanderson, Catherine Smith and Matthew Brook O’Donnell

The computational analysis of recurring word patterns (collocations, ngrams, frames, phraseological expressions) has been widely explored (Granger and Meunier 2008 demonstrates the breadth of contemporary approaches). Patterns of association for the same vocabulary item vary according to a range of variables, including genre/text-type (Gledhill 2000), speaker/writer characteristics, language variety, time period (Miall 1992; Baker et al. 2007) and even position within a text (Mahlberg and O'Donnell 2008). This study examines the change and development of associations in 10 years of home news articles from the Guardian newspaper using techniques developed for analysis of industry scale datasets and machine learning.

The field of data mining offers a group of highly optimized techniques for the discovery and classification of patterns within large data sets. Here we make use of association rule mining (ARM; Agrawal et al 1993) to extract sets of 2 or more words commonly occurring with each of the candidate items in each time period. A key advantage of ARM over standard collocational techniques is that can identify multiple co-occurring items. For example, Mahlberg and O'Donnell (2008) found that the word CONTROVERSY (C) collocates with FACE, FRESH, OVER, GOVERNMENT and YESTERDAY. Each of these is discovered as a collocation pair (C-FACE), (C-FRESH), (C-OVER) etc. The ARM procedure would find 'item sets' including (C, FACE, FRESH), (C, FACE, OVER), (C, GOVERNMENT, OVER, FACE).

We use the Cheshire3 Information Architecture as a platform to manage, index, search and analyze the corpus of nearly 80 million words. Articles are grouped into months and the text extracted as the input for the ARM technique, repeated over each month individually. The resulting item sets are then ordered according to their temporal dispersion across the entire corpus, clustering techniques used to discover sets of itemsets with similar characteristics, and finally plotted as an aid for visualization. We also investigate the effects of different sizes of collocate window on the patterns discovered.

This methodology reveals interesting collocations with similar behavior over time, showing trends in language use, or trends in social and historical phenomena.

References:

Agrawal R, T. Imielinski and A. Swami. (1993). “Mining Association Rules between Sets of Items in

Large Databases”. SIGMOD. June 1993, 22(2), 207-16

Baker, P., T. McEnery, and G. Gabrielatos. (2007). “Using Collocation Analysis to Reveal the

Construction of Minority Groups: The Case of Refugees, Asylum Seekers and Immigrants in

the UK Press'. Proceedings of Corpus Linguistics 2007.

Gledhill, C.J. (2000). Collocations in Science Writing. Tübingen: Gunter Narr Verlag.

Granger, S. and F. Meunier (eds). (2008). Phraseology: An interdisciplinary perspective. Amsterdam:

John Benjamins.

Mahlberg, M. and M. B. O'Donnell. (2008). “A fresh view of the structure of hard news stories”. In

The 19th European Systemic Functional Linguistics Conference and Workshop.

http://scidok.sulb.uni-saarland.de/volltexte/2008/1700/

Miall, D. S. (1992). “Estimating Changes in Collocations of Key Words Across a Large Text: A Case

Study of Coleridge's Notebooks”. Computers and the Humanities 26, 1-12.

The Corpus of Eighteenth-Century Texts: Towards the construction of the diachronic corpus of Russian

Svetlana Savchuk

The paper presents the results of a project aimed at the creation of a sub-corpus of 18^th-century Russian texts within the Russian National Corpus. The main problems concerned with text preparation and morphological annotation and prospects for Corpus development are discussed.

The corpus nowadays contains more than 2.5 million tokens in the form of about 800 full texts dating from 1700 to 1800. These texts cover main functional spheres (fiction, journalism, spheres of science and education, law, religion and everyday life) and represent a wide range of text types, for example, novels, plays, handbooks, documents, sermons, diaries, private and official correspondence, etc.). In addition to the basic selection of prose there is a supplementary part of the corpus - 18^th-century poetry as a component of the Poetic Corpus within the RNC.

The corpus of 18^th-century Russian prose is supplied with three types of annotation. Metatextual annotation includes information regarding author’s name, age, sex, text characteristics (creation date, functional sphere, text type, domain, etc.), bibliographical sources, and so on. Each word has morphological and semantic annotation.

As far as the corpus of 18^th-century Russian texts is being constructed at the base of the technology developed for the RNC, it has two special features.

1. The electronic versions of texts included in the corpus reproduce the printed or manuscript texts in the orthography of the 20^th century, i. e. for modern editions of the 18^th-century texts spelling is meet the requirements of modern orthographical rules (accepted in 1956), for pre-revolutionary editions changes in spelling are made according to the reform of orthography in 1918.

2. The morphological annotation is performed automatically by the parser developed for modern Russian texts and based on the grammar dictionary of modern Russian. Tagging and lemmatization of approximately 10 percent of text volume are checked manually.

Arguments pro and contra as well as positive results of this method are regarded in the paper. The principal benefits are as follows.

1. The Corpus of the 18^th-century Russian texts is a module in the RNC supplied with the same types of annotation and compatible with the other diachronic parts of the RNC (texts dating from 1800 to 1950) as well as with its main part of modern Russian texts. It not only allows of linguistic analysis of different types of 18^th-century texts, it has also made it possible to compare the features of a certain type of texts (for instance, academic prose or news texts) in the 18^th-century corpus and in modern language.

2. Errors of grammar tagging, especially the list of the “unknown”, unidentified word forms (belonging to words beyond the corpus grammar vocabulary), supplied with hypothetic lemmas give essentially valuable material for the analysis of the orthographic, phonetic and different kinds of grammar variation in the 18^th-century language. At the same time the study of variation has practical use for customizing tagger according to special cases typical of 18^th-century texts.

Linguistically informed and corpus informed morphological analysis of Arabic

Majdi Sawalha and Eric Atwell

Standard English PoS-taggers generally involve tag-assignment (via dictionary-lookup etc) followed by tag-disambiguation (via a context model, eg PoS-ngrams or Brill transformations). We want to PoS-tag our Arabic Corpus, but evaluation of existing PoS-taggers has highlighted shortcomings; in particular, about a quarter of all word tokens are not assigned a fully correct morphological analysis. Tag-assignment is significantly more complex for Arabic. An Arabic lemmatiser program can extract the stem or root, but this is not enough for full PoS-tagging; words should be decomposed into five parts: proclitics, prefixes, stem or root, suffixes and postclitics. The morphological analyser should then add the appropriate linguistic information to each of these parts of the word; in effect, instead of a tag for a word, we need a subtag for each part (and possibly multiple subtags if there are multiple proclitics, prefixes, suffixes and postclitics).

Four main knowledge-based approaches have been applied for computational morphology. First, syllable-based morphology (SBM), where morphological realisations are defined in terms of their syllables. Second, root-and-pattern morphology, where stems are recognised by matching the word against a list of canonical patterns of roots and affixes. Third, lexeme-based morphology (LBM), where stem is the only morphologically relevant form of lexeme. Finally, using a stem-based Arabic lexicon with grammar and lexis specifications (Soudi et al, 2007). All four methods rely on a “hand-crafted” lexical database of stems and/or roots and/or patterns. Another approach is to use Machine Learning from a morphologically-analysed corpus to build a lexical database.

Many challenges face the implementation of Arabic morphology. The rich “root-and-pattern” nonconcatenative (or nonlinear) morphology and the high complex word formation process of root and patterns, especially if one or two long vowels are part of the root letters. Moreover, the orthographic issues of Arabic such as short vowels ( َ ُ ِ ), Hamzah (ء أ إ ؤ ئ), Taa’ Marboutah ( ة ) and Ha’ ( ه ), Ya’ ( ي ) and Alif Maksorah( ى ) , Shaddah ( ّ ) or gemination, and Maddah ( آ ) or extension which is a compound letter of Hamzah and Alif ( أا ).

Our morphological analyzer uses linguistic knowledge of the language as well as corpora to verify the linguistic information. To understand the problem, we started by analyzing fifteen established Arabic language dictionaries, to build a broad-coverage lexicon which contains not only roots and single words but also multi-word expressions, idioms, collocations requiring special part-of-speech assignment, and words with special part-of-speech tags. The morphological analyzer first searches the broad lexicon for a word; if the word is found in the lexicon, the correct analysis of the word is given. The next stage of research was a detailed analysis and classification of Arabic language roots to address the “tail” of hard cases for existing morphological analyzers, and analysis of the roots, word-root combinations and the coverage of each root category of the Qur’an and the word-root information stored in our lexicon. From authoritative Arabic grammar books, we extracted and generated comprehensive lists of affixes, clitics and patterns. These lists were then cross-checked by analyzing words of three corpora: the Qur’an, the Corpus of Contemporary Arabic and Penn Arabic Treebank (as well as our Lexicon, considered as a fourth cross-check corpus). We also developed a novel algorithm that generates the correct pattern of the words, which deals with the orthographic issues of the Arabic language and other word derivation issues, such as the elimination or substitution of root letters.

Using a parser as a heuristic tool for the description of New Englishes

Gerold Schneider and Marianne Hundt

We propose the use of an automatic parser as a tool for descriptive linguistics and show its application to a selection of New Englishes. The novel approach of using an automatic parser has two advantages. First, quantitative data on very large amounts of texts are available instantly, a process which would take years of work if manual annotation were used. Second, it allows variational linguists to use a corpus-driven approach, where results emerge from the data rather than having to rely solely on observational and anecdotal cues. The disadvantage of the parser-based approach is, however, that the level of precision (false positives) and recall (missed instances) is considerably lower, due to tagging and parsing errors. We deliver detailed evaluations to assess precision and recall levels of the parser we use (Schneider 2008). The results delivered by a parser-based quantitative analysis allow the linguist to obtain an overview of interesting phenomena, allowing him or her to investigate the reported hits in detail.

Our application to New Englishes uses several parsed subparts of the International Corpus of English (ICE). We employ two methods for our corpus-driven study. On the one hand, quantitative differences in the use of established syntactic patterns are investigated: we focus on ditransitive verbs and on noun-PP modification, comparing our findings to existing regional investigations (Mukherjee 2005) and extending on existing lexicographical research (Lehmann and Schneider 2008). On the other hand, pathways into discovering regional syntactic aberrations are illustrated: we investigate aberrant collocations and the correlation between consistent tagging and parsing errors and regional syntactic innovations.

References:

Schneider, G. (2008). “Hybrid Long-Distance Functional Dependency Parsing”. Doctoral Thesis, Institute of Computational Linguistics, University of Zurich.

Lehmann, H. M. and G. Schneider (2009). “Parser-Based Analysis of Syntax-Lexis Interactions”. In Proceedings of ICAME 2008 (International Conference of Modern and Medieval Corpus Linguistics), Ascona, Switzerland.

Joybrato M. (2005). “English Ditransitive Verbs: Aspects of Theory, Description and a Usage-based Model”. In C, Mair, C. F. Meyer and N. Oostdijk (eds) Language and Computers Amsterdam/New York: Rodopi.

Tracing the sentential complements of Prevent through centuries

Elina Sellgren

The verb prevent can select three sentential complements in present-day British English: the types prevent me from going, prevent me going and prevent my going. In American English, only me from going is used alongside the archaic my going.

Christian Mair (2002) proposes that this is a case of a constellation where British and American English are diverging, with BrE the innovative language variant. The LOB, FLOB, Brown and Frown corpora show that the type me going has become equal in frequency to me from going over the 20th century, whereas the type prevent my going is rare and archaic. Prevent may be leading a grammatical change among semantically similar verbs (hinder, block, stop, etc.), which may also start dropping out from in BrE.

A study by Heyvaert et al. (2005) corroborates the finding that my going is very rare today with prevent in BrE. What is not known is what part this variant has played in the past in the competition between the more common variants. Visser (1973) claims that the decreasing use of my going is peculiar to prevent, but does not provide systematical corpus evidence. The competition has been chiefly studied synchronically. Rohdenburg’s (e.g. 1996) complexity principle has been offered as an explanation. The principle assumes that the structurally more explicit variant me from going is used more often in syntactically complex environments than the less explicit variants me/my going. The prepositional variant is also assumed to have advanced diachronically at the expense of the other variants.

In present-day AmE the prepositional variant is dominant. In BrE, the less explicit variant me going seems to defy the principle. Rohdenburg’s principle provides a partial explanation from a synchronic point of view, but its diachronic dimension has been somewhat neglected. Kirsten (1957, cited in van Ek: 1966) has examined the variation in the 18th and 19th centuries with corpora of modest sizes, but naturally not considering the complexity principle.

The author traced the evolution of the sentential complements of prevent from the 18th century to early 20th century, using the Corpus of Late Modern English (CLMET: 1710-1920) and the corpus of Early American Fiction. The quantitative analysis of CLMET shows that me from going has been consistently dominant from the 18th century. My going has followed closely behind, until it underwent a dramatic dip in frequency in the 19th century. Me going has gained foothold slowly over centuries and is no newcomer. In early AmE, the same variants were found. Me going is found occasionally, unlike in present-day AmE.

The paper will examine the interplay between the variants: Is the decline of my going related to the success of me going? Does the complexity principle apply to the diachronic data?

References:

Heyvaert, L., H. Rogiers and N. Vermeylen. (2005). “Pronominal Determiners in Gerundive

Nominalization: A “Case” Study”. In English Studies Vol. 86, February 2005, 71-88.

Mair, C. (2002). “Three changing patterns of verb complementation in English”. In English

Language and Linguistics 6, 105-132.

Rohdenburg, G. (1996). “Cognitive complexity and increased grammatical explicitness in

`Late Modern English”. English Studies 76, 367-388.

Van Ek, J. A. (1966). Four Complementary Structures of Predication in Contemporary

British English. An Inventory. Groningen: Wolters.

Visser, F. Th. (1973). An Historical Syntax of the English Language, Part Three. Leiden:

Brill.

An integrated environment for extracting and translating collocations

Violeta Seretan

To date, substantial efforts have been made to devise accurate methods for extracting collocations from corpora; much fewer works are devoted, in contrast, to the subsequent processing of results in order to enable their integration in the various Natural Language Processing applications which require them – for instance, in machine translation. Identifying collocations in the source text (e.g., break record) and providing an adequate translation in the target language (i.e., battre record rather than the literal translation casser record in French) is a key issue in machine translation. There are two main reasons why collocations are particularly problematic from this point of view: one is the prevalence of this subtype of multi-word expressions in language, and the second is their high syntactic flexibility making their treatment much more difficult than in the case of other (more rigid) types of expressions.

This article describes a method—an extension of Seretan (2008)—for identifying collocations and their translation equivalents from parallel text corpora. The method relies on syntactic information provided by a deep symbolic parser (Wehrli, 2007) and is capable of dealing with a whole gammut of grammatical transformations that collocations can undergo (for instance, passivisation, as in Records are made to be broken), as well as with structural divergences across languages (e.g., the English verb-object collocation reach compromise is translated into French as a verb-preposition-argument collocation, parvenir à compromis).

Particular attention is paid to those collocations made up of strictly more that two items (such as draw a clear distinction). Their identification in corpora is possible, on the one hand, thanks a specific extraction module which post-process the previous binary extraction results (i.e., if both clear distinction and draw distinction are identified in the same phrase, then the collocation draw clear distinction can be inferred), and, on the other hand, thanks to parser’s ability to recognise already-known collocations and to treat them as single units (i.e., once added to the lexicon, the collocation clear distinction can be recognised in the new text and treated as a complex noun, which is later identified as the direct object of draw).

The method described is used to increase the lexical coverage of a multilingual translation system (Wehrli et al., 2009). The article briefly presents this system and describes the manner in which collocations are processed during the syntactic transfer from one language to another. Experimental results are provided for four main languages—English, French, Italian and Spanish—and are accompanied by an appropriate evaluation for each of the modules described. The whole system constitutes a unique environment providing an advanced and comprehensive treatment of collocations.

References:

Seretan, V. (2008). “Collocation Extraction Based on Syntactic Parsing”. Ph.D. thesis,

University of Geneva.

Wehrli, E. (2007). “Fips, a “deep” linguistic multilingual parser”. In Proceedings of ACL

2007 Workshop on Deep Linguistic Processing. Prague, Czech Republic, 120–127.

Wehrli, E., L. Nerima and Y. Scherrer. (2009). “Deep linguistic multilingual translation and

bilingual dictionaries”. In Proceedings of the Fourth Workshop on Statistical Machine Translation Athens, Greece, 90–94.

On the distribution of English reflexives

Makoto Shimizu

English reflexives have been one of the most thoroughly discussed topics in linguistics. Various approaches have tried to explain their syntactic behaviours and semantic interpretations. To the best of my knowledge, however, there have been few researches done on the actual distribution of English reflexives. Arguments tend to be based on the intuitions of the individual researchers’ rather than the actual use.

What I would like to do in this paper is threefold: first, to demonstrate the distribution of English reflexives in computer corpora, second, to discuss combinations of grammatical functions of reflexive pronouns and their anaphors, and third, to present several examples which could be problematic to some syntactically oriented approaches, especially when English reflexives are translated into Japanese and Japanese reflexives into English.

The data I use are the LOB Corpus, the Brown Corpus, and some bilingual parallel texts which I have collected. I extract English reflexive pronouns and their Japanese counterparts from the data, analyse them, and categorise them according to the grammatical functions and positions in text of the reflexive pronouns and their antecedents.

The grammatical function of the reflexive pronouns discussed in previous studies has almost always been the direct object of transitive verbs. The percentage of direct objects in LOB is approximately 50% while that of appositives is 17% and that of adverbials is 22%. Four other grammatical functions are recognised. As for the positions of reflexives in text, approximately 81% occur within the same clauses where their antecedents occur while approximately 19% occur in complement clauses.

The combinations of grammatical functions of reflexive pronouns and their anaphors mainly discussed in this paper are direct object and subject, appositive and adverbial, direct object and second person reflexives.

The problematic examples discussed in this paper are the reflexives which have antecedents in the determiner position, in the subject position in the main clauses, in separate sentences, in speech situations, and in previous contexts.

Much ado about much, a lot and a bit:

Quantifiers and colloquialization in 20th-century English

Nicholas Smith and Amy Yun-Ting Wang

The last decade has seen a surge of interest in recent change in standard English (especially at the levels of lexicon and grammar), and corpus methods have featured prominently among the approaches taken. One of the most popular hypotheses put forward to explain recent changes in standard written English is that of colloquialization; see e.g. Hundt and Mair (1999), Smith (2003), Hinrichs and Szmrecsanyi (2007). Colloquialization is essentially a process of social-stylistic change: the forms of everyday conversational speech are increasingly being used in printed texts, as the norms of those genres become less formal and conservative.

Our paper seeks to test the colloquialization hypothesis with respect to quantifiers in English (e.g. much, many, a lot of, lots of, several). Apart from increasing use of amount with count nouns, there has been little discussion of recent change in quantifier usage. In the descriptive literature, certain forms such as a lot (of) and lots (of) have been explicitly identified as colloquial or informal (see e.g. Quirk et al. 1985: 262-4). If the colloquialization hypothesis is supported, we would expect these forms to have expanded in use in recent times. Conversely, uses such as much with mass nouns, in non-assertive contexts (e.g. much consideration has been given to these proposals) – which are associated with higher formality – should have declined. Our paper addresses both the overall frequency of the quantifiers, and their different semantic and pragmatic functions. At both levels of analysis we find interesting variation in development. For example, in British English a lot of, an expression associated with informal use in Quirk et al. (1985: 264), increases dramatically across the mid- to late 20th century. On the other hand the adverbial and pronominal uses of a lot seem to reach a peak in the 1960s. We have also found growth in the frequency of a bit, as a result of increased use of its pragmatic functions, e.g. mitigation.

As in numerous other studies of short-term change in standard English, we will use the Brown family of corpora as primary data. These matching one-millionword corpora of British and American English now span most of the 20th century. However, we will also make use of two reference corpora (BNC and ICE-GB) that contain both spoken and written texts, in order to gauge the degree of colloquiality of each quantifier, and its functions. This step highlights, for example, that plenty is not as colloquial as sometimes suggested in the literature (cf. Quirk et al. 1985: 264). Lots of is more common in speech than writing, but it is more prevalent in formal and planned speech (as in the context-governed part of the BNC) than casual conversation. Thus, in addition to its findings on diachronic change in the quantifiers, our paper provides an important methodological contribution to discussion on colloquialization and recent language change.

References:

Hinrichs, L. and B. Szmrecsanyi. (2007). "Recent changes in the function and frequency of

standard English genitive constructions: a multivariate analysis of tagged corpora". In

English Language and Linguistics 11(3), 437–474.

Hundt, M. and C. Mair. (1999). “‘Agile’ and ‘uptight’ genres: the corpus-based approach to

language change in progress. In International Journal of Corpus Linguistics 4(2). 221-

242.

Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik. (1985). A Comprehensive Grammar of

the English Language. London and New York: Longman.

Smith, N. (2003). “Changes in the modals and semi–modals of strong obligation and

epistemic necessity in recent British English”. In R. Facchinetti, M. Krug and F.

Palmer (eds). Modality in Contemporary English. Berlin/New York: Mouton de Gruyter, 241-267.

Comparing forensic linguistic analysis against SVM classification

for authorship attribution of newspaper opinion columns

Rui Sousa Silva, Luís Sarmento, Belinda Maia, Eugénio Oliveira and Tim Grant

Authorship attribution is an important analysis task both in plagiarism detection and in information filtering. In particular, it is decisive to determine whether two given texts (or text snippets), or a text section within the same text, were produced by the same author or by different authors.

Authorship attribution can thus be related to a text classification procedure. In order to attribute the authorship of a text (i.e. to classify the texts according to their author) from a limited set of possible authors, two different approaches can be adopted. Traditional text classification procedures, on the one hand, focus on extracting a wide variety of relatively simple features from text. For example, features can be obtained using bag-of-word approaches, with more or less sophisticated weighting schemes and filtering procedure (Joachims, 1998). Forensic linguistic approaches, on the other have, shown that there are several stylistic markers that help determine the authorship of a text (Coulthard & Johnson, 2007; McMenamin, 2002), independently of the topic of those texts (Hänlein, 1998). One of these markers is lexical richness. As different authors make a different, individual use of the same language, they show particular preferences for a range of lexical features (Coulthard & Johnson, 2007; Honoré, 1979; McMenamin, 2002) that differentiate them from other authors.

In this paper we compare the efficiency of these two approaches to authorship attribution, using a corpus of texts (opinion articles) published in a Portuguese quality daily newspaper, composed of 250 articles written by 23 authors over a period of 10 weeks (between November 2008 and January 2009), covering a wide set of topics. We use this corpus to perform a blind authorship attribution exercise using, on the one hand, a state-of-the-art text classification algorithm – Support Vector Machines (Joachims, 1998) – based on relatively simple features such as word and bigram frequencies, and, on the other hand, a calculation of the lexical richness (see Coulthard & Johnson, 2007) of the reference corpus texts and the queried texts.

As this research will show, while traditional text-classification algorithms based on simple features are able to provide relatively good performances with straightforward technical implementations, they are not very efficient in isolating biases that the topic of the text has on author attributions. Forensic linguistics approaches, on the other hand, rely on other important features of the text, e.g. lexical richness, which are important validity and reliability features in authorship attribution, and therefore less prone to topic bias. We conclude by showing that the combination of both approaches not only proves useful, but desirable.

References:

Coulthard, M. and A. Johnson. (2007). An Introduction to Forensic Linguistics: Language in

Evidence. London and New York: Routledge.

Hänlein, H. (1998). Studies in Authorship Recognition - A Corpus-based Approach.

Frankfurt: Peter Lang.

Honoré, A. (1979). “Some simple measures of richness of vocabulary”. In Association for

Literary and Linguistic Computing Bulletin, 7(2), 172-177.

Joachims, T. (1998). “Text categorization with support vector machines: learning with many

relevant features”. Paper presented at the ECML-98, 10th European Conference on

Machine Learning, Heidelberg.

McMenamin, G. R. (2002). Forensic Linguistics: Advances in Forensic Stylistics. Boca Raton

and New York: CRC Press.

A corpus of ambiguous biomedical abbreviations

Mark Stevenson, Abdulaziz Al Amri, Yikun Guo and Robert Gaziauskas

Abbreviations occur in many types of language and their use is particularly prevalent in technical domains such as biomedicine. Many abbreviations are ambiguous in the sense that they have more than one possible expansion. For example, ``BSA” is an abbreviation of both “Body Surface Area” and “Bovine Serum Albumin”. Identifying the correct expansion of ambiguous abbreviations is often important for correct understanding. In fact, misinterpretation of abbreviations has been shown to lead to medical practitioners making fatal errors (Fred and Cheng, 1999).

Automatic language processing can assist in the disambiguation of ambiguous abbreviations using a process known as Word Sense Disambiguation (WSD). The most successful WSD systems are provided with a corpus containing examples of ambiguous abbreviations annotated with the relevant expansion. However, the annotation of such corpora is a difficult, time-consuming task for which it is difficult to generate annotations with a high enough level of agreement between annotators to be considered reliable. It is, however, possible to automatically create a corpus of abbreviations by making use of the fact that abbreviations are often introduced in text together with their expansion, for example “BSA (bovine serum albumin)”. A corpus can be created by replacing the abbreviation and expansion with the abbreviation alone and using the expansion to generate the annotation.

We describe the creation of a corpus of ambiguous abbreviations found in biomedical abstracts using this process. A set of 21 abbreviations used by previous researchers (Liu et. al., 2001, 2004) were chosen and examples searched for in PubMed, a database containing more than 18 million abstracts from publications in the life sciences. For each abbreviation we searched for documents containing the abbreviation immediately followed by one of its possible expansions in parentheses (or vice versa). These were then automatically processed to remove the expansion and provide the necessary annotation. For convenience documents were transformed into a similar format to the one used in the NLM-WSD corpus (Weeber et. al. 2001). Our corpus contains 55,655 documents. There was a wide variation between the numbers of abstracts found for each abbreviation ranging from 14,871 (CSF) to just 71 (ASP).

To allow comparison of WSD algorithms we wanted to make our corpus freely available. Although the texts on PubMed are available on the internet they can not be freely distributed without permission of the large number of copyright holders. Consequently we distributed a tool that allowed the corpus to be reconstructed using a small program that automatically downloads the relevant documents from PubMed and carries out the necessary reformatting. This tool has been distributed under GNU General Public License and is available from http://nlp.shef.ac.uk/BioWSD/downloads/abbreviationdata/

References:

Fred, H. and T. Cheng. (1999). "Acronymesis: the exploding misuse of acronyms". In Texas Heart Institute Journal, 30, 255-257.

Liu, H., Y. Lussier and C. Friedman. (2001). "Disambiguating ambiguous biomedical terms in biomedical narrative text: An unsupervised method". In Journal of Biomedical Informatics, 34, 249-261.

Liu, H., V. Teller and C. Friedman. (2004). "A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation". In Journal of the American Medical Informatics Association, 11(4), 320-331.

Weeber, M., J. Mork and A. Aronson. (2001). "Developing a Test Collection for Biomedical Word Sense Disambiguation". In Proceedings of AMAI Symposium. Washington DC, 746-50.

Colligational patterns in a corpus and their lexicographic documentation

Petra Storjohann

The purpose of this paper is to show how corpora and tools can be used to analyse significant colligational patterns, i.e. lexico-grammatical combinations (cf. Sinclair 1996, Hoey 2005), lexicographically. In German, patterns such as das nötige Wissen vermitteln (to impart necessary knowledge to sbd), sein Wissen unter Beweis stellen (to prove one’s own knowledge to sbd), das erworbene Wissen in die Praxis umsetzen (to put acquired knowledge into practice) play a vital role when learning the language, as they exhibit relevant idiomatic usage and lexical and syntactic rules of combination. Often, they form more complex syntagms where each item has specific semantic and grammatical functions and particular preferences with respect to position and distribution. An analysis of adjectives, for example, shows whether there are preferences as to adverbial, attributive, or predicative functions. The investigation of colligational behaviour of nouns, on the other hand, demonstrates typical distribution of subject or complement functions.

Traditionally, corpus analyses of syntagmatic constructions have not been conducted for lexicographic purposes, although corpora and their tools offer quick and systematic access to colligational structures. The aim of this paper is to show how to utilize corpora to extract and examine typical syntagms and how results of such an analysis are documented lexicographically in one specific dictionary for contemporary German, in elexiko, a large-scale Internet reference work. In this online dictionary, information on significant combinational patterns is based on a model which accounts for the lexical and grammatical interplay between units in a syntagm. Its presentation of colligational structures underlies a systematic classification which groups similar patterns together and illustrates prototypical combinations extensively. As individual patterns contain authentic corpus material and complementary prose style usage notes, the result is a useful guide to text production or reception. The classification principles of colligational patterns, their corpus-driven retrieval, and their recently integrated presentation in elexiko will be elucidated with concrete dictionary examples. This is followed by a comparison with other existing dictionaries.

References:

Sinclair, J. (1996). “The search for units of meaning”. Textus 9, 75-106.

Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London:

Routledge.

A corpus-based constructionist approach to the concept of ‘fear’ in English

Ariadna Strugielska

The domain of emotions has always been of interest to linguists. With the emergence of Cognitive Linguistics and the advent of Conceptual Metaphor Theory, figurative language was rediscovered as an intriguing field of study, and special attention was paid to non-expert conceptions of emotions (e.g. Kövecses 1986; Wierzbicka 1990). Early research into the semantics of emotions was based on an introspective analysis of data and it was not until recently that lexically based studies were re-evaluated (e.g. Kövecses 2000; Glynn 2002). The new methodology disregards the explicatory value of content words and proposes that more emphasis be placed on schematic structures (cf.: Goatly 1997; Cameron 2003; Steen 2002). Moreover, there is a concurrently observable tendency towards employing naturally occurring data rather than prefabricated sentences. Consequently, corpus-based studies of emotion language are now available (e.g. Deignan 2005, 2008; Stefanowitsch 2006). However, while these analyses verify Conceptual Metaphor Theory against a new set of data, little attention is paid to a systematic study of what grammatical structure contributes to the semantics of abstract concepts.

The current paper sets out to demonstrate that adopting conceptual metaphors to naturally occurring data cannot be validated if we assume that emotions have a semantic structure partly arrived at via constructions they occur in. Thus, through establishing a constructional profile of fear (cf.: Kemmer and Hilpert 2005), we are going to show that corpus data do not corroborate the metaphorical mappings previously proposed. Through analyzing nominal and prepositional phrases with fear, we are going to show that the semantic content of the concept is in fact different from what is evidenced by metaphorical pairings. Moreover, sentence-level constructions with fear will place it in paradigmatic grids shared by items not even remotely related to the domain of emotion. Simultaneously, then, the question of synonymy within the chosen lexical field will be addressed. Hopefully, then, it will be revealed that a corpus-based constructionist approach to the concept of fear sheds new light on the internal structure of emotions.

References:

Deignan, A. (2008). “Corpus linguistics and conceptual metaphor theory”. In M. S. Zanotto,

L. Cameron and M. C. Cavalcanti (eds). Confronting Metaphor in Use. An applied

linguistic approach. Amsterdam, Philadelphia: John Benjamins Publishing Company.

Glynn, D. (2002). “Love and Anger: the grammatical structure of conceptual metaphors”. In

Style 36, 541–559.

Goatly, A. (1997). The Language of Metaphors. London, New York: Routlege.

Kemmer, S. and M. Hilpert. (2005). “Constructional grammaticalization in the English make-

causative”. Presented at ICHL in Madison, Wisc. August 2005.

Kövecses, Z. (1986). Metaphors of Anger, Pride, and Love: A lexical approach to the

structure of concepts. Amsterdam: John Benjamins.

Kövecses, Z. (2000). Metaphor and Emotion: Language, Culture, and Body in Human

Feeling. Cambridge: Cambridge University Press.

Steen, G. (2002). “Metaphor identification: a cognitive approach”. In Style 36, 386–407.

Stefanowitsch, A. (2006). “Words and their metaphors”. In A. Stefanowitsch and S. Th. Gries

(eds). Corpus-Based Approaches to Metaphor and Metonymy. Berlin, New York:

Mounton de Gruyter.

Wierzbicka, A. (1990). “The semantics of emotions: Fear and its relatives in English”. In

Australian Journal of Linguistics, 10(2), 359–375.

Learning English bare singulars: Corpus approaches for the L2 classroom

Laurel Stvan

This project reports on the distribution of English singular count nouns (such as school, camp, town and prison) that may sometimes occur without determiners. Contrasting these with the more typical bare forms of mass and proper nouns, bare singular count nouns are shown to comprise a problematic set for language learners. Although article usage is one of the trickiest areas of ESL to master (Master 1997, Ionin et al. 2008), bare singulars are only recently surfacing as a focus for the language classroom (Byrd 2005), since reference grammars have usually suggested that these forms do not occur in English. The ramifications for learners of English are part of an ongoing examination of bare singulars, a noun phrase construction that, although low in frequency, is used to convey highly marked meanings (Stvan 2007). This paper addresses how corpus data can be integrated into a lesson on the syntax and pragmatics of such bare singular count nouns for the ESL classroom.

As a snapshot of the current use of the nouns, a dataset of the first 100 uses of 15 bare singular count nouns was culled from the online Corpus of Contemporary American English (Davies 2008). This paper reports on methodological as well as pedagogical aspects of the findings. For the methodology of accessing and concordancing a relevant corpus, several issues arise: 1) To find singular nouns in bare uses, access to a tagged corpus is crucial. 2) Since the presence or absence of articles is key, articles cannot be included on a stop list when indexing or concordancing. 3) Using these pieces, a baseline is presented in which the frequency of the occurrence of zero articles with count nouns in the corpus is contrasted with the definite and indefinite article uses of these same nouns. From this basis, a pedagogical application is outlined that is both controlled in scope while incorporating active participation and critical thinking by the students. After reviewing English article use, advanced L2 learners examine sentences with bare forms from naturally occurring L1 data. By assisting them to discern which bare forms are used and to what effect, the task encourages learners to hypothesize about the subtypes of noun phrases in the dataset, helping them to develop more fine-tuned rules for using bare forms in English and a richer understanding of the English article system. The material outcome to be focused on is increased student awareness of covers both the types semantics of the nouns and the contexts in which the bare forms are used, By seeing a range of forms distilled from naturally occurring uses, students are aided in discerning which nouns can be used in bare forms and to what effect. This task encourages advanced L2 learners to hypothesize about the categories of subtypes of noun phrases in the dataset, helping them to develop more fine-tuned rules for using bare forms in English and a richer understanding of the English article system. providing a more structured basis for students to try out their own production of bare and articulated English noun forms.

References:

Byrd, P. (2005). “Nouns without Articles: Focusing Instructions for ESL/EFL Learners within the Context of

Authentic Discourse”. In J. Frodesen and C. Holten (eds). The Power of Context in Language Teaching

and Learning. Boston: Thomptson/Heinle, 13-25

Davies, M. (2008). The Corpus of Contemporary American English (COCA): 385 million words, 1990-present.

Available online at http://www.americancorpus.org

Ionin, T., M. L. Zubizarreta and S. Bautista Maldonado. (2008). “Sources of linguistic knowledge in the second

language acquisition of English articles”. In Lingua 118, 554–576.

Master, P. (1997). “The English Article System: Acquisition, Function, and Pedagogy”. System 25, 215-232.

Stvan, L. S. (2007). "The Functional Range of Bare Singular Count Nouns in English". In E. Stark, E. Leiss and

W. Abraham (eds). Nominal Determination: Typology, Context Constraints, and Historical Emergence. Amsterdam: John Benjamins, 171-187.

Issues in anonymising a corpus of text messages: Whether, why, who, what and how?

Caroline Tagg

“cant believe you forgot my surname, Mr NAME242. Ill give u a clue, its spanish and begins with m…”

As the personal data embedded into the above text message suggests, the anonymisation of linguistic data held in language corpora is not the straightforward procedure it may at first appear. Instead, it involves carefully balancing the principle of protecting participants with the practicalities of what can feasibly be anonymised without excessively altering or distorting the data. Consequently, most attempts to anonymise data can only be the result of decisions made at every turn as to what anonymisation means and how this understanding can be applied to the corpus in question with consideration of the research purpose it aims to fulfil. Furthermore, legal guidelines are neither explicit nor comprehensive and it is generally up to language researchers to interpret the legalities and ensure that they keep within the law.

This talk focuses on the anonymisation of CorTxt, a corpus of over 11,000 text messages compiled from between March 2004 and May 2007 in the UK, in order to explore spelling variation, grammatical features, phraseologies and everyday creativity in text messaging. In the talk, I explore whether and why CorTxt should be anonymised, from whom participants must be protected, what needs to be replaced or removed in the corpus, and how to both identify these items and anonymise them. With as yet no ‘standard’ by which linguists can anonymise data, my answers to the above questions draw on Rock (2001), perhaps the only attempt to investigate the possibility of formulating standard policies which can be adapted to different linguistic research projects. In adapting these policies to the anonymisation of CorTxt, my account includes challenges thrown up by the specific nature of text message data and should prove useful to researchers exploring textese and online communication as well as spoken data.

References:

Rock, F. (2001). "Policy and Practice in the Anonymisation of Linguistic Data". In

International Journal of Corpus Linguistics 6(1), 1-26.

RHECITAS: Citation analysis of French humanities articles

Ludovic Tanguy, Fanny Lalleman, Claire François, Phillipe Muller and Patrick Séguéla

The RHECITAS project aims at the content analysis of citations in French humanities articles using natural language processing techniques. It is based on a growing corpus of online articles, currently consisting of 275 articles across 6 journals, totalling 2 million words, 7000 references and 6000 identified citation contexts. The project is funded by TGE-ADONIS (CNRS, French National Research Center). Its primary aim is to provide reading aids for online humanities journals, but it also contributes to the contrastive study of citation behaviour across disciplines. Although citation analysis is now a well-established field, our task is innovative for several reasons:

● no wide-coverage citation analysis study has ever been done on French;

● humanities publications are absent (or extremely rare) from citation indexes: ready-to-use data is therefore unavailable;

● studies on citations only marginally deal with humanities, focusing on experimental sciences. As a result, proposed typologies of citation functions are inadequate.

We defined two main objectives. First, the distinction between "important" citations versus background or perfunctory ones. This distinction, although less ambitious than existing typologies, is more appropriate to humanities articles and to a multi-disciplinary corpus. Second, the extraction of specific information about what the author "borrowed" from a reference. Our approach is based on specific patterns and cue-phrases, specifically developed for this application, and fully automated using NLP techniques. In this regard, we follow the work of Teufel et al (2006) on the identification of citation function.

For the first objective, we designed an "integration scale" for citations, measuring the embedding of the citation marker in the author's discourse (Swales 1990). Features used for this task are the syntactic function of the citation, its relative position in the article, cooccurrence with first person pronouns, etc. The first results allow us to automatically provide a short list of the most important references for a given article.

The second objective is to provide the reader with more specific information for each reference selected by the previous method. We presently focus on the extraction of key terms or concepts associated with a reference. We based our approach on specific patterns, such as quotation marks and definitions of terms (eg "X, which Y calls Z").

Technically, Rhecitas consists of a fully operational tool that performs the automatic annotation (in XML format) and the subsequent analysis. A module is responsible for harvesting online articles and extracting references and citation contexts. The GATE platform is used for analysing citation contexts, using the French syntactic parser Cordial Analyseur, and a set of local grammars and lexical resources for the extraction of the linguistic features. An additional result is that these resources and corpus are available for linguistic studies.

Beyond the two main objectives described above, this project opens the way to the linguistic investigation of citation behaviour. Further development can lead to better information retrieval for online humanities publications, and more insightful analysis of a research field, as opposed to the kind of quantitative analysis developed in bibliometrics.

References:

Bornmann, L. and H. D. Daniel. (2008). "What do citation counts measure? A review of

studies on citing behavior". In Journal of Documentation (64:1), 45-80.

Moed, H. F. (2005). Citation analysis in research evaluation. Springer.

Swales, J. (1990). English in academic and research settings, Cambridge University Press.

Teufel, S., A. Siddharthan and D. Tidhar. (2006). "Automatic classification of citation

function". In Proceedings of EMNLP 6, 2006, 103-110.

Register differentiation in East African English: A multidimensional study

Lize Terblanche, Bentus van Rooy, Ronel Rossouw, Christoph Haase and Josef Schmied

Seemingly contradictory statements are often made in the literature about African and other New Varieties of English, to the effect that it is sometimes more formal and at other times less formal than native varieties (cf. Schmied 1991: 37 and 51). A closer examination of these claims shows that the label of excessive formality is more often attributed to spoken language, while written language is regarded as too informal (cf. Van Rooy & Terblanche 2006). These suggestions point to the possibility that register differentiation is not as stark in African varieties. Terblanche (2009) finds positive evidence for this possibility from the use of nominalizations in Black South African English. Furthermore, Schmied (2008) finds differences between Kenyan and Tanzanian English in journalistic writing, which can be explained by the different sociolinguistic status in the two countries: In Kachru's concentric model of inner, outer and expanding circles, Kenya is more an outer-circle country than Tanzania, where English is less entrenched. We thus expect register differentiation to be more extensive in Kenyan English, which will explain more complex styles in general, and may be attributed to better training of newspaper journalists to produce more simplified, reader-adapted styles. Evidence for a training effect comes from a study of Shaw and Liu (1998), who tested non-native students before and after an English language proficiency course. They found extensive adjustment of the style of academic essays in the direction of native written norms after intervention. Thus, social factors such as the functions of English in a country and the educational context may contribute to the extent of register differentiation in an African variety of English.

Alongside these findings about non-native varieties, native varieties of English have seen a narrowing of the gap between written and spoken forms during the twentieth century, in part due to wider schooling and general literacy (Mair 2006). Biber (2006) finds, nevertheless, that the differences between spoken and written language remain pervasive in university language used among native speakers. A more conclusive test would be to examine a wider range of registers, which is what we propose to do in this presentation, by submitting ICE East Africa to the multidimensional approach of Biber (1988). Within ICE East Africa, it is possible to separate Kenyan and Tanzanian data, which allows for a more refined look at the influence of the national contexts on English usage. If register differentiation is a function of the diversity of contexts of use, then the data will offer relevant evidence to determine this issue. Almost twenty years have passed since the ICE-EA data were collected. Thus, to determine if any recent changes have taken place, a more recent WWW complement is added to the existing ICE corpus, offering a diachronic dimension and a tentative outlook to the study.

References:

Biber, D. (2006). University Language: A corpus-based study of spoken and written registers.

Amsterdam/Philadelphia: Benjamins.

Mair, C. (2006). Twentieth Century English. Oxford: Oxford University Press.

Schmied, J. (1991). English in Africa. An Introduction. London/New York: Longman.

Schmied, J. (2008). “Clause complexity and clause linking in African academic writing: a critical

empirical comparison”. Paper presented at IAWE 14, Hong Kong, December 2008.

Shaw, P. And E. T-K. Liu. (1998). “What develops in the development of second-language writing?” In Applied Linguistics, 19, 225-254.

Terblanche, L. (2009). “A comparative study of nominalisation in L1 and L2 writing and speech”. In Southern African Linguistics and Applied Language Studies, 27(1).

Van Rooy, B. and L. Terblanche. (2006). A corpus-based analysis of involved aspects of student writing.

In Language Matters, 37, 160-182.

Populism and democracy: The construction of meaning in the discourse

Wolfgang Teubert

Populist politicians cater to the interests of the people who have elected them. Democratic politicians wouldn’t do such a thing. They do what is right for their country. We tend to find populist leaders in countries we don’t like, such as Venezuela or Iran, and we find democratic leaders in countries on our side, like the Ukraine or Georgia. Democracy is a banner word, populism a stigma word. Our political reality is a reality constructed by our public discourse, by the media. The reality constructed in our discourse is indeed the only reality to which we have access and which we can share. The task of linguists, as I see it, is not to look at language as a mechanism outside of our control, but to present the evidence of the discourse to us, the discourse participants. By analysing collocation profiles, paraphrases, chains of arguments and intertextuality, linguists can highlight ruptures, dissonances, inconsistencies and prosodies that will inform our interpretation. Any such interpretation is of course arbitrary and provisional. But any interpretation becomes part of the existing discourse. It is thus an opportunity for us to change the reality the discourse has constructed for us.

Tackling data insufficiency: Automatically extending a (richly annotated) data set

Daphne Theijssen, Nelleke Oostdijk, Hans van Halteren and Lou Boves

Much effort has been – and continues to be – put into developing corpora to provide linguists with suitable data in sufficient quantities to perform their research. Still, for many types of research this remains an issue: even when numerous corpora are available, most of these are too small and/or have not been annotated with the required information. This paper addresses the problem of data insufficiency and presents an approach to automatically extend a data set.

In our project, we study situations where speakers can choose between several syntactic options that are equally grammatical, but where usually one variant is most acceptable given the context. The current focus is on the dative alternation in English, where speakers and writers can choose between a double object (She handed the student the book) or a prepositional dative construction (She handed the book to the student).

In the past, researchers have tried to explain the speakers’ choices in different ways, e.g. by taking a syntactic (e.g. Quirk et al 1972), semantic (e.g. Gries and Stefanowitsch 2004) or discourse approach (e.g. Collins 1995). The aim of the current project is to explore the suitability of various statistical techniques to combine these approaches in a single model (e.g. Bresnan et al. 2007). For this, we need a richly annotated data set.

To be able to find the two constructions, and to incorporate the various approaches mentioned, we need a corpus with syntactic, semantic and discourse annotations. Since such a corpus does not exist, we selected a corpus meeting at least the first requirement: the one-million-word ICE-GB Corpus, containing written and spoken language in various genres. We automatically extracted and manually filtered instances of the dative alternation. The result was a set of 915 relevant instances, which we manually enriched with the information desired.

Initial experiments with the data have indicated that the set is still too small for our purposes. We therefore developed a (semi-)automatic approach to extend the our data set, using the 100-million-word British National Corpus (BNC). Since the BNC has no syntactic annotations, we made a list of ditransitive verbs and extracted all sentences containing them. These sentences were fed to the Connexor Machinese parser, yielding syntactic dependency trees. Sentences in which the parser identified a ditransitive construction were kept.

Next, we developed algorithms for automatically enriching our data with the information desired: the animacy, concreteness, definiteness, discourse givenness, pronominality, person and number of the objects (the book and him in the example), and the semantic class of the verb. The algorithms employed the part-of-speech tags available in the BNC, the dependency parses produced by the Connexor Machinese parser, and the noun classes and synonym sets found in WordNet. We evaluated both steps of the process by applying it to the existing data set of 915 manually annotated instances. The details of the method and the results found are presented at the conference.

References:

Bresnan, J., A. Cueni, T. Nikitina and R. H. Baayen. (2007). “Predicting the Dative Alternation”. In G. Bouma, I. Kraemer and J. Zwarts (eds). Cognitive Foundations of Interpretation: 69-94. Amsterdam: Royal Netherlands Academy of Science.

Collins, P. (1995). The indirect object construction in English: an informational approach. Linguistics 33, 35-49.

Gries, S. Th. and A. Stefanowitsch. (2004). “Extending Collostructional Analysis: A Corpus-based Perspective on ‘Alternations’”. In International Journal of Corpus Linguistics 9, 97-129.

Quirk, R., S. Greenbaum, G. Leech and J. Svartvik. (1972). A Grammar of Contemporary English. London: Longman.

Toward automatic error identification of learner corpora: A DP-matching approach

Yukio Tono and Hajime Mochizuki

With the increasing availability of learner corpora, there is a growing interest in automatic analysis of learner language (e.g. AALL08 & 09). One of such areas of automatization includes identification of learner errors. Since error annotation is very time-consuming and problematical in its consistency and accuracy, we tend to focus on the detailed problems and heuristics of how to identify more precisely particular types of errors such as article errors, verb morphological errors, among others. Whilst it is important to pursue for increasing accuracy in error annotation for each lexico-grammatical category, it is also necessary to grasp the over-all picture of major error patterns across different proficiency levels with reasonably economical and efficient methods.

The present paper aims to achieve this second goal, looking at the overall pictures of L2 learner errors across proficiency by using automatic analysis of learner corpora. To this end, we have developed parallel corpora (the original and the corrected versions) of the JEFLL Corpus, a 700,000-word learner corpus of approximately 10,000 Japanese EFL learners' in-class timed essays. We employed a classical, dynamic programming-based pattern matching technique, or so-called the DP-matching algorithm. It basically compares the original sentences written by learners with the reference sentences, the ones corrected by a native-speaker. With DP matching, we automatically identified the three types of potential errors; omission, addition and misformation. Then the error candidates for each error type were POS-tagged and processed for further multivariate analysis, in order to see the relationship between particular error categories based on POS and proficiency levels of learners. We will argue that this kind of automatic analysis of learner errors on proficiency-indexed learner corpora will shed light on the process of Interlanguage development and some implications for SLA and ELT research will be discussed.

A corpus-based study of invariant tags in London English

Eivind Torgerson and Costras Gabrielatos

This paper reports on the analysis of the use of a number of invariant tags in spoken London English, which formed a part of the completed project Analysis of spoken London English using corpus tools (funded by the British Academy). The tags examined were: innit, okay, right, yeah, you get me and you know, as well as three semi-fixed expressions containing you know, which functioned as tags: do you know what I mean, if you know what I mean and do you know what I’m saying.

The study used the Linguistic Innovators Corpus (LIC), a 1.4 million word corpus comprising the transcribed and marked-up interview data from the Lancaster/Queen Mary ESRC-funded project, Linguistic innovators: the English of adolescents in London (Kerswill et al. 2008), as well as the Corpus of London Teenage English (COLT) (Stenström et al. 2002). The research methodology combined approaches and techniques from sociolinguistics and corpus linguistics. Five variables were examined, four of which had two values: age (young: 16-19 / old: over 70), sex (male/female), ethnicity (Anglo/non-Anglo) and place of residence (Hackney/Havering). The fifth variable was a self-assessed measure of the multi-ethnicity of the friendship networks that speakers belonged to, with scores ranging from 1 (all friends same ethnicity as self) to 5 (60%-80% of friends different ethnicity as self). The analysis took into account the relative frequency of use of each tag, as well as the proportion of speakers in each sociolinguistic group that used each tag.

The comparison of LIC and COLT revealed an increase in yeah and, in particular, innit, and a dramatic increase in you get me, but a decrease in the relative frequencies of right and okay. The analysis of LIC showed that all the innovative tags, such as innit and you get me, were clearly a feature of young people’s speech. In addition, the most innovative tag, you get me, was by far most frequent in Hackney (inner London), and the highest frequency was observed among the non-Anglo speakers. The ethnic minority speakers, and male speakers in general, are the most innovative tag users, particularly of innit and you get me, but the ethnic minority speakers also had high frequencies of yeah, okay and right, and they were therefore the highest users of tags overall. Overall, there is a difference in tag usage between inner and outer London: the more innovative tags are more frequent in inner London and the more traditional ones in outer London. The innovative tags you get me and innit were most frequent, and were used by a larger proportion of speakers, among male, non-Anglo, Hackney residents.

The results indicate that young people, ethnic minorities, an urban environment, and dialect contact are of great importance in language change, a fact that can feed into an exploratory model of language variation and change.

Corpus-driven study of underlying syntactic structures

Aleksandar Trklja

It has become widely accepted in corpus linguistics that meaning is ‘typically dispersed over several words‘ (Stubbs 2001: 63) and that ‘people do not speak in words‘ but that they communicate through phrasemes (Mel‘ cuk 1995: 169). Since Sinclair (1991) published his influential paper on ‘open choice and idiom principle‘ a great body of work focusing on different aspects of idiomaticity has been produced. Some recent researches have shown that contemporary English texts are composed of about 55% of prefabricated multi-word units (Erman and Warren 2000). In spite of these findings there is still no consensus among the corpus linguists about what constitute a unit of meaning. Some of the units of meaning corpus linguists nowadays usually study are phrases, collocations, colligations, lexical bundles, clusters, etc.

In Liverpool I want to contribute to this ongoing discussion by presenting a study of the units of meaning composed of more than five words. The study is based on the exploration of the underlying syntactic structures. The purpose of my presentation will be to illustrate how it is possible to identify parallel language constructions by investigating the distribution of fixed words and the items with fixed positions occuring in various forms. As my data show the variations of the lexical items in a syntactic structure are limited and systematic, which makes possible to classify these items for each structure into several categories. My research indicates that the greater the degree of overlapping in the constructions the closer they are in meaning. In addition, it is also possible to put the constructions with the same types of unfixed items into the same meaning groups. All these imply that the lexical items in the structures mutually influence each other and that the choices language users can make when they combine the prefabricated multi-word units with other words are not endless and that they do not only depend on register.

I will also briefly describe the study of the identified constructions across languages on the example of German and Serbian-Croatian language. With the help of parallel and monolingual corpora I will try to extend Tognini-Bonelli’s idea of ‘functionally complete units of meaning‘ (2001).

A browser-based corpus-driven error correction tool for individualized second language learning

Nai-Lung Tsao, David Wible and Anne Li-E Liu

This paper describes a corpus-driven tool designed to provide error checking for second language learners on a browser-based toolbar along with automated individualized follow up to user queries. The motivation for the tool is a common practice among Google users for whom English is a second language and who use Google searches to check for errors in their own English. We describe a number of drawbacks of this practice, primarily that matches returned for such queries do not entail that the query string is correct. Google finds, for example, about half as many hits for the erroneous in my point of view as for the correct counterpart from my point of view. Rather than try to change this user practice, however, we provide for users a novel tool that supports it.

The tool has two components: the interface, which is simply a query box on the browser toolbar, and the error detection and correction algorithm. There are no error-specific rules but one general algorithm for all errors. Our hybrid n-gram pattern extraction algorithm runs on XXX words of BNC to create the knowledgebase. The coverage is broad because the algorithm extracts not only contiguous strings of lexemes (typical n-grams) but computes associations among contiguous parts of speech, lexemes, and word forms in any sequence and all combinations of these within the same string, thus extracting hybrid n-grams such as: spend time V-ing; from POSS point of view; as ADJ as possible.

We borrow the notion of edit distance both to run the pattern matches against any string of English users submit for checking and to provide correction suggestions in the form of k-best correction candidates. The extracted target n-grams plus the weighted edit distance measure can thus determine not only that spend time V-ing is acceptable, but that spend time to V or spend time V-base is likely unacceptable counterpart to be thus corrected. Hence, a single algorithm yields corrections of the following sort among others:

Submitted string à Suggested correction

I afraid of dogs à I am afraid of dogs

pay attention on à pay attention to

let people to know à let people know

spend time to study à spend time studying

in my point of view à from my point of view

pay time studying à spend time studying

We further exploit the fact that the tool is embedded in a Web-browser toolbar to provide automated individualized follow-up for errors that the user has previously submitted for query. For a learner submitting the string in my point of view, the system will suggest the corrected form as from my point of view and in this user’s future Web browsing will detect this correct form appearing in context within the web pages that the learner browses. It will offer to highlight these and to give further example sentences with from my point of view. This turns simple queries and corrections into long-term sustained and repeated exposure to pinpoint example types that address the learner’s specific difficulties.

New corpus tools for investigating the acquisition of formulaic chunks in L2

Nai-Lung Tsao, Anne Li-E Liu and David Wible

Studies in formulaic sequences in first and second language have suggested that English learners’ uses of formulaic sequences are uncompetitive with NSs in terms of amount, accuracy, and appropriateness. A seemingly straightforward explanation is that English learners do not tend to see formulaic sequences as wholes in the first place and thus fail to retrieve and use them in production. Two premises of this paper are: (1) learner production of formulaic language is most fruitfully seen not as simply present or absent from their production (as either/or) but as also likely to include approximative uses of formulaic language. (2) Corpus linguistics can add depth and breadth to research on this relatively unexplored approximative space of learners’ less-than-perfect attempts at producing formulaic expressions. We describe a new suite of corpus tools designed to discover learners’ approximative uses of formulaic language and support research into the acquisition processes involved.

We employ BNC as the source from which standard chunks are extracted, and a learner corpus, English TLC, which consists of 5-million words of English running text produced by learners in Taiwan in our study. Chunks extracted from BNC are not limited to strings of specific lexemes or word forms as are typical n-grams; rather, we innovate a technique for extraction of hybrid n-grams, that is, chunks comprising combinations of lexemes, part-of-speech, or even specific word forms of lexemes. For instance, the frozen chunk point of view can be added with others to form a semi-frozen string like from my/his/their point of view or from John’s point of view. On the basis of our hybrid pattern extractor, these will be stored under the pattern from+POSS+point+of+view.

Extracted target chunks are then matched against the learner corpus. Searching a learner corpus for exact matches of the identified standard English chunks will yield successful or accurate learner uses of chunks. Such work is straightforward with widely available existing corpus tools. We propose in addition a way of identifying likely cases of learners’ approximative uses of (or less-than-perfect attempts to use) these standard formulas by implementing an edit distance algorithm. Both successful attempts and ill-formed formulaic language can then be retrieved. For instance, with the pattern from+POSS+point+of+view and the edit distance algorithm, we find from English TLC a frequent ill-formed chunk *in my point of view and assign a value to the distance of this learner string from the standard formulaic expression. Another common semi-frozen chunk is in the middle/beginning/end of the and the hybrid pattern for that is in+the+ NOUN+of+the+NOUN. With that and the edit distance algorithm, we successfully find from English TLC the approximative uses: ‘*in the middle of night’ where the second the is omitted and ‘*At the middle of the movie..’ where a wrong preposition is used.

We show how our tool can extract exactly such data and patterns in learner output automatically without requiring the researcher to choose target formulas a priori. That is, the algorithm feeds the discovery of unanticipated instances of learners’ successful and unsuccessful uses of unanticipated formulaic language.

Lexical priming and polysemous verbs: A critique

Fanie Tsiamita

Hoey’s theory of Lexical Priming attempts to provide a theoretical framework to explain the long-established phenomena of collocation (cf. Firth [1951]1957; Sinclair 1991; Stubbs 1996), colligation (cf. Firth [1951]1957; Sinclair 2004; Hunston 2001), and several kinds of semantic relationships (cf. Sinclair 1991, 2004; Louw 1993, Partington 2004). The framework generates a number of hypotheses which call for closer examination. One of them concerns the phenomenon of polysemy.

The claim of Lexical Priming with respect to polysemy is that the collocations, semantic associations and colligations that a polysemous word is characteristically primed for will systematically differentiate its various senses (Hoey 2005: 81).The claim is further that the different senses avoid use of each other’s collocations, colligations and semantic associations.

My paper aims to provide a critique of this claim on the basis of an analysis of a number of polysemous verbs drawn from the narrative sections of the BNC, with senses that can be classified along the cline concrete - abstract. Hoey’s exposition of his claim focused primarily on words at the abstract end of the cline and he analysed no verbs. The choice of verbs along the cline of concrete-abstract allows for examination of possible effects of concreteness and abstractness of meaning on the types of primings for each sense as well as checking the applicability of the claim to verbs. The focus however will be less on clear cases of confirmation of the claims of Lexical Priming, and more on cases which are covered by the claims of Lexical Priming only when certain contextual aspects are taken into account that were not considered in Hoey (2005). Particular attention will be given to cases on the basis of which the theory of Lexical Priming could be revisited and potentially revised (or rejected).

My data are all drawn from a corpus of about 17,000,000 words, extracted from the BNC and comprising texts from novels, as general corpora may “iron out primings associated with particular genres or domains” (Hoey 2007: 10). At the same time, the fact that fiction texts pose little restriction in terms of choice of topic considerably raises the chances of the occurrence of multiple senses of a given polysemous item, therefore making this piece of research more practicable. A further reason for choosing a corpus of novels was the expectation that the relative flexibility of the genre would be more likely to allow for unexpected primings which could not be accounted for by the theory in its present form.

References :

Firth, J. R. (1951-1957). “A synopsis of linguistic theory, 1930-1955”. In F. Palmer (ed.).

Selected Papers of JR Firth 1952-59. London: Longman, 168-205.

Hoey, M. (2005). Lexical Priming. Abingdon and New York: Routledge.

Hoey, M. (2007). “Lexical Priming and Literary Creativity”. In M. Hoey, M. Mahlberg, M. Stubbs and W. Teubert. Text, Discourse and Corpora: Theory and Analysis. London: Continuum, 7-29.

Hunston, S. (2001). “Colligation, lexis, pattern, and text”. In M. Scott and G. Thompson (eds). Patterns of Text: In Honour of Michael Hoey. Amsterdam/Philadelphia: John Benjamins, 13-33.

Louw, B. (1993). “Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies”. In M. Baker, G. Francis and E. Tognini-Bonelli (eds). Text and Technology: In Honour of John Sinclair. Amsterdam/Philadelphia: John Benjamins, 157-176.

Partington, A. (2004). “’Utterly content in each other’s company’: Semantic prosody and semantic preference”. International Journal of Corpus Linguistics, 9 (1), 131-156.

Sinclair, J. M. (1991). Corpus, Concordance, Collocation. Oxford : Oxford University Press.

Sinclair, J. M. (2004). Trust the Text: Language, Corpus and Discourse. London/New York: Routledge.

Stubbs, M. (1996). Text and Corpus Analysis. Oxford: Blackwell.

New developments in optimization of linguistic corpus research:

TextAnalyst and NetXtract

Victoria Tuzlukova

Rapid developments in computer technologies, creation of innovative software and algorithms have dramatically changed the traditional research landscape in linguistics (Baranov 2001) and its traditional methods of research. They have also changed its new active lines. Nowadays linguistics moves towards cross-disciplinary and trans-disciplinary approaches. In their turn they presuppose the integration of different computer programs (Garvin 1962) that can perform varied linguistic tasks including book-keeping, monitoring and description of real language input data, where descriptions based in a linguist’s intuition are not usually helpful. The new trends are now towards such effective computer-based research projects as solving problems related to higher classes in hierarchy of the analysis of the language. These trends can be characterized as a move from language descriptive and statistical analysis, grammatical tagging of texts and corpora to contextual analysis, simulation and analysis of language functioning in different contexts and areas of discourse.

The aim of this paper is to contribute to deeper understanding of the importance of quantitative approaches for linguistics and language studies by highlighting some computer software that could be effectively used for optimization of linguistic corpus research. This paper presents the software that was effectively used in the research of the bilingual corpus of the educational texts. It does not only explore innovative software that is intended to satisfy a successful quantitative and qualitative analysis of medium-size and big-size bilingual and multilingual textual corpora. It also describes some useful characteristics of such software as TextAnalyst and NetXtract within the focus of such functions as automatic indexing of the educational texts; search for key words/phrases; analysis of the contexts of keywords/phrases and their frequencies. The observations of the software, which is considered in regard to the enhancement and optimization of the linguistic corpus research, are exemplified by the data of the research of a bilingual corpus of educational texts.

References:

Baranov, A. N. (2001). Introduction into Applied Linguistics. Moscow.

Garvin, P. (1962). “Computer participation in Linguistic Research”. In Language, 38,

(4), 385-389

Building the PDT-VALLEX valency lexicon

Zdeňka Urešová

In our contribution, we relate the development of a richly annotated corpus and a computational valency lexicon. Our valency lexicon, called PDT-VALLEX (Hajič et al., 2003) has been created as a “byproduct” of the annotation of the Prague Dependency Treebank (PDT) but it became an important resource for further linguistic research as well as for computational processing of the Czech language.

We will present a description of the verbal part of this lexicon (more than 5300 verbs with 8200 valency frames) that has been built on the basis of the PDT corpus. Rigorous approach and the linking of each verb occurrence to the valency lexicon has made it possible to verify and refine the very notion of valency as introduced in the Functional Generative Description theory (Sgall et al., 1986; Panevová, 1974-5).

Every occurrence of a verb in the corpus contains a reference to its valency frame (i.e., to an entry in the PDT-VALLEX valency lexicon). The annotators insert the verbs (verb senses) found in the course of the annotation and their associated valency frames into the lexicon, adding an example (or more examples) of its usage (directly from the corpus). They also insert a note that refers to another verb that has one of its valency frames related to the current one (a synonym/antonym, an aspectual counterpart, etc.).

A functor as well as its surface realization is recorded in every slot of each valency frame. The mapping between the valency frame and its surface realization is generally quite complex (Hajič, Urešová, 2003). The surface realizations through the morphemic case, preposition and a case, and subordinate sentence are the most common.

Ex.: snížit (to lower): ACT(.1) PAT(.4) ?ORIG(z+2) ?EFF(na+4)

usage: snížit nájem z 8 na 6 tisíc (lit.: lower the rent from 8 to 6 thousand)

The valency frame is fully formalized to allow for automatic computerized processing of the valency dictionary entries. The question mark in front of the valency member in the above example denotes optionality, the other valency members are obligatory. The realization of inner participants is always given in full, since there is no “standard” or “default” realization; free modifications’ realization need not be specified.

The PDT-VALLEX is available as part of the PDT version 2 published by the Linguistic Data Consortium (http://www.ldc.upenn.edu, LDC2006T01).

Acknowledgements: The research reported in this paper was supported by the GAUK 52408/2008, and Ministry of Education Projects ME09008 and MSM0021620838.

References:

Hajič J. and Z. Urešová. (2003). “Linguistic Annotation: from Links to Cross-Layer

Lexicons”. In Proceedings of The Second Workshop on Treebanks and Linguistic Theories. Vaxjö University Press, 69-80.

Hajič, J., J. Panevová, Z. Urešová, A. Bémová, V. Kolářová. (2003). “PDT-VALLEX:

Creating a Large-coverage Valency Lexicon for Treebank Annotation”. In Proceedings of The Second Workshop on Treebanks and Linguistic Theories. Vaxjö University Press, 57-68.

Panevová, J. (1974-75). “On Verbal Frames in Functional generative Description”. Part I,

The Prague Bulletin of Mathematical Linguistics 22, 3-40. Part II, The Prague Bulletin of Mathematical Linguistics 23, 17-52.

Sgall, P., E. Hajičová and J. Panevová. J. Mey (ed.). (1986). “The Meaning of the Sentence in

Its Semantic and Pragmatic Aspects”. Dordrecht: Reidel and Prague: Academia.

Complementation patterns in causative verbs across varieties of English

Bertus van Rooy and Lize Terblanche

Verb complementation, at the interface of grammar and lexis, is identified as one of the areas of most instability and extensive change in late twentieth century standard varieties of English, by Mair (2006:119-140). In his extensive review of linguistic features across a range of Englishes, Schneider (2007) notes unique patterns of verb complementation in Malaysian, Singaporean, Indian, South African, Kenyan, Tanzanian, Nigerian, Cameroonian, Jamaican, and even earlier forms of American English. Likewise, Mukherjee & Hoffman (2006) find that Indian and British Englishes differ from each other in this respect. However, apart from taking cognizance of the fact of difference, very little is known about the possibility of regularity or irregularity in this area of English grammar.

This paper investigates the hypothesis that the New (non-native) varieties of English, display more regularity than Old Varieties. The hypothesis follows from the observation of regularization observed in other aspects of the grammar of English (Szmrecsanyi & Kortmann, 2009), in terms of which non-native varieties are more likely to generalise regular patterns, rather than be constrained by stabilized irregularity more characteristic of native varieties. However, their observations are based mainly on morphological evidence.

On the syntactic level, Van Huyssteen and Van Rooy (2003) note that the causative construction in Black South African English is frequently encoded lexically rather than in a verb-predicate structure. In consequence, certain standard English complementation patterns are unattested in their data. It is expected that verb complementation patterns will be different in this variety. Taken with the findings of Mukherjee & Hoffmann (2006), there is a case for the investigation of complementation patterns in New Varieties of English.

Kemmer & Verhagen (1994: 115) define the prototypical analytic (as opposed to lexical) causative as a two-verb structure that expresses a predicate of causation and a predicate of effect. To operationalise the investigation, we propose to take the list of frequent causative verbs provided by Biber et al. (1999: 363): allow, cause, enable, force, help, let, require and permit, and analyse their complementation patterns, with a view to determining whether New Varieties exhibit more or less regularity than Old (native) varieties, and also whether any innovation has taken place as far as the complementation patterns in new varieties are concerned. Available ICE corpora representing the Old and New Varieties will be the principal data for the investigation.

References:

Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finnegan. (1999). Longman Grammar of

Spoken and Written Language. Essex: England.

Kemmer, S. and A. Verhagen. (1994). “The grammar of causatives and the conceptual

structure of events”. Cognitive Linguistics, 5(2), 115-156.

Mair, C. (2006). Twentieth Century English. Oxford: Oxford University Press.

Mukherjee, J. and S. Hoffmann. (2006). “Describing verb-complementational profiles in New

Englishes: a case study of Indian English”. English World-Wide, 27, 143-174.

Schneider, E. (2007). Post-Colonial Englishes. Cambridge; Cambridge University Press.

Szmrecsanyi, B. and B. Kortmann. (2009). “Between simplification and

complexification: non-standard varieties of English around the world”. Language

Complexity as an Evolving Variable. Oxford: Oxford University Press.

Van Huyssteen, Gerhard & Van Rooy, Bertus (2003). The Causative Construction in “Black

South African English: emergent or emerging?” Presented at the 8^th International Cognitive Linguistic Conference, La Rioja, Spain.

NOT-negation and NO-negation in contemporary spoken British English:

A corpus-based study

José Ramón Varela Pérez

This contribution explores the interface between corpus linguistics, diachronic typology and usage-based approaches to the study of grammatical variation. I will address the alternation between two types of negation in contemporary spoken English mainly involving non-specific indefinite Noun Phrases under the scope of negation: NOT-negation (He did not see anything) and NO-negation (He saw nothing) (Tottie 1991a, 1991b, 1994).

Historically, the possibility of variation between the older construction with NO-negation and the newer one with NOT-negation was only effective after the disappearance of ne and the rise of not as a marker of verbal negation at the end of the ME period (Jespersen 1917; Mazzon 2004).

There have not been many corpus-based studies of the variation between NOT-negation and NO-negation. Most of the existing research on this topic includes contexts where variation between the two constructions is not possible and/or offers quantitative findings with little qualitative analysis of the data (e.g. Biber et al. 1999; Westin 2002; Peters 2008). Only Tottie (1991b) has offered a comprehensive study of this area although she relies on corpora dating back to the 1960s and the early 1970s. In this paper, I will use a sample of contemporary spoken British English taken from the British component of the International Corpus of English (ICE-GB), including conversations recorded in the early 1990s.

I will also address the impact of several internal factors on the choice between the two constructions, including some that have not yet been considered in the literature. Ultimately, the variation between NOT-negation and NO-negation must be placed against the backdrop of diachronic typology: the history of sentence negation in English (the so-called Jespersen’s Cycle) and two competing typological tendencies that bear opposite results in the expression of negation: (a) the ‘Neg First’ principle, i.e. the universal psycholinguistic tendency for negative markers to be placed before the verb (Jespersen 1917; Horn 1989); and (b) the End-weight principle, i.e. the tendency to concentrate communicatively significant elements towards the second part of the sentence (Mazzon 2004).

References:

Biber, D. et al. (1999). Longman Grammar of Spoken and Written English. Harlow, Essex: Pearson Education Ltd.

Horn, L. R. (1989). A Natural History of Negation. London and Chicago: University of Chicago.

Jespersen, O. (1917). Negation in English and other languages. Copenhagen: Munksgaard.

Mazzon, G. (2004). A History of English Negation. Harlow: Longman.

Peters, P. (2008). “Patterns of negation: The relationship between NO and NOT in regional varieties of English.” T. Nevalainen, I. Taavitsainen, P. Pahta and M. Korhonen (eds). The Dynamics of Linguistic Variation: Corpus Evidence on English Past and Present. Amserdam: Benjamins, 147-162.

Tottie, G. (1991a. “Lexical diffusion in syntactic change: Frequency as a determinant of linguistic conservatism in the development of negation in English.” In D. Kastovsky (ed.). Historical English Syntax. Berlin/New York: Mouton de Gruyter, 439-467.

Tottie, G. (1991b). Negation in English Speech and Writing. A Study in Variation. San Diego, California: Academic Press.

Tottie, G. (1994). “Any as an indefinite determiner in non-assertive clauses: Evidence from Present-day and Early Modern English.”. In D. Kastovsky (ed.). Studies in Early Modern English, 413-427.

Westin, I. (2002). Language Change in English Newspaper Editorials. Amsterdam/New York: Rodopi.

AN.ANA.S.: aligning text to temporal syntagmatic progression in Treebanks

Miriam Voghera and Francesco Cutugno

AN.ANA.S. (Annotazione Analisi Sintattica) is a system for the syntactic annotation of both spoken and written texts developed within a Treebank project. The system is a public resource.

The basic properties of AN.ANA.S. are the following: 1) it works on unrestricted texts, including all disfluencies as well as typical spoken elements; 2) it is aligned to temporal syntagmatic progression of the texts; 3) it is based on a set of symbols and rules which are reflected in a formal, computationally treatable, metadata structure expressed in terms of XML elements, sub-elements, attributes and dependencies among elements resumed into a formal descriptive document (Document Type Definition - DTD); 4) it uses a hybrid framework, allowing the annotation of both constituents and functional relations; 5) it aims to capture the basic syntactic feature of both spoken and written texts, allowing a direct comparison.; 5) it has a limited number of language-specific annotation tags, so to widen its the application to different languages.; 6) it is addressed to linguists with little computational background; 7) it has a searchable interface.

Presently we have annotated a pilot corpus of written and spoken Italian texts of 2100 clauses; 420 clauses of spoken Spanish and 800 clauses of spoken English. An AN.ANA.S. L2 version has been developed and tested on an Italian learner corpus of 3150 clauses. Data. Data on the frequency and distribution of all the constituents have been collected. This has allowed not only to design the syntax of different texts, but also to enlighten linguistic properties of the structures. In fact, statistical regularities contribute to increase not only the data for which a theory can account, but also its explanatory adequacy.

References:

Abeillé, A. (ed.) (2003). Treebanks: Building and Using parsed Corpora, Dordrecht: Kluwer.

CHRISTINE Project, http://www.grsampson.net/RChristine.html

Clark, B. (2005). “On Stochastic Grammar”, Language, 81 (1), 207-217.

International corpus of English, http://ice-corpora.net/ice/

Newmeyer, F. (2003). “Grammar is Grammar, Usage is Usage”, Language, 79 (4), 682-707.

Marcus, M., B. Santorini and M. A. Manrcinkiewicz, “Building a large annotated corpus of English: the Penn Treebank”. Computational Linguistics 19(21), 313-330.

Voghera, M., G. Basile, D. Cerbasi and G. Fiorentino. “La sintassi della clausola nel

dialogo”. In F. Albano Leoni, F. Cutugno, M. Pettorino, and R. Savy (eds) Il parlato

italiano. Atti del Convegno Nazionale, Napoli, D’Auria Editore, 2004. CD-rom, B17.

Voghera, M. and F. Cutugno. (2006). “An observatory on Spoken Italian linguistic resources

and descriptive standards”. In Proceedings of 5th International Conference on Language Resources and Evaluation. LREC 24-26 Maggio 2006, Genova, ELRA, 643-646.

Voghera, M. and G. Turco. (forthcoming). “From text to lexicon: the annotation of pre-target

structures in an Italian learner corpus”. III International Workshop on Corpus Linguistics, Florence 2008.

Building a corpus-based specialist writing-aid tool for non-native speakers of English

Alexandra Volanschi and Natalie Kübler

It is a well-established fact that French-speaking researchers (and more generally non-native speakers of English) are increasingly required to publish in English in order to be internationally acknowledged (and nationally evaluated). French universities and research institutions are well aware of this situation; however, it is still very difficult for French speakers to write their papers in native-like English and for their papers to be accepted as easily those written by researchers from the “inner circle” (Kachru 1985) countries. Based on the analysis of a corpus composed of research articles, this paper describes a method of exploring the combinatory properties of terms belonging to a particular domain in life sciences: yeast biology. The applicative objective of this work consists in creating a writing-aid tool aimed at meeting the specific needs of researchers who have to publish in English as a second language. A questionnaire submitted to researchers in biology at the Life Sciences department at the University Paris Diderot helped us better evaluate those needs. Among other things, its results have shown that, while terminology is acquired by reading literature in the field, phraseology constitutes one of the major difficulties for non-native speakers.

In order to extract terminological collocations belonging to the field, but also those belonging to General Scientific Language (Pecman 2004), we compiled a 5 million-word specialised corpus composed of research articles focusing on yeast biology. The corpus was POS tagged, lemmatised and dependency-parsed. Starting from this annotated corpus we have devised a collocation extraction method which, along the lines suggested by Lin 1998 or Kilgarriff 2002, combines syntactic parsing with statistical criteria such as frequency, mutual information and coverage (number of documents in which a co-occurrence is observed). We show that collocational information extracted using this technique is sometimes incomplete; we have consequently tried to give a clearer picture of the phraseological behaviour of specialised verbs by providing “extended” argument structures, also extracted from the dependency parse.

The results of this process of collocation and argument structure extraction have been incorporated into a preliminary web-based query tool which provides an access to terms and their collocations (and the associated statistical information mentioned above). Each combination is illustrated with examples from the corpus. Several testing stages with students and researchers in biology are planned in order to help us better structure the access to the data and adapt it to users who are not expert linguists. We intend to develop corpus-driven materials for ESP teaching and to test the transferability of this approach to other fields.

References:

Kachru, B. B. (1985). “Standards, codification, and sociolinguistic realism: The English

language in the outer circle”. In R. Quirk and H. G.Widdowson (eds). English in the

world: Teaching and learning the language and literatures. Cambridge: Cambridge

University Press, 11-30

Kilgarriff, A. and D. Tugwell. (2002). “Sketching words”. In M-H. Corréard (ed.).

Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S.

Atkins. EURALEX, 125-137.

Lin, D. (1998). “Extracting Collocations from Text Corpora”. First Workshop on

Computational Terminology, Montreal, Canada, August, 1998.

Pecman, M. (2004). Phraséologie contrastive anglais-français : analyse et traitement en vue

de l’aide à la rédaction scientifique. PhD dissertation, 2004, University of Nice.

Howevering the howevers and butting the buts: Investigating contrastive and concessive constructions in media discourse

John Wade and Stefano Federici

“But me no buts”

(Henry Fielding 1707-1754)

The adroit use of adversative constructions (Halliday and Hasan 1976) in written text is a key element in building a logical, cohesive and convincing argument.

It is the purpose of this paper to examine a number of examples starting from selected connective devices and collocating them in their specific syntactic contexts in order to identify any recurrent patterns which come to the fore.

An ad hoc corpus was first constructed with material from a widely circulated British national daily newspaper. It was decided to focus on media discourse, since, while newspaper reporting theoretically represents an impartial account of events, “There is always a decision to interpret and represent it in one way rather than another” (Fairclough 1995: 54). The use of adversative constructions plays an important role in representing reality from one perspective, while appearing at the same time to be balanced and unbiased.

An initial small-scale manual analysis was carried out on the corpus in order to classify the selected connective devices according to meaning and function and this was then extended to an analogical comparison which allowed the further extraction of material which was then analysed automatically.

The setting up of the corpus, analysis of the data acquired and conclusions will be fully illustrated in the paper.

References:

Fairclough, N. (1995). Critical Discourse Analysis. Harlow: Longman.

Halliday, M. A. K. and R. Hasan. (1976). Cohesion in English. Harlow: Longman

A corpus-based study of the speech, writing and thought presentation in Julian Barnes’ Talking It Over.

Brian Walker

Julian Barnes’ Talking It Over is a novel with a fairly familiar theme – a love triangle – that is told in a rather unusual way – there are nine first person narrators. This paper describes part of my PhD research into characterisation in Talking It Over, and focuses on my current corpus-based investigation of the narrators in the novel. Of the nine narrators there are three main narrators, Oliver, Stuart and Gillian, who are the three people involved in the love triangle, and who say substantially more than the other narrators in the novel. The remaining six narrators are people whose involvement in the story is more peripheral, but come into contact with one or more of the main narrators. My paper will describe how I use an annotated, electronic version of the novel to explore the differences in speech, writing and thought presentation (SW&TP) usage in the narrations and how these variations relate to distinct approaches to telling the story.

My analysis uses the SW&TP model described in Semino and Short (2004), which is based on the model originally conceived by Leech and Short (1981). My paper will describe how I added XML tags to the text to mark up the different SW&TP categories using an annotation scheme based on that developed by McIntyre et al (2004). I will mention some of the problems I encountered while adding tags manually and how I attempted to address them. I will also mention the benefits of this approach. I will then explain how I derived statistics about the SW&TP used by the three main narrators from the fully tagged version of the novel, and show the results of this statistical analysis. I will then give a brief interpretive response to a comparison of these statistics.

References:

Barnes, J. (1991). Talking It Over. London: Jonathan Cape Ltd.

Leech, G. and M. Short. (1981). Style in Fiction. London: Longman.

McIntyre, D., C. Bellard-Thompson, J. Heywood, T. McEnery, E. Semino and M. Short.

(2004). “Investigating the presentation of speech, writing and thought in spoken

British English: A corpus-based approach”. ICAME Journal 28, 49-76.

Semino, E. And M. Short. (2004). Corpus Stylistics: Speech, Writing and Thought

Presentation in a Corpus of English Narratives. London: Routledge.

A corpus-based discourse analysis of global warming in British, American and Chinese newspapers

Fang Wang

This thesis aims at investigating the meanings of the lexical item global warming in three discourses which are represented by three prestige world newspapers: the Guardian, the Washington Post and the People’s Daily(人民日报: Renmin Ribao). Through the building of a diachronic corpus which consists of the news articles in which global warming occurs at least once during the last 20 years in the three newspapers, I draw a comparison between the different meanings of global warming displayed. Besides, the corpus is further divided into three time periods based on two frequency explosions of the articles concerned. In this way, the changes of the meanings of global warming over time can be traced. Therefore, my study focuses on both a comparative-synchronic axis (simultaneous depictions of global warming in different newspapers) and a historical-diachronic axis (temporal sequences and evolutions).

The research methodology that I adopt here is collocation profiles. A collocation profile is a list of all words found in the immediate context of the keyword, listed according to their statistical significance as collocates of the keyword. Here, by setting the word span of 5 words to the left and five words to the right of global warming, nine collocation profiles are generated based on three sub corpora with each being divided into three time periods. Then, both the common collocates of all these nine collocation profiles and salient collocates from each specific profile are selected and applied to further concordance analyses. Through detail examination of these concordance lines and systematic interpretation, it can be said that there are striking differences between the meanings of global warming constructed in three discourses: in the Guardian discourse, global warming is represented as an irrefutable fact, and tough actions against it are required. The Washington Post discourse is employing different strategies to cast doubts on the existence of global warming. The People’s Daily (人民日报: Renmin Ribao) discourse is (re)producing the legitimacy of any actions to combat global warming and representing developing countries’ contributions to the problem.

A corpus-based study of Chinese learners’ use of synonymous nouns in college English writing

Hong Wang

Modern English has an unusually large number of synonyms or near-synonyms, and some pairs or groups of synonyms have only one equivalent translation in Chinese. Since Chinese learners tend to learn English words only by remembering their Chinese equivalents, English synonyms, especially those with only one Chinese equivalent, may mean exactly the same to them. This paper investigates Chinese learners’ use of five pairs of synonymous nouns, namely aspect/respect, ability/capability, chance/opportunity, relation/relationship, and safety/security in their College English writing. Each pair has only one identical equivalent in Chinese, respectively 方面 (fang mian), 能力 (neng li), 机会 (ji hui), 关系(huan xi) and 安全 (an quan). The aim of the study is to find out whether Chinese learners have difficulty in using these kinds of synonyms, what deviations concerning the use of these synonyms are most typical in their writing, and what factors might have contributed to the inappropriateness. To answer these questions, a comparative study is conducted based on data from a learner corpus COLEC (Chinese College Learner English Corpus) and a native-speaker corpus LOCNESS. Concordincing lines of each pair of the synonyms are extracted from both corpora using WordSmith Tools 4.0 and colligational and collocational analyses are conducted to look for the similarities and disparities between the COLEC learner English and the LOCNESS native speaker English. Results show that the confusion of the synonyms is clearly a wide-spread phenomenon and Chinese learners tend to use the synonyms interchangeably, thus inappropriate collocations or deviant usages arise. The main reason is that when learning synonyms, learners focus mostly on the Chinese translations of the English words, giving little or no attention to their semantic differences in English, nor paying attention to the colligational and collocational restrictions of each synonym. Pedagogical implications of the study are then discussed and suggestions made for using concordance-based exercises as a way of helping learners detect the semantic differences of synonyms and raising learners’ awareness of the colligational and collocational restrictions of synonyms.

Taking stance across languages: Modal verbs of epistemic necessity and inference in English and Polish research articles

Krystyna Warchal

Defined as “the ways the writers project themselves into their texts to communicate their integrity, credibility, involvement, and a relationship to their subject matter and their readers” (K. Hyland, Disciplinary discourses: Writer stance in research articles, in Ch. Candlin and K. Hyland, eds., Writing: Texts, processes and practices, London: Longman, 1999, p. 101), stance can be expressed by a variety of means, including, among others, hedges, emphatics and attitude markers. The use of these elements – their frequency, distribution and variety in various text types – is language and culture-specific. This paper focuses on selected exponents of stance by which speakers of English and Polish express their assessment of the truth of a proposition and their commitment to the assessment, and more specifically, on modal verbs of epistemic necessity and inference used in linguistics research articles in these two languages. The analysis is based on two corpora of research articles published in the years 2001-2006 in English and Polish linguistics-related journals. The English corpus consists of 200 articles published in five internationally recognised journals; on the basis of affiliation notes it is assumed that the writers are native users of English or have a native-like command of this language. The Polish corpus comprises 200 articles, all of them included in the 2003 list of Polish scientific journals issued by the Polish Committee for Scientific Research and written in Polish by Polish authors. The analysis focuses on the following modal and quasi-modal verbs: MUST, NEED, HAVE (GOT) TO / MUSIEĆ and SHOULD, OUGHT TO / POWINIEN in an attempt to discuss their meaning and function in one specific genre and discipline but across languages and cultures.

The forms and functions of organisational frameworks across different registers

Martin Warren

Organisation-oriented words, such as conjunctions, connectives and discourse markers, are sometimes co-selected by speakers and writers to form ‘clause collocations’ (Hunston, 2002), which in this paper are termed ‘organisational frameworks’, to link distinct sections of the discourse (for example, ‘because ... so’). While this type of word association is quite common, there have been few research studies examining the extent of this form of phraseology beyond the descriptions of well-known correlative conjunctions (for example, ‘either ... or’), and also whether they differ in form, function or frequency across different registers. This paper builds on the methodology of previous studies of ‘concgrams’ (see, for example, Cheng, Greaves and Warren, 2006) and looks at the most frequent organisational frameworks in general English use and compares them with those found in a specialised corpus of Financial Services English. The method used to fully automatically extract the organisational frameworks is described, along with examples of both the forms and functions of the organisational frameworks.

Acknowledgements:

The research described in this paper was substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region (Project No. PolyU 5459/08H, B-Q11N).

References:

Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press.

Cheng, W., C. Greaves, and M. Warren. (2006). “From n-gram to skipgram to concgram”. International Journal of Corpus Linguistics 11/4, 411-433.

Distance between words as an alternative way to model language

Justin Washtell and Eric Atwell

The prevailing computational approach to discovering associated words within a body of text is to move a window along the text and count the number of times words occur together within it. If this is more than we would expect by chance for any pair of words, given their individual frequencies, then it is concluded that they are semantically related. As well as directly inferring semantic association, this approach also forms the basis of word similarity measures (e.g. if two words tend to co-occur with the same words) and various higher-level language processing tasks such as disambiguating word senses (e.g. discerning a river bank from a financial bank).

In order to perform any of these measurements, it is necessary to decide upon an appropriate window size, in order to define “co-occurrence”. This presents a problem. While there may be an optimum window size for certain applications, the choice invariably involves trade-offs.

Techniques are used in biogeography to estimate the dispersion or spatial variance of a species or feature, by measuring the distances between individuals. It is possible for us to take a similar approach to identifying associated words in text. If two words tend to occur more closely to each other than we would expect by chance – given their individual distributions – we can infer that they are associated. We refer to this approach as co-dispersion.

To illustrate, we compare the 12 most significant word co-occurring with “bank”, using different window-sizes: 1 word each side, 10 words, 100 words. The results overlap, and all window sizes produce some intuitive results; but there is disagreement between them. So which is best? The problem is that “co-occurrence” is scale-dependent: we cannot interpret the results without reference to the scale of the analysis. As a simple proxy of “semantic association”, co-occurrence is problematic.

While there is limited overlap between the most significant collocates determined via smallest and largest window-sizes, co-dispersion can be seen to exhibit a more-or-less constant level of agreement with all window-sizes. This illustrates scale-independence, and also illustrates that the two approaches are otherwise largely equivalent. This is important because, despite its limitations, the window-based approach has proven to be effective in various language-processing tasks, and a good proxy for human intuitions of relatedness.

Another side-effect of considering information at all scales is a greater effective data density, and therefore dramatically increased statistical significance. If we plot on a graph the spread of apparent associations found in an example body of text (the Brown Corpus), and compare these with apparent associations in randomly-generated text, we can estimate a level of significance or confidence in these apparent associations. The confidence levels for word-associations found using our windowless co-dispersion measure are noticeably higher than for word-associations based on a fixed window. Co-dispersion can take into account all occurrences of both words, whereas window-based co-occurrence can only take into account cases where both words are in the window. This reduces data-sparseness and hence increases statistical confidence.. Zipf’s law tells us that the vast majority of word-types are rare, and the chances of two words co-occuring in a limited window are even poorer; so a modelling technique which can make the most of every example is particularly useful when looking at medium- or low-frequency words.

A new computing measure for extracting contiguous

phraseological units from academic texts

Naixing Wei, Jingjie Li and Junying Zhou

This paper introduces a new computing method for extracting contiguous phraseological units from academic texts by measuring the internal associations within n-grams. We have undertaken a series of research activities to develop the new computing method.

(1) In the first place, we further developed the concept of pseudo-bigram transformation (Silva & Lopes 1999), which assumes that every n-gram w₁, w₂, w₃, ……, w_n () may be seen as a pseudo-bigram in terms of having one dispersion point “located” between a left part, w₁…w_i, and a right part, w_(i+1)…w_n_,,of the n-gram(，). In this way we may extend the use of current statistics-based measures (e.g. MI, Entropy) to the computing of n-grams, where.

(2) We then constructed a new normalizing algorithm of probability-weighted average in order to calculate the internal association of an n-gram by taking a weighted average of the “glue” values for all the dispersion points, where weighting depends on the probability of each “glue”. Let G be the “glue” value for each dispersion point and P(G) be the probability of each event G. The formula for probability-weighted average (Expression 1) is as follows:

(1)

The new algorithm is proposed for refining the current measures, enhancing the precision and recall of phraseological units extracted by these measures. Expressions 2 and 3 are the respective formulas for refining the MI and the Entropy by utilizing the normalizing algorithm of probability-weighted average.

(2)

(3)

(3) Thirdly, we applied the new computing measure to the processing of our corpus data and extracted 197,108 different types of phraseological units, with a total of 3,045,514 tokens. Compared to the current statistics-based measures, the new measure reach a higher extraction precision, and the extracted data better reflect the semantic and structural characteristics of n-grams. Take the set of multi-word sequences “a final”, “a final report”, “a final report no”, “a final report no later” and “final report no later than” for example. The current Entropy measure extracts the sequence “a final report no” as a multi-word unit, for the value of its entropy ratio is higher than the other four. The Wordsmith will output all the five sequences as “clusters”. By contrast, our new computing measure will identify only the two structurally and semantically complete sequences, “a final report” and “final report no later than”, as phraseological units, excluding those irrelevant sequences which meddle in the process.

This research can have potentially useful applications in corpus linguistics and natural language processing. Particularly, the new computing method can be used to extract phraseological sequences from corpora with greater precision, to identify multi-word units more accurately, thus, improving the validity and effectiveness of phraseological studies to a great extent.

Corpus profiling with the Posit Tools

George R. S. Weir

A common activity in corpus linguistics is the comparison of one corpus with another. Sometimes, this relies on the presumption that the comparator serves as a reference standard and is representative, in some significant manner, of texts (or language) in general. The sample corpus may then be scrutinised in order to determine how or the extent to which it varies from 'the norm'. In other contexts, the aim is simply to contrast several example text collections, in search of significant similarities or differences between them.

Of course, there are many possible dimensions for corpus comparison and their selection usually depends upon specific practical objectives, such as language translation, authorship confirmation, genre classification, etc. The present paper describes recently developed software tools for textual analysis (Posit) and focuses on their use in corpus comparison. The Posit system is primarily designed to analyse corpus content in terms of parts-of-speech (POS), word and n-gram frequency.

An extension to the system allows for easy comparison of POS, word and n-gram frequency profiles between corpora. Thereby, Posit affords a useful facility for shedding light on these dimensions of comparison between corpora. The extensive system output includes pie charts to display distributions of major parts-of-speech for individual corpora and line graph projections for comparisons across multiple corpora.

In its initial design, the Posit system is script-based, highly modular and allows for easy interfacing to other software applications. Posit is able to handle arbitrarily large corpora and may be controlled from a command-line (Anonymised2, 2007). In its second-generation, the Posit system includes additional features such as concordance display, the choice of using internal, external POS tagger or pre-tagged corpora, and the option of outputting results to an

integrated SQL database (Anonymised3, 2008).

The application of the Posit tools as a means of shedding light on corpus content, particularly crosscorpus comparison, is illustrated in terms of POS, word and n-gram profiles using newspaper corpus data. The recent use of Posit analysis in the context of Japanese historical EFL texts is also described (Anonymised3, 2008). The Posit tools are freely available to the research community.

References:

Anonymised1. The Posit Toolset with Graphical User Interface. Proceedings of the

[Anonymised]International Workshop, Sri Lanka, September 2008.

Anonymised2. The Posit Text Profiling Toolset. Proceedings of the [Anonymised]

Conference, Thailand, December 2007.

Anonymised3. Multiword vocabulary in Japanese ESL texts. Proceedings of the

[Anonymised]Conference, Hawaii, August 2008.

Design and compilation of pedagogical corpora for secondary schools

Johannes Widmann and Kurt Kohn

Although on the whole corpus use in teacher education contexts slowly seems to be on the increase, an entirely different picture emerges when we look at the use of corpora in actual teaching practice in secondary schools. Most teachers are very hesitant to include corpus-based teaching approaches in their courses. The most important reason for this is the lack of a sound and innovative pedagogic concept for corpus-based language learning that goes beyond corpus concepts motivated by descriptive linguistic needs. Furthermore, dedicated teachers often do not have the necessary support from their institution to embark on these new ways of teaching and learning. However, from a corpus linguistic perspective, it is apparent that there is also a lack of pedagogically motivated corpus tools and a lack of corpora suited for the use in educational contexts. This lack of appropriate corpora also shows in the fact that corpus-based teaching materials have not experienced wide-spread acceptance in course books, either.

The present paper presents a research project that has developed pedagogical corpora that are based on the thematic needs and wishes of teachers in secondary schools. Before the corpora were compiled, a survey was done to find out about these needs and about the required corpus materials to satisfy them. The paper then shows how these corpora have been constructed based on these needs on the one hand and based on pedagogical principles combined with an applied linguistics perspective on the other hand. It argues that the construction of these corpora has to follow very different design principles than those that are followed when constructing large representative reference corpora. The same is true for the tools that are used to exploit the corpora. The tools of the project have been tailored to the needs of the teaching context. They put pedagogical suitability and relevance before linguistic and functional sophistication. So the paper demonstrates that on all three levels (corpus design, corpus compilation, and engineering of the search tools) pedagogical corpora must follow different principles than descriptive corpora.

Finally, it is argued that the corpora must be enriched with additional resources to unfold their full pedagogical potential (cf. Braun 2007). The corpus texts and annotation alone do not guarantee that learners can make optimal use of the language in the corpora and authenticate it for themselves in the sense of Widdowson (2003). The corpus must thus be complemented by suitable learning materials that embed it in a comprehensive teaching concept. The project has developed two types of exercises that serve exactly these purposes. First, ready-made learning packages have been produced that are directly related to individual sections of the corpus. Second, exploratory and communicative exercises have been developed that allow for guided exploration of the corpus materials and using it for communicative tasks. Both types of exercises can be retrieved with the search tool and they help the learners to make corpus work more rewarding for them.

References:

Braun, S. (2007). "Integrating corpus work into secondary education: from data-driven

learning to needs-driven corpora." ReCALL 19/3, 307-328.

Widowson, H. (2003). The appropriate language for learning. In Widdowson, H. (ed.). Defining Issues in English Language Teaching. Oxford, Oxford UP.

The general and the specific: Collocational resonance of scientific language

Geoffrey Williams and Chrystel Millon

Special language has long sought special words, but the wealth of language lies in semi-technical words and 'general' words in specific contexts. The BMC Corpus is a 33 million word resource built as part of the Scientext initiative, a comparable corpus project consisting of a consortium of three French Universities led by the Université de Grenoble 3. It has been fully part of speech tagged and lemmatised. Rather than being a mass of data it has been structured into themes to make comparative analysis possible. In the paper we shall look at some general words to see how their use varies in scientific contexts. This will be done by looking at the collocational networks of the certain high frequency nouns and verbs.

Variations of meaning between ‘general’ and specialised usage will be done using the mechanism of collocational resonance that sets out to map how wordings carry over aspects of meaning from one context to another. Collocational resonance is the notion that a word can carry over part of its sense from one collocational environment to another. This notion has had two, simultaneous, births. One looks at the move from the literal to the metaphoric (Hanks 2006), the other takes a more inter-textual standpoint (Williams 2008). In this paper it will be shown that the two are quite compatible, both having been developed from Sinclair’s idiom principle. What brings them together is lexical priming (Hoey 2005), in that the meaning of both words and phraseological units can only be found in the environment and that the collocational resonances arise from primings of the components.

This can be illustrated with the word 'probe', both as noun and verb. The primary sense of the word is as a noun denoting a long thin instrument, a sense readily attested in dictionaries. This has resulted in a transfer of meaning to a verb form of ‘probe’, which describes the action of searching using a long thin instrument. However, in molecular biology, neither the probe as pointed instrument, nor the notion of searching with such an instrument, is found, but ‘probes’ are widely used, so somehow we have to see what elements of meaning are carried over, if any, and how the polysemy of ‘probe’ can be described. This will be done using lexicographical prototypes (Hanks 1994) which provide the means to see the variations of meaning, to map the collocational resonance of the word and also to perceive the effects of lexical priming in generating particular senses of the word within a given context. The thematic division of the BMC corpus will allow is to map the different senses and primings of both noun and verb in different scientific uses and compare these to 'general' language use.

The outcomes of this work are of specific interest in the field of learners dictionaries providing means whereby the standard dictionary can be expanded to take into account usage in scientific contexts.

References:

Hanks, P. (1994). “Linguistic Norms and Pragmatic Exploitations or, Why Lexicographers

Need Prototype Theory, and Vice Verse.” Papers in Computational Lexicography:

Complex 94, 89-113.

Hanks, P. (unpublished 2006). “Resonance and the Phraseology of Metaphors”. Conference

paper. The Many faces of Phraseology. Louvain-la-neuve 2006.

Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London:

Routledge.

Williams G. (forthcoming 2008). “The Good Lord and his works : A corpus-based study of

collocational resonance.”. In Granger, S. and F. Meunier (eds). Phraseology: an

interdisciplinary perspective. Amsterdam : Benjamins.

Expressing gratitude by non-native speakers of English: Using the International Corpus of English (Hong Kong component)

May Wong

This paper presents research findings from the analysis of corpus linguistic data

generated by the International Corpus of English (ICE) project (Nelson 1996). Specifically, the research reported on in this paper focuses on the use of expressions of gratitude in spoken discourse, and presents results derived from the data from the Hong Kong component of the International Corpus of English project (ICE-HK). Whereas previous studies of thanking expressions have dealt with issues relating to a repertoire of conversational routines of expressing gratitude, this present study confines itself to the analysis of expressions of gratitude containing the stem ‘thank’, and also includes a discussion of methodological issues. Such issues include, first, the traditional way of studying expressions of gratitude in (interlanguage) pragmatic research, and, second, the corpus linguistic analytic approach adopted in this study. A key argument here is that the corpus-linguistic methodology used in such studies has an important effect on throwing up salient pragmatic patterns in the use of expressions of gratitude.

The results show that Hong Kong speakers of English do not employ a wide variety of thanking strategies investigated in previous literature. Their expressions of gratitude are usually brief, with thanks and thank you being the most common forms of gratitude expression. They are frequently used as a closing signal and as a complete turn. The Hong Kong Chinese tend to be rather reserved in expressing their gratitude explicitly and thus they seldom show appreciation of the interlocutors in their expression of gratitude, nor do they employ an extended turn for thanking. In addition, it is less likely for them to express gratitude and reject an offer. Responses to an act of thanking seem to be infrequent in ICE-HK and only a few strategies are represented.

The paper also considers the pedagogical implications of the way this function can be acquired in a second/foreign language with the help of the corpus findings. Corpus data can provide us with information which is not easily accessible by intuition, such as frequency information. While two thanking strategies, namely, the ‘thanking + stating reason’ and ‘thanking + refusing’ structures, are commonplace in native-speakers’ interactions (Schauer and Adolphs 2006: 127), my investigation has shown that Hong Kong speakers of English seldom adopt the latter strategy. The frequency of occurrence of different strategies of expressing gratitude can thus be used as one of the guiding principles for the selection and prioritising of language content i.e. sequencing in ELT materials (e.g. Leech 2001) with regard to the teaching of formulaic sequences.

References:

Schauer, G. and S. Adolphs. (2006). “Expressions of gratitude in corpus and DCT data: vocabulary, formulaic sequences, and pedagogy”. System 34 (1), 119-134.

Leech, G. (2001). “The role of frequency in ELT: new corpus evidence brings a re-appraisal”. Foreign Language Teaching and Research 33 (5), 328-339.

How different is translated Chinese from native Chinese?

Richard Xiao

Since the 1990s, the rapid development of the corpus-based approach to linguistic investigation in general, and the development of multilingual corpora in particular, have brought even more vigour into Descriptive Translation Studies (Toury 1995). Presently, corpus-based Descriptive Translation Studies has focused on translation as a product, by comparing comparable corpora of translational and non-translational texts, especially translated and L1 English. A number of distinctive features of translational English in relation to native English have since been uncovered. For example, Laviosa (1998) finds that translational language has four core patterns of lexical use. Beyond the lexical level, translational English is characterised by normalisation, simplification, explicitation, and sanitisation. Features of translational English such as these are sometimes called “translation universals” (TU) in the literature. Similar features have also been reported in the translational variants of a few languages other than English (e.g. Swedish). Nevertheless, research of this area has so far been confined largely to translational English translated from closely related European languages (e.g. Mauranen and Kujamäki 2004). If the features of translational language that have been reported are to be generalised as “translational universals”, the language pairs involved must not be restricted to English and closely related European languages. Clearly, evidence from “genetically” distinct language pairs such as English and Chinese is arguably more convincing, if not indispensable.

In this paper, I will first briefly review previous research of translation universals, and then introduce the newly created ZJU Corpus of Translational Chinese (ZCTC), which is designed as a translational match for the Lancaster Corpus of Mandarin Chinese (LCMC), a one-million-word balanced corpus representing native Chinese. The ZCTC corpus is created with the explicit aim of uncovering the potential features of translational Chinese. The remainder of the paper presents a number of case studies of the lexical and syntactic features of translational Chinese on the basis of the two monolingual comparable corpora, which may shed new light on some core features of translational Chinese. The implications of the study for TU hypotheses such as simplification, normalisation and explicitation will also be discussed.

References:

Laviosa, S. (1998). “Core patterns of lexical use in a comparable corpus of English narrative prose.” Meta 43(4), 557-570.

Mauranen, A. And P. Kujamäki. (2004). Translation Universals: Do They Exist? Amsterdam: John Benjamins.

Toury, G. (1995). Descriptive Translation Studies and Beyond. Amsterdam: John Benjamins.

On Chinese Learners’ Use of English Tenses: A Corpus-based Study

Xuemei Zhang and Yingying Yang

Based on the researchers’ previous empirical studies on Chinese English learners’ acquisition of English tenses with special reference to the past simple and the present perfect, and the effects of temporality expressions on the students’ choices of English tenses, this paper mainly reports a further follow-up investigation: to explore Chinese English learners’ performance in the light of tense agreement with the following research questions:

1. In what kind of context or time framework will Chinese learner perform better in the aspect of tense consistency?

2. Do the Chinese temporal expressions affect students’ use of tense consistency?

This study selected randomly 100 translation passages of the same 180-words Chinese article, which were collected from 5 different universities of China (with each 20) from The Corpus for English Major, or the CEM, a newly-built corpus, funded by 2007 National Social Science Foundation (No. 07BYY037). All the passages have already been error-tagged with a careful classification of different tense errors. In addition, the passages were retagged with a well-designed scheme to attach more information such as marking both the wrong and correct types of tense and the types of temporality expressions in each sentence. After that another empirical study was followed to further validate the research finding -- both a new translation passage was carried out by a group of students of similar English proficiency level and a questionnaire was administered.

The analysis shows that Chinese learners can use the simple present and the simple past tenses correctly, and they can use them consistently within the same time framework, even when time framework changes, they can still use tenses correctly, with the aid of different time adverbials as well as Chinese aspect marker “le”. Some other revealing findings regarding Chinese English learners’ use of English tenses and English learners’ acquisition of tenses in general are also discussed.

References:

Bardovi-Harlig, K. (2000). Tense and Aspect in Second Language Acquisition: Form, Meaning, and Use. Oxford: Blackwell

Granger, S. (1999). “Use of Tenses by Advanced EFL Learners: Evidence from an Error- tagged Computer Corpus”. In H. Hasselgard and S. Oksefjell (eds) Out of Corpora — Studies in Honour of Stig Johansson. Amsterdam Atlanta: Rodopi, 191 - 202.

Bardovi-Harlig, K. (1999). “From morpheme studies to temporal semantics”. Studies in Second Language Acquisition, 2, 341–382.

Zhang, X. and Y. Yingying (2009). “On Chinese English Major’s Acquisition of English Tenses: a study based on the writing corpora of CEM”. Journal of Sichuan International Studies University. 3, 87-94.

ANNIS: A aearch tool for multi-layer annotated corpora

Amir Zeldes, Julia Ritz, Anke Lüdeling and Christian Chiarcos

ANNIS is a flexible web-based corpus search architecture for search and visualization of multiple-layer corpora. By multi-layer we mean that the same primary datum may be annotated with (i) annotations of different types (spans, hierarchical trees/graphs with labelled edges and arbitrary pointing relations between terminals or non-terminals), and (ii) annotation structures that possibly overlap and/or conflict hierarchically. The supported search functionalities of ANNIS include exact and regular expression matching on word forms and annotations, as well as complex relations between individual elements, such as all forms of overlapping, contained or adjacent annotation spans, hierarchical dominance (children, ancestors, left- or rightmost child etc.) and more. Alternatively to the query language, data can be accessed using a graphical query builder. Query matches are visualized depending on annotation types: annotations referring to tokens (e.g. lemma, POS, morphology) are shown immediately in the match list. Spans (covering one or more tokens) are displayed in a grid view, trees/graphs in a tree/graph view, and pointing relations (such as anaphoric links) in a discourse view, with same-colour highlighting for coreferent elements. Full Unicode support is provided and a media player is embedded for rendering audio files linked to the data, allowing for a large variety of corpora.

Corpus data is annotated with automatic tools (taggers, parsers etc.) or task-specific expert tools for manual annotation, and then mapped onto the interchange format PAULA (Dipper 2005), where stand-off annotations refer to the same primary data. Importers exist for many formats, including EXMARaLDA (Schmidt 2004), TigerXML (Brants/Plaehn 2000), MMAX2 (Müller/Strube 2006), RSTTool (O’Donnell 2000), PALinkA (Orasan 2003) and Toolbox (Stuart et al. 2007). Data is compiled into a relational DB for optimal performance. Query matches and their features can be exported in the ARFF format and processed with the data mining tool WEKA (Witten/Frank 2005), which offers implementations of clustering and classification algorithms. ANNIS compares favourably with search functionalities in the above tools as well as corpus search engines (EXAKT¹, TIGERSearch², CQP³) and other frameworks/architectures (NITE, Carletta et al. 2003, GATE⁴).

References:

T. Brants and O. Plaehn. (2000). Interactive Corpus Annotation. In Proc. of LREC 2000. Athens, Greece.

J. Carletta, S. Evert, U. Heid, J. Kilgour, J. Robertson and H. Voormann (2003). "The NITE XML Toolkit: Flexible Anno-tation for Multi-modal Language Data”. Behavior Research Methods, Instruments and Computers 35(3), 353-363.

S. Dipper (2005). “XML-Based Stand-Off Representation and Exploitation of Multi-Level Linguistic Annotation”. In Proc. of Berliner XML Tage 2005 (BXML 2005). Berlin, Germany, 39-50.

C. Müller and M. Strube (2006). “Multi-Level Annotation of Linguistic Data with MMAX2”. In S. Braun, K. Kohn and J. Mukherje (eds), Corpus Technology and Language Pedagogy. Peter Lang, 197–214.

M. O’Donnell (2000). “RSTTool 2.4 - A Markup Tool for Rhetorical Structure Theory”. In Proc. International Natural Language Generation Conference (INLG'2000), Mitzpe Ramon, Israel, 253–256.

C. Orasan (2003). “Palinka: A Highly Customisable Tool for Discourse Annotation”. In Proc. of the 4th SIGdial Workshop on Discourse and Dialogue. Sapporo, Japan.

T. Schmidt (2004). “Transcribing and Annotating Spoken Language with Exmaralda”. In Proc. of the LREC- Workshop on XML Based Richly Annotated Corpora, Lisbon 2004. Paris: ELRA.

R. Stuart, G. Aumann and S. Bird (2007). “Managing Fieldwork Data with Toolbox and the Natural Language Toolkit”. Language Documentation & Conservation 1(1), 44–57.

I. H. Witten and E. Frank (2005). Data mining: Practical Machine Learning Tools and Techniques, 2nd Ed. San Francisco: Morgan Kaufman.

-------------------

1 http://www.exmaralda.org/exakt.html

2 http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/

3 http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPUserManual/HTML/

4 http://gate.ac.uk

On Chinese learners’ use of English tenses: A corpus-based study

Xuemei Zhang and Yingying Yang

1. In what kind of context or time framework will Chinese learner perform better in the aspect of tense consistency?

2. Do the Chinese temporal expressions affect students’ use of tense consistency?

References:

Bardovi-Harlig, K. (2000). Tense and Aspect in Second Language Acquisition: Form, Meaning, and Use. Oxford: Blackwell

Bardovi-Harlig, K. (1999). “From morpheme studies to temporal semantics”. Studies in Second Language Acquisition, 2, 341–382.

Capturing the generic structure of French linguistic articles with a focus on the core features of the genre

Berger, K. and C. Nord. (1999). Das Neue Testament und frühchristliche Schriften. Übersetzung und Kommentar. Frankfurt a. M.: Insel Verlag.

An integrated environment for extracting and translating collocations

References :

New developments in optimization of linguistic corpus research:

TextAnalyst and NetXtract

Victoria Tuzlukova

A corpus-based discourse analysis of global warming in British, American and Chinese newspapers

Fang Wang

Annotated corpus of Karelian dialects: An interim report

Dmitri Evmenov

Compiling, annotating and publishing corpora in DK-CLARIN, the Danish incarnation of the pan-European initiative for a common resource infrastructure

Jakob Halskov and Jørg Asmussen

'Diamonds may not be forever'. Keyness in discourse conflict: Identifying indicators of persuasive stance.

Ersilia Incelli

An analysis of Japanese EFL learners' use and management of discourse strategies across different proficiency levels

Aika Miura

Towards a discourse corpus of Czech

Lucie Mladová, Zuzanna Bedrichová, Šárka Zikánová and Eva Hajicová

A corpus-based computational model of the Quran using Frame Semantics

Remission among the Mentally Ill: An intertextual study of essentialist discourse in psychiatry

Laura Strakova

Corpus-based analysis of 'Master' in translations of The Analects

Wang Yuanting