UCREL research centre

Corpus Linguistics 2013

Lancaster University, UK – 22nd to 26th July 2013

Pre-conference workshops

The main conference will be preceded by a workshop day on Monday 22nd July. On this day, the following pre-conference workshops will be offered:

More information -- including detailed workshop descriptions and a provisional schedule -- will be published on this webpage as it becomes available.

Workshop summaries

Evaluative Language and Corpus Linguistics

The term 'evaluative language' covers a wide range of language resources used to express subjectivity, value and opinion. Corpus approaches to the study of evaluative language range from Sentiment Analysis (e.g. Liu 2010) to the study of stance (e.g. Biber 2006) to the application of corpus techniques to the study of appraisal (e.g. Bednarek 2008). In this workshop we plan to present advances in the use of corpora to study evaluative language, focusing on three main areas: the relationship between corpus evidence and theoretical models relating to evaluative language, in particular using corpus investigation techniques to generate, test or enhance such models; the use of evaluative language in specific contexts; and applications of studies of evaluative language. The workshop will consist of a series of papers and a round-table discussion. More details...

Fifth Interdisciplinary Workshop on Corpus-Based Approaches to Figurative Language: Metaphor and Austerity

This fifth Interdisciplinary Workshop on Corpus-Based Approaches to Figurative Language will consist of a day-long colloquium including oral presentations, a poster session, plus a round-table discussion. It is the organizers' intention to showcase original research into the figurative language associated with Austerity in its many guises and in various spheres of life, and to stimulate interdisciplinary debate between established and early-career researchers who are investigating Austerity in corpus data. Proposals are welcome on any aspect of figurative language relevant to the central theme of Austerity, including, but not limited to, the economy, work and unemployment, immigration and asylum seeking, social inclusion and exclusion. Given the dominance of English in the literature on metaphor, research dealing with other languages will be particularly welcome, whether contrastive or otherwise. More details...

Workshop on Arabic Corpus Linguistics

Following on from the successful first WACL in 2011, this second event will again take place at Lancaster University. The aim of this series of workshops is to create a venue for exploring progress in the field of research into the Arabic language using corpora, from across the many areas of corpus linguistics and computational linguistics where the analysis of Arabic structure and usage is an active issue.

The scope of the workshop encompasses both (a) the design, construction and annotation of Arabic corpora, and (b) the use of corpora in research on the Arabic language -- in any relevant area, including (but not limited to!) lexis and lexicography, syntax, collocation, NLP systems and analysis tools, contrastive and historical studies, stylistics, and discourse analysis. All varieties of Arabic -- including the different Colloquial Arabics as well as Classical/Qur'anic and Modern Standard forms of the language -- are within the workshop's purview. More details...

Web as Corpus Workshop

Web corpora and other Web-derived data have become a gold mine for corpus linguistics and natural language processing. The Web is an easy source of unprecedented amounts of linguistic data from a broad range of registers and text types. However, a collection of Web pages is not immediately suitable for exploration in the same way a traditional corpus is.

The 8th Web as Corpus Workshop continues a highly successful series of yearly Web as Corpus workshops, and provides a venue for interested researchers to meet, share ideas and discuss the problems and possibilities of compiling and using Web corpora; with the return to its roots in the corpus linguistics community the leading theme of this workshop is the application of Web data in language research, including linguistic evaluation of Web-derived corpora as well as strategies and tools for high-quality automatic annotation of Web text. More details...

Corpus Analysis with Noise in the Signal

This workshop will bring together studies which highlight and quantify the impact of noisy textual data on corpus-based research and/or present methods to negate the effect of this noise. We consider a corpus with "noisy textual data" to be one that contains substantial amounts of non-standard words, i.e. words which would not appear in a typical lexicon for the corpus's base language. Examples include:

  • Historical corpora from periods of a language when the orthography had not been widely standardised, hence containing large amounts of spelling variation.
  • Corpora of computer-mediated language varieties (chatroom, SMS, social media, blogs, etc.), where irregular orthography is prevalent for a variety of reasons (language play, abbreviations, typing errors, etc.).
  • Learner corpora (first language and second language) in which misspellings (and other language errors) are frequent.
  • Inaccurately digitised texts, e.g. badly OCRed or badly transcribed corpora.

Participants will be able to gain insights into the characteristics of the noise in different language varieties, the effect of the noise on different corpus linguistic techniques and different methods to either negate the noise or to produce more robust tools that can accurately process noisy textual data. More details...

Annotating correspondence corpora

This workshop will be of interest to researchers working with digital correspondence collections - historical, contemporary, regional or professional, and derived from manuscript, typescript or email formats. Such collections are important resources for historians, industrial archivists, sociolinguists, discourse analysts and researchers in the fields of migrant and cultural studies, but many are unannotated or only partially annotated, and differences in the transcription and markup practices of different correspondence projects hamper interconnectivity and prevent the full use of modern corpus query methods.

The aims of the workshop are to:

  • investigate ways of organizing, interpreting, and using the various types of information embedded within correspondence.
  • bring together experts and novices in the field, those studying correspondence from different disciplinary perspectives, those who are working with established correspondence corpora, and those who are compiling new corpora, e.g. for their doctoral studies.

This will be a full day workshop with five 30 minute presentations in the morning and hands-on practical sessions in the afternoon. The afternoon sessions will provide opportunities to examine existing annotation schemes, experiment with new types of annotation, for example GISTools, and try out various corpus query methods and visualisation techniques. More details...

Compiling and analysing a spoken academic corpus

The workshop will focus on the main issues connected with building a spoken academic corpus. The aim of the session is to equip the participants with transferable skills related to spoken academic corpus design and analysis which they can apply in their own contexts. In a series of practical exercises we will go through the individual stages of corpus design and explore new possibilities of corpus analysis including visualizations and statistics. The following areas will be covered: data gathering, transcription, annotation and corpus analysis.

The workshop will be of particular interest to researchers (both students and lecturers) and practitioners interested in spoken academic discourse in general as well as to those interested in individual lexico-grammatical patterns in academic speech. More details...

A Fully-annotated Pragmatic Corpus -- the SPICE-Ireland Corpus

SPICE-Ireland is an annotated version of the spoken component of ICE-Ireland, one of the national components comprising the International Corpus of English (cf. Greenbaum 1996). SPICE stands for 'Systems of Pragmatic Annotation in the Spoken Component of ICEā€Ireland'. The SPICE-Ireland corpus comprises 600,000 words of transcribed spoken language, equally divided between Northern Ireland and the Republic of Ireland. In line with other ICE corpora, the spoken material in SPICE-Ireland contains 300 texts of 2,000-words from 15 spoken discourse situations. The transcriptions are annotated for the following features:

  • Utterance speech-act function
  • prosody (pitch movements)
  • utterance tags
  • discourse markers
  • quotatives

The aim of this workshop is to show that the SPICE-Ireland annotation scheme is without precedent in the information which it provides towards an unraveling of a speaker's pragmatic intent and the interpretation of a speaker's accommodation of face needs. More specifically, it aims:

  • to introduce participants to each of the annotation sets and to explain their motivation and rationale;
  • to share with participants some findings using the corpus;
  • to provide an opportunity for using the corpus with the help of some specifically-designed tasks and topics;
  • to provide consultation for any searches or analyses which participants might wish to undertake.
More details...

