The first Corpus Analysis with Noise in the Signal (CANS 2013) workshop will be held on July 22nd 2013 at the seventh international Corpus Linguistics conference (CL2013), Lancaster University, UK. The schedule is now available, with six papers to be presented, along with a round-table discussion and software demonstrations. Registration for the workshop (and the CL2013 conference) is open until 30th June 2013.

The workshop will bring together studies which highlight and quantify the impact of noisy textual data on corpus-based research and/or present methods to negate the effect of this noise. We consider a corpus with "noisy textual data" to be one that contains substantial amounts of non-standard words, i.e. words which would not appear in a typical lexicon for the corpus's base language. Examples include:

Participants will be able to gain insights into the characteristics of the noise in different language varieties, the effect of the noise on different corpus linguistic techniques and different methods to either negate the noise or to produce more robust tools that can accurately process noisy textual data.


This is a draft schedule, subject to change. A PDF version is also available.

2:00 - 2:30pm Introduction and VARD 2.5
2:30 - 3:00pm Marcel Bollmann
Spelling normalization of historical German with sparse training data (abstract)
3:00 - 3:30pm Felix Bildhauer & Roland Schäfer
Token-level noise in large Web corpora and nondestructive normalization for linguistic applications (abstract)
3:30 - 4:00pm Tea break & software demonstrations
4:00 - 4:30pm Elena Klyachko, Timofey Arkhangelskiy, Olesya Kisselev & Ekaterina Rakhilina
Automatic error detection in Russian learner language (abstract)
4:30 - 4:45pm Turo Hiltunen & Jukka Tyrkkö
Tagging Early Modern English Medical Texts (1500-1700) (abstract)
4:45 - 5:00pm Verena Möller
Retrieving passive structures from the Secondary-Level Corpus of Learner English (SCooLE) - How can we make part-of-speech tagging more successful? (abstract)
5:00 - 6:00pm Discussion

Call for Papers

Please note that submission is now closed

Whilst many widely-used corpora include mainly standard written text on which a range of automatic corpus analysis and Natural Language Processing (NLP) techniques can be accurately performed, an increasing number of corpora contain substantial amounts of noisy textual data and irregular language. Such corpora range from relatively small specialised historical corpora (e.g. Early Modern English Medical Texts (EMEMT)) and second language learner corpora (e.g. French Learner Language Oral Corpora (FLLOC)) to very large datasets such as the transcribed Early English Books Online collection (EEBO-TCP), large collections of OCRed books (e.g. from Google Books) and the very large corpora being crawled from the web (e.g. from Twitter, and Web as Corpus). These non-standard language varieties can cause significant issues for corpus analysis tools, which in the majority of cases are set up and trained to deal with clean standard texts.

Our response to some of these issues has been the development of a Variant Detector tool (VARD2). Originally developed to normalize spelling variants within historical English datasets, VARD2 has since been adapted for use with SMS, Twitter, child language, learner corpora, other languages, etc. The purpose of this workshop is to provide a format in which we can discuss - and compare - our approach with other researchers' approaches to noise. This may include work where researchers have used and adapted VARD2, or utilise new tools and methods.

We invite submissions to present research highlighting the impact of noisy textual data on corpus-based research and/or providing methods to negate the effect of such noise. We are interested in research concerning any corpora with substantial textual noise and are particularly keen to have a range of languages and noise sources represented at the workshop.

Noise sources may include but are not limited to:

Topics of interest include but are not limited to:

Two types of submissions are sought, full paper presentations and shorter work-in-progress reports. For full papers we require an extended abstract of 1,000-2,000 words. For work-in-progress reports we require shorter abstracts of 500-1,000 words. The (extended) deadline for submitting abstracts is 1st March 2013, they will then be reviewed by the organising committee and you will receive a response by 11th March 2013. The organising committee consists of:

Accepted full papers will be allocated 20 minutes + 5 minutes for questions, accepted work-in progress reports will be allocated 10 minutes + 5 minutes for questions. The remaining time will include an open discussion of the papers presented and general topics such as:

We expect to select papers from the workshop for a peer-reviewed journal special issue.


Papers should be submitted to, and should use the same guidelines and template as those for the main CL2013 conference, with the exception of text length restrictions. Two types of submissions are sought:

All abstracts must be formatted according to the CL2013 style-sheet, which can be downloaded from the links below. These template files both explain and exemplify the format that you should use.

The (extended) deadline for abstract submission is 1st March 2013, they will then be reviewed by the organising committee and you will receive a response by 11th March 2013.

In line with the policy of the conference organisers, you are welcome to submit abstracts both for this workshop and for the main Corpus Linguistics 2013 conference. However, if you give two papers they should be different, without substantial overlap.

Key dates

1st March 2013
Abstract submission deadline

11th March 2013
Notification of acceptance

15th April 2013
Early bird registration closes

30th June 2013
Final deadline for registration

22nd July 2013
Day of workshop


Alistair Baron
Lancaster University

Paul Rayson
Lancaster University

Dawn Archer
University of Central Lancashire


