CANS 2013: Corpus Analysis with Noise in the Signal

About

The first Corpus Analysis with Noise in the Signal (CANS 2013) workshop will be held on July 22nd 2013 at the seventh international Corpus Linguistics conference (CL2013), Lancaster University, UK. The schedule is now available, with six papers to be presented, along with a round-table discussion and software demonstrations. Registration for the workshop (and the CL2013 conference) is open until 30th June 2013.

The workshop will bring together studies which highlight and quantify the impact of noisy textual data on corpus-based research and/or present methods to negate the effect of this noise. We consider a corpus with "noisy textual data" to be one that contains substantial amounts of non-standard words, i.e. words which would not appear in a typical lexicon for the corpus's base language. Examples include:

Historical corpora from periods of a language when the orthography had not been widely standardised, hence containing large amounts of spelling variation.
Corpora of computer-mediated language varieties (chatroom, SMS, social media, blogs, etc.), where irregular orthography is prevalent for a variety of reasons (language play, abbreviations, typing errors, etc.).
Learner corpora (first language and second language) in which misspellings (and other language errors) are frequent.
Inaccurately digitised texts, e.g. badly OCRed or badly transcribed corpora.

Participants will be able to gain insights into the characteristics of the noise in different language varieties, the effect of the noise on different corpus linguistic techniques and different methods to either negate the noise or to produce more robust tools that can accurately process noisy textual data.

Schedule

This is a draft schedule, subject to change. A PDF version is also available.

2:00 - 2:30pm	Introduction and VARD 2.5
2:30 - 3:00pm	Marcel Bollmann Spelling normalization of historical German with sparse training data (abstract)
3:00 - 3:30pm	Felix Bildhauer & Roland Schäfer Token-level noise in large Web corpora and nondestructive normalization for linguistic applications (abstract)
3:30 - 4:00pm	Tea break & software demonstrations
4:00 - 4:30pm	Elena Klyachko, Timofey Arkhangelskiy, Olesya Kisselev & Ekaterina Rakhilina Automatic error detection in Russian learner language (abstract)
4:30 - 4:45pm	Turo Hiltunen & Jukka Tyrkkö Tagging Early Modern English Medical Texts (1500-1700) (abstract)
4:45 - 5:00pm	Verena Möller Retrieving passive structures from the Secondary-Level Corpus of Learner English (SCooLE) - How can we make part-of-speech tagging more successful? (abstract)
5:00 - 6:00pm	Discussion

Call for Papers

Please note that submission is now closed

Whilst many widely-used corpora include mainly standard written text on which a range of automatic corpus analysis and Natural Language Processing (NLP) techniques can be accurately performed, an increasing number of corpora contain substantial amounts of noisy textual data and irregular language. Such corpora range from relatively small specialised historical corpora (e.g. Early Modern English Medical Texts (EMEMT)) and second language learner corpora (e.g. French Learner Language Oral Corpora (FLLOC)) to very large datasets such as the transcribed Early English Books Online collection (EEBO-TCP), large collections of OCRed books (e.g. from Google Books) and the very large corpora being crawled from the web (e.g. from Twitter, and Web as Corpus). These non-standard language varieties can cause significant issues for corpus analysis tools, which in the majority of cases are set up and trained to deal with clean standard texts.

Our response to some of these issues has been the development of a Variant Detector tool (VARD2). Originally developed to normalize spelling variants within historical English datasets, VARD2 has since been adapted for use with SMS, Twitter, child language, learner corpora, other languages, etc. The purpose of this workshop is to provide a format in which we can discuss - and compare - our approach with other researchers' approaches to noise. This may include work where researchers have used and adapted VARD2, or utilise new tools and methods.

We invite submissions to present research highlighting the impact of noisy textual data on corpus-based research and/or providing methods to negate the effect of such noise. We are interested in research concerning any corpora with substantial textual noise and are particularly keen to have a range of languages and noise sources represented at the workshop.

Noise sources may include but are not limited to:

Historical spelling variation
Computer-mediated language varieties (e.g. chatroom, SMS, social networks, blogs, Twitter, etc.)
First and second language learner corpora
Inaccurately digitised texts, e.g. badly OCRed or badly transcribed corpora
Idiosyncratic language usage/idiolect features

Topics of interest include but are not limited to:

Evaluations of established corpus analysis methodology when processing noisy corpora.
Methods for pre-processing noise in corpora, such as spelling normalisaton and error correction.
Development of noise-aware corpus analysis methods which are robust enough to deal with noisy corpora and process them with accuracy, e.g. new automatic part-of-speech taggers.
Analyses of the characteristics and trends of spelling variation and language irregularities.
Studies which highlight the importance of maintaining original spellings and language irregularities and how these can assist in some aspects of corpus analysis.

Two types of submissions are sought, full paper presentations and shorter work-in-progress reports. For full papers we require an extended abstract of 1,000-2,000 words. For work-in-progress reports we require shorter abstracts of 500-1,000 words. The (extended) deadline for submitting abstracts is 1st March 2013, they will then be reviewed by the organising committee and you will receive a response by 11th March 2013. The organising committee consists of:

Alistair Baron (Lancaster University)
Paul Rayson (Lancaster University)
Dawn Archer (University of Central Lancashire)

Accepted full papers will be allocated 20 minutes + 5 minutes for questions, accepted work-in progress reports will be allocated 10 minutes + 5 minutes for questions. The remaining time will include an open discussion of the papers presented and general topics such as:

What are the key challenges of dealing with noisy textual data going forward?
When should we leave "noise" where it is? And for what reason(s)?
What are the dangers of ignoring the noise?

We expect to select papers from the workshop for a peer-reviewed journal special issue.

Submission

Please note that submission is now closed

Papers should be submitted to cans2013@comp.lancs.ac.uk, and should use the same guidelines and template as those for the main CL2013 conference, with the exception of text length restrictions. Two types of submissions are sought:

Full papers: Extended abstract of 1,000-2,000 words.
Work-in-progress reports: Shorter abstract of 500-1,000 words.

All abstracts must be formatted according to the CL2013 style-sheet, which can be downloaded from the links below. These template files both explain and exemplify the format that you should use.

Download abstract template in Microsoft Word Document format (.doc)
Download abstract template in Microsoft Word 2007 Document format (.docx)
Download abstract template in Open Document Format (for OpenOffice.org) (.odt)
Download abstract template in Rich Text Format (.rtf)

The (extended) deadline for abstract submission is 1st March 2013, they will then be reviewed by the organising committee and you will receive a response by 11th March 2013.

In line with the policy of the conference organisers, you are welcome to submit abstracts both for this workshop and for the main Corpus Linguistics 2013 conference. However, if you give two papers they should be different, without substantial overlap.

Key dates

1st March 2013
Abstract submission deadline

11th March 2013
Notification of acceptance

15th April 2013
Early bird registration closes

30th June 2013
Final deadline for registration

22nd July 2013
Day of workshop

Organisers

Alistair Baron
Lancaster University

Paul Rayson
Lancaster University

Dawn Archer
University of Central Lancashire

Links

Corpus Linguistics 2013

Variant spelling research

VARD2 - Variant Detector