UCREL's Variant Spelling Research
Origins of the research:
In January 2003, Dawn Archer and Paul Rayson began to explore the
feasibility of retraining the
UCREL Semantic Analysis System (USAS) so
that it was applicable to historical texts dating from 1600 onwards
(Archer et al 2003). Previously, the USAS system had been used to
annotate written and spoken English from the modern period. Our initial
experiments suggested that the statistical model adopted in the
part-of-speech tagging component (CLAWS) was sufficiently robust to
cope with grammatical variation across time, but spelling variants
caused difficulty, as variants of known words were not recognized by
the system. Our response was to implement a prototype spelling detector
as a pre-processing step in USAS. This work continued in the
WordHoard
project funded by the Mellon Foundation with Prof. Martin Mueller
(Northwestern University, Chicago), in which we linguistically
annotated the Nameless Shakespeare corpus and Chadwick-Healey's
Eighteenth and Nineteenth Century Fiction corpora. The approach we
adopted is to produce a list of variant spellings, which we
manually matched to normalized forms. A Variant Detector computer program
(VARD) inserts modern equivalents of these forms when they appear
in a given text while preserving the original variant. This approach
proved to be very effective: we identified 45,800+ variants from
our analyses of different historical texts to date, and undertook
a preliminary empirical study of spelling variation across the
16th-19th centuries based on 4,000 of these variants (Archer & Rayson
2004).
Continuation of the work: For more recent information on the
research, see the links and publications below.
Information about the VARD2 software is
available on Alistair's VARD2 page.
This work has been supported by grants from the British Academy (Scragg Revisited),
The Mellon Foundation (WordHoard), an EPSRC PhD Plus grant for Alistair Baron,
and recently by AHRC/ESRC in the Samuels project.
Coming soon! Paul Rayson and Alistair Baron have been awarded £7,000 by JISC for the crowdsourcing of Early English Books Online (EEBO) spelling regularisation through VARD.
We will shortly be putting out a call for volunteers (who will be paid from the grant) to carry out VARDing of EEBO phase 1 samples.
You can already sign up on our website in advance: ucrel-vardsourcing.lancs.ac.uk
Main contacts:
- Paul Rayson (School of Computing and Communications, Lancaster University)
- Dawn Archer (Department of Languages, Information and Communications, Manchester Metropolitan University)
- Alistair Baron (School of Computing and Communications, Lancaster University)
- Nick Smith (School of Education, University of Leicester)
- Mahmoud El-Haj (School of Computing and Communications, Lancaster University)
- Andrew Moore (School of Computing and Communications, Lancaster University)
Related publications
-
Archer, D., Kytö, M., Baron, A. & Rayson, P. (2015). "Guidelines for normalising Early Modern English corpora: decisions and
justifications." ICAME Journal: 39.
doi: 10.1515/icame-2015-0001
-
Alexander, M., Dallachy, F., Piao, S., Baron, A., & Rayson, P. (2015). Metaphor, popular science and semantic tagging: distant reading
with the Historical Thesaurus of English. Digital Scholarship in the Humanities, 30(Suppl. 1).
doi: 10.1093/llc/fqv045
-
Rayson, P., Baron, A., Piao, S., & Wattam, S. (2015). Large-scale time-sensitive semantic analysis of historical corpora. Abstract from
The 36th Meeting of ICAME (International Computer Archive of Modern and Medieval English), Trier, Germany.
-
Tagg, C., Baron, A., & Rayson, P. (2012). "i didn't spel that wrong did i. Oops": Analysis and normalisation of SMS spelling variation.
Lingvisticae Investigationes, 35(2), 367-388.
10.1075/li.35.2.12tag
-
Rayson, P. and Baron, A. (2011). Automatic error tagging
of spelling mistakes in learner corpora. In Meunier F., De Cock S.,
Gilquin G. and Paquot M. (eds.) A Taste for Corpora. In honour of
Sylviane Granger, Studies in Corpus Linguistics, 45. John
Benjamins, Amsterdam.
-
Baron, A., Tagg, C., Rayson, P., Greenwood, P., Walkerdine, J. and
Rashid, A. (2011). Using verifiable author data: Gender and spelling
differences in Twitter and SMS. Presented at ICAME 32, Oslo,
Norway, 1-5 June 2011.
-
Baron, A., Rayson, P. and Archer, D. (2011). Quantifying Early Modern
English spelling variation: Change over time and genre. Presented at
Conference on New Methods in Historical Corpora, University of
Manchester 29-30 April 2011.
-
Lehto, A., Baron, A., Ratia, M. and Rayson, P. (2010). Improving the
precision of corpus methods: The standardized version of Early Modern
English Medical Texts. In Taavitsainen, I. and Pahta, P. (eds.)
Early Modern English Medical Texts: Corpus description and
studies, pp. 279-290. John Benjamins, Amsterdam.
-
Tagg, C., Baron, A. and Rayson, P. (2010). "I didn't spel that wrong
did i. Oops": Analysis and standardisation of SMS spelling variation.
Presented at ICAME 31, Giessen, Germany, 26-30 May 2010.
-
Baron, A., Rayson, P. and Archer, D. (2009). Word frequency and key
word statistics in historical corpus linguistics. In
Anglistik: International Journal of English
Studies, 20 (1), pp. 41-67.
-
Baron, A. and Rayson, P. (2009). Automatic standardization of texts
containing spelling variation, how much training data do you need? In
M. Mahlberg, V. González-Díaz and C. Smith (eds.) Proceedings of
the Corpus Linguistics Conference, CL2009, University of
Liverpool, UK, 20-23 July 2009.
-
Baron, A., Rayson, P. and Archer, D. (2009). Automatic Standardization
of Spelling for Historical Text Mining. In Proceedings of Digital
Humanities 2009, University of Maryland, USA, 22-25 June 2009.
-
Baron, A., Rayson, P. and Archer, D. (2009). The extent of spelling
variation in Early Modern English. Presented at ICAME 30,
Lancaster University, UK, 27-31 May 2009.
- Rayson, P., Archer, D., Baron, A. and Smith, N. (2008).
Travelling Through Time with Corpus Annotation Software.
In Lewandowska-Tomaszczyk, B. (ed) Corpus Linguistics, Computer Tools, and Applications -
State of the Art. PALC 2007. Peter Lang, Frankfurt am Main,
pp. 29-46.
- Baron, A. and Rayson, P. (2008). VARD2: A tool for dealing
with spelling variation in historical corpora.
In proceedings of the
Postgraduate Conference in Corpus Linguistics, Aston University,
Birmingham, 22nd May 2008.
- Pilz, T., Ernst-Gerlach, A., Kempken, S., Rayson, P. and Archer, D. (2008)
The identification of spelling variants in English and German historical texts:
manual or automatic?
Literary and Linguistic Computing, 23, 1, pp. 65-72.
doi:10.1093/llc/fqm044
-
Rayson, P., Archer, D., Baron, A., Culpeper, J. and
Smith, N. (2007).
Tagging the Bard: Evaluating the accuracy of a modern POS tagger
on Early Modern English corpora.
In proceedings of Corpus Linguistics 2007, July 27-30, University of
Birmingham, UK.
- Rayson, P., Archer, D., Baron, A. and Smith, N. (2007).
Tagging historical corpora - the problem of spelling variation.
In proceedings of Digital Historical Corpora,
Dagstuhl-Seminar 06491,
International Conference and Research Center for
Computer Science, Schloss Dagstuhl, Wadern, Germany,
December 3rd-8th 2006.
(http://drops.dagstuhl.de/opus/volltexte/2007/1055/)
ISSN 1862-4405.
- Archer, D., Ernst-Gerlach, A., Kempken, S., Pilz, T., and Rayson, P.,
(2006), "The identification of spelling variants in English and German
historical texts: manual or automatic?", In proceedings of Digital
Humanities 2006, The Sorbonne, Centre Cultures Anglophones et
Technologies de l'Information, Paris, France, July 5 - 9, 2006, pp. 3 -
5.
- Archer, D., Culpeper, J. and Rayson, P. (2005)
Love - a familiar or a devil? An exploration of key domains in Shakespeare’s
Comedies and Tragedies.
Presented at the AHRC ICT Methods Network Expert Seminar on Linguistics.
Lancaster University, 8 September 2005.
- Rayson, P., Archer, D., Smith, N., (2005), VARD versus WORD: A
comparison of the UCREL variant detector and modern spellcheckers on
English historical corpora. In Proceedings of Corpus Linguistics 2005,
Birmingham University, July 14-17, Proceedings from the Corpus
Linguistics Conference Series on-line e-journal, Vol. 1, no. 1, ISSN
1747-9398.
- Archer, D., Culpeper, J., Rayson, P., (2005), "Love - a familiar or a
devil? An exploration of key domains in Shakespeare's Comedies and
Tragedies", Presented as part of the Keyword Extraction in Information
Retrieval panel at the ACH/ALLC Conference, June 15 - 18, 2005,
Victoria, BC, Canada.
- Mueller, M. (2005). The Nameless Shakespeare.
Working Papers from the First and Second Canadian Symposium on Text Analysis Research (CaSTA).
Computing in the Humanities Working Papers (CHWP 34).
(HTML version)
-
Archer, D., McEnery, T., Rayson, P., Hardie, A. (2003).
Developing an automated semantic analysis system for Early Modern English.
In
Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper
number 16. UCREL, Lancaster University, pp. 22 - 31.
Related presentations
- Rayson, P. (2007).
Travelling through time with corpus annotation software.
Invited talk at the 10th Practical Applications in Language and Computers
(PALC 2007) conference.
19-22 April 2007, Lodz University, Poland.
- Archer, D. and Rayson, P. (2006).
Teaching a computer to read Shakespeare - the problem of spelling variation.
Invited talk at the OED Forum, Oxford University Press and Kellogg College, Oxford, 21 June 2006.
-
Archer, D. and Rayson, P. (2006)
Dealing with variation: spelling.
Invited talk at SCOTS Symposium on Linguistic Variation and Electronic Projects.
University of Glasgow, 28th April 2006.
- Archer, D. and Rayson, P. (2004)
Using an historical semantic tagger as a diagnostic tool for variation in spelling.
Presented at
Thirteenth International Conference on English Historical Linguistics
(ICEHL 13)
University of Vienna, Austria
23-29 August, 2004.