UCREL's Variant Spelling Research

Origins of the research: In January 2003, Dawn Archer and Paul Rayson began to explore the feasibility of retraining the UCREL Semantic Analysis System (USAS) so that it was applicable to historical texts dating from 1600 onwards (Archer et al 2003). Previously, the USAS system had been used to annotate written and spoken English from the modern period. Our initial experiments suggested that the statistical model adopted in the part-of-speech tagging component (CLAWS) was sufficiently robust to cope with grammatical variation across time, but spelling variants caused difficulty, as variants of known words were not recognized by the system. Our response was to implement a prototype spelling detector as a pre-processing step in USAS. This work continued in the WordHoard project funded by the Mellon Foundation with Prof. Martin Mueller (Northwestern University, Chicago), in which we linguistically annotated the Nameless Shakespeare corpus and Chadwick-Healey's Eighteenth and Nineteenth Century Fiction corpora. The approach we adopted is to produce a list of variant spellings, which we manually matched to normalized forms. A Variant Detector computer program (VARD) inserts modern equivalents of these forms when they appear in a given text while preserving the original variant. This approach proved to be very effective: we identified 45,800+ variants from our analyses of different historical texts to date, and undertook a preliminary empirical study of spelling variation across the 16th-19th centuries based on 4,000 of these variants (Archer & Rayson 2004).

Continuation of the work: For more recent information on the research, see the links and publications below. Information about the VARD2 software is available on Alistair's VARD2 page.

This work has been supported by grants from the British Academy (Scragg Revisited), The Mellon Foundation (WordHoard), an EPSRC PhD Plus grant for Alistair Baron, and recently by AHRC/ESRC in the Samuels project.

Coming soon! Paul Rayson and Alistair Baron have been awarded £7,000 by JISC for the crowdsourcing of Early English Books Online (EEBO) spelling regularisation through VARD. We will shortly be putting out a call for volunteers (who will be paid from the grant) to carry out VARDing of EEBO phase 1 samples. You can already sign up on our website in advance: ucrel-vardsourcing.lancs.ac.uk

Main contacts:

Paul Rayson (School of Computing and Communications, Lancaster University)
Dawn Archer (Department of Languages, Information and Communications, Manchester Metropolitan University)
Alistair Baron (School of Computing and Communications, Lancaster University)
Nick Smith (School of Education, University of Leicester)
Mahmoud El-Haj (School of Computing and Communications, Lancaster University)
Andrew Moore (School of Computing and Communications, Lancaster University)

Related publications

Archer, D., Kytö, M., Baron, A. & Rayson, P. (2015). "Guidelines for normalising Early Modern English corpora: decisions and justifications." ICAME Journal: 39. doi: 10.1515/icame-2015-0001
Alexander, M., Dallachy, F., Piao, S., Baron, A., & Rayson, P. (2015). Metaphor, popular science and semantic tagging: distant reading with the Historical Thesaurus of English. Digital Scholarship in the Humanities, 30(Suppl. 1). doi: 10.1093/llc/fqv045
Rayson, P., Baron, A., Piao, S., & Wattam, S. (2015). Large-scale time-sensitive semantic analysis of historical corpora. Abstract from The 36th Meeting of ICAME (International Computer Archive of Modern and Medieval English), Trier, Germany.
Tagg, C., Baron, A., & Rayson, P. (2012). "i didn't spel that wrong did i. Oops": Analysis and normalisation of SMS spelling variation. Lingvisticae Investigationes, 35(2), 367-388. 10.1075/li.35.2.12tag
Rayson, P. and Baron, A. (2011). Automatic error tagging of spelling mistakes in learner corpora. In Meunier F., De Cock S., Gilquin G. and Paquot M. (eds.) A Taste for Corpora. In honour of Sylviane Granger, Studies in Corpus Linguistics, 45. John Benjamins, Amsterdam.
Baron, A., Tagg, C., Rayson, P., Greenwood, P., Walkerdine, J. and Rashid, A. (2011). Using verifiable author data: Gender and spelling differences in Twitter and SMS. Presented at ICAME 32, Oslo, Norway, 1-5 June 2011.
Baron, A., Rayson, P. and Archer, D. (2011). Quantifying Early Modern English spelling variation: Change over time and genre. Presented at Conference on New Methods in Historical Corpora, University of Manchester 29-30 April 2011.
Lehto, A., Baron, A., Ratia, M. and Rayson, P. (2010). Improving the precision of corpus methods: The standardized version of Early Modern English Medical Texts. In Taavitsainen, I. and Pahta, P. (eds.) Early Modern English Medical Texts: Corpus description and studies, pp. 279-290. John Benjamins, Amsterdam.
Tagg, C., Baron, A. and Rayson, P. (2010). "I didn't spel that wrong did i. Oops": Analysis and standardisation of SMS spelling variation. Presented at ICAME 31, Giessen, Germany, 26-30 May 2010.
Baron, A., Rayson, P. and Archer, D. (2009). Word frequency and key word statistics in historical corpus linguistics. In Anglistik: International Journal of English Studies, 20 (1), pp. 41-67.
Baron, A. and Rayson, P. (2009). Automatic standardization of texts containing spelling variation, how much training data do you need? In M. Mahlberg, V. González-Díaz and C. Smith (eds.) Proceedings of the Corpus Linguistics Conference, CL2009, University of Liverpool, UK, 20-23 July 2009.
Baron, A., Rayson, P. and Archer, D. (2009). Automatic Standardization of Spelling for Historical Text Mining. In Proceedings of Digital Humanities 2009, University of Maryland, USA, 22-25 June 2009.
Baron, A., Rayson, P. and Archer, D. (2009). The extent of spelling variation in Early Modern English. Presented at ICAME 30, Lancaster University, UK, 27-31 May 2009.
Rayson, P., Archer, D., Baron, A. and Smith, N. (2008). Travelling Through Time with Corpus Annotation Software. In Lewandowska-Tomaszczyk, B. (ed) Corpus Linguistics, Computer Tools, and Applications - State of the Art. PALC 2007. Peter Lang, Frankfurt am Main, pp. 29-46.
Baron, A. and Rayson, P. (2008). VARD2: A tool for dealing with spelling variation in historical corpora. In proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, 22nd May 2008.
Pilz, T., Ernst-Gerlach, A., Kempken, S., Rayson, P. and Archer, D. (2008) The identification of spelling variants in English and German historical texts: manual or automatic? Literary and Linguistic Computing, 23, 1, pp. 65-72. doi:10.1093/llc/fqm044
Rayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In proceedings of Corpus Linguistics 2007, July 27-30, University of Birmingham, UK.
Rayson, P., Archer, D., Baron, A. and Smith, N. (2007). Tagging historical corpora - the problem of spelling variation. In proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, International Conference and Research Center for Computer Science, Schloss Dagstuhl, Wadern, Germany, December 3rd-8th 2006. (http://drops.dagstuhl.de/opus/volltexte/2007/1055/) ISSN 1862-4405.
Archer, D., Ernst-Gerlach, A., Kempken, S., Pilz, T., and Rayson, P., (2006), "The identification of spelling variants in English and German historical texts: manual or automatic?", In proceedings of Digital Humanities 2006, The Sorbonne, Centre Cultures Anglophones et Technologies de l'Information, Paris, France, July 5 - 9, 2006, pp. 3 - 5.
Archer, D., Culpeper, J. and Rayson, P. (2005) Love - a familiar or a devil? An exploration of key domains in Shakespeare’s Comedies and Tragedies. Presented at the AHRC ICT Methods Network Expert Seminar on Linguistics. Lancaster University, 8 September 2005.
Rayson, P., Archer, D., Smith, N., (2005), VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora. In Proceedings of Corpus Linguistics 2005, Birmingham University, July 14-17, Proceedings from the Corpus Linguistics Conference Series on-line e-journal, Vol. 1, no. 1, ISSN 1747-9398.
Archer, D., Culpeper, J., Rayson, P., (2005), "Love - a familiar or a devil? An exploration of key domains in Shakespeare's Comedies and Tragedies", Presented as part of the Keyword Extraction in Information Retrieval panel at the ACH/ALLC Conference, June 15 - 18, 2005, Victoria, BC, Canada.
Mueller, M. (2005). The Nameless Shakespeare. Working Papers from the First and Second Canadian Symposium on Text Analysis Research (CaSTA). Computing in the Humanities Working Papers (CHWP 34). (HTML version)
Archer, D., McEnery, T., Rayson, P., Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper number 16. UCREL, Lancaster University, pp. 22 - 31.

UCREL's Variant Spelling Research

Related publications

Related presentations