UCREL's Variant Spelling Research

Origins of the research: In January 2003, Dawn Archer and Paul Rayson began to explore the feasibility of retraining the UCREL Semantic Analysis System (USAS) so that it was applicable to historical texts dating from 1600 onwards (Archer et al 2003). Previously, the USAS system had been used to annotate written and spoken English from the modern period. Our initial experiments suggested that the statistical model adopted in the part-of-speech tagging component (CLAWS) was sufficiently robust to cope with grammatical variation across time, but spelling variants caused difficulty, as variants of known words were not recognized by the system. Our response was to implement a prototype spelling detector as a pre-processing step in USAS. This work continued in the WordHoard project funded by the Mellon Foundation with Prof. Martin Mueller (Northwestern University, Chicago), in which we linguistically annotated the Nameless Shakespeare corpus and Chadwick-Healey's Eighteenth and Nineteenth Century Fiction corpora. The approach we adopted is to produce a list of variant spellings, which we manually matched to normalized forms. A Variant Detector computer program (VARD) inserts modern equivalents of these forms when they appear in a given text while preserving the original variant. This approach proved to be very effective: we identified 45,800+ variants from our analyses of different historical texts to date, and undertook a preliminary empirical study of spelling variation across the 16th-19th centuries based on 4,000 of these variants (Archer & Rayson 2004).

Continuation of the work: For more recent information on the research, see the links and publications below. Information about the VARD2 software is available on Alistair's VARD2 page.

This work has been supported by grants from the British Academy (Scragg Revisited), The Mellon Foundation (WordHoard), an EPSRC PhD Plus grant for Alistair Baron, and recently by AHRC/ESRC in the Samuels project.

Coming soon! Paul Rayson and Alistair Baron have been awarded £7,000 by JISC for the crowdsourcing of Early English Books Online (EEBO) spelling regularisation through VARD. We will shortly be putting out a call for volunteers (who will be paid from the grant) to carry out VARDing of EEBO phase 1 samples. You can already sign up on our website in advance: ucrel-vardsourcing.lancs.ac.uk

Main contacts:

Related publications

  1. Archer, D., Kytö, M., Baron, A. & Rayson, P. (2015). "Guidelines for normalising Early Modern English corpora: decisions and justifications." ICAME Journal: 39. doi: 10.1515/icame-2015-0001
  2. Alexander, M., Dallachy, F., Piao, S., Baron, A., & Rayson, P. (2015). Metaphor, popular science and semantic tagging: distant reading with the Historical Thesaurus of English. Digital Scholarship in the Humanities, 30(Suppl. 1). doi: 10.1093/llc/fqv045
  3. Rayson, P., Baron, A., Piao, S., & Wattam, S. (2015). Large-scale time-sensitive semantic analysis of historical corpora. Abstract from The 36th Meeting of ICAME (International Computer Archive of Modern and Medieval English), Trier, Germany.
  4. Tagg, C., Baron, A., & Rayson, P. (2012). "i didn't spel that wrong did i. Oops": Analysis and normalisation of SMS spelling variation. Lingvisticae Investigationes, 35(2), 367-388. 10.1075/li.35.2.12tag
  5. Rayson, P. and Baron, A. (2011). Automatic error tagging of spelling mistakes in learner corpora. In Meunier F., De Cock S., Gilquin G. and Paquot M. (eds.) A Taste for Corpora. In honour of Sylviane Granger, Studies in Corpus Linguistics, 45. John Benjamins, Amsterdam.
  6. Baron, A., Tagg, C., Rayson, P., Greenwood, P., Walkerdine, J. and Rashid, A. (2011). Using verifiable author data: Gender and spelling differences in Twitter and SMS. Presented at ICAME 32, Oslo, Norway, 1-5 June 2011.
  7. Baron, A., Rayson, P. and Archer, D. (2011). Quantifying Early Modern English spelling variation: Change over time and genre. Presented at Conference on New Methods in Historical Corpora, University of Manchester 29-30 April 2011.
  8. Lehto, A., Baron, A., Ratia, M. and Rayson, P. (2010). Improving the precision of corpus methods: The standardized version of Early Modern English Medical Texts. In Taavitsainen, I. and Pahta, P. (eds.) Early Modern English Medical Texts: Corpus description and studies, pp. 279-290. John Benjamins, Amsterdam.
  9. Tagg, C., Baron, A. and Rayson, P. (2010). "I didn't spel that wrong did i. Oops": Analysis and standardisation of SMS spelling variation. Presented at ICAME 31, Giessen, Germany, 26-30 May 2010.
  10. Baron, A., Rayson, P. and Archer, D. (2009). Word frequency and key word statistics in historical corpus linguistics. In Anglistik: International Journal of English Studies, 20 (1), pp. 41-67.
  11. Baron, A. and Rayson, P. (2009). Automatic standardization of texts containing spelling variation, how much training data do you need? In M. Mahlberg, V. González-Díaz and C. Smith (eds.) Proceedings of the Corpus Linguistics Conference, CL2009, University of Liverpool, UK, 20-23 July 2009. PDF version
  12. Baron, A., Rayson, P. and Archer, D. (2009). Automatic Standardization of Spelling for Historical Text Mining. In Proceedings of Digital Humanities 2009, University of Maryland, USA, 22-25 June 2009.
  13. Baron, A., Rayson, P. and Archer, D. (2009). The extent of spelling variation in Early Modern English. Presented at ICAME 30, Lancaster University, UK, 27-31 May 2009.
  14. Rayson, P., Archer, D., Baron, A. and Smith, N. (2008). Travelling Through Time with Corpus Annotation Software. In Lewandowska-Tomaszczyk, B. (ed) Corpus Linguistics, Computer Tools, and Applications - State of the Art. PALC 2007. Peter Lang, Frankfurt am Main, pp. 29-46.
  15. Baron, A. and Rayson, P. (2008). VARD2: A tool for dealing with spelling variation in historical corpora. In proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, 22nd May 2008. PDF version
  16. Pilz, T., Ernst-Gerlach, A., Kempken, S., Rayson, P. and Archer, D. (2008) The identification of spelling variants in English and German historical texts: manual or automatic? Literary and Linguistic Computing, 23, 1, pp. 65-72. doi:10.1093/llc/fqm044
  17. Rayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In proceedings of Corpus Linguistics 2007, July 27-30, University of Birmingham, UK. PDF version
  18. Rayson, P., Archer, D., Baron, A. and Smith, N. (2007). Tagging historical corpora - the problem of spelling variation. In proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, International Conference and Research Center for Computer Science, Schloss Dagstuhl, Wadern, Germany, December 3rd-8th 2006. (http://drops.dagstuhl.de/opus/volltexte/2007/1055/) ISSN 1862-4405.
  19. Archer, D., Ernst-Gerlach, A., Kempken, S., Pilz, T., and Rayson, P., (2006), "The identification of spelling variants in English and German historical texts: manual or automatic?", In proceedings of Digital Humanities 2006, The Sorbonne, Centre Cultures Anglophones et Technologies de l'Information, Paris, France, July 5 - 9, 2006, pp. 3 - 5. PDF version
  20. Archer, D., Culpeper, J. and Rayson, P. (2005) Love - a familiar or a devil? An exploration of key domains in Shakespeare’s Comedies and Tragedies. Presented at the AHRC ICT Methods Network Expert Seminar on Linguistics. Lancaster University, 8 September 2005. PDF version
  21. Rayson, P., Archer, D., Smith, N., (2005), VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora. In Proceedings of Corpus Linguistics 2005, Birmingham University, July 14-17, Proceedings from the Corpus Linguistics Conference Series on-line e-journal, Vol. 1, no. 1, ISSN 1747-9398. PDF version
  22. Archer, D., Culpeper, J., Rayson, P., (2005), "Love - a familiar or a devil? An exploration of key domains in Shakespeare's Comedies and Tragedies", Presented as part of the Keyword Extraction in Information Retrieval panel at the ACH/ALLC Conference, June 15 - 18, 2005, Victoria, BC, Canada.
  23. Mueller, M. (2005). The Nameless Shakespeare. Working Papers from the First and Second Canadian Symposium on Text Analysis Research (CaSTA). Computing in the Humanities Working Papers (CHWP 34). (HTML version)
  24. Archer, D., McEnery, T., Rayson, P., Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper number 16. UCREL, Lancaster University, pp. 22 - 31. PDF version

Related presentations

  1. Rayson, P. (2007). Travelling through time with corpus annotation software. Invited talk at the 10th Practical Applications in Language and Computers (PALC 2007) conference. 19-22 April 2007, Lodz University, Poland. PDF version PDF version
  2. Archer, D. and Rayson, P. (2006). Teaching a computer to read Shakespeare - the problem of spelling variation. Invited talk at the OED Forum, Oxford University Press and Kellogg College, Oxford, 21 June 2006.
  3. Archer, D. and Rayson, P. (2006) Dealing with variation: spelling. Invited talk at SCOTS Symposium on Linguistic Variation and Electronic Projects. University of Glasgow, 28th April 2006.
  4. Archer, D. and Rayson, P. (2004) Using an historical semantic tagger as a diagnostic tool for variation in spelling. Presented at Thirteenth International Conference on English Historical Linguistics (ICEHL 13) University of Vienna, Austria 23-29 August, 2004.