UCREL's Variant Spelling Research
Origins of the research:
In January 2003, Dawn Archer and Paul Rayson began to explore the
feasibility of retraining the
UCREL Semantic Analysis System (USAS) so
that it was applicable to historical texts dating from 1600 onwards
(Archer et al 2003). Previously, the USAS system had been used to
annotate written and spoken English from the modern period. Our initial
experiments suggested that the statistical model adopted in the
part-of-speech tagging component (CLAWS) was sufficiently robust to
cope with grammatical variation across time, but spelling variants
caused difficulty, as variants of known words were not recognized by
the system. Our response was to implement a prototype spelling detector
as a pre-processing step in USAS. This work continued in the
project funded by the Mellon Foundation with Prof. Martin Mueller
(Northwestern University, Chicago), in which we linguistically
annotated the Nameless Shakespeare corpus and Chadwick-Healey's
Eighteenth and Nineteenth Century Fiction corpora. The approach we
adopted is to produce a list of variant spellings, which we
manually matched to normalized forms. A Variant Detector computer program
(VARD) inserts modern equivalents of these forms when they appear
in a given text while preserving the original variant. This approach
proved to be very effective: we identified 45,800+ variants from
our analyses of different historical texts to date, and undertook
a preliminary empirical study of spelling variation across the
16th-19th centuries based on 4,000 of these variants (Archer & Rayson
Continuation of the work: For more recent information on the
research, see the links and publications below.
Information about the VARD2 software is
available on Alistair's VARD2 page.
This work has been supported by grants from the British Academy (Scragg Revisited),
The Mellon Foundation (WordHoard), an EPSRC PhD Plus grant for Alistair Baron,
and recently by AHRC/ESRC in the Samuels project.
Coming soon! Paul Rayson and Alistair Baron have been awarded £7,000 by JISC for the crowdsourcing of Early English Books Online (EEBO) spelling regularisation through VARD.
We will shortly be putting out a call for volunteers (who will be paid from the grant) to carry out VARDing of EEBO phase 1 samples.
You can already sign up on our website in advance: ucrel-vardsourcing.lancs.ac.uk
Main contacts:
- Paul Rayson (School of Computing and Communications, Lancaster University)
- Dawn Archer (Department of Languages, Information and Communications, Manchester Metropolitan University)
- Alistair Baron (School of Computing and Communications, Lancaster University)
- Nick Smith (School of Education, University of Leicester)
- Mahmoud El-Haj (School of Computing and Communications, Lancaster University)
- Andrew Moore (School of Computing and Communications, Lancaster University)
