Methods Network

Workshop on Historical Text Mining

Thursday 20th and Friday 21st July 2006, Lancaster University, UK.

Organisers: Paul Rayson (Lancaster University) and Dawn Archer (University of Central Lancashire)

Post workshop resources are now listed alongside the workshop programme.

This is one of a series of workshops sponsored by the AHRC ICT Methods Network.

We want to develop a network of scholars interested in 'Historical Text Mining' via a workshop for experts from the various fields: text mining and E-Science, corpus development and annotation, historical linguistics, dialectology and computational linguistics. We believe that a discussion relating to the effective text mining of historical data is particularly overdue and much needed, because of the growth in (historical) digital resources (e.g. Open Content Alliance, Google Print, Early English Books Online). We particularly want to better define the relationship between the text mining/E-Science community, who are often involved in applying basic techniques to large scale datasets, and the corpus linguistic community, who tend to apply data-driven linguistic analysis and annotation techniques to relatively small datasets.

The 'Historical Text Mining' workshop will seek:

  1. to raise awareness of the various techniques utilised and/or tools developed by researchers working within the various fields.
  2. to make scholars who work with historical data aware of existing text mining techniques that are applicable to their research needs,
  3. to familiarise such scholars with the use of these techniques and tools, by means of a series of tutorial sessions (e.g. GATE, WordSmith, VARD, VIEW, Wmatrix),
  4. to investigate the problems of applying some "modern" large-scale corpus annotation and analysis techniques to historical data, and
  5. to encourage/enable a roundtable discussion, with the ultimate aim of determining what needs to be done to improve historical text mining and (importantly) identify possible future workshops and collaborative projects.

One of the tools to be demonstrated, the VARD (= Variant Detector) presently "matches" spelling variants to their "normalised" equivalents using a search and replace script and a list of terms, and which is being extended so that variants can be detected and "normalised" automatically, via fuzzy matching procedures. The VARD will enable historical linguists to undertake an empirical exploration of variation across four centuries (i.e. 16th-19th), but its usefulness is not limited to the (historical) lexicographer. Indeed, the VARD will facilitate annotation of, and text retrieval from, previously unseen pre-20th century corpora, and thus is of potential benefit to the historian, the English scholar, and those researchers interested in (historical) dialectology. VARD techniques will be applicable to detecting variants in, for example, the Scottish Corpus of Texts and Speech (SCOTS) and the Newcastle Electronic Corpus of Tyneside English (NECTE).

The tutorial sessions will make use of licensed and freely available material, including the Lancaster Newsbook Corpus (1640-1661), Nameless Shakespeare, the Lampeter Corpus of English Tracts (1640-1760), Corpus del Español (1200s-1900s), and the EEBO-TCP collection, which contains structured SGML/XML text editions for a significant portion of the Short Title Catalogue of Early English books published between 1473 and 1700.

Draft Programme

The draft programme is as follows. All sessions will take place in room C60b/c in Infolab21. Room C60b/c is directly above reception and below Cafe21 in the Infolab21 building. For further information about the presentations, download the latest version of the workshop handout.

Thursday 20th JulyPost workshop resources
10:00 Registration and coffee
10:20 Welcome and introductions PDF versionSlides
Morning Session: What's possible with modern data?
10:30 "Historical Text Mining", Historical "Text Mining" and "Historical Text" Mining: Challenges and Opportunities
Robert Sanderson (University of Liverpool)
PDF versionSlides
11:00 Introduction to GATE (General Architecture for Text Engineering)
Wim Peters (University of Sheffield)
PDF versionSlides
11:30 Introduction to WordSmith
Mike Scott (Liverpool University)
URL: WordSmith Tools including multilingual guides and FAQ
12:00 Tutorial: Corpus annotation and Wmatrix
Paul Rayson (Lancaster University)
URL: Wmatrix introduction and tutorial
12:30 Lunch
Afternoon session: Problems re-applying such tools to historical data
13:30 Search methods for documents in non-standard spelling
Thomas Pilz and Andrea Ernst-Gerlach (Universität Duisburg-Essen)
PDF versionSlides
14:00 The Potential of the Historical Thesaurus of English
Christian Kay (University of Glasgow)
PDF versionSlides
URL: Historical Thesaurus of English
14:30 Discussion PDF versionSlides
15:00 Coffee/tea break
15:30 The CEEC corpora and their external databases
Samuli Kaislaniemi (University of Helsinki)
PDF versionSlides
15:50 Exploring Speech-related Early Modern English Texts: Lexical Bundles Re-visited
Jonathan Culpeper (Lancaster University) and Merja Kytö (Uppsala University)
16:30 Introducing nora: a text-mining tool for literary scholars
Tom Horton (University of Virginia)
PDF versionSlides
17:00 Closing discussion PDF versionSlides
Friday 21st July
Morning Session: Possible solutions to the problems identified on day 1
9:30 Teaching a computer to read Shakespeare: the problem of spelling variation and Tutorial on the variant detector tool (VARD)
Dawn Archer (University of Central Lancashire) and Paul Rayson (Lancaster University)
PDF versionSlides
10:15 The advantages of using relational databases for large historical corpora
Mark Davies (Brigham Young University)
PDF versionSlides PDF versionNotes URL: VIEW: Variation In English Words and phrases
11:15 Coffee/tea break
11:30 Discussion PDF versionSlides
12:00 Managing Momus: following the fortuna [sic] and frequency of a trope in Early English Books Online
Stephen Pumfrey (Lancaster University)
PDF versionSlides
12:15 Digitisation of historical texts at ProQuest and ways of accessing variant word forms
Tristan Wilson (ProQuest Information and Learning)
PDF versionSlides
12:30 Lessons learnt from transcribing and tagging the Newcastle Electronic Corpus of Tyneside English
Joan Beal (University of Sheffield) and Nick Smith (Lancaster University)
13:00 Lunch
13:30 Visit to the Rare Book Archive in Lancaster University Library to view the Hesketh Collection
Participants who have registered in advance of the workshop will need to bring photo-ID.
Afternoon session: final round-up
14:00 Software demonstrations and small-group discussion
14:30 Nineteenth Century Serials Edition Project
Suzanne Paylor, Jim Mussell (Birkbeck College)
14:45 LICHEN: The Linguistic and Cultural Heritage Electronic Network
Lisa Lena Opas-Hänninen (University of Oulu)
PDF versionSlides
15:00 Discussion: Software requirements PDF versionSlides
15:15 Coffee/tea break
15:45 Round-up discussion: where to next? PDF versionSlides
16:30 Close

Registration and accommodation

Participation is free but, since places are very limited, we request that potential participants apply in advance, and explain why they wish to attend and what they expect from the workshop. Please contact the organisers - preferably by email (details below) - with the following information:
  • Name
  • Email address
  • Institutional affiliation
  • Where you heard about the event
  • A brief statement including your current area of research and why you wish to attend this workshop
The deadline for applications is Friday 14th July. If you do have a confirmed place, and then are unable to attend, please let us know as soon as possible. As of 6th July, there are very few places remaining.

Why are we asking for the supporting statement? We would like to know what you expect to gain from attendance at the workshop, and how you might apply the methods and tools presented at the workshop in your own research area in the future. We hope that all applicants can be included, but the number of places is limited to approximately thirty-five, so if we are over-subscribed, we will have to select participants from the applications received. The selection criteria will be based on fulfilling the aims of the workshop as described above.

Refreshments and lunches will be provided free of charge. Participants should make their own arrangements for accommodation in Lancaster if required. Suitable hotels are the Lancaster House Hotel, which is situated on the university campus, and the Royal Kings Arms, which is near the train station in Lancaster. Further away, but within a taxi ride of Lancaster and the University is the Holiday Inn, which is also handy for junction 34 of the M6 Motorway. Guestrooms are also available on Lancaster University campus. For details, see the College and residence office website. A list of other hotels and guest houses in Lancaster is available from the Virtual Lancaster website. The workshop will take place during graduation week at Lancaster University, so please reserve accommodation as soon as possible.


The workshop will take place in the Infolab21 building at Lancaster University (number 51 on the campus map). Wireless internet access will be available on campus and in Infolab21 during the workshop for participants. We will allocate temporary usernames during registration in order to access the campus wireless network. The workshop will take place during graduation week at Lancaster University, hence parking on campus will be restricted. If you wish to reserve a parking space contact the organisers well in advance of the workshop.

Maps and travel details are available from the University's web site. In particular, train times between Lancaster and Manchester Airport can be found on the railway website. Train tickets are around £20 (return) depending on your time of travel. Once at Lancaster railway station we recommend you take a taxi (available from just outside the station) to the University; this taxi journey should take at most 10 minutes and cost about £5 depending on the time of day.

If you prefer, you can pre-book an airport taxi direct from Manchester or Liverpool Airport to the University. Please contact the organisers for further details.

Contact information

Dr Paul Rayson
Computing Department, Infolab21, South Drive, Lancaster University, Lancaster, LA1 4WA, UK.
Tel: +44 (0)1524 510357
Fax: +44 (0)1524 510492

Dr Dawn Archer
Lecturer in English Language and Linguistics, Department of Humanities, University of Central Lancashire, Preston, Lancashire, PR1 2HE
Tel: +44 (0)1772 893032