Thursday 20th and Friday 21st July 2006, Lancaster University, UK.
Organisers: Paul Rayson (Lancaster University) and Dawn Archer (University of Central Lancashire)
Post workshop resources are now listed alongside the workshop programme.
This is one of a series of workshops sponsored by the AHRC ICT Methods Network.
We want to develop a network of scholars interested in 'Historical Text
Mining' via a workshop for experts from the various fields: text mining
and E-Science, corpus development and annotation, historical
linguistics, dialectology and computational linguistics. We believe
that a discussion relating to the effective text mining of historical
data is particularly overdue and much needed, because of the growth in
(historical) digital resources (e.g. Open Content Alliance, Google
Print, Early English Books Online). We particularly want to better
define the relationship between the text mining/E-Science community,
who are often involved in applying basic techniques to large scale
datasets, and the corpus linguistic community, who tend to apply
data-driven linguistic analysis and annotation techniques to relatively
small datasets.
The 'Historical Text Mining' workshop will seek:
One of the tools to be demonstrated, the VARD (= Variant Detector)
presently "matches" spelling variants to their "normalised" equivalents
using a search and replace script and a list of terms, and which is
being extended so that variants can be detected and "normalised"
automatically, via fuzzy matching procedures. The VARD will enable
historical linguists to undertake an empirical exploration of variation
across four centuries (i.e. 16th-19th), but its usefulness is not
limited to the (historical) lexicographer. Indeed, the VARD will
facilitate annotation of, and text retrieval from, previously unseen
pre-20th century corpora, and thus is of potential benefit to the
historian, the English scholar, and those researchers interested in
(historical) dialectology. VARD techniques will be applicable to
detecting variants in, for example, the Scottish Corpus of Texts and
Speech (SCOTS) and the Newcastle Electronic Corpus of Tyneside English
(NECTE).
The tutorial sessions will make use of licensed and freely available
material, including the Lancaster Newsbook Corpus (1640-1661), Nameless
Shakespeare, the Lampeter Corpus of English Tracts (1640-1760), Corpus
del Español (1200s-1900s), and the EEBO-TCP collection, which contains
structured SGML/XML text editions for a significant portion of the
Short Title Catalogue of Early English books published between 1473 and
1700.
Why are we asking for the supporting statement? We would like to know
what you expect to gain from attendance at the workshop, and how you
might apply the methods and tools presented at the workshop in your own
research area in the future. We hope that all applicants can be
included, but the number of places is limited to approximately
thirty-five, so if we are over-subscribed, we will have to select
participants from the applications received. The selection criteria
will be based on fulfilling the aims of the workshop as described
above.
Refreshments and lunches will be provided free of charge.
Participants should make their own arrangements for accommodation in Lancaster if required.
Suitable hotels are the Lancaster House Hotel,
which is situated on the university campus, and the
Royal Kings Arms,
which is near the train station in Lancaster.
Further away, but within a taxi ride of Lancaster and the University is the
Holiday Inn,
which is also handy for junction 34 of the M6 Motorway.
Guestrooms are also available on Lancaster University campus. For details,
see the College and residence office website.
A list
of other hotels and guest houses in Lancaster is available from the
Virtual Lancaster website.
The workshop will take place during graduation week at Lancaster University,
so please reserve accommodation as soon as possible.
Maps and travel details
are available from the University's web site. In particular, train times between
Lancaster and Manchester Airport can be found on
the railway website.
Train tickets are around £20 (return) depending on your time of travel.
Once at Lancaster railway station we recommend you take a taxi (available
from just outside the station) to the University; this
taxi journey should take at most 10 minutes and cost about £5 depending on the
time of day.
If you prefer, you can pre-book an airport taxi direct from Manchester or
Liverpool Airport
to the University. Please contact the organisers for further details.
Dr Dawn Archer
Draft Programme
The draft programme is as follows.
All sessions will take place in room C60b/c in Infolab21. Room C60b/c is directly above reception and
below Cafe21 in the Infolab21 building.
For further information about the presentations, download the
latest version of the workshop handout.
Thursday 20th July Post workshop resources 10:00 Registration and coffee 10:20 Welcome and introductions
Slides
Morning Session: What's possible with modern data? 10:30 "Historical Text Mining", Historical "Text Mining" and "Historical Text"
Mining: Challenges and Opportunities
Robert Sanderson (University of Liverpool)Slides
11:00 Introduction to GATE (General Architecture for Text Engineering)
Wim Peters (University of Sheffield)Slides
URL: www.gate.ac.uk
11:30 Introduction to WordSmith
Mike Scott (Liverpool University)URL: WordSmith Tools including multilingual guides and FAQ
12:00 Tutorial: Corpus annotation and Wmatrix
Paul Rayson (Lancaster University)URL: Wmatrix introduction and tutorial
12:30 Lunch Afternoon session: Problems re-applying such tools to historical data 13:30 Search methods for documents in non-standard spelling
Thomas Pilz and Andrea Ernst-Gerlach (Universität Duisburg-Essen)Slides
14:00 The Potential of the Historical Thesaurus of English
Christian Kay (University of Glasgow)Slides
URL: Historical Thesaurus of English
14:30 Discussion
Slides
15:00 Coffee/tea break 15:30 The CEEC corpora and their external databases
Samuli Kaislaniemi (University of Helsinki)Slides
15:50 Exploring Speech-related Early Modern English Texts: Lexical Bundles Re-visited
Jonathan Culpeper (Lancaster University) and Merja Kytö (Uppsala University)16:30 Introducing nora: a text-mining tool for literary scholars
Tom Horton (University of Virginia)Slides
URL: www.noraproject.org
17:00 Closing discussion
Slides
Friday 21st July Morning Session: Possible solutions to the problems identified on day 1 9:30 Teaching a computer to read Shakespeare: the problem of spelling variation and
Tutorial on the variant detector tool (VARD)
Dawn Archer (University of Central Lancashire)
and Paul Rayson (Lancaster University)Slides
10:15 The advantages of using relational databases for large historical corpora
Mark Davies (Brigham Young University)
Slides
Notes
URL: VIEW: Variation In English Words and phrases
11:15 Coffee/tea break 11:30 Discussion
Slides
12:00 Managing Momus: following the fortuna [sic] and frequency of a trope in Early English Books Online
Stephen Pumfrey (Lancaster University)Slides
12:15 Digitisation of historical texts at ProQuest and ways of accessing variant word forms
Tristan Wilson (ProQuest Information and Learning)Slides
12:30 Lessons learnt from transcribing and tagging the Newcastle Electronic Corpus of Tyneside English
Joan Beal (University of Sheffield) and Nick Smith (Lancaster University)13:00 Lunch 13:30 Visit to the Rare Book Archive in Lancaster University Library to view
the Hesketh Collection
Participants who have registered in advance of the workshop will need to bring photo-ID.Afternoon session: final round-up 14:00 Software demonstrations and small-group discussion 14:30 Nineteenth Century Serials Edition Project
Suzanne Paylor, Jim Mussell (Birkbeck College)
14:45 LICHEN: The Linguistic and Cultural Heritage Electronic Network
Lisa Lena Opas-Hänninen (University of Oulu)Slides
15:00 Discussion: Software requirements
Slides
15:15 Coffee/tea break 15:45 Round-up discussion: where to next?
Slides
16:30 Close Registration and accommodation
Participation is free but, since places are very limited, we request that
potential participants apply in advance, and explain why they wish to
attend and what they expect from the workshop. Please contact the
organisers - preferably by email (details below) - with the following
information:
The deadline for applications is Friday 14th July. If you do have a
confirmed place, and then are unable to attend, please let us know as
soon as possible. As of 6th July, there are very few places remaining.
Location
The workshop will take place in the Infolab21 building at Lancaster University
(number 51 on the campus map).
Wireless internet access will be available on campus and in Infolab21 during the workshop
for participants. We will allocate temporary usernames during registration in order to access the
campus wireless network.
The workshop will take place during graduation week at Lancaster University, hence parking on campus will be restricted.
If you wish to reserve a parking space contact the organisers well in advance of the workshop.
Contact information
Dr Paul Rayson
Computing Department, Infolab21, South Drive, Lancaster University, Lancaster, LA1 4WA, UK.
Tel: +44 (0)1524 510357
Fax: +44 (0)1524 510492
Email: paul@comp.lancs.ac.uk
Lecturer in English Language and Linguistics,
Department of Humanities,
University of Central Lancashire,
Preston, Lancashire,
PR1 2HE
Tel: +44 (0)1772 893032
Email: dearcher@uclan.ac.uk