UCREL Corpus Research Seminar

Collocation statistics in OCR data: are they reliable?

The increasing availability of digitized texts is good news for scholars, including in the Humanities and Social Sciences, who are interested in analysing large quantities of texts in both quantitative and qualitative ways. Unfortunately, these texts are often digitized using Optical Character Recognition (OCR) software, with variable success. This is especially an issue for historical texts, which often attract lower quality output. In this presentation, I explore the impact of OCR errors on two common collocation statistics (Mutual Information and Log Likelihood), comparing statistics generated from a set of matching samples, including one hand-corrected ('gold') sample, an uncorrected sample, and an automatically corrected sample. These matching samples are excerpts from the British Library's c19th British Newspapers (part 1) collection. I find a clear effect of OCR errors especially on larger collocation spans and describe this effect in terms of differences between the values of the statistics generated in the various samples, rankings of the statistics in the various samples, and rates of false positives and false negatives in the uncorrected and automatically corrected samples when compared with the gold sample.

UCREL Corpus Research Seminar

University Centre for Computer Corpus Research on Language

Computing & Communications | Linguistics and English Language

Collocation statistics in OCR data: are they reliable?

Amelia Joulain-Jay

Department of History; CASS

Week 16 2016/2017

Thursday 23rd February 2017
3:00-4:00pm

Management school LT9

UCREL Corpus Research Seminar

University Centre for Computer Corpus Research on Language

Computing & Communications | Linguistics and English Language

Collocation statistics in OCR data: are they reliable?

Amelia Joulain-Jay

Department of History; CASS

Week 16 2016/2017

Thursday 23rd February 20173:00-4:00pm

Management school LT9

Thursday 23rd February 2017
3:00-4:00pm