Collocation statistics in OCR data: are they reliable?

Amelia Joulain-Jay

Department of History; CASS

The increasing availability of digitized texts is good news for scholars, including in the Humanities and Social Sciences, who are interested in analysing large quantities of texts in both quantitative and qualitative ways. Unfortunately, these texts are often digitized using Optical Character Recognition (OCR) software, with variable success. This is especially an issue for historical texts, which often attract lower quality output. In this presentation, I explore the impact of OCR errors on two common collocation statistics (Mutual Information and Log Likelihood), comparing statistics generated from a set of matching samples, including one hand-corrected ('gold') sample, an uncorrected sample, and an automatically corrected sample. These matching samples are excerpts from the British Library's c19th British Newspapers (part 1) collection. I find a clear effect of OCR errors especially on larger collocation spans and describe this effect in terms of differences between the values of the statistics generated in the various samples, rankings of the statistics in the various samples, and rates of false positives and false negatives in the uncorrected and automatically corrected samples when compared with the gold sample.

Week 16 2016/2017

Thursday 23rd February 2017

Management school LT9