Corpus software and related tools

The main corpus tools and related resources developed by researchers at Lancaster are listed below. You will also find open-source tools available on UCREL's Github.


BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word BNC in its most recent incarnation, the XML-version.

BNC Web Index

This is the web front end to David Lee's BNC Index spreadsheet. For an introduction to BNC Index, please see David's web site.


Part of speech tagging software for English.


Clustertool allows you to perform Hierarchical Agglomerative Cluster Analysis on your own data.


An extension of BNCweb but designed for use with any corpus.

LL Calculator

This calculates Log-Likelihood values from a 2x2 contingency table. LL is a more reliable alternative to the standard Pearson's chi-squared test, see Dunning (1993).


LWAC is a tool for constructing corpora from web data.


Stream-oriented Java library and a set of command line tools for high quality sentence boundary detection. (Sentence segmentation / splitting / disambiguation). Currently has one model for German (trained on general text and Wikipedia lynx dumps).


Flexible Significance Test System: Chi-squared test, log-likelihood test and Fisher exact test for any kind of contingency table, using R


Semantic tagger developed for English and extended to Finnish and Russian.


Variant Detector software that facilitates the pre-processing of corpora for normalisation of spelling variation (e.g. Early Modern English)


A corpus comparison and annotation tool incorporating CLAWS and USAS in a web front end.