Corpus software and related tools


The main corpus tools and related resources developed by researchers at Lancaster are listed below. You will also find open-source tools available on UCREL's Github.

BNCweb

BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word BNC in its most recent incarnation, the XML-version.

BNC Web Index

This is the web front end to David Lee's BNC Index spreadsheet. For an introduction to BNC Index, please see David's web site.

CLAWS

Part of speech tagging software for English.

Clustertool

Clustertool allows you to perform Hierarchical Agglomerative Cluster Analysis on your own data.

CQPweb

An extension of BNCweb but designed for use with any corpus.

LL Calculator

This calculates Log-Likelihood values from a 2x2 contingency table. LL is a more reliable alternative to the standard Pearson's chi-squared test, see Dunning (1993).

LWAC

LWAC is a tool for constructing corpora from web data.

Sentrick

Stream-oriented Java library and a set of command line tools for high quality sentence boundary detection. (Sentence segmentation / splitting / disambiguation). Currently has one model for German (trained on general text and Wikipedia lynx dumps).

SigTest

Flexible Significance Test System: Chi-squared test, log-likelihood test and Fisher exact test for any kind of contingency table, using R

USAS

Semantic tagger developed for English and extended to Finnish and Russian.

VARD

Variant Detector software that facilitates the pre-processing of corpora for normalisation of spelling variation (e.g. Early Modern English)

Wmatrix

A corpus comparison and annotation tool incorporating CLAWS and USAS in a web front end.


UCREL LOGO