Wmatrix corpus analysis and comparison tool

Wmatrix is a software tool for corpus analysis and comparison. It provides a web interface to the English USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains.

Wmatrix allows the user to run these tools via a web browser such as Chrome, Firefox or Internet Explorer, and so will run on any computer (Mac, Windows, Linux, Unix) with a web browser and a network connection. Wmatrix was initially developed by Paul Rayson in the REVERE project, extended and applied to corpus linguistics during PhD work and is still being updated regularly. Earlier versions were available for Unix via terminal-based command line access (tmatrix) and Unix via Xwindows (Xmatrix), but these only offer retrieval of text pre-annotated with USAS and CLAWS.

Sections in this introduction to Wmatrix: screenshots, screencasts (short video introductions), acknowledgements and references for Wmatrix, and example applications and publications.

Tutorial for Wmatrix: with step-by-step instructions on how to compare Liberal Democrat and Labour Party Manifestos for the 2005 UK General Election (updated February 2010). Further examples of the application to the 2010 general election manifestos can be seen on Paul's blog. The plain text versions of the 2010 UK election manifestos can be downloaded for use in your favourite text analysis software (with thanks to Martin Wynne for editing two of the files). TEI encoded versions of the 2010 election manifestos are now available (with thanks to Lou Burnard). Application to the 2015 General Election manifestos and downloadable versions of the documents from seven main parties has also been carried out.

Access the tool online at http://ucrel.lancs.ac.uk/wmatrix3.html (Wmatrix2 has now been retired). Wmatrix3 runs in Lancaster University's cloud infrastructure. Note that Wmatrix3 is suitable for English texts only.

Usernames for Wmatrix are free to members of Lancaster University. If you would like access to Wmatrix, please contact Paul Rayson.

Usernames for academic research and teaching: (non-Lancaster academics) A free one-month trial is available for individual academic users, please contact Paul Rayson to set up a username and password. Please note that should apply from an academic email address. If you are a student, please apply via your course lecturer or supervisor. Once the one-month trial has expired, usernames are available for £50 per username per year from the online secure order page run by Lancaster University. Multiple usernames (or years) may be purchased at a reduced cost. Please ask Paul for details. Further development and external availability of Wmatrix currently depends on licensing its use.

Introduction to Wmatrix



Wmatrix users can upload their own corpus data to the system, so that it can be automatically annotated and viewed within the web browser. Each file is stored in a folder (equivalent to a folder in Windows or directory on Unix).

Input format guidelines

The analysis may be improved with some pre-editing of the input text, although pre-editing is not normally required. There are
guidelines provided for texts to be tagged by CLAWS. Most important is the replacement of less-than (<) and greater-than (>) characters by the corresponding SGML entity references (&lt;) and (&gt;) respectively. The text may contain well-formed HTML, SGML or XML tags. If the text contains less-than or greater-than symbols in formulae, for example, then CLAWS may mistake large quantities of the following text for SGML tags, or fail to POS tag the file. The guidelines mention start and end text markers, but these are not required since they are inserted for you by Wmatrix.
Tag wizard

Wmatrix users can upload their file and complete the automatic tagging process by clicking on the tag wizard. Once the file has been uploaded to the web server, it is POS tagged by CLAWS and semantically tagged by USAS. This process can be carried out step by step starting with the 'load file without tagging' option in the advanced interface. As a shortcut you can simply upload frequency profiles if you have them. The format for a frequency list is a very simple two column format with a total line at the head of the file. You can see an example of this. The column widths are not significant.

My Tag Wizard

My Tag Wizard is a variant of the tag wizard which allows you to override or extend the system dictionaries for your own data. There are two main uses. First, you can override the current most likely tag for any word or MWE. Second, you can extend the dictionaries in terms of coverage of vocabulary and tagset. For example, you can create a new tag by listing the words and MWEs that you wish to be tagged with it.
One workarea

Viewing folders

By clicking on the folder name, the user can see its contents. Following the application of the tag wizard, the folder contains the original text, POS and semantically tagged versions of that text, and a set of frequency profiles.

Simple and advanced interfaces

The user can toggle between simple and advanced interfaces in Wmatrix. The advanced interface offers more options and more control over the data.
Frequency list

Frequency profiles

From the folder view, the user can click on a frequency list to see the most frequent items in their corpus. Frequency lists are available for words in the simple interface, and in the advanced interface for POS tags and semantic tags. The lists can be sorted alphabetically or by frequency.


From the frequency list view, the user can click on 'concordance' and see standard concordances. These can show the usual word based concordance as well as all occurrences for words in one POS or semantic category.

Key words, key POS and key domains: comparison of frequency lists

From the folder view, the user can click on compare frequency list to perform a comparison of the frequency list for their corpus against another larger normative corpus such as the BNC sampler, or against another of their own texts (once that text has been loaded into Wmatrix). This comparison can be carried out at the word level to see keywords, or at the POS (in the advanced interface), or at the semantic level (to see key concepts or domains). The log-likelihood statistic is employed by Wmatrix. For more details, see the log-likelihood calculator. In the simple interface, word and tag clouds are shown which visualise the more significant differences in the larger font sizes. In the advanced interface more detailed frequency information is also displayed in table form. Then the key comparison shows the most significant key items towards the top of the list since the result is sorted on the LL (log-likelihood) field which shows how significant the difference is. You should just look at items with a '+' code since this shows overuse in your text as compared to the standard English corpora. To be statistically significant you should look at items with a LL value over about 7, since 6.63 is the cut-off for 99% confidence of significance.

N-grams and c-grams

Recurrent sequences of words are called n-grams in Wmatrix. These are similar to clusters in WordSmith and lexical bundles in Biber's work. You can calculate n-grams of length 2 to 5 for each text. Collapsed-grams (or c-grams) are a merged version of these lists. They show you which 2-grams are subsets of 3-grams, which 3-grams are subsets of 4-grams, and so on. The resulting c-gram list is a tree structure with the longest n-grams on the left and shortest n-grams on the right.


Collocations in Wmatrix are pairs of words that occur together more often than would be expected due to chance. There are a choice of 11 different statistics that can be used to calculate the strength of association between the two words. For further details about these statistics, see the following paper:

Piao, S. (2002) Word alignment in English-Chinese parallel corpora. Literary and linguistic computing, 17 (2), 207-230. doi:10.1093/llc/17.2.207

The collocation feature was introduced in September 2009 and is currently in beta testing.


This section shows short video introductions to the Wmatrix software. Further videos will be appearing soon.

Acknowledgements and references:

Wmatrix was initially developed within the
REVERE project (REVerse Engineering of Requirements) funded by the EPSRC, project number GR/MO4846.

Lancaster University Proof of concept funding in July 2006 provided support for a new server and continued software development. In December 2006, further interface design using XHTML/CSS was carried out by Andrew Foote (InfoLab21 Knowledge Business Centre) funded under support from the European Regional Development Fund. Through a Lancaster University small grant (Towards an Online Conceptual Database of the Latin Vulgate Bible) a 'reader' interface is being developed for pre-tagged corpora.

Why the name, Wmatrix? Originally, I wrote a piece of software called Matrix which presented tables of frequency information from corpora, hence the named is partially derived from mathematical 'matrices'. This was Unix terminal based using 'curses'. I then wrote an X-windows version with a graphical user interface and named it Xmatrix. The web based version came next, hence Wmatrix. I also have a Java API to the website called Jmatrix. There's a note in my PhD saying that it has nothing to do with any films featuring Keanu Reeves, but if you're a Doctor Who fan like me, you may recognise another meaning of the Matrix.

The collocation feature in Wmatrix uses software derived from MLCT developed by Scott Piao.

The C-grams feature uses software developed by Andrew Stone.

Thanks are due to Steve Wattam who ported the semantic tagger, frequency profiling and concordance software to Linux from Solaris.

Please reference Wmatrix as one of the following:
Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics. 13:4 pp. 519-549. DOI: 10.1075/ijcl.13.4.06ray
Rayson, P. (2009) Wmatrix: a web-based corpus processing environment, Computing Department, Lancaster University. http://ucrel.lancs.ac.uk/wmatrix/
Rayson, P. (2003). Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. Ph.D. thesis, Lancaster University. (abstract or full text PDF version)

Publications and applications using Wmatrix:

