Wmatrix tutorials (for version 6)

Step by step instructions using a case study of the linguistic analysis of Political Party Manifestos for the UK General Elections (updated June 2023)

This document describes the method, using the Wmatrix tool, to carry out a comparison and corpus analysis of the Liberal Democrat and Labour Party Manifestos for the 2005 General Election. Although these are older manifestos, they have been retained to allow comparability with the tutorial for version 5. Tutorials A, B and C describe the steps required to collect, prepare, annotate and analyse the two documents. For more advanced users, tutorial D shows details of further analyses that can be performed and tutorial E introduces the methods that you can use to alter the way that Wmatrix semantically tags the text. Tutorial F describes the collocation table and metrics. Tutorial G provides an overview of metaphor analysis features in Wmatrix.

Further examples of the application to the 2010 general election manifestos can be seen on Paul's blog. The plain text versions of the 2010 UK election manifestos can be downloaded for use in your favourite text analysis software (with thanks to Martin Wynne for editing two of the files). A similar analysis for the 2015, 2017, 2019 and 2024 General Election manifestos has been carried out. The recent manifesto files are also available on GitHub.

Getting started on Wmatrix6

As with Wmatrix5, you need to apply for an account with your email address. Please follow the instructions at https://ucrel-wmatrix6.lancaster.ac.uk/ to apply if you don't already have an account. It is a two step process where you first validate your email address, and then the account is created once this step is completed. After the account is created, then you can login.
The same page (https://ucrel-wmatrix6.lancaster.ac.uk/) also lists the key differences between version 5 and version 6. Please read and digest these. They represent the largest set of changes for a new version of Wmatrix since version 1 in 2000.
With the introduction of new languages in Wmatrix6, you should also familiarise yourself with the differences between the taggers for each language. Apart from the English tag wizard, which uses the same tagging pipeline (CLAWS and USAS) as version 5, the tag wizards for the other languages use PyMUSAS, which is an open source implementation in Python. For further details of the contributors to the taggers and lexicons for each language, see the USAS web page, in particular, you should see our NAACL 2015 and LREC 2016 papers which show the differences in the taggers and the linguistic resources employed. Latest statistics on the relative sizes of the dictionaries can be found on the GitHub lexicon statistics table.
Currently, Wmatrix6 uses PyMUSAS version 0.3.0 (May 2022) with pymusas-model versions 0.3.2 (March 2023).
With all these caveats in mind, you can now start to explore Wmatrix version 6.

Tutorial A: Data collection and preparation

This tutorial guides you through how to prepare corpus files for loading into Wmatrix. You can either use the manifesto examples provided here, or follow similar steps for your own data.
Note that the set of election manifesto files are available, pre-tagged and pre-analysed, in the new corpus library feature available from Wmatrix5 onwards. For details on how to use the corpus library, please jump straight to the second half of Tutorial B. If you'd like to continue to follow the steps to prepare the data yourself, please continue with Tutorial A and then the first half of Tutorial B.
Originally the two manifestos were downloaded from the Labour and LibDem websites, however they have now been removed. Local saved copies are provided here.
Accessed on 5th May 2005, Labour provided their manifesto in PDF at: http://www.labour.org.uk/fileadmin/manifesto_13042005_a3/pdf/manifesto.pdf (local copy)
Using Adobe Acrobat Reader, save the Labour PDF document as plain text (File Menu -> Save As -> then change the file type to plain text). Please note that with some versions of the Adobe tools, you will need to manually remove running heads and bullet points by editing the resulting plain text file. Bullet points may be converted to the single character 'n'. Running heads may appear in the middle of some sentences that run over a page break.
This plain text version still contains some non-ASCII characters. You can remove these by opening the text file in Microsoft Word and saving the file as text with line breaks (File Menu -> Save As -> then change the file type to text with line breaks: in MSWord 2000 onwards).
The resulting file contains a few remaining non-ASCII characters (e.g. pound sign) but these can be left in for now (local copy).
Accessed on 5th May 2005, the Libdems provided a PDF version of their English General Election manifesto at: http://www.libdems.org.uk/media/documents/policies/manifesto2005.pdf (local copy)
Since this file contains multiple columns and the Adobe conversion to text from multiple column format does not always preserve the correct word order, it is preferable to use the text version of the manifesto in RTF format originally downloaded from: http://www.libdems.org.uk/media/documents/manifesto05.rtf (local copy)
Open the RTF file in Microsoft Word and save the file as text with line breaks (File Menu -> Save As -> then change the file type to text with line breaks: MSWord 2000 onwards).
If you are using MS-Word 2003 or later, please select 'plain text', then in the dialog box click 'insert line breaks' and 'allow character substitution' and then save. This last option is required for replacing Windows or smart apostrophes for their ASCII equivalents.
The resulting file is now ready for use in Wmatrix (local copy).
The above steps illustrate the process for these two files. The same steps should be applied to other PDF, DOC(X) or RTF files in order to convert them to TXT format that is suitable for Wmatrix.
Wmatrix is also capable of dealing with text in HTML, SGML or XML format. The taggers do not require parsable encoding, it is necessary only that left and right angled brackets are well-balanced.
If you have a very large number of files that you wish to load into Wmatrix, then several options are open to you. First, you can group them together in one or a small group of files (suggested maximum size is one million words per file). Unix, Linux and Mac OSX users can use the 'cat' command to concatenate files. On Windows, you can use the 'copy' command with a list of files to concatenate them (see https://en.wikipedia.org/wiki/Copy_(command)). If you group your files into a smaller number and then load these in to Wmatrix, the resulting folders can be grouped into one Wmatrix folder using the 'join' option in the advanced user interface (in Wmatrix5 rather than Wmatrix6). Alternatively, in Wmatrix version 6, you can also retain the original files separately, create a zip file and upload this.
When loading any English data in to Wmatrix, care should be taken when the file contains angled brackets (< or >). These can be misinterpreted by CLAWS as XML tags and some of the text may be left untagged and not counted by Wmatrix. See the input format guidelines for further instructions on how to avoid these problems. If you wish (for the English tag wizard only), you can deliberately force Wmatrix to ignore sections of your text e.g. headers or speaker markers. In order to do this, enclose this text within angled brackets e.g.
```
<speaker id="A1">
```
or
```
<A>
```
or
```
<head content="Any header text here">
```
would all be ignored by Wmatrix.
For the English tag wizard (CLAWS and USAS), it is best to use ASCII format files in order to avoid characters being ignored. However, when using the PyMUSAS tag wizard for other languages, UTF-8 format files are supported.

Tutorial B: Upload to Wmatrix or use the corpus library feature

Log in to Wmatrix using your existing username and password. From Wmatrix5 onwards, your username is the email address that you used to create your account.
You can upload files to Wmatrix by clicking on the tag wizard option and using the following steps.

For the LibDem text file, follow the instructions to select the language (English in this case), then choose a name the folder, then select the file on your local computer, and click the button to upload the data.
In Wmatrix6, you don't need to wait with the web page open while the tag wizard completes the annotation, frequency counting, and indexing process. But it should take around one minute to complete for files of this size. Note that larger files, or zips containing large numbers of files can take much longer. There is a status message in the folder view which tells you how far through the tag wizard process your files have progressed.
Repeat the above two steps for the Labour text file.
The two manifestos are now ready for analysis in Wmatrix.

If you wish to use the election manifesto corpus files without having to manually upload the data yourself, you can use the Wmatrix corpus library feature, as follows:

Click on the Library option in the menu.
You will see a set of folders that are available for you to use from the Wmatrix corpus library.
Select which folders you want to use, and then click the "Access corpus library folders" button.
The corpus library folders that you have selected will be copied to your Wmatrix account, and you can now proceed with Tutorial C.

Tutorial C: Initial data analysis (using the simple interface)

For this and later tutorials, it is useful to have the one page PDF listing the whole semantic tagset available as a reference document in another window or browser tab. Through the steps in this tutorial, you'll start to explore the semantic tagger output and learn more about the way that the USAS tagset works. Much more information about the USAS semantic tagger is available.
Click on the "My Folders" option to see the Labour and LibDem folders that you have created using the tag wizard, or copied using the corpus library feature.
Click on the LibDem folder to see what you can do using the simple interface. The folder view for the LibDem manifesto is shown here:
First, you can view the word list to see words that are most frequently used in the LibDem manifesto by clicking on "Frequency" under item 1, as shown here:
In the word frequency list, you can see that 'the', 'and', 'to' are most frequent, not surprisingly, along with other closed class words. Also, note that in Wmatrix6, you can see punctuation and XML tags included in the frequency lists, whereas in Wmatrix5 these were automatically removed, and could not be searched for or concordanced.
Looking further down the frequency list, we see open class words that reflect the topic of the text. For example, 'government' is used 89 times, 'tax' 57 times, 'environment' 34 times. The full list can be saved as a text file by right clicking on the file icon and clicking 'save as'. Note that in Wmatrix5, some multi-word-expressions (MWEs) are marked by the system as words joined by underscore characters e.g. red_tape, tuition_fees, public_transport. This feature is not yet available in Wmatrix6, but MWEs are still semantically tagged as one item.
In the same way, you can view word frequencies for the Labour Manifesto by moving to that folder.
Note that using the advanced interface, you can also view frequency lists by part-of-speech and semantic tag (see Tutorial step D.3).
To see a concordance of a particular word or tag, click on the concordance link alongside the word in a frequency list, or type in a word in the "word search" box under item 2 in the main folder page. A sample of a concordance of the word "government" from the LibDem manifesto is shown here:
In Wmatrix6, the concordance view is paginated, so you can step forwards and backwards through all the concordance lines by clicking on the 'Previous' and 'Next' links in the box just above the concordances. Line numbers are shown on the left hand side, and (if your corpus consists of more than one file loaded via a zip file) you can see the filename displayed on the right hand side of each concordance line.
You can save the current displayed set of concordance lines by right clicking on the save icon at the top. The output is tab delimited so you can easily import this into a spreadsheet and see the left and right context, centre key word and filename in separate columns.
Wmatrix6 also allows you to sort the concordance lines using the drop down list above the set of lines. Concordance sorting wasn't previously possible in Wmatrix5. The default sort is corpus order (i.e. as the lines occur in the input), but you can also select to sort on the filename, up to three words to the left, the centre or key word itself, up to three words to the right, or the POS or semantic tag of the key word. The sorted elements appear highlighted in red. Sorting concordance lines helps you to determine the patterns of usage of words of interest, and how often the patterns occur by sorting similar lines next to each other in a concordance view. A sample of a concordance sorted by one word to the right (R1) of the word "government" from the LibDem manifesto is shown here:
Now, return to the main folder page for the LibDem manifesto file by clicking on the folder name in the breadcrumbs menu at the top right of any page (i.e. next to "You are here").
To compare the LibDem and Labour manifestos using the simple interface, select the Labour Manifesto from the drop down lists under the word clouds (item 3) and tag clouds (item 4) options. This shows a visualisation of the items that are significantly more frequent in the LibDem manifesto relative to the Labour one. A larger font indicates a greater significance value. In the advanced interface, much more information appears before the clouds in table format showing the underlying frequency, relative frequency, and significance (log-likelihood) value, see step D.4 for more information.
Here's the Key word cloud for the LibDem Manifesto compared to Labour Manifesto. You can see that the LibDems talk about themselves, the environment, pollution, and green issues, tax, red tape, tuition fees, and so on, relatively more frequently than Labour do in their manifesto:
Here's the Key semantic tag cloud for the LibDem Manifesto compared to Labour Manifesto, and you can see similar concepts, represented by the semantic tags for green issues, government, farming and horticulture, appearing relatively more frequently in the LibDem manifesto:
Concordances can again be seen by clicking on the word or tags in the clouds in the simple interface, so you should further explore what topics are being discussed using the key word and semantic tag clouds as a data-driven guide towards finding the important items for your analysis. Tooltips, which appear by hovering your mouse over the words or tags, show the frequency and log-likelihood information.
By default, the cloud visualisations show the overused items in the LibDem manifesto. These are the items with positive keyness values, i.e. where the relative frequency in the LibDem manifesto is higher than the relative frequency in the Labour manifesto. If you want to see overused items in the Labour manifesto, you need to change to the Labour folder (click on "My folders" at the top right and then choose the Labour folder) and then select the LibDem manifesto in the drop down list to compare against.
Compare the word cloud and the semantic tag cloud for the LibDem versus Labour Manifesto comparison. Which features in the text are represented in both word and tag clouds? Which features only emerge when you use the semantic tag cloud? Are there any words in the word cloud which cannot be found in the key semantic tags in the semantic tag cloud?
So far, you've compared the two manifestos directly against each other. You can also compare one manifesto to a much larger reference corpus to discover key words and key semantic categories relative to that reference corpus. In the drop down comparision lists for the clouds, you will find all the other folders that you've loaded in to Wmatrix, under "My folders". Above this, you will find the built in "Standard reference corpora", which are always available for keyness comparison purposes. The choice of which reference corpus to use is very important, and you should consider whether you need a more written reference corpus or a more speech based reference corpus, or something closer to the domain or topic of your corpus, for comparability. Experiment with some of the other standard reference corpora provided in the tool e.g. British English 2021 (BE21), British English 2006 (BE06) and American English 2006 (AmE06) to see what differences emerge in the results. Further details about the reference corpora that are available in Wmatrix can be seen in the help system by clicking on the "Help" menu at the top of the screen and then selecting the topic "Standard reference corpora for keyness analysis".
If you repeat these steps for the LibDem 2010 manifesto (compared against the BNC 1994 Sampler Written) and contrast the results at the word level to the results at the key domains (semantic tags) level then you will begin to see where some of the results at the key domain level reinforce results at the key words level e.g. "sustainable" and "climate" appear as key words and "Green issues" appears at the semantic level. In addition, some further patterns can only be seen at the key semantic level e.g. "Law and order" appears at the semantic level but is harder to spot at the key words level (other than with key words lower down the list such as "law" and "prison"). This illustrates the advantage of the key semantic domains approach over the key words approach since it allows you to spot further items of interest that otherwise do not appear with other techniques. For more details about this, see the 2008 paper in IJCL:
Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics. 13:4 pp. 519-549. DOI: 10.1075/ijcl.13.4.06ray
If you haven't already done so, you should now switch to the advanced user interface by going to the top of the Wmatrix screen, clicking on "My folders" and then click on "switch to advanced interface". Then when you click on any folder, you will see much more detailed information and many more options for exploring and analysing the data therein. For more guidance on the advanced interface, please continue to Tutorial D.

Tutorial D: Advanced data analysis (tokenisation, MWEs, and n-grams)

In step C.6 above, we saw that the word frequency list contains some multiword expressions (MWEs) which the semantic tagger automatically annotates (from an existing list) with one semantic tag for the whole phrase. Using the advanced interface for Wmatrix6, there is an additional method which you can use to extract repeated multiword elements in a corpus to locate what is known as n-grams, or sometimes called clusters, lexical bundles, chunks, fixed expressions or formulaic sequences. In this tutorial, we will look at some of the advanced features in Wmatrix which include MWE analysis.
At the top right of the Wmatrix screen in the breadcrumbs to the right of "You are here", click on "My folders" and then click on "switch to advanced interface" in the menus on the left (if you haven't already done so by now).
Select the LibDem Manifesto 2005 folder as before and you will see a more complicated view than the simple interface, as shown below. This lets you see frequency lists of POS tags as well as compare POS tag profiles just as you did previously with word and semantic tag frequency lists. Try experimenting with these more advanced features to see how the Labour and LibDem manifestos compare.
Also, at this point you should familiarise yourself with which features are the same as in the simple interface that you used in Tutorial C: the word frequency list is the same, but word search is called concordancing in the advanced interface, and the word cloud visualisations include much more information in table format with detailed statistics for keyness comparisons, as show in the next step. Some features are new to the advanced interface: lemma (i.e. dictionary headword) frequency lists, Part-of-Speech (POS) tag frequencies, key POS analysis, and a download section where you can save the complete frequency lists, the original untagged file, plus the SQLite Database which Wmatrix6 uses for its internal indexing.
As an example of the more detailed information in the advanced interface, once you are in the LibDem folder you can select the Labour Manifesto folder in the 'keyness analysis' drop down lists at the word, lemma, part-of-speech and semantic level. At the semantic level, the table shown below is produced. The keyness table, by default, is sorted on the log-likelihood value, resulting in the most significant differences at the top of the table. However, a variety of effect size measures can be selected and sorted on, and there are filters to include or exclude certain elements or features. Full word lists for each semantic tag can be seen by clicking on the 'List1' link (for the list of words in that semantic category in the LibDem manifesto) and 'List2' (for the list of words in that semantic category in the Labour manifesto) on the left handside in the advanced interface, along with a concordance link.
The USAS semantic tagger marks multiword expressions (defined here as a single meaningful unit) and they are assigned a single semantic tag and counted as one item in the frequency lists. For further information on how Wmatrix identifies MWEs using manually defined rules or templates in the USAS dictionaries, you can read three papers, Piao et al (2003), Piao et al (2005) and Rayson et al (2004). Full references are listed on the USAS website. It is also worth noting here that the tokenisation principles (i.e. how word boundaries are defined) in Wmatrix are designed to help annotate and count meaningful linguistic units and chunks. For English corpora, Wmatrix relies on CLAWS to do its word tokenisation, where contracted forms are split into separate words to give each part an individual POS tag (see CLAWS tagging guidelines for more information). For other languages, tagged using the spaCy pipeline and the PyMUSAS semantic tagger, other tokenisation principles are applied.
Another way to identify MWEs is using the n-gram technique for counting recurring 'n'-words-long consecutive sequences or patterns in the text. Note that a 1-gram list is equivalent to a plain word frequency list (without any semantic MWEs). In the advanced interface, you can see n-grams from 2 words long up to 5 words long from the main folder interface. These are found by clicking on the numbers 2, 3, 4, and 5 in the first line of the table alongside the word frequency lists. In Wmatrix6, the n-gram lists are produced by the tag wizard using the SQLite Database so you don't need to make them using an additional process as was required in Wmatrix5 (which used the Ngram Statistics Package (NSP)). Shown below is the 2-gram frequency list.
You can export the n-gram lists as tab-delimited files by clicking on the save link on the top left of the table when viewing the lists. Concordances for each n-gram can be displayed using the 'concordance' link in the usual way.
Compare the 2-gram and 3-gram lists with the lists of MWEs extracted using the semantic tagger as described in step D.6 above. There is some overlap e.g. "make_sure" and "such_as". Other items in the 2-gram list ("we will", "of the") are not included in the MWE list because they are not identified as semantically coherent units by the USAS tagger. Further items in the 2-gram list e.g. "liberal democrats" and "liberal democrat" should appear in the MWE list but do not because they are not listed in the USAS dictionary.
Compare the items listed in the 2, 3, 4 and 5-gram frequency lists. What items in the 2-gram list are also contained in the patterns shown in the 3-gram list? You should also find items in the 3-gram list that are part of items in the 4-gram list and so on with 4 and 5-grams. These overlaps between the lists shows that you should consider all the n-gram lists together for your analysis because some of the frequent shorter n-grams will be contained with the longer n-grams.

Tutorial E: Extending the Wmatrix dictionaries

The My Tag Wizard feature is not yet implemented in Wmatrix6, so you will need to head over to the tutorial for Wmatrix version 5 to learn about these steps.

Tutorial F: Collocation

This tutorial assumes that you have already completed tutorials A and B. You should also be using the advanced interface to Wmatrix. If you're not using the advanced interface, switch to it now by clicking 'switch to advanced interface' at the top left of the Wmatrix screen.
Select the folder for the Labour manifesto (2005) that was uploaded to Wmatrix, or copied from the corpus library, in tutorial B by clicking on the icon or the name.
In the main table, you should see a column headed 'Collocation'. Underneath this will be a link called 'Word'. The word collocations are calculated automatically by the tag wizard in wmatrix6 using the SQLite Database, unlike in wmatrix5 where you had to start the calculations as an additional process (with Java code created by Scott Piao). To view the word level collocations, click on the 'Word' link in the 'Collocation' column of the main table.
You will see the full table of all the collocates extracted from the Labour manifesto, as shown below.
The first column of the table shows the word and then the second column shows the collocate which occurs within three words either side of the word. The following six columns show the frequency of occurrence of the collocate one word to the left (L1) of the word, two words to the left (L2), and three words to the left (L3). Similarly, R1, R2 and R3, show the number of times the collocate has appeared in those positions to the right of the word. The next column is Total, which shows the sum of all those positions where the collocate occurs in close proximity to the word. Further to the right, we have the "Word Freq" column which shows how often the word appears anywhere in the Labour Manifesto. Similarly, in the "Collocate Freq" column, this shows how many times the collocate appears anywhere in the Labour Manifesto. With all of this information, we can construct a contingency table for each word and collocate pair, and calculate a variety of statistics.
The first metric calculated in wmatrix is the Multual Information (MI). Wmatrix also calculates the two-cell (sometimes called LL-short) Log-Likelihood (LL2) statistic. These are shown at the right hand side of the table. By default, the collocates table only shows collocate pairs that occur at least three times.
You can sort the collocation table by any of the columns with a clickable link in the title row. For example, to show the word and collocate pairs with the strongest links according to MI, click on MI in the header row, as shown below.
Using MI, you will see that the strongest collocates are "21st century", "tony blair", "prime minister", and "penalty notices". From the L3-R3, total, word frequency and collocate frequency columns, you can observe that most of these word-collocate pairs only occur together, hence why they score highly. Scroll further down the table (to line 120) to see the "real terms" collocate. This pair occurs six times together (within the +/- three word window), but the frequencies of real (12) and terms (10) show that not every occurrence of each word is in close proximity with the other word. Hence why this collocate pair scores slightly lower (10.438).
Now sort the table using Log-Likelihood (LL2), and beyond the high frequency pairs of sentence break XML tags, commas and full stops, you will see the strongest collocates are "of the", "we will", "to the" and "per cent", as shown below:
Contrast the high word and collocate frequencies here at the top of the LL2-sorted table, with the lower frequencies you can see at the top of the MI-sorted table. This clearly illustrates the difference between collocates extracted using different statistics. For more information on these statistics and others which may be used for collocation calculations, please see the following sources:
- Piao, S. (2002) Word alignment in English-Chinese parallel corpora. Literary and linguistic computing, 17 (2), 207-230. doi:10.1093/llc/17.2.207
- Statistics in #LancsBox, see the manual section 12, pages 48-49.
- Lancaster Stats Tools online, see section 3 "Semantics and discourse: Collocations, keywords and lockwords"
If you wish to find collocates of a given word, or vice versa, then enter the word in the 'search this list' box and click 'Go'. For example, using the MI statistic, you can enter "health" in the search box and see that there are many collocates extracted: "mental health", "healthy choices" "restored health", "health problems, "good health" and "health services" amongst others, as shown here:
At any point, you can right-click on the save icon and download a tab-delimited file containing the information in the table. This tab-delimited format can be imported into a spreadsheet or document editor of your choice. After using the search box, you must clear it and click 'Go' in order to see the full list of collocates again.
You can use substrings in the search box. You will notice that entering "school" will find collocates containing both the singular and plural forms e.g. "primary schools" and "secondary school". A search for "polic" will find "police officers" and "neighbourhood policing" but also "international policy"
Using the word collocates lists, you are now equipped to carry out a collocational analysis of the Labour manifesto and compare it to collocates found in the LibDem manifesto.

Tutorial G: metaphor analysis

The metaphor analysis features are not yet implemented in Wmatrix6, so you will need to head over to the tutorial for Wmatrix version 5 to learn about these steps.