Wmatrix tutorials

Linguistic analysis of Liberal Democrat and Labour Manifestos for the 2005 General Election (updated December 2013)

This document describes the method, using the Wmatrix tool, to carry out a comparison and corpus analysis of the Liberal Democrat and Labour Party Manifestos for the 2005 General Election. Tutorials A, B and C describe the steps required to collect, prepare, annotate and analyse the two documents. For more advanced users, tutorial D shows details of further analyses that can be performed and tutorial E introduces the methods that you can use to alter the way that Wmatrix semantically tags the text. Tutorial F describes the word collocation and semantic collocation facilities.

Further examples of the application to the 2010 general election manifestos can be seen on Paul's blog. The plain text versions of the 2010 UK election manifestos can be downloaded for use in your favourite text analysis software (with thanks to Martin Wynne for editing two of the files). A similar analysis for the 2015 General Election manifestos has been carried out.

Tutorial A: data collection and preparation

  1. Locate the two manifestos on the Labour and LibDem websites, if they are still there. Otherwise use the local copies provided here.
  2. Accessed on 5th May 2005, Labour provide their manifesto in PDF at: http://www.labour.org.uk/fileadmin/manifesto_13042005_a3/pdf/manifesto.pdf (local copy)
  3. Using Adobe Reader (or Adobe Acrobat), save the Labour document as plain text (File Menu -> Save As -> then change the file type to plain text). Please note that with some versions of the Adobe tools, you will need to manually remove running heads and bullet points by editing the resulting plain text file. Bullet points may be converted to the single character 'n'. Running heads may appear in the middle of some sentences that run over a page break.
  4. This plain text version still contains some non-ASCII characters. Remove these by opening the text file in Microsoft Word and saving the file as text with line breaks (File Menu -> Save As -> then change the file type to text with line breaks: MSWord 2000).
  5. The resulting file contains a few remaining non-ASCII characters (e.g. pound sign) but these can be left in for now (local copy).
  6. Accessed on 5th May 2005, the Libdems provide a PDF version of their English General Election manifesto at: http://www.libdems.org.uk/media/documents/policies/manifesto2005.pdf (local copy)
  7. Since this file contains multiple columns and the Adobe conversion to text from multiple column format does not always preserve the correct word order, it is preferable to use the text version of the manifesto in RTF format at: http://www.libdems.org.uk/media/documents/manifesto05.rtf (local copy)
  8. Open the RTF file in Microsoft Word and save the file as text with line breaks (File Menu -> Save As -> then change the file type to text with line breaks: MSWord 2000).
  9. The resulting file is now ready for use in Wmatrix (local copy).
  10. The above steps illustrate the process for these two files. The same steps can be applied to other PDF, DOC or RTF files in order to convert them to TXT format for Wmatrix.
  11. If you have a very large number of files that you wish to load into Wmatrix, then they need to be grouped together in one or a small group of files. Unix, Linux and Mac OSX users can use the 'cat' command to concatenate files. On Windows, you can use the 'copy' command with a list of files to concatenate them (see http://en.wikipedia.org/wiki/List_of_DOS_commands#copy). If you group your files into a smaller number and then load these in to Wmatrix, the resulting folders can be grouped into one Wmatrix folder using the 'join' option in the advanced user interface.
  12. When loading any data in to Wmatrix, care should be taken when the file contains angled brackets (< or >). These can be misinterpreted by Wmatrix as XML tags and some of the text may be left untagged and not counted by Wmatrix. See the input format guidelines for further instructions on how to avoid these problems.
  13. If you wish, you can deliberately force Wmatrix to ignore sections of your text e.g. headers or speaker markers. In order to do this, enclose this text within angled brackets e.g.
    <speaker id="A1">
    or
    <A>
    or
    <head content="Any header text here">
    would all be ignored by Wmatrix.

Tutorial B: upload to Wmatrix

  1. Log in to Wmatrix using your existing username and password.
  2. Click on the tag wizard option.
  3. For the LibDem text file, follow the instructions to name the folder, select the file on your local disk, and click to upload the data.
  4. Wait while the tag wizard completes the annotation and counting process. It should take around one minute.
  5. Repeat the above two steps for the Labour text file.
  6. The two manifestos are now ready for analysis in Wmatrix.

Tutorial C: data analysis (using the simple interface)

  1. Click on the "My Folders" option to see the Labour and LibDem folders that you have created using the tag wizard.
  2. Click on the LibDem folder and view the word list to see words most used in the LibDem manifesto: 'the', 'and', 'to' are most frequent, not surprisingly, but 'government' is used 84 times, 'tax' 50 times, 'environment' 34 times. If you are using the advanced interface, the full list can be saved as a text file by right clicking on the file icon and clicking 'save as'. Note that some multi-word-expressions are marked by the system as words joined by underscore characters e.g. red_tape, tuition_fees, public_transport.
  3. In the same way, you can view word frequencies for the Labour Manifesto (LibDem full word frequency list and Labour full word frequency list).
  4. Note that using the advanced interface, you can also view frequency lists by part-of-speech and semantic tag (see step D.3).
  5. To see a concordance of a particular word or tag, click on the concordance link alongside the word or tag in a frequency list. A sample of a concordance is shown here:
    Concordance from the LibDem manifesto
  6. To compare the LibDem and Labour manifestos using the simple interface, select the Labour Manifesto folder from the drop down lists under word cloud and tag cloud options. This shows a visualisation of the items that are significantly more frequent in the LibDem manifesto than the Labour one. A larger font indicates greater significance. In the advanced interface, much more information appears alongside the clouds in table format showing the underlying frequency and log-likelihood information, see step D.4.
    Key word cloud Key domains cloud
  7. Concordances can be seen by clicking on the word or tags in the simple interface. Tooltips, which appear by hovering your mouse over the words or tags, show frequency and log-likelihood information.
  8. By default the clouds show overused items in the LibDem manifesto. If you want to see overused items in the Labour manifesto, you need to change to the Labour folder (click on My Folders top right and then choose the Labour folder) and then select the LibDem manifesto in the drop down list.
  9. Compare the word cloud and the tag cloud for the LibDem versus Labour comparison. Which features in the text are shown in both word and tag clouds? Which features only emerge when you use the tag cloud? Are there any words in the word cloud which cannot be found in the key semantic tags in the tag cloud?
  10. So far you've compared the two manifestos directly against each other. You can also compare one manifesto to a large reference corpus to discover key words and key semantic categories. For example, if you compare the LibDem manifesto against the 'BNC Sampler Written' corpus, then items such as "pensioners" and "people" appear as key. The choice of which reference corpus to use is very important. Experiment with some of the other standard reference corpora provided in the tool e.g. British English 2006 (BE06) and American English 2006 (AmE06) to see what differences emerge in the results. Further details about the reference corpora that are available in Wmatrix can be seen in the help system by clicking on "Contents" in the Help menu at the top of the screen and then selecting the topic "Standard reference corpora for key analysis".
  11. If you repeat these steps for the LibDem 2010 manifesto (compared against the BNC Sampler Written) and contrast the results at the word level to the results at the key domains (semantic tags) level then you will begin to see where some of the results at the key domain level reinforce results at the key words level e.g. "sustainable" and "climate" appear as key words and "Green issues" appears at the semantic level. In addition, some further patterns can only be seen at the key semantic level e.g. "Law and order" appears at the semantic level but is harder to spot at the key words level (other than with key words lower down the list such as "law" and "prison"). This illustrates the advantage of the key semantic domains approach over the key words approach since it allows you to spot further items of interest that otherwise do not appear with other techniques. For more details about this, see the 2008 paper in IJCL:
    Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics. 13:4 pp. 519-549. DOI: 10.1075/ijcl.13.4.06ray
  12. If you haven't already done so, you should now switch to the advanced user interface by going to the top of the Wmatrix screen, clicking on "My folders" and then click on "switch to advanced interface". Then when you click on any folder, you will see much more detailed information and many more options for exploring and analysing the data therein. For more guidance on the advanced interface, please view the "Advanced Folder Interface" video that appears on the Wmatrix home page.

Tutorial D: Advanced data analysis (MWEs, n-grams and c-grams)

Please note that the n-grams and c-grams features in Wmatrix are currently switched off while a new faster and more powerful n-gram tool is implemented.
  1. In step C.2 above, we saw that the word frequency list contains some multiword expressions. Using the advanced interface for Wmatrix, you can extract the multiword elements marked by the tool and find n-grams and c-grams. In this part of the tutorial, we will look at some of the advanced features in Wmatrix which include MWE analysis.
  2. At the top right of the Wmatrix screen, click on "My folders" and then click on "switch to advanced interface" (if you haven't already done so by now).
  3. Select the LibDem Manifesto 2005 folder as before and you will see a more complicated interface than the simple view. This lets you see frequency lists of POS tags as well as compare POS tag profiles just as you did before with word and semantic tag profiles. Try experimenting with these features to see how the Labour and LibDem manifestos compare.
  4. Also, at this point you can familiarise yourself with which features are the same as in the simple interface: word frequency is the same, word search is called concordancing in the advanced interface, and the word clouds views include much more information in table format. Some features are new to the advanced interface: POS tag frequencies, key POS analysis, file format conversion and download. As an example of the more detailed information in the advanced interface, once you are in the LibDem folder you can select the Labour Manifesto folder in the 'key analysis' drop down lists at the word, part-of-speech and semantic level. At the semantic level, the table below is produced. The keyness table is sorted on the log-likelihood value, resulting in the most significant differences at the top of the table. Full word lists for each tag can be seen by clicking on the 'list' links on the left in the advanced interface.
    Key domains in the advanced interface
  5. In order to see multiword expressions (MWEs), first select the word frequency list sorted by frequency (top left in the main table in the LibDem folder), then ...
  6. In the predefined filter drop down list, select multiword expressions and click 'go'. You should see just the MWEs contained in the word frequency list. For example, "make_sure" occurs 22 times, "such_as" 17 times, "red_tape" 13 times. If you click on any concordance link, these will be shown just as for single word concordances.
  7. For further information on how Wmatrix identifies MWEs using manually defined rules or templates in the USAS dictionaries, you can read two papers, Piao et al (2005) and Rayson et al (2004). Full references are listed on the USAS website.
  8. Another way to identify MWE is using the n-gram technique for counting recurrent 'n'-words-long patterns in the text. Note that a 1-gram list is equivalent to a plain word frequency list. In the advanced interface, you can see n-grams from 2 words long up to 5 words long from the main interface. These are found by clicking on the numbers 2, 3, 4, and 5 in the first line of the table alongside the word frequency lists. If these numbers are not shown, you need to click on "Make: n-grams (2 to 5) and c-grams" in the bottom right of the main table and follow the instructions displayed there.
  9. You can export the n-gram lists as tab-delimited files by clicking on the link on the top right of the table when viewing the lists.
  10. Compare the 2-gram and 3-gram lists with the lists of MWEs extracted using the drop down filter in step D.6 above. There is some overlap e.g. "make_sure" and "such_as". Other items in the 2-gram list ("we will", "of the") are not included in the MWE list because they are not identified as semantically tagged units by the USAS tagger. Further items in the 2-gram list e.g. "liberal democrats" and "liberal democrat" should appear in the MWE list but do not because they are not listed in the USAS dictionary. Using the advanced features in "My Tag Wizard" you can add your own words and MWEs to the lists in the USAS dictionaries contained in Wmatrix (see tutorial E below). A good way to identify candidates for adding to your own dictionary in Wmatrix is to use the n-gram lists to find reccuring items.
  11. Compare the items listed in the 2, 3, 4 and 5-gram frequency lists. What items in the 2-gram list are also contained in the patterns shown in the 3-gram list? You should also find items in the 3-gram list that are part of items in the 4-gram list and so on with 4 and 5-grams.
  12. In order to avoid the manual work of identifying overlapping items, you can use c-grams (collapsed-grams). Click on the letter "C" in the main LibDem folder. You will see a tree representation. This shows which 4-grams appear within each 5-gram, which 3-grams are contained within each 4-gram and so on recursively down the tree. For example, "commission on long term care" contains "commission on long term" and "on long term care". These in turn contain "commission on long", "on long term" and "long term care". Finally, the 2-grams "on long", "commission on" and "term care" are subsets of the preceding 3-grams.
  13. You can export the c-gram table as a tab-delimited file by clicking on the link on the top right of the table.
  14. The c-gram approach is intended to help with filtering shorter length n-gram lists to find more useful n-gram units that might be sensible chunks for linguistic analysis. This is a new feature in Wmatrix and feedback is most welcome!

Tutorial E: Extending the Wmatrix dictionaries

  1. As we saw in step D.10 above, there are some MWE items that you might wish to add to the Wmatrix dictionaries since they are not tagged in your data, e.g. "liberal democrat". In addition, words with domain specific meanings e.g. "party" meaning "political group", might not be correctly classified in Wmatrix. Thirdly, some words may not be known to the semantic tagger and therefore receive a Z99 tag meaning 'unknown'. They will still receive a POS tag from the CLAWS tagger. There are two features in Wmatrix that allow you to change the way that the system works:
    • The "Domain tag wizard" allows you to give the semantic tagger extra information about the semantic domains (or fields) if you know in advance which domain(s) the text in your file is about. This will mean that the tagger increases the likelihood of the preferred domain(s) and will usually choose the tag(s) when an ambiguous word or MWE occurs in the text. For example, the most likely tag for "party" in Wmatrix is K1/S1.1.3+ to represent the celebration/entertainment sense. The second most likely sense is G1.2 for political party. If you wish to force the system to choose the political sense, then you can use the domain tag wizard and enter "G", "G1" or "G1.2" in the 'preferred semantic field' box when you load the data into Wmatrix. Further instructions on this are contained in the "Domain tag wizard" screen. You need to be using the advanced interface in order to see this feature. Take the data prepared in tutorial A and run it through the domain tag wizard using different 'preferred semantic fields' and see what effect they have.
    • In order to add new words and MWEs to the USAS dictionaries contained in Wmatrix, you can use the "My Tag Wizard" feature. This allows you to create your own supplementary word list and MWE list which you can then merge with the system dictionaries for your personal use. Running new data through "My Tag Wizard" rather than the normal tag wizard activates these extended dictionaries for tagging. You could use this feature to add new words or new senses of words that you have found in your own data. Further instructions are contained in the "My Tag Wizard" page within Wmatrix. You could also employ this wizard to introduce a new semantic tag to the system by listing the words and MWEs that you wish to be categorised with this tag. If you do that, then bear in mind that the reference corpora (e.g. BNC sampler) are not tagged with your new scheme, so comparability is an issue to take account of. You need to be using the advanced Wmatrix interface in order to have this feature available.
    • Once you have created your own dictionaries to use in "My Tag Wizard", you can still use the normal Wmatrix versions by returning to the plain tag wizard and running new data through it.
  2. The following steps guide you through the process of creating a personal lexicon to use in the My Tag Wizard feature:
    1. In order to find words that you might add to your supplementary dictionary, you can look at the list of unknown words in each folder. The icon is a question mark. If you click on it you will see a list of words that are not in the main dictionaries of the semantic tagger. You will also see typos, mispelt words and words that have been incorrectly tokenised by the system. For example, in the Labour Manifesto for 2005, you will see items such as "full.They", "globalisation", "G7" and "G8". The first of these is caused by a lack of space character at the end of the sentence between "full" and "They". Errors like this are due to the automatic way that the PDF, DOC and RTF files were converted. You can manually correct them in the original text files and reload them into Wmatrix if you wish, although it should be noted that they are very infrequent. Words like "globalisation", "G7" and "G8" are good candidates to be added to your personal dictionaries in the My Tag Wizard feature.
    2. Open a local file to store your personal lexicon and copy the appropriate lines from the unknown list into your file. This local file should be in plain text format, so you should use WordPad or NotePad on Windows, TextEdit on MacOSx or a plain text editor on Linux. If you use MSWord to edit it, then make sure that you save the personal lexicon in plain text format. For each line that you have copied, you will need to edit the semantic tag field. For most of the lines in the list of unknown words, the semantic tag is shown as Z99. However, for each of the unknown words, the USAS tagger within Wmatrix tries to make a best guess at possible semantic categories. It looks up the unknown word in WordNet to find a list of synonyms. For each of these synonyms the tagger then looks them up in its dictionary. If the synonyms occur in the USAS dictionary, then the semantic tags from the synonyms are used to make a candidate list of tags for the unknown word. You may find that some or all of these candidate tags are not appropriate so you can discard them. If you are not sure which tag to use, then read the guide to the semantic tagset on the USAS web page. You can also search the main dictionary for similar words and see how they are coded. To do this, click on "USAS: Lexicon" in the Help menu. It is also worth viewing this file since you will need to replicate the format in your own personal dictionary.
    3. In addition to unknown words, you can use the personal lexicon to override the tagging of words that are already in the main system lexicon if the meanings tagged by the system are incorrect for your data. You may notice these incorrect tags through frequency profiling and concordancing on your data. For example, the most likely tag for "party" in Wmatrix is K1/S1.1.3+ to represent the celebration/entertainment sense. The second most likely sense is G1.2 for political party. If you wish to improve the chances of the system choosing the political sense, then you can copy the line containing "party" from the main system lexicon to your personal lexicon file and edit the order of the tags. If you move G1.2 to be first in the semantic tag list, it is much more likely to be chosen.
    4. You may also use the personal lexicon to introduce a completely new semantic tag for your data. In the My Tag Wizard page, you will see example files which introduce a new tag 'I5' for an innovation category. Bear in mind that new tags should follow the same format as existing ones, i.e. an upper case letter followed by subcategories indicated by numbers.
    5. Once you are happy with your local copy of your personal lexicon, save it to your local disk, then click on My Tag Wizard in the tagging menu in Wmatrix. Follow the instructions there to create your personal dictionaries and load the local copy of your lexicon to the system. Note that for My Tag Wizard to be activated you need to load a personal MWE list (see steps below) or merge a blank MWE list. When it is activated you can test running the Labour Manifesto through My Tag Wizard. Check that your personal lexicon changes have taken effect in the tagging of the file.
  3. The following steps guide you through the process of creating a personal MWE list:
    1. Finding candidate MWEs for your data is more tricky than finding unknown words for the personal lexicon. You may find items such as 'liberal democrat(s)' through the process of concordancing. Another approach is to use the n-gram lists generated in step D.10 for extracting candidate phrases, terms, or idioms for inclusion. Of course, you may known in advance names of people, organisations or places that you wish to include.
    2. Editing a local copy of your personal MWE list is much the same as editing a local copy of your personal lexicon. The main difference is the format of each line. There are two places to view MWEs in the main system MWE list. Click on USAS MWEs in the Help menu to see the main system list. There is also a sample file in the My Tag Wizard screen of new MWEs to be tagged as I5. To start off, you can copy this format. Each part of the MWE should have "_*" which matches any POS tag. Later, you can revise this if you wish the MWE entries to be tied more specifically to POS tags. Each lines ends with a semantic tag or a list of semantic tags with the most likely one in the first position.
    3. While you are editing your personal MWE list, you should check if similar entries already exist in the main system MWE list, accessed as described above. If there are no similar entries, then you can add a simple version of the new MWE containing "_*" for POS tags. However, if a similar entry already exists, or you wish to override the tags for a particular MWE, then you need to replicate the format in the main system entry. Due to the way that the MWE templates in the system dictionary are written it is quite possible that several 'match' at the same point in a given sentence. Hence, a set of heuristics is applied by the semantic tagger to say which one should be chosen in such a case. New personal rules may not be used because they are less tightly specified than those already in the system. For example, if you create a new MWE entry:
      going_* downhill_* M1
      This overlaps with one already in the main system MWE list:
      going_* {R*} downhill_RL A5.1-/A2.1
      Since the 'downhill' part has a fixed POS tag "RL" the system rule wins over your new rule and is applied instead. The way to fix this is to change your new rule to be
      going_* {R*} downhill_RL M1
    4. Once you have finished editing the local copy of your personal MWE list, you should load it into the system by clicking on My Tag Wizard and then the link for editing your personal dictionaries. Follow the instructions on the screen to load the file. Don't forget to merge both the personal lexicon and MWE list. When that is complete, try running a new file through My Tag Wizard to see if your new entries are being applied correctly.

Tutorial F: word collocations and semantic collocations

  1. This tutorial assumes that you have already completed tutorials A and B. You should also be using the advanced interface to Wmatrix. If you're not using the advanced interface, switch to it now by clicking 'switch to advanced interface' at the top left of the Wmatrix screen.
  2. Enter the folder for the Labour manifesto (2005) that was uploaded to Wmatrix in tutorial B by clicking on the icon or the name.
  3. In the main table, you should see a column headed 'Collocation'. Underneath this will be two links called 'Word' and 'Word-Semantic'.
    • If you don't see these links, then scroll further down the table and locate the link called 'Calculate: word and semantic collocations' in the manual operations and file conversions section
    • Click on that calculate link and follow the instructions. The word level and semantic level collocations will now be calculated. You may need to refresh the main folder view in Wmatrix while the collocations are being calculated. Once the calculations are completed, the links in the main table will be activated.
  4. To view the word level collocations, click on the 'Word' link in the 'Collocation' column of the main table.
  5. You will see a table of all the collocates extracted from the Labour manifesto, for example "Sure Start", "Tony Blair" and "Preface Tony" are the first three.
  6. The default statistic used is the Multual Information (MI). You can change the statistic using the drop down box and clicking "Go". Eleven different statistics are available for you to use. Select Log-Likelihood and click "Go".
  7. Using Log-Likelihood, you will see the first two collocates are "We will" and "per cent". This illustrates the difference between collocates extracted using different statistics. For more information on the statistics used, please see the paper by Scott Piao:
    Piao, S. (2002) Word alignment in English-Chinese parallel corpora. Literary and linguistic computing, 17 (2), 207-230. doi:10.1093/llc/17.2.207
  8. No matter which collocation statistic you use, the table of collocates will be ordered by the collocation statistic score itself. A T-score filter is also applied to remove non-significant collocates.
  9. If you wish to find collocates of a given word, then enter the word in the 'search this list' box and click 'Go'. For example, using the MI statistic, you can enter "health" in the search box and see that there are four collocates extracted: "mental health", "restored health", "good health" and "health services"
  10. At any point, you can right-click on the save icon and download a tab-delimited file containing the information in the table. This tab-delimited form can be imported into a spreadsheet or word-processor document.
  11. After using the search box, you must clear it and click 'Go' in order to see the full list of collocates again.
  12. You can use substrings in the search box. You will notice that entering 'school' will find collocates containing both the singular and plural forms e.g. "primary schools" and "secondary school". A search for 'polic' will find "police officers" and "neighbourhood policing" but also "International policy"
  13. To explore the semantic collocations, return to the Labour manifesto folder by clicking on the folder name at the top right of the Wmatrix window. Then, click on 'Word-Semantic' in the Collocations column of the main table.
  14. The view of the list is very similar to the word collocations, but this time you can see which semantic tags are collocated with certain words. For example, in the list for the MI statistic, "prime G1.1/S2mf" and "lawabiding N5+++c" are shown as the most significant collocates.
  15. You can change the statistic and search the list as before with the word level collocations.
  16. You will need a list of semantic tags in order to see what groups of words are represented by these word-tag collocate pairs. You could open another Wmatrix window and run concordances for the tags in order to see what sorts of words are being picked out.
  17. In the MI list, search on the word 'council'. You will see four semantic collocates: "council G1.1/I1", "councils S1.2.1-", "councils S7.1+", and "councils A1.1.1".
    • If you run concordances for the tag 'A1.1.1' and then search for the word 'councils' in the context, you will see that this collocation represents 'councils' either tackling or dealing (tagged as A1.1.1) with antisocial behaviour, unauthorised sites or pubs and clubs.
    • If you run concordances for the tag 'S7.1+' and then search for the word 'councils' in the context, you will see that this collocation represents 'councils' linked with words such as power, leading and organised (all tagged as S7.1+). Read the wider concordance lines in order to see exactly what is written in the manifesto about this link.
    • A similar search for the tag 'S1.2.1-' reveals that the word 'councils' is collocated with 'antisocial behaviour'.
    • Carrying out the same operation for the tag 'G1.1/I1' shows the collocate 'council tax'.
  18. Using the word collocates and semantic collocates lists, you are now equipped to carry out a collocational analysis of the Labour manifesto and compare it to the LibDem manifesto.

Background reading