UCREL Semantic Analysis System (USAS)


USAS Home Page | English tagger | Dutch tagger | Chinese tagger | Italian tagger | Portuguese tagger | Spanish tagger


Top level codes The UCREL semantic analysis system is a framework for undertaking the automatic semantic analysis of text. The framework has been designed and used across a number of research projects and this page collects together various pointers to those projects and publications produced since 1990.

The semantic tagset used by USAS was originally loosely based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981). It has a multi-tier structure with 21 major discourse fields (shown here on the right), subdivided, and with the possibility of further fine-grained subdivision in certain cases. We have written an introduction to the USAS category system (PDF file) with examples of prototypical words and multi-word units in each semantic field.

The full tagset is available on-line in plain text form and formatted on one page in PDF. The tagset has been translated to Finnish by Laura Löfberg (University of Tampere, Finland). You can also see the USAS semantic tagset in Russian as a two page PDF and text file.

A visual representation showing the USAS tagset heirarchy is now on-line, along with those for the Louw-Nida model and the Hallig/Von Wartburg/Schmidt/Wilson Model.

You can try out the English semantic tagger online.

Extension of Semantic Tagger Framework for Chinese, Dutch, Italian, Portuguese and Spanish

The USAS framework is now being extended to cover five more languages: Chinese, Dutch, Italian, Portuguese and Spanish. The Java software framework developed in the Benedict and ASSIST projects has been modified to accommodate these languages, and semantic lexicons are compiled for them by automatically "translating" the English semantic lexicon entries, with some manual improvement where possible. Due to the inevitable ambiguity of translations and part-of-speech correspondence across and between languages, the automatically translated lexicons contain errors, which need to be cleared manually. A website interface is provided for users to test the semantic taggers. This is beta release of the tools, which will be improved in future. Please get in touch with Paul Rayson if you would like to be involved in further improvements of the tools.

Chinese Semantic Tagger (http://phlox.lancs.ac.uk/ucrel/semtagger/chinese)

The Chinese semantic tagger has been developed by incorporating the Stanford Chinese word segmenter and the Chinese POS tagger into the USAS Java framework. The Chinese semantic lexicons have been automatically generated by translating the English semantic lexicons entries using a Chinese-English Dictionary (Xiao et al., 2010) and a LDC (Linguistic Data Consortium) English-Chinese Wordlist. Due to the different Chinese POS tags used in the Stanford Chinese POS tagger and Xiao et al.'s dictionary, their POS tags are mapped into a simplified common tagset to be used internally by the software system. The Chinese lexicon also employs a set of extended kinship semantic tags designed by Qian and Piao (2009). The full tagset is available in Chinese (txt, docx, pdf). We are grateful for the assistance of Dr Richard Xiao (Lancaster University, UK) and Qian Yufang (Zhejiang University of Media and Communications, China) with this research. Currently the Chinese single word and multi-word unit semantic lexicons contain over 64,000 and over 19,000 entries respectively. As automatically generated lexicons, they contain errors. The Chinese lexicons can be accessed here (download as UTF-8): (a) Single word lexicon (b) MWE lexicon

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Dutch Semantic Tagger (http://phlox.lancs.ac.uk/ucrel/semtagger/dutch)

The Dutch semantic tagger has been developed using a similar process to that of the Italian semantic tagger, using the Dutch version of TreeTagger. The Dutch lexicon has been compiled by translating the English semantic lexicon entries using a Dutch-English dictionary developed by (Tiberius and Schoonheim, 2014). As the Dictionary and TreeTagger use different POS tagsets, they are both mapped into a simplified common tagset to be used by the software system. Currently Dutch semantic single word lexicon contains 4,203 entries. We are grateful for the assistance of Dr. Carole Tiberius (INL, Netherlands) with this research. The Dutch lexicon can be accessed here (download as UTF-8): (a) Single word lexicon (b) The Dutch multiword lexicon is not available yet.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Italian Semantic Tagger (http://phlox.lancs.ac.uk/ucrel/semtagger/italian)

The Italian semantic tagger is being developed in collaboration with Dr Francesca Bianchi (Dip. di Studi Umanistici, Universita del Salento, Italy) and Prof. Elena Semino (Dept. of Linguistics and English Language, Lancaster University, UK). The original Java software framework has been modified by incorporating the TreeTagger Italian POS tagger. The English semantic lexicon entries have been automatically translated into Italian counterparts using FreeLang and other English-Italian Dictionaries with the help of Italian native speakers. Although some lexicon entries were manually checked, most of the entries were automatically generated and therefore they inevitably contain errors, which need to be cleared manually in future. The full tagset is available in Italian (doc, pdf). Currently, there are two Italian semantic lexicons: single word lexicon (over 20,400 entries) and multi-word lexicon (over 4,100 entries which were manually checked). The Italian lexicons can be accessed here (download as UTF-8): (a) Single word lexicon (b) MWE lexicon

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Portuguese Semantic Tagger (http://phlox.lancs.ac.uk/ucrel/semtagger/portuguese)

The Portuguese semantic tagger is being developed using a similar process to that of the Italian semantic tagger, using the Portuguese TreeTagger. The Portuguese lexicons have been compiled by translating the English semantic lexicon entries using a Portuguese-English dictionary developed by Davies and Preto-Bay (2007) and FreeLang Portuguese-English bilingual lexicon. A small section of the lexicons were manually checked, but most of the lexicon entries were automatically generated and therefore contain errors, which need to be cleared manually in the future. As the POS tagger and lexical resources use different POS tagsets, these tagsets are mapped into a simplified common tagset to be used by the software. Currently, there are two Portuguese semantic lexicons: single word lexicon (over 13,900 entries) and multi-word lexicon (over 1,780 entries). The Portuguese lexicons are being created with the help of Carmen Dayrell (CASS, Lancaster University, UK). These lexicons can be accessed here (download as UTF-8): (a) Single word lexicon (b) MWE lexicon.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Spanish Semantic Tagger (http://phlox.lancs.ac.uk/ucrel/semtagger/spanish)

The Spanish semantic tagger is in an early stage of development, following a similar process to that of the Italian semantic tagger, using the Spanish TreeTagger. Currently Spanish tagger has only a single-word semantic lexicon compiled by translating the English semantic lexicon entries using a Spanish-English dictionary compiled by Mark Davies (2006). Generated by automatic process, the lexicon contains errors and it needs to be cleaned manually in future. Because the POS tagger and the dictionary employ different POS tagset, these tagsets are mapped into a simplified common POS tagset to be used by the software. Currently, the Spanish semantic lexicon contains over 2,000 entries, which can be accessed here (download as UTF-8): (a) Single word lexicon

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Funded projects

The software and linguistic resources underpinning the semantic analysis have been designed and produced during five projects: The ACASD and ACAMRIT projects led to the initial design and implementation of the tools and applied them in the area of interview transcripts. The REVERE project applied the tools in the domain of software engineering documentation using a web front end called Wmatrix. In Benedict, we have re-implemented the English semantic tagging (EST) tool in Java, and improved the linguistic resources in the tool. In addition we developed a Finnish Semantic Tagging (FST) tool. In the ASSIST project, we extended the existing USAS framework to construct a Russian Semantic Tagger (RST). In 2013, the UCREL research centre funded initial development of the Italian, Dutch and Chinese lexicons.

People

Andrew Wilson was the RA in Linguistics on the first two projects and Paul Rayson was the RA in Computing on all five projects. Scott Piao and Dawn Archer were the RAs on the Benedict project. Olga Mudraya was the RA in Linguistics and Scott Piao was the Computing RA on the Assist project. Scott Piao is the Computing RA on the initial development of the Italian, Dutch and Chinese taggers. The grant holders and supervisors were Roger Garside (Computing), Geoff Leech (Linguistics) and Jenny Thomas (Linguistics, now at Bangor). Tony McEnery was the principal investigator for Benedict. For ASSIST, Roger Garside, Tony McEnery, Andrew Wilson and Paul Rayson were the grant holders.

Availability

Publications describing the system (or extensions of the system)

  1. Wilson, A. and Rayson, P. (1993). Automatic Content Analysis of Spoken Discourse. In: C. Souter and E. Atwell (eds), Corpus Based Computational Linguistics. Amsterdam: Rodopi. pp215-226 (text)
  2. Wilson, A. (1993). Towards an Integration of Content Analysis and Discourse Analysis: The Automatic Linkage of Key Relations in Text. UCREL Technical Paper 3, Linguistics Department, Lancaster University. PDF version
  3. Rayson, P., and Wilson, A. (1996). The ACAMRIT semantic tagging system: progress report. In L. J. Evett, and T. G. Rose (eds) Language Engineering for Document Analysis and Recognition, LEDAR, AISB96 Workshop proceedings, pp 13-20. Brighton, England. Faculty of Engineering and Computing, Nottingham Trent University, UK. ISBN 0 905 488628 PDF version
  4. Wilson, A. and Thomas, J.A. (1997) Semantic annotation, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 53-65.
  5. Garside, R., and Rayson, P. (1997). Higher-level annotation tools. In. R. Garside, G. Leech, and A. McEnery (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London. pp 179 - 193.
  6. Paul Rayson (2002). USAS: UCREL semantic analysis system. Invited talk at Daito Bunka University, Tokyo, Japan. February 2002. (HTML slides)
  7. Dawn Archer, Andrew Wilson, Paul Rayson (2002). Introduction to the USAS category system. Benedict project report, October 2002. (PDF version)
  8. Dawn Archer, Tony McEnery, Paul Rayson, Andrew Hardie (2003). Developing an automated semantic analysis system for Early Modern English. In Dawn Archer, Paul Rayson, Andrew Wilson and Tony McEnery (eds.) Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper number 16. UCREL, Lancaster University, pp. 22 - 31. PDF version
  9. Laura Löfberg, Dawn Archer, Scott Piao, Paul Rayson, Tony McEnery, Krista Varantola, Jukka-Pekka Juntunen (2003). Porting an English semantic tagger to the Finnish language. In Dawn Archer, Paul Rayson, Andrew Wilson and Tony McEnery (eds.) Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper number 16. UCREL, Lancaster University, pp. 457 - 464. PDF version
  10. Scott S. L. Piao, Paul Rayson, Dawn Archer, Andrew Wilson and Tony McEnery (2003). Extracting Multiword Expressions with a Semantic Tagger. In proceedings of the Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, at ACL 2003, 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, July 12, 2003, pp. 49-56. PDF version
  11. Piao, Scott S. L., Paul Rayson, Dawn Archer, Tony McEnery (2004). Evaluating Lexical Resources for A Semantic Tagger. In proceedings of 4th International Conference on Language Resources and Evaluation (LREC 2004), May 2004, Lisbon, Portugal, Volume II, pp. 499-502. ISBN 2-9517408-1-6. PDF version
  12. Rayson, P., Archer, D., Piao, S. L., McEnery, T. (2004). The UCREL semantic analysis system. In proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks in association with 4th International Conference on Language Resources and Evaluation (LREC 2004), 25th May 2004, Lisbon, Portugal, pp. 7-12. PDF version
  13. Archer, D., Rayson, P., Piao, S., McEnery, T. (2004). Comparing the UCREL Semantic Annotation Scheme with Lexicographical Taxonomies. In Williams G. and Vessier S. (eds.) Proceedings of the 11th EURALEX (European Association for Lexicography) International Congress (Euralex 2004), Lorient, France, 6-10 July 2004. Université de Bretagne Sud. Volume III, pp. 817-827. ISBN 2-9522-4570-3. PDF version
  14. Paul Rayson, Scott Piao, Dawn Archer (2004). Modern and Historical Aspects of the UCREL Semantic Analysis System. Invited talk at the University of Sheffield, UK, 16th November 2004. (PDF versionslides)
  15. Rayson, P. (2005) Right from the word go: identifying multi-word-expressions for semantic tagging. Invited talk at BAAL Corpus Linguistics SIG / OTA Workshop: Identifying and Researching Multi-Word Units. Thursday 21st April 2005, Oxford University Computing Services. (PDF versionslides)
  16. Scott S.L. Piao, Dawn Archer, Olga Mudraya, Paul Rayson, Roger Garside, Tony McEnery, Andrew Wilson (2005) A Large Semantic Lexicon for Corpus Annotation. In proceedings of the Corpus Linguistics 2005 conference, July 14-17, Birmingham, UK. Proceedings from the Corpus Linguistics Conference Series on-line e-journal, Vol. 1, no. 1, ISSN 1747-9398. PDF version
  17. Piao, S., Rayson, P., Archer, D., McEnery, T. (2005) Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language, (Special issue on Multiword expressions), Volume 19, issue 4, pp. 378 - 397, Elsevier. doi:10.1016/j.csl.2004.11.002
  18. Mudraya, O., Babych, B., Piao, S., Rayson, P., Wilson, A. (2006). Developing a Russian semantic tagger for automatic semantic annotation. In proceedings of Corpus Linguistics 2006, St. Petersburg, from 10-14 October 2006. English PDF version Russian PDF version (slides)
  19. Qian, Yufang and Scott Piao (2009). The Development of A Semantic Annotation Scheme for Chinese Kinship. Corpora, Vol. 4 (2), Edinburgh University Press. pp. 189-208.

Publications describing applications of the system

  1. Wilson, A. and Leech, G.N. (1993). Automatic Content Analysis and the Stylistic Analysis of Prose Literature. Revue: Informatique et Statistique dans les Sciences Humaines 29: 219-234.
  2. Thomas, J., and Wilson, A. (1996). Methodologies for studying a corpus of doctor-patient interaction. In J. Thomas and M. Short (eds) Using corpora for language research. Longman, London, pp 92-109.
  3. Rayson, P., Garside, R., and Sawyer, P. (1999). Recovering Legacy Requirements. In Proceedings of REFSQ'99 Fifth International Workshop on Requirements Engineering: Foundations of Software Quality, June 14-15 1999, Heidelberg, Germany. Published by University of Namur, pp. 49-54. ISBN 2 87037 307 4. PDF version
  4. Rayson, P., Garside, R., and Sawyer, P. (2000). Assisting requirements engineering with semantic document analysis. In Proceedings of Content-based multimedia information access RIAO 2000 (Recherche d'Informations Assistie par Ordinateur, Computer-Assisted Information Retrieval) International Conference, College de France, Paris, France, April 12-14, 2000. C.I.D., Paris, pp. 1363 - 1371. ISBN 2-905450-07-X PDF version
  5. Rayson, P., Emmet, L., Garside, R., and Sawyer, P. (2000). The REVERE Project: Experiments with the application of probabilistic NLP to Systems Engineering. In proceedings of 5th International Conference on Applications of Natural Language to Information Systems (NLDB'2000). Versailles, France, June 28-30th, 2000. PDF version
  6. Rayson, P., Garside, R., and Sawyer, P. (2000). Assisting Requirements Recovery from Legacy Documents. In Henderson, P. (ed.) Systems Engineering for Business Process Change: collected papers from the EPSRC research programme. Springer-Verlag, London, pp. 251 - 263. ISBN 1-85233-2220 PDF version
  7. Barbara Lewandowska-Tomaszczyk, Michael Oakes & Paul Rayson (2001). Annotated Corpora for Assistance with English-Polish Translation. Paper presented at Corpus Linguistics 2001, Lancaster University, UK, March 30-April 2, 2001. PDF version
  8. S. Sharoff, P. Rayson, O. Mudraya, A. Wilson and T. McEnery (2004). A tool for assisting translators using automatic semantic annotation. Presented at Corpus Use and Learning to Translate (CULT-BCN) Barcelona, January 22nd-24th 2004.
  9. Marilyn Deegan, Harold Short, Dawn Archer, Paul Baker, Tony McEnery, Paul Rayson (2004) Computational Linguistics Meets Metadata, or the Automatic Extraction of Key Words from Full Text Content. RLG Diginews, Vol. 8, No. 2. ISSN 1093-5371.
  10. Jones, M., Rayson, P. and Leech, G. (2004) Key category analysis of a spoken corpus for EAP. Presented at The 2nd Inter-Varietal Applied Corpus Studies (IVACS) International Conference on "Analyzing Discourse in Context" The Graduate School of Education, Queen’s University, Belfast, Northern Ireland, 25 - 26 June, 2004. PDF version
  11. Löfberg L, Juntunen J-P, Nykanen A, Varantola K, Rayson P, Archer D. (2004). Using a semantic tagger as dictionary search tool. In Williams G. and Vessier S. (eds.) Proceedings of the 11th EURALEX (European Association for Lexicography) International Congress (Euralex 2004), Lorient, France, 6-10 July 2004. Université de Bretagne Sud. Volume I, pp. 127-134. ISBN 2-9522-4570-3.
  12. Archer, D. and Rayson, P. (2004) Using an historical semantic tagger as a diagnostic tool for variation in spelling. Presented at Thirteenth International Conference on English Historical Linguistics (ICEHL 13) University of Vienna, Austria 23-29 August, 2004.
  13. Sharoff, S., Babych, B., Rayson, P., Mudraya, P. and Piao, S. (2006) ASSIST: Automated Semantic Assistance for Translators. In companion proceedings to the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), Trento, Italy, April 3-7, 2006, pp. 139 - 142. ISBN 1-932432-60-4. PDF version
  14. Piao, S. L., Rayson, P., Mudraya, O., Wilson, A. and Garside, R. (2006) Measuring MWE compositionality using semantic annotation. In proceedings of COLING/ACL workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, July 23, 2006, Sydney, Australia. PDF version (Download data for human ratings)
  15. Andrew Wilson, Olga Moudraia (2006) Quantitative or Qualitative Content Analysis? Experiences from a cross-cultural comparison of female students' attitudes to shoe fashions in Germany, Poland and Russia. In Andrew Wilson, Paul Rayson and Dawn Archer (eds.) Corpus Linguistics around the world. Rodopi, Amsterdam.
  16. For more recent applications of the English Semantic Tagger, see the list on the Wmatrix website