Wim Peters
NLP Group
Department of Computer Science
University of Sheffield
Regent Court
211 Portobello St.
Sheffield S1 4DP
WordNet(Fellbaum,1998) is a lexical database produced on psycholinguistic principles. It has been developed at Princeton University and, for ten years now, has been freely available. It has been very extensively used in language engineering research. WordNet is an English-language resource. In the recent EU project EuroWordNet (Vossen,1998) (http//www.hum.uva.nl/~wn) and in a number of associated research activities wordnets for several other languages have been developed, on the same basic plan but with further sophistications and with the added benefit of the Inter-Lingual Index, which links synsets (sets of synonymous words) of different languages. The synset is the basic conceptual unit in (Euro)WordNet. Each meaning of each word is located in a synset and synsets stand in hierarchical and other relations to each other.
The WordNet and EuroWordNet databases have been analysed in order to detect patterns of sense combinations that display some kind of semantic regularity. Another term used for this regularity is regular polysemy (Apresjan 1973). Regular polysemy holds where a set of word senses is related in systematic and predictable ways. In principle, a pattern is considered regular if the combinations of related senses are valid for more than one word. Regular polysemy has mostly been approached from a theoretical perspective. There is a limited set of default relations that have been identified in the literature, such as plant/food and producer/product (Ostler and Atkins, 1991, Pustejovsky, 1995). Most of these relations have been arrived at by examination of a limited quantity of linguistic material (texts, dictionaries) or introspection.
We have taken a data-driven approach to the identification of regular polysemic patterns. For this purpose we have operationalized Apresjan's definition of regular polysemy by exploiting WordNet's hierarchical structure: wherever there are two or more words with senses in one part of the hierarchy which also have senses in another part of the hierarchy, then we have a candidate pattern of regular polysemy. The resulting candidate regular polysemic sets of words with their corresponding senses were further pruned by the application of two criteria: 1) Multilingual lexicalization. Because metonymically related word senses are often translated by the same word in other languages (Seto 1996), only the candidate regular polysemic sense combinations that are lexicalized by the same language-specific word in three languages (English, Dutch and Spanish) were selected. 2) Sense co-occurrence within the same document in Semcoe, the semantically tagged version of the Brown corpus. These sense combinations are selected whenever they are found in the candidate sets described above.
These methods, individually aplied, reduce the set of regular polysemy candidates from WordNet. These have been examined, and a distinction has been made between the different types of regular polysemic groups that have been extracted. Participating words and tehir senses are linked by pairwise combinations of hypernyms that subsume the word senses in the WordNet hierarchy. These hypernymic pairs can be at a high level in the hierarchy and therefore correspond to coarse-grained semantic relations between the senses of the words participating in the regular polysemic pattern, or they can be at lower hierarchical levels that correspond to fine-grained semantic relations between the word senses. The latter give rise to regular polysemic relations that constitute semantically more coherent groups, such as publication-publisher (paper, newspaper, magazine), musical composition - group of singers (trio, quartet, suite) and building - institution/association (school, chamber, court). Also, different types of metaphorical transfer can be identified, such as supporting structure - theory (framework, foundation, base), musical theme - idea (theme, motif, strain) and concrete obstruction - abstract obstruction (barrier, roadblock, hurdle). Techniques to automate the fine-tuning of the extraction of suitable conceptual pairs in order to select relevant groups of words are then discussed. Some concluding observations are that mining a linguistic resource with semantic information in search for regularities in lexicalization patterns can give insight into the level of encoded semantic regularity in the resource and the orientation of lexicographic practise towards this phenomenon.
REFERENCES
Fellbaum, Christiane (ed.) (1998) WordNet: An Electronic Lexical Database. Cambridge, Mass.: MIT Press.
Ostler, N. and Atkins, B. (1991), Predictable Meaning Shift: Some Linguistic properties of Lexical Implication Rules, In: Pustejovsky, J., Bergler, S. (Eds.), Lexical Semantics and Knowledge Representation, ACL SIGLEX Workshop Berkeley, California
Pustejovsky, J. (1995). The Generative Lexicon, MIT Press, Cambridge MA, U.S.A.
Ken-ichi Seto (1996), On the Cognitive Triangle: the Relation of Metaphor, Metonymy and Synecdoche in: A. Burkhardt & N. Norrick, eds. Tropic Truth De Gruyter, Berlin - New York: 1996.
Vossen, P. 1998. EuroWordNet: Building a Multilingual Database with Wordnets for European Languages. In: K. Choukri, D. Fry, M. Nilsson (eds), The ELRA Newsletter, Vol3, n1, 19998. ISSN: 1026-8200.