Acronyms as an Integral Part of Multi-Word Term Recognition - A Token of Appreciation

Irena Spasic

University of Cardiff

The increasing amount of textual information in requires effective term recognition methods to identify textual representations of domain-specific concepts as the first step toward automating its semantic interpretation. The dictionary look-up approaches may not always be suitable for dynamic domains such as biomedicine or the newly emerging types of media such as patient blogs, the main obstacles being the use of non-standardised terminology and high degree of term variation. Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domain-specific concept as a latent variable. FlexiTerm is an unsupervised method for recognition of multi-word terms from a domain-specific corpus. It uses regular expressions to constrain the search space based on term formation patterns and then processes them statistically to identify largest frequently occurring bags of words and the corresponding terms. FlexiTerm uses a range of methods to normalize three types of term variation - orthographic, morphological, and syntactic variations. Acronyms, which represent a highly productive type of term variation, were not originally supported. In this talk, we describe how the functionality of FlexiTerm has been extended to recognize acronyms and incorporate them into the term conflation process. We evaluated the effects of term conflation in the context of information retrieval as one of its most prominent applications. On average, relative recall increased by 32 points, whereas index compression factor increased by 7% points. Therefore, evidence suggests that integration of acronyms provides non-trivial improvement of term conflation.

Week 6 2019/2020

Thursday 14th November 2019
3:00-4:00pm

Management school LT5

Joint UCREL and DSG talk