Unsupervised Word Sense Disambiguation and Concept Extraction in Clinical and Biomedical Documents

St├ęphan Tulkens

CLiPS, University of Antwerp

The automated analysis of clinical text presents us with a set of unique challenges. First, clinical text differs from conventional textual domains, diminishing the performance of off-the-shelf resources which are not specifically developed with this domain in mind. As an example, clinical text is usually written in a relatively informal style, featuring abbreviations and idiomatic language use which might be specific to a given caregiver or hospital. Second, training data is sparse due to privacy constraints, which also makes it difficult to reuse annotated data and datasets in different projects, hampering progress.

In this talk, I will describe research on the extraction and disambiguation of concepts from patient notes and biomedical texts using unsupervised methods based on distributional semantics.

Specifically, we've shown that domain-specific distributional semantic vectors, when appropriately composed into higher-order context vectors, provide us with a sufficiently powerful instrument to be able to distinguish between multiple highly related senses in a biomedical Word Sense Disambiguation (WSD) task. Our unsupervised method obtains comparable performance to knowledge-based and supervised methods on the same task.

Additionally, we apply similar distributional semantic methods to concept extraction on free clinical text. On this task, our approach is outperformed by supervised concept extraction on the same dataset, but significantly outperforms other unsupervised concept extraction methods.

Week 9 2017/2018

Thursday 7th December 2017

Charles Carter A17

Joint Data Science Group and UCREL talk.