Detecting Meaningful Multi-word Expressions in Political Text

Ken Benoit

London School of Economics and Political Science

The rapid growth of applications treating text as data has transformed our ability to gain insight into important political phenomena. Almost universal among existing approaches is the adoption of the bag of words approach, counting each word as a feature without regard to grammar or order. This approach remains extremely useful despite being an ob- viously inaccurate model of how observed words are generated in natural language. Many politically meaningful textual features, however, occur not as unigram words but rather as pairs of words or phrases, especially in language relating to policy, political economy, and law. Here we present a hybrid model for detecting these associated words, known as collocations. Using a combination of statistical detection, human judgement, and machine learning, we extract and validate a dictionary of meaningful collocations from three large corpora totalling over 1 billion words, drawn from political manifestos and legislative floor debates. We then examine how the word scores of phrases in a text model compare to the scores of their component terms.

