Automatic detection of Spanish and Japanese modal markers and presence in spoken corpora

Carlos Herrero-Zorita

Autonomous University of Madrid

The main aim of this study is to automatically find and classify elements that signal modality in Spanish and Japanese sentences, taking into account theoretical and empirical information. In an effort to join different disciplines such as typology, logic, corpus and computational linguistics, we aim to answer three main questions: (1) What is the best definition and classification of modality for a cross-linguistic computational work; (2) How is modality used in spoken Spanish and Japanese, and how modal markers are modified in discourse; (3) How can we formalise this information into a program that can annotate modals automatically in new texts.

The result is a rule-based program that outputs a XML with markers annotated and classified equally in both languages. Modality is seen from the logic perspective as a semantic feature that adds necessity or possibility meanings to the predicate of the sentence using a series of auxiliaries. The corpus shows how these auxiliaries can be affected by negation, ellipsis, syntactic separation and ambiguity, which need to be detected by the program for the sake of precision and recall.

The corpus study also provides information about modality usage, and reveals that its frequency is correlated with the type of interaction, possibly related to social constraints. Monologues achieve similar results in both languages, as well as non-linguistic factors of sex and age of the speakers. Dialogues on the other hand show a completely different picture in Spanish, with a predominance of necessity, and Japanese, with possibility slightly higher.

