SANTI-morf: a new morphological annotation system for Indonesian texts*

Prihantoro

CASS, Lancaster University

SANTI-morf is a new morphological annotation system at morpheme level for Indonesian texts. It can analyse words formed by means of affixation, reduplication, compounding (which are three major morphological operations in Indonesian morphology (Mueller 2007:1221-1222)) and clitics. SANTI-morf implements a robust tokenisation and annotation scheme devised by Prihantoro (2019). Indonesian words are tokenised into morphemes: the orthographic and citation forms of the morphemes must be presented. SANTI-morf tags are fine-grained. Analytic labels to identify sub-categories of affixes (prefixes, suffixes, circumfixes, infixes) and reduplications (full, partial, imitative) are included. SANTI-morf also includes various analytic labels to analyse how each morpheme functions, for instance outcome POS (the POS of the words formed by certain affixes), active, passive, reciprocal, iterative etc. The program that I use to develop SANTI-morf is Nooj (Silberztein 2003), a rule-based linguistic text analyser. By using Nooj we can write lexicon and rules and apply them to analyse texts at multiple linguistic levels (morphology, morphosyntax, syntax, etc). Nooj can also be used as a corpus query program to retrieve the morphemes in the corpus annotated by SANTI-morf. The lexicons and rules I wrote for SANTI-morf can be grouped into four modules and are applied sequentially. This multi-module pipeline helps reduce ambiguity even before the disambiguation rules are applied. The first module is the Annotator, which analyses all words in the target text(s). The second module is the Guesser, which analyses all words left unanalysed by the Annotator. The third module is the Improver, which identifies incorrect analyses given by the Annotator or the Guesser, and adds the correct analyses. The last module is the Disambiguator which contains contextual and non-contextual disambiguation rules to resolve ambiguities. Unresolved ambiguities are kept. The evaluation shows that SANTI-morf gives 99% precision and 99% recall.

Keywords: SANTI-morf, morpheme annotation, Indonesian, rules, pipeline, Nooj

Week 1 2021/2022

Thursday 14th October 2021
1:00-2:00pm

Teams meeting - please contact for link