A new tagset for morphological analysis of Indonesian texts

Prihantoro

CASS, Lancaster University

This project proposes a new tagset for a Morphological Analyser (MA) of Indonesian currently under development, which will be implemented in Nooj (Silberztein, 2004). Indonesian is a variety of Malay, a member of Austronesian language family. It is the national and official language of Indonesia. According to the Ethnologue (Lewis et al., 2009), Indonesian has more than 200 million speakers. The purpose of this tagset is that, when applied to the full text of an Indonesian corpus, the users of such corpus will be able to perform queries based on morphological criteria as well as raw word forms by using Nooj. The tagset is distinct from the existing state-of-the-art tagset used by the Morphind, a MA for Indonesian (Larasati et al., 2011), or Pischeldo et al.'s MA tagset (2008). The new tagset provide annotations for affixes, clitics, particles, reduplications, morphophonemics, and functional categories such as passive voice, reciprocal voice, agentive and instrumental nouns. A large portion of the scheme is dedicated for affixes, which is the most productive word formation device in Indonesian. Prentice (1987) mentions that word formations in Indonesian is a combination of syntactic, semantic and syntactic factor. Taking this view into account, the new scheme allows for some ambiguities to be resolved later by syntactic and semantic annotations.

Week 22 2018/2019

Thursday 2nd May 2019
3:00-4:00pm

Charles Carter A15