The compilation and annotation of the Reference Corpus of Contemporary Portuguese

Amalia Mendes

University of Lisbon

In this talk, I will present the Reference Corpus of Contemporary Portuguese, which has been developed at the Centre for Linguistics at the University of Lisbon (CLUL) for more than two decades. This is an electronically based linguistic corpus of written and spoken materials, with a total of 311 million tokens, covering different varieties of Portuguese in the world. The CRPC is now available for online queries through the CQPWeb interface.

After briefly reporting on the processes and tools involved for the automatic annotation of the corpus with lemmas, PoS and NP chunks, I will focus on our annotation scheme for modality. Modality is usually defined as the expression of the speaker's opinion and of his attitude towards the proposition (Palmer, 1986). It traditionally covers epistemic modality, which is related to the degree of commitment of the speaker to the truth of the proposition, but also deontic modality, capacity and volition, a.o. Modality detection is therefore also clearly linked to the current trend in NLP on sentiment analysis and opinion mining. I will report on a corpus sample of approximately 2000 sentences fully annotated with modal values, which provides us with insights in the distribution of the types of modality and the validity of our annotation scheme. This manually annotated corpus was recently used as training data for the automatic tagging of modality in Portuguese, with promising results.

Week 7 2013/2014

Thursday 28th November 2013

FASS Meeting Room 1