NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Shamsuddeen Hassan Muhammad

MAPi-Joint Doctoral Program, University of Porto

Abstract:
Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria—Hausa, Igbo, Nigerian-Pidgin, and Yorùbá—consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing, and labelling methods that enable us to create datasets for these low-resource languages. We evaluate a range of pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptive fine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.

Bio:
Shamsuddeen is a PhD candidate at the MAPi-Joint Doctoral Program in Computer Science at the University of Porto. His current research interests focus on natural language processing for low-resource languages. He received his Master's degree from the University of Manchester, UK, and a Bachelor's Degree from Bayero University, Kano, Nigeria. He is a member of MasakhaneNLP and a faculty member at the Faculty of Computing at Bayero University, Kano, Nigeria.

Join the talk here on MS Teams

Week 18 2021/2022

Thursday 10th March 2022
1:00-2:00pm

Microsoft Teams

The talk is on creating Nigerian sentiment corpus. Paper submitted to LREC and under review: https://arxiv.org/pdf/2201.08277.pdf