Language resources: Arabic Song Lyrics Corpus & Igbo-English Machine Translation Benchmark

Mahmoud El-Haj & Ignatius Ezeani

SCC, Lancaster University

This session presents two articles recently published at LREC 2020 and ICLR AfricaNLP workshop.

Part 1: Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus (Mahmoud El-Haj)

This paper introduces Habibi the first Arabic Song Lyrics corpus. The corpus comprises more than 30,000 Arabic song lyrics in 6 Arabic dialects for singers from 18 different Arabic countries. The lyrics are segmented into more than 500,000 sentences (song verses) with more than 3.5 million words. I provide the corpus in both comma separated value (csv) and annotated plain text (txt) file formats. In addition, I converted the csv version into JavaScript Object Notation (json) and eXtensible Markup Language (xml) file formats.

To experiment with the corpus I run extensive binary and multi-class experiments for dialect and country-of-origin identification. The identification tasks include the use of several classical machine learning and deep learning models utilising different word embeddings. For the binary dialect identification task the best performing classifier achieved a testing accuracy of 93%. This was achieved using a word-based Convolutional Neural Network (CNN) utilising a Continuous Bag of Words (CBOW) word embeddings model. The results overall show all classical and deep learning models to outperform our baseline, which demonstrates the suitability of the corpus for both dialect and country-of-origin identification tasks. I am making the corpus and the trained CBOW word embeddings freely available for research purposes.

Part 2: Building Evaluation Benchmark for Igbo-English Machine Translation (Ignatius Ezeani, Paul Rayson, Ikechukwu Onyenwe, Chinedu Uchechukwu, Mark Hepple)

Although researchers and practitioners are pushing the boundaries and enhancing the capacities of NLP tools and methods, works on African languages are lagging. A lot of the focus is on well-resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 97% of the world's 7000 languages, including African languages, are low resourced for NLP i.e. they have little or no data, tools, and techniques for NLP research. For instance, only 5 out of 2965, 0.19% authors of full-text papers in the ACL Anthology extracted from the 5 major conferences in 2018 ACL, NAACL, EMNLP, COLING and CoNLL, are affiliated to African institutions.

In this work, we will focus on Igbo, a low-resourced Nigerian language, spoken by more than 50 million people globally with over 50% of the speakers are in southeastern Nigeria. We will discuss our effort toward building a standard machine translation benchmark dataset for Igbos. Our team identified some key challenges to achieving conceptual equivalence in Igbo-English translation and made a few suggestions. We will also present our plan for developing a benchmark translation model for Igbo-English translation.

Week 28 2019/2020

Thursday 11th June 2020

Online: join mailing list or contact organisers to receive link