Previous seminars

Seminars from previous years are still being added, the archive is still available on the old website.

Academic year:


Week 9

Thursday 9th December 2021


Microsoft Teams - request a link via email

Social Media Monitoring of the COVID-19 Pandemic and Influenza Epidemic With Adaptation for Informal Language in Arabic Twitter Data: Qualitative Study

Lama Alsudias

SCC, Lancaster University

  • Abstract



Twitter is a real-time messaging platform widely used by people and organizations to share information on many topics. Systematic monitoring of social media posts (infodemiology or infoveillance) could be useful to detect misinformation outbreaks as well as to reduce reporting lag time and to provide an independent complementary source of data compared with traditional surveillance approaches. However, such an analysis is currently not possible in the Arabic-speaking world owing to a lack of basic building blocks for research and dialectal variation.


We collected around 4000 Arabic tweets related to COVID-19 and influenza. We cleaned and labeled the tweets relative to the Arabic Infectious Diseases Ontology, which includes nonstandard terminology, as well as 11 core concepts and 21 relations. The aim of this study was to analyze Arabic tweets to estimate their usefulness for health surveillance, understand the impact of the informal terms in the analysis, show the effect of deep learning methods in the classification process, and identify the locations where the infection is spreading.


We applied the following multilabel classification techniques: binary relevance, classifier chains, label power set, adapted algorithm (multilabel adapted k-nearest neighbors [MLKNN]), support vector machine with naive Bayes features (NBSVM), bidirectional encoder representations from transformers (BERT), and AraBERT (transformer-based model for Arabic language understanding) to identify tweets appearing to be from infected individuals. We also used named entity recognition to predict the place names mentioned in the tweets.


We achieved an F1 score of up to 88% in the influenza case study and 94% in the COVID-19 one. Adapting for nonstandard terminology and informal language helped to improve accuracy by as much as 15%, with an average improvement of 8%. Deep learning methods achieved an F1 score of up to 94% during the classifying process. Our geolocation detection algorithm had an average accuracy of 54% for predicting the location of users according to tweet content.


This study identified two Arabic social media data sets for monitoring tweets related to influenza and COVID-19. It demonstrated the importance of including informal terms, which are regularly used by social media users, in the analysis. It also proved that BERT achieves good results when used with new terms in COVID-19 tweets. Finally, the tweet content may contain useful information to determine the location of disease spread.

JMIR Med Inform 2021;9(9):e27670

Week 7

Thursday 25th November 2021


Microsoft Teams - request a link via email

Advances in NLP for African Languages

Bonaventure Dossou1 & Chris Emezue2

1Mila Quebec AI Institute  2Mila Quebec AI

  • Abstract

"Language is inherent and compulsory for human communication. Whether expressed written, spoken, or signed, language ensures understanding between people of the same and different regions. African languages (over 2000) are complex and truly low-resourced. These languages receive minimal attention: the datasets required for NLP applications are difficult to discover and existing research is hard to reproduce. However, this is changing. With the growing awareness and effort to include more low-resourced languages in NLP research, African languages have recently been a major subject of research in natural language processing. The presentation is the story of the dynamic duo, Chris and Bonaventure, as they work towards tackling the NLP challenges facing African languages. The talk will cover a range of topics with a focus on the OkwuGbé end-to-end speech recognition system for Fon and Igbo."

Chris and Bona are members of the Masakhane Community (

Week 6

Thursday 18th November 2021


Microsoft Teams - request a link via email

Early Modern English Trials: Introducing the Corpus

Emma Pasquali

University of Naples L’Orientale

  • Abstract

This paper illustrates the peculiarities of the Corpus of Early Modern English Trials (1650-1700),

henceforth EMET, a highly specialized historical corpus of trial proceedings. The main purpose of

the creation of the above-mentioned corpus is to shed light on the pragmatic aspects of Early Modern

spoken English, since trial proceedings are considered records of authentic dialogues (Culpeper and

Kytö 2010:17).

The initial part of the essay will illustrate the phase of the archives consultation, the criteria

behind the selection of the trials and it will also discuss the technical stages that are necessary to the

uploading of a corpus on #LancsBox and its study. Afterwards, the EMET itself will be presented by

specifying the number of documents, the total number of tokens, the types of charges involved and

the average number of tokens per text. Besides, the paper will also present a prototype of trial

information sheet that will provide a guide for users.

The final part of the study will display a comparison between the EMET and A Corpus of

English Dialogues 1560-1760 (CED), in order to underline the kinship and the differences they



Primary Sources

CED = A Corpus of English Dialogues 1560-1760 (2005). Compiled by Merja Kytö (Uppsala

University, Sweden) and Jonathan Culpeper (Lancaster University, England).

Secondary Sources

Brezina, V. / Timperley, M. / McEnery, T. (2018). #LancsBox v. 4.x [software]. Available at:

Culpeper, J. / Kytö, M. (1997). "Towards a corpus of dialogues, 1550-1750." In Language in Time

and Space: Studies in Honour of Wolfgang Viereck on the Occasion of his 60th Birthday,

edited by H. Ramisch and K. Wynne (eds), 60-73. Stuttgart: Franz Steiner Verlag.

Culpeper, J. / Kytö, M. (2010). Early Modern English Dialogues: Spoken Interaction as Writing.

Cambridge: Cambridge University Press.

Garside, R. / Leech, G. / McEnery, T. (eds). (1997). Corpus Annotation: Linguistic Information from

Computer Text Corpora. London: Taylor & Francis.

Kytö, M. / Walker, T. (2003). "The Linguistic Study of Early Modern English Speech- Related Texts

How 'Bad' can 'Bad' Data Be?" Journal of English Linguistics, 31:221-248.

Week 1

Thursday 14th October 2021


Teams meeting - please contact for link

SANTI-morf: a new morphological annotation system for Indonesian texts*


CASS, Lancaster University

  • Abstract

SANTI-morf is a new morphological annotation system at morpheme level for Indonesian texts. It can analyse words formed by means of affixation, reduplication, compounding (which are three major morphological operations in Indonesian morphology (Mueller 2007:1221-1222)) and clitics. SANTI-morf implements a robust tokenisation and annotation scheme devised by Prihantoro (2019). Indonesian words are tokenised into morphemes: the orthographic and citation forms of the morphemes must be presented. SANTI-morf tags are fine-grained. Analytic labels to identify sub-categories of affixes (prefixes, suffixes, circumfixes, infixes) and reduplications (full, partial, imitative) are included. SANTI-morf also includes various analytic labels to analyse how each morpheme functions, for instance outcome POS (the POS of the words formed by certain affixes), active, passive, reciprocal, iterative etc. The program that I use to develop SANTI-morf is Nooj (Silberztein 2003), a rule-based linguistic text analyser. By using Nooj we can write lexicon and rules and apply them to analyse texts at multiple linguistic levels (morphology, morphosyntax, syntax, etc). Nooj can also be used as a corpus query program to retrieve the morphemes in the corpus annotated by SANTI-morf. The lexicons and rules I wrote for SANTI-morf can be grouped into four modules and are applied sequentially. This multi-module pipeline helps reduce ambiguity even before the disambiguation rules are applied. The first module is the Annotator, which analyses all words in the target text(s). The second module is the Guesser, which analyses all words left unanalysed by the Annotator. The third module is the Improver, which identifies incorrect analyses given by the Annotator or the Guesser, and adds the correct analyses. The last module is the Disambiguator which contains contextual and non-contextual disambiguation rules to resolve ambiguities. Unresolved ambiguities are kept. The evaluation shows that SANTI-morf gives 99% precision and 99% recall.

Keywords: SANTI-morf, morpheme annotation, Indonesian, rules, pipeline, Nooj