Previous seminars

Seminars from previous years are still being added, the archive is still available on the old website.

Academic year:


Week 22

Thursday 5th May 2022


Microsoft Teams

SafeSpacesNLP: Exploring behaviour classification around online mental health conversations from a multi-disciplinary context - NLP, applied linguistics, social science and human-in-the-loop AI

Dr. Stuart Middleton1 & Dr Elena Nichele2

1University of Southampton, ECS  2University of Nottingham

  • Abstract

In this seminar, we will present the UKRI TAS hub SafeSpacesNLP research project. The SafeSpacesNLP project is exploring how multi-disciplinary teams can use NLP-based behaviour classification algorithms to identify online harmful behaviours for children and young people. We will present our research from the viewpoint of Computer Science and the viewpoint of Applied Linguistics / Social Science, highlighting both the existing research challenges for each discipline and how SafeSpacesNLP is trying to address some of these. We will conclude with the longer-term challenges we see for this research area.

Dr Stuart Middleton:
Dr Stuart E. Middleton is an Associate Professor at the University of Southampton, Electronics and Computer Science (ECS), AIC Group. He has over the last 17 years made internationally recognized contributions to research into natural language processing and information extraction. Many of his grants are cross-disciplinary in nature, featuring consortia with a mixture of academic and commercial partners experienced in a range of domains and disciplines.

He is currently PI of NERC-funded platform grant GloSAT (Global Surface Air Temperature - NLP for data rescue) and UKRI TAS Hub funded SafeSpacesNLP (Behaviour classification NLP for online harmful behaviours for children and young people). He is CoI of ESRC-funded grant FloraGuard (Tackling the illegal trade in endangered plants - Socio-technical NLP) and ESRC-funded ProTecThem (Building Awareness for Safer and Technology-Savvy Sharenting).

He is a Turing Fellow, was chair of the recent WebSci'20 Workshop 'Socio-technical AI systems for defence, cybercrime and cybersecurity' and is a member of the Sector Leads Committee of the UKRI Trustworthy and Autonomous Systems (TAS) Hub. He has been an invited AI expert at both the ATI/DSTL workshop 2019 on 'Decision Support for Military Commanders' and the UK Cabinet Office ministerial AI roundtable event 2019 on 'use of AI in policing' which was chaired by the policing minister.

Dr Elena Nichele:
Dr Elena Nichele is a Research Fellow at the School of Computer Science of the University of Nottingham (UK), where she is a member of the Horizon Digital Economy Research Institute and the Trustworthy Autonomous Systems Hub and she is involved in different multi-disciplinary projects.

Her main research interests lie in business communication, human-computer interactions, cross-cultural exchanges, and computer-mediated communication. She has been using linguistic approaches to examine perceptions surrounding social, cultural and political matters. Her previous research has involved corpus linguistics, (critical) discourse analysis, commodification, and digital marketing. In particular, she has been focusing on the concept (and communication) of (in)authenticity.

Her expertise combines linguistics and marketing. She was awarded the PhD title in Applied Linguistics from Lancaster University (UK) and holds degrees in International Business (MA) from the University of Florida (USA) and Languages for Business Communications (MA) from the University of Verona (Italy).


Week 20

Thursday 24th March 2022


Microsoft Teams - request a link via email

Corpus framework analysis: integrating computational linguistics, corpus linguistics, and clinical psychology to analyse Reddit posts on personal recovery in bipolar disorder

Glorianna Jagfeld

Spectrum Centre for Mental Health Research, Lancaster University

  • Abstract

The concept of personal recovery, 'a way of living a satisfying, hopeful life even with the limitations caused by the illness' (Anthony, 1993) is of particular value in bipolar disorder where symptoms often persist despite adequate treatment but has been under-researched. A recent systematic review defined the first conceptual framework for personal recovery in bipolar disorder, POETIC (Purpose & meaning, Optimism & hope, Empowerment, Tensions, Identity, Connectedness) (Jagfeld, Lobban, Marshall, et al., 2021). So far, personal recovery has only been studied in researcher-constructed environments (interviews, focus groups). Peer online support forum posts can serve as a complementary source of non-reactive data to study health beliefs and experiences.

By integrating corpus and computational linguistics and health research methods, this study analyses a corpus of public bipolar support forum posts from the discussion platform Reddit in relation to the lived experience of personal recovery. As people talk about a wide variety of topics on Reddit, selecting what is relevant presents a challenge in working with non-reactive data and led to our innovative corpus construction process. Starting from a 1B word dataset of Reddit posts by people with a self-reported bipolar disorder diagnosis (Jagfeld, Lobban, Rayson, et al., 2021), a series of automatic filtering steps involving computational linguistic methods and manual coding resulted in the 1.3M word PR-BD corpus of personal recovery-relevant posts.

To analyse the PR-BD corpus, I coded lemmas in the PR-BD corpus into the POETIC framework via concordance analysis using #LancsBox 6.0. This constitutes a novel integration of corpus and computational linguistics and deductive framework analysis, which we have named corpus framework analysis (CFA). Preliminary CFA results show that three POETIC domains featured most in discussions on Reddit: Connectedness (particularly romantic relationships and social support), Purpose & meaning (parenting, work), and Empowerment (self-management and personal responsibility).


Anthony, W. A. (1993). Recovery from mental illness: the guiding vision of the mental health system in the 1990s. Psychosocial Rehabilitation Journal, 16(4), 11-23.

Jagfeld, G., Lobban, F., Marshall, P., & Jones, S. H. (2021). Personal recovery in bipolar disorder: Systematic review and "best fit" framework synthesis of qualitative evidence - a POETIC adaptation of CHIME. Journal of Affective Disorders, 292, 375-385.

Jagfeld, G., Lobban, F., Rayson, P., & Jones, S. H. (2021). Understanding who uses Reddit: Profiling individuals with a self-reported bipolar disorder diagnosis. Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access at NAACL 2021.

Week 19

Thursday 17th March 2022


Microsoft Teams - request a link via email

Finding and analyzing reported speech in interviews with clinical voice-hearers: im/politeness

Zsofia Demjen1 & Luke Collins2

1University College London   2CASS, Lancaster University

Week 18

Thursday 10th March 2022


Microsoft Teams

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Shamsuddeen Hassan Muhammad

MAPi-Joint Doctoral Program, University of Porto

  • Abstract

Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria—Hausa, Igbo, Nigerian-Pidgin, and Yorùbá—consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing, and labelling methods that enable us to create datasets for these low-resource languages. We evaluate a range of pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptive fine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.

Shamsuddeen is a PhD candidate at the MAPi-Joint Doctoral Program in Computer Science at the University of Porto. His current research interests focus on natural language processing for low-resource languages. He received his Master's degree from the University of Manchester, UK, and a Bachelor's Degree from Bayero University, Kano, Nigeria. He is a member of MasakhaneNLP and a faculty member at the Faculty of Computing at Bayero University, Kano, Nigeria.

Join the talk here on MS Teams

The talk is on creating Nigerian sentiment corpus. Paper submitted to LREC and under review:

Week 17

Thursday 3rd March 2022


Microsoft Teams

A Manyfesto for Decolonizing AI

Sabelo Mhlambi

Stanford’s Digital Civil Society Lab

  • Abstract

"Decolonizing AI" is both a critique and an emerging movement both in the West and Non-Western world amongst AI researchers, activists, and practitioners. While its proponents have identified parallels between historical colonialism and the colonial-like scale and extractive nature of AI-related technologies developed by big tech companies, can a decolonial framing address broader socio-economic about power and agency within the creation and use of AI? This talk will explore the varying views on "decolonizing" AI and will build upon work from the "AI Decolonial Manyfesto" collaborative effort (

Sabelo is a computer scientist and researcher whose work focuses on the ethical implications of technology in the developing world, particularly in Sub-Saharan Africa, along with the creation of tools to make Artificial Intelligence more accessible and inclusive to underrepresented communities.

He is a Practitioner Fellow at Stanford's Digital Civil Society Lab and a Fellow at Berkman-Klein Center for Internet & Society at Harvard. His research centers on examining the risks and opportunities of AI in the developing world, and in the use of indigenous ethical models as a framework for creating a more humane and equitable internet. His current technical projects include the creation of Natural Language Processing models for African languages, an alternative design of web platforms for decentralizing data, and an open-source library for offline networks.

Join the talk here on MS Teams

Week 16

Thursday 24th February 2022


Microsoft Teams - request a link via email

Obesity in the News: A Corpus-Based Critical Analysis of the British Press

Gavin Brookes

CASS, Lancaster University

  • Abstract

In this talk, I present findings from the Representations of Obesity in the News project - a recent programme of work carried out within the ESRC Centre for Corpus Approaches to Social Science. The aim of the project was to examine the discourses that the British press draws upon to represent the topic of obesity. To do this, we assembled and analysed a 36-million-word corpus of national British newspaper articles about obesity published between 2008 and 2017. Using a corpus-based approach to Critical Discourse Studies, the project has explored how the press represents people with obesity in ways that stigmatise them, how such representations vary according to newspaper formats and political leanings and social variables such as gender and social class, as well as how these representations have changed over time. This seminar will cover some of the main findings in relation to these and our other areas of focus and consider their implications for people living with obesity and society more broadly.

Week 14

Thursday 10th February 2022


Microsoft Teams

Systematic Inequalities in Language Technology Performance across the World's Languages

Antonios Anastasopoulos

George Mason University

  • Abstract

Abstract: Natural language processing (NLP) systems have become a central technology in communication, education, medicine, artificial intelligence, and many other domains of research and development. While the performance of NLP methods has grown enormously over the last decade, this progress has been restricted to a minuscule subset of the world's 6,500 languages. We introduce a framework for estimating the global utility of language technologies as revealed in a comprehensive snapshot of recent publications in NLP. Our analyses involve the field at large, but also more in-depth studies on both user-facing technologies (machine translation, language understanding, question answering, text-to-speech synthesis) as well as more linguistic NLP tasks (dependency parsing, morphological inflection). In the process, we (1) quantify disparities in the current state of NLP research, (2) explore some of its associated societal and academic factors, and (3) produce tailored recommendations for evidence-based policy making aimed at promoting more global and equitable language technologies.

Bio: Antonios Anastasopoulos is an Assistant Professor in Computer Science at George Mason University. He received his PhD in Computer Science from the University of Notre Dame, advised by David Chiang, and then did a postdoc at Languages Technologies Institute at Carnegie Mellon University. His research is on natural language processing with a focus on low-resource settings, endangered languages, and cross-lingual learning, and is currently funded by the National Science Foundation, the National Endowment for the Humanities, Google, Amazon, Meta, and the Virginia Research Investment Fund.

Week 12

Thursday 27th January 2022


Microsoft Teams - request a link via email

"Erm you know that thing over there..." Towards a general service list of conversational British English for TESOL - A Pilot Study

Kevin Gerigk

LAEL, Lancaster University

  • Abstract

This pilot study aims at the 500 most frequent words (K0.5 Level) in spoken British English. The overall aim of the project is the compilation of a novel General Service List of Spoken English based on the spoken British National Corpus 2014 (BNC2014). Linguists have been interested in vocabulary lists, especially in the field of language teaching, for almost 70 years now. Whilst intuition and experience have been replaced with corpus-driven methods, the majority of vocabulary lists is either based on written data or written and minimally spoken data (cf. O'Keeffe et al., 2007). This ignores the uniqueness of the spoken genre, which this piece of research will focus on to fill the gap. In this pilot study, I focused on the K0.5 Level of the spoken BNC2014 accessed via SketchEngine. I will argue in favour of the lemma as the lexical unit of interest, in contrast with word families (cf. Nation, 1990), to allow a more fine-grained analysis of important vocabulary. In terms of statistics, I use Average Reduced Frequency, following the example used in the recent compilation of the new General Service List (Brezina & Gablasova, 2015). The advantage of this statistic is that it takes into account both the frequency of an item and its distribution across the corpus, thus eliminating outliers. After retrieval, the data is sifted through manually to eliminate a closed category of lemmas, e.g. swear words, which have little pedagogical value. My presentation will focus mainly on the methodological side of the pilot study but will also be able to report findings on 1) coverage and distribution of core vocabulary 2) a description of core vocabulary items, and their implications for ELT.


Brezina, V., & Gablasova, D. (2015). Is there a core general vocabulary? Introducing the new general service list. Applied Linguistics, 36(1), 1-22.

Nation, P. (1990). Teaching & learning vocabulary. Newbury House Publishers.

O'Keeffe, A., McCarthy, M., & Carter, R. (2007). From corpus to classroom - Language use and language teaching. Cambridge University Press (CUP).

Week 9

Thursday 9th December 2021


Microsoft Teams - request a link via email

Social Media Monitoring of the COVID-19 Pandemic and Influenza Epidemic With Adaptation for Informal Language in Arabic Twitter Data: Qualitative Study

Lama Alsudias

SCC, Lancaster University

  • Abstract



Twitter is a real-time messaging platform widely used by people and organizations to share information on many topics. Systematic monitoring of social media posts (infodemiology or infoveillance) could be useful to detect misinformation outbreaks as well as to reduce reporting lag time and to provide an independent complementary source of data compared with traditional surveillance approaches. However, such an analysis is currently not possible in the Arabic-speaking world owing to a lack of basic building blocks for research and dialectal variation.


We collected around 4000 Arabic tweets related to COVID-19 and influenza. We cleaned and labeled the tweets relative to the Arabic Infectious Diseases Ontology, which includes nonstandard terminology, as well as 11 core concepts and 21 relations. The aim of this study was to analyze Arabic tweets to estimate their usefulness for health surveillance, understand the impact of the informal terms in the analysis, show the effect of deep learning methods in the classification process, and identify the locations where the infection is spreading.


We applied the following multilabel classification techniques: binary relevance, classifier chains, label power set, adapted algorithm (multilabel adapted k-nearest neighbors [MLKNN]), support vector machine with naive Bayes features (NBSVM), bidirectional encoder representations from transformers (BERT), and AraBERT (transformer-based model for Arabic language understanding) to identify tweets appearing to be from infected individuals. We also used named entity recognition to predict the place names mentioned in the tweets.


We achieved an F1 score of up to 88% in the influenza case study and 94% in the COVID-19 one. Adapting for nonstandard terminology and informal language helped to improve accuracy by as much as 15%, with an average improvement of 8%. Deep learning methods achieved an F1 score of up to 94% during the classifying process. Our geolocation detection algorithm had an average accuracy of 54% for predicting the location of users according to tweet content.


This study identified two Arabic social media data sets for monitoring tweets related to influenza and COVID-19. It demonstrated the importance of including informal terms, which are regularly used by social media users, in the analysis. It also proved that BERT achieves good results when used with new terms in COVID-19 tweets. Finally, the tweet content may contain useful information to determine the location of disease spread.

JMIR Med Inform 2021;9(9):e27670

Week 7

Thursday 25th November 2021


Microsoft Teams - request a link via email

Advances in NLP for African Languages

Bonaventure Dossou1 & Chris Emezue2

1Mila Quebec AI Institute  2Mila Quebec AI

  • Abstract

"Language is inherent and compulsory for human communication. Whether expressed written, spoken, or signed, language ensures understanding between people of the same and different regions. African languages (over 2000) are complex and truly low-resourced. These languages receive minimal attention: the datasets required for NLP applications are difficult to discover and existing research is hard to reproduce. However, this is changing. With the growing awareness and effort to include more low-resourced languages in NLP research, African languages have recently been a major subject of research in natural language processing. The presentation is the story of the dynamic duo, Chris and Bonaventure, as they work towards tackling the NLP challenges facing African languages. The talk will cover a range of topics with a focus on the OkwuGbé end-to-end speech recognition system for Fon and Igbo."

Chris and Bona are members of the Masakhane Community (

Week 6

Thursday 18th November 2021


Microsoft Teams - request a link via email

Early Modern English Trials: Introducing the Corpus

Emma Pasquali

University of Naples L’Orientale

  • Abstract

This paper illustrates the peculiarities of the Corpus of Early Modern English Trials (1650-1700),

henceforth EMET, a highly specialized historical corpus of trial proceedings. The main purpose of

the creation of the above-mentioned corpus is to shed light on the pragmatic aspects of Early Modern

spoken English, since trial proceedings are considered records of authentic dialogues (Culpeper and

Kytö 2010:17).

The initial part of the essay will illustrate the phase of the archives consultation, the criteria

behind the selection of the trials and it will also discuss the technical stages that are necessary to the

uploading of a corpus on #LancsBox and its study. Afterwards, the EMET itself will be presented by

specifying the number of documents, the total number of tokens, the types of charges involved and

the average number of tokens per text. Besides, the paper will also present a prototype of trial

information sheet that will provide a guide for users.

The final part of the study will display a comparison between the EMET and A Corpus of

English Dialogues 1560-1760 (CED), in order to underline the kinship and the differences they



Primary Sources

CED = A Corpus of English Dialogues 1560-1760 (2005). Compiled by Merja Kytö (Uppsala

University, Sweden) and Jonathan Culpeper (Lancaster University, England).

Secondary Sources

Brezina, V. / Timperley, M. / McEnery, T. (2018). #LancsBox v. 4.x [software]. Available at:

Culpeper, J. / Kytö, M. (1997). "Towards a corpus of dialogues, 1550-1750." In Language in Time

and Space: Studies in Honour of Wolfgang Viereck on the Occasion of his 60th Birthday,

edited by H. Ramisch and K. Wynne (eds), 60-73. Stuttgart: Franz Steiner Verlag.

Culpeper, J. / Kytö, M. (2010). Early Modern English Dialogues: Spoken Interaction as Writing.

Cambridge: Cambridge University Press.

Garside, R. / Leech, G. / McEnery, T. (eds). (1997). Corpus Annotation: Linguistic Information from

Computer Text Corpora. London: Taylor & Francis.

Kytö, M. / Walker, T. (2003). "The Linguistic Study of Early Modern English Speech- Related Texts

How 'Bad' can 'Bad' Data Be?" Journal of English Linguistics, 31:221-248.

Week 1

Thursday 14th October 2021


Teams meeting - please contact for link

SANTI-morf: a new morphological annotation system for Indonesian texts*


CASS, Lancaster University

  • Abstract

SANTI-morf is a new morphological annotation system at morpheme level for Indonesian texts. It can analyse words formed by means of affixation, reduplication, compounding (which are three major morphological operations in Indonesian morphology (Mueller 2007:1221-1222)) and clitics. SANTI-morf implements a robust tokenisation and annotation scheme devised by Prihantoro (2019). Indonesian words are tokenised into morphemes: the orthographic and citation forms of the morphemes must be presented. SANTI-morf tags are fine-grained. Analytic labels to identify sub-categories of affixes (prefixes, suffixes, circumfixes, infixes) and reduplications (full, partial, imitative) are included. SANTI-morf also includes various analytic labels to analyse how each morpheme functions, for instance outcome POS (the POS of the words formed by certain affixes), active, passive, reciprocal, iterative etc. The program that I use to develop SANTI-morf is Nooj (Silberztein 2003), a rule-based linguistic text analyser. By using Nooj we can write lexicon and rules and apply them to analyse texts at multiple linguistic levels (morphology, morphosyntax, syntax, etc). Nooj can also be used as a corpus query program to retrieve the morphemes in the corpus annotated by SANTI-morf. The lexicons and rules I wrote for SANTI-morf can be grouped into four modules and are applied sequentially. This multi-module pipeline helps reduce ambiguity even before the disambiguation rules are applied. The first module is the Annotator, which analyses all words in the target text(s). The second module is the Guesser, which analyses all words left unanalysed by the Annotator. The third module is the Improver, which identifies incorrect analyses given by the Annotator or the Guesser, and adds the correct analyses. The last module is the Disambiguator which contains contextual and non-contextual disambiguation rules to resolve ambiguities. Unresolved ambiguities are kept. The evaluation shows that SANTI-morf gives 99% precision and 99% recall.

Keywords: SANTI-morf, morpheme annotation, Indonesian, rules, pipeline, Nooj