Previous seminars

Seminars from previous years are still being added, the archive is still available on the old website.

Academic year:


Week 25

Thursday 25th May 2023


Microsoft Teams - request a link via email

A Corpus-based Analysis of North Korean Defectors in Public Discourse: Comparison of South Korean Newspapers and Western Media

Prof. Sun-Hee Lee

Dept of East Asian Langs & Culture, Wellesley College, MA, USA

  • Abstract


In March 2022, the number of North Korean Defectors (NKDs) residing in South Korea reached 33,882 according to the Ministry of Unification. Despite a surge in the number of defectors over two decades, socio-economic challenges and prejudice against those who crossed the border continues to intensify. In this presentation, I provide a corpus-based analysis of public discourse regarding NKDs. The analysis examines how media function to formulate the identity of NKDs and stereotypes/prejudices through linguistic representation. Noting how power is exercised through language in social and political structure, my study presents an analysis on how the South Korean and Western news media identify, categorize, and represent NKDs and explores the dynamics of language, identity, and power in public discourse. The analysis of public discourse must be interrogated from broad realms of social, historical, and political contexts. In conjunction with the long-term research project on the comprehensive discourse analysis of NKDs, the current work focuses on four major broadsheet newspapers that have distinct political stances and investigates interactive discourse features that contribute to representations of NKDs in the South Korean community. Additionally, I examine how the same topics have been represented in the Western media including New York Times and The Guardian and The Times London. Both qualitative analysis and quantitative tools explicate how language is used to substantiate stereotypes and bias by media, resulting in a credible analysis. The outcomes are expected to reveal empirical issues and challenges not only for current South Korean society and its inclusion of NKDs but also for a reunified Korea in the future.

Bio Profile

Dr. Sun-Hee Lee is a Professor of Korean in the Department of East Asian Languages and Cultures at Wellesley College in the US. She earned her doctoral degrees from the Linguistics Department at The Ohio State University and from the Korean Language and Literature Department at Yonsei University. Dr. Lee's research areas include corpus linguistics, learner corpora, and discourse analysis. She has published several books and articles on Korean grammatical constructions, corpus analysis, and learner language. Her recent research interest is in a corpus-based analysis of media, gender, and personal narratives in addition to learner corpus research.

For queries or meeting link, contact Dr. Ignatius Ezeani (

Week 24

Thursday 18th May 2023


Microsoft Teams - request a link via email

Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models

Victor Dibia

Microsoft HAX Team

  • Abstract

Systems that support users in the automatic creation of visualizations must address several subtasks - understand the semantics of data, enumerate relevant visualization goals, and generate visualization specifications. In this work, we pose visualization generation as a multi-stage generation problem and argue that well-orchestrated pipelines based on large language models (LLMs) and image generation models (IGMs) are suitable for addressing these tasks. This talk presents LIDA, a novel tool for generating grammar-agnostic visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER that converts data into a rich but compact natural language summary, a GOAL EXPLORER that enumerates visualization goals given the data, a VISGENERATOR that generates, refines, executes, and filters visualization code, and an INFOGRAPHER module that yields data-faithful stylized graphics using IGMs. LIDA provides a Python API, and a hybrid user interface (direct manipulation and multilingual natural language) for interactive charts, infographics, and data story generation.

Overall, the talk will cover:

  • The design of a tool (LIDA) that concretely applies LLMs to the task of automated visualization generation
  • Challenges with evaluating LLM-based systems and how they are addressed with LIDA
  • Current research challenges and opportunities for LLM-enabled visualization tools.

Project Page:


Victor Dibia is a Principal Research Software Engineer at the Human-AI eXperiences (HAX) team, Microsoft Research, where he focuses on Generative AI. His research interests span human-computer interaction, computational social science, and applied machine learning. Victor's work has been published at conferences such as ACL, EMNLP, AAAI, and CHI, earning multiple best paper awards and garnering attention from media outlets like the Wall Street Journal and VentureBeat. He is an IEEE Senior member, a Google Certified Professional in Data Engineering and Cloud Architect, and a Google Developer Expert in Machine Learning. Victor holds a Ph.D. in Information Systems from the City University of Hong Kong and a Masters in Information Networking from Carnegie Mellon University.

For queries or meeting link, contact Dr. Ignatius Ezeani (

Week 23

Thursday 11th May 2023


Microsoft Teams - request a link via email

Computational Interpretation of Embedded Narrative Discourse

Professor Pablo Gervás

Universidad Complutense de Madrid

  • Abstract
The discourse for narratives beyond the simplest ones often conveys at different points of its span conflicting views on the events it is describing. The characters in the narrative will tell stories to other characters, and these stories are not always true. Authors will often exploit this to mislead someone--sometimes particular characters, sometimes the reader. In doing this, the author is hoping to achieve not a discourse intended to reflect the truth of the events at a given point in time, but rather an evolving sequence of views about those events presented in the specific order in which the narrator wants the reader to experience them. The discourse for a complete narrative will usually include several conflicting views in terms of what events are true, and these views need to be managed. Sometimes the reader needs to keep more than one view in mind. For instance, in the tale of Snowhite, the hunter tells the queen that he has killed Snowhite, but he has really let her go. Readers easily keep track of the beliefs of the different characters, and then adjust well when the queen discovers the truth and sets out to kill her herself. Sometimes they only have one view but it can change radically at a particular point in the story. In the Starwars saga, Luke Skywalker has been told that Darth Vader killed his father, and both he and the viewers believe that is so until the big reveal in The Empire Strikes Back. A process of interpretation of narrative would need to model how these various views are constructed, stored, and validated or falsified as the reader progresses through the reading of the discourse. Computational procedures for interpreting a story need to account for these embedded stories in terms of how to represent them and how to process them. The talk will present ongoing work towards the construction of a computational model that can identify embedded discourses, represent the way they are used to convey recursively embedded stories and interpret the resulting structures into a sequence of evolving interpretations of what is happening in the story.

Pablo Gervás holds a PhD in Computing from Imperial College, University of London (1995), and he is currently a full professor of computational creativity and natural language processing (Catedrático de Universidad) at Universidad Complutense de Madrid. He is the director of the NIL research group ( and for many years he was the director of the Instituto de Tecnología del Conocimiento ( He has been the national coordinator for Spain of the FP7 EU projects PROSECCO, WHIM, and ConCreTe in the area of Computational Creativity. He has been the coordinator for two national research projects (GALANTE and MILES) involving several institutions and the principal investigator for two more (IDiLyCo and CANTOR). His main research interest currently lies in the study of the role that computers can play in helping people interested in literary creativity. Prof Gervás has expertise in automatic generation of (fictional) stories and poetry, and has a background in natural language generation, Computational Creativity, and narratology. He is the author of the PropperWryter software which was used in the process of creating Beyond the Fence -- the first computer-generated musical, staged at the London West End in 2016.

For queries or meeting link, contact Dr. Ignatius Ezeani (

Week 21

Thursday 27th April 2023


Microsoft Teams - request a link via email

Creating and Visualising Semantic Story Maps

Valentina Bartalesi Lenzi


  • Abstract

A narrative is a conceptual basis of collective human understanding. Humans use stories to represent characters' intentions, feelings and the attributes of objects and events. A widely-held thesis in psychology to justify the centrality of narrative in human life is that humans make sense of reality by structuring events into narratives. Therefore, narratives are central to human activity in cultural, scientific, and social areas. Story maps are computer science realizations of narratives based on maps. They are online interactive maps enriched with text, pictures, videos, and other multimedia information, whose aim is to tell a story over a territory. This talk presents a semi-automatic workflow that, using a CRM-based ontology and the Semantic Web technologies, produces semantic narratives in the form of story maps (and timelines as an alternative representation) from textual documents. An expert user first assembles one territory-contextual document containing text and images. Then, automatic processes use natural language processing and Wikidata services to (i) extract entities and geospatial points of interest associated with the territory, (ii) assemble a logically-ordered sequence of events that constitute the narrative, enriched with entities and images, and (iii) openly publish online semantic story maps and an interoperable Linked Open Data-compliant knowledge base for event exploration and inter-story correlation analyses. Once the story maps are published, the users can review them through a user-friendly web tool. Overall, our workflow complies with Open Science directives of open publication and multi-discipline support and is appropriate to convey "information going beyond the map" to scientists and the large public. As demonstrations, the talk will show workflow-produced story maps to represent (i) 23 European rural areas across 16 countries, their value chains and territories, (ii) a Medieval journey, (iii) the history of the legends, biological investigations, and AI-based modelling for habitat discovery of the giant squid Architeuthis dux.

Valentina Bartalesi Lenzi is a researcher at the Institute of Information Science and Technologies (ISTI) of The National Research Council of Italy (CNR) and external professor of Semantic Web in the Computer Science master's degree course at the University of Pisa. She earned her PhD in Information Engineering from the University of Pisa and graduated in Digital Humanities from the University of Pisa. Her research fields mainly concern Knowledge Representation, Semantic Web technologies, and the development of formal ontologies for representing textual content and narratives. She has participated in several European and National research projects, including CRAEFT, MOVING, MINGEI, PARTHENOS, E-RIHS PP, IMAGO, and DanteSources. She is the author of over 50 peer-reviewed articles in national and international conferences and scientific journals.

For queries or meeting link, contact Dr. Ignatius Ezeani (

Week 20

Thursday 13th April 2023


Microsoft Teams - request a link via email

Scaling contrastive training of auto encoders for NLP and low resource settings

Stephen Mander

Postgraduate Researcher, School of Computer and Communication

  • Abstract

Contrastive training still underlies many technologies within the realm of machine learning. It has shown much promise in multimodal activations and logical abilities. However, replication remains an ongoing challenge in academic and low-resource communities. This talk showcases an exploration of using different data shapes to train models with multiple input streams. There are myriad applications in supervised training, low-resource language, cross-modal training, and machine translation tasks where annotations are almost none existent.

For queries, contact Ignatius Ezeani (

Week 20

Thursday 23rd March 2023


Microsoft Teams - request a link via email

Text mining for health knowledge discovery from social media

Suzan Verberne

The Leiden Inst of Advanced Computer Science, Leiden University

  • Abstract

Patient forums are forums centered around patient communities. Previous qualitative work has shown that patients gather on patient forums to exchange information and experiences, and support each other emotionally. Patient forums can also be a source for medical hypotheses, e.g. on the effectiveness of medication and side effects. This specifically benefits patients with a rare disease for which clinical trials are often too costly. In the Ph.D. project of Anne Dirkson, we have developed text mining techniques to process and extract information from the large volume of messages on a patient forum. Specifically, we have focussed on extracting the side effects of medications, and the coping strategies of patients who suffer from these side effects. The extraction and aggregation of this information are more challenging than extracting regular named entities (like names and locations) because side effects and coping strategies are not proper nouns; they can be described descriptively with a large variation. For example, one could describe their headache with 'my head is bursting', 'throbbing pain in my head', or 'pounding headache' to name a few. In my presentation, I will explain the challenges of knowledge discovery from patient forum data, the methods that we developed, and the results that we obtained. I will also show how the extracted information relates to results from questionnaire data among patients.

Short bio:

Suzan Verberne is an associate professor at the Leiden Institute of Advanced Computer Science at Leiden University. She is the group leader of Text Mining and Retrieval. She obtained her Ph.D. in 2010 on the topic of Question Answering and has since then been working on the edge between Natural Language Processing (NLP) and Information Retrieval (IR). She has supervised projects involving a large number of application domains: from social media to law and from archaeology to health. Her research focus is to advance NLP "beyond the benchmark", addressing challenging problems in specific domains. She is highly active in the NLP and IR communities, holding chairing positions in large worldwide conferences. See link to bio profile

For queries, contact Chloe: or Ignatius:

Week 10

Thursday 15th December 2022


Microsoft Teams - request a link via email

Describing pain: using corpus linguistic techniques to investigate language used by sufferers to express and communicate their experiences of pain

Jane Demmen

LAEL, Lancaster University

  • Abstract

This talk describes the analysis of ways in which pain is described by people experiencing a particular health condition, trigeminal neuralgia (TN), in comparison to people experiencing a wider range of painful conditions. The research was prompted by a request from a healthcare professional with a view to gaining a more nuanced understanding of the ways people voluntarily describe pain relating to TN and pain relating to more generic musculoskeletal conditions, to assist in clinical practice and patient communication. Using a range of corpus linguistic techniques, the use of different terms to describe and evaluate pain are explored in two corpora of online forum contributions, with particular focus on the pain descriptors which feature in the short version of the McGill Pain Questionnaire (a widely-used instrument in healthcare settings in the diagnosis and treatment of pain).

Week 9

Thursday 8th December 2022



Assessing Hybrid Identities in Online Extremist Communities through sociolinguistic styles

Shengnan Liu

Psychology, Lancaster University

  • Abstract

Style-shifting has been the focus of language variation and change in sociolinguistics since

1960s. As sociolinguistic styles are sensitive to social change (Ure, 1982), it is not surprising

that they have become a focus of social psychologists who seek to assess social identities

through linguistic styles. ASIA (Automated Social Identity Assessment toolkit) (Koschate et

al., 2021), a toolkit which leverages machine learning and natural language processing to

automatically assess which identity is situationally salient through sociolinguistic styles, has

been proven to be successful in assessing feminist and parent identity in Reddit and Mumsnet

online communities. Cork et al (2022) has applied ASIA to assess entrepreneur and libertarian

identities. With an interest on the recent rise in online influence of hybrid communities which

are characterised by ideological mutations, this study investigates the dynamic nature and

influence of hybrid eco-fascist identities. It trains and validates an ASIA model to

automatically assess which identity (eco or fascist) is situationally salient. This allows us to

examine the dynamic interplay of these identities over time, and the role that linguistic style

plays in the expression of the ecological and the fascist identities in eco-fascist movements. To

train the model, the study used Reddit data form environmental and far-right forums that were

publicly available for the period 2016-2020. Once trained, ASIA was applied to public data

from Reddit eco-fascist forums. Topic modelling and corpus linguistics analysis are then

adopted to validate the results produced by the ASIA model. The results demonstrate that 1)

social linguistics styles can indeed be used to detect and assess hybrid identities, 2)

interdisciplinary research on hybrid identity assessment provides new methodological and

theoretical insights to social psychology, sociolinguistics, and computational linguistics.

Week 8

Thursday 1st December 2022


Microsoft Teams

Categorising keywords: a case study on German conspiracy discourse

Nathan Dykes

Friedrich-Alexander-Universität Erlangen-Nürnberg

  • Abstract

Keyword analysis is central to corpus-assisted discourse studies (CADS), as a means of comparing two corpora on a high level. It is typically used to identify starting points for a more detailed analysis.

Usually, keywords are grouped into thematic categories, which are seen as pointers to central topics of the discourse at hand.

However, there is no best practice as to how these categories are formed, and this question has so far received little attention.

In this talk, two different approaches to keyword categorisation in CADS are compared on the keywords of two actors known to spread conspiracies and misinformation on German Telegram channels.

The first strategy examined is the classic approach of topic-based categories, where the categories formed by two independent researchers are compared to explore how individual experts might differ in what central topics are identified.

The second strategy places more focus on linguisic form by annotating surface-level semantic and grammatical features rather than discourse dependent topics.

Overall, the study hopes to open up the discussion with regards to shifting the methodological discussion to the role of the researcher and of linguistic versus thematic categories.

Week 4

Thursday 3rd November 2022


Microsoft Teams - request a link via email

Speech Analytics for the Detection of Neurological Conditions in Global English

Sam Hollands

University of Sheffield

  • Abstract

Dementia is an umbrella term for the loss of cognitive and memory abilities caused by a wide variety of neurological conditions. It has been discovered that both the content of an individual's discourse and the acoustics of their produced speech can be automatically analysed to help detect dementia and other neurological conditions. Whilst the cutting edge demonstrates effective diagnostic capabilities on L1 (native) speakers of English, this talk will explore ongoing research assessing the efficacy and exploring solutions for L2+ (non-native) English performance. This research treats a dementia classification pipeline as a modular system containing an automatic speech recognition (ASR) component to extract transcribed language; and then the challenge of classifying using features extracted from the acoustic signal and transcribed output. Limitations of ASR across a wide range of L2+ backgrounds will be explored challenging existing beliefs about the competency of state-of-the-art cloud-based ASR APIs on non-native speech and critically assessing the limitations of word error rate (WER) as the ubiquitous metric for ASR evaluation. My talk will then explore ongoing research into the features of dementia, potential issues in the generalisability of sparse dementia corpora, and early work looking at the impact of features of non-native speech.

Week 4

Monday 31st October 2022


Microsoft Teams - request a link via email

FoLD: a permanent, controlled-access, online repository for forensic linguistic research

Tim Grant

School of Languages & Social Sciences, Aston University

  • Abstract

This talk presents an innovative online resource for sharing and accessing forensic linguistics data, the Forensic Linguistic Databank (FoLD -, developed in the Aston Institute for Forensic Linguistics (AIFL) at Aston University, Birmingham. FoLD is a permanent, controlled access online repository for forensic linguistic data, including malicious communication data, investigative interview data, hate speech, and legal language.

Since access to relevant forensic linguistic data has been notoriously challenging since the conception of the discipline in the 1960s, FoLD represents the first attempt to provide researchers with the opportunity of sharing datasets of different levels of sensitivity and ethical concern.

In this talk we present the FoLD repository, how to donate data, and how to access already existing datasets from the website.

We further showcase a project carried out by researchers in the FoLD research centre at AIFL using data from FoLD.

This talk is a cross-over with FORGE, who provide seminars on forensic linguistics

Week 3

Thursday 27th October 2022


Microsoft Teams - request a link via email

Towards a methodological tree: combining Discourse Theory, Critical Discourse Studies and Corpus Linguistics

Katy Brown

University of Bath

  • Abstract

Discourse studies as a broad field has demonstrated openness to incorporating mixed methodologies and perspectives to provide a range of insights into complex phenomena. This paper seeks to propose a new framework which brings together the diverse traditions of Discourse Theory (DT), Critical Discourse Studies (CDS) and Corpus Linguistics (CL). While there are some excellent examples of work combining two of these approaches, particularly CDS and CL (e.g., Subtirelu and Baker, 2018; Baker, 2012), and a growing discussion around the potential compatibility of DT and CDS (Brown, 2020; De Cleen et al., 2021), or DT and CL (Wilkinson, 2022; Nikisianis et al., 2019), there have been very few attempts to bring them all together into a coherent research programme. The aim here then, expanding on recent studies conducted using this framework (Brown and Mondon, 2020; Brown, Mondon and Winter, 2021), is to develop a detailed account of how this combination can be achieved and what benefits it brings to the field of discourse studies. To demonstrate the way this can be implemented in textual analysis, examples are drawn from a study of far-right Brexit discourse and the process of mainstreaming.

Week 1

Thursday 13th October 2022


Welcome Lecture LT1 / Teams link available via email

A year to remember? Introducing the BE21 corpus and exploring recent part of speech tag change in British English

Paul Baker

CASS, Lancaster University

  • Abstract

This talk describes the collection and analysis of the most recent edition of the Brown family, the BE21 corpus, consisting of 1 million words of written British English texts, published in 2021. Using measures of the Coefficient of Variance, the frequencies of part-of-speech tags in BE21 are compared against the other four British members of the Brown family (from 1931, 1961, 1991 and 2006). Part-of-speech tags that are steadily increasing or decreasing in all five or the latest three corpora are examined via concordance lines and their distributions in order to identify new and emerging trends in British English. The analysis points to the continuation of some trends (such as declines in modal verbs and titles of address), along with newer trends like the rise of first person pronouns. The analysis indicates that more general trends of densification, democratisation and colloquialisation are continuing in British English.