Previous seminars

Seminars from previous years are still being added, the archive is still available on the old website.

Academic year:


Week 28

Thursday 11th June 2020


Online: join mailing list or contact organisers to receive link

Language resources: Arabic Song Lyrics Corpus & Igbo-English Machine Translation Benchmark

Mahmoud El-Haj & Ignatius Ezeani

SCC, Lancaster University

  • Abstract
This session presents two articles recently published at LREC 2020 and ICLR AfricaNLP workshop.

Part 1: Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus (Mahmoud El-Haj)

This paper introduces Habibi the first Arabic Song Lyrics corpus. The corpus comprises more than 30,000 Arabic song lyrics in 6 Arabic dialects for singers from 18 different Arabic countries. The lyrics are segmented into more than 500,000 sentences (song verses) with more than 3.5 million words. I provide the corpus in both comma separated value (csv) and annotated plain text (txt) file formats. In addition, I converted the csv version into JavaScript Object Notation (json) and eXtensible Markup Language (xml) file formats.

To experiment with the corpus I run extensive binary and multi-class experiments for dialect and country-of-origin identification. The identification tasks include the use of several classical machine learning and deep learning models utilising different word embeddings. For the binary dialect identification task the best performing classifier achieved a testing accuracy of 93%. This was achieved using a word-based Convolutional Neural Network (CNN) utilising a Continuous Bag of Words (CBOW) word embeddings model. The results overall show all classical and deep learning models to outperform our baseline, which demonstrates the suitability of the corpus for both dialect and country-of-origin identification tasks. I am making the corpus and the trained CBOW word embeddings freely available for research purposes.

Part 2: Building Evaluation Benchmark for Igbo-English Machine Translation (Ignatius Ezeani, Paul Rayson, Ikechukwu Onyenwe, Chinedu Uchechukwu, Mark Hepple)

Although researchers and practitioners are pushing the boundaries and enhancing the capacities of NLP tools and methods, works on African languages are lagging. A lot of the focus is on well-resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 97% of the world's 7000 languages, including African languages, are low resourced for NLP i.e. they have little or no data, tools, and techniques for NLP research. For instance, only 5 out of 2965, 0.19% authors of full-text papers in the ACL Anthology extracted from the 5 major conferences in 2018 ACL, NAACL, EMNLP, COLING and CoNLL, are affiliated to African institutions.

In this work, we will focus on Igbo, a low-resourced Nigerian language, spoken by more than 50 million people globally with over 50% of the speakers are in southeastern Nigeria. We will discuss our effort toward building a standard machine translation benchmark dataset for Igbos. Our team identified some key challenges to achieving conceptual equivalence in Igbo-English translation and made a few suggestions. We will also present our plan for developing a benchmark translation model for Igbo-English translation.

Week 27

Thursday 4th June 2020


Online: join mailing list or contact organisers to receive link

Desirable and undesirable differentness : Exploring representations of people with schizophrenia in the British tabloids and broadsheets

James Balfour

LAEL, Lancaster University

  • Abstract

Widespread stigma towards people with schizophrenia and its negative impact on patient outcomes has long been documented (e.g. Wing, 1978). Over the last couple of decades, researchers in psychiatry and health studies have increasingly drawn attention to the role played by the British press in reproducing intolerant and inaccurate representations of people with schizophrenia in the British media, with many studies pointing to the British tabloids as a particular area of concern (e.g. Bowen et al., 2019; Clement & Foster, 2008). While insightful, previous studies have typically had a narrow focus. For instance, they have tended to focus on similarities between the tabloid and broadsheets' reportage over differences, with the result that they tend to dwell on the ubiquity of the topic of 'violence' over other topics. This research has also had more to say about the sensationalist reporting style in the tabloids over representations in the broadsheets.

In this talk, I discuss findings from a keyword analysis carried out as part of a larger project which examines representation of schizophrenia in the British national press between 2000 and 2015 using corpus-based techniques. By conducting a contrastive keyword analysis, I discover distinctive lexis in either subcorpus, which provides insights into the unique topics which the tabloids and broadsheets associate with people with schizophrenia. The analysis reveals a more detailed and nuanced picture than previous studies account for. While it is true that the tabloids provide sensationalistic stories that represent people with schizophrenia as violent criminals, they moreover dwell on issues of problematised agency, blame and retribution as a distinctive feature of their reporting. In contrast, keywords in the broadsheets point towards a more nuanced and positive picture, with people with schizophrenia being represented as artists and creative thinkers, as a distinctive feature of their reporting. I conclude by linking these findings with Goffman's (1963) definition of stigma as a perception 'undesired differentness' whereby the broadsheets represent people with schizophrenia in terms of 'desired differentness' and tabloids in terms of 'undesirable differentness'. I then discuss some of the potential wider implications these representations have on the lived experience of the disorder.

For more information about the project, see:


Bowen, M., Kinderman, P. & Cooke, A. (2019). Stigma: A linguistic analysis of the UK red-top tabloids press' representation of schizophrenia. Perspectives in Public Health, 139(3), 147-152. doi:10.1177/1757913919835858

Clement, S. and Foster, N. (2008). Newspaper reporting on schizophrenia: A content analysis of five national newspapers at two time points. Schizophrenia Research, 98(1), 178-183. doi:10.1016/j.schres.2007.09.028

Goffman, E. (1990). Stigma: Notes on the management of spoiled identity. Harmondsworth: Penguin. (Original work published in 1963).

Wing, J. K. (1978). Schizophrenia: towards a new synthesis. London: Academic Press.

Week 26

Thursday 28th May 2020


Online: join mailing list or contact organisers to receive link

Lived experience in bipolar disorder: a corpus-based study of Reddit social media posts

Glorianna Jagfeld

Spectrum Centre for Mental Health Research, Lancaster University

  • Abstract

Bipolar disorder is a severe mental health problem characterised by recurring episodes of depressed and elevated mood. About 1.5-2% of the European population are estimated to meet diagnostic criteria during their lifetime. Complementary to clinical recovery, mainly concerned with symptom reduction and restoration of functioning, personal recovery is a way of living a satisfying, hopeful life even with limitations caused by mental health issues. Since bipolar disorder symptoms often persist despite adequate treatment, this concept might be of importance.

Social media posts can serve as alternative data source to study health beliefs and experiences, mitigating the questionnaire or interviewer bias that may occur in traditional health research methods. In my PhD project I'm aiming to shed more light on the experience of personal recovery in bipolar disorder via corpus and computational linguistic analyses of social media posts.

This talk will cover the following:

1) How I constructed a large dataset of 24M public posts on the social media platform Reddit by 20K people with a self-reported bipolar disorder diagnosis.

2) What demographic information about the people in the dataset I could automatically infer or extract from their posts.

3) First analyses of a corpus of posts in bipolar disorder-related subforums constructed from the full dataset. Particularly, I'll describe salient topics in the corpus that emerged from a key semantic domain analysis with the UCREL Semantic Analysis System (USAS).

4) Ideas for further analyses whose results I can relate to findings of previous qualitative research on personal recovery in bipolar disorder.

Since this is work in progress, I'm very much looking forward to your comments and suggestions.

Week 25

Thursday 21st May 2020


Online: join mailing list or contact organisers to receive link

Arabic NLP: infectious disease ontology with non-standard terminology & metaphorical expressions in sentiment analysis

Lama Alsudias & Israa Alsiyat

SCC, Lancaster University

  • Abstract

Part 1: Developing an Arabic Infectious Disease Ontology to Include Non-Standard Terminology (Lama Alsudias and Paul Rayson)

Building ontologies is a crucial part of the semantic web endeavour. In recent years, research interest has grown rapidly in supporting languages such as Arabic in NLP in general but there has been very little research on medical ontologies for Arabic. We present a new Arabic ontology in the infectious disease domain to support various important applications including the monitoring of infectious disease spread via social media. This ontology meaningfully integrates the scientific vocabularies of infectious diseases with their informal equivalents. We use ontology learning strategies with manual checking to build the ontology. We applied three statistical methods for term extraction from selected Arabic infectious diseases articles: TF-IDF, C-value, and YAKE. We also conducted a study, by consulting around 100 individuals, to discover the informal terms related to infectious diseases in Arabic. In future work, we will automatically extract the relations for infectious disease concepts but for now these are manually created. We report two complementary experiments to evaluate the ontology. First, a quantitative evaluation of the term extraction results and an additional qualitative evaluation by a domain expert.

Part 2: Metaphorical Expressions in Automatic Arabic Sentiment Analysis (Israa Alsiyat and Scott Piao)

Over the recent years, Arabic language resources and NLP tools have been under rapid development. One of the important tasks for Arabic natural language processing is the sentiment analysis. While a significant improvement has been achieved in this research area, the existing computational models and tools still suffer from the lack of capability of dealing with Arabic metaphorical expressions. Metaphor has an important role in Arabic language due to its unique history and culture. Metaphors provide a linguistic mechanism for expressing ideas and notions that can be different from their surface form. Therefore, in order to efficiently identify true sentiment of Arabic language data, a computational model needs to be able to "read between lines". In this paper, we examine the issue of metaphors in automatic Arabic sentiment analysis by carrying out an experiment, in which we observe the performance of a state-of-art Arabic sentiment tool on metaphors and analyse the result to gain a deeper insight into the issue. Our experiment evidently shows that metaphors have a significant impact on the performance of current Arabic sentiment tools, and it is an important task to develop Arabic language resources and computational models for Arabic metaphors.

Week 18

Friday 6th March 2020


Charles Carter A19

Cross-lingual (English-Urdu) text reuse detection

Muhammad Sharjeel

SCC, Lancaster University

  • Abstract

Cross-lingual text reuse occurs when pre-existing texts in one language are used to create new texts in another language. Due to the increasing amount of digital text readily available on the Web and social media in multiple languages and freely accessible efficient Machine Translation systems, cross-lingual text reuse has increased to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases.

The main objective of this talk is twofold: (1) I'll present the development of a large-scale benchmark standard evaluation corpus named TREU (Text Reuse in English-Urdu language) Corpus and (2) How the corpus could be used in the development and evaluation of cross-lingual (English-Urdu) text reuse detection systems.

TREU corpus is the first cross-lingual cross-script text reuse corpus developed for a significantly under-resourced language pair (English-Urdu). It contains text from the journalism domain, manually annotated at three levels of text reuse i.e., Wholly Derived, Partially Derived, and Non-Derived. It includes real cases of text reuse (in total 4,514 texts) at document-level from English to Urdu language. The corpus has been evaluated using a diversified range of methods categorised under three types i.e., Translation plus Mono-lingual Analysis (T+MA), cross-lingual embeddings, and cross-lingual Vector Space Model. The use of these methods on the TREU Corpus shows its usefulness and how it can be utilised in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level.

Although the talk focuses on English-Urdu language pair, the proposed corpus and methods could serve as a framework for future research in other languages and language pairs.

Week 17

Thursday 27th February 2020


Management School LT 10

Short-term Language Change of Groups in House of Commons Debates

Ed Dearden

SCC, Lancaster University

  • Abstract

This talk will cover work I am currently conducting, looking at language change in groups over relatively short periods of time. By using and adapting established methods for studying language change, I am hoping we can learn more about how the language of different groups changes within communities. As an initial dataset for this work, I am looking at House of Commons debates between 2015 and 2019, with specific interest in the groups formed around Brexit. One key challenge is in trying to establish whether methods created for looking at long time ranges are applicable over much shorter periods - in my case ~4 years.

In this talk, I will show what I've been working on, discuss the challenges of looking at short-term language change, and suggest future directions for taking this work forward. As this is a presentation of work-in-progress, I'm very keen to gain feedback and insight from as many people as possible."

Week 15

Thursday 13th February 2020


Charles Carter A18

Public opinion mining: analysing comments and reviews

Teh Phoey Lee

Sunway University

  • Abstract

We now live in a world where many people would much rather connect and share their experience on social media than speak to each other. There are many different types of social media platform, including Twitter, Blogosphere, Facebook, and Instagram. The opinions that are shared on social media platforms are easily picked up by other readers - and well beyond those, the posts are primarily aimed at - using the same platform. Postings may be of significant interest to companies seeking to understand market sentiment or to political parties in helping them to predict the outcome of national elections. Postings may contain pictures and comments and are often followed with responses from friends and others. Using sentiment analysis, postings can be systematically identified, extracted, quantified, and studied. The information in them can be processed, manage, clustered and analysed for the use in various applications, such as marketing, customer service, healthcare benefits or even civil protection: these uses may be both benign and melaine.

In this talk, Dr Teh will share her research on sentiment analysis (which also is known as opinion mining or emotion AI), where natural language processing, text analysis, computational linguistics are applied. A particular focus of Dr Teh's research is the understanding of sentiment in different countries, where different languages and cultures have enormous implications for the understanding of sentiment.


Dr Teh Phoey Lee is Associate Professor in the School of Science and Technology, Sunway University, and is a senior member of the IEEE Association. She completed her PhD in Management Information System at University of Putra, Malaysia in the year 2011 and has published over 50+ papers, several of which have been co-authored with colleagues at Lancaster University. Dr Teh is fluent in many languages, particularly different Chinese dialects, and other Asian languages. This fluency enables here to shed a particularly knowledgeable light on the analysis of social media posts of Asian posters. Dr Teh has also served as Head of Department, and Programme Leader for several postgraduate programmes, and lead the development of a new undergraduate programme in Mobile Computing and Entrepreneurship.

Research Interest(s)

1. Sentiment analytics

2. Social media analytics

3. Textual analysis

4. Information Acquisition

Thursday 6th February 2020


Charles Carter A15

Understanding caregivers' experiences of supporting suicidal relatives with psychosis or bipolar disorder: a qualitative analysis of an online-peer support forum

Paul Marshall

Spectrum Centre for Mental Health Research, Lancaster University

  • Abstract

Over recent decades, significant research attention has focused on understanding the experiences of those who support or care for people with psychosis or bipolar disorder. Over the same period, research has consistently shown that those experiencing bipolar disorder, psychosis, and related diagnoses such as schizophrenia, are at a much greater risk of suicidal ideation, attempting suicide, or dying by suicide than the general population. Despite this, few studies to date have investigated caregivers' experiences of providing support to people with psychosis or bipolar disorder during periods of increased suicide risk.

A qualitative thematic analysis of data derived from an online peer support forum was undertaken to address this previously understudied topic. The forum was established by researchers at Lancaster University as part of a randomised controlled trial to evaluate an intervention for relatives of people with psychosis and bipolar disorder - the Relatives Education and Coping Toolkit. This seminar will present the initial findings of this analysis in addition to reviewing some of the challenges and advantages of using online forum data in qualitative research. There will also be an opportunity to discuss how corpus methods could be applied to further investigate this dataset.

Week 12

Thursday 23rd January 2020


Charles Carter A18

Internal validity in learner corpus research

Pascual Pérez-Paredes

Universidad de Murcia, RSLE University of Cambridge

  • Abstract

This talk will report findings from two case studies where different corpora have been used to investigate the use of stance adverbs in spoken communication. All the corpora discussed were collected using the same design criteria. The focus on this discussion is on the range of inferences from the data available (Gray, 2017) and ultimately on the nature of learner corpus research (LCR), both ontologically and epistemologically.

The first case study will probe into the use native speaker corpora (Aguado-Jiménez et al, 2012), while the second will focus on the analyses of different English L2 (learner) corpora (Pérez-Paredes & Bueno, 2019). I will discuss the implications of using triangulation techniques (Baker & Egbert, 2016; Flick, 2018) in LCR and how researchers may benefit from increased criticality in their research designs.

Keywords: corpus linguistics, stance adverbs, data triangulation, research validity


Aguado-Jiménez, P., Pérez-Paredes, P. & Sánchez, P. 2012. Exploring the use of multidimensional analysis of learner language to promote register awareness, System 40(1), 90-103.

Baker. P. & Egbert, J. (eds). 2016. Triangulating methodological approaches in corpus linguistic research. London: Routledge.

Flick, U. 2018. Doing triangulation and mixed methods. London: Sage.

Gray, D. 2017. Doing research in the real world. 4th Edition. London: Sage.

Marchi, A. & Taylor, C. 2009. If on a winter's night two researchers...: a challenge to assumptions of soundness of interpretation. Critical Approaches to Discourse Analysis across Disciplines: CADAAD,3(1), 1-20.

Pérez-Paredes, P. & Bueno, C. 2019. A corpus-driven analysis of certainty stance adverbs: obviously, really and actually in spoken native and learner English. Journal of Pragmatics, 140,22-3

Week 11

Thursday 16th January 2020


Charles Carter A18

Corpus linguistics and clinical psychology: examining the psychosis continuum

Luke Collins1 & Elena Semino2

1CASS, Lancaster University  2LAEL, Lancaster University

  • Abstract

We present our work with the 'Hearing the Voice' project, a study exploring experiences of Auditory Verbal Hallucinations (AVHs), or voices that others cannot hear. Auditory Verbal Hallucinations are experienced by a large proportion of individuals with a psychiatric diagnosis (such as schizophrenia or bipolar disorder) and approximately 1% of people with no psychiatric diagnosis (Kråvik et al., 2015). Researchers have investigated similarities/differences across 'clinical' and 'non-clinical' populations (i.e. those who seek clinical support for their experiences and those who do not) and proposed a 'continuum' model for those experiences. Our corpus linguistic approach offers a novel contribution to debates in clinical psychology around the validity of the 'psychosis continuum' model.

We analysed semi-structured interviews with 67 'voice-hearers': 27 self-identified 'Spiritualists' (non-clinical) and 40 individuals registered with Early Intervention in Psychosis services (clinical) to consider what evidence there is for a 'continuum' with respect to their reports. We conducted a keyness analysis at the level of semantic domains, using the USAS tagger (Rayson, 2008). From the list of key semantic domains, we identified four major themes through which to investigate the (dis)similarity of aspects of the voice-hearing experience across our two cohorts: Affect; Control; Meaning-making; and Sensory input. These themes corresponded with aspects of the voice-hearing experience identified by psychologists as points of similarity/difference between clinical and non-clinical populations (Baumeister et al., 2017).

We found that there is evidence for continuity between the reports of clinical and non-clinical participants, though in some circumstances there is also grounds for considering sub-categories of the clinical population. Our analysis thereby offers the means through which to critically assess the validity of the 'continuum' model and consider its implications for clinical treatment.


Baumeister, D., Sedgwick, O., Howes, O., Peters, E. (2017) Auditory verbal hallucinations and continuum models of psychosis: A systematic review of the healthy voice-hearer literature. Clinical Psychology Review 51: 125-41.

Kråkvik. B., Larøi, F., Kalhovde, A. M., Hugdahl, K., Kompus, K., Salvesen, Ø., Stiles, T. C. and Vedul-Kjelsås, E. (2015) Prevalence of auditory verbal hallucinations in a general population: a group comparison study. Scandinavian Journal of Psychology 56: 508-15.

Rayson, P. (2008) From key words to key semantic domains. International Journal of Corpus Linguistics 13(4): 519-549.

Week 10

Thursday 12th December 2019


Management school LT5

Weird and non-WEIRD: Introducing the Corpus of Indonesian Sign Language (BISINDO)

Nick Palfreyman

University of Central Lanchasire

  • Abstract

We have entered the age of the sign language corpus, with several comprehensive corpora already available -including for Australian Sign Language, NGT (Sign Language of the Netherlands) and British Sign Language. However, there is currently a noticeable bias towards SLs of WEIRD (Western, educated, industrialized, rich, democratic) countries. This presentation introduces the BISINDO Corpus, which features over 45,000 tokens from spontaneous conversation between 131 participants using Indonesian Sign Language. For this corpus, data were collected between 2010 and 2017 from six Indonesian cities/islands. I begin by discussing some of the challenges in compiling the BISINDO corpus, including - in some cases - finding deaf sign language users in the field. Other challenges are not particular to sign language research and seem to be faced by corpus linguists in many non-WEIRD societies, especially around ethics. I then move on to look at two examples of how the corpus can shed light on processes of language change in BISINDO. First, I look at the grammatical domain of negation, and second at signs based on BISINDO's two manual alphabets.

Week 9

Thursday 5th December 2019


B78 (DSI Space) InfoLab21

Extractive Summarisation for Scientific Articles; making them more discoverable

Daniel Kershaw


  • Abstract
At Elsevier, a lot of effort is focussed on content discovery for users, allowing them to find the most relevant articles for their research. This, at its core, blurs the boundaries of search and recommendation as we are both pushing content to the user and allowing them to search the world's largest catalogue of scientific research. Apart from using the content as is, we can make new content more discoverable with the help of authors at submission time, for example by getting them to write an executive summary of their paper. However, doing this at submission time means that this additional information is not available for older content. This raises the question of how we can utilise the author's input on new content to create the same feature retrospectively to the whole Elsevier corpus. Focusing on one use case, we discuss how an extractive summarization model (which is trained on the user-submitted summaries), is used to retrospectively generate executive summaries for articles in the catalogue. Further, we show how extractive summarization is used to highlight the salient points (methods, results and finding) within research articles across the complete corpus. This helps users to identify whether an article is of particular interest for them. As a logical next step, we investigate how these extractions can be used to make the research papers more discoverable through connecting it to other papers which share similar findings, methods or conclusion. In this talk we start from the beginning, understanding what users want from summarization systems. We discuss how the proposed use cases were developed and how these tie into the discovery of new content. We then look in more technical detail at what data is available and which deep learning methods can be utilised to implement such a system. Finally, while we are working toward taking this extractive summarization system into production, we need to understand the quality of what is being produced before going live. We discuss how internal annotators were used to confirming the quality of the summaries. Though the monitoring of quality does not stop there, we continually monitor user interaction with the extractive summaries as a proxy for quality and satisfaction.

Joint UCREL and DSG talk

Week 6

Thursday 14th November 2019


Management school LT5

Acronyms as an Integral Part of Multi-Word Term Recognition - A Token of Appreciation

Irena Spasic

University of Cardiff

  • Abstract
The increasing amount of textual information in requires effective term recognition methods to identify textual representations of domain-specific concepts as the first step toward automating its semantic interpretation. The dictionary look-up approaches may not always be suitable for dynamic domains such as biomedicine or the newly emerging types of media such as patient blogs, the main obstacles being the use of non-standardised terminology and high degree of term variation. Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domain-specific concept as a latent variable. FlexiTerm is an unsupervised method for recognition of multi-word terms from a domain-specific corpus. It uses regular expressions to constrain the search space based on term formation patterns and then processes them statistically to identify largest frequently occurring bags of words and the corresponding terms. FlexiTerm uses a range of methods to normalize three types of term variation - orthographic, morphological, and syntactic variations. Acronyms, which represent a highly productive type of term variation, were not originally supported. In this talk, we describe how the functionality of FlexiTerm has been extended to recognize acronyms and incorporate them into the term conflation process. We evaluated the effects of term conflation in the context of information retrieval as one of its most prominent applications. On average, relative recall increased by 32 points, whereas index compression factor increased by 7% points. Therefore, evidence suggests that integration of acronyms provides non-trivial improvement of term conflation.

Joint UCREL and DSG talk

Week 4

Thursday 31st October 2019


Charles Carter A18

Representations of health conditions in the UK and US press: A corpus linguistic approach

Ewan Hannaford

University of Glasgow

  • Abstract

Mental and physical illness have traditionally been seen as distinct categories, but this division is now recognised by medical experts and health professionals as largely unhelpful and inaccurate. Medical experts now often encourage a more holistic approach to healthcare, whereby mental health conditions are treated as just another type of illness (RCP Report, 2010; Kolappa, et al., 2013). However, amongst the general public, mental illness remains highly stigmatised and viewed as distinct from physical illness (Kendell, 2001; Pescosolido, et al, 2010). Media representations have a large impact on public perceptions and significantly influence the prevalence of stigmas and stereotypes surrounding both traditionally physical and traditionally mental disorders (Wahl, 1995; Stuart, 2006; Young, Norman, & Humphreys, 2008). Differences in media coverage of such illnesses may therefore be contributing to the persistence of a societal distinction between 'physical' and 'mental' illness.

My research is investigating this using two regional corpora of UK and US press coverage, each spanning over 20 years and covering a range of disorders across the traditional physical/mental health spectrum. Through corpus linguistic analysis of these datasets, my work aims to uncover differences and similarities in the themes, topics, and attitudes present in different health condition discourses, and to identify the potential causes of these features. My talk will provide a brief background on my work and previous research into press representations of health conditions, before discussing my methodological approach and presenting some preliminary results from a recent pilot study I conducted. The implications of these pilot findings for my full study will then be explored, with a current update on the status of my research.


American Psychiatric Association. (1994). Diagnostic and Statistical Manual of Mental Disorders (4th Ed.). Washington, DC: American Psychiatric Association.

Kendell, R. (2001). The distinction between mental and physical illness. British Journal of Psychiatry 178. 490-493.

Kolappa, K., Henderson, D., & Kishore, S. (2013). No physical health without mental health: Lessons unlearned? Bulletin of the World Health Organisation 91:3. 3-3A.

Pescosolido, B., Martin, J., Long, J., Medina, T., Phelan, J., & Link, B. (2010). "A disease like any other"? A decade of change in public reactions to schizophrenia, depression, and alcohol dependence. American Journal of Psychiatry 167. 1321-1330.

Royal College of Psychiatrists. (2010). No Health Without Mental Health: The Supporting Evidence. RCP Report. Available from:

Stuart, H. (2006). Media portrayal of mental illness and its treatments: What effect does it have on people with mental illness? CNS Drugs 20:2. 99-106.

Wahl, O. (1995). Media Madness: Public Images of Mental Illness. New Brunswick, NJ: Rutgers University Press.

Young, M., Norman, G., & Humphreys, K. (2008). Medicine in the popular press: The influence of the media on perceptions of disease. PloS ONE 3:10. E3552. Available from:

Week 3

Thursday 24th October 2019


Charles Carter A18

Social Networks in Early Modern English Comedies

Jakob Ladegaard & Ross Deans Kristensen-McLachlan

Aarhus University

  • Abstract

Social network analysis is used in sociological and sociolinguistic research to study patterns of verbal interaction between members of a community. This method has rarely been applied to literary texts at scale, but in this talk I present a work in progress that attempts to use computationally assisted social network analysis on a corpus of around 20 dramatic texts; so-called prodigal son comedies written in English between 1590 and 1640. Literary criticism of these plays often focuses on the relationship between prodigal sons and their father figures but pay relatively little attention to the social networks they are part of. However, we believe these networks are important not only for the dramatic plots, but also for what the plays might tell us about social and economic questions of the time. We therefore wanted to study the plays' social networks. This can be done in terms of the plays' overall network metrics, which might reveal structural changes in this dramatic subgenre over time, but mainly we were interested in comparing characters with specific traits across plays. This was done by constructing networks with characters as nodes and their verbal exchanges as links. We extracted overall network metrics for all plays as well as count measures (line and word counts) and metric measures (degree and centrality measures) for all characters in all the texts. We then ranked the characters in each play according to their scores on these measures. This allowed us to compare the roles of different types of characters across plays, in particular the prodigal sons, their father figures and the minor characters who in some cases mediate their relation. The talk will present some preliminary results of these comparisons and end out with a discussion of the possibility of combining this social network approach with other, more stylistically oriented corpus based approaches to these texts.

The work presented here was done in collaboration with Ross Deans Kristensen-McLachlan, Aarhus University.


Jakob Ladegaard is Associate Professor in Comparative Literature, Aarhus University. His research is primarily concerned with the relations between modern literature, politics and economy. He is the PI of the research project: 'Unearned Wealth - A Literary History of Inheritance, 1600-2015', 2017-2021. The project uses digital methods to study English and French literary representations of inheritance. Recent publications include Context in Literary and Cultural Studies (ed. with J.G. Nielsen), UCL Press, 2019.

Week 2

Thursday 17th October 2019


Charles Carter A18

Detecting Meaningful Multi-word Expressions in Political Text

Ken Benoit

London School of Economics and Political Science

  • Abstract
The rapid growth of applications treating text as data has transformed our ability to gain insight into important political phenomena. Almost universal among existing approaches is the adoption of the bag of words approach, counting each word as a feature without regard to grammar or order. This approach remains extremely useful despite being an ob- viously inaccurate model of how observed words are generated in natural language. Many politically meaningful textual features, however, occur not as unigram words but rather as pairs of words or phrases, especially in language relating to policy, political economy, and law. Here we present a hybrid model for detecting these associated words, known as collocations. Using a combination of statistical detection, human judgement, and machine learning, we extract and validate a dictionary of meaningful collocations from three large corpora totalling over 1 billion words, drawn from political manifestos and legislative floor debates. We then examine how the word scores of phrases in a text model compare to the scores of their component terms.

Week 1

Thursday 10th October 2019


Charles Carter A18

Verbs in specialized language: the case of the knowledge base EcoLexicon

Míriam Buendía-Castro

University of Granada (Spain)

  • Abstract
This research presents EcoLexicon (, a multilingual terminological knowledge base on the environment developed at the University of Granada which contains over 3,500 concepts and over 20,000 terms in English, Spanish, German, French, Russian, and Modern Greek. This talk focuses on how verb phraseological information is encoded in EcoLexicon. As is well known, verbs are an extremely important part of language, however, very few specialized knowledge resources include them. It is our assertion that verbs and their potential arguments can be classified and structured in a set of conceptual-semantic categories typical of a given specialized domain. In this context, when semantic roles and macroroles are specified as well as the resulting phrase structure, it is then possible to establish templates that represent this meaning for entire frames. In this regard, within the context of a specialized knowledge domain, the range of verbs generally associated with potential arguments can be predicted within the frame of a specialized event.