Previous seminars

Seminars from previous years are still being added, the archive is still available on the old website.

Academic year:


Week 18

Thursday 8th March 2018


Management School LT 9

Digging into Early Colonial Mexico: Developing Computational Approaches to study 16th Century historical documents

Patricia Murrieta-Flores

History, Lancaster University

Week 17

Thursday 1st March 2018


Management School LT 9

A Solution to the Problem of High Variance When Tuning NLP Models With K-fold Cross Validation

Henry Moss

STOR-i, Lancaster University

  • Abstract

K-fold cross validation (CV) is a popular method for estimating the true performance of machine learning models, allowing model selection and parameter tuning. However, the very process of CV requires random partitioning of the data and so our performance estimates are in fact stochastic, with variability that can be substantial for natural language processing tasks. We demonstrate that these unstable estimates cannot be relied upon for effective parameter tuning. The resulting tuned parameters are highly sensitive to how our data is partitioned, meaning that we often select sub-optimal parameter choices and have serious reproducibility issues.

We propose to instead use performance estimates based on the less variable J-K-fold CV. Our main contributions are extending the use of J-K-fold CV from performance estimation to parameter tuning and investigating how best to choose J and K. To balance effectiveness and computational efficiency we advocate lower choices of K than are typically seen in the NLP literature and instead use the saved computation to increase J. We provide empirical evidence for this claim across a range of NLP tasks.

Joint Data Science Group and UCREL talk

Week 15

Thursday 15th February 2018


Faraday Seminar Room 4

Second-person plural forms in World Englishes: A Corpus-based Study

Liviana Galiano

LAEL, Lancaster University

  • Abstract

Varieties of English world-wide show that a linguistically explicit difference between singular and plural second person pronouns is still alive and well. A corpus-based research shows how plural second person forms behave linguistically, providing information about frequencies, old and new functions and patterns, as well as suggesting the direction of the grammaticalisation processes that are taking place.

Week 14

Thursday 8th February 2018


Infolab C60b/c

Scaling Entity Linking with Crowdsourcing

Dyaa Albakour

Signal Media

  • Abstract

Signal Media is a research-led technology company that uses Artificial Intelligence (AI) and Machine Learning (ML) to turn streams of unstructured text, e.g. news articles, into useful information.

One of the core components of Signal's text analytics pipeline is entity linking. In this presentation, I first review the current state-of-the-art for the task of entity linking (EL) and make the case for using supervised learning approaches to tackle EL. These approaches require large amounts of labelled data, which represent a bottleneck for scaling them out to cover large numbers of entities. To mitigate this, we have developed a production-ready solution to efficiently collect high-quality labelled data at a scale using Active Learning and Crowdsourcing. In particular, I will discuss in this presentation the different steps and the challenges in tuning the design parameters of the crowdsourcing task to limit the noise, reduce the cost and maximise the effectiveness of the resulting machine learning models for EL.

Joint Data Science Group and UCREL talk

Week 13

Thursday 1st February 2018


Management School LT 9

Beyond the checkbox: understanding what patients say in online feedback

Paul Baker

CASS, Lancaster University

  • Abstract

This talk describes the analysis of a 29 million word corpus of online patient feedback about the NHS submitted to the website NHS Choices. I focus on ways that corpus methods can be implemented in order to answer questions that were set by the Patients and Information Directorate at NHS England. This includes examining 1) how linguistic markers differ in relation to the quantitative rating of between 1 and 5 which patients gave along with their written feedback 2) how patients evaluated different members of NHS staff and 3) How patients self-descriptors relating to age and gender impacted on the way they gave feedback. A mixture of corpus techniques including keywords, collocates and concordancing are implemented, with findings indicating a range of patient expectations and motivations for leaving feedback. I end by reflecting on the potential consequences of incorporating a 'market values' model of the NHS.

Week 12

Thursday 25th January 2018


George Fox LT2

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger

Mahmoud El-Haj1 & Jo Knight2

1SCC, Lancaster University  2CHICAS, Medical School, Lancaster University

  • Abstract

In many areas of academic publishing, there is an explosion of literature, and sub-division of fields into subfields, leading to stove-piping where sub-communities of expertise become disconnected from each other. This is especially true in the genetics literature over the last 10 years where researchers are no longer able to maintain knowledge of previously related areas.

This paper extends several approaches based on natural language processing and corpus linguistics which allow us to examine corpora derived from bodies of genetics literature and will help to make comparisons and improve retrieval methods using domain knowledge via an existing gene ontology.

We derived two open access medical journal corpora from PubMed related to psychiatric genetics and immune disorder genetics. We created a novel Gene Ontology Semantic Tagger (GOST) and lexicon to annotate the corpora and are then able to compare subsets of literature to understand the relative distributions of genetic terminology, thereby enabling researchers to make improved connections between them.

Week 11

Thursday 18th January 2018


Management School LT9

Teacher scaffolding in SEN classrooms: insights from corpus methods

Gill Smith

CASS, Lancaster University

  • Abstract

This research applies a corpus-based method to the study of teaching (and specifically teacher scaffolding) in SEN classrooms, thus enabling the exploitation of a larger and therefore more representative sample of language used than previous research in this field. In particular, this talk shall focus upon the use of directives. Directives are utterances which function to elicit some kind of action or response on behalf of the listener. I shall being by outlining the scaffolding literature's definition of directives, making distinctions between verbal and physical action directives. However, these definitions of directives are often extremely ill-defined from a linguistic perspective. Thus, in order to define corpus queries, the next step is to move from these vague descriptions to grammatically sound definitions of linguistic forms. This process was informed by the grammars of Biber et al. (1999) and Quirk et al. (1987), where links were made between directive forms and imperative sentence structures. The next step is to translate these grammatical definitions into corpus queries. In the case of the directives feature, this was done by translating the grammatical forms into a multiword regular expression query appropriate for CWB/CQPweb. For example, the linguistic structure of verbal directives was translated to the following CQP syntax: [pos="V.*0" & semtag="Q2.*"] These queries were created through a trial and error process, and were successful to varying degrees, which shall be explained throughout this talk. The final stage is to apply these queries to the SEN corpus to look at the use of directives in SEN classroom interactions. This is a two-step analysis: first looking at the frequency and linguistic structure of different teacher directives and then looking in depth at pupil responses to these directives. We can use these findings to make inferences about teacher-pupil interactions in SEN classrooms and the significant role teacher directives play in this, which I shall explore in more depth throughout this UCREL talk.

Week 11

Wednesday 20th December 2017


Management school LT9

'Excuse me but are you a blooming idiot'. The use of apologies in teenage talk.

Karin Aijmer

University of Gothenburg

  • Abstract

Research on (im)politeness has drawn attention to the ways in which norms and values may differ depending on factors such as the speech situation or the age and gender of the speakers. Apologies are, for example, closely associated with politeness and face concerns but they also have a potential for discursive and strategic impoliteness. This paper will explore how teenagers use apology expressions such as excuse me, (I'm) sorry , (I beg your) pardon in mocking or sarcastic ways to construct a particular identity.

The apology expressions are collected from the COLT Corpus (the Bergen Corpus of London Teenage Language) which consists of about half a million words of teenage spoken language. The starting-point for the research comes from the observation that the apology phrases were strikingly more frequent in COLT than in corresponding corpora with adult speakers (the London-Lund Corpus of Spoken English, the spoken part of the British component of the ICE-Corpus). It can therefore be hypothesized that teenagers use apologies differently from adults.

The focus is on the following research questions: How are apologies used by teenagers for fighting, teasing, mock politeness, joking, novelty and creativity? In what ways is the insincere interpretation of the apology expression signaled linguistically? It will be argued that the strategies are used consciously by the teenagers to construct a social identity and to establish solidarity with members of their peer group.

Week 10

Thursday 14th December 2017


Faraday Seminar Room 4

The Psychological Status of Collocation: Evidence from ERPs

Jennifer Hughes

CASS, Lancaster University

  • Abstract

In this presentation, I discuss the results of four ERP experiments which collectively aim to find out whether or not there is a neurophysiological difference in the way that the brain processes pairs of words which form collocations compared to pairs of which do not form collocations. The ERP (or event-related potential) technique is a method of measuring the changes in voltage that occur in the brain in response to particular stimuli. The stimuli used in this research consists of corpus-derived adjective-noun bigrams which form strong collocations (Condition 1), and matched adjective-noun bigrams which do not form collocations (Condition 2). The bigrams are embedded into sentences which are presented in a word-by-word fashion.

In Experiment 1, I pilot a procedure for determining whether or not a detectable neurophysiological difference exists between Conditions 1 and 2 for native speakers of English. In Experiment 2, I replicate the pilot study using a different groups of native English speakers; then, in Experiment 3, I replicate the procedure using non-native speakers of English (specifically, native speakers of Mandarin Chinese). In Experiment 4, I then investigate the psychological validity of different association measures, namely transition probability, mutual information, log-likelihood, z-score, t-score, Dice-coefficient, MI3, and raw frequency.

The results reveal that there is a neurophysiological difference in the way that the brain processes corpus-derived collocational bigrams compared to matched non-collocational bigrams, and that this difference is larger for the non-native speakers compared to the native speakers. Moreover, while there is a strong correlation between the amplitude of the brain response and all of the association measures studied in Experiment 4, the strongest correlations exist between amplitude and hybrid association measures, including z-score, MI3, and Dice co-efficient. This suggests that mutual information and log-likelihood, which are two of the most commonly used association measures in corpus linguistics (Gries 2014a:37), are not necessarily always the optimal choice. I discuss these results in relation to prior literature from the fields of corpus linguistics and cognitive neuroscience.

Week 9

Thursday 7th December 2017


Charles Carter A17

Unsupervised Word Sense Disambiguation and Concept Extraction in Clinical and Biomedical Documents

Stéphan Tulkens

CLiPS, University of Antwerp

  • Abstract

The automated analysis of clinical text presents us with a set of unique challenges. First, clinical text differs from conventional textual domains, diminishing the performance of off-the-shelf resources which are not specifically developed with this domain in mind. As an example, clinical text is usually written in a relatively informal style, featuring abbreviations and idiomatic language use which might be specific to a given caregiver or hospital. Second, training data is sparse due to privacy constraints, which also makes it difficult to reuse annotated data and datasets in different projects, hampering progress.

In this talk, I will describe research on the extraction and disambiguation of concepts from patient notes and biomedical texts using unsupervised methods based on distributional semantics.

Specifically, we've shown that domain-specific distributional semantic vectors, when appropriately composed into higher-order context vectors, provide us with a sufficiently powerful instrument to be able to distinguish between multiple highly related senses in a biomedical Word Sense Disambiguation (WSD) task. Our unsupervised method obtains comparable performance to knowledge-based and supervised methods on the same task.

Additionally, we apply similar distributional semantic methods to concept extraction on free clinical text. On this task, our approach is outperformed by supervised concept extraction on the same dataset, but significantly outperforms other unsupervised concept extraction methods.

Joint Data Science Group and UCREL talk.

Week 8

Thursday 30th November 2017


Management school LT9

Error analysis in a learner corpus from spanish speakers EFL learners: A corpus based study

María Victoria Pardo

Universidad de Antioquia, Colombia

  • Abstract

This study was conducted to identify and analyse the most common errors in the written output from learners of English as a foreign language (EFL) at university level in Universidad del Norte in Barranquilla, Colombia. Students were placed in different levels according to the Common European Framework of Reference for Languages (Europe 2001) (CEFR). The data was collected through the methodology of corpus linguistics with the compilation of a learner corpus from university students aiming to complete their bachelor degree. The analysis of the data was based on Corder's theory about Error Analysis (Corder 1981) and following the guide from the description of errors given by James (James 1998). Errors were tagged using Louvain University tagger (Estelle et al. 2005) in order to obtain comparable results with similar work worldwide. The analysis finds that among eight error categories proposed by Louvain's tagger the most recurrent are the grammatical errors. This paper shows a preview of the results from the study that is part of a Doctoral thesis and will show its final results in 2018.

Week 7

Thursday 23rd November 2017


Fylde D28

It doesn't stop, it never, never stops, er, it doesn't stop evolving": Observing Spoken British English of the past 20 years through apparent and real-time evidence

Susan Reichelt

CASS, Lancaster University

  • Abstract

This presentation introduces the secondary data analysis of the Spoken BNC, past and present. A sociolinguistic approach to carefully compiled subsets of both corpora enables us to investigate language change in both synchronic and diachronic ways.

Tracking language change as it happens is, according to Chambers (1995:147), "the most striking single accomplishment of contemporary linguistics". Starting with Labov's work on Martha's Vineyard (1963) and in New York City (1966), sociolinguists have established apparent-time studies as a convenient method for tracing changes in language use. Notwithstanding its appeal through relative ease of data collection, the apparent-time method comes with caveats, as "data cannot uncritically be assumed to represent diachronic linguistic developments" (Bailey, 2008:314). Real-time studies, though arguably more difficult to conduct, offer the researcher the opportunity to see how language has progressed in actuality. The compilation of socially balanced subsets of Spoken BNC from 1994 and 2014 provides us with resources for a double tracked approach, as "in the best of circumstances, of course, researchers will be able to combine apparent-time data with real-time evidence, with the relative strengths of one approach offsetting the weaknesses of the other" (Bailey, 2008:330).

This talk presents some theoretical musings on age (and time more generally) as a sociolinguistic factor as well as first findings from the secondary data analysis project of the Spoken BNC. Intensifiers, some of the most well-researched discourse devices (e.g. Stenström et al., 2002; Ito and Tagliamonte, 2003; Tagliamonte and Roberts, 2005; Macaulay, 2006; Rickford et al., 2007; Tagliamonte, 2008; Barnfield and Buchstaller, 2010), provide an exemplary backdrop against which I will highlight ongoing language change and ways to observe it.

Week 6

Thursday 16th November 2017


Faraday SR4

Arabic Dialect Identification in the Context of Bivalency and Code-Switching

Mahmoud El-Haj1 & Mariam Aboelezz2

1SCC, Lancaster University  2British Library

  • Abstract

In this work we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods can reach more than 97% and score well (66%) when tested on completely unseen data.

Week 5

Thursday 9th November 2017


Fylde D28

'God', 'nation' and 'family' vote for the impeachment of Dilma Rousseff: a corpus-based approach to discourse

Rozane Rebechi

Universidade Federal do Rio Grande do Sul

  • Abstract

Phraseological diversity is operationalized as root type-token ratio computed for each syntactic relation. Two methods are tested to approach phraseological sophistication. First, sophisticated word combinations are defined as academic collocations that appear in the Academic Collocation List (Ackermann & Chen, 2013). Second, it is approximated with the average pointwise mutual information score as this measures has been shown to bring out word combinations made up of closely associated medium to low-frequency (i.e. advanced or sophisticated) words.

Week 4

Thursday 2nd November 2017


Cavendish Colloquium

Phraseological complexity in EFL learner writing across proficiency levels

Magali Paquot

Université Catholique de Louvain

  • Abstract

This presentation reports on the first results of a large-scale research programme that aims to define and circumscribe the construct of phraseological complexity and to theoretically and empirically demonstrate its relevance for second language theory (cf. Paquot, 2017). Within this broad agenda, the study has two main objectives. First, it investigates to what extent measures of phraseological complexity can be used to describe L2 performance at different proficiency levels in a corpus of linguistic term papers written by French EFL learners. Second, it compares measures of phraseological complexity with traditional measures of syntactic and lexical complexity. Variety and sophistication are postulated to be the first two dimensions of phraseological complexity, which is approached via relational co-occurrences, i.e. co-occurring words that appear in a specific structural or syntactic relation (e.g. adjective + noun, adverbial modifier + verb, verb + direct object).

Week 3

Thursday 26th October 2017


Management school LT9

Social change and discourse-semantic shifts in The Times (London): The 'at risk' construct in historical perspective

Jens Zinn

CASS, Lancaster University

  • Abstract

This presentation reports results from an interdisciplinary research project which utilises corpus linguistics tools to examine long term social changes in the print news media in the UK and Germany. It uses 'risk words' as an entry point to analyse the changing meaning of risk. The presentation focuses on the occurrence and discourse-semantic shifts of the 'at the risk', 'at risk' and 'at-risk' construct in the news coverage of the London Times during the late 19C to early 21C. Building on a number of text corpora of the London Times (all articles published in the volumes from 1870 until today) which are available at the Corpus Approaches to Social Sciences (CASS) research centre, it shows how 'at the risk', 'at risk' and 'at-risk' has become part of the cultural repertoire used in the print news media. The presentation explores how the construct developed in the context of broader social changes. It starts with the notion of 'at the risk' which was originally bound to particular technical issues in relation to shipping and trading until the early 19th Century. It became increasingly overtaken by the 'at risk' construct which during the 20C became linked to economics and in the later decades was used to characterise particular social groups as vulnerable such as babies, children, youth and women. It is now common to describe or scandalise when decision makers 'put others at risk' rather than burdening the possible negative outcomes themselves.

Week 2

Thursday 19th October 2017


Hannaford Lab

#LancsBox: A new corpus tool for researchers, students and teachers

Vaclav Brezina & Matt Timperley

CASS, Lancaster University

  • Abstract

In this practical UCREL CRS session, we introduce #LancsBox v. 3 (just released), a software package for the analysis and visualisation of language data and corpora, which was developed at Lancaster University. #LancsBox can be used by linguists, language teachers, translators, historians, sociologists, educators and anyone interested in quantitative language analysis. It is free to use for non-commercial purposes and works with any major operating system. In this workshop, we introduce the main functionalities of #LancsBox, including tagging and searching corpora, building collocation graphs and visualising keywords.The workshop will take place in a computer lab, but please feel free to bring your own laptop.

Week 1

Thursday 12th October 2017


Management school LT9

Corpus and software resources available at Lancaster

Andrew Hardie1 & Paul Rayson2

1CASS, Lancaster University  2SCC, Lancaster University

  • Abstract

This talk will provide a brief introduction to the UCREL research centre, and an overview of the corpus resources, software tools and infrastructure that is available for corpus linguistics and NLP researchers at Lancaster University. The talk will cover corpora of English and non-English varieties, and there will be brief descriptions of annotation, retrieval and other software. Two web-based systems (CQPweb and Wmatrix) will be briefly demonstrated. CQPweb is a corpus retrieval and analysis tool which provides fast access to a range of very large standard corpora. Wmatrix, on the other hand, allows uploading of your own English corpora, carries out tagging and provides key word and key domain analysis, plus frequency lists and concordancing.