Previous seminars

Seminars from previous years are still being added, the archive is still available on the old website.

Academic year:


Week 9

Thursday 7th December 2017


Charles Carter A17

Unsupervised Word Sense Disambiguation and Concept Extraction in Clinical and Biomedical Documents

Stéphan Tulkens

CLiPS, University of Antwerp

  • Abstract

The automated analysis of clinical text presents us with a set of unique challenges. First, clinical text differs from conventional textual domains, diminishing the performance of off-the-shelf resources which are not specifically developed with this domain in mind. As an example, clinical text is usually written in a relatively informal style, featuring abbreviations and idiomatic language use which might be specific to a given caregiver or hospital. Second, training data is sparse due to privacy constraints, which also makes it difficult to reuse annotated data and datasets in different projects, hampering progress.

In this talk, I will describe research on the extraction and disambiguation of concepts from patient notes and biomedical texts using unsupervised methods based on distributional semantics.

Specifically, we've shown that domain-specific distributional semantic vectors, when appropriately composed into higher-order context vectors, provide us with a sufficiently powerful instrument to be able to distinguish between multiple highly related senses in a biomedical Word Sense Disambiguation (WSD) task. Our unsupervised method obtains comparable performance to knowledge-based and supervised methods on the same task.

Additionally, we apply similar distributional semantic methods to concept extraction on free clinical text. On this task, our approach is outperformed by supervised concept extraction on the same dataset, but significantly outperforms other unsupervised concept extraction methods.

Joint Data Science Group and UCREL talk.

Week 8

Thursday 30th November 2017


Management school LT9

Error analysis in a learner corpus from spanish speakers EFL learners: A corpus based study

María Victoria Pardo

Universidad de Antioquia, Colombia

  • Abstract

This study was conducted to identify and analyse the most common errors in the written output from learners of English as a foreign language (EFL) at university level in Universidad del Norte in Barranquilla, Colombia. Students were placed in different levels according to the Common European Framework of Reference for Languages (Europe 2001) (CEFR). The data was collected through the methodology of corpus linguistics with the compilation of a learner corpus from university students aiming to complete their bachelor degree. The analysis of the data was based on Corder's theory about Error Analysis (Corder 1981) and following the guide from the description of errors given by James (James 1998). Errors were tagged using Louvain University tagger (Estelle et al. 2005) in order to obtain comparable results with similar work worldwide. The analysis finds that among eight error categories proposed by Louvain's tagger the most recurrent are the grammatical errors. This paper shows a preview of the results from the study that is part of a Doctoral thesis and will show its final results in 2018.

Week 7

Thursday 23rd November 2017


Fylde D28

It doesn't stop, it never, never stops, er, it doesn't stop evolving": Observing Spoken British English of the past 20 years through apparent and real-time evidence

Susan Reichelt

CASS, Lancaster University

  • Abstract

This presentation introduces the secondary data analysis of the Spoken BNC, past and present. A sociolinguistic approach to carefully compiled subsets of both corpora enables us to investigate language change in both synchronic and diachronic ways.

Tracking language change as it happens is, according to Chambers (1995:147), "the most striking single accomplishment of contemporary linguistics". Starting with Labov's work on Martha's Vineyard (1963) and in New York City (1966), sociolinguists have established apparent-time studies as a convenient method for tracing changes in language use. Notwithstanding its appeal through relative ease of data collection, the apparent-time method comes with caveats, as "data cannot uncritically be assumed to represent diachronic linguistic developments" (Bailey, 2008:314). Real-time studies, though arguably more difficult to conduct, offer the researcher the opportunity to see how language has progressed in actuality. The compilation of socially balanced subsets of Spoken BNC from 1994 and 2014 provides us with resources for a double tracked approach, as "in the best of circumstances, of course, researchers will be able to combine apparent-time data with real-time evidence, with the relative strengths of one approach offsetting the weaknesses of the other" (Bailey, 2008:330).

This talk presents some theoretical musings on age (and time more generally) as a sociolinguistic factor as well as first findings from the secondary data analysis project of the Spoken BNC. Intensifiers, some of the most well-researched discourse devices (e.g. Stenström et al., 2002; Ito and Tagliamonte, 2003; Tagliamonte and Roberts, 2005; Macaulay, 2006; Rickford et al., 2007; Tagliamonte, 2008; Barnfield and Buchstaller, 2010), provide an exemplary backdrop against which I will highlight ongoing language change and ways to observe it.

Week 6

Thursday 16th November 2017


Faraday SR4

Arabic Dialect Identification in the Context of Bivalency and Code-Switching

Mahmoud El-Haj1 & Mariam Aboelezz2

1SCC, Lancaster University  2British Library

  • Abstract

In this work we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods can reach more than 97% and score well (66%) when tested on completely unseen data.

Week 5

Thursday 9th November 2017


Fylde D28

'God', 'nation' and 'family' vote for the impeachment of Dilma Rousseff: a corpus-based approach to discourse

Rozane Rebechi

Universidade Federal do Rio Grande do Sul

  • Abstract

Phraseological diversity is operationalized as root type-token ratio computed for each syntactic relation. Two methods are tested to approach phraseological sophistication. First, sophisticated word combinations are defined as academic collocations that appear in the Academic Collocation List (Ackermann & Chen, 2013). Second, it is approximated with the average pointwise mutual information score as this measures has been shown to bring out word combinations made up of closely associated medium to low-frequency (i.e. advanced or sophisticated) words.

Week 4

Thursday 2nd November 2017


Cavendish Colloquium

Phraseological complexity in EFL learner writing across proficiency levels

Magali Paquot

Université Catholique de Louvain

  • Abstract

This presentation reports on the first results of a large-scale research programme that aims to define and circumscribe the construct of phraseological complexity and to theoretically and empirically demonstrate its relevance for second language theory (cf. Paquot, 2017). Within this broad agenda, the study has two main objectives. First, it investigates to what extent measures of phraseological complexity can be used to describe L2 performance at different proficiency levels in a corpus of linguistic term papers written by French EFL learners. Second, it compares measures of phraseological complexity with traditional measures of syntactic and lexical complexity. Variety and sophistication are postulated to be the first two dimensions of phraseological complexity, which is approached via relational co-occurrences, i.e. co-occurring words that appear in a specific structural or syntactic relation (e.g. adjective + noun, adverbial modifier + verb, verb + direct object).

Week 3

Thursday 26th October 2017


Management school LT9

Social change and discourse-semantic shifts in The Times (London): The 'at risk' construct in historical perspective

Jens Zinn

CASS, Lancaster University

  • Abstract

This presentation reports results from an interdisciplinary research project which utilises corpus linguistics tools to examine long term social changes in the print news media in the UK and Germany. It uses 'risk words' as an entry point to analyse the changing meaning of risk. The presentation focuses on the occurrence and discourse-semantic shifts of the 'at the risk', 'at risk' and 'at-risk' construct in the news coverage of the London Times during the late 19C to early 21C. Building on a number of text corpora of the London Times (all articles published in the volumes from 1870 until today) which are available at the Corpus Approaches to Social Sciences (CASS) research centre, it shows how 'at the risk', 'at risk' and 'at-risk' has become part of the cultural repertoire used in the print news media. The presentation explores how the construct developed in the context of broader social changes. It starts with the notion of 'at the risk' which was originally bound to particular technical issues in relation to shipping and trading until the early 19th Century. It became increasingly overtaken by the 'at risk' construct which during the 20C became linked to economics and in the later decades was used to characterise particular social groups as vulnerable such as babies, children, youth and women. It is now common to describe or scandalise when decision makers 'put others at risk' rather than burdening the possible negative outcomes themselves.

Week 2

Thursday 19th October 2017


Hannaford Lab

#LancsBox: A new corpus tool for researchers, students and teachers

Vaclav Brezina & Matt Timperley

CASS, Lancaster University

  • Abstract

In this practical UCREL CRS session, we introduce #LancsBox v. 3 (just released), a software package for the analysis and visualisation of language data and corpora, which was developed at Lancaster University. #LancsBox can be used by linguists, language teachers, translators, historians, sociologists, educators and anyone interested in quantitative language analysis. It is free to use for non-commercial purposes and works with any major operating system. In this workshop, we introduce the main functionalities of #LancsBox, including tagging and searching corpora, building collocation graphs and visualising keywords.The workshop will take place in a computer lab, but please feel free to bring your own laptop.

Week 1

Thursday 12th October 2017


Management school LT9

Corpus and software resources available at Lancaster

Andrew Hardie1 & Paul Rayson2

1CASS, Lancaster University  2SCC, Lancaster University

  • Abstract

This talk will provide a brief introduction to the UCREL research centre, and an overview of the corpus resources, software tools and infrastructure that is available for corpus linguistics and NLP researchers at Lancaster University. The talk will cover corpora of English and non-English varieties, and there will be brief descriptions of annotation, retrieval and other software. Two web-based systems (CQPweb and Wmatrix) will be briefly demonstrated. CQPweb is a corpus retrieval and analysis tool which provides fast access to a range of very large standard corpora. Wmatrix, on the other hand, allows uploading of your own English corpora, carries out tagging and provides key word and key domain analysis, plus frequency lists and concordancing.