Previous seminars

Seminars from previous years are still being added, the archive is still available on the old website.

Academic year:


Week 23

Thursday 13th June 2019


Fylde LT 2

Visualizing Dispersion: a new tool for Corpus Linguistics software

Andressa Gomide

CASS, Lancaster University

  • Abstract

Measuring word distribution in a corpus is a very important, yet underused technique in Corpus Linguistics (CL). Although recent work has highlighted and discussed the relevance of dispersion measures (DM) when analysing a corpus (e.g. Biber et al. 2016), their presence in research is still very limited (Gries 2008).

This talk presents the development and implementation of a CL tool designed to help users understand dispersion measures and apply them in research. This tool was created as part of a project which aims to enhance the experience of users with CL software through the creation of graphical data visualisation.

In this presentation, I will outline the steps taken to achieve the final tool and do a software demonstration.

The development of the tool consisted of three steps: (a) identifying the target audience and understanding their needs; (b) development and implementation of the visualization; and (c) user assessment of the newly developed tool. User needs were assessed via (a) literature investigation into papers reporting corpus-based methods and (b) a contextual design approach (Beyer and Holtzblatt 1998), allowing observation of how users interact with CL software in their own environment. Key issues for a successful data visualization, such as its functionality, aesthetics and accuracy (Cairo 2016) were also considered. The new functionality was implemented in CQPweb (Hardie 2012), an open-source piece of software for corpus linguistic analysis. Finally, a user assessment was conducted to allow final adjustments to the system to be further fitted to the users' needs.

Towards the end of this presentation, I will offer a demonstration of the dispersion tool using the Sydney Corpus of Television Dialogue, a recently-launched corpus by Bednarek (2018).


Bednarek, M. (2018) Language and Television Series. A Linguistic Approach to TV Dialogue. Cambridge: Cambridge University Press.

Beyer, H. & Holtzblatt, K. (1998). Contextual Design: Defining Customer-Centered Systems. San Francisco: Morgan Kaufmann. ISBN 1-55860-411-1

Biber, D., Reppen R., Schnur E., Ghanem, R,. (2016). "On the (Non)Utility of Juilland's D to Measure Lexical Dispersion in Large Corpora." International Journal of Corpus Linguistics 21 (4):439-64.

Cairo, A. (2016). The truthful art: data, charts, and maps for communication. New Riders.

Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403-437.

Hardie, A. (2012). CQPweb — combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380-409.

Week 26

Thursday 30th May 2019


Charles Carter A15

Super-infrastructure for Biomedical Text Mining

Mahmoud El-Haj & Nathan Rutherford

SCC, Lancaster University

  • Abstract

In this talk we describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.

Week 25

Thursday 16th May 2019


CHC - Charles Carter A15

Exploring and classifying the Arabic copula and auxiliary kāna via enhanced part-of-speech tagging

Andrew Hardie1 & Wessam Ibrahim2

1CASS, Lancaster University  2Tanta University

  • Abstract

Arabic syntax is understudied relative to the language's famously complex morphology - both generally and from a corpus-based perspective. The copula kāna, 'be', functions additionally as an auxiliary, creating periphrastic tense-aspect constructions; but the literature on these functions of kāna is far from exhaustive. To analyse kāna within the million-word Leeds Corpus of Contemporary Arabic, part-of-speech tagging (via a newly-enhanced system) is applied to disambiguate copula and auxiliary at a high rate of accuracy. Concordances of both are extracted, and 10% samples of each (499 instances of copula kāna, 387 of auxiliary kāna) are manually analysed to identify surface-level grammatical patterns and meanings. This raw analysis is then systematised according to the more general patterns' main parameters of variation; special descriptions are developed for specific apparently fixed-form expressions (including two phraseologies which afford expression of verbal and adjectival modality). Overall substantial new detail, not mentioned in existing reference grammars, is discovered (e.g. the great predominance of the past imperfect construction over other uses of auxiliary kāna); there exists notable potential for these corpus-based findings to inform and enhance not only grammatical descriptions, but also pedagogy of Arabic as a first or second/foreign language.

Week 23

Thursday 9th May 2019


B78 (DSI Space) InfoLab21

Beyond word vectors: adding structure to vector representations of meaning

Steven Schockaert

Cardiff University

  • Abstract

While the use of word vectors is now standard in natural language processing, from a knowledge representation point of view such vectors have important shortcomings. In this talk, I will discuss two central issues with word vectors, and outline possible strategies for addressing them.

First, word vectors are inherently limited in how they can express relational information. While vector difference based approaches have been found to successfully model some types of relations, such approaches are provably limited in the kinds of relationships they can capture. As an alternative, in our recent work, we have proposed to combine word vectors with relation vectors, where the latter are explicitly learned to capture how different pairs of words are related.

Second, while representing entities as vectors seems natural, in many applications concepts (or categories) also play a central role. Such concepts formally correspond to sets of entities, hence they are more naturally represented as regions in a vector space. Estimating regions in high-dimensional spaces is challenging however, which may explain why this topics has not yet received much attention. In our recent work, we have proposed a solution based on Bayesian estimation of Gaussian densities, where prior information about the semantic relationships between different concepts is exploited to learn more faithful concept representations.

Joint Data Science Group and UCREL talk

Week 22

Thursday 2nd May 2019


Charles Carter A15

A new tagset for morphological analysis of Indonesian texts


CASS, Lancaster University

  • Abstract

This project proposes a new tagset for a Morphological Analyser (MA) of Indonesian currently under development, which will be implemented in Nooj (Silberztein, 2004). Indonesian is a variety of Malay, a member of Austronesian language family. It is the national and official language of Indonesia. According to the Ethnologue (Lewis et al., 2009), Indonesian has more than 200 million speakers. The purpose of this tagset is that, when applied to the full text of an Indonesian corpus, the users of such corpus will be able to perform queries based on morphological criteria as well as raw word forms by using Nooj. The tagset is distinct from the existing state-of-the-art tagset used by the Morphind, a MA for Indonesian (Larasati et al., 2011), or Pischeldo et al.'s MA tagset (2008). The new tagset provide annotations for affixes, clitics, particles, reduplications, morphophonemics, and functional categories such as passive voice, reciprocal voice, agentive and instrumental nouns. A large portion of the scheme is dedicated for affixes, which is the most productive word formation device in Indonesian. Prentice (1987) mentions that word formations in Indonesian is a combination of syntactic, semantic and syntactic factor. Taking this view into account, the new scheme allows for some ambiguities to be resolved later by syntactic and semantic annotations.

Thursday 25th April 2019


Charles Carter A15

Lexicogrammar: Lexical Grammar or Construction Grammar? Two corpus-based case studies

Costas Gabrielatos

Edge Hill University

Week 21

Thursday 28th March 2019


Management School LT 12

Corpus methods and multimodal data: A new approach

William Dance

LAEL, Lancaster University

  • Abstract

Within corpus linguistics, multimodality is a subject which is often overlooked. While there are multiple projects tackling multimodal interactional elements in corpora, such as the French interaction corpus RECOLA and the video meeting repository REPERE, corpus linguistic approaches generally tend to struggle when faced with extra-textual content such as images. Until now, the only viable approach to including such content in a corpus has been manual image annotation, but such an approach runs into two overarching issues.

First, as Fanelli et al. (2010) note, visual modality is the most labour-intensive form of multimodal corpus annotation when performed in traditional methods 'by hand'. Second, multimodal corpora are often limited in terms of scope and remain "domain specific, mono-lingual [...] and/or of a specialist nature" (Knight, 2010, p. 397) in order to reduce variables and make corpus construction less complex. This new approach redresses both these issues as it automates the annotation process and consequently widens the scope so that studies can be extended to millions of images.

To interpret images, we utilise the Google Cloud Vision service; a service which allows images to be automatically annotated by machine learning algorithms. Of these, we use two forms in particular: 'label annotation' and 'web detection'. The former provides general annotations ("speaker"; "public"; television") while the latter provides specific content labels ("Hillary Clinton"; "Presidential Election"; "DNC").

This approach was created in response to Twitter's recently released elections integrity dataset. In October 2018, Twitter released a massive trove of data containing all communications from and between accounts believed to be connected to the Russian organisation known as the Internet Research Agency (IRA). The dataset (hereafter 'T-IRA') contains over 9 million tweets and 1.7 million images, GIFs and videos, rendering traditional corpus linguistic and multi-modal methods ineffective. This necessitated a new form of combined visual and textual analysis which can efficiently encode images and text to create fully integrated multimodal corpora.

This talk will comprise two main discussions. The first will be a demonstration of the method, reflecting on its reliability, consistency and other methodological implications. The second, will discuss the results from a pilot study of the T-IRA dataset making use of critical approaches to multimodal discourse analysis (Kress, 2011) to help shed light on how hostile state information operations are carried out on social media.


Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., & van Gool, L. (2010). 3D vision technology for capturing multimodal corpora: chances and challenges. LREC Workshop on Multimodal Corpora (pp. 70-73). Valletta: European Language Resources Association (ELRA).

Giraudel, A., Carré, M., Mapelli, V., Kahn, J., Galibert, O., & Quintard, L. (2012, May). The REPERE Corpus: a multimodal corpus for person recognition. In LREC (pp. 1102-1107).

Knight, D. (2011). The future of multimodal corpora. Revista brasileira de linguistica aplicada, 391-415.

Kress, G. (2011). Multimodal discourse analysis. In J. P. Gee, & M. Handford, The Routledge Handbook of Discourse Analysis (pp. 35-50). Abingdon: Routledge.

Ringeval, F., Sonderegger, A., Sauer, J., & Lalanne, D. (2013, April). Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (pp. 1-8). IEEE.

Week 20

Friday 22nd March 2019


Infolab C60b/c

Imitation learning, zero-shot learning and automated fact checking

Andreas Vlachos

University of Cambridge

  • Abstract

In this talk I will give an overview of my research in machine learning for natural language processing. I will begin by introducing my work on imitation learning, a machine learning paradigm I have used to develop novel algorithms for structure prediction that have been applied successfully to a number of tasks such as semantic parsing, natural language generation and information extraction. Key advantages are the ability to handle large output search spaces and to learn with non-decomposable loss functions. Following this, I will discuss my work on zero-shot learning using neural networks, which enabled us to learn models that can predict labels for which no data was observed during training. I will conclude with my work on automated fact-checking, a challenge we proposed in order to stimulate progress in machine learning, natural language processing and, more broadly, artificial intelligence.

Joint UCREL and DSG

Week 19

Wednesday 13th March 2019


County South C89

Evaluating the effect of data-driven learning (DDL) on the acquisition of academic collocations by Chinese learners of English

Tanjun Liu

CASS, Lancaster University

  • Abstract

Collocations, prefabricated multi-word combinations, are considered to be a crucial component of language competence and also a challenge to L2 learners at different proficiency levels. This study focuses on the evaluation of a specific pedagogical approach to teaching collocations, the corpus-based data-driven learning approach (DDL) which has been argued to offer an effective teaching method in language learning. However, large-scale, quantitative studies evaluating the effectiveness and assessing the benefits of DDL in the acquisition of academic collocations were limited in number when compared to a different method of teaching of collocations (Boulton & Cobb, 2017).

This study, therefore, uses data from 100 Chinese students of English from a Chinese university and employs a quasi-experimental method, using a pre-test-and-post-test (including delayed test) control-group research design to compare the achievement of the use of DDL and online dictionary in teaching academic collocations to the Chinese EFL learners. One of the experimental group uses #Lancsbox (Brezina, McEnery & Wattam, 2015), an innovative and user-friendly corpus tool. The other experimental group uses the online version of the Oxford Collocations Dictionary. The results are analysed for the differences in collocation gains within and between the two groups. Those quantitative data are supported by findings from semi-structured interviews linking learners' results with their attitudes towards DDL. The findings contribute to our understanding of the effectiveness of DDL for teaching academic collocations and suggest that the incorporation of technology into language learning can enhance collocation knowledge.

References: Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta‐analysis. Language Learning, 67(2), 348-393. Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173.

Week 18

Thursday 7th March 2019



Misogyny and The Red Pill (MANTRaP)

Frazer Heritage & Alex Krendel

LAEL, Lancaster University

  • Abstract

In this presentation, we introduce current research into various online communities which are characterised by online misogyny and anti-feminism. Although these communities are loosely networked, they form in nexus on the Reddit platform. This presentation will be split into two different sections, further details of which are listed below. The first talk will examine the representation of gendered social actors in the Incel (involuntary celibate) community, while the second uses corpus approaches in tandem with appraisal theory to explore how gendered social actors are represented in The Red Pill subreddit.

Exploring the representation of the gendered 'other' in Reddit's 'Incel' community

- Frazer Heritage

This paper presents a study of the online discussion forum Reddit, specifically the r/braincels subreddit. Posters within this sub-reddit identify as involuntary celibates or 'incels'. Incels are an online imagined community of (typically heterosexual) men who wish to, but do not, have sexual relations with women, seeing women as the cause of their problems. Incels are a group who are explicitly marked for their sexuality, their lack of sexual interactions, and the ideologies which are mutually constitutive with this lack of sexual interaction. In this paper, we take a small but representative corpus of 65,000 words generated from 50 threads created and commented on by incels. We analyse word frequencies, collocations and concordance lines to explore the representation of gendered social actors. Preliminary findings show that the most frequent terms for 'women' are not pejorative and that male social actors are referred to in the corpus with only slightly lower frequencies. However, we also observe a pervasive generalisation of these gendered social actors, which is indicative of how the members of this online community create, maintain and reinforce problematic views of gender and sexuality. We also explore how women and certain men are constructed as an 'outgroup' who are partly responsible for incels failing to engage in sexual interaction. We then discuss how incels position themselves with regard to social status and social capital and argue that incels view the type of masculinity they perform as marginalised.

"Hypergamy: a woman's inability to love unconditionally like men can love (and dogs)": masculinity, femininity, and sexuality across the Reddit "manosphere"

- Alexandra Krendel

Over the past five years, a number of men have committed violent acts against women in the name of sexual entitlement and misogyny, after voicing their hatred of women online. To explore this social problem, this research analyses three sub-groups of The Red Pill community, which has approximately 300,000 frequent users on the online discussion forum Reddit. The Red Pill is part of the "manosphere" -an online community for mostly white, heterosexual men whose identity is constructed in opposition to feminist ideals. This research adds to a growing body of literature (Schmitz and Kazyak 2016, Ging 2017) which has conducted thematic analyses of sections of the "manosphere", but has not yet applied a corpus linguistic approach to explore how different social actor roles are allocated to men and women. For this research, I took three sub-corpora of approximately 70,000 words each from the men going their own way, men's rights activists and Red Pill theory sections, to examine how attitudes to gender and sexuality vary across the community. The keywords and collocates of man, woman and girl were analysed to determine how each sub-community varied thematically, and then concordance lines were investigated to more qualitatively analyse these results. Although all communities generalised women as being naturally incapable of loyalty in relationships, and resented women for being social agents, each sub-section of The Red Pill conceptualised sexual relationships with women in subtly different ways.

Week 17

Thursday 28th February 2019



Corpus analysis or close reading? A case study on press representations of obesity

Paul Baker

CASS, Lancaster University

  • Abstract

This talk is based on an ESRC-funded research project currently running at Lancaster University which aims to examine newspaper representations around obesity. Media reporting of obesity has been criticised in academic research as alarmist and uncritical (Holland et al 2011) and is perceived by obese people as portraying them as freaks and enemies of society who are rarely given a voice unless successfully losing weight, which Couch et al (2015) argue is a form of 'synoptical' social control.

Taking five years of newspaper data from the Daily Mail, I examine representations but also consider methods of analysis, comparing a traditional 'close reading' method with one associated with corpus assisted discourse studies.

Four sampling techniques were used in order to identify sets of 10 articles for a 'close reading'. These were 1) sampling articles from the week where the highest number of articles were published, 2) sampling articles that contain the highest number of references to obesity 3) random sampling and 4) sampling based on using the tool ProtAnt which ranks the proto-typicality of articles based on the number of keywords found in them. The close reading considered phenomena such as quotation patterns, narrative structure, argumentation strategies and fallacies as well as lexical choice, grammatical relationships and metaphor.

For the corpus analysis, collocates of the terms obese and obesity were identified, grouped into semantic categories, and then concordance lines of a range of collocates taken from different categories were analysed. To ensure a degree of comparability across the different analytical conditions, the same amounts of time were spent on each form of analysis.

Having carried out the analyses a meta-analysis compared the findings elicited by different techniques in order to identify the extent that they overlap or give dissonant results. Rather than attempting to judge which approach was the most successful, the paper ends with a more reflective discussion of their strengths and weaknesses and makes suggestions for how they can be combined in order to complement one another following Baker and Levon (2015).


Baker, P. and Levon, E. (2015) 'Picking the right cherries?: a comparison of corpus-based and qualitative analyses of news articles about masculinity.' Discourse and Communication 9(2): 221-336.

Couch, D., Thomas, S. L., Lewis, S., Blood, R. W.and Komesaroff, P. (2015) Obese adult's perception of news reporting on obesity. The panopticon and synopticon at work. Sage Open 5(4) 2158244015612522.

Holland K., Blood R. W., Thomas S. I., Lewis S., Komesaroff P. A. and Castle D. J. (2011). "Our girth is plain to see": An analysis of newspaper coverage of Australia's future "Fat Bomb." Health, Risk & Society, 13, 31-46.

Week 16

Thursday 21st February 2019



'The mythological marauding violent schizophrenic': using the word sketch tool to examine collocates of SCHIZOPHRENIC (n.) relating to dangerousness in the UK press

James Balfour

LAEL, Lancaster University

  • Abstract

Schizophrenia is much more common than we think, with roughly 1 in 100 people diagnosed with schizophrenia in the U.K (Frith and Johnstone, 2003). That said, the press and other media repeatedly mischaracterise people with schizophrenia as dangerous criminals (The Schizophrenia Commission Report, 2012; Clement and Foster, 2008), despite evidence to the contrary. Indeed, statistical evidence shows that people with schizophrenia are not significantly more likely to commit violent crimes than the general population (Fazel and Grann, 2006). Instead, a study carried out in the U.S. showed that people with schizophrenia are 14 times more likely to be the victims of violent crime rather than the perpetrators of it (Brekke et al, 2001).

In this talk, I focus on some preliminary findings from a corpus of British news articles that refer to schizophrenia published between 2000 and 2015 in 9 national newspapers. In particular, I focus on lexicogrammatical patterns around the lemma SCHIZOPHRENIC (n.) using Sketch Engine's word sketch tool (Kilgarriff et al, 2014) and examine the ways in which certain patterns implicitly characterise people with schizophrenia as violent and a threat to others. In doing so, I reflect on the link between these representations and 'news values' (Galtung and Ruge, 1965; Jewkes, 2015), and assess the role they play in sustaining dominant misconceptions.

Brekke, J. S. (2001). Risks for Individuals With Schizophrenia Who Are Living in the Community. (Psychiatric Services).(Abstract). JAMA, The Journal of the American Medical Association, 286(23), 2922.

Fazel, S., & Grann, M. (2006). The population impact of severe mental illness on violent crime.(Author abstract). American Journal of Psychiatry, 163(8), 1397. doi:10.1176/ajp.2006.163.8.1397

Frith, C. D. and Johnstone, E. (2003). Schizophrenia: a very short introduction. Oxford: Oxford University Press.

Jewkes, Y. (2015). Media and crime (3rd edition). London: Sage.

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovvář, V., Michelfeit, J., Rychlý, P., Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 1(7-36).

The Schizophrenia Commission (2012). The abandoned illness: a report from the Schizophrenia Commission. London: Rethink Mental Illness.

Week 15

Thursday 14th February 2019



Usage fluctuation analysis: A new way of analysing shifts in historical discourse

Tony McEnery

CASS, Lancaster University

  • Abstract

Words change their usage over time. This profoundly simple and accessible fact about language is part of our everyday shared experience; if we reflect on the pre-internet meanings of words such as web, network and tweet, usage change is instantly apparent. Lexical changes are often prompted by changes in society, culture and technology requiring new naming strategies for new or modified concepts. This talk introduces a methodology for the diachronic analysis of large historical corpora that looks at the fluctuation of word usage manifested through collocation, that is the co-occurrence of words in texts. In essence, this technique is theory neutral and does not presuppose commitment to any one semantic theory. Instead, it helps to accurately describe large amounts of evidence about word usage, in different contexts, that are available in historical corpora. This talk first addresses the issue of diachronic change in word usage and meaning after which I will present the new technique, Usage Fluctuation Analysis (UFA) detailing some guidelines for the interpretation of the results of the analysis. I will then present three short case studies by applying the technique to three words related to social actors in seventeenth-century Britain - whore, harlot and banker. These case studies will demonstrate the value of the technique by relating the observations to corpus and historical analyses carried out manually (as a validation of the technique) as well as by showing novel observations that the technique affords and that were not previously available.

Week 14

Thursday 7th February 2019



Metaphors and narratives of climate change across genres and discourse communities: A corpus-based comparison

Elena Semino1 & Alice Deignan2

1LAEL, Lancaster University  2University of Leeds

  • Abstract

Climate change is one of the world's most urgent issues. Young people, in particular, are likely to be affected in their lifetimes, and will also influence future developments through their lifestyle choices and decisions as citizens. As with other scientific topics, however, knowledge about climate change is mediated through discourse. In this paper, we consider the use of metaphors and narratives to represent climate change in texts for and by members of different discourse communities: scientists, school teachers and secondary school students. We present the results of a comparative analysis of three corpora, consisting of: (1) Academic articles about climate science (approximately 500,000 words); (2) Educational materials for secondary school students in the UK (approximately 250,000 words); (3) Interviews with secondary school students in the North of England (approximately 90,000 words). The three corpora were compared using a combination of corpus linguistic techniques, including the analysis of word frequency lists, key semantic domains and collocational patterns. We will focus particularly on the findings that relate to how the students' use of metaphors and narratives for climate change contrasts with the metaphors and narratives used in the other two corpora.

Week 13

Thursday 31st January 2019


Management School LT 11

The Hansard at Huddersfield project

Alexander Von Lunen & Hugo Sanjurjo Gonzalez

University of Huddersfield

  • Abstract

The AHRC-funded Hansard at Huddersfield project is following up on the recent SAMUELS project, which semantically tagged the Hansard Corpus. The Hansard corpus is the collection of UK House of Commons and Lords debates from 1803 to 2005, and the SAMUELS project tagged this corpus with a grammatical and semantic annotation based on the Historical Thesaurus Semantic Tagger.

The main goal of the Hansard at Huddersfield project is to introduce some usual corpus linguistic methods to the general public in a simplified manner. Most methods from corpus linguistics cater for a specialist audience, yet these methods could also help the general public by making the record of British parliament more accessible. The project contemplates this by means of intuitive searches and associated visualisations. Timelines, word clouds, sunburst visualizations and line charts show linguistic information such as frequencies, linguistic tags and relations in a simpler and more understandable way. Thus, we expect that non-academic users and the general public maximise the benefits of the Hansard Corpus without the need of any linguistic expertise and obtaining more than a simple list of full-text search results.

Week 12

Thursday 24th January 2019


Management School LT 11

Using corpora to investigate the linguistic challenges of the transition from primary to secondary school

Duygu Candarli & Robbie Love

University of Leeds

  • Abstract

This talk introduces an ESRC-funded project which aims to build understanding of the language challenges of the transition from Key Stage (KS) 2 to KS3 in the UK school context. We focus on academic registers, that is, instructional and regulative registers (Christie, 2002) rather than, say, the language of the playground.

The KS2/3 transition is known to be difficult for many children, and there is a well-documented dip in attainment and motivation at the beginning of KS3 (DfE, 2011; Howe & Richards, 2011). Academic language issues are probably exacerbated at this point. Braund and Driver, writing about science learning, note that '[t]eaching environments [...] and teachers' language are very different in secondary schools from primary schools' (2005, p. 78). Students' writing has been extensively researched in the UK education contexts (e.g. Durrant & Brenchley, 2018; Nesi & Gardner, 2012); however, little is known about the educational registers that students encounter at school, and how they differ from the everyday language they use outside of school. Our project addresses the following research questions:

● How does the academic language of KS3 differ from that of KS2?

● How does the language of both KS2 and KS3 differ from everyday language?

Our data consist of two bespoke corpora, and various pre-existing reference corpora. The bespoke corpora represent (1) KS2 (years 5 and 6) and (2) KS3 (years 7 and 8). Both include audio recordings of lessons, students' textbooks, teacher-designed worksheets, information sheets, vocabulary and glossary booklets, PowerPoint presentations, exams, marking rubrics and other formative assessments in the subjects of English, history, maths, geography and the sciences. We are gathering these data from primary and secondary schools in England, mainly from the Yorkshire region.

This project aims to advance our understanding of the potential language challenges of students at the transition stage in the UK education system, by using methods from corpus linguistics to gain a 'bird's eye' view on a pressing educational issue. Methodologically, we hope to further bridge the disciplines of corpus linguistics and education, and in doing so help to improve the accessibility of curricula for all students.

This is an ESRC-funded research project based at the School of Education at the University of Leeds, working in partnership with Lancaster University's Centre for Corpus Approaches to Social Science and affiliated with Cambridge University Press.


Braund, M. & Driver, M. (2005). Pupils' perceptions of practical science in primary and secondary school: implications for improving progression and continuity of learning. Educational Research, 47, 77-91

Christie, F. (2002). Classroom discourse analysis: A functional perspective. London: Continuum.

Department for Education. (2011). How do pupils progress between Key Stages 3 and Research Report.

Durrant, P., & Brenchley, M. (2018). Development of vocabulary sophistication across genres in English children's writing. Reading and Writing.

Nesi, H., & Gardner, S. (2012). Genres across the disciplines: Student writing in higher education. Cambridge: Cambridge University Press.

Week 11

Wednesday 16th January 2019


Management School A001c (PC/Learning Lab)

Wmatrix for forensic linguistics: a practical hands-on demo

Paul Rayson

SCC, Lancaster University

  • Abstract

Wmatrix was originally conceived in the REVERE project (1998-2001) as a web interface to facilitate the availability of Natural Language Processing (NLP) and Corpus Linguistics (CL) tools and methods to software engineers who were studying legacy systems through document archaeology alone (Rayson et al 2001, 2005). Since then, its web interface has been extended to expose more underlying details of the language analysis rather than hiding them away, and it has supported applications of NLP and CL methods in many other areas such as political discourse analysis, tracing facework, corpus stylistics, metaphor analysis, topic modelling, evaluating problem based learning and the language of illness. In the short talk at the beginning of this session, I will highlight applications in forensic, legal, and policing settings, for example: online child protection (Rashid et al 2013), predicting collective action (Charitonidis et al 2017), scientific fraud (Markowitz and Hancock 2014), and studies of the language of international criminal tribunals (Potts and Kjær 2015), sex offenders (Lord et al 2008), extremism and counter extremism (Prentice et al 2012), and psychopaths (Hancock et al 2013). In the remainder of the two-hour session, participants will follow the online tutorials which introduce the key semantic domains method. We will use the new version 4 of Wmatrix running on a dedicated server with secure HTTPS access, which went public in December 2018. Users will be provided with existing manifesto datasets but you are welcome to bring your own English corpora to upload.

Charitonidis C., Rashid A., Taylor P.J. (2017) Predicting Collective Action from Micro-Blog Data. In: Kawash J., Agarwal N., Özyer T. (eds) Prediction and Inference from Social Networks and Social Media. Lecture Notes in Social Networks. Springer, Cham

Jeffrey T. Hancock, Michael T. Woodworth and Stephen Porter (2013) Hungry like the wolf: A word-pattern analysis of the language of psychopaths. Legal and Criminological Psychology. Volume 18, Issue 1, pages 102-114.

Lord V, Davis B, Mason P. 2008. Stance-shifting in language used by sex offenders. Psychology, Crime & Law 14, 357-379.

Markowitz DM, Hancock JT (2014) Linguistic Traces of a Scientific Fraud: The Case of Diederik Stapel. PLoS ONE 9(8): e105937. doi:10.1371/journal.pone.0105937

Potts, A. and Kjær, A.L. (2015) Constructing Achievement in the International Criminal Tribunal for the Former Yugoslavia (ICTY): A Corpus-Based Critical Discourse Analysis. International Journal for the Semiotics of Law. doi: 10.1007/s11196-015-9440-y

Prentice, S, Rayson, P & Taylor, P 2012, 'The language of Islamic extremism: towards an automated identification of beliefs, motivations and justifications' International Journal of Corpus Linguistics, vol. 17, no. 2, pp. 259-286. DOI: 10.1075/ijcl.17.2.05pre

Rashid, A, Baron, A, Rayson, P, May-Chahal, C, Greenwood, P & Walkerdine, J 2013, 'Who am I? Analysing Digital Personas in Cybercrime Investigations' Computer, vol. 46, no. 4, pp. 54-61. DOI: 10.1109/MC.2013.68

Rayson, P., Emmet, L., Garside, R., & Sawyer, P. (2001). The REVERE project: Experiments with the application of probabilistic NLP to systems engineering. In Natural Language Processing and Information Systems - 5th International Conference on Applicationsof Natural Language to Information Systems, NLDB 2000, Revised Papers (pp. 288-300).

Sawyer, P., Rayson, P., & Cosh, K. (2005). Shallow Knowledge as an Aid to Deep Understanding in Early-Phase Requirements Engineering. DOI: 10.1109/TSE.2005.129

Joint session with FORGE

Week 10

Thursday 13th December 2018



Colloquialisation of academic British English: Evidence from the Written BNC2014

Abi Hawtin

CASS, Lancaster University

  • Abstract

Is academic British English becoming more colloquial? Evidence from the Written BNC2014

Leech (2002:72) defines colloquialisation as "a tendency for features of the conversational spoken language to infiltrate and spread in the written language". In this presentation I will discuss the results of a study which investigates whether academic British English has become more colloquial since the 1990s. I use some early data from the Written BNC2014, and compare this to data from the BNC1994, to investigate whether linguistic features associated with colloquialisation have changed in frequency between the two corpora. I find that in some respects academic British English has certainly become more colloquial since the 1990s, although this pattern is not straightforward. It seems that academic books have changed much more in the direction of colloquialisation theory than academic journal articles, and that genres of writing with a 'social' aspect are showing more changes in line with colloquialisation theory than the 'hard' science genres. I also consider whether these results can tell us anything about the colloquialisation of language in general.

Leech, G. (2002). Recent grammatical change in English: Data, description, theory. In K. Aijmer & B. Altenberg (Eds.), Advances in Corpus Linguistics. Papers from the 23rd International Conference on English Language Research on Computerized Corpora (ICAME 23), Göteborg 22-26 May 2002 (pp. 61-81), Amsterdam: Rodopi.

Week 9

Thursday 6th December 2018


Fylde LT 3

The UK 'at risk' - A corpus approach to historical social change 1785-2009

Jens O. Zinn

CASS, Lancaster University

  • Abstract

Since the 1980s sociologists have tried to explain the proliferation of risk language. Theoretical explanations provided by risk society theorists such as Anthony Giddens and Ulrich Beck, for example, were mainly developed on the basis of scholarly observations rather than empirical evidence. With the digitization of newspaper archives new opportunities occurred to examine long term social change empirically such as the shift towards risk in public debate. This presentation reports results from an interdisciplinary research project which utilises corpus linguistics tools to examine long term social change towards risk. It uses 'risk words' as an entry point to analyse the changing meaning of risk and uses The Times corpus at CASS to examine the discourse-semantic shifts of at risk-constructs ('at the risk', 'at risk' and 'at-risk') from 18C to early 21C. Combining collocation analysis with detailed qualitative analysis of concordances the presentation shows how at risk-constructs has become part of the cultural repertoire used in the print news media representing what could be called a new zeitgeist. The research argues that socio-structural changes, cultural changes, institutional practices and socially significant historical events shaped and fostered the use of at risk-constructs.

Jens Zinn is Associate Professor in Sociology at the University of Melbourne and Guest Professor at the Risk and Crisis Research Centre Mid-Sweden University. He is currently Marie Skłodowska-Curie research fellow at CASS.

Earlier publications on discourse semantic change of risk: Zinn, J.O. 2019 (in preparation) The UK 'at risk'. A corpus approach to social change 1785-2009. Basingstoke, UK: Palgrave Macmillan.

Zinn, J.O. 2018 The Proliferation of 'at risk' in The Times: A Corpus Approach to Historical Social Change, 1785-2009. Historical Social Research 43(2), 313-364.

Zinn, J.O. and McDonald, D. 2018: Risk in the New York Times (1987-2014) - A corpus-based exploration of sociological theories. Basingstoke, UK: Palgrave Macmillan.

Zinn, J. O. and McDonald, D. 2016: Changing Discourses of Risk and Health Risk: A Corpus Analysis of the Usage of Risk Language in the New York Times. In: Chamberlain, M. (ed.): Medicine, Risk, Discourse and Power, London, New York: Routledge, 207-240.

Zinn, J.O. 2010: Risk as Discourse: Interdisciplinary Perspectives. CADAAD journal 4(2), 10624.Available at:

Week 8

Thursday 29th November 2018



Climate Change, Ocean Acidification and the Nitrogen Cycle. A Corpus-based Discourse Analysis of the concept of Anthropocene in the press

Angela Zottola1 & Claudio de Majo2

1University of Nottingham  2Rachel Carson Center for Environment and Society (Ludwig-Maximil

  • Abstract

The term Anthropocene first appeared in the early 2000s when scientist Paul Crutzen attempted to define the effects of human societies on the environment (Steffen et al. 2011). Since then, it has become an increasingly widespread, but also controversial, word in the scientific community. As environmental discourses increasingly permeate our lives, it has trespassed the borders of scholarly traditions, becoming acknowledged in popular culture (Autin & Holbrook 2012).

Bearing in mind the pivotal role the press has in the popularization, dissemination and consequent understanding of given topics, this contribution aims at investigating the representation of the notion of Anthropocene provided by the press in the USA, UK, India and Australia, highlighting different stances and ideas related to this concept.

In the framework of the Environmental Humanities (Trischler 2016, 2013; Steffen et al. 2007) and Corpus-based Discourse Analysis (Baker et al. 2008; Baker, Gabrielatos & McEnery 2013) this work analyses three corpora of newspaper articles collected from the three countries starting from 2002 - year in which this term was first employed in a scientific paper (Zalasiewicz et al. 2008), to the present day, critically investigating the way the press' representation influences our understanding of Anthropocene.

The analysis shows that climate change, ocean acidification and the nitrogen cycle are among the most common human-induced natural phenomena the term Anthropocene is associated with, and that the account of the consequences of the Anthropocene mostly rely on emotional discourses, rather than on scientifically grounded evidence.


Autin, J.W. & Holbrook, J.M. 2012. Is the Anthropocene an issue of stratigraphy or pop culture? GSA Today, 60-61.

Baker, Paul, Gabrielatos, Costas and Tony McEnery. 2013. Discourse Analysis and Media Attitudes. The Representation of Islam in the British Press. New York: Cambridge.

Baker, Paul, Gabrielatos, Costas, KhosraviNik, Majid, Krzyzanowski, Michal, McEnery, Tony and Ruth Wodak. 2008. A Useful Methodological Synergy? Combining Critical Discourse Analysis and Corpus Linguistics to Examine Discourses of Refugees and Asylum Seekers in the UK Press. Discourse and Society, 19(3): 273-306.

Carvalho, A. 2007. Ideological Cultures and Media Discourses on Scientific Knowledge: Re-reading News on Climate Change. Public Understanding of Science, 16(2007): 223-243.

Hajer, M. 1995. The Politics of Environmental Discourse: Ecological Modernization and the Policy Process. Oxford, New York : Oxford University Press.

Steffen et al. 2011. The Anthropocene: conceptual and historical perspectives. Philosophical transactions of the Royal Society, 369: 842-867.

Steffen et al. 2007. The Anthropocene: Are Humans Now Overwhelming the Great Forces of Nature? Royal Swedish Academy of Sciences, 36(8).

Trischler, H. (ed.). 2013. "Anthropocene: Exploring the Future of the Age of Humans," RCC Perspectives, 3.

Trischler, H. 2016. "The Anthropocene : A Challenge for the History of Science, Technology, and the Environment", National Centre for Biotechnology Information, 24(3), 309-355.

Zalasiewicz et al. 2008. Are we now living in the Anthropocene? GSA Today, 18(2):4-8.

Week 7

Thursday 22nd November 2018


Fylde LT 1

Fool's Gold: Understanding the Linguistic Features of Deception and Humour Through April Fools' Hoaxes

Ed Dearden

SCC, Lancaster University

  • Abstract

Every year on April 1st, people play practical jokes on one another and news websites fabricate false stories with the goal of making fools of their audience. In an age of disinformation, with Facebook under fire for allowing "Fake News" to spread on their platform, every day can feel like April Fools' day. We create a dataset of April Fools' hoax news articles and build a set of features based on past research examining deception, humour, and satire. Analysis of our dataset and features suggests that looking at the structural complexity and levels of detail in a text are the most important types of feature in characterising April Fools'. We propose that these features are also very useful for understanding Fake News, and disinformation more widely.

Week 5

Thursday 8th November 2018



A Genre Analysis of Discourses Surrounding Venereal Disease in Seventeenth-Century England

Tony McEnery & Helen Baker

CASS, Lancaster University

  • Abstract

Sufferers of venereal disease in seventeenth-century England faced an array of difficulties. Not only must they cope with the painful and often worsening symptoms of syphilis, gonorrhea or whichever type of sexually transmitted illness they had contracted, they were also obliged to hide these symptoms to avoid being marked and condemned as a carrier of such a disease. This talk is about perceptions of the disease and the types of texts which referred to it.

Using the term pox as a starting point, we gather together a selection of suitable search queries - terms of interest which were used to refer to venereal disease in early modern England - and describe how this list was compiled step-by-step. We demonstrate the challenges inherent in achieving a comprehensive list of names due to the necessary inclusion of many near-synonyms and spelling variants of each term. A large proportion of these terms were constructed by the insertion of a nationality adjective in front of the noun pox or disease, e.g. Italian pox, American disease, with French pox being the most commonly used alternative to the pox. Accordingly we investigate to what extent English writers associated venereal disease with different nations.

Following on from that, in order to uncover the kinds of written works in which references to venereal disease appear, we undertake a genre analysis. Such a genre-based approach has only recently become possible due to the addition of a categorisation genre framework for titles within the EEBO corpus. The talk will present the findings of that analysis and reflect on the possible reasons for the changing pattern of reference to venereal disease by genre over the century.


Siena, K.P. (2001), The "Foul Disease" and Privacy: The Effects of Venereal Disease and Patient Demand on the Medical Marketplace in Early Modern London. Bulletin of the History of Medicine, 75, 2: 199-224.

Szreter, S. (2017), Treatment rates for the pox in early modern England: a comparative estimate of the prevalence of syphilis in the city of Chester and its rural vicinity in the 1770s. Continuity and Change, 32, 2: 183-223.

Week 4

Thursday 1st November 2018



A thematically oriented analysis of the Financial Services Annual Reports (FinSerAR) Corpus: the UK financial services' narrative towards Brexit

Vasiliki Simaki

CASS, Lancaster University

  • Abstract

In this study, I present a corpus consisting of annual reports from UK financial companies: the Financial Services Annual Reports Corpus. Based on the potential impact of the Brexit on financial services, I decided to focus the interest of my study into the financial domain of the UK. For this corpus, I downloaded the 2015, 2016 and 2017 annual reports from five UK-based banks: Barclays, HSBC, Lloyds, Royal Bank of Scotland and Santander UK. In this corpus, I identified all the content referring to the 2016 UK referendum by using thematic keywords, its outcome, and the response of the financial services to the exiting process of the UK from the European Union. I extracted this subset, the Brexit-related content, and I performed different analytical tasks. I explored the context in which the thematic keywords are found, compared statistically the three yearly sets of the subset, and searched the significant words of the subset in terms of keyness. The analysis results showed that the Brexit-related content cannot be considered as neutral or objective, and this conclusion led me to the identification of stance in the subset. For this task, I used a functional-cognitive stance framework and stance constructions (Simaki et al. 2017, Simaki et al. 2019) to detect stance markers in the Brexit-related content, and the findings are discussed.


Simaki, V., C. Paradis, M. Skeppstedt, M. Sahlgren, K. Kucher and A. Kerren. 2017. 'Annotating speaker stance in discourse: the Brexit Blog Corpus', Corpus Linguistics and Linguistic Theory. DOI: 10.1515/cllt-2016-0060

Simaki, V., Paradis, C., Kerren, A. (2019). A two-step procedure to identify lexical elements of stance constructions in discourse from political blogs. In Corpora (accepted).

Week 3

Thursday 25th October 2018


Fylde LT 1

Pero, Bueno, Pues: Testing new methodological approaches for the identification and disambiguation of discourse markers in spoken peninsular Spanish

Zoé Broisson

University of Louvain

  • Abstract

In this presentation, I introduce a first attempt at testing the operationality of a corpus-based, cross-linguistic definition and taxonomy of discourse markers developed by Crible (2017) to reliably annotate spoken peninsular Spanish data. To this end, I manually extract and annotate 737 DMs in a subset of 8,300 words of semi-formal interviews sourced from the Spanish component of the Backbone corpus (Kurt 2012). Using a step-wise annotation procedure inspired by Scholman et al. (2016), I demonstrate that Crible's (2017) taxonomy constitutes a valuable tool for the annotation of spoken Spanish discourse markers, and I provide suggestions to further enhance its replicability. With this study, I hope to modestly contribute to a line of methodological research I believe opens promising avenues for the exchange of comparable research within the field of discourse analysis.

Week 2

Thursday 18th October 2018


Management School LT 11

Narratives of Voice-hearers

Luke Collins

CASS, Lancaster University

  • Abstract

In this talk, I will introduce a new CASS project that explores first-person accounts of voice-hearers. Voice-hearing refers to the perception of voices that others cannot hear and is typically associated with mental health problems. However, there are also many people who experience voices that are not distressing. Using the UCREL Semantic Analysis System (USAS), I present some initial observations of the key themes observed in the two corpora of interview data collected from: i) participants engaging with clinical services to help them cope with their distressing voices; ii) participants who self-identify as Spiritualists and view their experiences as communication with the spirit world. I offer some analytical observations of the ways in which the participants position themselves as agentive to demonstrate how a linguistic analysis of their accounts can help us to better understand their respective experiences.

Week 1

Thursday 11th October 2018


LUMS Computer Lab A001c

#LancsBox v. 4 and other brand-new corpus tools

Vaclav Brezina & Matt Timperley

CASS, Lancaster University

  • Abstract

In this practical session and software demonstration, we briefly introduce three brand new software tools: #LancsBox, BNClab, and Lancaster Stats Tools online, all developed at Lancaster University. Following the recent debate in the field (e.g. McEnery & Hardie 2011; Kilgarriff 2012; Gries 2013; Lijffijt et al. 2014; Brezina & Meyerhoff 2014; Brezina et al. 2015; Gablasova et al. 2017) and responding to the challenges identified in the debate, we have developed software and tools that incorporate a number of existing analytical techniques and add new innovative methods that enable more efficient and sophisticated exploration of the data. These tools can be used by linguists, language teachers, translators, historians, sociologists, educators, and anyone interested in quantitative language analysis. They are free to use for non-commercial purposes. This practical session highlights innovative features of the new tools and focuses on practical demonstration of the new version of #LancsBox (v. 4).