Corpus Construction Seminar: Written BNC2014, Spoken BNC2014, and CorCenCC

Abi Hawtin1, Robbie Love1 & Dawn Knight2

1CASS, Lancaster University  2Cardiff University

This week's UCREL CRS will consist of 3 presentations all concerning corpus construction.

Abi Hawtin will present 'Introducing the Written BNC2014 project'.

"The Centre for Corpus Approaches to Social Science (CASS) at Lancaster University and Cambridge University Press are collaborating on the creation of a new, publicly accessible corpus of contemporary written British English: Written BNC2014. The Written BNC1994 is even now used as a proxy for contemporary British English, despite being 20 years old; a new standard, broad-coverage, widely-available corpus is needed to allow the same kind of research fostered by the Written BNC1994 to generate results that are truly representative of contemporary British English.

This presentation introduces the corpus construction project and provides a progress report. I will cover, first, the decisions made in the development of the sampling frame, and in particular, the divisive issue of 'representativeness (of contemporary language) vs comparability (with the 1994 corpus)'. Second, I will present the decisions which have been made regarding the collection and tagging of the texts and the implications these decisions will have for the research affordances of the completed corpus. The estimated release date for the corpus is third quarter 2018."

Robbie Love will present '"Normal with a brummy twang": dealing with metadata in the Spoken BNC2014'.

"The Spoken BNC2014 is a new corpus of spoken British English, currently being compiled by Lancaster University and Cambridge University Press. This presentation focuses on metadata; the collection and processing of which is a crucial aspect of spoken corpus compilation. I consider certain categories of speaker metadata which have presented particularly interesting methodological problems, and critically assess the decisions made to overcome them:

• Dialect What is the best approach to capturing and characterising speaker dialect, given that responses such as that quoted in my title may be elicited? I evaluate the use of birthplace, current location, location of L1 acquisition and self-reported dialect.

• Socio-economic status Can one single classification system account for all speakers? I evaluate Social Grade (Collis 2009), Social Class based on Speaker Occupation (Rose & Pevalin 2005) and NS-SEC (ibid.).

• Sexuality How far is too far? Here, I discuss speakers' unwillingness to disclose some (potential) metadata categories including sexuality and religion.

My assessment of the problems relating to the classification of such categories has forced me to call into question some of the decisions made with regards to existing spoken corpora, including the original Spoken BNC1994 (Leech 1993). My findings thus demonstrate the need to pursue a delicate balance between comparability (with the earlier BNC) and methodological advancement."

Dawn Knight will present 'The National Corpus of Contemporary Welsh: A community driven approach to linguistic corpus construction'.

"This paper provides a detailed overview of the plans for the Economic and Social Research Council (ESRC) and the Arts and Humanities Research Council (AHRC)-funded CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes - The National Corpus of Contemporary Welsh) project: A community driven approach to linguistic corpus construction.

Although there are spoken, written and e-language Welsh corpora, CorCenCC will be the first large-scale corpus (with a projected word count of 10,000,000) to incorporate all data types reflecting the use of the Welsh language by contemporary users of it. The project will engage and work closely with Welsh language users and, as part of this presentation, I will discuss the strategies and techniques that we will employ to engage and recruit the participants from across Wales. It will use new technologies, such as crowdsourcing, to collect data, and draw contributors from the 562,000+ Welsh speakers in the UK.

CorCenCC will be open-source and freely available for use by professional communities and anyone with an interest in language. The corpus will enable, for example, community users to investigate dialect variation or idiosyncrasies of their own language use; professional users to profile texts for readability or develop digital language tools; to learn from real life models of Welsh; and researchers to investigate patterns of language use and change. During this presentation I will also discuss some of the practical, methodological and technical issues involved in the design and construction of CorCenCC and will welcome feedback the draft outline of the corpus sampling frame that will feature as part of these discussions.

Project website: http://sites.cardiff.ac.uk/corcencc/ Follow us on Twitter: @CorCenCC"

Week 28 2015/2016

Thursday 9th June 2016
2:00-4:00pm

Furness LT 3