The Centre for Corpus Approaches to Social Science (CASS) at Lancaster University and Cambridge University Press (CUP) are collaborating on a new corpus of spoken British English, known as the Spoken British National Corpus 2014 (Spoken BNC2014). This will be the first publicly-accessible corpus of its kind since the spoken component of the original British National Corpus (Leech 1993) (henceforth Spoken BNC1994).
As with all spoken corpora, the definition of a suitable transcription scheme protocols not only an essential preparatory step, but also a locus for the critical examination of certain issues in corpus construction. The Spoken BNC2014's transcription scheme was developed cooperatively by CASS and CUP via a process of reviewing previous work, pilot-testing, and extensive discussion/consultation. We could in theory have reused, unedited, the same scheme as Spoken BNC1994 (Crowdy 1994); however, we argue that this scheme is insufficiently detailed to minimize ambiguity in transcription, as too much is left to the discretion of individual transcribers. Instead, we adapted a modern, highly-detailed system, one currently being used in another CASS project (Gablasova et al., under review), while also, in the interests of comparability, taking into account Crowdy (1994) where possible.
In this talk I describe how we refined the transcription scheme for the Spoken BNC2014, and discuss some the issues encountered along the way, with particular focus on a hitherto unexamined area, speaker identification. The resulting scheme adheres to no particular prior encoding standard. However, its coding is defined so as to be computationally unambiguous, allowing automated mapping to standard XML for distribution/archiving; our target encoding instantiates "Modest XML for Corpora" (Hardie 2014). A critical comparison of this scheme to that of the Spoken BNC1994 explores the delicate balance we sought between backwards-compatibility and optimal practice in the context of the new corpus.
References
Crowdy, S. (1994). Spoken Corpus Transcription. Literary and Linguistic Computing, 9(1), 25-28.
Gablasova, D., Brezina, V., McEnery, T. & Boyd, E. (under review) Epistemic stance in spoken L2 English: The effect of task type and speaker style, submitted to Applied Linguistics.
Hardie, A. (2014). Modest XML for corpora: not a standard, but a suggestion. ICAME Journal, 38, 73-103.
Leech, G. (1993). 100 million words of English. English Today, 9-15. doi:10.1017/S0266078400006854