Rayson, P., Leech, G., and Hodges, M. (1997). Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics. Volume 2, number 1. pp 133 - 152. John Benjamins, Amsterdam/Philadelphia. ISSN 1384-6655.

SOCIAL DIFFERENTIATION IN THE USE OF ENGLISH VOCABULARY: SOME ANALYSES OF THE CONVERSATIONAL COMPONENT OF THE BRITISH NATIONAL CORPUS1

Paul Rayson, Geoffrey Leech and Mary Hodges
UCREL (University Centre for Computer Corpus Research on Language)
Lancaster University,
Lancaster LA1 4YT,
United Kingdom.


Abstract


In this article we undertake selective quantitative analyses of the demographically-sampled spoken English component of the British National Corpus (for brevity, referred to here as the "Conversational Corpus"). This is a subcorpus of c.4.5 million words, in which speakers and respondents (see 1 below) are identified by such factors as gender, age, social group and geographical region. Using a corpus analysis tool developed at Lancaster, we undertake a comparison of the vocabulary of speakers, highlighting those differences which are marked by a very high X2 value of difference between different sectors of the corpus according to gender, age and social group. A fourth variable, that of geographical region of the United Kingdom, is not investigated in this article, although it remains a promising subject for future research. (As background we also briefly examine differences between spoken and written material in the British National Corpus (BNC).) This study is illustrative of the potentiality of the Conversational Corpus for future corpus-based research on social differentiation in the use of language. There are evident limitations, including (a) the reliance on vocabulary frequency lists, and (b) the simplicity of the transcription system employed for the spoken part of the BNC. The conclusion of the article considers future advances in the research paradigm illustrated here.


Keywords: British National Corpus, spoken English vocabulary frequency, chi-squared test

1. Introduction

The British National Corpus (BNC) is a c.100-million-word corpus of present-day British English, containing a c.10-million-word subcorpus of spoken language recorded in the period 1991-19932. The spoken subcorpus, in its turn, is subdivided into a part sampled by demographic methods (the "Conversational Corpus"), and a part sampled by context-governed methods3.

The Conversational Corpus, consisting of 4,552,555 words4, was collected by the following method. A market-research firm, the British Market Research Bureau, sampled the population of the UK (over the age of 15) using well-tried methods of social survey research. Individuals willing to take part in the project were equipped with a high-quality Walkman sound recorder, and recorded any linguistic transactions in which they engaged during a period of two days. These individuals, whom we will henceforth call "respondents", numbered 153, and were sampled in order to obtain a good representation of the population of the UK, given the unavoidable practical limitation in the number of respondents, according to

Region: south, midland, north5

Gender: male, female

Age: 15-24; 25-34; 35-44; 45-59; 60 and over6

Social Group: A, B, C1, C2, D, E7

In addition, respondents undertook to obtain the permission of other speakers with whom they engaged in conversation, and to note for each speaker details of gender, age, social group, etc.8 Thus the Conversational Corpus provides an unparalleled resource for investigating, on a large scale, the conversational behaviour of the British population in the 1990s. In the longer term it will be possible to use the corpus, and its various subdivisions, for the investigation of many lexical, grammatical, sociolinguistic, or other aspects of the productions of British native speakers of English. We are, however, conscious of the limitations imposed by the corpus's simple orthographic transcription, and in the present study are restricting our attention to phenomena of lexical frequency.

The software used to investigate lexical frequency in the corpus is described further in section 6 below. The particular virtues of the software for the present purpose is that it can be used to

(a) list two parallel frequency lists derived from different sectors of the corpus

(b) sort these lists not only in alphabetical or in rank order, but according to the order of significance of X2 (chi-square) values, so that (for example) words which are most significant in differentiating the speech of females from the speech of males are ordered at the top of the list.

(c) provide a KWIC concordance of any word according to need.

(d) provide (a)-(c) above not only for orthographic words, but for part-of-speech tagged words, i.e. grammatical forms or lexemes, or the grammatical tags themselves.

In referring to "sectors of the corpus" above, we mean not only subcorpora identified by header information regarding the gender, age, etc. of the respondent, but also utterances extracted from the corpus on the basis of the gender, age, etc. of the speaker. The ability to isolate utterance information of this kind is another advantage of the software.

In the following sections 2-4, we examine some of the salient results of the lexical frequency analysis of gender variation (section 2), age variation (section 3) and social group variation (section 4). The social interest of these results, although partly predictable, is obvious. In section 5, we turn to some salient contrasts between the spoken part and the written part of the BNC, as shown in the two-million-word "Sampler Corpus", a broadly representative subsample of the whole BNC.

2. Gender Variation

Using the whole of the Conversational Corpus material for which gender of speaker is indicated, we find that female speakers have a larger share of the corpus than male speakers according to a number of different measures. Firstly, there is small built-in bias in the corpus, in that 75 female respondents but only 73 male respondents were enlisted, as volunteers to participate in the collection of data. In addition, female speakers overall took a larger share of the language collected, as shown by these figures:

TABLE 1 Distribution of the Conversational Corpus between Female and Male Speakers
Female Speakers Male Speakers
Number of Speakers 561 536
Number of turns 250,955 179,844
Number of words spoken 2,593,452 1,714,443
Number of turns per speaker 447.33 335.53
Number of words per turn 10.33 9.53

Not only were there more female respondents, but also more female speakers, who on the whole took more turns and longer turns than the male speakers. All this led to a greater female than male representation in the Conversational Corpus. It should be borne in mind, therefore, in the following tables, that for every 100 word tokens spoken by men in the demographic corpus, 151 were spoken by women9. Because of this disparity, the normalized frequency of each word is presented in the following tables, in the form of a percentage of all word tokens (M%= percentage of the number of word tokens spoken by males; F% = percentage of the number of word tokens spoken by females). The X2 value is based on comparison of such normalized frequencies. The 25 most significant words showing overrepresentation in male and female speech are the following, in order of significance.

TABLE 2: Words most characteristic of male speech10
WORD MALES M % FEMALES F % X2
fucking 1401 0.08 325 0.01 1233.1
er 9589 0.56 9307 0.36 945.4
the 44617 2.60 57128 2.20 698.0
yeah 22050 1.29 28485 1.10 310.3
aye 1214 0.07 876 0.03 291.8
right 6163 0.36 6945 0.27 276.0
hundred 1488 0.09 1234 0.05 251.1
fuck 335 0.02 107 0.00 239.0
is 13608 0.79 17283 0.67 233.3
of 13907 0.81 17907 0.69 203.6
two 4347 0.25 5022 0.19 170.3
three 2753 0.16 2959 0.11 168.2
a 28818 1.68 39631 1.53 151.6
four 2160 0.13 2279 0.09 145.5
ah 2395 0.14 2583 0.10 143.6
no 14942 0.87 19880 0.77 140.8
number 615 0.04 463 0.02 133.9
quid 484 0.03 339 0.01 124.2
one 9915 0.58 12932 0.50 123.6
mate 262 0.02 129 0.00 120.8
which 1477 0.09 1498 0.06 120.5
okay 1313 0.08 1298 0.05 119.9
that 31014 1.81 43331 1.67 114.2
guy 211 0.01 95 0.00 108.6
da 459 0.03 338 0.01 105.3
yes 7102 0.41 9167 0.35 101.0

TABLE 3: Words most characteristic of female speech
WORD MALES M % FEMALES F % X2
she 7134 0.42 22623 0.87 3109.7
her 2333 0.14 7275 0.28 965.4
said 4965 0.29 12280 0.47 872.0
n't 24653 1.44 44087 1.70 443.9
I 55516 3.24 92945 3.58 357.9
and 29677 1.73 50342 1.94 245.3
to 23467 1.37 39861 1.54 198.6
cos 3369 0.20 6829 0.26 194.6
oh 13378 0.78 23310 0.90 170.2
Christmas 288 0.02 1001 0.04 163.9
thought 1573 0.09 3485 0.13 159.7
lovely 414 0.02 1214 0.05 140.3
nice 1279 0.07 2851 0.11 134.4
mm 7189 0.42 12891 0.50 133.8
had 4040 0.24 7600 0.29 125.9
did 6415 0.37 11424 0.44 109.6
going 3139 0.18 5974 0.23 109.0
because 1919 0.11 3861 0.15 105.0
him 2710 0.16 5188 0.20 99.2
really 2646 0.15 5070 0.20 97.6
school 501 0.03 1265 0.05 96.3
he 15993 0.93 26607 1.03 90.4
think 4980 0.29 8899 0.34 88.8
home 734 0.04 1662 0.06 84.0
me 5182 0.30 9186 0.35 83.5

Perhaps the most notable (though predictable) finding illustrated in this table is the tendency for taboo words ("swear words") to be more characteristic of male speech than female speech11. (This tendency is found not only with fucking, but with other "four-letter words" lower down the list shit X2 =37.4, hell X2 =22.8, crap X2 =44.3. Another tendency is for males to use number words: not only hundred (X2 =251.1), and one (X2 =123.6) but, for example, three X2 =168.2, two X2 =170.3, four X2 =145.5. Females, on the other hand, make strikingly greater use of the feminine pronoun she/her/hers12 and also of the first-person pronoun I/me/my/mine13.

TABLE 4 USE OF SOME PRONOUNS IN FEMALE SPEECH
WORD MALES M % FEMALES F % X2
she 7134 0.42 22626 0.87 3109.7
her 2333 0.14 7275 0.28 965.4
hers 26 0.00 110 0.00 24.3
I 55516 3.24 92945 3.58 357.9
me 51882 0.30 9186 0.35 8305
mine 505 0.03 818 0.03 1.5

As Table 2 shows, men are more likely to use the filled-pause marker er, and certain informal interjections or word-isolates, such as yeah, aye, okay, ah ( X2 =143.6), eh (X2 =77.5),and hmm (X2 =28.5). Women, on the other hand, make more use of yes (X2 =101.0), mm (X2 =133.8), really (X2 =97.6). The preference for the and of in male speech in Table 2 may appear more puzzling, but accords with the stronger male preference for nouns (in particular common nouns) over verbs and pronouns (see Table 7). One would hypothesise, on the basis of figures presented in Table 7 below, that male speech shows a stronger propensity to build noun phrases (including articles and common nouns) where female speech has a tendency to rely more on pronouns and proper nouns14.

Many fascinating lexical patterns of gender preference could be followed up here, and their reasons explored, but we will content ourselves with mentioning one other area - that of family relationships - where the sexes are strongly differentiated in lexical preference. On the whole, women are more strongly oriented towards family terms, although there are some interesting exceptions.

TABLE 5 SOME FAMILY TERMS USED MORE BY FEMALES
WORD MALES M % FEMALES F % X2
mother 272 0.02 627 0.02 34.2
father 115 0.01 307 0.01 27.7
sister 105 0.01 257 0.01 17.6
brother 136 0.01 229 0.01 1.0
daughter 70 0.00 171 0.01 11.6
daddy 353 0.02 624 0.02 5.5
grandma 58 0.00 207 0.01 35.5
aunty/auntie 74 0.00 179 0.01 11.8

TABLE 6 SOME FAMILY TERMS USED MORE BY MALES
WORD MALES M % FEMALES F % X2
mummy 742 0.04 755 0.03 59.6
mum 1647 0.10 1856 0.07 76.2
son 171 0.01 149 0.01 24.8
dad 941 0.05 1275 0.05 6.6

Since the Conversational Corpus has been grammatically tagged, it is worthwhile taking a look at patterns of part-of-speech distribution.

TABLE 7 Parts of Speech as Percentages of all word tokens
Males % Females % X2
Common Nouns 8.49 7.93 395.18
Proper Nouns 1.44 1.64 257.78
Pronouns 13.37 14.55 1016.27
Verbs 20.30 21.52 721.51


It turns out that male speakers favour common nouns, whereas female speakers favour proper nouns, personal pronouns and verbs. To an extent, this appears to bear out the hypothesis (Tannen 1991: 76-77) that male speech is more factual and concerned with reporting information, whereas female speech is more interactive and concerned with establishing and maintaining relationships ("report" versus "rapport").

The greater frequency of proper nouns in women's speech needs a slightly more subtle explanation. Our hypothesis was that, since the large majority of proper nouns in conversation is comprised of persons' names, this pattern of preference illustrates the tendency for women speakers to be more concerned with persons as individuals. To test this further, we examined the most frequent personal names in male and female speech, with the following result:

FIFTY MOST FREQUENT PERSONAL NAMES IN FEMALE SPEECH IN THE CORPUS15


God John David Paul Richard Ann Charlotte Michael Dave James Tim Jonathan Emma Chris Geoff Jim Mark June Christopher Peter Margaret Neil Jane Brian Ben Gary Arthur Amy Mary Phil Tony George Andrew Steve Sarah Lee Clare Martin Andy Mike Sue Jean Joe Bryony Scott Jenny Sally Pete Robert Stuart

FIFTY MOST FREQUENT PERSONAL NAMES IN MALE SPEECH IN THE CORPUS


God John Jean Paul Chris David Jesus Michael Tony Christ Mark Tim Ann Dave Peter Richard Mary Andrew Brian Jim June Nick Andy Sarah Christopher James George Jack Ken Steve Alex Robert Matt Ian Paula Bob Sue Raymond Colin Gary Bruce Phil Laura Alan Steven Joe Terry Mike May


It was seen that the 50 most common personal names in female speech account for 0.412 per cent of all word tokens, where the 50 most common personal names in male speech account for only 0.338 per cent of all word tokens. So women tend to use personal names considerably more than men do. The opposite tendency was observed with geographical names: men show a stronger predilection for place-names than women. The 50 most common place-names formed 0.096 percent of all word tokens in male speech but only 0.064 of all word tokens in female speech.

Before leaving the topic of gender, let us note that another way to investigate gender variation is to observe the lexical differentiation between the spoken data recorded by male and female respondents. This is easier to extract from the corpus, since each respondent's data constitutes a distinct file. However, it is obvious that the results here will be less easy to interpret, because female speakers will generally record a mixture of male and female speech. As one might expect, though, the results for male and female respondents is not all that different from the results for male and female speakers. Two examples are the female-favoured words nice and baby:

TABLE 8 Effects of Speaker Gender and Respondent Gender
Word F-Speakers F % M-Speakers M % X2
baby 411 0.02 115 0.01 70.6
nice 2851 0.11 1279 0.07 134.4
Word F-Resp F % M-Resp M % X2
baby 394 0.02 132 0.01 60.0
nice 2676 0.10 1446 0.08 72.1

It is noted that the figures for gender of respondents show a similar tendency to those for gender of speakers, but the effect is "diluted". This makes sense: one expects a woman respondent to record more female speech, and a male respondent to record more male speech. This bias is predictable because of the effect of self-recording, although an added explanation may be found in the tendency for people to build networks with people of the same gender16.

The provisional conclusion of this section is that the differences between male speech and female speech in the Conversational Corpus are pronounced (to judge from X2 values, they are greater than differences based on age or social group), but are matters of frequency, rather than of absolute choice. For example, although taboo words are much more frequent in male speech, they are by no means absent from female speech. A further investigation, which we have not been able to carry out, would be to find out how speech in the corpus varies according to the gender of the addressee, as well as the gender of the speaker.

3. Lexical Variation by Age Group

For simplicity, we divided the age groups into just two classes: speakers under 35, and speakers aged 35 or over. We will refer to these two classes, for convenience, as "younger speakers" and "older speakers". The words most significantly differentiating the two classes are as follows:

TABLE 9 Words used more by under-35's
WORD Under 35 % Over 35 % X2
mum 2603 0.14 850 0.04 1409.3
fucking 1457 0.08 259 0.01 1184.6
my 5768 0.31 4289 0.18 762.4
mummy 1176 0.06 321 0.01 755.2
like 9961 0.54 8612 0.36 745.2
na* 5085 0.27 3710 0.16 712.8
goes 1941 0.10 988 0.04 606.6
shit 742 0.03 72 0.00 410.1
dad 1419 0.08 763 0.03 403.7
daddy 728 0.04 247 0.01 380.1
me 7317 0.40 6825 0.29 371.9
what 16104 0.87 16855 0.71 357.3
fuck 380 0.02 58 0.00 330.1
wan* 1121 0.06 601 0.03 320.6
really 4043 0.22 3562 0.15 277.0
okay 1476 0.08 997 0.04 257.0
cos 5175 0.28 4855 0.20 254.4
just 8442 0.46 8531 0.36 251.8
why 2980 0.16 2534 0.11 240.0

*na is counted as a separate word in the combinations gonna (=going to) and wanna (=want to).

TABLE 10 Words used more by over-35's
WORD Under 35 % Over 35 % X2
yes 3958 0.21 12077 0.51 2365.0
well 10002 0.54 19203 0.81 1059.8
mm 6627 0.36 13338 0.56 895.2
er 14182 0.34 12388 0.52 773.8
they 14182 0.77 24073 1.01 682.2
said 5885 0.32 11006 0.46 538.3
says 1085 0.06 2908 0.12 443.1
were 3140 0.17 6201 0.26 385.8
the 40786 2.20 59293 2.49 352.2
of 12078 0.65 19119 0.80 314.6
and 32352 1.75 46464 1.95 224.7
to 25388 1.37 36828 1.54 211.2
mean 3727 0.20 6211 0.26 155.0
he 17109 0.92 24835 1.04 144.0
but 9551 0.52 14378 0.60 139.0
perhaps 216 0.01 673 0.03 136.0
that 30426 1.64 42723 1.79 131.3
see 4346 0.23 6932 0.29 122.1
had 4433 0.24 7034 0.30 118.3

As with the gender tables, these tables illustrate tendencies which are confirmed by items lower down the list. The tendency for older speakers to use yes is contrasted with the younger speakers' tendency to use the more informal variant yeah (X2=47.8). (Yep (X2=4.2) shows no strong bias either way17.) Perhaps related to this is the younger speakers' preference for certain interjections okay (X2=257.0), ah (X2=163.7) , ow (X2=156.9), hi (X2=85.1), hey (X2=113.1), ha ( X2 =106.6), no (X2=107.7), ooh (X2=77.5.), wow (X2 =70.0), hello (X2=88.7). Younger speakers, like male speakers, show a marked tendency in favour of certain taboo words: fucking ( X2 =1184.6), shit (X2=410.1), fuck (X2=330.1), crap (X2=155.1), arse (X2=48.0), bollocks (X2=92.1). More surprising, perhaps, is the stronger tendency to use the polite words please (X2 =72.9), sorry (X2=36.5), pardon (X2=10.2), and excuse (as a verb X2 =42.7). Counter to this we note older speakers' preference for bloody (X2=14.7) and bugger (X2=12.2) - a sign that these pre-eminently British imprecations are nowadays flourishing more among older age-groups?

Other tendencies are more elusive, although it is worth notice that the following adjectives and adverbs are favoured by younger speakers: weird, massive, horrible, sick, funny, disgusting, brilliant; really, alright, basically.

In the transcribed speech of under-35s, the spellings wanna (= "want to") and gonna (="going to") more frequently appear in the transcription, presumably reflecting a greater tendency towards phonological reduction, consonant with the tendency to treat these expressions as quasi-auxiliary verbs. In this respect the speech of younger British speakers appears to be following the lead of American English.

Turning to the speech of older speakers, we note some words which are suggestive of hesitation, uncertainty or turn manipulation: well, mm, er. It is notable that says and said are more widely used in the speech of older speakers, whereas goes and going are more widely used by younger speakers. At least part of this difference may be due to young people's habit of using go as a verb of speech reporting.

4. Lexical Variation by Social Group

Although the occupationally-graded social class categories A-E are commonly used in market research and other social survey work, they are no more than a convenient way of stratifying the population of the country in terms of measurable criteria. However, we find once again that these subdivisions reveal highly significant differences of lexical frequency. As with age-groups, we have found it convenient to divide the population into two large groups: into an "upper" (A, B, C1) bracket and a "lower" (C2, D, E) bracket.

Tables 11 and 12 show the most significantly different frequency patterns in the Conversational Corpus:

TABLE 11 Words used more by social classes A/B/C1
WORD A/B/C1 % C2/D/E % X2
yes 6485 0.49 5089 0.28 876.9
really 2897 0.22 2833 0.16 155.1
okay 1073 0.08 858 0.05 136.5
are 5150 0.39 5622 0.31 127.8
actually 1159 0.09 983 0.05 119.7
just 5924 0.44 6707 0.37 103.5
good 2622 0.20 2748 0.15 90.0
you 38616 2.89 49263 2.72 82.6
erm 4874 0.36 5551 0.31 79.9
right 4468 0.33 5158 0.28 62.7
school 716 0.05 633 0.03 62.6
think 4849 0.36 5670 0.31 58.0
need 1023 0.08 992 0.05 57.4
your 4436 0.33 5199 0.29 51.5
basically 155 0.01 83 0.00 50.2
guy 153 0.01 84 0.00 47.5
sorry 592 0.04 546 0.03 42.9
hold 304 0.02 238 0.01 41.4
difficult 156 0.01 96 0.01 39.1
wicked 71 0.01 26 0.00 37.6
rice 79 0.01 32 0.00 37.5
class 144 0.01 88 0.00 36.6

TABLE 12 Words used more by social classes C2/D/E
WORD A/B/C1 % C2/D/E % X2
he 11308 0.85 19707 1.09 452.1
says 731 0.05 2332 0.13 432.0
said 4168 0.31 8178 0.45 379.7
fucking 235 0.02 1006 0.06 280.3
ain't 312 0.02 1031 0.06 202.6
yeah 14017 1.05 22132 1.22 197.3
its 122 0.01 571 0.03 174.8
them 3748 0.28 6550 0.36 153.4
aye 373 0.03 1031 0.06 144.6
she 8284 0.62 13249 0.73 137.9
bloody 598 0.04 1425 0.08 137.1
pound 356 0.03 939 0.05 118.3
I 43727 3.27 63450 3.50 116.3
hundred 572 0.04 1323 0.07 116.3
well 8657 0.65 13536 0.75 106.2
n't 20452 1.53 30414 1.68 102.6
mummy 264 0.02 728 0.04 101.6
that 21886 1.64 32378 1.79 97.4
they 11195 0.84 17081 0.94 93.0
him 2094 0.16 3689 0.20 91.5
were 2564 0.19 4313 0.24 74.5
four 1076 0.08 2012 0.11 72.7
bloke 77 0.01 294 0.02 71.3
five 1112 0.08 2056 0.11 69.6
thousand 219 0.02 562 0.03 66.2

Tendencies observable here are greater use of second-person pronouns and the emphatic adverbs actually and really among A/B/C1 speakers, as contrasted with the greater use of third-person pronouns, the verb says/said, numbers and swearwords among C2/D/E speakers. (It is notable that the distribution of taboo vocabulary is highly significant along all three dimensions of gender, age and social group, demonstrating (for those who needed it) that the archetypal user of swearwords is to be found among male speakers in the social range C2/D/E under the age of 35.) Two other comparisons of note are yes and yeah, and guy and bloke. In both cases, the former variant is more favoured in A/B/C1, and the latter in C2/D/E.

5. Spoken versus written

For comparison, we present below a table of lexical frequencies in the Conversational Corpus as a whole, compared to the written half of the Sampler Corpus (1 million words), taken as a representative subcorpus from the written part of the BNC. As before the most significantly different items are listed first.

TABLE 13: Comparison of Conversational Corpus to written Sampler Corpus.
Word Written % X2 Spoken %
the 67114 6.59 47410.7 107495 2.36
of 32648 3.20 42372.5 33740 0.74
I 7061 0.69 21542.9 157308 3.46
you 4564 0.45 18996.2 125403 2.75
by 5894 0.58 13316.3 3143 0.07
n't 1789 0.18 12527.2 72441 1.59
it 8432 0.83 11817.5 120006 2.64
's 6755 0.66 10389.3 100836 2.21
in 20986 2.06 9394.6 42235 0.93
oh 224 0.02 8208.1 38852 0.85
do 1765 0.17 7783.3 50520 1.11
to 26622 2.61 6488.5 66865 1.47
which 3711 0.36 5861.5 3162 0.07
from 4720 0.46 5638.5 5243 0.12
got 425 0.04 5593.4 29024 0.64
as 6547 0.64 5338.4 9614 0.21
know 640 0.06 5237.9 29347 0.64
well 1076 0.11 4843.5 31254 0.69
what 1661 0.16 4793.4 35705 0.78
no 1889 0.19 4664.7 36849 0.81

The lists hold no particular surprises in the light of previous corpus-based studies of spoken and written English (especially Biber 1988). For example, prepositions and the are highly associated with the 'informative' and nominal tendency of written language, whereas first- and second-person pronouns, contractions such as 's and interjections such as oh are associated with the 'interactive' or 'involved' tendency of spoken language (see Biber's Factor 1, Biber 1988: 56-90). What is of particular interest, however, is that the X2 values are very much larger than we have found in comparing the different varieties of speech from the Conversational Corpus. This is possibly due to the comparison of such unequal sized corpora.

6. Frequency lists and the X2 statistic

The software used is a statistics and concordance package developed by UCREL at Lancaster University. The frequency profiles it produces can be split according to a classification scheme based on SGML (Standard Generalised Mark-up Language) contained in the transcribed text.

Once the software system was familiarised with the format of the annotation applied to the BNC files and could retrieve the list of people from the file headers, it was possible to use a classification system to split the transcribed data based on any of the fields stored in the personal details section of a BNC file header.

Here is a section of the header from BNC file KB8:

<partics>
      <person age=4 dialect=XNC educ=2 flang=EN-GBR id=PS14B n=W0001
        role=other sex=f soc=AB>
        Age:          53
        BMRB code:    601
        BNC name:     Ann2
        Name:         Ann
        Occupation:   registered child minder
      </person>

The person id value (PS14B in this case) is unique within the BNC, and in the transcribed text we can follow references to it (in the form of utterance tags <u who=XX>) within the text to discover who is actually saying what. An example, from the same file, is:

<u who=PS14B>
<s n=00047>
 <w DTQ>What <w VBB>are <w PNP>you <w VDG>doing<c PUN>?
</u>
<u who=PS14F>
<s n=00048>
 <w VVG>Making <wAT0>a <w NN1>fortune <w NN1>teller<c PUN>.
</u>

(where grammatical word tags are marked <w XX>).

A classification scheme to split text from male and female speakers is defined as follows (* is a wildcard matching any string):

class Men
person age = *
person dialect = *
person educ = *
person flang = *
person id = *
person n = *
person resp = *
person role = *
person sex = m
person soc = *
class Women
person age = *
person dialect = *
person educ = *
person flang = *
person id = *
person n = *
person resp = *
person role = *
person sex = f
person soc = *

Once defined this classification will result in the standard word frequency profile having two columns, one for male speakers and one for female speakers; see table 14.

TABLE 14: Sample frequency profile
Word Men Women
Over/

under-use
Chi-squared Log-likelihood
i 55825 92945
X-
362.9 365.5
you 46547 72443
X-
33.8 33.8
it 46006 68376
X+
3.7 3.7
the 44846 57128
X+
692.0 685.0
's 38278 57673
X-
0.1 0.1
that 31166 43331
X+
111.2 110.6
and 29833 50342
X-
249.7 251.8
a 28995 39631
X+
152.3 151.4
n't 24796 44087
X-
447.1 452.6
to 23658 39861
X-
192.7 194.3
yeah 22168 28485
X+
308.3 305.4
do 18258 29526
X-
59.9 60.2
he 16099 26607
X-
89.8 90.4
in 15960 24095
X-
0.2 0.2
they 15429 23553
X-
2.1 2.1
no 15007 19880
X+
137.3 136.2
of 13999 17907
X+
205.7 203.7
what 13894 19875
X+
20.3 20.2

For each word we apply the X2 test to obtain a value which shows whether there is a significant difference across the observed frequencies. For a frequency profile with 2 columns the X2 value is significant (with 99% confidence with 1 degree of freedom) if it is greater than 6.63. Most of the X2 values quoted in this paper are significant at this level.

The value is calculated from the following formula:

where

where Oi is the observed frequency, Ei is the expected frequency, and Ni is the total frequency in class i. The X2 value becomes unreliable when the expected frequency is less than 5 (see Dunning 1993: 61-74), and possibly overestimates with high frequency words and when comparing a relatively small corpus to a much larger one. We usually omit lower frequency words from consideration. In fact (as Dunning suggests), we can use the Log-likelihood value (to which the X2 statistic is an approximation), given by the formula:

to produce a more accurate result in these cases. These values are shown in table 14, but not reproduced elsewhere in this paper.

As mentioned by Francis and Kucera (1982: 461), when we build a rank frequency list even from a representative corpus we need to bear in mind that the actual frequency of an item in the list may need adjustment. We might want to take into account its frequency distribution within the corpus. Particularly for the lower frequency words, a large proportion of the occurrences might take place in one conversation and this would skew the results. To counteract this effect, we can automatically apply an adjusted frequency measure, or an index of dispersion (see Carroll et al 1971: xxix). However, in this study we have relied on the X2 statistic and a count of the number of speakers for each item, checking hypotheses by looking more closely at actual concordance examples where necessary.

7. Conclusion

As the preceding two paragraphs make clear, further refinements have yet to be made in the calculation of differences of lexical frequency profiles between different subcorpora of the BNC. Nevertheless, the present preliminary study has demonstrated the potential, using a quantitative corpus analysis tool, for investigating significant contrasts of lexical frequency between different text categories or subcorpora of the BNC. It is particularly needful to add to the X2 statistic an index of dispersion. Although this study has been largely based on frequency of orthographic lexical forms, use has also been made of frequency data for tagged word forms (e.g. distinguishing mine as a pronoun from mine as a noun). A further variant of the analysis would be to use frequency data for lemmas, since these are automatically retrievable from a grammatically tagged corpus. Yet a further development of the same paradigm of research would be to undertake comparisons on the basis of more abstract phenomena, such as syntactic constructions and word senses. This is also possible for the BNC, but will have to await the syntactic and semantic annotation of the corpus. The exploitation of the BNC for these purposes has only just begun.


Notes

1. We are grateful to Jane Sunderland for her comments on a draft of this paper, especially with reference to Section 2.

2. The BNC was compiled in the years 1991-1995 by a collaborative team consisting of Oxford University Press (the lead partner), Longman Group UK Ltd., Chambers Harrap, the British Library, and the Universities of Oxford (Oxford University Computing Services) and Lancaster (University Centre for Computer Corpus Research on Language). For details of availability for research purposes please contact Oxford University Computing Services, 13 Banbury Road, Oxford OXN 6NN, UK. or email natcorp@oucs.ox.ac.uk. Major funding for the project was provided by the Science and Engineering Research Council and by the Department of Trade and Industry.

The Conversational Corpus which is the subject of this article was collected, digitized, edited and transcribed by Longman, with agencies subcontracted to them.

3. For a description and rationale of the spoken component of the BNC, and in particular of the Conversational Corpus, see Crowdy 1993: 259-265, and 1995: 224-234.

4. This total counts multi-word units as separate orthographic words (e.g. in spite of counts as three not one).

5. There were 38 geographical locations in which recording took place, spread relatively evenly across the population of the country. Scotland and Northern Ireland were included in the north region, and Wales in the midland region.

6. There is also an age category 0-14, which has only a small representation in the corpus, since children were excluded from acting as respondents.

7. The categories are occupationally defined as follows: A - higher managerial, administrative or professional; B - intermediate managerial, administrative or professional; C1 - supervisory or clerical, and junior managerial, administrative or professional; C2 - skilled manual workers; D - semi- and unskilled manual workers; E - state pensioners or widows (no other earner), casual or lowest grade workers.

The classification of speakers and respondents in terms of these and other factors is not always exhaustive. Thus, there are a minority of speakers and respondents for which the age, gender and social group are unknown. In practice, we have limited our study to data for which the relevant categories are known. This means that the total occurrences of a given word will not necessarily remain the same from one table to another.

8. Further details are obtainable from L Burnard, The Users Reference Guide for the British National Corpus, obtainable from Oxford University Computing Services (see note 1 above)

9. More precisely, female speakers in the corpus utter 51.27% more word tokens than male speakers. Each female speaker, on average, utters 44.53% more words than each male speaker. Each female speaker takes, on average, 33.32% more conversational turns than each male person. Turns are defined, simplistically, as stretches of speech spoken by one person and preceded and followed by the speech of someone else, or by silence. The fact that women's speech has a much larger representation in the corpus than men's speech appears to run counter to reports, in previous studies, of male verbosity in mixed-sex conversation (Coates 1986: 114-117). It should be remembered, however, that much of the BNC data will represent female-to-female talk.

10. In this and subsequent Tables, words are omitted if they occur in the Conversational Corpus less frequently than once every 10,000 words.

11. This finding accords with claims made by Lakoff (1975).

12. In contrast, use of the masculine pronouns he/him/his is almost even between female and male speakers.

13. Cf somewhat similar findings on gender differences in pronoun usage in Hirschman 1994: 427-428.

14. Relevant here is Hudson's claim that frequency of nouns + pronouns relative to other parts of speech tends to be a constant across text type (Hudson 1994: 331-339).

15. These lists of names are included for interest, but it has to be borne in mind that the occurrence of personal names is often heavily biased by their frequent repetition in the recordings of particular respondents. It should also be noted that the list does not include surnames, which were anonymised in the original transcription.

16. For male respondent files there is 63.6% male speech and 36.4% female speech, whereas in female respondent files there is 22.7% male speech and 77.3% female speech.

17. It must be borne in mind that these are simply variant spellings of the same word, and may have been somewhat unreliably transcribed, on the basis of limited phonetic evidence. On the other hand, the result, in terms of differentiation between the age-groups, is so marked that it must reflect a real difference of linguistic behaviour.


References

Biber, D. (1988). Variation Across Speech and Writing. Cambridge University Press.

Burnard, L. (compiler)(1995), The Users Reference Guide for the British National Corpus. Oxford: Oxford University Computing Services.

Carroll, J.B., Davies, P., and Richman, B. (1971). The American Heritage Word Frequency Book. Houghton Mifflin Company, Boston.

Coates, J. (1986). Women, Men and Language. London, Longman. 114-117

Crowdy, S. (1993), 'Spoken corpus design and transcription', Literary and Linguistic Computing, 8(4), 259-265.

Crowdy, S. (1995), 'The BNC Spoken Corpus', in G. Leech, G. Myers and J. Thomas (eds.), Spoken English on Computer: Transcription, Mark-up and Application, London: Longman, 224-234.

Dunning, Ted. (1993). 'Accurate methods for the statistics of surprise and coincidence' Computational Linguistics, Volume 19, number 1, 61-74.

Francis, W.N., and Kucera, H. (1982). Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin Company, Boston.

Hirschman, L (1994) 'Female-male differences in conversational interaction', Language in Society 23, 427-442

Hudson, R. A.. (1994), 'About 37% of word-tokens are nouns', Language, 70, 331-339.

Lakoff, R. (1975) Language and Woman's Place. New York: Harper and Row.

Tannen, D. (1991), You Just Don't Understand: Women and Men in Conversation. London: Virago Press. 76-77