Paul Rayson, Geoffrey Leech and Mary Hodges
UCREL (University Centre for Computer Corpus Research on Language)
Lancaster University,
Lancaster LA1 4YT,
United Kingdom.
In this article we undertake selective quantitative analyses of the demographically-sampled spoken English component of the British National Corpus (for brevity, referred to here as the "Conversational Corpus"). This is a subcorpus of c.4.5 million words, in which speakers and respondents (see 1 below) are identified by such factors as gender, age, social group and geographical region. Using a corpus analysis tool developed at Lancaster, we undertake a comparison of the vocabulary of speakers, highlighting those differences which are marked by a very high 2 value of difference between different sectors of the corpus according to gender, age and social group. A fourth variable, that of geographical region of the United Kingdom, is not investigated in this article, although it remains a promising subject for future research. (As background we also briefly examine differences between spoken and written material in the British National Corpus (BNC).) This study is illustrative of the potentiality of the Conversational Corpus for future corpus-based research on social differentiation in the use of language. There are evident limitations, including (a) the reliance on vocabulary frequency lists, and (b) the simplicity of the transcription system employed for the spoken part of the BNC. The conclusion of the article considers future advances in the research paradigm illustrated here.
Keywords: British National Corpus, spoken English vocabulary frequency, chi-squared test
The British National Corpus (BNC) is a c.100-million-word corpus
of present-day British English, containing a c.10-million-word
subcorpus of spoken language recorded in the period 1991-19932.
The spoken subcorpus, in its turn, is subdivided into a part sampled
by demographic methods (the "Conversational Corpus"),
and a part sampled by context-governed methods3.
The Conversational Corpus, consisting of 4,552,555 words4, was
collected by the following method. A market-research firm, the
British Market Research Bureau, sampled the population of the
UK (over the age of 15) using well-tried methods of social survey
research. Individuals willing to take part in the project were
equipped with a high-quality Walkman sound recorder, and recorded
any linguistic transactions in which they engaged during a period
of two days. These individuals, whom we will henceforth call
"respondents", numbered 153, and were sampled in order
to obtain a good representation of the population of the UK, given
the unavoidable practical limitation in the number of respondents,
according to
Region: south, midland, north5
Gender: male, female
Age: 15-24; 25-34; 35-44; 45-59; 60 and over6
Social Group: A, B, C1, C2, D, E7
In addition, respondents undertook to obtain the permission of
other speakers with whom they engaged in conversation, and to
note for each speaker details of gender, age, social group, etc.8
Thus the Conversational Corpus provides an unparalleled resource
for investigating, on a large scale, the conversational behaviour
of the British population in the 1990s. In the longer term it
will be possible to use the corpus, and its various subdivisions,
for the investigation of many lexical, grammatical, sociolinguistic,
or other aspects of the productions of British native speakers
of English. We are, however, conscious of the limitations imposed
by the corpus's simple orthographic transcription, and in the
present study are restricting our attention to phenomena of lexical
frequency.
The software used to investigate lexical frequency in the corpus
is described further in section 6 below. The particular virtues
of the software for the present purpose is that it can be used
to
(a) list two parallel frequency lists derived from different
sectors of the corpus
(b) sort these lists not only in alphabetical or in rank order,
but according to the order of significance of
2 (chi-square) values,
so that (for example) words which are most significant in differentiating
the speech of females from the speech of males are ordered at
the top of the list.
(c) provide a KWIC concordance of any word according to need.
(d) provide (a)-(c) above not only for orthographic words, but
for part-of-speech tagged words, i.e. grammatical forms or lexemes,
or the grammatical tags themselves.
In referring to "sectors of the corpus" above, we mean
not only subcorpora identified by header information regarding
the gender, age, etc. of the respondent, but also utterances extracted
from the corpus on the basis of the gender, age, etc. of the speaker.
The ability to isolate utterance information of this kind is another
advantage of the software.
In the following sections 2-4, we examine some of the salient
results of the lexical frequency analysis of gender variation
(section 2), age variation (section 3) and social group variation
(section 4). The social interest of these results, although partly
predictable, is obvious. In section 5, we turn to some salient
contrasts between the spoken part and the written part of the
BNC, as shown in the two-million-word "Sampler Corpus",
a broadly representative subsample of the whole BNC.
Using the whole of the Conversational Corpus material for which
gender of speaker is indicated, we find that female speakers have
a larger share of the corpus than male speakers according to a
number of different measures. Firstly, there is small built-in
bias in the corpus, in that 75 female respondents but only 73
male respondents were enlisted, as volunteers to participate in
the collection of data. In addition, female speakers overall took
a larger share of the language collected, as shown by these figures:
Female Speakers | Male Speakers | |
Number of Speakers | 561 | 536 |
Number of turns | 250,955 | 179,844 |
Number of words spoken | 2,593,452 | 1,714,443 |
Number of turns per speaker | 447.33 | 335.53 |
Number of words per turn | 10.33 | 9.53 |
Not only were there more female respondents, but also more female
speakers, who on the whole took more turns and longer turns than
the male speakers. All this led to a greater female than male
representation in the Conversational Corpus. It should be borne
in mind, therefore, in the following tables, that for every 100
word tokens spoken by men in the demographic corpus, 151 were
spoken by women9. Because of this disparity, the normalized frequency
of each word is presented in the following tables, in the form
of a percentage of all word tokens (M%= percentage of the number
of word tokens spoken by males; F% = percentage of the number
of word tokens spoken by females). The
2 value is based on comparison of such normalized
frequencies. The 25 most significant words showing overrepresentation
in male and female speech are the following, in order of significance.
WORD | MALES | M % | FEMALES | F % | 2 |
fucking | 1401 | 0.08 | 325 | 0.01 | 1233.1 |
er | 9589 | 0.56 | 9307 | 0.36 | 945.4 |
the | 44617 | 2.60 | 57128 | 2.20 | 698.0 |
yeah | 22050 | 1.29 | 28485 | 1.10 | 310.3 |
aye | 1214 | 0.07 | 876 | 0.03 | 291.8 |
right | 6163 | 0.36 | 6945 | 0.27 | 276.0 |
hundred | 1488 | 0.09 | 1234 | 0.05 | 251.1 |
fuck | 335 | 0.02 | 107 | 0.00 | 239.0 |
is | 13608 | 0.79 | 17283 | 0.67 | 233.3 |
of | 13907 | 0.81 | 17907 | 0.69 | 203.6 |
two | 4347 | 0.25 | 5022 | 0.19 | 170.3 |
three | 2753 | 0.16 | 2959 | 0.11 | 168.2 |
a | 28818 | 1.68 | 39631 | 1.53 | 151.6 |
four | 2160 | 0.13 | 2279 | 0.09 | 145.5 |
ah | 2395 | 0.14 | 2583 | 0.10 | 143.6 |
no | 14942 | 0.87 | 19880 | 0.77 | 140.8 |
number | 615 | 0.04 | 463 | 0.02 | 133.9 |
quid | 484 | 0.03 | 339 | 0.01 | 124.2 |
one | 9915 | 0.58 | 12932 | 0.50 | 123.6 |
mate | 262 | 0.02 | 129 | 0.00 | 120.8 |
which | 1477 | 0.09 | 1498 | 0.06 | 120.5 |
okay | 1313 | 0.08 | 1298 | 0.05 | 119.9 |
that | 31014 | 1.81 | 43331 | 1.67 | 114.2 |
guy | 211 | 0.01 | 95 | 0.00 | 108.6 |
da | 459 | 0.03 | 338 | 0.01 | 105.3 |
yes | 7102 | 0.41 | 9167 | 0.35 | 101.0 |
WORD | MALES | M % | FEMALES | F % | 2 |
she | 7134 | 0.42 | 22623 | 0.87 | 3109.7 |
her | 2333 | 0.14 | 7275 | 0.28 | 965.4 |
said | 4965 | 0.29 | 12280 | 0.47 | 872.0 |
n't | 24653 | 1.44 | 44087 | 1.70 | 443.9 |
I | 55516 | 3.24 | 92945 | 3.58 | 357.9 |
and | 29677 | 1.73 | 50342 | 1.94 | 245.3 |
to | 23467 | 1.37 | 39861 | 1.54 | 198.6 |
cos | 3369 | 0.20 | 6829 | 0.26 | 194.6 |
oh | 13378 | 0.78 | 23310 | 0.90 | 170.2 |
Christmas | 288 | 0.02 | 1001 | 0.04 | 163.9 |
thought | 1573 | 0.09 | 3485 | 0.13 | 159.7 |
lovely | 414 | 0.02 | 1214 | 0.05 | 140.3 |
nice | 1279 | 0.07 | 2851 | 0.11 | 134.4 |
mm | 7189 | 0.42 | 12891 | 0.50 | 133.8 |
had | 4040 | 0.24 | 7600 | 0.29 | 125.9 |
did | 6415 | 0.37 | 11424 | 0.44 | 109.6 |
going | 3139 | 0.18 | 5974 | 0.23 | 109.0 |
because | 1919 | 0.11 | 3861 | 0.15 | 105.0 |
him | 2710 | 0.16 | 5188 | 0.20 | 99.2 |
really | 2646 | 0.15 | 5070 | 0.20 | 97.6 |
school | 501 | 0.03 | 1265 | 0.05 | 96.3 |
he | 15993 | 0.93 | 26607 | 1.03 | 90.4 |
think | 4980 | 0.29 | 8899 | 0.34 | 88.8 |
home | 734 | 0.04 | 1662 | 0.06 | 84.0 |
me | 5182 | 0.30 | 9186 | 0.35 | 83.5 |
Perhaps the most notable (though predictable) finding illustrated
in this table is the tendency for taboo words ("swear words")
to be more characteristic of male speech than female speech11. (This
tendency is found not only with fucking, but with other
"four-letter words" lower down the list shit 2 =37.4,
hell 2 =22.8,
crap 2 =44.3.
Another tendency is for males to use number words: not only hundred
(2 =251.1),
and one (2 =123.6)
but, for example, three 2 =168.2,
two 2 =170.3,
four 2 =145.5.
Females, on the other hand, make strikingly greater use of the
feminine pronoun she/her/hers12 and also of the first-person
pronoun I/me/my/mine13.
WORD | MALES | M % | FEMALES | F % | 2 |
she | 7134 | 0.42 | 22626 | 0.87 | 3109.7 |
her | 2333 | 0.14 | 7275 | 0.28 | 965.4 |
hers | 26 | 0.00 | 110 | 0.00 | 24.3 |
I | 55516 | 3.24 | 92945 | 3.58 | 357.9 |
me | 51882 | 0.30 | 9186 | 0.35 | 8305 |
mine | 505 | 0.03 | 818 | 0.03 | 1.5 |
As Table 2 shows, men are more likely to use the filled-pause
marker er, and certain informal interjections or word-isolates,
such as yeah, aye, okay, ah (
2 =143.6), eh (2 =77.5),and
hmm (2 =28.5).
Women, on the other hand, make more use of yes (2 =101.0),
mm (2 =133.8),
really (2 =97.6).
The preference for the and of in male speech in
Table 2 may appear more puzzling, but accords with the stronger
male preference for nouns (in particular common nouns) over verbs
and pronouns (see Table 7). One would hypothesise, on the basis
of figures presented in Table 7 below, that male speech shows
a stronger propensity to build noun phrases (including articles
and common nouns) where female speech has a tendency to rely more
on pronouns and proper nouns14.
Many fascinating lexical patterns of gender preference could be
followed up here, and their reasons explored, but we will content
ourselves with mentioning one other area - that of family relationships
- where the sexes are strongly differentiated in lexical preference.
On the whole, women are more strongly oriented towards family
terms, although there are some interesting exceptions.
WORD | MALES | M % | FEMALES | F % | 2 |
mother | 272 | 0.02 | 627 | 0.02 | 34.2 |
father | 115 | 0.01 | 307 | 0.01 | 27.7 |
sister | 105 | 0.01 | 257 | 0.01 | 17.6 |
brother | 136 | 0.01 | 229 | 0.01 | 1.0 |
daughter | 70 | 0.00 | 171 | 0.01 | 11.6 |
daddy | 353 | 0.02 | 624 | 0.02 | 5.5 |
grandma | 58 | 0.00 | 207 | 0.01 | 35.5 |
aunty/auntie | 74 | 0.00 | 179 | 0.01 | 11.8 |
WORD | MALES | M % | FEMALES | F % | 2 |
mummy | 742 | 0.04 | 755 | 0.03 | 59.6 |
mum | 1647 | 0.10 | 1856 | 0.07 | 76.2 |
son | 171 | 0.01 | 149 | 0.01 | 24.8 |
dad | 941 | 0.05 | 1275 | 0.05 | 6.6 |
Since the Conversational Corpus has been grammatically tagged,
it is worthwhile taking a look at patterns of part-of-speech distribution.
Males % | Females % | 2 | |
Common Nouns | 8.49 | 7.93 | 395.18 |
Proper Nouns | 1.44 | 1.64 | 257.78 |
Pronouns | 13.37 | 14.55 | 1016.27 |
Verbs | 20.30 | 21.52 | 721.51 |
It turns out that male speakers favour common nouns, whereas female
speakers favour proper nouns, personal pronouns and verbs. To
an extent, this appears to bear out the hypothesis (Tannen 1991:
76-77) that male speech is more factual and concerned with reporting
information, whereas female speech is more interactive and concerned
with establishing and maintaining relationships ("report"
versus "rapport").
The greater frequency of proper nouns in women's speech needs
a slightly more subtle explanation. Our hypothesis was that, since
the large majority of proper nouns in conversation is comprised
of persons' names, this pattern of preference illustrates the
tendency for women speakers to be more concerned with persons
as individuals. To test this further, we examined the most frequent
personal names in male and female speech, with the following result:
FIFTY MOST FREQUENT PERSONAL NAMES IN FEMALE SPEECH IN THE CORPUS15
FIFTY MOST FREQUENT PERSONAL NAMES IN MALE SPEECH IN THE CORPUS
God John Jean Paul Chris David Jesus Michael Tony Christ Mark Tim Ann Dave Peter Richard Mary Andrew Brian Jim June Nick Andy Sarah Christopher James George Jack Ken Steve Alex Robert Matt Ian Paula Bob Sue Raymond Colin Gary Bruce Phil Laura Alan Steven Joe Terry Mike May
It was seen that the 50 most common personal names in female speech
account for 0.412 per cent of all word tokens, where the 50 most
common personal names in male speech account for only 0.338 per
cent of all word tokens. So women tend to use personal names
considerably more than men do. The opposite tendency was observed
with geographical names: men show a stronger predilection for
place-names than women. The 50 most common place-names formed
0.096 percent of all word tokens in male speech but only 0.064
of all word tokens in female speech.
Before leaving the topic of gender, let us note that another way
to investigate gender variation is to observe the lexical differentiation
between the spoken data recorded by male and female respondents.
This is easier to extract from the corpus, since each respondent's
data constitutes a distinct file. However, it is obvious that
the results here will be less easy to interpret, because female
speakers will generally record a mixture of male and female speech.
As one might expect, though, the results for male and female
respondents is not all that different from the results for male
and female speakers. Two examples are the female-favoured
words nice and baby:
Word | F-Speakers | F % | M-Speakers | M % | 2 |
baby | 411 | 0.02 | 115 | 0.01 | 70.6 |
nice | 2851 | 0.11 | 1279 | 0.07 | 134.4 |
Word | F-Resp | F % | M-Resp | M % | 2 |
baby | 394 | 0.02 | 132 | 0.01 | 60.0 |
nice | 2676 | 0.10 | 1446 | 0.08 | 72.1 |
It is noted that the figures for gender of respondents show a
similar tendency to those for gender of speakers, but the effect
is "diluted". This makes sense: one expects a woman
respondent to record more female speech, and a male respondent
to record more male speech. This bias is predictable because of
the effect of self-recording, although an added explanation may
be found in the tendency for people to build networks with people
of the same gender16.
The provisional conclusion of this section is that the differences
between male speech and female speech in the Conversational Corpus
are pronounced (to judge from 2
values, they are greater than differences
based on age or social group), but are matters of frequency, rather
than of absolute choice. For example, although taboo words are
much more frequent in male speech, they are by no means absent
from female speech. A further investigation, which we have not
been able to carry out, would be to find out how speech in the
corpus varies according to the gender of the addressee, as well
as the gender of the speaker.
For simplicity, we divided the age groups into just two classes:
speakers under 35, and speakers aged 35 or over. We will refer
to these two classes, for convenience, as "younger speakers"
and "older speakers". The words most significantly differentiating
the two classes are as follows:
WORD | Under 35 | % | Over 35 | % | 2 |
mum | 2603 | 0.14 | 850 | 0.04 | 1409.3 |
fucking | 1457 | 0.08 | 259 | 0.01 | 1184.6 |
my | 5768 | 0.31 | 4289 | 0.18 | 762.4 |
mummy | 1176 | 0.06 | 321 | 0.01 | 755.2 |
like | 9961 | 0.54 | 8612 | 0.36 | 745.2 |
na* | 5085 | 0.27 | 3710 | 0.16 | 712.8 |
goes | 1941 | 0.10 | 988 | 0.04 | 606.6 |
shit | 742 | 0.03 | 72 | 0.00 | 410.1 |
dad | 1419 | 0.08 | 763 | 0.03 | 403.7 |
daddy | 728 | 0.04 | 247 | 0.01 | 380.1 |
me | 7317 | 0.40 | 6825 | 0.29 | 371.9 |
what | 16104 | 0.87 | 16855 | 0.71 | 357.3 |
fuck | 380 | 0.02 | 58 | 0.00 | 330.1 |
wan* | 1121 | 0.06 | 601 | 0.03 | 320.6 |
really | 4043 | 0.22 | 3562 | 0.15 | 277.0 |
okay | 1476 | 0.08 | 997 | 0.04 | 257.0 |
cos | 5175 | 0.28 | 4855 | 0.20 | 254.4 |
just | 8442 | 0.46 | 8531 | 0.36 | 251.8 |
why | 2980 | 0.16 | 2534 | 0.11 | 240.0 |
*na is counted as a separate word in the combinations
gonna (=going to) and wanna (=want to).
WORD | Under 35 | % | Over 35 | % | 2 |
yes | 3958 | 0.21 | 12077 | 0.51 | 2365.0 |
well | 10002 | 0.54 | 19203 | 0.81 | 1059.8 |
mm | 6627 | 0.36 | 13338 | 0.56 | 895.2 |
er | 14182 | 0.34 | 12388 | 0.52 | 773.8 |
they | 14182 | 0.77 | 24073 | 1.01 | 682.2 |
said | 5885 | 0.32 | 11006 | 0.46 | 538.3 |
says | 1085 | 0.06 | 2908 | 0.12 | 443.1 |
were | 3140 | 0.17 | 6201 | 0.26 | 385.8 |
the | 40786 | 2.20 | 59293 | 2.49 | 352.2 |
of | 12078 | 0.65 | 19119 | 0.80 | 314.6 |
and | 32352 | 1.75 | 46464 | 1.95 | 224.7 |
to | 25388 | 1.37 | 36828 | 1.54 | 211.2 |
mean | 3727 | 0.20 | 6211 | 0.26 | 155.0 |
he | 17109 | 0.92 | 24835 | 1.04 | 144.0 |
but | 9551 | 0.52 | 14378 | 0.60 | 139.0 |
perhaps | 216 | 0.01 | 673 | 0.03 | 136.0 |
that | 30426 | 1.64 | 42723 | 1.79 | 131.3 |
see | 4346 | 0.23 | 6932 | 0.29 | 122.1 |
had | 4433 | 0.24 | 7034 | 0.30 | 118.3 |
As with the gender tables, these tables illustrate tendencies
which are confirmed by items lower down the list. The tendency
for older speakers to use yes is contrasted with the younger
speakers' tendency to use the more informal variant yeah
(2=47.8). (Yep (2=4.2)
shows no strong bias either way17.) Perhaps related to this is the
younger speakers' preference for certain interjections okay
(2=257.0), ah (2=163.7)
, ow (2=156.9), hi (2=85.1),
hey (2=113.1), ha (
2 =106.6), no (2=107.7), ooh
(2=77.5.), wow (2
=70.0),
hello (2=88.7).
Younger speakers, like male speakers, show a marked tendency
in favour of certain taboo words: fucking (
2 =1184.6),
shit (2=410.1),
fuck (2=330.1),
crap (2=155.1),
arse (2=48.0),
bollocks (2=92.1).
More surprising, perhaps, is the stronger tendency to use the
polite words please
(2 =72.9),
sorry (2=36.5),
pardon (2=10.2),
and excuse (as a verb
2 =42.7).
Counter to this we note older speakers' preference for bloody
(2=14.7)
and bugger (2=12.2)
- a sign that these pre-eminently British imprecations are nowadays
flourishing more among older age-groups?
Other tendencies are more elusive, although it is worth notice
that the following adjectives and adverbs are favoured by younger
speakers: weird, massive, horrible, sick, funny, disgusting,
brilliant; really, alright, basically.
In the transcribed speech of under-35s, the spellings wanna
(= "want to") and gonna (="going to")
more frequently appear in the transcription, presumably reflecting
a greater tendency towards phonological reduction, consonant with
the tendency to treat these expressions as quasi-auxiliary verbs.
In this respect the speech of younger British speakers appears
to be following the lead of American English.
Turning to the speech of older speakers, we note some words which
are suggestive of hesitation, uncertainty or turn manipulation:
well, mm, er. It is notable that says and
said are more widely used in the speech of older speakers,
whereas goes and going are more widely used by younger
speakers. At least part of this difference may be due to young
people's habit of using go as a verb of speech reporting.
Although the occupationally-graded social class categories A-E
are commonly used in market research and other social survey work,
they are no more than a convenient way of stratifying the population
of the country in terms of measurable criteria. However, we find
once again that these subdivisions reveal highly significant differences
of lexical frequency. As with age-groups, we have found it convenient
to divide the population into two large groups: into an "upper"
(A, B, C1) bracket and a "lower" (C2, D, E) bracket.
Tables 11 and 12 show the most significantly different frequency
patterns in the Conversational Corpus:
WORD | A/B/C1 | % | C2/D/E | % | 2 |
yes | 6485 | 0.49 | 5089 | 0.28 | 876.9 |
really | 2897 | 0.22 | 2833 | 0.16 | 155.1 |
okay | 1073 | 0.08 | 858 | 0.05 | 136.5 |
are | 5150 | 0.39 | 5622 | 0.31 | 127.8 |
actually | 1159 | 0.09 | 983 | 0.05 | 119.7 |
just | 5924 | 0.44 | 6707 | 0.37 | 103.5 |
good | 2622 | 0.20 | 2748 | 0.15 | 90.0 |
you | 38616 | 2.89 | 49263 | 2.72 | 82.6 |
erm | 4874 | 0.36 | 5551 | 0.31 | 79.9 |
right | 4468 | 0.33 | 5158 | 0.28 | 62.7 |
school | 716 | 0.05 | 633 | 0.03 | 62.6 |
think | 4849 | 0.36 | 5670 | 0.31 | 58.0 |
need | 1023 | 0.08 | 992 | 0.05 | 57.4 |
your | 4436 | 0.33 | 5199 | 0.29 | 51.5 |
basically | 155 | 0.01 | 83 | 0.00 | 50.2 |
guy | 153 | 0.01 | 84 | 0.00 | 47.5 |
sorry | 592 | 0.04 | 546 | 0.03 | 42.9 |
hold | 304 | 0.02 | 238 | 0.01 | 41.4 |
difficult | 156 | 0.01 | 96 | 0.01 | 39.1 |
wicked | 71 | 0.01 | 26 | 0.00 | 37.6 |
rice | 79 | 0.01 | 32 | 0.00 | 37.5 |
class | 144 | 0.01 | 88 | 0.00 | 36.6 |
WORD | A/B/C1 | % | C2/D/E | % | 2 |
he | 11308 | 0.85 | 19707 | 1.09 | 452.1 |
says | 731 | 0.05 | 2332 | 0.13 | 432.0 |
said | 4168 | 0.31 | 8178 | 0.45 | 379.7 |
fucking | 235 | 0.02 | 1006 | 0.06 | 280.3 |
ain't | 312 | 0.02 | 1031 | 0.06 | 202.6 |
yeah | 14017 | 1.05 | 22132 | 1.22 | 197.3 |
its | 122 | 0.01 | 571 | 0.03 | 174.8 |
them | 3748 | 0.28 | 6550 | 0.36 | 153.4 |
aye | 373 | 0.03 | 1031 | 0.06 | 144.6 |
she | 8284 | 0.62 | 13249 | 0.73 | 137.9 |
bloody | 598 | 0.04 | 1425 | 0.08 | 137.1 |
pound | 356 | 0.03 | 939 | 0.05 | 118.3 |
I | 43727 | 3.27 | 63450 | 3.50 | 116.3 |
hundred | 572 | 0.04 | 1323 | 0.07 | 116.3 |
well | 8657 | 0.65 | 13536 | 0.75 | 106.2 |
n't | 20452 | 1.53 | 30414 | 1.68 | 102.6 |
mummy | 264 | 0.02 | 728 | 0.04 | 101.6 |
that | 21886 | 1.64 | 32378 | 1.79 | 97.4 |
they | 11195 | 0.84 | 17081 | 0.94 | 93.0 |
him | 2094 | 0.16 | 3689 | 0.20 | 91.5 |
were | 2564 | 0.19 | 4313 | 0.24 | 74.5 |
four | 1076 | 0.08 | 2012 | 0.11 | 72.7 |
bloke | 77 | 0.01 | 294 | 0.02 | 71.3 |
five | 1112 | 0.08 | 2056 | 0.11 | 69.6 |
thousand | 219 | 0.02 | 562 | 0.03 | 66.2 |
Tendencies observable here are greater use of second-person pronouns
and the emphatic adverbs actually and really among
A/B/C1 speakers, as contrasted with the greater use of third-person
pronouns, the verb says/said, numbers and swearwords among
C2/D/E speakers. (It is notable that the distribution of taboo
vocabulary is highly significant along all three dimensions of
gender, age and social group, demonstrating (for those who needed
it) that the archetypal user of swearwords is to be found among
male speakers in the social range C2/D/E under the age of 35.)
Two other comparisons of note are yes and yeah,
and guy and bloke. In both cases, the former variant
is more favoured in A/B/C1, and the latter in C2/D/E.
For comparison, we present below a table of lexical frequencies
in the Conversational Corpus as a whole, compared to the written
half of the Sampler Corpus (1 million words), taken as a representative
subcorpus from the written part of the BNC. As before the most
significantly different items are listed first.
Word | Written | % | 2 | Spoken | % |
the | 67114 | 6.59 | 47410.7 | 107495 | 2.36 |
of | 32648 | 3.20 | 42372.5 | 33740 | 0.74 |
I | 7061 | 0.69 | 21542.9 | 157308 | 3.46 |
you | 4564 | 0.45 | 18996.2 | 125403 | 2.75 |
by | 5894 | 0.58 | 13316.3 | 3143 | 0.07 |
n't | 1789 | 0.18 | 12527.2 | 72441 | 1.59 |
it | 8432 | 0.83 | 11817.5 | 120006 | 2.64 |
's | 6755 | 0.66 | 10389.3 | 100836 | 2.21 |
in | 20986 | 2.06 | 9394.6 | 42235 | 0.93 |
oh | 224 | 0.02 | 8208.1 | 38852 | 0.85 |
do | 1765 | 0.17 | 7783.3 | 50520 | 1.11 |
to | 26622 | 2.61 | 6488.5 | 66865 | 1.47 |
which | 3711 | 0.36 | 5861.5 | 3162 | 0.07 |
from | 4720 | 0.46 | 5638.5 | 5243 | 0.12 |
got | 425 | 0.04 | 5593.4 | 29024 | 0.64 |
as | 6547 | 0.64 | 5338.4 | 9614 | 0.21 |
know | 640 | 0.06 | 5237.9 | 29347 | 0.64 |
well | 1076 | 0.11 | 4843.5 | 31254 | 0.69 |
what | 1661 | 0.16 | 4793.4 | 35705 | 0.78 |
no | 1889 | 0.19 | 4664.7 | 36849 | 0.81 |
The lists hold no particular surprises in the light of previous
corpus-based studies of spoken and written English (especially
Biber 1988). For example, prepositions and the are highly
associated with the 'informative' and nominal tendency of written
language, whereas first- and second-person pronouns, contractions
such as 's and interjections such as oh are associated
with the 'interactive' or 'involved' tendency of spoken language
(see Biber's Factor 1, Biber 1988: 56-90). What is of particular
interest, however, is that the 2
values are very much larger than we have found in comparing the
different varieties of speech from the Conversational Corpus.
This is possibly due to the comparison of such unequal sized corpora.
The software used is a statistics and concordance package developed
by UCREL at Lancaster University. The frequency profiles it produces
can be split according to a classification scheme based on SGML
(Standard Generalised Mark-up Language) contained in the transcribed
text.
Once the software system was familiarised with the format of the
annotation applied to the BNC files and could retrieve the list
of people from the file headers, it was possible to use a classification
system to split the transcribed data based on any of the fields
stored in the personal details section of a BNC file header.
Here is a section of the header from BNC file KB8:
<partics> <person age=4 dialect=XNC educ=2 flang=EN-GBR id=PS14B n=W0001 role=other sex=f soc=AB> Age: 53 BMRB code: 601 BNC name: Ann2 Name: Ann Occupation: registered child minder </person>
The person id value (PS14B in this case) is unique within the BNC, and in the transcribed text we can follow references to it (in the form of utterance tags <u who=XX>) within the text to discover who is actually saying what. An example, from the same file, is:
<u who=PS14B> <s n=00047> <w DTQ>What <w VBB>are <w PNP>you <w VDG>doing<c PUN>? </u> <u who=PS14F> <s n=00048> <w VVG>Making <wAT0>a <w NN1>fortune <w NN1>teller<c PUN>. </u>
(where grammatical word tags are marked <w XX>).
A classification scheme to split text from male and female speakers
is defined as follows (* is a wildcard matching any string):
class Men
person age = * person dialect = * person educ = * person flang = * person id = * person n = * person resp = * person role = * person sex = m person soc = * |
class Women
person age = * person dialect = * person educ = * person flang = * person id = * person n = * person resp = * person role = * person sex = f person soc = * |
Once defined this classification
will result in the standard word frequency profile having two
columns, one for male speakers and one for female speakers; see
table 14.
Word | Men | Women |
| Chi-squared | Log-likelihood |
i | 55825 | 92945 | 362.9 | 365.5 | |
you | 46547 | 72443 | 33.8 | 33.8 | |
it | 46006 | 68376 | 3.7 | 3.7 | |
the | 44846 | 57128 | 692.0 | 685.0 | |
's | 38278 | 57673 | 0.1 | 0.1 | |
that | 31166 | 43331 | 111.2 | 110.6 | |
and | 29833 | 50342 | 249.7 | 251.8 | |
a | 28995 | 39631 | 152.3 | 151.4 | |
n't | 24796 | 44087 | 447.1 | 452.6 | |
to | 23658 | 39861 | 192.7 | 194.3 | |
yeah | 22168 | 28485 | 308.3 | 305.4 | |
do | 18258 | 29526 | 59.9 | 60.2 | |
he | 16099 | 26607 | 89.8 | 90.4 | |
in | 15960 | 24095 | 0.2 | 0.2 | |
they | 15429 | 23553 | 2.1 | 2.1 | |
no | 15007 | 19880 | 137.3 | 136.2 | |
of | 13999 | 17907 | 205.7 | 203.7 | |
what | 13894 | 19875 | 20.3 | 20.2 |
For each word we apply the 2 test to obtain a value which shows whether there is a significant difference across the observed frequencies. For a frequency profile with 2 columns the 2 value is significant (with 99% confidence with 1 degree of freedom) if it is greater than 6.63. Most of the 2 values quoted in this paper are significant at this level.
The value is calculated from the following formula:
where Oi is the observed frequency, Ei is the expected frequency, and Ni is the total frequency in class i. The 2 value becomes unreliable when the expected frequency is less than 5 (see Dunning 1993: 61-74), and possibly overestimates with high frequency words and when comparing a relatively small corpus to a much larger one. We usually omit lower frequency words from consideration. In fact (as Dunning suggests), we can use the Log-likelihood value (to which the 2 statistic is an approximation), given by the formula:
to produce a more accurate result in these cases. These values
are shown in table 14, but not reproduced elsewhere in this paper.
As mentioned by Francis and Kucera (1982: 461), when we build
a rank frequency list even from a representative corpus
we need to bear in mind that the actual frequency of an item in
the list may need adjustment. We might want to take into account
its frequency distribution within the corpus. Particularly for
the lower frequency words, a large proportion of the occurrences
might take place in one conversation and this would skew the results.
To counteract this effect, we can automatically apply an adjusted
frequency measure, or an index of dispersion (see Carroll et al
1971: xxix). However, in this study we have relied on the 2
statistic and a count of the number of speakers for each item,
checking hypotheses by looking more closely at actual concordance
examples where necessary.
As the preceding two paragraphs make clear, further refinements
have yet to be made in the calculation of differences of lexical
frequency profiles between different subcorpora of the BNC. Nevertheless,
the present preliminary study has demonstrated the potential,
using a quantitative corpus analysis tool, for investigating significant
contrasts of lexical frequency between different text categories
or subcorpora of the BNC. It is particularly needful to add to
the 2 statistic an index of dispersion. Although this
study has been largely based on frequency of orthographic lexical
forms, use has also been made of frequency data for tagged word
forms (e.g. distinguishing mine as a pronoun from mine
as a noun). A further variant of the analysis would be to
use frequency data for lemmas, since these are automatically retrievable
from a grammatically tagged corpus. Yet a further development
of the same paradigm of research would be to undertake comparisons
on the basis of more abstract phenomena, such as syntactic constructions
and word senses. This is also possible for the BNC, but will
have to await the syntactic and semantic annotation of the corpus.
The exploitation of the BNC for these purposes has only just begun.
1. We are grateful to Jane Sunderland for her comments on a draft of
this paper, especially with reference to Section 2.
2. The BNC was compiled in the years 1991-1995 by a collaborative team
consisting of Oxford University Press (the lead partner), Longman Group
UK Ltd., Chambers Harrap, the British Library, and the Universities of
Oxford (Oxford University Computing Services) and Lancaster (University
Centre for Computer Corpus Research on Language). For details of
availability for research purposes please contact Oxford University
Computing Services, 13 Banbury Road, Oxford OXN 6NN, UK. or email
natcorp@oucs.ox.ac.uk. Major funding for the project was provided by
the Science and Engineering Research Council and by the Department of
Trade and Industry.
The Conversational Corpus which is the subject of this article was
collected, digitized, edited and transcribed by Longman, with agencies
subcontracted to them.
3. For a description and rationale of the spoken component of the BNC,
and in particular of the Conversational Corpus, see Crowdy 1993:
259-265, and 1995: 224-234.
4. This total counts multi-word units as separate orthographic words
(e.g. in spite of counts as three not one).
5. There were 38 geographical locations in which recording took place,
spread relatively evenly across the population of the country. Scotland
and Northern Ireland were included in the north region, and Wales in
the midland region.
6. There is also an age category 0-14, which has only a small
representation in the corpus, since children were excluded from acting
as respondents.
7. The categories are occupationally defined as follows: A - higher
managerial, administrative or professional; B - intermediate
managerial, administrative or professional; C1 - supervisory or
clerical, and junior managerial, administrative or professional; C2 -
skilled manual workers; D - semi- and unskilled manual workers; E -
state pensioners or widows (no other earner), casual or lowest grade
workers.
The classification of speakers and respondents in terms of these and
other factors is not always exhaustive. Thus, there are a minority of
speakers and respondents for which the age, gender and social group are
unknown. In practice, we have limited our study to data for which the
relevant categories are known. This means that the total occurrences of
a given word will not necessarily remain the same from one table to
another.
8. Further details are obtainable from L Burnard, The Users Reference
Guide for the British National Corpus, obtainable from Oxford
University Computing Services (see note 1 above)
9. More precisely, female speakers in the corpus utter 51.27% more word
tokens than male speakers. Each female speaker, on average, utters
44.53% more words than each male speaker. Each female speaker takes, on
average, 33.32% more conversational turns than each male person. Turns
are defined, simplistically, as stretches of speech spoken by one
person and preceded and followed by the speech of someone else, or by
silence. The fact that women's speech has a much larger representation
in the corpus than men's speech appears to run counter to reports, in
previous studies, of male verbosity in mixed-sex conversation (Coates
1986: 114-117). It should be remembered, however, that much of the BNC
data will represent female-to-female talk.
10. In this and subsequent Tables, words are omitted if they occur in
the Conversational Corpus less frequently than once every 10,000 words.
11. This finding accords with claims made by Lakoff (1975).
12. In contrast, use of the masculine pronouns he/him/his is almost
even between female and male speakers.
13. Cf somewhat similar findings on gender differences in pronoun usage
in Hirschman 1994: 427-428.
14. Relevant here is Hudson's claim that frequency of nouns + pronouns
relative to other parts of speech tends to be a constant across text
type (Hudson 1994: 331-339).
15. These lists of names are included for interest, but it has to be
borne in mind that the occurrence of personal names is often heavily
biased by their frequent repetition in the recordings of particular
respondents. It should also be noted that the list does not include
surnames, which were anonymised in the original transcription.