The Automatic Content Analysis of Spoken Discourse A Report on Work in Progress Andrew Wilson Paul Rayson Unit for Computer Research on the English Language Lancaster University Bowland College Lancaster LA1 4YT United Kingdom 1. Introduction. The research reported here represents a collaboration between the Unit for Computer Research on the English Language (UCREL) at Lancaster University, a London market research firm, and an advertising agency. Our aim is to develop the software which will automatically assign semantic tags to large bodies of transcribed spoken interviews between market researchers and members of the public and perform statistical tests on the resulting tag frequency profiles. We hope to bridge the gap between the limitations of response data in traditional large-scale quantitative research surveys using fixed multiple-choice questionnaires, and the lack of quantifiable and comparable evidence of very small scale qualitative interviews (Siddall and Santry, 1988). 2. Content Analysis. Traditionally, survey research has relied upon two techniques which have imposed their own restrictions on the data collected. Quantitative research has largely relied on the use of pre- designed multiple-choice questionnaires which allow statistical profiles to be easily and quite quickly obtained because the variables are fixed, that is to say that the respondents may only choose from those responses for which the questionnaire designer has allowed. However, this approach means that potentially useful and interesting responses may not be being collected because the respondents have no opportunity to express an opinion which is not included in the choice of options on the questionnaire. Qualitative research --- the second technique which is commonly used --- involves the in-depth interviewing of individuals or groups. These interviews will in general have some planned structure but will also be quite open and the interviewer will often follow up a response. Respondents are free to answer in whatever way they like. For this reason it is difficult to code and obtain reliable statistics from such data, and statistical reliability is also compromised by the amount of person hours involved in the collection and analysis of the interview data which means that the samples are often very small. This is why such data tend to be analyzed qualitatively rather than quantitatively. Clearly a compromise between the two methodologies would be advantageous. We contend that this compromise may be found in the method of automated content analysis. Content analysis is a form of quantitative research, but it differs from traditional quantitative research because it can be applied to free text --- either spoken or written --- and so it is not bound by a pre-coded questionnaire. Berelson (1952:14-15), one of the leading proponents of content analysis in the post-second world war period, collated several early definitions of content analysis. Despite their age, two of these offer perhaps the most succinct and accurate descriptions of content analysis as it is commonly practised. Kaplan (1943:230) states that `the technique known as content analysis...attempts to characterize the meanings in a given body of discourse in a systematic and quantitative fashion' whilst Leites and Pool (1942:1-2) are more explicit about the quantitative dimension: `it [content analysis] must refer either to syntactic characteristics of symbols...or to semantic characteristics...[and] must indicate frequencies of such characteristics with a high degree of precision. One could perhaps define more narrowly: it must assign numerical values to such frequencies'. Content analysis is thus concerned with the statistical analysis of primarily the semantic features of texts. In practice this generally means using a given set of content categories such as the specimen market research categories shown in Figure 1 into which the words in the text are classified. These may be chosen ad hoc for a particular research survey, but this will involve some initial effort working on the analysis system and performing the initial coding of words in a lexicon. We at Lancaster hope that we have partly solved this problem by having two sets of tags, that is by having a more specific set of semantic classes which map onto the particular project's content tags. The underlying assumption of content analysis is that we can automatically code free text with such semantic labels, and the frequencies will give us quantifiable data about the topics being raised, and the relative frequencies of different opinions. But unlike qualitative research, we have a finite set of possible categories so the analysis may be done in a more readily quantifiable way. This means that hypotheses about the semantic content of texts can be generated and tested with reference to standard text norms in a way which is not possible with traditional qualitative research. ACCURACY ACTION AFFLUENCE AGE AGGRESSION APPROACHABILITY/ENTHUSIASM APPROBATION APPROPRIATENESS ATTAINABLE AUTHENTICITY AVARICE BEGINNINGS CAUTION CHANGE IN CIRCUMSTANCE CHANGE IN NUMBER COLOURS Figure 1. Specimen content categories. Thus automatic content analysis combines the flexibility of traditional qualitative research with the statistical dimension of quantitative research. It provides a source of highly flexible and detailed responses, but these data can be quantified, and the results could be compared to further studies, for example to compare the popularity of a certain product between one year and the next. Historically, content analysis can be traced back to 18th century Sweden and a careful doctrinal analysis of a hymn book by the Lutheran church (Dovring, 1954). In the post- second world war period it gained increased acceptance as a research methodology in the social sciences, and in the 1960s researchers began to look to the increasingly accessible computer technology to do automatically what had previously been done by hand. This culminated in the development of the system called The General Inquirer at Harvard University (Stone et al., 1966) which despite the development of other programs for computer-assisted content analysis (cf. Tesch, 1990) has until now maintained its position as the most sophisticated and fully automated content analysis system. In more recent years, content analysis has had to compete to a greater degree than before with other approaches to textual analysis in the social sciences but there is still considerable interest in the technique, especially in the market research sector (e.g. Kassarjian, 1977; Siddall and Santry, 1988) --- hence our joint sponsorship by a London market research firm. We at Lancaster believe that more robust and refined technology will make content analysis a more widely used technique once again in other forms of social science research which rely on textual data. It can be applied to any text --- written or spoken --- so long as the robustness of rules is checked for the new text type. Thus it may find uses in the semantic analysis of many different kinds of natural language corpora with applications beyond social scientific analysis, for example for the crucial task of thesaural semantic annotation in information retrieval (cf. Paice, 1991). In Summer 1988, Lancaster carried out a pilot project to test the feasibility of producing an automatic content analyzer, and in May 1990 work began on a two year project to produce such a content analyzer. 3. The Lancaster Content Analyzer. This section will describe the envisaged operation of the Lancaster Content Analysis System. The first step is to have the spoken data transcribed into machine readable form. This will be done by audio typists: details of prosody and other features are not required in this form of analysis. Each text will be identified by a markup in Standard Generalized Markup Language (SGML), and divided according to the questions in the interview and respondent characteristics such as age, sex, social group etc. The only special linguistic markup at this stage will be for phoric relations. A string of references to `it' and `they', for example, is not very useful to the user of the system who will want to know exactly to what or whom a speaker is referring, but as the automatic resolution of anaphora is beyond the present state of the art, these will be taken care of by human transcribers at the transcription stage, e.g. ... and the thing they fought for comes about in spite of their defeat, and when it comes turns out not to be what they expected ... Manual assignment of glosses will be made more efficient by using short markup sequences which will then be converted automatically to SGML form. The data will be run through a spell checker: the system will fail to match misspelled words in its lexicon. The next step is to run the data through the CLAWS part-of-speech tagging system (Garside, Leech and Sampson, 1987), which assigns a part-of-speech to every lexical item or syntactic idiom in the text, using probabilistic Markov models of likely part-of-speech sequences where a word is not in the system's lexicon. The output from CLAWS is then fed into the first component of the semantic tagging suite. This assigns a semantic tag or tags to each lexical item in the text, as can be seen in Figure 2. 0000000 000 UH Yes Z4 0000000 000 PPIS1 I Z8 0000000 000 VD0 do A1.1 0000000 000 RR sometimes N6 0000000 000 . . 0000000 000 APP$ My Z8 0000000 000 NN1 Avon Z3 0000000 000 NN1 lady S2.1 0000000 000 IF for Z5 0000000 000 DD1 this Z5 0000000 000 NN1 area M7 N3.6 A4.1 0000000 000 , , 0000000 000 RR just A14 T3--- 0000000 000 VVN called M1 Q1.3 Q2.2 0000000 000 II on Z5 0000000 000 APP$ my Z8 0000000 000 NN1 door H5 0000000 000 CC and Z5 0000000 000 VVD asked Q2.2 0000000 000 CS if Z7 0000000 000 PPIS1 I Z8 0000000 000 VM would A7+ 0000000 000 VV0 like E2+ 0000000 000 TO to Z5 0000000 000 VV0 look X3.4 A8 0000000 000 II through Z5 0000000 000 AT1 a Z5 0000000 000 NN1 book Q4.1 I1/K5.1% Figure 2. Semantically tagged text before postediting. The tags were developed specially for this project, the aim being to ensure that as far as possible the semantic information contained in them is sufficiently fine-grained so as to make mapping to another set of content tags for a particular research survey as easy as possible. This is, of course, not as easy as it sounds: Bloomfield's famous dictum that `the statement of meanings is therefore the weak point in language study, and will remain so until human knowledge advances very far beyond its present state' (Bloomfield, 1933) still rings as true today as it did in 1933. Semantic categorization which is thesaural in approach rather than based on a small set of semantic primitives will probably always remain a matter of some personal judgement due to the problem of fuzzy sets, where occasionally one sense of a particular word may be classified in two or more semantic categories. However, this is precisely the kind of classification which is required for content analysis. We have attempted to provide a categorization for the Lancaster analyzer which is as observationally adequate as possible to avoid too many potential problems in mapping to possible sets of content tags. The initial tagset was loosely based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981) as this appeared to offer the most appropriate thesaurus type classification of word senses for this kind of analysis. We have since considerably revised the tagset in the light of practical tagging problems met in the course of the research. The semantic tagset has a multi-tier structure. There are currently 21 major discourse fields, subdivided, and with the possibility of further fine-grained subdivision in certain cases. Each tag is represented by a decimal notation, the major discourse field being shown by a capital letter, the subdivisions by numerals, and further subdivisions by further numerals separated off by points, e.g. A12, A5.1, S1.2.3 etc. Figure 3 shows some examples of categories from the semantic tagset. E - EMOTIONAL ACTIONS, STATES AND PROCESSES 1 General 2 Liking 3 Calm/Violent/Angry 4 Happy/sad: 1 Happy 2 Contentment 5 Fear/bravery/shock 6 Worry, concern, confident H - ARCHITECTURE, BUILDINGS, HOUSES AND THE HOME 1 Architecture and kinds of houses and buildings 2 Parts of buildings 3 Areas around or near houses 4 Residence 5 Furniture and household fittings L - LIFE AND LIVING THINGS 1 Life and living things 2 Living creatures generally 3 Plants Figure 3. Sample semantic classifications from the tagset. Antonymity of conceptual classifications is indicated by +/- markers on tags, e.g. A5.1+ (good) and A5.1- (bad) both belong to the same category (A5.1) but are clearly distinguishable; comparatives and superlatives receive double and triple +/- markers respectively. Certain words and collocational units show a clear double membership of categories, for example sportswear and riding boots may be members of the category clothing and the category sport: their basic field is clothing and their subsidiary connotation sport. In certain contexts this information may be of little use, for example in a text about sport, the sport dimension will be superfluous; but in others it may be important. Such cases are therefore dealt with using slash tags. Here both tags are indicated and separated by a slash, e.g. B5/K5.1 where B5 represents clothing and personal effects, and K5.1 represents sport. The end interface can search on either or both of the tags. Closed class words such as prepositions and proper nouns are assigned a tag beginning with `Z', which indicates a grammatical bin category: of these, all but the proper nouns are not counted in the final statistical analysis, but can easily be retrieved if necessary. Once tagged as such, proper nouns --- for example product names --- need to be extracted and counted individually in the statistical analysis. The lexicon used by this program consists of lexical items each of which has one syntactic tag and one or more semantic tags assigned to it. In cases where a word has more than one syntactic tag, it is duplicated with a separate entry for each syntactic tag, for example: table NN1 H5 N2 table VV0 S7.1+ firm NN1 S5/I2.1 firm VV0 O4.5 firm JJ O4.5 A7+ Figure 4. Sample lexicon entries. The semantic tags for each entry in the lexicon are arranged in approximate rank frequency order to assist in manual postediting, and to allow for gross automatic selection of the commonest tag, subject to weighting by domain of discourse (see below). A morphology analyzer removes the need to store multiple forms except where these are semantically distinct, e.g. the system does not require -ing and regular past tense forms. Only the base form of regular conjugations and declensions is required, and a set of spelling rules will match the other forms. The system reads in the text and assigns the tag or tags indicated for that word and part-of-speech. If no exact match for the word and syntactic tag pair is found, then morphological rules built into the system are used to strip away the suffix of the word. The new stem becomes the search key in the second pass through the lexicon; however, the same syntactic tag as before should not be used. For instance if the word finding with a syntactic tag VVG does not appear in the lexicon, the -ing ending is discarded and a search is performed for find with a syntactic tag VV0 to indicate that it is the base form of the verb. If a match does not occur this time the syntactic tag is simply ignored and the appropriate semantic tags assigned, given a match on the lexical item alone (probabilistically this may be better than an unmatched search). Failing that a special `unmatched' tag, Z99, is attached. The suffix-stripping techniques used in this program reduce the size of the lexicon required by up to 30% as only the base form of regular verbs and singular form of regular nouns need to be included. There is a further list of semantic phrasal idioms such as all in all and up the creek without a paddle. This list also includes phrasal verbs. The use of wildcards in the idiom list entries removes the need to enter all the variants of a variable idiom such as kick the bucket, kicks the bucket, kicked the bucket etc. Techniques similar to those applied to the lexicon are repeated and single-word tagging is corrected if a match is found, using a form of ditto tagging similar to that in the CLAWS system (cf. Blackwell, 1987) which will ensure that the classification of an idiom is counted only once in the statistical output. Discontinuity of idioms and phrasal verbs is allowed for by permitting intercalation of up to five words between members; for example, in (1) (1) Laura handed the 50 reward money over the phrasal verb handed over is interrupted by the noun phrase the 50 reward money. Typically, however, the number of intercalations between parts of an idiom will not be so great and the program currently allows the user to select a maximum length of intercalation between 0 and 5 items whilst we are working on the optimal length which will have the maximal effectiveness in linking items which should be linked, with the minimum number of erroneous links. We are attempting to perform some automatic disambiguation for words with more than one semantic tag. Our initial intuition was that a weighting by domain of discourse would resolve a large number of cases of semantic ambiguity, so that if respondents have been talking a lot about clothing, the clothing sense of boot will be more likely than the vehicle part sense. The texts would be manually pre-scanned to obtain an idea of what topics were being discussed, and the semantic codes for these fields would be entered into a file. When the tagging program was run, the program would look at cases where a word in text could be given more than one tag, and if it found a match for one or more of the tags in the file it would delete all possible tags except that or those tags. Some words would therefore not be disambiguated, and others would remain ambiguous, but with a generally smaller number of options, because the file matched more than one tag in the list for those particular words. However, our intuitions were not confirmed. When the program was run over sample texts, the overall ambiguity ratio (calculated as the percentage of ambiguous words and idioms in all tagged items excluding punctuation) fell by only around 3% from approximately 21% to 18%. This reflected a large number of ambiguities which were not accounted for by domain of discourse; the actual domain weighting technique had an error rate of only around 4% of those words affected by this method. Given that this technique was in general accurate but compromised by lack of coverage, we have decided to augment the domain weighting with a rank ordering of sense likelihood for the language as a whole. The tags for each lexical item and idiom have now been arranged in general rank frequency order, with rarity markers assigned to particularly rare senses, for example the noun book in the betting sense rather than the media/paper documents sense. (Unfortunately due to the lack of published data, much of this rank ordering has had to rely on intuition.) The domain weighting method just described will be used to `promote' tags where those fields occur in the tag list, whilst the rarity markers will act as weights to prevent promotion of rare senses which were the main source of error when this method was used in isolation. After promotion and weighting, the first tag in the list will be automatically selected for the particular word or idiom: this will hopefully be the most probable sense, taking into account the domain of discourse. However, a large number of ambiguities in the text are caused by the very common verbs be, do and have, which may have either an auxiliary function or one or more `content' senses. Rank frequency ordering in these cases is unhelpful as, for example, auxiliary be has approximately the same frequency of occurrence as content be. Given that each of the English auxiliaries collocates with a particular form of the verb, we have been able to develop rules based on characteristic sequences of CLAWS part-of-speech tags within which these collocations occur, also including sequences to take account of tag questions. These rules distinguish the auxiliary uses of these verbs with a high degree of accuracy. In the longer term, further work on word sense disambiguation could look at bigram and trigram methods using the tag sequences on either side as the grammatical tagging system does, but this is beyond the time-scale of the present funding. At this stage of the program run, manual postediting takes place for ambiguous tags and unmatched items using software tools specially developed at Lancaster. This will probably only be done in the training stage where the system is being run for the first time over a new discourse field, to improve the probability weightings for different word senses. There next follows a suite of programs which will make the Lancaster content analysis system more specific in its output than other attempts. Previous analyzers such as General Inquirer have simply provided frequency counts on the content categories for the texts being studied. However, simple frequency counts on individual categories cannot provide information about the referents of adjectives, whether and how adjectives are modified by intensifiers or downtoners, and whether or not assertions are negated. We are therefore providing for a certain amount higher level linguistic processing which will make the output counts more finely-tuned and hence --- it is hoped --- more useful. It should perhaps be said that this is not intended as a competitor to the method of a full syntactic parse and semantic analysis: rather, it is an attempt to link together the most important descriptive and modifying words in a corpus without the additional computation and manual intervention that the former kind of analysis would require. Based on the analysis of a preliminary corpus of approximately 206,906 words of market research interviews, rules were developed for the following by extracting recurrent sequences of CLAWS tags and examining the incidence of the relevant links:-- * Linking adjectives to the nouns they modify, e.g. (2) Ruth is attractive The program will cope with scoping across Boolean and and or links. * Linking modifiers to adjectives, e.g. (3) Ruth is very attractive * Ensuring that negative words are linked to the items which they negate (cf. Wilson, 1991), e.g. (4) I do not like mushrooms This also includes transfer of negation, where in an indirect statement a negative structurally negates one of a set of introductory verbs (e.g. think) but semantically negates the main verb or adjective in the indirect statement, e.g. (5) I do not think (that) asparagus is tasty (6) I do not think (that) Paula likes it The links for the various kinds of relation form chains of relationships linking the relevant negatives, adjectives, modifiers, and nouns, which are interpreted and applied in producing the differential output from the end interface (see below). These rules have a success rate of over 90% when run over raw, unedited tagged interview text. We do not envisage the manual postediting of CLAWS output in the practical application of this system, as the degree of accuracy required in academic research is unlikely to be necessary or worth the additional person hours required in the commercial context. With postedited texts, the results are even more satisfactory. At this stage the basic semantic tags are converted by general and specific rules to the content tags chosen for the particular research project being undertaken. This will be achieved by having a two-level mapping procedure, for example we might have a rule that says all words tagged A5.1+ map into content tag 5, except for the word smart which maps into content tag 12. The final module is the `front end' of the system, allowing the user interactively to obtain frequency profiles and concordances from the data. A menu system accesses the information according to various combinations of the respondent and question details encoded at the transcription stage. Initially the user is presented with a table showing the absolute or relative frequency of each content tag. The next level shows frequencies of each particular word grouped by tag, and we can then subdivide even further so that each line is the frequency of a word with a particular modifier (including the null modifier). Modifiers and negatives can be clearly displayed so that the qualitative aspect is not lost in trying to lump boosters together in a numerical value:-- content tag 12 30 sensible 20 clever 10 very sensible 6 not very sensible 7 not at all sensible 5 quite sensible 2 quite clever 5 not clever 5 Figure 5. Part of envisaged statistical output. At each level of detail in the tag frequency profile, statistical chi-squared tests will be performed to indicate significant frequencies with respect to standard text norms. The text norms will be derived from an initial 1.5 million word semantically analyzed and postedited corpus currently being collected. Finally, the user can select an item in the frequency table and request a key word in context (KWIC) concordance, which allows the examination of the linguistic detail and shows the respondent data for each instance. Work is approaching the end of the current development of the Lancaster analyzer (April 1992). It is hoped that by the end of summer 1992, the analyzer described in this paper will be fully operational and that the advantages of this kind of semantic analysis will become clear as real interview texts are analyzed. Note. This research is supported by a SERC/DTI grant number IED4/1/1143 to Prof. G.N. Leech, Dr. J.A. Thomas and Mr. R.G. Garside. The authors would like to express their thanks to Dr. J.A. Thomas for her helpful comments on the initial manuscript. References. Berelson, B. (1952): Content Analysis in Communication Research. Glencoe, IL: The Free Press Publishers. Blackwell, S. (1987): `Syntax versus Orthography: Problems in the Automatic Parsing of Idioms'. In: R. Garside, G. Leech and G. Sampson (eds), The Computational Analysis of English: A Corpus-Based Approach. London: Longman. Bloomfield, L. (1933): Language. New York: Holt, Rinehart, Winston. Dovring, K. (1954): `Quantitative Semantics in 18th Century Sweden'. Public Opinion Quarterly 18:4, 389-94. Garside, R., G. Leech and G. Sampson. (1987): The Computational Analysis of English: A Corpus-Based Approach. London: Longman. Kaplan, A. (1943): `Content Analysis and the Theory of Signs'. Philosophy of Science 10, 230-247. Kassarjian, H.H. (1977): `Content Analysis in Consumer Research'. Journal of Consumer Research 4, 8-18. Leites, N.C. and I. de S. Pool. (1942): On Content Analysis. Library of Congress, Experimental Division for Study of War-Time Communications, Document 26. McArthur, T. (1981): Longman Lexicon of Contemporary English. London: Longman. Paice, C.D. (1991): A Thesaural Model of Information Retrieval. Information Processing and Management 27:5, 433-447. Siddall, J. and E. Santry. (1988): `Tabloids of Stone'. The Market Research Society Conference. Stone, P.J. et al. (1966): The General Inquirer: A Computer Approach to Content Analysis Studies in Psychology, Sociology, Anthropology and Political Science. Cambridge, MA: MIT Press. Tesch, R. (1990): Qualitative Research: Analysis Types & Software Tools. London: The Falmer Press. Wilson, A. (1991): `"No", "Not", and "Never" Negation in a Corpus of Spoken Interview Transcripts'. Lancaster Papers in Linguistics, number 73.