Gerard Steen, 
 Department of English Free University, 
Amsterdam
GJ.STEEN@hetnet.nl
Elena Semino,  
Department of Linguistics and Modern English Language
Lancaster University
 Lancaster, 
LA1 4YT 
e.semino@lancaster.ac.uk
This report presents the results and discussion of the first stage of the reliability studies of the first step of a metaphor identification procedure. The complete procedure is a theoretical reconstruction of what linguists do when they identify a stretch of language as metaphorical. It involves five steps, and has been presented in Steen (1999) and applied by Semino, Heywood, and Short (in press). The paper will begin with a brief presentation of the complete five-step procedure. Last summer the procedure was discussed in a small three-day meeting between nine metaphor specialists from different disciplines: stylistics, Cognitive Linguistics, psycholinguistics, and applied linguistics. As a result of the meeting it was decided to test the reliability of the first step of the procedure. The purpose of these tests is to calibrate the first step of the procedure and make it available for application in various kinds of linguistic research, including the tagging of existing and new corpora of spoken and written English, the analysis of experimental texts for psycholinguistic research, and the analysis of collections of metaphors for theoretical and empirical work in Cognitive Linguistics. The rest of the paper will report on the findings of the reliability test. In a companion paper, we shall address the problem of how to tag words identified as metaphorical in electronic corpora. The aim of the first stage of the reliability testing has been to collect coding data from the analysts involved in the meeting in order to have a first impression of the agreements and disagreements at the present stage of development of the project. In particular, the metaphor identification data pertain to five 19th-century poems which were discussed at the meeting and which gave rise to positive conclusions about the potential for a reliable procedure. These texts are a good starting point for further improvement of the procedure in all kinds of respects: theoretical assumptions, instructions, handling of specific groups of phenomena (including classes of metaphor), and detection of important sources of difficulty and error. That is why all participants were requested to go over the texts again and send in their codings of the metaphorically used words. The practical manner in which the discussion of the data and the theoretical and methodological issues has taken place is by means of a specially designed electronic discussion site. It contains the texts and the codings and allows the participants to comment on the codings and each other's comments. A brief demonstration of the site will be part of the paper. The discussion site provides an efficient means for the members of the group to create consensus about errors, comparable cases, difficult issues, and so on. This kind of consensus has meanwhile been reached, and a new reliability test of different types of materials will soon be undertaken to examine whether progress has actually been made. This stepwise approach to improving the procedure is expected to lead to piecemeal further increases of the reliability of the procedure. The results of the study are based on the metaphor identification data of four analysts for 414 content words that were used literally or metaphorically in five poems. Testing of the reliability of the identification procedure took place by means of calculating Cochran's Q, which provides a test of marginal homogeneity for binary data. The results before discussion showed that agreement between analysts was ordered according to their experience with the procedure, and that sufficient reliability for all items was achieved between only one pair of analysts. More interesting, however, was the fact that agreement could also be ordered according to word class: all four analysts achieved sufficient reliability for nouns (n=194), and a combination of three of the four analysts achieved agreement for verbs (n=101) and almost sufficient agreement for adjectives (n=87). After the analysis of the original data, a discussion followed between the four analysts in order to remove error and to align judgements according to explicitly identical interpretations of the instructions. This discussion took place in three steps. First the poem with the lowest reliability score was used to remove the most obvious obstacles in unclear instructions. A test of all of the data with corrected entries for the first poem 'after discussion' showed improvement but not sufficient enough to reach adequate reliability. Then the poem with the second lowest reliability score was discussed between the four analysts, who now also had access to a list of additional clarifications on the discussion site, which list had come out of the discussion of the first poem. A test of all of the data with corrected entries for the first two poems 'after discussion' showed sufficient reliability for the combination of all four analysts. A breakdown according to word class exhibited the same order of reliability for nouns, verbs, and adjectives, with verbs on the brink of becoming completely reliable for all four analysts. The third round, for the last three poems, is completed at the moment of submitting this abstract.
References
Semino, E., Heywood, J. and M. Short (in press) 'Methodological Problems in the Analysis of Metaphors in a Corpus of Conversations about Cancer', Journal of Pragmatics.
Steen, G. (1999) 'From Linguistic to Conceptual Metaphor in Five Steps', in R. W. Gibbs Jr. and G. Steen (eds) Metaphor in Cognitive Linguistics, pp. 57-77. Amsterdam: John Benjamins.