ࡱ > r u o p q [ = bjbj 4B ΐ ΐ k
8 H M x ^ oM qM qM qM qM qM qM O tR j qM 9 ] ] ] qM M * * * ] > oM * ] oM * * 6 D 'G p8Wq
B % E [M M 0 M E R 9( R 8 'G 'G T R {I r W * _ 3 * qM qM * j M ] ] ] ] R :
LIX 68 revisited
An extended readability measure
Katarina Mhlenbock and Sofie Johansson Kokkinakis
Department of Swedish
Gothenburg university
HYPERLINK "mailto:katarina.muhlenbock@svenska.gu.se"katarina.muhlenbock@svenska.gu.se
HYPERLINK "mailto:sofie.johansson.kokkinakis@svenska.gu.se"sofie.johansson.kokkinakis@svenska.gu.se
Abstract
Language- and genre specific readability measures are useful for texts adapted for a specific group of readers. We regard the Swedish readability index LIX, based solely on the surface structure of a text, as insufficient for determining the degree of adaptation required for highly heterogeneous groups of readers. Therefore, we suggest additional parameters to be considered when aiming at tailored text production. A corpus based comparative study suggests that features mirroring the vocabulary load, sentence structure, idea density and human interest might contribute to a more accurate and reliable scoring of readability.
Introduction
The readability of a text has to be established from a language-specific perspective, implying that different languages possess more or less accurate and elaborated readability formulae. To date, there exist a vast number of readability indices for English, but for less common (smaller) languages only a few measures are currently used. For Swedish, the LIX formula, based on sentence and word length, is almost exclusively used. We regard LIX as insufficient to reflect the complex relation between the human reader and the text, and would propose additional measures based on vocabulary load, sentence structure, idea density and human interest for determining text difficulty levels.
A corpus of simplified Swedish text and childrens fiction and a reference corpus of standard Swedish were used in order to test how different features were realized across genres and text types. Features studied concerned vocabulary load, measured by the number of long words, sentence structure by determining average sentence length and idea density by measuring lexical variation and nominal quote. Finally, we measured the factor human interest as the proportion of proper nouns.
Adding the extended calculations to the LIX formula and comparing the new results across text types and genres, we found that a measure based on additional factors on lexical, syntactic and semantic levels contributes strongly to a more correct weighting of text difficulty and appropriateness for different readers. Texts adapted to the specific needs of an individual reader are valuable assets for various types of applications connected to research, education and information, constituting a prerequisite for the integration into society of second language learners, language-impaired persons and beginning readers.
Theoretical background
Scientific studies approach the process of reading - focusing on the individual - and readability specifics - focusing on the text - from various points of view, and most often as separate areas of research. Focus is often put on the individuals shortcomings to understand written texts, but rarely on the text itself and the way it is presented to the reader. In a similar way text research often neglect the importance of the individuals cognitive and emotional prerequisites composed by perception, memory, intelligence, mother tongue or motivation.
The reading process
Reading skill is the basis for the individuals integration, development and interaction in the information society (Lynch & Hudson, 1991). A fundamental aspect for participation is the possibility to acquire information and to communicate. Sweden is a country where for centuries the population almost exclusively has been considered as literate. Nevertheless, we can see that today about 25 % of the adult population need specifically adapted information (OECD, 1994). These persons form a highly heterogeneous group with large individual variations in prerequisites and needs, affected for instance of dyslexia, aphasia, intellectual or cognitive disability or being second language learners. Consequently, the concept of easy-to-read, in Sweden coined as a term more than 30 years ago, cannot be universal and it will not be possible to write a text that will perfectly suit the demands of the individual. However, easy-to-read material is generally characterized by the use of a simple, straightforward language, without being childish or simplistic. From a purely linguistic view this might be achieved by eliminating a certain amount of grammatical features in a text while keeping close to the original meaning. The features in question operate on a syntactic, lexical or textual level contributing to the semantics, more or less readable and comprehensive for the individual.
Readability
Readability research started in the 1920s and has mainly been carried out in the US (Lively & Pressey, 1923, Vogel & Washburne, 1928, Lewerentz, 1929, Dale & Tyler, 1934, Morris & Holversen, 1938). Factors considered in these studies comprise word length, percentage of multisyllabic words, subordinate clauses, etc. Gray and Leary (1935) examined readability very thoroughly and investigated more than 200 style elements and the relationships between them. By combining variables that were highly predictive but not related to each other they created a readability formula with five variables and a high correlation to reading-difficulty scores, assigned by informants. ADDIN EN.CITE Chall19581271271276Chall, Jeanne SReadability. An appraisal of research and application.1958OhioThe Bureau of Educational Research, Ohio State UniversityChall (1958) included the reader and concluded that only four types of elements seem to be significant for a readability criterion, namely vocabulary load, sentence structure, idea density and human interest. This is completely in accordance with her previous definition of readability, i.e.:
The sum total (including all the interactions) of all those elements within a given piece of printed material that affect the success a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting. (Dale & Chall, 1949)
In the 1970s research eventually established that the two variables commonly used in readability formulas a semantic (meaning) measure such as difficulty of vocabulary and a syntactic (sentence structure) measure such as average sentence length are the best predictors of textual difficulty. By virtue of this, Bjrnsson (1974) presented his readability formula LIX for Swedish, based on sentence and word length.
Returning to Chall (1958) we will try to analyze in what way the properties of a text having a certain vocabulary load, sentence structure, idea density and being of human interest are to be identified.
Challs readability perspectives
Researchers who stressed a connection between vocabulary load and text difficulty have studied word dispersion across different text types (Lively & Pressey, 1923), and lexical variation and difficulty measured as word frequency ratios. The knowledge of words has always been considered a strong measure of a readers development, reading comprehension, and verbal intelligence. Commonly used word-frequency lists, such as the early Thorndikes Teachers Word Book (1921) provide objective means for measuring the difficulty of words and texts.
A syntactically parsed text can be analyzed with regard to the frequency or proportion of simple sentences in relation to complex sentence or average number of clauses. The average sentence length is easy to calculate, and can give a good indication of sentence structure. Early readability studies pointed at a high positive correlation between sentence length and structural complexity (Dale & Tyler, 1934, Gray & Leary, 1935).
Lexical variation among the content words (nouns and verbs), the proportion of content words vs. total number of words can be good indicators of the idea density of a text.
Regarding the degree of human interest, the measures to be considered might be proportion of personal pronouns or the number of person words. The amount of proper names can also be a fair indicator of the texts appeal to the reader.
Swedish research on readability
The Swedish researcher Platzack (1974) asserts that readability can be regarded as a function giving a measureable output in the form of human effort. The input, or arguments, of this readability function are according to Platzack content, typography, language, reader and understanding. However, Platzack considers readability to be an interesting property mainly within texts with the primary goal to provide information. A high readability score for these texts can thus, figuratively speaking, be regarded as repaying the individual with maximal information against minimal cognitive efforts invested.
From a structural point of view Swedish is characterized as an inflecting and compounding language. The compounding patterns are increasingly productive in modern language; a fact that might have impact on the LIX scores established by ADDIN EN.CITE Bjrnsson19681261261266Bjrnsson, C.H.Lsbarhet1968StockholmBokfrlaget LiberBjrnsson (1968). The readability index LIX is calculated by simply adding average sentence length in terms of number of words to the percentage of words > 6 characters. LIX is still used for determining the readability of texts intended for persons with specific linguistic needs. However, we consider the LIX score alone to be insufficient for determining the degree of adaptation required for these highly heterogeneous groups, and would suggest additional parameters to be measured when aiming at tailored text production for individual readers. Admittedly, Bjrnsson made a thorough study of additional Swedish textual factors that may well fit into most of the categories suggested by Chall (1958). They were, however, gradually abandoned in favour of factors focusing on features of surface structure alone, i.e. running words and sentences. Many of the factors initially considered were regarded as useless at the time, owing to the lack of suitable means and methods for carrying out statistical calculations on sufficiently large text collections.
Other studies (Hultman & Westman , 1977) etc., suggest other measures, describing the relation between content words and function words used in a nominal quote, and the variation between different words/word forms used in a text as the word variation index (Hultman & Westman, 1977, Melin & Lange, 2000).
Material
In order to investigate the parameters vocabulary load, sentence structure, idea density and human interest - instanciated in measureable units as suggested above - we made a corpus-based study of two text types. We used the LSBarT corpus, consisting of easy-to-read texts and childrens books, and SUC, a balanced corpus of 1 million words in written Swedish from the 1990s in order to test how different features were realized across genres and text types.
LSBarT
LSBarT is a Swedish corpus of 1.5 million words, divided into four subcorpora or genres: easy-to-read News text, Fiction, Community information and ordinary Childrens Fiction. Published easy-to-read texts, by definition shorter than most written material, are still rather scarce which obviously restrains the collection work. A crucial question at this point regards representativity and to what degree the samples cover the total range of the population. Can we confide the material included in the corpus to be representative for the texts we want to examine? For easy-to-read texts, the range is not very comprehensive, due to the fact that the supply is very small. It is obvious that the authors concentrate on writing texts in order to meet specific requirements, and the driving force is not the authorship by itself.
The modest size of LSBarT was therefore compensated for by making text representativity be decisive during compilation. As to an approximation of the publicly available easy-to-read texts, we estimate that roughly the same amounts of News and Community information are produced, but twice as much fiction. News texts are from three sources. Web versions of two daily newspapers, 8 sidor and Klartext, constitute the major part. A small amount of text intended to be read by immigrants, Invandrartidningen, is also included. Community information consists mainly of texts published by the government, municipality and public authorities on the Internet, dealing with citizenship and public services. Fiction texts are gathered from two publishing houses, one aimed at easy-to-read literature, Lttlstfrlaget, and the other at ordinary childrens literature, BonnierCarlsen, with a target group of readers in the ages of 6 to 15 years.
The corpus is tagged with parts of speech with the TnT-tagger (Brants, 2000), manually corrected and annotated with lemma forms. In order to facilitate the comparative studies of the two corpora, we used a subset of LSBarT, containing slightly above 1 million words. The composition of this subcorpus is displayed in Table 1.
Easy-to-read textsOrdinary childrens booksAll textsFictionNews textsCommunity informationFictionAll words1,142,666223,110353,049201,769364,738Percentage19.5%30.9%17.7%31.9%
Table. 1. Composition of the LSBarT subcorpus
The reference corpus SUC
The Swedish corpus SUC (Kllgren, 1998) of 1 million words uses the information representation SGML/XML format, annotated with parts of speech, morphological analysis and lemma (base form) as well as a range of structural tags and functionally interpreted tags. All the texts in the corpus were written in the 1990s, and balanced according to genre, following the principles used in the Brown (Francis & Kucera, 1964) and LOB (Johansson et al., 1978) corpora. SUC was compiled broadly to mirror what a Swedish person might have read in the early 1990s and all texts in the corpus were originally written in Swedish.
Method
Challs categories were used as a repository for adding further parameters to Swedish readability studies. In addition to LIX, the vocabulary load is calculated by the number o f e x t r a l o n g w o r d s ( e" 1 4 c h a r a c t e r s ) , i n d i c a t i n g t h e p r o p o r t i o n o f l o n g c o m p o u n d s . B y m e a s u r i n g t h e s e n t e n c e l e n g t h , w e h o p e t o g e t a n i n d i c a t i o n o f s e n t e n c e s t r u c t u r e . I d e a d e n s i t y i s i n d i c a t e d b y m e a s u r i n g l e x i c a l v a r i a t i o n O V I X A D D I N E N . C I T E <