Seminars from previous years are still being added, the archive is still available on the old website.
The increasing availability of digitized texts is good news for scholars, including in the Humanities and Social Sciences, who are interested in analysing large quantities of texts in both quantitative and qualitative ways. Unfortunately, these texts are often digitized using Optical Character Recognition (OCR) software, with variable success. This is especially an issue for historical texts, which often attract lower quality output. In this presentation, I explore the impact of OCR errors on two common collocation statistics (Mutual Information and Log Likelihood), comparing statistics generated from a set of matching samples, including one hand-corrected ('gold') sample, an uncorrected sample, and an automatically corrected sample. These matching samples are excerpts from the British Library's c19th British Newspapers (part 1) collection. I find a clear effect of OCR errors especially on larger collocation spans and describe this effect in terms of differences between the values of the statistics generated in the various samples, rankings of the statistics in the various samples, and rates of false positives and false negatives in the uncorrected and automatically corrected samples when compared with the gold sample.
The talk examines ideological functions of metonymy by exploring the metonymic location names in Chinese and American political discourses. Based on evidence extracted from the self-built corpora by the news articles on Sino-US relations with the aid of USAS Semantic Tagger and the Concordance Tool, the study intends to investigate the ideological motivation underlying the use of PLACE metonymies. Following MIP (Zhang, et al., 2011), it identifies and concludes the (sub)models of PLACE metonymy, namely PLACE FOR INSTITUTION, PLACE FOR INHABITANT, PLACE FOR POWER and PLACE FOR PRODUCT. A statistical analysis of these metonymies based on the chi-square test from the conceptual and discursive parameters presents their distributional features with finer granularity, the coverage commonalities and differentiation of the same metonymies in two discourse communities. The findings reveal that the metonymies, while operative on an embodied basis, are actually tied up with and reinforce the hidden ideologies of discourse communities, and thus become the carrier of the political stance or disposition, and evaluative partiality of two media groups, implicitly exercising the manipulative check on their audiences.
The talk is based on collaborative research with Yishan Zhou, a postgraduate at College of Foreign languages, University of Shanghai for Science and Technology (USST), China.
Identity resolution capability for social networking profiles is important for a range of purposes, from open-source intelligence applications to forming semantic web connections. Yet research in this area is hampered by the lack of access to ground-truth data linking the identities of profiles from different networks. Almost all data sources previously used by researchers are no longer available, and historic datasets are both of decreasing relevance to the modern social networking landscape and ethically troublesome regarding the preservation and publication of personal data. We present and evaluate a method which provides researchers in identity resolution with easy access to a realistically-challenging labelled dataset of online profiles, drawing on four of the currently largest and most influential online social networks. We validate the comparability of samples drawn through this method and discuss the implications of this mechanism for researchers and potential alternatives and extensions.
This is a joint talk held in conjunction with The FORGE (The Forensic Linguistics Research Group)
The main aim of this study is to automatically find and classify elements that signal modality in Spanish and Japanese sentences, taking into account theoretical and empirical information. In an effort to join different disciplines such as typology, logic, corpus and computational linguistics, we aim to answer three main questions: (1) What is the best definition and classification of modality for a cross-linguistic computational work; (2) How is modality used in spoken Spanish and Japanese, and how modal markers are modified in discourse; (3) How can we formalise this information into a program that can annotate modals automatically in new texts.
The result is a rule-based program that outputs a XML with markers annotated and classified equally in both languages. Modality is seen from the logic perspective as a semantic feature that adds necessity or possibility meanings to the predicate of the sentence using a series of auxiliaries. The corpus shows how these auxiliaries can be affected by negation, ellipsis, syntactic separation and ambiguity, which need to be detected by the program for the sake of precision and recall.
The corpus study also provides information about modality usage, and reveals that its frequency is correlated with the type of interaction, possibly related to social constraints. Monologues achieve similar results in both languages, as well as non-linguistic factors of sex and age of the speakers. Dialogues on the other hand show a completely different picture in Spanish, with a predominance of necessity, and Japanese, with possibility slightly higher.
How do we talk about the books we read together? How do teachers guide reading of study texts in schools? This seminar reports on the continuing British Academy-funded project Literature's Lasting Impression which investigates shared reading of novels and reading aloud in primary schools, secondary schools, universities and public reading groups. In particular, it will attend to teachers' action of quoting study texts aloud during collective reading activity in primary and secondary classrooms. What functions does this appear to serve? Informed by Conversation Analysis, the presentation also extends exploration of quoting aloud as distinct from quotation in writing, which I have termed echo in earlier work investigating pupils' responses to poetry. Drawing on my role as a teacher educator in the field of Secondary English, I will also reflect on methodological issues and the role of empirical research in teacher education and the pedagogy of literary reading. How can transcripts of classroom interaction be used to refine and improve teacher education, and what is the potential of a corpus dedicated to this distinctive form of spoken language?
This talk examines the forces that trigger two word-order designs in English: (i) object-verb sentences (*?The teacher the student hit) and (ii) adjunct-complement vs complement-adjunct constructions (He taught yesterday Maths vs He taught Maths yesterday). The study focuses both on the diachronic tendencies observed in the data in Middle English, Early Modern and Late Modern English, and on their synchronic design in Present-Day English. The approach is corpus-based (or even corpus-driven) and the data, representing different periods and text types, are taken from a number of corpora (the Penn-Helsinki Parsed Corpus of Middle English, the Penn-Helsinki Parsed Corpus of Early Modern English, the Penn Parsed Corpus of Modern British English and the British National Corpus, among others). The aim of this talk is to look at the consequences that the placement of major constituents (eg. complements) has for the parsing of phrases in which they occur. I examine whether the data are in keeping with determinants of word order like complements-first (complement plus adjunct) and end-weight in the periods under investigation. Some statistical analyses will help determine the explanatory power of such determinants.
On 12th April 2003, 83.6% of Hungarians voted in support of Hungary joining the European Union. This decisive result followed a massive parliamentary discussion about the issue and guaranteed Hungary access to the EU. But what was the attitude of Hungarian MPs towards the European Union? And how was Hungarian identity shaped in discourses about EU membership? In this talk I will present the preliminary results of a corpus-assisted study of Hungarian parliamentary speeches delivered between 1998 and 2003. After a brief historical introduction I will first outline the methodological approach I adopted to sketch attitudes and identities by means of collocation analysis. I will then describe the data I employed, namely the self-collected HUNPOL corpus. Finally, using the GraphColl software, I will show how semantic and discourse prosody can highlight the Hungarian politicians' stance regarding the European Union and the status they posit for themselves in a (possibly) new political dimension.
The SAMS project (Software Architecture for Mental health Self-management) is investigating whether monitoring data from everyday computer-use activity can be used to effectively detect subtle signs of cognitive impairment that may indicate the early stages of Alzheimer's disease.
In this talk I will discuss the SAMS project, the collection of data and text form participants, and our approach to mining the text to infer cognitive health. During the SAMS project, bespoke software is used to collect data and text from participants (installed on the participants' home PCs). The collection software passively and unobtrusively collects many forms of data and text from the participants' PCs (inc. typed email and document text), which is securely logged, and later transferred to our server for analysis. The analysis consists of various data and text mining techniques to attempt to map trends and patterns in the data with clinical indicators of Alzheimer's Disease, e.g. working memory, motor control.
Tools usage within the SAMS project will also be discussed, including the development of the bespoke collection and analysis software, as well as existing tools that are re-used (Part of Speech Tagger, Semantic Tagger).
In this talk, I will present ongoing work on the development of UCREL multilingual semantic annotation system. Over the past years, the original UCREL English semantic tagger has been adjusted and extended to cover more and more languages, including Finnish, Italian, Portuguese, Chinese, Spanish, French etc. Currently, a major project CorCenCC is underway in which a Welsh semantic tagger is under development in collaboration with Welsh project partners. This tool is useful for various corpus-based research such as cross-lingual studies. I'll discuss linguistic resources involved in the development of the tool, and introduce a GUI tool which links the multilingual tagger web services and help researchers to process corpus data conveniently.
This talk will present an overview of newly available resources from the Mellon-funded Visualising English Print project (Mellon-funded, University of Wisconsin-Madison, the University of Strathclyde and the Folger Shakespeare Library) for engaging with the Text Creation Partnership texts. She will discuss some of our the curation and standardisation principles guiding the project, how we envision scholars will use our resources, and present a case study of how to use our resources to conduct an analysis of Early Modern scientific writing.
Our paper will explore the ways in which religious identities are negotiated in a setting characterised by religious diversity and proximity: Yorubaland in South West Nigeria. We will explore how interreligious relationships are discursively constructed in extensive survey data (2,819 respondents in total) collected as part of an anthropological project focussing on the coexistence of Islam, Christianity and traditional practice in Yoruba-speaking parts of southwest Nigeria: 'Knowing each other: Everyday religious encounters, social identities and tolerance in southwest Nigeria'. Corpus tools and techniques will be used to examine the 1,535 questionnaires filled in English, particularly answers to open-ended questions (our corpus). The premise is that by exploring discursive choices made by Christian and Muslim respondents in this corpus, we can gain insights into Yoruba Muslims and Christians' perception of themselves and each other and their experiences of inter-religious encounters.
Owing to the focus of the paper, two sub-corpora of the above-mentioned corpus were compiled: one with all answers by respondents of Muslim faith and another with all answers by respondents of Christian faith. We will use four-grams for each of these corpora to show how corpus-assisted investigations into phraseology have helped us gain insights into the data which traditional anthropological methods alone would not have allowed. Our findings will concern, for example, the specific boundaries our Christian and Muslim respondents draw around their religious behaviour and their shared understanding of religion.
English texts on Chinese governmental websites are often criticised for being 'Chinglish' or 'lifeless'. This project investigates how English versions of Chinese governmental websites can improve their stylistic quality. The project is a computational stylistic comparison between English texts on Chinese governmental websites and English texts on UK and US governmental websites. The approach is corpus-based and employs Biber's (1988) multidimensional analysis. A corpus (including two subcorpora) of websites had previously been downloaded using the wget-m method. Perl scripts were used to extract text content from web pages to form a txt file for each website, and word frequency lists and trigrams have also been extracted. Keyword lists for the two subcorpora have been generated based on a COCA word frequency list. Several issues remain to be dealt with before further analysis can be conducted, including: whether it is possible to separate 'real content' from purely repetitive content when data comes from web pages (such as menus, navigation, copyright); the alternatives to manual annotation when this is not a practical option given the massive size of the corpus; and how to identify which features to consider to make the comparison more significant.
This talk will provide a basic introduction to FireAnt, a freeware tool that I have co-developed with Laurence Anthony (Waseda University). FireAnt offers three main utilities: the live collection of real-time tweets; the ability to filter that (and many other kinds of) data based on user-defined parameters; and the ability to export the data in formats suitable for corpus tools, network graphing, timeseries analysis, and so forth. I'll demonstrate each of these steps in turn, and provide suggestions along the way for possible types of investigation that FireAnt has been and can be used for. There will be time at the end for questions.