Cross-lingual (English-Urdu) text reuse detection

Muhammad Sharjeel

SCC, Lancaster University

Cross-lingual text reuse occurs when pre-existing texts in one language are used to create new texts in another language. Due to the increasing amount of digital text readily available on the Web and social media in multiple languages and freely accessible efficient Machine Translation systems, cross-lingual text reuse has increased to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases.

The main objective of this talk is twofold: (1) I'll present the development of a large-scale benchmark standard evaluation corpus named TREU (Text Reuse in English-Urdu language) Corpus and (2) How the corpus could be used in the development and evaluation of cross-lingual (English-Urdu) text reuse detection systems.

TREU corpus is the first cross-lingual cross-script text reuse corpus developed for a significantly under-resourced language pair (English-Urdu). It contains text from the journalism domain, manually annotated at three levels of text reuse i.e., Wholly Derived, Partially Derived, and Non-Derived. It includes real cases of text reuse (in total 4,514 texts) at document-level from English to Urdu language. The corpus has been evaluated using a diversified range of methods categorised under three types i.e., Translation plus Mono-lingual Analysis (T+MA), cross-lingual embeddings, and cross-lingual Vector Space Model. The use of these methods on the TREU Corpus shows its usefulness and how it can be utilised in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level.

Although the talk focuses on English-Urdu language pair, the proposed corpus and methods could serve as a framework for future research in other languages and language pairs.

Week 18 2019/2020

Friday 6th March 2020

Charles Carter A19