COrpus of Urdu News TExt Reuse (COUNTER)

The COUNTER corpus can be downloaded from this page. The corpus contains 600 source-derived document pairs collected from the field of journalism. We believe these documents will be useful to evaluate mono-lingual text reuse detection systems in general and specifically for Urdu language.

Download the corpus: COUNTER.zip

Licence of the Corpus

The corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Corpus Statistics

The corpus has 600 source and 600 derived documents. It contains in total 275,387 words (tokens), 21,426 unique words and 10,841 sentences. It has been manually annotated at document level with three levels of reuse: wholly derived (135), partially derived (288) and non derived (177).

Acknowledging the Corpus

We release this corpus with an intention that it will foster the research in mono-lingual text reuse systems specifically for Urdu language. If you use the corpus we kindly ask you to refer to it in your publications as follows:

Muhammad, S., Nawab, R. M. A., and Rayson, P. (2016). COUNTER - corpus of Urdu news text reuse. Language Resources and Evaluation. DOI: 10.1007/s10579-016-9367-2

The corpus also has a permanent DOI from Lancaster University: 10.17635/lancaster/researchdata/96

Contact Information

If you have any questions or suggestions, or any other feedback please send an email to ucrel@lancaster.ac.uk

Please see the README file included with the corpus for further information. Note: The news agency and newspapers text in the corpus have been collected from a wide range of sources (online, email, print). The user acknowledges that the use of these news documents is restricted to research and/or academic purposes only.

