Text Reuse @ UCREL & NLPT

Cross-Language English-Urdu Corpus (CLEU)

The CLEU corpus can be downloaded from this page. The Cross-Language English-Urdu Corpus (CLEU) has source text in English while the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories i.e., near copy, paraphrased copy and independently written.

Download the corpus: CLEU.zip

Licence of the Corpus

The corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Corpus Statistics

The corpus contains 3,235 pairs of real examples of cross-language text reuse at sentence/passage level (the source text is in English while derived text is in Urdu). Each sentence/passage pair is categorised as (i) Near Copy (NC) (751 pairs), (ii) Paraphrased Copy (PC) (1,751 pairs), or (iii) Independently Written (IW) (733 pairs). A spreadsheet is available which describes detailed results from a 300-pair sample subset.

Acknowledging the Corpus

We release this corpus with an intention that it will foster cross-lingual text-reuse research specifically for Urdu language. If you use the corpus we kindly ask you to refer to it in your publications as follows:

Publication under review.

The corpus also has a permanent DOI from Lancaster University: 10.17635/lancaster/researchdata/176

Contact Information

If you have any questions or suggestions, or any other feedback please send an email to ucrel@lancaster.ac.uk

This page last modified on Wednesday 25 June 2025 at 9:14 pm .