Text Reuse @ UCREL & NLPT

Urdu Short Text Reuse Corpus (USTRC)

The USTRC corpus can be downloaded from this page. USTRC is a gold standard benchmark corpus to measure short text reuse in the Urdu language. It contains in total 2,684 source-reused short text pairs.

Download the corpus: USTRC.zip

Licence of the Corpus

The corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Corpus Statistics

The corpus contains 2,684 source-reused short text pairs. Each pair falls into one of three categories: (1) verbatim (496 pairs) (2) paraphrased (1,329 pairs) and (3) independently written (859 pairs). A spreadsheet is available which describes detailed results from a paper under review.

Acknowledging the Corpus

We release this corpus with an intention that it will foster text-reuse research specifically for Urdu language. If you use the corpus we kindly ask you to refer to it in your publications as follows:

@ARTICLE{8118088, 
author={S. Sameen and M. Sharjeel and R. M. A. Nawab and P. Rayson and I. Muneer}, 
journal={IEEE Access}, 
title={Measuring Short Text Reuse for the Urdu Language}, 
year={2018}, 
volume={6}, 
number={}, 
pages={7412-7421}, 
keywords={Benchmark testing;Gold;Guidelines;Natural language processing;Plagiarism;Standards;Urdu corpus;Urdu text reuse detection;natural language processing}, 
doi={10.1109/ACCESS.2017.2776842}, 
ISSN={}, 
month={},}

The corpus also has a permanent DOI from Lancaster University: 10.17635/lancaster/researchdata/192

Contact Information

If you have any questions or suggestions, or any other feedback please send an email to ucrel@lancaster.ac.uk

This page last modified on Wednesday 25 June 2025 at 9:14 pm .