Urdu Short Text Reuse Corpus (USTRC)

The USTRC corpus can be downloaded from this page. USTRC is a gold standard benchmark corpus to measure short text reuse in the Urdu language. It contains in total 2,684 source-reused short text pairs.

Download the corpus: USTRC.zip

Licence of the Corpus

The corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Corpus Statistics

The corpus contains 2,684 source-reused short text pairs. Each pair falls into one of three categories: (1) verbatim (496 pairs) (2) paraphrased (1,329 pairs) and (3) independently written (859 pairs). A spreadsheet is available which describes detailed results from a paper under review.

Acknowledging the Corpus

We release this corpus with an intention that it will foster text-reuse research specifically for Urdu language. If you use the corpus we kindly ask you to refer to it in your publications as follows:

The corpus also has a permanent DOI from Lancaster University: 10.17635/lancaster/researchdata/192

Contact Information

If you have any questions or suggestions, or any other feedback please send an email to ucrel@lancaster.ac.uk

