Text Reuse @ UCREL & NLPT

Urdu Short Text Reuse Corpus (USTRC)

The USTRC corpus can be downloaded from this page. USTRC is a gold standard benchmark corpus to measure short text reuse in the Urdu language. It contains in total 2,684 source-reused short text pairs.

Download the corpus: USTRC.zip

Licence of the Corpus

The corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Corpus Statistics

The corpus contains 2,684 source-reused short text pairs. Each pair falls into one of three categories: (1) verbatim (496 pairs) (2) paraphrased (1,329 pairs) and (3) independently written (859 pairs). A spreadsheet is available which describes detailed results from a paper under review.

Acknowledging the Corpus

We release this corpus with an intention that it will foster text-reuse research specifically for Urdu language. If you use the corpus we kindly ask you to refer to it in your publications as follows:

author={S. Sameen and M. Sharjeel and R. M. A. Nawab and P. Rayson and I. Muneer}, 
journal={IEEE Access}, 
title={Measuring Short Text Reuse for the Urdu Language}, 
keywords={Benchmark testing;Gold;Guidelines;Natural language processing;Plagiarism;Standards;Urdu corpus;Urdu text reuse detection;natural language processing}, 

The corpus also has a permanent DOI from Lancaster University: 10.17635/lancaster/researchdata/192

Contact Information

If you have any questions or suggestions, or any other feedback please send an email to ucrel@lancaster.ac.uk

Lancaster University logo COMSATS Lahore UCREL research centre

This page last modified on Friday 13 August 2021 at 4:43 pm .