Text Reuse @ UCREL & NLPT |
||||
Urdu Short Text Reuse Corpus (USTRC)The USTRC corpus can be downloaded from this page. USTRC is a gold standard benchmark corpus to measure short text reuse in the Urdu language. It contains in total 2,684 source-reused short text pairs. Download the corpus: USTRC.zip Licence of the CorpusThe corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Corpus StatisticsThe corpus contains 2,684 source-reused short text pairs. Each pair falls into one of three categories: (1) verbatim (496 pairs) (2) paraphrased (1,329 pairs) and (3) independently written (859 pairs). A spreadsheet is available which describes detailed results from a paper under review. Acknowledging the CorpusWe release this corpus with an intention that it will foster text-reuse research specifically for Urdu language. If you use the corpus we kindly ask you to refer to it in your publications as follows:
@ARTICLE{8118088,
author={S. Sameen and M. Sharjeel and R. M. A. Nawab and P. Rayson and I. Muneer},
journal={IEEE Access},
title={Measuring Short Text Reuse for the Urdu Language},
year={2018},
volume={6},
number={},
pages={7412-7421},
keywords={Benchmark testing;Gold;Guidelines;Natural language processing;Plagiarism;Standards;Urdu corpus;Urdu text reuse detection;natural language processing},
doi={10.1109/ACCESS.2017.2776842},
ISSN={},
month={},}
The corpus also has a permanent DOI from Lancaster University: 10.17635/lancaster/researchdata/192 Contact InformationIf you have any questions or suggestions, or any other feedback please send an email to ucrel@lancaster.ac.uk |
||||
|
||||
|
This page last modified on Wednesday 25 June 2025 at 9:14 pm .
|
||||