Text Reuse @ UCREL & NLPT |
||||
Urdu Short Text Reuse Corpus (USTRC)The USTRC corpus can be downloaded from this page. USTRC is a gold standard benchmark corpus to measure short text reuse in the Urdu language. It contains in total 2,684 source-reused short text pairs. Download the corpus: USTRC.zip Licence of the CorpusThe corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Corpus StatisticsThe corpus contains 2,684 source-reused short text pairs. Each pair falls into one of three categories: (1) verbatim (496 pairs) (2) paraphrased (1,329 pairs) and (3) independently written (859 pairs). A spreadsheet is available which describes detailed results from a paper under review. Acknowledging the CorpusWe release this corpus with an intention that it will foster text-reuse research specifically for Urdu language. If you use the corpus we kindly ask you to refer to it in your publications as follows: @ARTICLE{8118088, author={S. Sameen and M. Sharjeel and R. M. A. Nawab and P. Rayson and I. Muneer}, journal={IEEE Access}, title={Measuring Short Text Reuse for the Urdu Language}, year={2018}, volume={6}, number={}, pages={7412-7421}, keywords={Benchmark testing;Gold;Guidelines;Natural language processing;Plagiarism;Standards;Urdu corpus;Urdu text reuse detection;natural language processing}, doi={10.1109/ACCESS.2017.2776842}, ISSN={}, month={},} The corpus also has a permanent DOI from Lancaster University: 10.17635/lancaster/researchdata/192 Contact InformationIf you have any questions or suggestions, or any other feedback please send an email to ucrel@lancaster.ac.uk |
||||
|
||||
This page last modified on Friday 13 August 2021 at 4:43 pm .
|