Text Reuse @ UCREL & NLPT |
||||
COrpus of Urdu News TExt Reuse (COUNTER)The COUNTER corpus can be downloaded from this page. The corpus contains 600 source-derived document pairs collected from the field of journalism. We believe these documents will be useful to evaluate mono-lingual text reuse detection systems in general and specifically for Urdu language. Download the corpus: COUNTER.zip Licence of the CorpusThe corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Corpus StatisticsThe corpus has 600 source and 600 derived documents. It contains in total 275,387 words (tokens), 21,426 unique words and 10,841 sentences. It has been manually annotated at document level with three levels of reuse: wholly derived (135), partially derived (288) and non derived (177). Acknowledging the CorpusWe release this corpus with an intention that it will foster the research in mono-lingual text reuse systems specifically for Urdu language. If you use the corpus we kindly ask you to refer to it in your publications as follows: Muhammad, S., Nawab, R. M. A., and Rayson, P. (2016). COUNTER - corpus of Urdu news text reuse. Language Resources and Evaluation. DOI: 10.1007/s10579-016-9367-2
Bibtex: @Article{Sharjeel2016, author="Sharjeel, Muhammad and Nawab, Rao Muhammad Adeel and Rayson, Paul", title="COUNTER: corpus of Urdu news text reuse", journal="Language Resources and Evaluation", year="2016", pages="1--27", issn="1574-0218", doi="10.1007/s10579-016-9367-2", url="http://dx.doi.org/10.1007/s10579-016-9367-2" } The corpus also has a permanent DOI from Lancaster University: 10.17635/lancaster/researchdata/96 Contact InformationIf you have any questions or suggestions, or any other feedback please send an email to ucrel@lancaster.ac.uk Please see the README file included with the corpus for further information. Note: The news agency and newspapers text in the corpus have been collected from a wide range of sources (online, email, print). The user acknowledges that the use of these news documents is restricted to research and/or academic purposes only. |
||||
|
||||
This page last modified on Friday 13 August 2021 at 4:43 pm .
|