Text Reuse @ UCREL & NLPT |
||||
Cross-Language English-Urdu Corpus (CLEU)The CLEU corpus can be downloaded from this page. The Cross-Language English-Urdu Corpus (CLEU) has source text in English while the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories i.e., near copy, paraphrased copy and independently written. Download the corpus: CLEU.zip Licence of the CorpusThe corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Corpus StatisticsThe corpus contains 3,235 pairs of real examples of cross-language text reuse at sentence/passage level (the source text is in English while derived text is in Urdu). Each sentence/passage pair is categorised as (i) Near Copy (NC) (751 pairs), (ii) Paraphrased Copy (PC) (1,751 pairs), or (iii) Independently Written (IW) (733 pairs). A spreadsheet is available which describes detailed results from a 300-pair sample subset. Acknowledging the CorpusWe release this corpus with an intention that it will foster cross-lingual text-reuse research specifically for Urdu language. If you use the corpus we kindly ask you to refer to it in your publications as follows: Publication under review.
The corpus also has a permanent DOI from Lancaster University: 10.17635/lancaster/researchdata/176 Contact InformationIf you have any questions or suggestions, or any other feedback please send an email to ucrel@lancaster.ac.uk |
||||
|
||||
This page last modified on Friday 13 August 2021 at 4:43 pm .
|