** CALL FOR PAPERS ** Multilingual Corpora: Linguistic Requirements and Technical Perspectives A pre-conference workshop to be held at Corpus Linguistics 2003 Lancaster, 27 March 2003 http://www.comp.lancs.ac.uk/ucrel/cl2003 http://www.coli.uni-sb.de/mocu03 ORGANIZED BY: Stella Neumann (Department of Applied Linguistics, Translation and Interpreting) Silvia Hansen (Department of Computational Linguistics) Saarland University, Saarbrücken, Germany TOPIC AND MOTIVATION: How do researchers go about building multilingual corpora? For the development of a linguistically interpreted corpus on the basis of more than one language there seem to be two methods: First, the multilingual corpus is split up into monolingual sub-corpora which are then annotated independently. For the second method, one language serves as the basis for building up and interpreting a multilingual corpus, whereas the other has to be adapted. Both methods, however, are rather problematic. They do not take sufficiently into account the differences and commonalities between the languages in question at each stage of corpus-based research, involving the comparability of the corpus design, the different kinds of segmentation, the diverging annotation schemes, the corpus representations and finally the again converging querying across different languages. Mistakes or inconsistencies which happen at one stage of the multilingual corpus development have negative influences on the following steps and result in worse mistakes or inconsistencies. Not only do these problems arise at each methodological step. They also multiply with the growing complexity of the research design. If the research aims at interpreting linguistic data on several levels, cross-linguistic comparability has to be taken into account on each level. The goal of the workshop is to bring together researchers who formulate specific requirements of how to work with corpora under a linguistic perspective and engineers who can offer technical solutions but need the input of users to adapt their tools to the needs of the linguists. Within this context, questions like the following are to be discussed: - What happens, if the units under investigation diverge on the different levels? - At present, the preferred solution is to use XML at all stages and on all layers. But is this really practicable? - Do linguists get along with stand-off mark-up? - Is this maybe a technical compromise? The workshop should result in a requirement catalogue in combination with technical solutions. It could thus serve as a starting point for the development of an annotation typology which takes into account different languages as well as different annotation layers. On the basis of this typology, the comparability of a multilingual multi-layer annotated corpus can be guaranteed. With this in mind, a multilingual corpus builder should be able to cope with possible problems in each of the above explained steps in corpus development. Papers are expected on the following questions: - linguistic requirements in the different methodological steps - state-of-the-art technical solutions - international standards which facilitate the development and exchange of multilingual corpora WORKSHOP PROFILE: The workshop will take a full day comprising about 8-10 papers. Short presentations are expected leaving enough time for discussion and assessment of the used methodologies as well as the development of possible solutions. This already points to the workshop agenda: The first third will deal with linguistic fundamentals, the second part will discuss the technical aspects and the last third will provide a platform for integrating both perspectives. Workshop proceedings will be produced. PROGRAMME COMMITTEE: to be announced! SCHEDULE: 20 January 2003: Deadline for submitted papers 21 February 2003: Notification of acceptance 7 March 2003: Camera ready copy 27 March 2003: Workshop REGISTRATION: Please refer to the main conference web page (http://www.comp.lancs.ac.uk/ucrel/cl2003) for registration details. SUBMISSIONS: Please send submissions in English as RTF or plain text files (preferably by email) to the address below. Paper length should be 8-10 pages, formatted in the same way as for the main conference (see http://www.comp.lancs.ac.uk/ucrel/cl2003/style.html for paper format guidelines). Stella Neumann (st.neumann@mx.uni-saarland.de) Department of Applied Linguistics, Translation and Interpreting (FR 4.6) Saarland University Postfach 15 11 50 66041 Saarbrücken Germany