Shallow Processing of Large Corpora
SProLaC 2003

CORPUS LINGUISTICS 2003
Lancaster University (UK), 27 March, 2003

Workshop motivation and aims

Corpora have developed with respect to two main directions:

large corpora of size min. 100 mln. tokens, and
small corpora of size up to 1 mln. tokens.

The data in the former is only morpho-syntactically annotated and the data in the latter is assigned more detailed syntactic (and) semantic information. Needless to say, both types of language corpora are valuable. However, a question arises, whether it is possible to build a really large corpus, which is fully processed linguistically. Since it is a hard task and concerns metadata problems (theories, availability of appropriate tools etc), we put the stress on shallow parsing of unrestricted data. In our view, the creation of such a resource, using automation, is a task of great importance. It would serve as a template for linguistic research, consistency checking and validation, large-scale applications in Information Retrieval and Information Extraction, testing of machine learning algorithms and many others. This task is related to other subtasks, such as: an adequate combination of diverse shallow processing techniques in a sound and robust processor, and smoothing shallow parsing approaches for stages of deeper linguistic analyses.

The workshop aims at being a forum for researchers to present their work in the area of Computational Corpus Linguistics and Language Engineering and to discuss the problems in design, management, linguistic interpretation and exploration of unrestricted data from both perspectives.

We envisage a one-day workshop and 10-12 presentations.

Topics of interest:

design principles for shallow-parsed large corpora;
text segmentation and preprocessing;
definition of the connection between the levels of processing;
chunk and partial parsing of large amounts of texts;
machine learning methods with large coverage;
software systems for management and accessibility to shallow-parsed large corpora;
applications of shallow-parsed large corpora

There will be a general discussion at the end of the workshop.

Important dates

Deadline for workshop abstract submission: 10th January 2003

Notification of acceptance 3rd February 2003

Final version of paper for workshop proceedings 3rd March 2003

Submissions

Papers should describe existing research connected to the topics of the workshop. The presentation at the workshop will be 25 minutes long (20 minutes for presentation and 5 minutes for questions and discussion). Each submission should show: title; author(s); affiliation(s); and contact author's e-mail address, postal address, telephone and fax numbers. Abstracts (maximum 500 words, plain-text format) should be sent to:

Kiril Simov
Email: kivs@bultreebank.org

The final version of the accepted papers should not be longer than 4,000 words or 10 A4 pages. Instructions for formatting and presentation of the final version will be sent to authors upon notification of acceptance.

There will be a proceedings of the workshop.

Programme committee

To be annonced

Organizing committee

Kiril Simov
BulTreeBank Project
Linguistic Modelling Laboratory, CLPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
Tel: (+359 2) 979 2825
Fax: (+359 2) 70 72 73
www.BulTreeBank.org

Petya Osenova
BulTreeBank Project
Linguistic Modelling Laboratory, CLPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
Tel: (+359 2) 979 2825
Fax: (+359 2) 70 72 73
www.BulTreeBank.org

Shallow Processing of Large CorporaSProLaC 2003