CLAWS Input / Output Format Guidelines

CLAWS has been designed to be as robust as possible and should be able to cope with different kinds of input. Failure to cope with certain situations should be signified by clear error messages from the system, and so the user should be given some understanding of what they can do to rectify the problem.

Here, we provide guidelines which should be adhered to in order to obtain the best quality (and least error-prone) results from the system.

The input to the system is a plain text (ASCII) file. It consists of one or more lines, separated Unix-fashion by 'newline' characters and terminated by the standard 'EOF' (end of file) marker. Each line consists of the text in normal orthography, with each 'word' separated from the next by one or more spaces; a 'word' consists of a word proper, perhaps preceded and/or followed by punctuation without intervening spaces.

If the text originates on a PC or Macintosh, then the correct format is best obtained by using the 'Save as' option and specifying the 'Text Only w/line breaks' which is available in most word processing packages.

CLAWS determines sentence breaks according to normal orthography. Therefore a sentence-terminating full stop (or other character) should be followed by at least one space, and the first word in a sentence should be capitalised.

The restrictions which apply to the character set of a text are as follows:

a to z, A to Z, 0 to 9 are normal characters forming words.
# is allowed as a pound sign, $ as a dollar sign (if necessary), although it is preferable to spell it out as 'pounds' and 'dollars' or use SGML entities (see below).
- is treated as a hyphen (as in 'daughter-in-law') or a dash or minus (as in 'working 9-5' and '-3' respectively). It is not allowed to mark a word continuation across a line break, so each word must appear complete on one line.
' is an apostrophe which can be used in the normal way in a word (as in O'Neill, she'll, the dog's tail and the dogs' tail).
space is the normal word-delimiter, but Tab is also allowed.
. , : ; ? ! are all the allowed punctuation marks; a single dot may be used as a decimal point, but omit periods in abbreviations and initials; use three dots (...) to represent any sort of gap, ellipsis, etc.;
" (double quote) is used for any sort of quotation mark.
at least one space should follow any punctuation.
multiple asterisks (*) are disallowed.

SGML entities should be used to encode extended character set symbols and accented letters. Lists of these can be found on the web in the HTML 4 specification: http://www.w3.org/TR/REC-html40/sgml/entities.html#iso-88591

The standard entities are:

£

&

é

<

>

[

]

&bquo;

&equo;

SGML tags may be included in the text to mark-up various features or act as a header giving information about the text. The tag names should be declared in a configuration file (see section on running CLAWS).

By default, a start and end marker for the text encoded as an SGML tag should be included in every file to be run through CLAWS. For example,

<text>
The quick brown fox jumps over the lazy dog.
</text>

CLAWS outputs a verticalised form of the text where each word has a list of possible POS tags. The most likely tag is the first in the list.

0000001 001 **6;0;START				01 NULL
0000001 002 ----------------------------------------------------
0000003 010 The					93 AT
0000003 020 quick				93 [JJ/99] RR@/1 NN1%/0
0000003 030 brown				93 [JJ/93] NN1@/7 VV0%/0
0000003 040 fox					93 [NN1/100] VV0@/0
0000003 050 jumps				93 [VVZ/97] NN2@/3
0000003 060 over				93 [II/59] RP/41 NN1%/0 JJ%/0
0000003 070 the					93 AT
0000003 080 lazy				93 JJ
0000003 090 dog					93 [NN1/100] VV0%/0
0000003 091 .					03 .
0000004 001 **7;7;text				01 NULL

The reference number at the start of each line shows which line of the input file a word comes from. Sentence breaks are identified by lines of hyphens. The two digit number to the left of the POS tags is a decision code produced by CLAWS to aid manual postediting. Each POS tag on an ambiguous word is followed by a slash and a likelihood value, expressed as a percentage.

The first line of this example (**6;0;START) contains a reference to a supplementary (supp) file produced by CLAWS. The supp file contains words in the input text which are longer than 25 characters and SGML tags which contain a space. The start text symbol is always copied to the supp file along with any text in the file proceeding it. The two numbers in the supp file reference give the number of characters transferred (six in this case) and the starting point in the supp file where this reference points to.