Page Border

Increase text size Decrease text size Reset text size Printer friendly page Email page link

Page last modified: Friday 11th October 2013


This section aims to give instructions for using all of the software's different areas and functions. For information on downloading the software see the availability section. If you have a question that is not answered in this user guide then see the FAQ section.

This user guide represents VARD 2.5.4.

Running the tool

VARD 2 is built with the Java programming language and at least version 6 of the Java Runtime Environment (JRE) is required to use the tool. This is free and can be downloaded from the java website if not already present on your computer.

The tool is delivered in a compressed ".zip" file, the files from which can be extracted to run the program, most operating systems now have built in functionality to do this, or any unzipping program, such as WinRAR or WinZip, can be used.

To run the main user interface of VARD 2 use run.bat in Windows, run.command in Mac OSX, or run.sh in Unix/Linux ("chmod a+x run.sh" first). Alternatively, open a command prompt, locate the VARD directory and type the following command:

java -Xms256M -Xmx512M -jar gui.jar

256 and 512 indicate the memory allocated to the Java Virtual Machine. The default values should be sufficient for most purposes, however, if your system only has 512MB of memory, these should be reduced to 128 and 256. It is not recommended to lower these values any further than 128/256. The memory can be raised if you are dealing with particularly large text files (100,000 tokens plus) and your system has sufficient memory. The higher figure should be no more than half of your system memory, the lower value should usually be half of the higher value.

It is possible to run more than one instance of VARD at the same time, but not recommended. You certainly should not run more than one instance simultaneously with the same setup folder or training folder being used as this could cause concurrency issues. You should also check how much system memory you have available.

There are also various command line interfaces available.


Go to top of page

Input

VARD 2 can process a variety of text formats; including plain text, rich text format (".rtf") and text containing SGML and XML tags. The tool cannot be used with Microsoft Word documents (".doc") and Adobe Portable Document Format (".pdf") files. There is no restriction on file extensions, VARD will attempt to process any file opened by the user. If your files are not successfully processed with the tool feedback would be appreciated, so that the possibility of dealing with these files in future versions can be explored.

Previously processed files saved with xml tags can also be opened with the tool and processing continued from when the file was saved. How the text is processed by VARD is defined by the setup, particularly the input setup.


Go to top of page

Main user interface

As of VARD 2.5 all of VARD's main functionality is contained within one user interface, a screenshot of which is shown below. Note that all interface screenshots in this user guide are from VARD running on MAC OSX, the interface looks similar on other OS's, but is probably 'prettiest' on the Mac. The main functions offered are manually normalising single texts, normalising a randomly selected sample for training, batch processing multiple texts, batch training VARD on previously normalised texts and editing VARD's setup options.

The interface is split into a main text area which is used during manual normalisation, a sidebar to the right which contains several different panels that give access to various methods for interacting with VARD, and a toolbar which offers various functions. These components and VARD's functions are described in the following sections.

VARD's main user interface screenshot


Go to top of page

Processing single texts

One of the main functions of the user interface is to allow for the manual processing of a single text at a time. The single text interface is similar to that found in modern word processing applications, with a main window displaying the text and spelling variants highlighted in yellow. One can right click on any word to be presented with different options depending on the type of word (variant, non variants or normalised).


Go to top of page

Inserting text

Text can either be processed by opening a file or by pasting text into the application. Before you process any text you should check that the setup is correct for the files/text you are processing.

To open a file you can use 'Open' in the 'File' menu or open icon, a standard file opening dialog will be displayed, simply select the file you wish to process. Another option is to simply drag and drop a file from your system into the main text area, the file will then be loaded as if selected using 'Open'. Before loading the new file, if you have any unsaved data you will be prompted to save all current data.

Pasting text can be achieved by using 'Paste' in the 'Edit' menu or paste icon, the text will always be appended to the end of the document with a line-break inserted above the pasted text. You can also add text in a similar way by simply dragging and dropping selected text from another application into the main text area.

To clear the current document use the 'New' command in the 'File' menu or new icon, if you have any unsaved data you will be prompted to save all current data.

Once text has been inserted it will be automatically analysed by the software and the words will be placed into the different categories.

Note: an empty line will always be inserted at the start of the document to ease issues with updating the very first position in the document, this is not saved in VARD's output.


Go to top of page

Main text area

Once text is available to the software it is displayed in the main text area. The text is not manually editable as any changes made to words need to be tracked by the software. Words belonging to the category selected (e.g. variant, non variants or normalised) in the text sidebar are highlighted within the text by a yellow background. Right-clicking any word in the document will bring up a pop-up menu which differs depending on the word's category; the options are detailed in the category descriptions below.


Go to top of page

Word categories

Words are placed (either manually or automatically) into 1 of 3 categories: variants, non variant forms and normalised words. Right-clicking on any word in the text (or in the types list) will bring up a popup menu with different options depending on the word's category, these are detailed below. One option which is common to all categories is Select type, this changes the currently selected word category to that of the clicked-on word and selects the word in the types list resulting in all instances of the word being highlighted with a green background (the current instance is highlighted in red) and allowing the user to cycle through and process each instance of the word in step complete.

Variants

Variant popup menu screenshot VARD 2 currently adds to the variant category when word forms are not found in its real word list, words from a text can also be added to the category manually by the user. The variant category contains all words which the system deems necessary to normalise (either manually or through automatic processing). Options offered when right-clicking on a word from the variant category consist of:

Normalisation suggestions: A ranked list of suggested normalisations is given, the ranking is based on the confidence score. Each suggested normalisation can be applied to the single instance (Normalise instance) or all instances of the word form (Normalise All). If the replacement word is not in the modern lexicon it can be added for future reference - this option is given as a dialog after the normalisation has been made.

Normalise to...: If the correct normalisation is not offered in the suggested list a user can provide their own normalisation with this option. The user is presented with a textbox to input the replacement and is given the option to normalise all instances. As with normalising from the suggestions list, if the replacement word is not in the modern lexicon it can be added for future reference - this option is given as a dialog after the normalisation has been made. The normalise to... option also allows for changing the casing of a word, if for example "william" is encountered in a text, you can normalise this to "William" by performing normalise to... and the casing you give will always be applied - whereas for normalisation suggestions the casing of the variant form is maintained.

Mark as not variant: The variant can be 'ignored' by changing its category to not variant, this option can be invoked upon the clicked-on instance (Mark instance as Not Variant) or all instances of the word (Mark all as Not Variant). If Mark all as Not Variant is invoked the user is given the option to add the word to the real word list so in future sessions it will not be marked as a variant.

Normalised

Normalised word popup menu screenshot The normalised category contains all words which have been normalised by the user or through automatic processing. The original word form is stored and displayed with the replacement in the types list.

The only option offered other than finding the word in the list is to revert the replacement to the original variant. The user can revert just the clicked-on instance or all instances.

Not variants

Not variant form popup menu screenshot The not variants category contains all words which are in the modern lexicon.

The user may mark any word in this category as a variant, either the clicked-on instance only (Mark instance as Variant) or all instances (Mark all as Variant). If Mark all as Variant is selected the user will be prompted to remove the word from the real words list.


Go to top of page

Text sidebar

Text sidebar screenshot The Text sidebar contains extra information and options. The word category can be changed with the radio buttons in the "Display" section of the panel, for each category the number of tokens is shown. The other components in the panel are as follows:

Types list

The types list contains each word in the currently selected category, with the number of tokens for each word displayed in brackets. Selecting a word in the list highlights all instances of the word in the main text area in green and begins the step complete component to cycle through instances of the word. Right-clicking on a word in the list gives similar options to those when right-clicking on a word in the main text area - these options are detailed in the word category descriptions.

The types list can be ordered alphabetically or by the frequency of each type in the text. Simply select sort a-z icon for sorting A-Z or sort by frequency icon for sorting by frequency (descending).

The current word list can be copied to the clipboard using the copy list icon button, the list is copied in tab-delimited format so is easily pasted into programs such as Microsoft Excel.

The current selection can be cleared with the clear selection icon button, this will also clear the step complete dialog and the selected instance in the main text area.

Step complete

The step complete component allows the user to cycle through and process instances in the text or instances of the currently selected word (text instances and type instances), the same options are available as when right-clicking a word and dependant on the word's category. Once an instance is dealt with the next available instance is selected and relevant options displayed. The currently selected instance is highlighted red in the text. The instances can also be cycled through using the first (First Instance), previous (Previous Instance), next (Next Instance) and last (Last Instance) buttons.

You can choose to use the step complete feature to cycle through text instances (with text instances) or type instances (with text instances). Text instances are the tokens found in the text sequentially. Type instances are the instances of the currently selected type in the Types List. If type instances is selected and all instances of the currently selected type have been dealt with, the next type will be automatically selected for processing. If the Types List is currently sorted alphabetically, this will be the next type in the list. If the Types List is sorted by frequency, the most frequent type in the list will be selected.

Auto normalise

There is also an option to automatically process variants in the current text in the main text area. This is very similar to the batch processing module of VARD, except the output is not produced immediately, instead replacements made can be viewed in the types list and main text area, and further manual processing can take place. Normalisations are automatically made if the top normalisation suggestion has a confidence score equal to or higher than the selected threshold. The threshold can be lowered and automatic processing invoked again, replacing any variants which now have a top replacement with a high enough confidence score. The automatic processor will always use the current confidence scores and f-score weight.


Go to top of page

Join function

Due to the main text area not being editable, a function is available to allow the joining of two or more word instances in the text. To do this the user should select (by clicking and dragging the mouse over) the words they wish to join in the main text area and select 'Join' from the 'Edit' menu or click join icon in the text toolbar.

Once the join function has been invoked all selected words will be concatenated with any characters (including ignored text, hyphens and white space) in between words removed. The whole of a word need not be selected for it to be included and words can be joined over line-breaks. The resulting new word instance will be evaluated by the system as a variant or not variant. The user can then manipulate the word like any other.

Any word instance which has already been normalised may not be joined with other words; the word instance must first be reverted to a variant and then joined. You can also not join across text which has been marked as a language, but you can join words within the same language-marked-up span.


Go to top of page

Language markup

This function allows you to mark spans as different languages so they can be processed separately (or not at all). Text can be marked as any language defined by the user, but the language must first be present either in the list provided in the setup or be present in another foreign xml tag in the text. Languages can be marked as "to process" in the setup, which means that the words will be treated as normal text within VARD (i.e. can be normalised).

You can select any span of text (by clicking and dragging the mouse over) and mark it up as a language using 'Mark as language' from the 'Edit' menu or by clicking mark as foreign icon in the text toolbar. You will be presented with a list of languages available to mark up as, once you have selected a language the markup will be made. If the language is not "to process" in the setup it will be treated as if ignored, i.e. text will not be available for processing and will not be normalised automatically. If the language is "to process" then the text will be available to process just as before, but options will be available for changing the language of the span and if the language is changed to "not process" in the setup then the text will be ignored as described.

If you right-click on any text which is marked up as a language, you will be provided with options to both remove the language markup and to change the language with which the span is marked up with. The current language is also displayed.


Go to top of page

Undo/Redo

All edits made to words within the text can be undone and redone using the 'Undo' and 'Redo' in the 'Edit' menu or with undo icon and redo icon. Edits made to the known variants list and the real word list are also undoable and redoable. Edits can only be undone (or redone) one edit at a time.


Go to top of page

Text formatting

Various formatting options are available for the text in the main text area. Two dropdown menus are available in the toolbar to change the selected text's font face and size. The toolbar and Style menu also have functions for making the selected text Bold (bold icon), Italic (italic icon) and Underlined (underline icon). Formatting functions will always be invoked on the currently selected text, to select all of the text in the current document 'Select All' can be used from the 'Edit' menu or select all icon in the toolbar.

Formatting will only be saved if the file is saved as Rich Text Format (.rtf).


Go to top of page

Saving document

The current document can be saved with or without xml markup. To save with XML tags use 'Save With XML Tags' (save with xml tags icon) in the 'File' menu, to save without xml tags use 'Save Without XML Tags' (save without xml tags icon). A ".rtf" extension can be used to save the current formatting, this is not possible when saving with xml tags.

'Save' in the 'File' menu or save icon in the toolbar can be used to save to the last file the document was saved to. If the document has not previously been saved the user will be able to choose whether or not to save with xml tags.

The currently selected text can be copied to the clipboard to be pasted into other programs with 'Copy' in the 'Edit' menu or copy icon in the toolbar; no formatting or xml tags will be included however.


Go to top of page

Saving other data

The training data can be written to files for use in future sessions from the 'File' menu, there are detailed below. Options are only available if a change has occurred in the particular data. For more information on the actual files written see the training files section.

'Save Dictionary' (save dictionary icon): This option saves any changes to the real words list, i.e. any words removed or added. This is the list checked against when judging whether a word is a variant and also used when searching for potential variant replacements.

'Save Variant List' (save variants list icon): Saves any additions to the known variants list. A variant to replacement mapping will be added when a replacement is made which does not already appear in the saved list. Frequency of selection data will also be changed each time a normalisation is made.

'Save Confidence Weights' (save confidence weights icon): Saves the current normalisation searching method weights to weights.txt.

'Save All' (save all icon): Gives the option to save all of the above as well as the document and the rules list, the same prompt is displayed as that shown when exiting the program.


Go to top of page

Exiting

Exit dialog screenshot Exiting the program can be done through the 'File' menu and 'Exit' (exit icon) or by simply closing the window.

Upon exiting the user will be prompted to save any changes made, as shown in the screenshot to the right. Note: options will only be given to save data which has changed.


Go to top of page

Setup

The setup defines various aspects of how VARD reads, processes and outputs text. All of the options are editable in the setup sidebar and the setup files. You can change the current setup by selecting a setup folder to use. Previously loaded setups are available to select from the Setup dropdown (these are defined in previous_setups.txt). You can select another folder by choosing it after clicking the browse icon button or by dragging a folder onto the dropdown text box. If the folder doesn't exist, or if any of the required setup files are missing, the folder will be populated with files containing default values. Once the the setup is loaded any texts processed in that session of VARD will be done so using that setup, you will also be able to edit the setup's options in the setup sidebar. If you change the setup you will be prompted to save any changes, the setup will then change and the current text will be reloaded using the newly selected setup's options.


Go to top of page

Methods and confidence scores

Each replacement offered for normalisation of a given variant is given a confidence score which is based on predicted precisions and recalls for each method used to find it. For each method all previous replacements made are taken into account as well as the current replacement being offered, all these scores are combined to produce an f-score (based on the current f-score weight). The current f-score, precision and recall for each method are displayed in a table in the top-right of the user interface. The scores for each individual replacement are displayed for each method in the variant popup menu in the format F-Score (Precision|Recall). This is based on the score given for the replacement and how many other replacements the method has suggested. The methods used to find and rank variant replacements are as follows:

KV: Known Variants

This method returns any replacement mappings found in the variants list for the given variant string. In most cases only one replacement will be offered, but the variants list may contain different options for a given variant string. A count is incremented or decremented each time a variant replacement is selected or reversed respectively. These counts are used to produce a precision and recall score for each offered replacement, if there is no ambiguity (i.e. only one offered replacement for the current variant string) these scores will be 100%, if there is more than one option the most picked replacement will have the higher recall score, the precision score will depend on how often the other replacements have been picked. The variants, replacements and counts are stored in the variants list file.

LR: Letter Replacement

A list of letter replacement rules (stored in the rules list file) is used to find potential variant replacements, rules are applied to the variant string and any resulting strings which are found to be in the real words list or in the variants list are offered as potential replacements. The recall score is reduced if more than one rule or a mapping in the variants list is required to create the replacement string, the precision score is based on how many other replacements are found for the current variant string. The rules list can be edited in the rule list sidebar.

PM: Phonetic Matching

A modified version of the Soundex phonetic matching algorithm is used to find matching replacements in the real words list or variants list. The recall score is reduced if a mapping from the variants list is required. The precision score is based on how many other options the algorithm produces from the current variant string.

ED: Edit Distance

A normalised Levenshtein Distance is calculated for all replacements offered by other methods. The individual replacements' scores are based on the replacement in question's distance score (recall) and the score given to all other offered replacements (precision).

F-Score Weight

To calculate overall confidence scores for methods and replacements, an f-score is calculated by combining the precision and recall scores. Usually, equal weight is given to precision and recall (F-Score weight 1), but a user may give priority to either precision or recall by altering this weight using the slider in the toolbar. By moving the slider either towards precision or recall will bias all F-Scores in VARD accordingly. Weights under 1 will bias towards precision, weights over 1 will bias towards recall. A score of 1/2 equates to precision being weighted twice as much as recall, a score of 2 equates to recall being weighted twice as much as precision, a third or 3 equates to 3 times as much, and so on. The difference this weighting makes is instantly shown in the f-scores presented for each method in confidence weights table. Adjusting this weight may result in the ranking of replacements offered for variants changing as well as some replacements having a score above the current automatic replacement threshold.


Go to top of page

Sample sidebar

Sample sidebar screenshot The Sample sidebar (added in version 2.5.1) provides the ability to train VARD with a randomly selected sample of your text, sub corpus (e.g. a group of texts from a certain time period) or whole corpus. This is done by marking partitions of random word length within the text(s) and then choosing from the whole set of partitions at random and presenting these to the user for manual normalisation. The purpose of this functionality is to allow for the training of VARD to deal with the spelling variants distributed throughout the text(s)/corpus, rather than from a small subset which may not be representative of the spelling variation contained within texts to be automatically normalised.

Choosing files

You can choose a list of files to be partitioned, or choose a list of already partitioned files, these are shown in a list in the 'Files' section. Files can be added to this list with add files icon or an entire folder(s)'s content with add folder icon. A standard opening dialog is given to the user where text files can be added. You can also add files or folders to be processed by dragging and dropping them from your system into the list. The entire directory tree rooted at a loaded folder can also be added by enabling the 'Search Subfolders' button search subfolders icon.

Files can be removed from the list by selecting the files and clicking delete items icon, or the entire list can be cleared with clear list icon.

Partitioning files

To insert partition markup into the files, first choose the files to partition, these are usually the same files which will be automatically normalised after training. Then you need to provide a minimum and maximum partition size, each partition will have a random number of words between these minimum and maximum values. Note, the final partition in each file may be below the minimum size so that each file can be used in full, also the word counting is based on tokenization with the set word pattern. Finally, provide an output folder for the partitioned files, use the browse icon button to select a folder on your system, or you can drag a folder onto the text box. If the input files are from different folders the folder structure will be retained. The output folder can be the same as the input folder, the input files will be overwritten by the partitioned files.

Once you have provided the list of files to partition, a minimum and maximum partition size and an output folder, click on '(Re-)partition files' partition files icon. Each file will be processed in turn, with partitions created randomly sized between the provided minimum and maximum and marked up in the text according to the set partition XML markup. Any existing partition markup will be removed - you will be prompted to whether you want to remove the markup. The partitioned files will replace those listed in the files list and used as the current sample.

Use existing partitions

You can use partitions previously marked up in files by choosing those files and then clicking 'Use files for sample' (use existing icon). The marked-up partitions will be read and partitions placed in the relevant categories based on the state attribute: queued are placed in a "random queue", from which partitions are chosen to be normalised next - this is equivalent to putting all of the partitions in a bag, mixing them around and picking one out at a time blindly; started are placed in the In progress list, available for manual processing; and done are placed in the Completed list and count towards the normalised sample size.

Remove existing partitions

After completing the normalisation sample you may wish to remove the partitions markup, particularly because the partition markup may create invalid XML structure with crossed tags. To do this, simply choose the partitioned files and click 'Remove partition markup' remove existing icon.

Processing a sample

Once you have partitioned your files and these are being used you can start processing randomly selected partitions to build a normalised sample. To randomly select the next partition to manually process, click 'Next random partition' next partition icon, this partition will then be added to the In progress list, marked as Started in the xml markup and loaded in the main text area - the file will be loaded, with just the partition available for processing and the rest of the text 'ignored' (grayed out). The next random partition next partition icon button is also available in the main toolbar.

Once you are happy that the partition has been processed fully, you can mark the partition as complete by clicking the partition done icon button in the main toolbar. The partition will then be moved from the In progress list to the Completed list and marked as done in the XML markup. The file containing the partition will also be saved along with the training data. The next available partition will then be loaded, this might be the first in the In progress list (see next paragraph) or the next random partition will be chosen.

You can leave a partition to come back to by simply requesting the next random partition (next partition icon). The partitions which have been started but not finished can be found in the In progress list. You can process any of these partitions by selecting them in the drop-down list and clicking the adjacent edit partition icon button. This In progress list and the edit partition icon button can also be found in the main toolbar.

All partitions marked as done are placed in the Completed list. The word count and percentage above this list gives the current sample size - you may, for example, choose to normalise 10% of your corpus. You can select one of the completed partitions to go back to and edit by clicking the adjacent edit partition icon button. The partition will then be added to the In progress list, marked as Started in the xml markup and loaded in the main text area.


Go to top of page

Batch sidebar

Batch sidebar screenshot The batch sidebar allows a user to exploit VARD's automatic processing capability over any number of text files, even an entire corpus or set of corpora. The current Rules, F-Score weight and setup will be used; including the current confidence weights, real word list and variants list.

Once files are chosen, a threshold set and an output type and folder selected, processing can be started using 'Process list'. The current progress of the batch processing will be shown in the progress bar.

Choosing files

All files to be processed are shown in a list in the 'Files' section. Files can be added to this list with add files icon or an entire folder(s)'s content with add folder icon. A standard opening dialog is given to the user where text files can be added. You can also add files or folders to be processed by dragging and dropping them from your system into the list. The entire directory tree rooted at a loaded folder can also be added by enabling the 'Search Subfolders' button (search subfolders icon).

Files can be removed from the list by selecting the files and clicking delete items icon, or the entire list can be cleared with clear list icon.

Any texts which can be processed by the single text mode can be processed by the batch processing mode. Any changes made to texts with VARD and saved with xml tags will be loaded when processing in batch.

Normalisation control

VARD 2 will generally find numerous potential normalisations for a given variant, each of these suggestions is given a confidence score. When automatic processing takes place the suggested normalisation with the highest confidence score is used to replace each variant. However, if the system's normalisation methods 'struggle' with a particular variant the highest confidence score may be relatively low - in these cases a threshold is required which is the minimum confidence score needed for a normalisation to take place. If the threshold is not met by the top normalisation suggestion the word is left as a variant.

The user can select a threshold to use when processing the current list of texts; the value must be between 0% and 100%. It is recommended to use the single text mode to train the tool on your texts/corpus and get a feel for a suitable threshold or to test a threshold on a sample first to gauge the recall and precision of replacements made. A higher threshold will increase precision but reduce recall.

When processing in batch mode, each text in the list will be normalised in turn. For each text all variants within are found and for each variant word a suitable normalisation is searched for, and if above the threshold the normalisation is made. Often the same variants are found in multiple files, but for each file the normalisations are searched for again and again each time a variant is encountered. By enabling 'Use normalisation cache', all normalisations made are saved in a fast-access hash map. When a new variant is encountered it is first searched for in the normalisation cache, and if found the normalisation is made, rather than searching for suitable normalisations again.

Selecting output

The text can be outputted with or without changes tagged (or both). If the original text was ".rtf", the formatting will be preserved and saved as ".rtf", this only applies to the untagged output. Both formats can be outputted simultaneously by selecting both check boxes.

An output folder must also be chosen, use the browse icon button to select a folder on your system, or you can drag a folder onto the text box. All output files and the stats file are placed in this folder. If the input files are from different folders the folder structure will be retained.

Stats file

Each time the batch processor is run ('Process list') a stats file is created in tab-delimited plain text. A row is present for each file processed with columns indicating the file name, total number of words and tokens, variant words and tokens remaining, normalised variant words and tokens and non variant words and tokens.

A Microsoft Excel file is available which serves as a rough template for further stats from this file. The yellow columns should be replaced with the columns from the stats file described above; be careful to make sure that totals are calculated in the bottom section from the entire list of files, especially if new rows are added. Stats available include totals, minimums, maximums and average of the given columns as well as percentages of how many variants were normalised.


Go to top of page

Train sidebar

Train sidebar screenshot The training sidebar allows the user to train VARD with previously normalised (VARDed) texts. Each text is processed as if it was being manually processed in the single text mode, with weights and the variants list being updated accordingly. However, the dictionary will not be "trained" in the same way as there is currently no way to detect automatically if words should be added or removed. Options are provided to add/remove all marked variants, not variants, normalised forms and original forms from normalised tags. In a future version of VARD the dictionary will be managed in a much 'smarter' way.

To use the training mode, simply select VARDed files in the same way as selecting files in the batch sidebar and click Train from file(s). VARD will process each file in turn and automatically save relevant lists. You can choose to edit the dictionary based on XML markup found in the files, the following options are available:

  • Add marked not variants?: Any words marked as Not variant will be added to the dictionary.
  • Remove marked variants?: Any words marked as Variant will be removed from the dictionary.
  • Add normalisations?: Any normalisations will be added to the dictionary, provided they are not marked as automatic.
  • Remove normalisation originals?: Any original forms (the variant) from normalisation tags will be removed from the dictionary, provided the normalisation isn't marked as automatic.


Go to top of page

Setup sidebar

The setup sidebar provides access to the various setup options available. The sidebar is split into Options, Input, Override and XML - all of which are described below. Any changes made are not immediately applied to the current session of VARD. To apply the changes you must first 'Save' the setup, upon which the current document will be reloaded (you will be prompted to save any changes in the existing session) so that the newly saved setup can be applied. 'Revert' resets all fields in the Setup sidebar to the last saved values. 'Save as' allows you to save the setup to a new folder, this effectively allows you to duplicate an existing setup or use an existing setup as the base for a new setup.

Note that a lot of the setup options require regular expressions and certain characters need "breaking" with \, such as {, }, [, ]. For more information about Java regular expressions this page is quite useful.

Setup options sidebar

Setup options sidebar screenshot The 'Options' section of the setup sidebar provides the ability to change the name of the Setup, define the training folder, set an annotator ID and edit the languages for marking text up with.

Setup name

The setup name is a required field, this allows for quick distinction between different setups and will be displayed in the setup bar so you can easily see which setup is currently in use.

Training folder

The training folder is the location of the words file, variants file, rules file and confidence weights file. In previous versions (pre 2.5) these files have been included with the rest of the setup files, they are now kept separate to allow for the possibility of sharing training across multiple setups - allowing for training from multiple corpora sources which may require different setups.

To change the training folder, either type in the new name, use the browse (browse icon) button or drag a folder into the text box. If the training folder is a subfolder of the setup folder then its path will be stored as relative to the setup folder path. If the training folder is changed to one that doesn't contains the necessary files, default versions of the missing files are created.

Annotator ID

The annotator ID is an optional field that, if provided, will be added as an attribute to each VARD XML tag. If multiple VARD users are normalising the same text or corpora, this might be worthwhile as it will allow for the distinction between the different user's decisions. If an existing VARD tag with an annotator ID attribute is read into VARD, the annotator ID will be maintained when saving that file, unless an annotator with a different ID reverts or removes the decision represented by the XML tag.

Languages

The languages section lists all of the languages available for marking text as a language in the single text mode. You can also set whether languages should be processed or not, if set to process then the text will be treated like any other text, if not it will be 'ignored' and not processed - this applies to the single text mode, automatic normalisation and training. To add a language to the list, simply type its name and click add language icon. To remove languages select them in the list and click remove language icon.


Go to top of page

Setup input sidebar

Setup input sidebar screenshot The input sidebar provides the options for reading text into VARD. You can set the encoding to use, the word pattern to use for tokenization, a list of patterns for 'ignoring' spans of text and a list of encoded entity conversions.

Word pattern

The user can define how a word should be detected, i.e. what characters are valid in a word. The pattern is a regular expression describing what should constitute a word when reading text in VARD. A good knowledge of (Java) regular expressions should be available if editing this entry. The expression lists valid characters for a word. The default expression and what each part means is as follows:

(?:\p{L}\p{M}*|[\'\-\^~=])+

  • (?:...) a non-capturing group.
  • \p{L} is any 'letter' in java, including diacritics (apart from when combined mark, which is dealt with by \p{M}*).
  • \p{M}* 0 or more combining marks (e.g. macrons, breves, etc.). Explanation here.
  • | indicates the group options, can be either sequence to the left, or to the right.
  • [...] defines the list of alternative characters.
  • \' apostrophe
  • \- hyphen
  • \^ caret
  • ~ tilde
  • = equals sign
  • + one or more characters.

Possible additions to the set of square brackets ([...]) to indicate extra possible characters include 0-9 for digits (e.g. to deal with SMS spelling variation) and ` to represent a grave accent. The Java Pattern class documentation explains Java regular expressions quite well.

Text to ignore

If your texts contain markup or sections which you do not wish to be normalised, VARD will need to be told to 'ignore' these spans. In this section you can set the regular expressions which define spans for VARD not to process, these spans will be grayed out in the main text area.

Three patterns are set by default:

  • <[^>]*>: XML tags, VARD's own xml tags are processed regardless.
  • \[[^\]]*\]: Text between square brackets ([...])
  • \{[^\}]*\}: Text between curly braces ({...})

You can add additional structures to ignore by typing the pattern in the textbox and clicking add item icon. For example, you might want to stop VARD from processing text between <header>...</header> tags with something similar to:

(?s)<header>.+?</header>

(?s) makes the regular expression 'dotall' or 'single-line' mode, meaning that linebreak markers are included for . (any character), this means that if the opening <header> tag and closing </header> tag are on different lines, the text between will still be ignored. The Java Pattern class documentation explains Java regular expressions quite well.

The ordering of text to ignore patterns is important, more 'specific' patterns should be earlier in the list. For example, if the XML tag pattern was at the top of the list, followed by the <header> tag pattern, the XML tags would be ignored first, stopping the <header> tag pattern from being matched. You can change the order of patterns using the move up icon and move down icon buttons.

You can edit an existing pattern by selecting it in the list and clicking the edit item icon button, this will remove the item from the list and add the text to the textbox ready for editing and to add back to the list. To remove patterns from the list simply select them and click the remove item icon button.

Encoding

The encoding field allows the user to define what encoding should be used when reading files into VARD. The default option is 'detect', which means that VARD will make its best effort to detect the encoding of incoming files. It should be noted that VARD isn't always accurate in detecting a file's encoding, so if you know the encoding of the files you were processing you should define it here. The list of encoding options are those available in the current Java virtual machine. Popular options are 'UTF-8', 'US-ASCII' and 'ISO-8859-1' (a.k.a. 'ISO-LATIN_1')

Entity conversion

Any characters encoded in structures like &(#x)1234; will be converted into Java Unicode characters, this means they will be processed like normal characters and, providing the font selected can display the character, will be displayed normally in the main text area. Upon saving the document all entities converted to Java Unicode characters will be converted back into their original state, including if a word has been normalised and the unicode version appears in the replacement.

The entity conversion list provides the ability to convert other encoded characters in the same way, for example named entities. The standard XML named entities will are included by default:

  • &amp; = &
  • &apos; = '
  • &gt; = >
  • &lt; = <
  • &quot; = "

To add other entities to convert to the list, provide the encoded string and its replacement and click add item icon. You can edit an existing entity conversion by selecting it in the list and clicking edit item icon, this will remove the item from the list and add the encoded entity and its replacement to the textboxes, ready for editing and to add back to the list. To remove entity conversions from the list simply select them and click remove item icon. Note that entity replacement occurs after ignoring text, hence text between &lt; and &gt; will not be ignored by default.


Go to top of page

Setup overrides sidebar

Setup overrides sidebar screenshot The overrides sidebar provides options to bypass VARD's finding of normalisation suggestions and detection of variants.

Always apply rules

These are rules which are applied prior to searching for normalisation suggestions. If after applying the rules to the variant string the resulting word is found in VARD's real word list, then this word is offered as a normalisation suggestion at 99% confidence. The string with the rules applied is also used instead of the variant in finding other normalisation suggestions; e.g. if the string with rules applied in the known variants list then the corresponding replacement is offered as a normalisation suggestion. Rules can be regular expression replacements, unlike standard letter replacement rules, meaning you can add a regular expression to search for in the variant and give a string to replace any matches with. If the replacement string is blank, any occurrences of the search term will be removed from the variant. It is not possible to add 'always on rules' which are like insert letter replacement rules. Note that all rules will be applied to the variant string and potentially multiple times, intermediate strings created will not be considered (this differs from the letter replacement rules).

An example of when this functionality is useful is dealing with long-s's (ſ), which are often found in Early Modern English texts in place of standard s's. The replacement of long-s with s is included by default as an 'always apply rule', meaning that if a long-s is encountered in a variant, e.g. Biſhop, it will be replaced with a standard s, e.g. to Bishop. In this case, with Bishop being present in VARD's real word list (by default), Bishop will be offered as a normalisation suggestion with a 99% confidence score. Other normalisation suggestions will be searched for as if the variant string was Bishop.

Note that if a normalisation suggestion resulting from solely applying 'always apply rules' (i.e. the resulting word is in the real word list) is chosen during training then the method weights will not be adjusted. If any other suggestion resulting from applying the rules is selected then the method weights will be adjusted as normal.

New rules can be added to the list by providing both the (regular expression) search term and the replacement in the provided text box and clicking add rule icon. You can edit an existing rule by selecting it and clicking edit rule icon, this will remove the rule from the list and add the search term and replacement to the relevant textboxes. Removing rules simply involves selecting the rules and clicking remove rule icon. Rules are applied in the order they are presented, you should take this into account if any rules clash with other rules, e.g. a replacement is in the search term of another rule. You can move the rules around the list with the move up icon and move down icon buttons.

Override to not variant

If a word in a text is matched by any of the regular expression patterns in this list then that word will be marked as not variant regardless of whether the word appears in the real word list. The regular expression must match the entire word for the override to occur.

An example of when this function will be useful is if digits are treated as letters in the word pattern (e.g. for SMS spelling variation), any occurrence of a numeric ordinal number, e.g. 1st, 2nd, 3rd, 4th etc will normally be marked as a variant. There was previously (pre 2.5) no complete solution to this issue as the only option was to add these to the real word list, which is both tedious and will always only deal with a finite set of numbers. By adding the pattern [0-9]+(th|st|nd|rd) any sequence of digits followed by th, st, nd or rd will be marked as not variant without adding any numeric ordinal numbers to the real word list.

You can add additional override patterns by typing the regular expression in the textbox and clicking add item icon. You can edit an existing pattern by selecting it in the list and clicking the edit item icon button, this will remove the item from the list and add the regular expression to the textbox ready for editing and to add back to the list. To remove patterns from the list simply select them and click the remove item icon button.

Override to variant

This list works in exactly the same way as the override to not variant list except that any words matched by the patterns listed will be marked as variants regardless of whether the word appears in the real word list.

Note that the not variants list is checked first, so if a word is matched by patterns in both list the word will be marked as not variant.


Go to top of page

Setup XML sidebar

Setup XML sidebar screenshot In this sidebar you are able to define the XML tags and attributes used to both read in previous VARD decisions and markup decisions made in the current session. Tags and attributes are required for all VARD decisions, in addition to the optional auto attribute in the normalised tag. For all tags, other attributes are allowed to be in the tag, these will just be ignored by VARD. The same set of tags and attributes are used for both input and output from VARD, it is not possible to input with one set of tags/attributes and output with another - you will need to use a text editor or write a script to search/replace tags if this is your requirement.

Annotator ID attribute

If the annotator id is set, then this attribute, with the annotator id as a value, will be added to all tags created by VARD. This field is required whether or not the annotator id is set, if you do not wish for this attribute to appear in tags you should set the annotator id to empty. The default is a_id.

Normalised

This tag represents all normalisations made by VARD. An attribute (Orig.) is present to include the original variant form, which is required for all normalised tags. The Auto attribute is optional, if present then "true" is given as the attribute's value if the normalisation was made by VARD automatically (e.g. in batch processing), "false" is given otherwise. The Auto attribute is used in Train mode; if the normalisation is marked as automatic it will not count towards training. If the Auto attribute is not present, the distinction cannot be made and every normalisation is counted as non-Auto. An example of the normalised tag with default settings is as follows:

<normalised orig="haue" auto="false">have</normalised>

Variants

This tag surrounds any word which the user has marked as a variant manually. Upon reading this tag VARD will mark the word as a variant regardless of whether the word is in the real word list (and normalise if possible in batch mode). No attributes are required. An example variant tag with default settings would look like:

<variant>bee</variant>

Not variants

This tag surrounds any word which the user has marked as not variant manually. Upon reading this tag VARD will mark the word as a not variant regardless of whether the word is in the real word list. No attributes are required. An example not variant tag with default settings would look like:

<notvariant>Melissa</notvariant>

Joins

Any join operations will be marked up with this tag. The original words with surrounding punctuation and white space are given in the Orig. attribute, which is required. If the original text contains linebreaks, how these are represented can be set in the Linebreak field, which again is required. The join tags may surround a variant or normalised tag, but join will never be surrounded by variant or replaced tags. An example join tag with default settings might look like:

<join orig="to-[lb]morrow">tomorrow</join>

Foreign text

Any span of text marked as a language will be surrounded by the tag defined here. The language the span is marked as is a required attribute. Foreign text tags may surround any other VARD tag, but no other VARD tag can surround this tag. If your text has had language identification performed on it by another tool, or the language of sections has been marked up manually separately from VARD, you may need to edit the foreign text tag and attribute so that VARD can take advantage of these tags. An example foreign text tag with default settings is as follows:

<foreign lang="Latin">Princeps Pacis</foreign>

Partitions

Any text which has been partitioned for sampling will have partitions marked up in the text with XML tags. Each partition has an opening tag before the first word in the partition and a closing tag after the last word in the partition. Note: it is possible that these tags could make the XML invalid by crossing tags (e.g. if a partition starts in one marked-up paragraph and ends in another), VARD doesn't rely on 100% valid XML structure, so this isn't an issue when processing partitioned texts within VARD, but other tools may rely on a 100% correct structure, for this reason partition markup should be removed after you have completed the sample if you wish to use the texts with other tools.

The partition tag contains three attributes: the sequence ('Seq.') attribute is an index within that file started at 1 and incremented for each partition, the 'Count' attribute contains the number of words in the partition, and the 'State' attribute indicates whether the partition has been normalised and marked completed ('done'), in progress ('started') or remaining for future random selection ('queued'). An example partition tag with default settings is as follows:

<vard:partition seq="46" state="queued" words="79">that Countrie. [...] manifest it self. I</vard:partition>


Go to top of page

Rules sidebar

Rule list sidebar screenshot An important part of VARD 2's normalisation capability is a list of letter replacement rules which are used to find potential variant normalisations. The current rules contain 3 parameters: original string, replacement string and position, e.g. replace ys with ies at the end of the word. Insertions can be achieved by leaving the original string blank and deletions can be achieved by leaving the replacement string blank. 58 rules are present by default in the software, however, this list can be added to or have rules removed in the Rules sidebar.

A rule can be added to the list by inputting the original, replacement and position parameters and clicking add rule icon. Leaving 'Original' blank will result in an insertion rule and leaving 'Replacement' blank will result in a deletion rule. Be careful to consider the context of a rule, surrounding letters may be provided by including them both in the original string and replacement string; for instance, a good rule to deal with cou'd, wou'd and shou'd (could, would and should) would be replace ou'd with ould at the end of a word. You can edit an existing rule by selecting it and clicking edit rule icon, this will remove the rule from the list and add it's fields to 'Add rule'. Removing rules simply involves selecting the rules and clicking remove rule icon.

The DICER tool may be useful for finding letter replacement rules to add to this list. DICER reads in VARD XML output and produces a list of letter replacement rules from the spelling normalisations made, along with various usage statistics. Get in touch if you're interesting in using it.

As soon as the rules list is altered the changes will come into effect for finding normalisation suggestions in the current session immediately, this applies to all modes (single text, batch and train). To save the rules for future sessions you must 'Save rules list'. You will also be prompted to save any changes to the rule list on exit and 'Save all'.


Go to top of page

Command line interfaces

As well as the graphical user interface to VARD, there are also various command line interfaces that may be useful for running VARD over a network or as part of a pipeline of tools through a script.

Batch process

This command line interface allows the user to process texts as if using the batch mode of the main interface. The following command will run VARD over the defined input folder:

java -Xms256M -Xmx512M -jar clui.jar "<setup folder>" <threshold int> <f-score weight int or fraction> "<input directory>" <search subfolders> "<output directory>" <use normalisation cache>

As when running the main interface, the memory values can be changed to suit your needs but 256/512 should be adequate for most uses.

Anything in brackets (<...>) should be replaced with your own values:

<setup folder>: This is the setup folder VARD will use, this is equivalent to that chosen in the user interface. If the setup folder does not exist it will be created and default settings used.

<threshold int>: Indicates the normalisation threshold. This should be an integer value between 0 and 100.

<f-score weight int or fraction>: Indicates the f-score weight used when calculating replacement scores. 1 indicates an equal balance between precision and recall, 1/2 indicates precision weight is twice as much as recall and 2 indicates recall weight is twice as much as precision. 1 is generally ok for most purposes, the threshold has a bigger effect on VARD's performance. For precision bias a score less than one should be given in the form of a fraction, the numerator and denominator should be whole numbers between 1 and 9 (inclusive). For recall bias a score greater than one should be given, this should be a whole number between 1 and 9 (inclusive).

<input directory>: This is the directory from which texts will be read in from. Ensure that the directory is surrounded by quotes ("..."). All files not hidden are (attempted to be) read by VARD 2, regardless of extension or type.

<search subfolders>: Should be true or false. If true then the entire directory tree rooted at the selected folder are processed. If false only files in the immediate directory are processed.

<output directory>: This is the directory where processed texts will be placed. Two folders are created, one with tagged texts and one without, plus a stats file. If search subfolders is set to true the original directory structure will be maintained in the two folders. Again, ensure the directory is surrounded by quotes ("...").

<use normalisation cache>: Should be true or false. If true then cached normalisations will be used. If false then normalisations will be searched for each time a variant is encountered. This is explained in more detail in the batch mode description.


Go to top of page

Auto normalise single file

This command line interface to VARD is similar to the command line batch processing interface, except that instead of processing an entire folder's worth files, it processes a single file. Normally, use of this interface should be discouraged as the initialisation cost of VARD is quite high, sometimes longer than processing a single file. The following command will run this interface:

java -Xms256M -Xmx512M -jar 1clui.jar "<setup folder>" <threshold int> <f-score weight int or fraction> "<input file>" <output type> "<output file>"

Anything in brackets (<...>) should be replaced with your own values:

<setup folder>: This is the setup folder VARD will use, this is equivalent to that chosen in the user interface. If the setup folder does not exist it will be created and default settings used.

<threshold int>: Indicates the normalisation threshold. This should be an integer value between 0 and 100.

<f-score weight int or fraction>: Indicates the f-score weight used when calculating replacement scores. 1 indicates an equal balance between precision and recall, 1/2 indicates precision weight is twice as much as recall and 2 indicates recall weight is twice as much as precision. 1 is generally ok for most purposes, the threshold has a bigger effect on VARD's performance. For precision bias a score less than one should be given in the form of a fraction, the numerator and denominator should be whole numbers between 1 and 9 (inclusive). For recall bias a score greater than one should be given, this should be a whole number between 1 and 9 (inclusive).

<input file>: This is the the file that will be processed. Ensure that the file path is surrounded by quotes ("...").

<output type>: Should be XML or Plain. If XML then XML tags are included to mark changes. If Plain then changes are unmarked.

<output file>: This is the file to which the processed file will be written to. Again, it's generally a good idea to surround the file path with quotes ("...").


Go to top of page

STDIN interface

This interface to VARD allows for VARD to be initiated and then it will listen to STDIN for files to normalise. To start the interface use this command:

java -Xms256M -Xmx512M -jar vardstdin.jar "<setup folder>" <threshold int> <f-score weight int or fraction> <use normalisation cache> "<stats file>" <output type>

Anything in brackets (<...>) should be replaced with your own values:

<setup folder>: This is the setup folder VARD will use, this is equivalent to that chosen in the user interface. If the setup folder does not exist it will be created and default settings used.

<threshold int>: Indicates the normalisation threshold. This should be an integer value between 0 and 100.

<f-score weight int or fraction>: Indicates the f-score weight used when calculating replacement scores. 1 indicates an equal balance between precision and recall, 1/2 indicates precision weight is twice as much as recall and 2 indicates recall weight is twice as much as precision. 1 is generally ok for most purposes, the threshold has a bigger effect on VARD's performance. For precision bias a score less than one should be given in the form of a fraction, the numerator and denominator should be whole numbers between 1 and 9 (inclusive). For recall bias a score greater than one should be given, this should be a whole number between 1 and 9 (inclusive).

<use normalisation cache>: Should be true or false. If true then cached normalisations will be used. If false then normalisations will be searched for each time a variant is encountered. This is explained in more detail in the batch mode description.

<stats file>: This is a path to the stats file which will be added to for each file normalised. If this is set to null then the stats file will not be created and written to.

<output type>: Should be XML or Plain. If XML then XML tags are included to mark changes. If Plain then changes are unmarked.

Once VARD is initialised, a message will be sent to STDOUT saying "READY". To process a file simply send [file to process] > [output file] to STDIN, replacing [file to process] with the path to the text file to be normalised and [output file] with the path to the file to write the processed text to. Once the text is processed and outputted, "FILECOMPLETE" will be sent to STDOUT. You can repeat this process as often as possible, even in parallel if desired. To exit, simply send the string "exit" to STDIN.


Go to top of page

Training files

A lot of data is saved by VARD in order for future sessions to benefit from training by a user. Four files are saved as training data, these are stored in the folder referred to in the setup. Any of the files described below can be reset to the original version by simply deleting the file. The next time the training is used by VARD the file will be re-initialised with default values.

All files are tab-delimited with one entry pair line (except for the variants list). Whilst lines may appear in alphabetic order, this is not a necessity. If editing the files manually it is important to ensure the files are saved with UTF-8 encoding.

words.txt

The real word list (or dictionary) used by VARD to classify words in a document and to find normalisations for variants is stored in words.txt. It is added to when a user opts to "add a word to the dictionary", or replacements from normalisations may be added if set to do so in the Train sidebar. The file is quite simple in that it only contains a maximum of 3 columns:

word [tab] frequency [tab] (# if word is user added)

The frequency is either taken from the British National Corpus or based on the folder in SCOWL. At present new words are always given a frequency of 10.

If this file is deleted, next time VARD uses the training folder it will re-initialise words.txt. This is done by adding words with a range of 50 or more in the British National Corpus (hidden from the user) and all words contained within the files from the scowl folder. The scowl folder contains a subset of the word lists offered by SCOWL (Spell Checker Oriented Word Lists), unwanted word lists can be deleted and new ones added, see the SCOWL webpage for extra word lists. The files simply contain lists of words, each on a new line. The file's ending indicates the frequency of the words, e.g. .10 indicates the top 10% most frequent words.

If creating your own words list (e.g. for another language) it may not be possible to include frequency information. The frequencies are only used when ranking normalisation suggestions if two confidence scores are equal. Due to the more complicated manner in which confidence scores are calculated, it is quite unlikely that two scores will be equal (particularly highly ranked scores) and so the same frequency (10 say) can be given to every word in the list without much detriment to VARD's performance.

variants.txt

The variant list containing variant to replacement mappings is saved in variants.txt. These are used by the known variants list method. Entries are grouped for variant strings, with a list of possible replacements for each. The file is formatted as follows:

variant
* [tab] replacement 1 [tab] count [tab] (# if mapping is user added)
* [tab] replacement 2 [tab] count [tab] (# if mapping is user added)
... etc.

All counts start at 1, if a count is reduced to 0 (by a replacement being reverted) it will be removed from the variants list (and recorded in the log). Counts will be incremented each time a replacement is used, and entries will be added when a normalisation is made which is not already in the list. Note: variant replacements are not necessarily in the real word list.

rules.txt

The rules files contains a list of all letter replacement rules used by the system. Rules can be added, edited and removed with the Rules sidebar or manually editing in this file. There are four fields for each entry:

original [tab] replacement [tab] position [tab] (# if rule is user added)

The original field may be blank if the rule is an insertion; likewise, the replacement field may be blank if the rule is a deletion. The position field should be one of Start, Second, Middle, Penultimate, End, Anywhere.

weights.txt

The current confidence weights for each normalisation method are stored in weights.txt, these are used to calculate confidence scores for each suggested normalisation. It is not recommended to edit this file, but each line represents:

Method code [tab] true positives [tab] false positives [tab] false negatives

The values are used to create precision and recall scores which are combined to give F-Scores. Making normalisations in the single text mode and with the train sidebar will have an impact on these scores.


Go to top of page

Setup files

These files, contained in the setup folder, represent a setup defined in the setup sidebar. It is recommended to use the setup sidebar to make edits to the setup, but if you wish you can edit the files directly. Note that a lot of the setup options require regular expressions and certain characters need "breaking" with \, such as {, }, [, ]. For more information about Java regular expressions this page is quite useful.

options.txt

This file contains various options which can be set by the user to influence how VARD operates. The fields available are:

setup_name gives a name for the setup to display to the user. This field represents the setup name in the options setup sidebar.

training_folder gives the path to the training files. This can be relative to the current setup folder, or a full path pointing to a folder elsewhere. This field represents the Training folder in the options setup sidebar.

annotator_id is the string to add to all VARD XML tags to indicate which user made that decision. If left empty, the annotator id will not be added to tags. This field represents the Annotator ID in the options setup sidebar.

encoding dictates the encoding which should be used when reading texts into VARD. If this is set to detect (the default) then VARD will attempt to detect the encoding, however, this sometimes is not possible and an incorrect encoding will be used meaning characters are displayed and processed erroneously. If the encoding is consistent and known for your corpus, an encoding should be enforced here. Different versions of Java will have different encodings available, however, according to the Java API the following are available as standard:

  • US-ASCII
  • ISO-8859-1
  • UTF-8
  • UTF-16BE
  • UTF-16LE
  • UTF-16

word_regex is a regular expression describing what should constitute a word when reading text into VARD. A good knowledge of (Java) regular expressions should be available if editing this entry. The expression lists valid characters for a word. This field represents the Word pattern section of the input setup sidebar.

text_to_ignore.txt

This file contains a list of regular expressions which are used to detect text which should not be processed by VARD, i.e. ignored. This file represents the Text to ignore section of the input setup sidebar.

entities.txt

This file contains a list of regular expressions representing entities along with their desired replacement. The regular expression will be searched for in the text and replaced with the string given in the second column (separated by a [tab]). This file represents the Entity conversion section of the input setup sidebar.

languages.txt

This file lists languages available for marking spans as a language. Each language is listed alongside either "process" or "ignore" to indicate whether to treat the text as normal, or to treat it as ignored text. This file represents the Languages section of the options setup sidebar.

alwaysrules.txt

This file lists all of the override normalisation rules. Each entry contains a regular expression search term and a replacement, separated by a [tab]. This file represents the Always apply rules section of the overrides setup sidebar.

notvariant_overrides.txt

This file lists all of the regular expressions to check against for words to override to not variant, regardless of whether the word is in the real word list. This file represents the Override to not variant section of the overrides setup sidebar.

variant_overrides.txt

This file lists all of the regular expressions to check against for words to override to variant, regardless of whether the word is in the real word list. This file represents the Override to variant section of the overrides setup sidebar.

annotations.txt

This file gives the tag and attribute names to use in VARD XML markup. The tags are described in the XML setup sidebar, which this file represents.


Go to top of page

Log files

If an error occurs whilst using VARD a log should be created in error_log.txt in the VARD root folder, which may help explain what has gone wrong. It would be appreciated if errors could be reported in the bugs section if not already listed there.

Three log files are also stored in the training folder; one for rule changes (rules_change_log.txt), one for the real words list (words_change_log.txt) and one for the list of variant -> replacement mappings (variants_change_log.txt). Each time a change is saved to the rules list, real word list or known variants list a timestamped log entry is made detailing the change.

The log files can be very useful for the future development of letter replacement rules, the default real word list and the pre-defined variant -> replacement mappings. It would be greatly appreciated if you could send your log files for analysis, especially after training the tool for your data.


Go to top of page

Other files

There are various files in the root folder of VARD, a description of each is given below. Some of these files are provided in the VARD download, others are created when VARD is first run.

previous_setups.txt

This file provides a list of setup folders chosen in the setup bar, this allows for the drop down of previous setups to save searching for a setup folder. The paths are operating system specific, so, for example, you couldn't share this file between a Windows setup and a Mac OSX setup. If you delete the file, it will be recreated next time VARD is run.

run.command

Use this to run VARD on Mac OSX, it will start the VARD main interface with 256M and 512M as the system memory allocations. If you get a message like: ""run.command" can't be opened because it is from an unidentified developer", hold down 'ctrl', click 'run.command' again and select 'Open'.

run.sh

Use this to run VARD on Unix/Linux, it will start the VARD main interface with 256M and 512M as the system memory allocations. "chmod a+x run.sh" first.

run.bat

Use this to run VARD on Windows machines (may appear just as "run"), it will start the VARD main interface with 256M and 512M as the system memory allocations. You may need to set permissions to allow this to run.

gui.jar

This is the Java base for the user interface, the jar is runnable by clicking it, but it is recommended to use the appropriate run script, or run from the command line so that enough system memory can be assigned to the Java virtual machine.

clui.jar

This is the Java base for the command line batch interface.

1clui.jar

This is the Java base for the command line single file interface.

vardstdin.jar

This is the Java base for the command line STDIN interface.

model.jar

This is VARD's main Java base, which the other jars rely upon.

ReadMe.txt

Brief instructions for running VARD.

scowl folder

Word lists used by VARD to populate words.txt.


Go to top of page

Page Border

Valid XHTML 1.0 Strict Valid CSS!