Page Border

Increase text size Decrease text size Reset text size Printer friendly page Email page link

Page last modified: Friday 11th October 2013


This page will track all released versions of the VARD 2 software; detailing bug fixes, changes made and functionality added. The user guide provides details of the current version's functions and how to use the software.

Current Version

2.5.4

A minor update with one bug fix:

  • Fixed issue where if two encoded entities occur next to each other in the same word, it may cause the misalignment of subsequent VARD XML tags.

Please see details of version 2.5.3 and lower, below, for information about the many changes since 2.4. The User Guide has been updated to reflect the latest changes. If you have any questions please use the form on the FAQ page.

VARD is free for academic use, if you would like to download VARD please make a request on the availability page.


Previous Versions

2.5.3

A minor update with a few bug fixes.

  • Changed the default word pattern for tokenising to include combined letter marks (diacritics).
  • Fixed bug where if text marked as a foreign language contained replaced named entities, the entities were not stored in their orignal state when saved as XML.
  • Fixed a bug where options.txt was re-saved unnecessarily when VARD was loaded.
  • Slight speed improving when reading files.

2.5.2

A minor update to solve a few bugs and add some functionality to the automatic training method.

  • New feature: In the automatic training mode you can now select to "train" the dictionary at the same time by adding and removing marked variants, not variants and normalisations. See the user guide for more information.
  • Fixed bug where the variants list wasn't being added to during automatic training.
  • Fixed bug where the interactive interface would freeze when changing the setup. Also fixed other sources of (non-consequential) error messages appearing after certain operations.

2.5.1

A minor update, but one which includes a new method for manually normalising a randomly selected sample of your corpus in order to train VARD. There have also been various minor bug fixes.

  • New feature: A new process for training VARD through manual normalisation is now available through the sample sidebar interface. This allows you to provide a list of texts (or a single text) you wish to normalise and let VARD present randomly selected partitions from the text(s) for you to manually normalise. As normalisations are made, VARD is trained to better normalise the texts you've selected, allowing for more accurate automatic normalisation of the whole texts.
  • Relaxed requirement that foreign tag language attribute needed to be present. Now, if language attribute isn't present, the tag isn't treated as foreign, and hence can be ignored through text to ignore.
  • Fixed bug where if a variant started with a named entity (e.g. 'tis, the normalisation would include the named entity regardless of whether it was removed in the normalisation.
  • Fixed bug where if two named entities occur next to each other (e.g. '" ('")) the following highlights are misaligned.
  • Fixed bug where exception messages were shown after certain normalisation sequences.
  • Fixed bug with saving text as RTF.

2.5

A major update which adds several new functions and generally improves the user interface. VARD is entirely free for academic use, you just need to complete a very quick request form to download it. If you have used previous versions of VARD, you will still need to complete this form and I will then send you instructions for downloading the new version.

Around 50% of the code base for VARD has been completely rewritten during this update. Inevitably with such a large amount of new code, some bugs may have been missed. If you spot any bugs, no matter how minor, please let me know so they can be rectified.

I'm always happy to receive feedback (both positive and negative), so if you have any comments on the new version of VARD, or if there are any features you would like adding in future versions, please get in touch.

Some of the many changes in 2.5 are as follows:

  • All of VARD's main functions are now accessible in a single interface, with batch processing, batch training and other new functions available in the sidebar.
  • You can now change the setup VARD is using within the user interface, and without having to restart VARD.
  • The setup options can now be edited in the user interface, rather than just by editing the files directly.
  • The training files are now separate from the setup folder, allowing them to be shared between multiple setups.
  • An annotator id can be set, which, if present, will be added to all of VARD's XML tags.
  • All of VARD's XML tags are now editable, hence you can define how VARD decisions are marked up and how decisions are read into VARD, potentially from other sources than VARD.
  • You can now mark text as a specific language, allowing for text of different languages to be processed differently (via different setups) or not at all.
  • Options are available for overriding VARD's decision making in specific circumstances:
    • Always apply rules: allow for specified (regular expression) search and replaces to be tried on variants, if a valid word is created, then this is offered as the top normalisation with almost certain confidence. This is particularly useful for dealing with long-s's (ſ) which are often found in Early Modern English texts in place of standard s's.
    • Not variant overrides: regular expressions which if match a word in the text always mark that word as a non-variant, this is useful for example in dealing with 1st, 2nd, 3rd, etc. when digits are included in words.
    • Variant overrides: as above, but all matching words are marked as variants.
  • Improvement to how capitalisation is dealt with. E.g. previously "theEnglish" would be normalised to "the english", now correctly normalised to "the English". If using Normalise to..., custom capitalisation can be used and this will now be maintained in the normalisation.
  • A normalisation cache can be used when batch processing in both the user interface and command line version, this speeds up processing significantly when processing a large number of files.
  • Two new command line interfaces have been added:
    • Single file: Normalise a single file automatically.
    • STDIN: Initialise VARD and keep running listening for instructions to normalise files via STDIN.
  • Everything is now a lot less 'clunky', especially when dealing with large texts.
  • Minor changes:
    • Rules manager moved to new sidebar.
    • Default word pattern changed.
    • In command line version ~/ will convert ~ to home directory.
    • Added timestamps to all logs.
    • Error log moved to root of VARD.
    • Search subfolders function now definable in batch mode or in command line version rather than in Setup.
  • Bug fixes:
    • Encoding is loaded for text pasted in.
    • Fixed issue with _ used in rule list to represent space.
    • Fixed a bug where if a word contained no letters (a-z/A-Z) a normalisation would default to all uppercase.
    • Fixed copying selected text, which wasn't working on some systems.
  • There are also probably some things that I neglected to note down. If you spot any changes that require explanation, please let me know.

2.4.2

A minor update to address three issues:

  • Fixed a bug whereby in rare circumstances the removal of a word from the dictionary would result in extra words being erroneously removed also.
  • A minor change to the ordering of the entries in the saved dictionary (words.txt) and variants list (variants.txt). These will be updated upon saving each in the interactive mode.
  • Fixed a bug that caused VARD to hang when attempting to normalise a variant with a long string of repeated characters.

2.4.1

A minor update to address two issues:

  • A bug was present which sometimes prevented VARD from initialising on certain systems, this has now been fixed.
  • In the interactive mode, a save prompt will no longer appear when the document is empty.

2.4

2.4 is a relatively small update, with mainly new user interface features added. Changes include:

  • Added ability to choose the setup folder (previously "saved data") at startup within the User Interface selection screen. This allows for the easier management of setup folders.
  • Added ability to choose the setup folder in command line version.
  • As part of the above, the logs folder is now stored inside the setup folder.
  • Added ability to sort types list in interactive mode to be sorted by the frequency of the type (descending). This should assist in normalising the more common spelling variants first.
  • When a type has been dealt with in "Step Complete" (Type Instances), the next next type in the list will be selected if A-Z sorted, or the most frequent type if sorted by frequency.
  • Fixed issue of buttons in "Step Complete" being disabled incorrectly in rare circumstances.
  • Drag and Drop functionality added throughout software.
  • When opening a new file in the interactive mode you will be prompted to save the current file and setup.

2.3

2.3 is the largest update of VARD to date, there are vast improvements in performance, especially in terms of training capability, reliability and functionality. Changes made include:

  • The training mechanism has been overhauled, the main improvements are detailed in a paper presented at Corpus Linguistics 2009. The basics behind the improvements are that each method now has a predicted recall and precision based on previous normalisation decisions, previously only recall was taken into account. This means that recall increases through training with precision either unaffected or even improved.
  • Since the CL2009 paper was published further corrections and improvements have been made, meaning precision and recall scores are both improved even further (details of the improvements will be published in the near future).
  • The variants list structure has been updated to combine normalisations of the same variant form. Each replacement has a count attached to it, this is incremented each time a replacement is chosen during training (and decremented each time a normalisation is reversed). This allows VARD to take into account ambiguity in the variants list where two or more options are offered. Any ambiguity will be reflected in the predicted precision and recall score given for each offered replacement; replacements chosen more often will have higher scores, if two options have similar counts both will have lower precisions, stopping variants being normalised when precision priority is high.
  • Due to the above, additions and removals from the variants list are made automatically.
  • Several erroneous entries in the pre-defined variant list have been removed.
  • It is recommended to start again with the default variants list, however, if a variants list is used from a previous version of VARD it will automatically be converted to the new format upon saving.
  • To bring VARD in line with DICER, additional letter replacement rule positions are available. Possible positions are Start, Second, Middle, Penultimate, End and All.
  • The letter replacement and phonetic matching algorithms now return mapped entries from the variants list (with a reduced score).
  • The edit distance algorithm is now normalised based on the length of the variant and proposed replacement. This allows the edit distance method to have its predicted recall and precision adjusted through training like the other normalisation methods.
  • The variants list, words list and rules list are now encoded with UTF-8 as some had problems with the previous UTF-16 encoding. This may mean that lists saved with previous versions of VARD are incompatible and will need converting to UTF-8 with other software. Please get in touch if this is proving problematic.
  • An F-score is used to combine precision and recall scores, the balance between precision and recall can be adjusted in the interactive and batch processing modes. This will alter the score given to replacement suggestions, potentially changing the rankings and whether the top replacement is over the automatic replacement threshold set.
  • A training mode has been added which allows the user to train VARD with previously normalised (VARDed) texts. Each text is processed as if it was being manually processed in the interactive mode, with weights and the variants list being updated accordingly. This means that only the XML VARD output is needed to train VARD for batch processing.
  • Automatically normalised words are marked in xml tags as such. When training from VARD xml output automatic normalisations are not counted in training.
  • The XML tags have been simplified, four tags exist: normalised, variant, notvariant and join. The normalised tag contains the original form as an attribute and whether or not the normalisation is automatic, variant and notvariant have no attributes and join has the original text before the join as an attribute. VARD 2.3 will read xml output from older versions of VARD which have different tags but always output with the new tags.
  • A step processing facility has been added to the interactive mode. The main reason for this is to allow the user to go through variant instances in the text and decide what to do with each in turn. The user can also cycle through normalised instances and non-variant instances in the text. Any word can also be selected in the types list and the user can go through instances of this word - this will be useful when different decisions are required based on the context of the instance.
  • The ability to define what characters make up a word in terms of tokenization has been added. This can be defined through a regular expression in the options file.
  • For VARD 2.2 a function was added to detect the encoding of texts being normalised. VARD 2.3 still has this functionality, and will normally guess the encoding correctly. However, sometimes the wrong encoding is applied and this will result in characters being displayed and processed incorrectly. To alleviate this problem, the correct encoding can now be enforced in the options file.

2.2

2.2 is a major upgrade to VARD, offering many improvements and new features, including:

  • An extended dictionary based upon the Spell Checker Oriented Word Lists (SCOWL) has been added. The dictionary no longer takes words from the known variants list replacements as some of these were erroneous.
  • Greatly improved XML provision. VARD now uses regular expressions to deal with all xml tags, this means that 100% valid xml files are no longer required (broken xml tags will cause problems however - this is virtually unavoidable). When exporting, xml tags are only added to reflect changes made by the system or the user.
    • Variant tags are now only added if the user indicates a word should be marked as a variant in the interactive version, this is to allow VARD to use the current dictionary to decide if a word should be a variant.
    • Correct tags are now added for words which the user indicates as being modern forms (not variants).
  • The uncommon words group has been removed as many words were being placed in this group incorrectly. In future versions of VARD it is planned to use contextual information to detect real-word errors.
  • A command line interface has been added to allow VARD to be potentially run on a network machine or as part of a sequence of tools in a script.
  • VARD now requires Java 6, please ensure you have the latest version of Java running on your machine before running VARD. Get Java.
  • Many bugs have been fixed, including the four on the bugs page.
    • Text produced in any version of the software (interactive, batch or command line) can be opened and processed with any other version. VARD xml tags will be parsed appropriately and in the same fashion in each version.
    • Pasting text (with or without xml tags) will be processed in exactly the same way as if the text was opened in a fresh file, except the text is appending to the end of the currently displayed text.
    • As previously stated, xml is dealt with much more successfully.
    • The delay when switching between word lists when dealing with large texts in the interactive version has been greatly reduced.
    • Various problems when undoing and redoing edits in the interactive version have been cleaned up.
    • Previously, when processing large corpora, system memory was being depleted due to poor "garbage collection" of previously processed text. This issue has now been resolved.
    • A problem occurred occasionally when the user clicked on the text in the interactive version whilst the words were being evaluated. This has now been resolved.
  • The stats file produced during batch processing (and now in the command line interface) now includes tokens as well as types.
  • Text within XML tags (<...>) (except VARD's tags), square brackets ([...]) and curly brackets ({...}) are now ignored (not processed) by VARD and coloured grey in the interactive version. This behaviour can be edited by the user in "saved data/text_to_ignore.txt". Regular expressions are used to declare which portions of text to ignore, hence a user can state that text between certain tags should be ignored, such as headers.
  • HTML/XML/Unicode entities are now dealt with much better. Unicode entities in the form &(#x)1234; are now converted into their equivalent Java Unicode characters, so processed by VARD like any other character. They are converted back into their original format upon saving. XML/HTML special entities such &amp; and &quot; are converted to their equivalent characters. This behaviour can be edited in "saved data/entities.txt". Regular expressions are used to declare what character sequences should be detected, and a replacement given.
  • A user can now search through the instances of a selected word using the previous/next instance buttons located under the word list in the interactive version of the tool.
  • When joining words in the interactive version, the newly created word is evaluated like any other word; if the new word forms a dictionary entry it will be marked as a modern form, otherwise it will be marked as a variant.
  • New icons have been added in the interactive version courtesy of famfamfam.
  • VARD now runs more efficiently, improving the overall speed when processing texts.
  • Various other minor improvements and bug fixes.

2.1.2

A small update to fix a couple of bugs. Firstly, when the Batch Version outputs texts, the folder structure of the original files is retained. A bug has also been fixed where manually editing the output folder would cause a system error. Secondly, a bug where closing xml tags were being processed by VARD has been fixed. Reading existing xml files can still cause problems, it is hoped that this will be rectified in version 2.1.3.

2.1.1

A minor update to allow the software to be used on other platforms. Use "vard-windows-run.bat" on Windows and "vard-run.command" on other platforms. Java 1.5.0 should also now be sufficient to run VARD 2.

2.1.0

The first version publicly released (on this website) in order to gain user feedback for the evaluation section of my PhD.

This version has greatly improved processing speed and stability.

2.0.0 - 2.0.9

A series of non-publicly released versions produced during an undergraduate project and the first year of my PhD.

1

The original version of VARD was non-interactive and only used a known variants list to search for and replace variants found within a text.

Page Border

Valid XHTML 1.0 Strict Valid CSS!