vardwrapper
Class VARD

java.lang.Object
  extended by vardwrapper.VARD

public class VARD
extends java.lang.Object

The 'interface' to VARD, allows for single words and files to be normalised.


Constructor Summary
VARD(java.io.File setupFolder, double threshold, double fWeight, boolean useCache)
          Instantiates a new VARD session.
 
Method Summary
 java.util.List<Suggestion> getNormalisationSuggestions(java.lang.String variant, int limit)
          Gets a list of normalisation suggestions ranked by confidence score (descending) for the given variant.
 double getThreshold()
           
 boolean isUseCache()
           
 boolean isVariant(java.lang.String word)
          Checks if a word is a varaint, i.e.
 Normalisation normalise(java.lang.String word)
          Attempts a normalisation of the given word.
 NormalisationStats normaliseFile(java.io.File original, java.io.File xmlOut, java.io.File plainOut)
          Normalise a whole text in a file, uses the full setup as if normalised using Batch mode of VARD.
 java.lang.String normaliseToString(java.lang.String word)
          Attempts to normalise the given word and returns either the normalisation made, or just the word if no normalisation was possible (no suggestions above threshold) or if the word wasn't a variant.
 void resetCache()
          Empties the current normalisation cache.
 void setThreshold(double threshold)
          Sets the current threshold to be used during normalisation.
 void setUseCache(boolean useCache)
          Sets whether or not to use the normalisation cache.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

VARD

public VARD(java.io.File setupFolder,
            double threshold,
            double fWeight,
            boolean useCache)
     throws java.io.IOException
Instantiates a new VARD session.

Parameters:
setupFolder - The folder to use to initialise vard, this is the setup folder selected when opening VARD. If the folder isn't present then a new folder will be created with default settings (not really recommended as some training should be completed before automatic normalisation).
threshold - The normalisation threshold for use throughout processing (whether by file or word).
fWeight - The fWeight to use when calculating confidence scores.
useCache - Whether or not to use caching for normalisation. If true, then each normalisation made will be stored, if the word to normalise is seen again then the same Normalisation instance is used again (for normaliseToString, the same word (normalisation or original form) will be returned).
Throws:
java.io.IOException - if error loading setup from folder.
Method Detail

isVariant

public boolean isVariant(java.lang.String word)
Checks if a word is a varaint, i.e. if it doesn't appear in VARD's dictionary.

Parameters:
word - the word to assess.
Returns:
true, if is variant (not in dictionary) or false if word is found in VARD's dictionary.

getNormalisationSuggestions

public java.util.List<Suggestion> getNormalisationSuggestions(java.lang.String variant,
                                                              int limit)
Gets a list of normalisation suggestions ranked by confidence score (descending) for the given variant. Will return a list regardless of whether the word is a variant (via isVariant) or not.

Parameters:
variant - the variant to find normalisation suggestions for.
limit - the maximum number of suggestions to find. If the total number of suggestions is less than the limit, all suggestions will be returned. If limit is 0, all suggestions will be found (can be a long list, especially for short words).
Returns:
a confidence score ranked (descending) list of Suggestions, with the normalisation word and confidence score.

normalise

public Normalisation normalise(java.lang.String word)
Attempts a normalisation of the given word. Will first check if isVariant, if the word is a variant, the top normalisation will be found (via getNormalisationSuggestions(word,1)). If the top normalisation's confidence score is above the threshold, Normalisation.isNormalised() will return true.

Parameters:
word - the word to be normalised.
Returns:
if(!variant): Normalisation(false,false,word,null,0.0), if(variant & suggestion list empty): Normalisation(true,false,word,null,0.0), if(variant & top_confidence <= threshold): Normalisation(true,false,word,top_normalisation,top_confidence), if(variant & top_confidence > threshold): Normalisation(true,true,word,top_normalisation,top_confidence).

normaliseToString

public java.lang.String normaliseToString(java.lang.String word)
Attempts to normalise the given word and returns either the normalisation made, or just the word if no normalisation was possible (no suggestions above threshold) or if the word wasn't a variant.

Parameters:
word - The word to attempt to normalise.
Returns:
the normalised word if a normalisation was possible, just the original word if not.

normaliseFile

public NormalisationStats normaliseFile(java.io.File original,
                                        java.io.File xmlOut,
                                        java.io.File plainOut)
                                 throws VARDException
Normalise a whole text in a file, uses the full setup as if normalised using Batch mode of VARD. All File parameters must be created (createNewFile()) prior to calling method.

Parameters:
original - the original file to be normalised
xmlOut - the output file for the XML tagged normalised file. If NULL, XML version will not be created.
plainOut - the output file for normalised version without XML tags. If NULL, plain version will not be created.
Returns:
the stats (as produced by batch mode of VARD) for the normalised file.
Throws:
VARDException - if any exceptions are thrown by VARD during normalisation or saving outputs.

setThreshold

public void setThreshold(double threshold)
Sets the current threshold to be used during normalisation. resetCache() is also called to ensure that correct normalisations are made based on new threshold.

Parameters:
threshold - The normalisation threshold to change to.

getThreshold

public double getThreshold()
Returns:
The current normalisation threshold.

setUseCache

public void setUseCache(boolean useCache)
Sets whether or not to use the normalisation cache.

Parameters:
useCache - TRUE or FALSE as to whether to use normalisation cache.

isUseCache

public boolean isUseCache()
Returns:
Whether or not the normalisation cache is currently being used.

resetCache

public void resetCache()
Empties the current normalisation cache.