grace.util
Class Tokenizer

java.lang.Object
  |
  +--grace.util.Tokenizer

public class Tokenizer
extends java.lang.Object

Performs data scanning, gathering, and conversion functions on text. This class provides functions to read various canned object like Strings, Dates, and integers as well as user defined object types. Most of the functionality uses regular expressions to locate and delimit the text.

Objects of this class are stateful in that they maintain a current position. As data is parsed, this current position is moved forwards or backwards. Operations that get data from the source typically set the current position in the source at the end of the data returned.

Synopsis:

   Tokenizer tokenizer = new Tokenizer("some text\nmore text\nnumCards=52");
   int numCards = tokenizer.getPrefixedInt("numCards=(\d+)", "$1");
 

Notes:

If this class is used for scraping screens, the screen text should contain any newlines that a meaningful to the format. This makes the * the screen scraping code more maintainable. In other words, newlines should not be stripped out. This will keep the parsing code independent from the length of the lines in the text. Therefore, if the text contains newlines and if the line length is changed someday, the parsing code should not need to be changed.


Constructor Summary
Tokenizer(java.lang.String source)
           
 
Method Summary
 void advance(int numCharacters)
          Move the current position forward the given number of characters.
 void advance(gnu.regexp.RE expression)
          Advances the current position in the source to the start of the match of given expression.
 void advance(gnu.regexp.RE expression, int positionAtSubExpressionNumber)
          Advances the current position in the source to the start of the match of given numbered subexpression.
 void advance(java.lang.String regularExpression)
          Advances the current position in the source to the start of the match given expression.
 void advance(java.lang.String expression, int positionAtSubExpressionNumber)
          Advances the current position in the source to the start of the match of given numbered subexpression.
 java.lang.Object clone()
          The copy allows the caller to capture the current state of this Tokenizer such that this Tokenizer can continue parsing but not affect the copy.
 java.lang.String find(gnu.regexp.RE regularExpression)
          Returns the first string in the input that matches the given regular expression or null if there are none.
 java.lang.String find(java.lang.String regularExpression)
          Returns the first string in the input that matches the given regular expression or null if there are none.
 java.lang.String findAndSubstitute(gnu.regexp.RE regularExpression, java.lang.String substituteString)
          Returns the result of substituting the result of the first match of the given regularExpression into the given substituteString.
 java.lang.String findAndSubstitute(java.lang.String regularExpression, java.lang.String substituteString)
          Returns the result of substituting the result of the first match of the given regularExpression into the given substituteString.
 java.lang.String get(gnu.regexp.RE expression)
          Returns the text that matches the given regularExpression that is assumed to start at the current position in the input source or returns null if at the end of the source.
 java.lang.String get(gnu.regexp.RE expression, int maxOffset)
          Returns the next match not more than maxOffset characters from the current position in the input stream or null if there are no more tokens.
 java.lang.String get(java.lang.String regularExpression)
          Returns the text that matches the given regularExpression that is assumed to start at the current position in the input source or returns null if at the end of the source.
 java.lang.String get(java.lang.String regularExpression, int maxOffset)
          Returns the next match not more than maxOffset characters from the current position in the input stream or null if there are no more tokens.
 java.util.Date getDate(java.text.DateFormat format)
          Parses and returns the next token as a Date parsed by the given format.
 java.util.Date getDate(java.lang.String simpleDateFormat)
          Parses and returns the next token as a Date parsed by a SimpleDateFormat object created with the given simpleDateFormat string.
 int getInt()
          Parses and returns the next token (skipping white space) as an integer.
 int getInt(int maxNumDigits)
          Parses and returns the next token (skipping white space) as an integer of the given maximum number of digits.
 int getInt(gnu.regexp.RE expression, java.lang.String substitution)
          This matches the given regular expression, sustitutes the match into the given substitution string, parses the result as an integer, and returns the result as an integer.
 int getInt(java.lang.String expression, java.lang.String substitution)
          This takes matches the given regular expression, sustitutes the match into the given substitution string, parses the result as an integer, and returns the result.
 java.lang.String getLine()
          Returns a line of text (delimited by newline) from input without the newline character in the result.
protected  java.lang.String getMatch(gnu.regexp.RE expression)
          Utility function that returns the first match in the source of the given regular expression and sets the current position to the end of the first match.
 int getNextInt()
          Parses and returns the next integer in the input (skipping non decimal digits and white space).
 int getPosition()
          Return the index of current position in the source.
 java.util.Date getPrefixedDate(gnu.regexp.RE regularExpression, java.text.DateFormat format)
          Parses and returns the token, after the given regular expression matches, as a Date parsed by the format object.
 java.util.Date getPrefixedDate(gnu.regexp.RE regularExpression, java.lang.String simpleDateFormat)
          Parses and returns the token, after the given regular expression matches, as a Date parsed by a SimpleDateFormat object created with the given simpleDateFormat string.
 java.util.Date getPrefixedDate(java.lang.String regularExpression, java.text.DateFormat format)
          Parses and returns the token, after the given regular expression matches, as a Date parsed by the format object.
 java.util.Date getPrefixedDate(java.lang.String regularExpression, java.lang.String simpleDateFormat)
          Parses and returns the token, after the given regular expression matches, as a Date parsed by a SimpleDateFormat object created with the given simpleDateFormat string.
 int getPrefixedInt(java.lang.String tagRegularExpression)
          Parses and returns the integer token after the matching regular expression.
 java.lang.String getSource()
          Returns the entire source.
protected  java.lang.String getSubstitutedMatch(gnu.regexp.RE expression, java.lang.String substitutionString)
          Utility function to find the first match of the given expression in the source, substitute the found match into the given susbstitution string and return the result.
 java.util.Date getTime(java.text.DateFormat timeFormat, java.util.Date date)
          Parses and returns the next token as a Date parsed by the given time format but using the year, month, and date of the given date.
 java.util.Date getTime(java.lang.String timeFormat, java.util.Date date)
          Parses and returns the next token as a Date parsed by the given time format but using the year, month, and date of the given date.
 java.lang.String getToken()
          Returns the next white space delimited token in the input stream or null if there are no more tokens.
 java.lang.String getToken(int maxNumWhiteSpaceChars)
          Returns the next white space delimited token not more than lastRelativeStartPosition characters from the current position in the input stream or null if there are no more tokens.
 void injectNewlines(int lineLength)
          Useful if the source text should contain but doesn't.
 boolean isAt(gnu.regexp.RE expression)
          Indicates that the given regular expression will match the at current position.
 boolean isAt(java.lang.String regularExpression)
          Indicates that the given regular expression will match the at current position.
static void main(java.lang.String[] args)
          Test program not quite completed.
 void printTo(PrintWriter writer)
          Used by grace.io.PrintWriter to nicely print this.
 void retreat(int numCharacters)
          Move the current position backward the given number of characters.
 void retreat(gnu.regexp.RE expression)
          Retreats the current position in the source to the start of the given expression.
 void retreat(java.lang.String regularExpression)
          Retreats the current position in the source to the start of the given expression.
 void setPosition(int absolute)
          Set the index of current position in the source.
 void skipWhiteSpace()
          Moves the current position to the next character in the source that is not white as determined by java.lang.Character.isWhitespace().
protected  java.lang.String toPrintable(java.lang.String notPrintable)
          Utility function to convert strings that have embedded non printable characters such as newlines and tabs, and returns a string that may be cleanly printed.
 java.lang.String toString()
           
 void undoLast()
          This moves the current position to the position before the previous function was called.
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Tokenizer

public Tokenizer(java.lang.String source)
Method Detail

injectNewlines

public void injectNewlines(int lineLength)
Useful if the source text should contain but doesn't. This probably means that, at one time, the source text had newlines, but the newlines have been stripped out. By injecting newlines, the source text is more maintainably parsed because the newlines add contextual information that was obviously meant for human visualization of the data.
Parameters:
lineLength - periodic position in source at which newlines should be injected

clone

public java.lang.Object clone()
The copy allows the caller to capture the current state of this Tokenizer such that this Tokenizer can continue parsing but not affect the copy. For example, this is useful to capture anchors in input or perform different or parallel parsing operations.
Overrides:
clone in class java.lang.Object

skipWhiteSpace

public void skipWhiteSpace()
Moves the current position to the next character in the source that is not white as determined by java.lang.Character.isWhitespace().

getMatch

protected java.lang.String getMatch(gnu.regexp.RE expression)
Utility function that returns the first match in the source of the given regular expression and sets the current position to the end of the first match.
Parameters:
expression - to match, return, and position after
Returns:
first string matching expression in source

getSubstitutedMatch

protected java.lang.String getSubstitutedMatch(gnu.regexp.RE expression,
                                               java.lang.String substitutionString)
Utility function to find the first match of the given expression in the source, substitute the found match into the given susbstitution string and return the result. The substitution string may contain plain text as well as the symbols $0-$9 as dictated by the GNU regular expression library. $0 refers the entire match and $1 through $9 refer to the first through the nineth matched sub expression.

The current position is positioned after the last matched character - not the end of the matched sub expression in the substitution, if any exists.

Parameters:
expression - to match
substitutionString - into which the matched expression in source is substituted.
Returns:
result of substituting the matched string found in source into the given substitutionString.

advance

public void advance(int numCharacters)
Move the current position forward the given number of characters.
Parameters:
numCharacters - to advance in the source

advance

public void advance(java.lang.String regularExpression)
             throws gnu.regexp.REException
Advances the current position in the source to the start of the match given expression.
Parameters:
regularExpression - to position at start of match
Throws:
gnu.regexp.REException - if the given regularExpression is invalid

advance

public void advance(gnu.regexp.RE expression)
Advances the current position in the source to the start of the match of given expression.
Parameters:
regularExpression - to position at start of match

advance

public void advance(java.lang.String expression,
                    int positionAtSubExpressionNumber)
             throws gnu.regexp.REException
Advances the current position in the source to the start of the match of given numbered subexpression. If the given expression contains no subexpressions, the current position is placed at the start of the matched expression.
Parameters:
expression - to match and position at sub expression
positionAtSubExpressionNumber - number of sub expression in expression to place current position.

advance

public void advance(gnu.regexp.RE expression,
                    int positionAtSubExpressionNumber)
Advances the current position in the source to the start of the match of given numbered subexpression. If the given expression contains no subexpressions, the current position is placed at the start of the matched expression.
Parameters:
expression - to match and position at sub expression
positionAtSubExpressionNumber - number of sub expression in expression to place current position.

find

public java.lang.String find(java.lang.String regularExpression)
                      throws gnu.regexp.REException
Returns the first string in the input that matches the given regular expression or null if there are none.
Parameters:
regularExpression -  
Returns:
first string in input that matches the regularExpression or null if there are no more tokens.
Throws:
gnu.regexp.REException - if regularExpression has syntax errors

find

public java.lang.String find(gnu.regexp.RE regularExpression)
Returns the first string in the input that matches the given regular expression or null if there are none.
Parameters:
regularExpression -  
Returns:
first string in input that matches the regularExpression or null if there are no more tokens.

findAndSubstitute

public java.lang.String findAndSubstitute(java.lang.String regularExpression,
                                          java.lang.String substituteString)
                                   throws gnu.regexp.REException
Returns the result of substituting the result of the first match of the given regularExpression into the given substituteString. Returns null if no match is made. The current position is set to next character after the end of found string in source.

This function uses the gnu.regexp.REMatch.substituteInto() function. Therefore, the substitute string can contain plain text or the special symbols $0-$9. $0 represents the entire matched string and $1-$9 represent the first thru the nineth matched sub expression respectively.

Parameters:
regularExpression - to match one time
substituteString - into which match is substituted
Returns:
result of substitution of first match in input into the substituteString or null if no match is made.
Throws:
gnu.regexp.REException - if regularExpression has syntax errors
See Also:
gnu.regexp.RE.getMatch(Object), gnu.regexp.REMatch.substituteInto(String)

findAndSubstitute

public java.lang.String findAndSubstitute(gnu.regexp.RE regularExpression,
                                          java.lang.String substituteString)
Returns the result of substituting the result of the first match of the given regularExpression into the given substituteString. Returns null if no match is made. The current position is set to next character after the end of found string in source.

This function uses the gnu.regexp.REMatch.substituteInto() function. Therefore, the substitute string can contain plain text or the special symbols $0-$9. $0 represents the entire matched string and $1-$9 represent the first thru the nineth matched sub expression respectively.

Parameters:
regularExpression - to match one time
substituteString - into which match is substituted
Returns:
result of substitution of first match in input into the substituteString or returns null if no match is made.
See Also:
gnu.regexp.RE.getMatch(Object), gnu.regexp.REMatch.substituteInto(String)

retreat

public void retreat(int numCharacters)
Move the current position backward the given number of characters.
Parameters:
numCharacters - to advance in the source

retreat

public void retreat(java.lang.String regularExpression)
             throws gnu.regexp.REException
Retreats the current position in the source to the start of the given expression.
Parameters:
regularExpression - to position at start of match
Throws:
gnu.regexp.REException - if the given regularExpression is invalid

retreat

public void retreat(gnu.regexp.RE expression)
Retreats the current position in the source to the start of the given expression.
Parameters:
regularExpression - to position at start of match

get

public java.lang.String get(java.lang.String regularExpression)
                     throws gnu.regexp.REException
Returns the text that matches the given regularExpression that is assumed to start at the current position in the input source or returns null if at the end of the source. The current position is set to next character after the end of found string in source.
Parameters:
regularExpression - to match starting at current position
Returns:
immediate match of regularExpression or null if end of source

get

public java.lang.String get(gnu.regexp.RE expression)
Returns the text that matches the given regularExpression that is assumed to start at the current position in the input source or returns null if at the end of the source. The current position is set to next character after the end of found string in source.
Parameters:
regularExpression - to match starting at current position
Returns:
immediate match of regularExpression or null if end of source

get

public java.lang.String get(java.lang.String regularExpression,
                            int maxOffset)
                     throws gnu.regexp.REException
Returns the next match not more than maxOffset characters from the current position in the input stream or null if there are no more tokens. This function is good for finding optional tokens that optionally appear in the source at or near a fixed position. The current position is set to next character after the end of found string in source.
Parameters:
regularExpression - to match starting at current position
maxOffset - maximum number of characters from current position that match will succeed
Returns:
matched string or null

get

public java.lang.String get(gnu.regexp.RE expression,
                            int maxOffset)
Returns the next match not more than maxOffset characters from the current position in the input stream or null if there are no more tokens. This function is good for finding optional tokens that optionally appear in the source at or near a fixed position. The current position is set to next character after the end of found string in source.
Parameters:
regularExpression - to match starting at current position
maxOffset - maximum number of characters from current position that match will succeed
Returns:
matched string or null

getToken

public java.lang.String getToken()
Returns the next white space delimited token in the input stream or null if there are no more tokens. Here, white space is a space, tab, or newline. The current position is set to next character after the end of token in source.
Returns:
next white space delimited token string or null if end of source

getToken

public java.lang.String getToken(int maxNumWhiteSpaceChars)
Returns the next white space delimited token not more than lastRelativeStartPosition characters from the current position in the input stream or null if there are no more tokens. This function is good for finding tokens that optionally appear in the source at or near a fixed position. The current position is set to next character after the end of token in source.
Parameters:
maxNumWhiteSpaceChars -  
Returns:
next token or null if there are no more

getLine

public java.lang.String getLine()
Returns a line of text (delimited by newline) from input without the newline character in the result. If the current position is on a newline, the next full line is returned. Otherwise, if the current position is in the middle of a line, the remainder of the line is returned. If there are no lines left, null is returned. The current position is set to start of the next line in the source.
Returns:
next line of text without the newline character or null if none

getInt

public int getInt()
           throws java.lang.NumberFormatException
Parses and returns the next token (skipping white space) as an integer. The current position is set to next character after the end of int token in source.
Returns:
result of parsing the next token as an int
Throws:
java.lang.NumberFormatException - if the next token is not an integer

getInt

public int getInt(int maxNumDigits)
           throws java.lang.NumberFormatException
Parses and returns the next token (skipping white space) as an integer of the given maximum number of digits. The current position is set to next character after the end of int token in source.
Returns:
result of parsing maxNumDigits of characters of the next token as an int
Throws:
java.lang.NumberFormatException - if the next token is not an integer

getInt

public int getInt(java.lang.String expression,
                  java.lang.String substitution)
           throws java.lang.NumberFormatException,
                  gnu.regexp.REException
This takes matches the given regular expression, sustitutes the match into the given substitution string, parses the result as an integer, and returns the result. The current position is set to next character after the end of int token in source.

This function uses the gnu.regexp.REMatch.substituteInto() function. Therefore, the substitute string can contain plain text or the special symbols $0-$9. $0 represents the entire matched string and $1-$9 represent the first thru the nineth matched sub expression respectively.

Parameters:
expression - to match in source
substitution - into which matched expression is substituted
Returns:
parsed integer result of subtituted match of given expression
Throws:
java.lang.NumberFormatException - if the sustituted value is not an int

getInt

public int getInt(gnu.regexp.RE expression,
                  java.lang.String substitution)
           throws java.lang.NumberFormatException
This matches the given regular expression, sustitutes the match into the given substitution string, parses the result as an integer, and returns the result as an integer.

This function uses the gnu.regexp.REMatch.substituteInto() function. Therefore, the substitute string can contain plain text or the special symbols $0-$9. $0 represents the entire matched string and $1-$9 represent the first thru the nineth matched sub expression respectively.

Parameters:
expression - to match in source
substitution - into which matched expression is substituted
Returns:
parsed integer result of subtituted match of given expression
Throws:
java.lang.NumberFormatException - if the sustituted value is not an int

getPrefixedInt

public int getPrefixedInt(java.lang.String tagRegularExpression)
                   throws java.lang.NumberFormatException,
                          gnu.regexp.REException
Parses and returns the integer token after the matching regular expression. This is useful if the integer is prefixed by some recogniseable string. If an error occurs, the current position is not changed. The current position is set to next character after the end of parsed int in source.
Returns:
parsed integer prefixed by given tagRegularExpression
Throws:
java.lang.NumberFormatException - if the next token is not an integer

getNextInt

public int getNextInt()
               throws java.text.ParseException
Parses and returns the next integer in the input (skipping non decimal digits and white space). The current position is set to next character after the end of next int token in source.
Returns:
next integer int source
Throws:
java.text.ParseException - if no integer in rest of input

getDate

public java.util.Date getDate(java.lang.String simpleDateFormat)
                       throws java.text.ParseException
Parses and returns the next token as a Date parsed by a SimpleDateFormat object created with the given simpleDateFormat string. The current position is set to next character after the end of date in source.
Parameters:
simpleDateFormat - to create a SimpleDateFormat to parse the date
Returns:
next token parsed as a data according to given simpleDateFormat
Throws:
java.text.ParseException - if date failed to parse

getDate

public java.util.Date getDate(java.text.DateFormat format)
                       throws java.text.ParseException
Parses and returns the next token as a Date parsed by the given format. The current position is set to next character after the end of date in source.
Parameters:
format - to parse the date
Returns:
parsed Date from current position in source
Throws:
java.text.ParseException - if date failed to parse

getTime

public java.util.Date getTime(java.lang.String timeFormat,
                              java.util.Date date)
                       throws java.text.ParseException
Parses and returns the next token as a Date parsed by the given time format but using the year, month, and date of the given date. Notice that because the date portion of the parsed time is overwritten by the given date, only hours, minutes, and seconds parsing sequences are useful in the given timeFormat. The current position is set to next character after the end of time in source.
Parameters:
format - to parse the date
date - to fill into result
Returns:
parsed Date from current position in source
Throws:
java.text.ParseException - if date failed to parse

getTime

public java.util.Date getTime(java.text.DateFormat timeFormat,
                              java.util.Date date)
                       throws java.text.ParseException
Parses and returns the next token as a Date parsed by the given time format but using the year, month, and date of the given date. Notice that because the date portion of the parsed time is overwritten by the given date, only hours, minutes, and seconds parsing sequences are useful in the given timeFormat. The current position is set to next character after the end of time in source.
Parameters:
format - to parse the date
date - to fill into result
Returns:
parsed Date from current position in source
Throws:
java.text.ParseException - if date failed to parse

getPrefixedDate

public java.util.Date getPrefixedDate(java.lang.String regularExpression,
                                      java.lang.String simpleDateFormat)
                               throws java.text.ParseException,
                                      gnu.regexp.REException
Parses and returns the token, after the given regular expression matches, as a Date parsed by a SimpleDateFormat object created with the given simpleDateFormat string. An object representing the regular expression may or may not be created. The current position is set to next character after the end of Date in source.
Parameters:
regularExpression - to match before parsing the date
simpleDateFormat - to use to parse the date
Throws:
java.text.ParseException - from SimpleDateFormat.parse()

getPrefixedDate

public java.util.Date getPrefixedDate(gnu.regexp.RE regularExpression,
                                      java.lang.String simpleDateFormat)
                               throws java.text.ParseException
Parses and returns the token, after the given regular expression matches, as a Date parsed by a SimpleDateFormat object created with the given simpleDateFormat string. The current position is set to next character after the end of Date in source.
Parameters:
regularExpression - to match before parsing the date
simpleDateFormat - to use to parse the date
Throws:
java.text.ParseException - from SimpleDateFormat.parse()

getPrefixedDate

public java.util.Date getPrefixedDate(java.lang.String regularExpression,
                                      java.text.DateFormat format)
                               throws java.text.ParseException,
                                      gnu.regexp.REException
Parses and returns the token, after the given regular expression matches, as a Date parsed by the format object. An object representing the regular expression may or may not be created. The current position is set to next character after the end of Date in source.
Parameters:
regularExpression - to match before parsing the date
format - to use to parse the date
Throws:
java.text.ParseException - from SimpleDateFormat.parse()

getPrefixedDate

public java.util.Date getPrefixedDate(gnu.regexp.RE regularExpression,
                                      java.text.DateFormat format)
                               throws java.text.ParseException
Parses and returns the token, after the given regular expression matches, as a Date parsed by the format object. The current position is set to next character after the end of Date in source.
Parameters:
regularExpression - to match before parsing the date
format - to use to parse the date
Throws:
java.text.ParseException - from SimpleDateFormat.parse()

getPosition

public int getPosition()
Return the index of current position in the source.
Returns:
index of current position in source

setPosition

public void setPosition(int absolute)
Set the index of current position in the source.
Parameters:
absolute - index to set
Returns:
index of current position in source

getSource

public java.lang.String getSource()
Returns the entire source.
Returns:
index of current position in source

undoLast

public void undoLast()
This moves the current position to the position before the previous function was called. This should work for all of the public functions. It simple moves the current position to the position prior to the previous function.

This only works once. In other words, currently only one undo operation is kept.


isAt

public boolean isAt(java.lang.String regularExpression)
             throws gnu.regexp.REException
Indicates that the given regular expression will match the at current position. This is useful if one wants to check the existance of optional text in the source at the current position. The current position is not moved.
Parameters:
regularExpression - to match at current position
Returns:
given expression matches at current position
Throws:
gnu.regexp.REException - if regularExpression is invalid

isAt

public boolean isAt(gnu.regexp.RE expression)
Indicates that the given regular expression will match the at current position. This is useful if one wants to check the existance of optional text in the source at the current position. The current position is not moved.
Parameters:
expression - to match at current position
Returns:
given expression matches at current position

toPrintable

protected java.lang.String toPrintable(java.lang.String notPrintable)
Utility function to convert strings that have embedded non printable characters such as newlines and tabs, and returns a string that may be cleanly printed. In other words, these characters are expanded to their printable character appearance like \n and \t.
Parameters:
notPrintable - string presumably containing tabs and newlines
Returns:
printable version of notPrintable with pretty tabs and newlines

printTo

public void printTo(PrintWriter writer)
Used by grace.io.PrintWriter to nicely print this.

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

main

public static void main(java.lang.String[] args)
Test program not quite completed.