Class Tokenizer
- java.lang.Object
-
- com.thaiopensource.xml.tok.Tokenizer
-
public class Tokenizer extends java.lang.ObjectIt provides operations on char arrays that represent all or part of a parsed XML entity.Several methods operate on char subarrays. The subarray is specified by a char array
bufand two integers,offandend;offgives the index inbufof the first char of the subarray andendgives the index inbufof the char immediately after the last char.The main operations provided by
TokenizeraretokenizeProlog,tokenizeContentandtokenizeCdataSection; these are used to divide up an XML entity into tokens.tokenizePrologis used for the prolog of an XML document as well as for the external subset and parameter entities (except when referenced in anEntityValue); it can also be used for parsing theMisc* that follows the document element.tokenizeContentis used for the document element and for parsed general entities that are referenced incontentexcept for CDATA sections.tokenizeCdataSectionis used for CDATA sections, following the<![CDATA[up to and including the]]>.tokenizeAttributeValueandtokenizeEntityValueare used to further divide up tokens returned bytokenizePrologandtokenizeContent; they are also used to divide up entities referenced in attribute values or entity values.
-
-
Field Summary
Fields Modifier and Type Field Description static intTOK_ATTRIBUTE_VALUE_SRepresents a white space character in an attribute value, excluding white space characters that are part of line boundaries.static intTOK_CDATA_SECT_CLOSERepresents the end of a CDATA section]]>.static intTOK_CDATA_SECT_OPENRepresents the start of a CDATA section<![CDATA[.static intTOK_CHAR_PAIR_REFRepresents a numeric character reference (decimal or hexadecimal), when the referenced character is greater than 0xFFFF and so is represented by a pair of chars.static intTOK_CHAR_REFRepresents a numeric character reference (decimal or hexadecimal), when the referenced character is less than or equal to 0xFFFF and so is represented by a single char.static intTOK_CLOSE_BRACKETRepresents]in the prolog.static intTOK_CLOSE_PARENRepresents a)in the prolog that is not followed immediately by any of*,+or?.static intTOK_CLOSE_PAREN_ASTERISKRepresents)*in the prolog.static intTOK_CLOSE_PAREN_PLUSRepresents)+in the prolog.static intTOK_CLOSE_PAREN_QUESTIONRepresents)?in the prolog.static intTOK_COMMARepresents,in the prolog.static intTOK_COMMENTRepresents a comment<!-- comment -->.static intTOK_COND_SECT_CLOSERepresents]]>in the prolog.static intTOK_COND_SECT_OPENRepresents<![in the prolog.static intTOK_DATA_CHARSRepresents one or more characters of data.static intTOK_DATA_NEWLINERepresents a newline (CR, LF or CR followed by LF) in data.static intTOK_DECL_CLOSERepresents>in the prolog.static intTOK_DECL_OPENRepresents<!NAMEin the prolog.static intTOK_EMPTY_ELEMENT_NO_ATTSRepresents an empty element tag<name/>, that doesn't have any attribute specifications.static intTOK_EMPTY_ELEMENT_WITH_ATTSRepresents an empty element tag<name att="val"/>, that contains one or more attribute specifications.static intTOK_END_TAGRepresents a complete end-tag</name>.static intTOK_ENTITY_REFRepresents a general entity reference.static intTOK_LITERALRepresents a literal (EntityValue, AttValue, SystemLiteral or PubidLiteral).static intTOK_MAGIC_ENTITY_REFRepresents a general entity reference to a one of the 5 predefined entitiesamp,lt,gt,quot,apos.static intTOK_NAMERepresents an unprefixed name in the prolog.static intTOK_NAME_ASTERISKRepresents a name followed immediately by*.static intTOK_NAME_PLUSRepresents a name followed immediately by+.static intTOK_NAME_QUESTIONRepresents a name followed immediately by?.static intTOK_NMTOKENRepresents a name token in the prolog that is not a name.static intTOK_OPEN_BRACKETRepresents[in the prolog.static intTOK_OPEN_PARENRepresents a(in the prolog.static intTOK_ORRepresents|in the prolog.static intTOK_PARAM_ENTITY_REFRepresents a parameter entity reference in the prolog.static intTOK_PERCENTRepresents a%in the prolog that does not start a parameter entity reference.static intTOK_PIRepresents a processing instruction.static intTOK_POUND_NAMERepresents#NAMEin the prolog.static intTOK_PREFIXED_NAMERepresents a name with a prefix.static intTOK_PROLOG_SRepresents whitespace in the prolog.static intTOK_START_TAG_NO_ATTSRepresents a complete start-tag<name>, that doesn't have any attribute specifications.static intTOK_START_TAG_WITH_ATTSRepresents a complete start-tag<name att="val">, that contains one or more attribute specifications.static intTOK_XML_DECLRepresents an XML declaration or text declaration (a processing instruction whose target isxml).
-
Constructor Summary
Constructors Constructor Description Tokenizer()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static java.lang.StringgetPublicId(char[] buf, int off, int end)Checks that a literal contained in the specified char subarray is a legal public identifier and returns a string with the normalized content of the public id.static booleanmatchesXMLString(char[] buf, int off, int end, java.lang.String str)Returns true if the specified char subarray is equal to the string.static voidmovePosition(char[] buf, int off, int end, Position pos)Moves a position forward.static intskipIgnoreSect(char[] buf, int off, int end)Skips over an ignored conditional section.static intskipS(char[] buf, int off, int end)Skips over XML whitespace characters at the start of the specified subarray.static inttokenizeAttributeValue(char[] buf, int off, int end, Token token)Scans the first token of a char subarrary that contains part of literal attribute value.static inttokenizeCdataSection(char[] buf, int off, int end, Token token)Scans the first token of a char subarrary that starts with the content of a CDATA section.static inttokenizeContent(char[] buf, int off, int end, ContentToken token)Scans the first token of a char subarrary that contains content.static inttokenizeEntityValue(char[] buf, int off, int end, Token token)Scans the first token of a char subarrary that contains part of literal entity value.static inttokenizeProlog(char[] buf, int off, int end, Token token)Scans the first token of a char subarray that contains part of a prolog.
-
-
-
Field Detail
-
TOK_DATA_CHARS
public static final int TOK_DATA_CHARS
Represents one or more characters of data.- See Also:
- Constant Field Values
-
TOK_DATA_NEWLINE
public static final int TOK_DATA_NEWLINE
Represents a newline (CR, LF or CR followed by LF) in data.- See Also:
- Constant Field Values
-
TOK_START_TAG_NO_ATTS
public static final int TOK_START_TAG_NO_ATTS
Represents a complete start-tag<name>, that doesn't have any attribute specifications.- See Also:
- Constant Field Values
-
TOK_START_TAG_WITH_ATTS
public static final int TOK_START_TAG_WITH_ATTS
Represents a complete start-tag<name att="val">, that contains one or more attribute specifications.- See Also:
- Constant Field Values
-
TOK_EMPTY_ELEMENT_NO_ATTS
public static final int TOK_EMPTY_ELEMENT_NO_ATTS
Represents an empty element tag<name/>, that doesn't have any attribute specifications.- See Also:
- Constant Field Values
-
TOK_EMPTY_ELEMENT_WITH_ATTS
public static final int TOK_EMPTY_ELEMENT_WITH_ATTS
Represents an empty element tag<name att="val"/>, that contains one or more attribute specifications.- See Also:
- Constant Field Values
-
TOK_END_TAG
public static final int TOK_END_TAG
Represents a complete end-tag</name>.- See Also:
- Constant Field Values
-
TOK_CDATA_SECT_OPEN
public static final int TOK_CDATA_SECT_OPEN
Represents the start of a CDATA section<![CDATA[.- See Also:
- Constant Field Values
-
TOK_CDATA_SECT_CLOSE
public static final int TOK_CDATA_SECT_CLOSE
Represents the end of a CDATA section]]>.- See Also:
- Constant Field Values
-
TOK_ENTITY_REF
public static final int TOK_ENTITY_REF
Represents a general entity reference.- See Also:
- Constant Field Values
-
TOK_MAGIC_ENTITY_REF
public static final int TOK_MAGIC_ENTITY_REF
Represents a general entity reference to a one of the 5 predefined entitiesamp,lt,gt,quot,apos.- See Also:
- Constant Field Values
-
TOK_CHAR_REF
public static final int TOK_CHAR_REF
Represents a numeric character reference (decimal or hexadecimal), when the referenced character is less than or equal to 0xFFFF and so is represented by a single char.- See Also:
- Constant Field Values
-
TOK_CHAR_PAIR_REF
public static final int TOK_CHAR_PAIR_REF
Represents a numeric character reference (decimal or hexadecimal), when the referenced character is greater than 0xFFFF and so is represented by a pair of chars.- See Also:
- Constant Field Values
-
TOK_PI
public static final int TOK_PI
Represents a processing instruction.- See Also:
- Constant Field Values
-
TOK_XML_DECL
public static final int TOK_XML_DECL
Represents an XML declaration or text declaration (a processing instruction whose target isxml).- See Also:
- Constant Field Values
-
TOK_COMMENT
public static final int TOK_COMMENT
Represents a comment<!-- comment -->. This can occur both in the prolog and in content.- See Also:
- Constant Field Values
-
TOK_ATTRIBUTE_VALUE_S
public static final int TOK_ATTRIBUTE_VALUE_S
Represents a white space character in an attribute value, excluding white space characters that are part of line boundaries.- See Also:
- Constant Field Values
-
TOK_PARAM_ENTITY_REF
public static final int TOK_PARAM_ENTITY_REF
Represents a parameter entity reference in the prolog.- See Also:
- Constant Field Values
-
TOK_PROLOG_S
public static final int TOK_PROLOG_S
Represents whitespace in the prolog. The token contains one whitespace character.- See Also:
- Constant Field Values
-
TOK_DECL_OPEN
public static final int TOK_DECL_OPEN
Represents<!NAMEin the prolog.- See Also:
- Constant Field Values
-
TOK_DECL_CLOSE
public static final int TOK_DECL_CLOSE
Represents>in the prolog.- See Also:
- Constant Field Values
-
TOK_NAME
public static final int TOK_NAME
Represents an unprefixed name in the prolog.- See Also:
- Constant Field Values
-
TOK_PREFIXED_NAME
public static final int TOK_PREFIXED_NAME
Represents a name with a prefix.- See Also:
- Constant Field Values
-
TOK_NMTOKEN
public static final int TOK_NMTOKEN
Represents a name token in the prolog that is not a name.- See Also:
- Constant Field Values
-
TOK_POUND_NAME
public static final int TOK_POUND_NAME
Represents#NAMEin the prolog.- See Also:
- Constant Field Values
-
TOK_OR
public static final int TOK_OR
Represents|in the prolog.- See Also:
- Constant Field Values
-
TOK_PERCENT
public static final int TOK_PERCENT
Represents a%in the prolog that does not start a parameter entity reference. This can occur in an entity declaration.- See Also:
- Constant Field Values
-
TOK_OPEN_PAREN
public static final int TOK_OPEN_PAREN
Represents a(in the prolog.- See Also:
- Constant Field Values
-
TOK_CLOSE_PAREN
public static final int TOK_CLOSE_PAREN
Represents a)in the prolog that is not followed immediately by any of*,+or?.- See Also:
- Constant Field Values
-
TOK_OPEN_BRACKET
public static final int TOK_OPEN_BRACKET
Represents[in the prolog.- See Also:
- Constant Field Values
-
TOK_CLOSE_BRACKET
public static final int TOK_CLOSE_BRACKET
Represents]in the prolog.- See Also:
- Constant Field Values
-
TOK_LITERAL
public static final int TOK_LITERAL
Represents a literal (EntityValue, AttValue, SystemLiteral or PubidLiteral).- See Also:
- Constant Field Values
-
TOK_NAME_QUESTION
public static final int TOK_NAME_QUESTION
Represents a name followed immediately by?.- See Also:
- Constant Field Values
-
TOK_NAME_ASTERISK
public static final int TOK_NAME_ASTERISK
Represents a name followed immediately by*.- See Also:
- Constant Field Values
-
TOK_NAME_PLUS
public static final int TOK_NAME_PLUS
Represents a name followed immediately by+.- See Also:
- Constant Field Values
-
TOK_COND_SECT_OPEN
public static final int TOK_COND_SECT_OPEN
Represents<![in the prolog.- See Also:
- Constant Field Values
-
TOK_COND_SECT_CLOSE
public static final int TOK_COND_SECT_CLOSE
Represents]]>in the prolog.- See Also:
- Constant Field Values
-
TOK_CLOSE_PAREN_QUESTION
public static final int TOK_CLOSE_PAREN_QUESTION
Represents)?in the prolog.- See Also:
- Constant Field Values
-
TOK_CLOSE_PAREN_ASTERISK
public static final int TOK_CLOSE_PAREN_ASTERISK
Represents)*in the prolog.- See Also:
- Constant Field Values
-
TOK_CLOSE_PAREN_PLUS
public static final int TOK_CLOSE_PAREN_PLUS
Represents)+in the prolog.- See Also:
- Constant Field Values
-
TOK_COMMA
public static final int TOK_COMMA
Represents,in the prolog.- See Also:
- Constant Field Values
-
-
Method Detail
-
movePosition
public static void movePosition(char[] buf, int off, int end, Position pos)Moves a position forward. On entry,posgives the position of the char at indexoffinbuf. On exit, itposwill give the position of the char at indexend, which must be greater than or equal tooff. The chars betweenoffandendmust encode one or more complete characters. A carriage return followed by a line feed will be treated as a single line delimiter provided that they are given tomovePositiontogether.
-
tokenizeCdataSection
public static int tokenizeCdataSection(char[] buf, int off, int end, Token token) throws EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenExceptionScans the first token of a char subarrary that starts with the content of a CDATA section. Returns one of the following integers according to the type of token that the subarray starts with:TOK_DATA_CHARSTOK_DATA_NEWLINETOK_CDATA_SECT_CLOSE
Information about the token is stored in
token.After
TOK_CDATA_SECT_CLOSEis returned, the application should usetokenizeContent.- Throws:
EmptyTokenException- if the subarray is emptyPartialTokenException- if the subarray contains only part of a legal tokenInvalidTokenException- if the subarrary does not start with a legal token or part of oneExtensibleTokenException- if the subarray encodes just a carriage return ('\r')- See Also:
TOK_DATA_CHARS,TOK_DATA_NEWLINE,TOK_CDATA_SECT_CLOSE,Token,EmptyTokenException,PartialTokenException,InvalidTokenException,ExtensibleTokenException,tokenizeContent(char[], int, int, com.thaiopensource.xml.tok.ContentToken)
-
tokenizeContent
public static int tokenizeContent(char[] buf, int off, int end, ContentToken token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenExceptionScans the first token of a char subarrary that contains content. Returns one of the following integers according to the type of token that the subarray starts with:TOK_START_TAG_NO_ATTSTOK_START_TAG_WITH_ATTSTOK_EMPTY_ELEMENT_NO_ATTSTOK_EMPTY_ELEMENT_WITH_ATTSTOK_END_TAGTOK_DATA_CHARSTOK_DATA_NEWLINETOK_CDATA_SECT_OPENTOK_ENTITY_REFTOK_MAGIC_ENTITY_REFTOK_CHAR_REFTOK_CHAR_PAIR_REFTOK_PITOK_XML_DECLTOK_COMMENT
Information about the token is stored in
token.When
TOK_CDATA_SECT_OPENis returned,tokenizeCdataSectionshould be called until it returnsTOK_CDATA_SECT.- Throws:
EmptyTokenException- if the subarray is emptyPartialTokenException- if the subarray contains only part of a legal tokenInvalidTokenException- if the subarrary does not start with a legal token or part of oneExtensibleTokenException- if the subarray encodes just a carriage return ('\r')- See Also:
TOK_START_TAG_NO_ATTS,TOK_START_TAG_WITH_ATTS,TOK_EMPTY_ELEMENT_NO_ATTS,TOK_EMPTY_ELEMENT_WITH_ATTS,TOK_END_TAG,TOK_DATA_CHARS,TOK_DATA_NEWLINE,TOK_CDATA_SECT_OPEN,TOK_ENTITY_REF,TOK_MAGIC_ENTITY_REF,TOK_CHAR_REF,TOK_CHAR_PAIR_REF,TOK_PI,TOK_XML_DECL,TOK_COMMENT,ContentToken,EmptyTokenException,PartialTokenException,InvalidTokenException,ExtensibleTokenException,tokenizeCdataSection(char[], int, int, com.thaiopensource.xml.tok.Token)
-
tokenizeProlog
public static int tokenizeProlog(char[] buf, int off, int end, Token token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenException, EndOfPrologExceptionScans the first token of a char subarray that contains part of a prolog. Returns one of the following integers according to the type of token that the subarray starts with:TOK_PITOK_XML_DECLTOK_COMMENTTOK_PARAM_ENTITY_REFTOK_PROLOG_STOK_DECL_OPENTOK_DECL_CLOSETOK_NAMETOK_NMTOKENTOK_POUND_NAMETOK_ORTOK_PERCENTTOK_OPEN_PARENTOK_CLOSE_PARENTOK_OPEN_BRACKETTOK_CLOSE_BRACKETTOK_LITERALTOK_NAME_QUESTIONTOK_NAME_ASTERISKTOK_NAME_PLUSTOK_COND_SECT_OPENTOK_COND_SECT_CLOSETOK_CLOSE_PAREN_QUESTIONTOK_CLOSE_PAREN_ASTERISKTOK_CLOSE_PAREN_PLUSTOK_COMMA
- Throws:
EmptyTokenException- if the subarray is emptyPartialTokenException- if the subarray contains only part of a legal tokenInvalidTokenException- if the subarrary does not start with a legal token or part of oneEndOfPrologException- if the subarray starts with the document element;tokenizeContentshould be used on the remainder of the entityExtensibleTokenException- if the subarray is a legal token but subsequent chars in the same entity could be part of the token- See Also:
TOK_PI,TOK_XML_DECL,TOK_COMMENT,TOK_PARAM_ENTITY_REF,TOK_PROLOG_S,TOK_DECL_OPEN,TOK_DECL_CLOSE,TOK_NAME,TOK_NMTOKEN,TOK_POUND_NAME,TOK_OR,TOK_PERCENT,TOK_OPEN_PAREN,TOK_CLOSE_PAREN,TOK_OPEN_BRACKET,TOK_CLOSE_BRACKET,TOK_LITERAL,TOK_NAME_QUESTION,TOK_NAME_ASTERISK,TOK_NAME_PLUS,TOK_COND_SECT_OPEN,TOK_COND_SECT_CLOSE,TOK_CLOSE_PAREN_QUESTION,TOK_CLOSE_PAREN_ASTERISK,TOK_CLOSE_PAREN_PLUS,TOK_COMMA,ContentToken,EmptyTokenException,PartialTokenException,InvalidTokenException,ExtensibleTokenException,EndOfPrologException
-
tokenizeAttributeValue
public static int tokenizeAttributeValue(char[] buf, int off, int end, Token token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenExceptionScans the first token of a char subarrary that contains part of literal attribute value. The opening and closing delimiters are not included in the subarrary. Returns one of the following integers according to the type of token that the subarray starts with:TOK_DATA_CHARSTOK_DATA_NEWLINETOK_ATTRIBUTE_VALUE_STOK_MAGIC_ENTITY_REFTOK_ENTITY_REFTOK_CHAR_REFTOK_CHAR_PAIR_REF
- Throws:
EmptyTokenException- if the subarray is emptyPartialTokenException- if the subarray contains only part of a legal tokenInvalidTokenException- if the subarrary does not start with a legal token or part of oneExtensibleTokenException- if the subarray encodes just a carriage return ('\r')- See Also:
TOK_DATA_CHARS,TOK_DATA_NEWLINE,TOK_ATTRIBUTE_VALUE_S,TOK_MAGIC_ENTITY_REF,TOK_ENTITY_REF,TOK_CHAR_REF,TOK_CHAR_PAIR_REF,Token,EmptyTokenException,PartialTokenException,InvalidTokenException,ExtensibleTokenException
-
tokenizeEntityValue
public static int tokenizeEntityValue(char[] buf, int off, int end, Token token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenExceptionScans the first token of a char subarrary that contains part of literal entity value. The opening and closing delimiters are not included in the subarrary. Returns one of the following integers according to the type of token that the subarray starts with:TOK_DATA_CHARSTOK_DATA_NEWLINETOK_PARAM_ENTITY_REFTOK_MAGIC_ENTITY_REFTOK_ENTITY_REFTOK_CHAR_REFTOK_CHAR_PAIR_REF
- Throws:
EmptyTokenException- if the subarray is emptyPartialTokenException- if the subarray contains only part of a legal tokenInvalidTokenException- if the subarrary does not start with a legal token or part of oneExtensibleTokenException- if the subarray encodes just a carriage return ('\r')- See Also:
TOK_DATA_CHARS,TOK_DATA_NEWLINE,TOK_MAGIC_ENTITY_REF,TOK_ENTITY_REF,TOK_PARAM_ENTITY_REF,TOK_CHAR_REF,TOK_CHAR_PAIR_REF,Token,EmptyTokenException,PartialTokenException,InvalidTokenException,ExtensibleTokenException
-
skipIgnoreSect
public static int skipIgnoreSect(char[] buf, int off, int end) throws PartialTokenException, InvalidTokenExceptionSkips over an ignored conditional section. The subarray starts following the<![ IGNORE [.- Returns:
- the index of the character following the closing
]]> - Throws:
PartialTokenException- if the subarray does not contain the complete ignored conditional sectionInvalidTokenException- if the ignored conditional section contains illegal characters
-
getPublicId
public static java.lang.String getPublicId(char[] buf, int off, int end) throws InvalidTokenExceptionChecks that a literal contained in the specified char subarray is a legal public identifier and returns a string with the normalized content of the public id. The subarray includes the opening and closing quotes.- Throws:
InvalidTokenException- if it is not a legal public identifier
-
matchesXMLString
public static boolean matchesXMLString(char[] buf, int off, int end, java.lang.String str)Returns true if the specified char subarray is equal to the string. The string must contain only XML significant characters.
-
skipS
public static int skipS(char[] buf, int off, int end)Skips over XML whitespace characters at the start of the specified subarray.- Returns:
- the index of the first non-whitespace character,
endif there is the subarray is all whitespace
-
-