Class Tokenizer


  • public class Tokenizer
    extends java.lang.Object
    It provides operations on char arrays that represent all or part of a parsed XML entity.

    Several methods operate on char subarrays. The subarray is specified by a char array buf and two integers, off and end; off gives the index in buf of the first char of the subarray and end gives the index in buf of the char immediately after the last char.

    The main operations provided by Tokenizer are tokenizeProlog, tokenizeContent and tokenizeCdataSection; these are used to divide up an XML entity into tokens. tokenizeProlog is used for the prolog of an XML document as well as for the external subset and parameter entities (except when referenced in an EntityValue); it can also be used for parsing the Misc* that follows the document element. tokenizeContent is used for the document element and for parsed general entities that are referenced in content except for CDATA sections. tokenizeCdataSection is used for CDATA sections, following the <![CDATA[ up to and including the ]]>.

    tokenizeAttributeValue and tokenizeEntityValue are used to further divide up tokens returned by tokenizeProlog and tokenizeContent; they are also used to divide up entities referenced in attribute values or entity values.

    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int TOK_ATTRIBUTE_VALUE_S
      Represents a white space character in an attribute value, excluding white space characters that are part of line boundaries.
      static int TOK_CDATA_SECT_CLOSE
      Represents the end of a CDATA section ]]>.
      static int TOK_CDATA_SECT_OPEN
      Represents the start of a CDATA section <![CDATA[.
      static int TOK_CHAR_PAIR_REF
      Represents a numeric character reference (decimal or hexadecimal), when the referenced character is greater than 0xFFFF and so is represented by a pair of chars.
      static int TOK_CHAR_REF
      Represents a numeric character reference (decimal or hexadecimal), when the referenced character is less than or equal to 0xFFFF and so is represented by a single char.
      static int TOK_CLOSE_BRACKET
      Represents ] in the prolog.
      static int TOK_CLOSE_PAREN
      Represents a ) in the prolog that is not followed immediately by any of *, + or ?.
      static int TOK_CLOSE_PAREN_ASTERISK
      Represents )* in the prolog.
      static int TOK_CLOSE_PAREN_PLUS
      Represents )+ in the prolog.
      static int TOK_CLOSE_PAREN_QUESTION
      Represents )? in the prolog.
      static int TOK_COMMA
      Represents , in the prolog.
      static int TOK_COMMENT
      Represents a comment <!-- comment -->.
      static int TOK_COND_SECT_CLOSE
      Represents ]]> in the prolog.
      static int TOK_COND_SECT_OPEN
      Represents <![ in the prolog.
      static int TOK_DATA_CHARS
      Represents one or more characters of data.
      static int TOK_DATA_NEWLINE
      Represents a newline (CR, LF or CR followed by LF) in data.
      static int TOK_DECL_CLOSE
      Represents > in the prolog.
      static int TOK_DECL_OPEN
      Represents <!NAME in the prolog.
      static int TOK_EMPTY_ELEMENT_NO_ATTS
      Represents an empty element tag <name/>, that doesn't have any attribute specifications.
      static int TOK_EMPTY_ELEMENT_WITH_ATTS
      Represents an empty element tag <name att="val"/>, that contains one or more attribute specifications.
      static int TOK_END_TAG
      Represents a complete end-tag </name>.
      static int TOK_ENTITY_REF
      Represents a general entity reference.
      static int TOK_LITERAL
      Represents a literal (EntityValue, AttValue, SystemLiteral or PubidLiteral).
      static int TOK_MAGIC_ENTITY_REF
      Represents a general entity reference to a one of the 5 predefined entities amp, lt, gt, quot, apos.
      static int TOK_NAME
      Represents an unprefixed name in the prolog.
      static int TOK_NAME_ASTERISK
      Represents a name followed immediately by *.
      static int TOK_NAME_PLUS
      Represents a name followed immediately by +.
      static int TOK_NAME_QUESTION
      Represents a name followed immediately by ?.
      static int TOK_NMTOKEN
      Represents a name token in the prolog that is not a name.
      static int TOK_OPEN_BRACKET
      Represents [ in the prolog.
      static int TOK_OPEN_PAREN
      Represents a ( in the prolog.
      static int TOK_OR
      Represents | in the prolog.
      static int TOK_PARAM_ENTITY_REF
      Represents a parameter entity reference in the prolog.
      static int TOK_PERCENT
      Represents a % in the prolog that does not start a parameter entity reference.
      static int TOK_PI
      Represents a processing instruction.
      static int TOK_POUND_NAME
      Represents #NAME in the prolog.
      static int TOK_PREFIXED_NAME
      Represents a name with a prefix.
      static int TOK_PROLOG_S
      Represents whitespace in the prolog.
      static int TOK_START_TAG_NO_ATTS
      Represents a complete start-tag <name>, that doesn't have any attribute specifications.
      static int TOK_START_TAG_WITH_ATTS
      Represents a complete start-tag <name att="val">, that contains one or more attribute specifications.
      static int TOK_XML_DECL
      Represents an XML declaration or text declaration (a processing instruction whose target is xml).
    • Constructor Summary

      Constructors 
      Constructor Description
      Tokenizer()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static java.lang.String getPublicId​(char[] buf, int off, int end)
      Checks that a literal contained in the specified char subarray is a legal public identifier and returns a string with the normalized content of the public id.
      static boolean matchesXMLString​(char[] buf, int off, int end, java.lang.String str)
      Returns true if the specified char subarray is equal to the string.
      static void movePosition​(char[] buf, int off, int end, Position pos)
      Moves a position forward.
      static int skipIgnoreSect​(char[] buf, int off, int end)
      Skips over an ignored conditional section.
      static int skipS​(char[] buf, int off, int end)
      Skips over XML whitespace characters at the start of the specified subarray.
      static int tokenizeAttributeValue​(char[] buf, int off, int end, Token token)
      Scans the first token of a char subarrary that contains part of literal attribute value.
      static int tokenizeCdataSection​(char[] buf, int off, int end, Token token)
      Scans the first token of a char subarrary that starts with the content of a CDATA section.
      static int tokenizeContent​(char[] buf, int off, int end, ContentToken token)
      Scans the first token of a char subarrary that contains content.
      static int tokenizeEntityValue​(char[] buf, int off, int end, Token token)
      Scans the first token of a char subarrary that contains part of literal entity value.
      static int tokenizeProlog​(char[] buf, int off, int end, Token token)
      Scans the first token of a char subarray that contains part of a prolog.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait