In contrast to the Lucene standard analyser this one adds filtering tokens of
less then minimum length (default two characters) and tokens that contain
only digits.
Reads text from a configurable HBase table and column, extracts the content
(or in some cases single items from a list seperated by a separator like '#')
and writes the counts to HDFS.
Reads text from a configurable HBase table and column, tokenizes with Lucene
and writes the resulting tokenized stuff to HDFS for further processing by
the Mahout colloc driver.
This class wraps the Lucene StandardAnalyzer - ties it to Lucene version
3.0.2 and provides a constructor w/o any arguments for instantiation in the
Mahout collocation analysis.
Delegate most of the analysis to the Lucene standard analyzer, add
filtering tokens of less than minimum length and filtering tokens that are
digit only.