eu.dicodeproject.analysis.examples
Class MailArchiveToSequenceFile
java.lang.Object
eu.dicodeproject.analysis.examples.MailArchiveToSequenceFile
- All Implemented Interfaces:
- org.apache.hadoop.fs.PathFilter
- Direct Known Subclasses:
- UnquotedArchiveToSequenceFile
public class MailArchiveToSequenceFile
- extends Object
- implements org.apache.hadoop.fs.PathFilter
Converts a directory containing unzipped mail archives in mbox format to a
sequencefiles.
Key is original filename with an id increasing with each email seen appended.
Value is the plain e-mail including header information.
|
Method Summary |
boolean |
accept(org.apache.hadoop.fs.Path current)
Accepts all files in a directory, splits each file into individual mails
and concatenates them - each prefixed with the mail's message id. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
MailArchiveToSequenceFile
public MailArchiveToSequenceFile(org.apache.hadoop.conf.Configuration conf,
String prefix,
org.apache.mahout.text.ChunkedWriter writer,
Charset charset)
throws IOException
- Throws:
IOException
MailArchiveToSequenceFile
public MailArchiveToSequenceFile(org.apache.hadoop.conf.Configuration conf,
String prefix,
org.apache.mahout.text.ChunkedWriter writer,
Charset charset,
MailContentHandler handler)
throws IOException
- Throws:
IOException
accept
public boolean accept(org.apache.hadoop.fs.Path current)
- Accepts all files in a directory, splits each file into individual mails
and concatenates them - each prefixed with the mail's message id.
TODO filter for headers vs. body
- Specified by:
accept in interface org.apache.hadoop.fs.PathFilter
Copyright © 2011. All Rights Reserved.