eu.dicodeproject.analysis.examples
Class MailArchiveToSequenceFile

java.lang.Object
  extended by eu.dicodeproject.analysis.examples.MailArchiveToSequenceFile
All Implemented Interfaces:
org.apache.hadoop.fs.PathFilter
Direct Known Subclasses:
UnquotedArchiveToSequenceFile

public class MailArchiveToSequenceFile
extends Object
implements org.apache.hadoop.fs.PathFilter

Converts a directory containing unzipped mail archives in mbox format to a sequencefiles. Key is original filename with an id increasing with each email seen appended. Value is the plain e-mail including header information.


Constructor Summary
MailArchiveToSequenceFile(org.apache.hadoop.conf.Configuration conf, String prefix, org.apache.mahout.text.ChunkedWriter writer, Charset charset)
           
MailArchiveToSequenceFile(org.apache.hadoop.conf.Configuration conf, String prefix, org.apache.mahout.text.ChunkedWriter writer, Charset charset, MailContentHandler handler)
           
 
Method Summary
 boolean accept(org.apache.hadoop.fs.Path current)
          Accepts all files in a directory, splits each file into individual mails and concatenates them - each prefixed with the mail's message id.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

MailArchiveToSequenceFile

public MailArchiveToSequenceFile(org.apache.hadoop.conf.Configuration conf,
                                 String prefix,
                                 org.apache.mahout.text.ChunkedWriter writer,
                                 Charset charset)
                          throws IOException
Throws:
IOException

MailArchiveToSequenceFile

public MailArchiveToSequenceFile(org.apache.hadoop.conf.Configuration conf,
                                 String prefix,
                                 org.apache.mahout.text.ChunkedWriter writer,
                                 Charset charset,
                                 MailContentHandler handler)
                          throws IOException
Throws:
IOException
Method Detail

accept

public boolean accept(org.apache.hadoop.fs.Path current)
Accepts all files in a directory, splits each file into individual mails and concatenates them - each prefixed with the mail's message id. TODO filter for headers vs. body

Specified by:
accept in interface org.apache.hadoop.fs.PathFilter


Copyright © 2011. All Rights Reserved.