Class SparkInput


  • public final class SparkInput
    extends java.lang.Object
    Utility tools to load RDDs from elastic file systems (S3 for example), skipping the slowness of the SparkContext textFile and sequenceFile functions. This is inspired from
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static org.apache.spark.api.java.JavaRDD<org.openstreetmap.atlas.geography.atlas.Atlas> atlasFiles​(org.apache.spark.api.java.JavaSparkContext context, java.lang.String path)
      Load Atlas files from an input path
      static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.apache.spark.input.PortableDataStream> binaryFile​(org.apache.spark.api.java.JavaSparkContext context, java.lang.String path)
      Get an RDD from a binary file input
      static <K extends org.apache.hadoop.io.Writable,​V extends org.apache.hadoop.io.Writable>
      org.apache.spark.api.java.JavaPairRDD<K,​V>
      sequenceFile​(org.apache.spark.api.java.JavaSparkContext context, java.lang.String path, java.lang.Class<K> sequenceKeyClass, java.lang.Class<V> sequenceValueClass)
      Get an RDD from a sequence file input
      protected static void setFileSystemCreator​(FileSystemCreator fileSystemCreator)  
      static org.apache.spark.api.java.JavaRDD<java.lang.String> textFile​(org.apache.spark.api.java.JavaSparkContext context, java.lang.String path)
      Get an RDD from a Text file input
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • atlasFiles

        public static org.apache.spark.api.java.JavaRDD<org.openstreetmap.atlas.geography.atlas.Atlas> atlasFiles​(org.apache.spark.api.java.JavaSparkContext context,
                                                                                                                  java.lang.String path)
        Load Atlas files from an input path
        Parameters:
        context - The context from Spark
        path - The path to a set of atlas files
        Returns:
        An RDD of Atlas objects. One Atlas object per slice.
      • binaryFile

        public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.apache.spark.input.PortableDataStream> binaryFile​(org.apache.spark.api.java.JavaSparkContext context,
                                                                                                                                         java.lang.String path)
        Get an RDD from a binary file input
        Parameters:
        context - The context from Spark
        path - The path to a set of binary files
        Returns:
        An RDD made of pairs of names to values from the binary files read.
      • sequenceFile

        public static <K extends org.apache.hadoop.io.Writable,​V extends org.apache.hadoop.io.Writable> org.apache.spark.api.java.JavaPairRDD<K,​V> sequenceFile​(org.apache.spark.api.java.JavaSparkContext context,
                                                                                                                                                                            java.lang.String path,
                                                                                                                                                                            java.lang.Class<K> sequenceKeyClass,
                                                                                                                                                                            java.lang.Class<V> sequenceValueClass)
        Get an RDD from a sequence file input
        Type Parameters:
        K - The key type in the sequence file
        V - The value type in the sequence file
        Parameters:
        context - The context from Spark
        path - The path to a set of sequence files
        sequenceKeyClass - The class to expect for the sequence file keys
        sequenceValueClass - The class to expect for the sequence file values
        Returns:
        An RDD made of pairs of keys to values from the sequence files read.
      • textFile

        public static org.apache.spark.api.java.JavaRDD<java.lang.String> textFile​(org.apache.spark.api.java.JavaSparkContext context,
                                                                                   java.lang.String path)
        Get an RDD from a Text file input
        Parameters:
        context - The context from Spark
        path - The path to open
        Returns:
        A RDD with one String line per item.
      • setFileSystemCreator

        protected static void setFileSystemCreator​(FileSystemCreator fileSystemCreator)