Class SparkJob

  • All Implemented Interfaces:
    java.io.Serializable
    Direct Known Subclasses:
    AtlasGenerator, EmptySparkJob, ShardedSparkJob

    public abstract class SparkJob
    extends org.openstreetmap.atlas.utilities.runtime.Command
    implements java.io.Serializable
    Skeleton for a Spark Job
    See Also:
    Serialized Form
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.openstreetmap.atlas.utilities.runtime.Command

        org.openstreetmap.atlas.utilities.runtime.Command.Flag, org.openstreetmap.atlas.utilities.runtime.Command.Optionality, org.openstreetmap.atlas.utilities.runtime.Command.Switch<T extends java.lang.Object>, org.openstreetmap.atlas.utilities.runtime.Command.SwitchList
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.util.Map<java.lang.String,​java.lang.String>> ADDITIONAL_SPARK_OPTIONS  
      static org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.lang.String> CLUSTER  
      static org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.lang.String> COMPRESS_OUTPUT  
      static java.lang.String FAILED_FILE  
      static org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.lang.String> INPUT  
      static org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.lang.String> OUTPUT  
      static java.lang.String SAVING_SEPARATOR  
      static org.openstreetmap.atlas.utilities.runtime.Command.Switch<SparkContextProvider> SPARK_CONTEXT_PROVIDER  
      static org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.util.Map<java.lang.String,​java.lang.String>> SPARK_OPTIONS  
      static java.lang.String SUCCESS_FILE  
    • Constructor Summary

      Constructors 
      Constructor Description
      SparkJob()  
    • Method Summary

      All Methods Static Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      protected org.apache.hadoop.conf.Configuration configuration()  
      protected java.util.Map<java.lang.String,​java.lang.String> configurationMap()  
      protected void copyToOutput​(org.openstreetmap.atlas.utilities.runtime.CommandMap command, java.lang.String input, java.lang.String output)  
      protected java.lang.String getAlternateParallelFolderOutput​(java.lang.String output, java.lang.String name)
      Get an alternate output based on the main output folder used for monitoring
      protected java.lang.String getAlternateSubFolderOutput​(java.lang.String output, java.lang.String name)
      Get an alternate output based on the main output folder used for monitoring.
      protected org.apache.spark.api.java.JavaSparkContext getContext()  
      abstract java.lang.String getName()  
      protected java.lang.String input​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)  
      int onRun​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)  
      protected java.lang.String output​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)  
      protected java.util.List<java.lang.String> outputToClean​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)
      Define all the folders to clean before a run.
      protected org.openstreetmap.atlas.streaming.resource.Resource resource​(java.lang.String path)  
      static org.openstreetmap.atlas.streaming.resource.Resource resource​(java.lang.String path, java.util.Map<java.lang.String,​java.lang.String> configurationMap)  
      protected void setContext​(org.apache.spark.api.java.JavaSparkContext context)  
      protected <T> void splitAndSaveAsHadoopFile​(org.apache.spark.api.java.JavaPairRDD<java.lang.String,​T> input, java.lang.String path, java.lang.Class<T> valueClass, java.lang.Class<? extends org.apache.hadoop.mapred.lib.MultipleOutputFormat<java.lang.String,​T>> formatterClass, java.util.function.UnaryOperator<java.lang.String> keyReducer)
      Instead of saving a full RDD(String, T) in a single folder, this function allows to save subsets of an RDD(String, T) in separate folders.
      abstract void start​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)
      The spark Job
      protected org.openstreetmap.atlas.utilities.runtime.Command.SwitchList switches()  
      • Methods inherited from class org.openstreetmap.atlas.utilities.runtime.Command

        commandSummary, getCommandMap, lastRawCommand, run, runWithoutQuitting
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • INPUT

        public static final org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.lang.String> INPUT
      • OUTPUT

        public static final org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.lang.String> OUTPUT
      • CLUSTER

        public static final org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.lang.String> CLUSTER
      • SPARK_OPTIONS

        public static final org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.util.Map<java.lang.String,​java.lang.String>> SPARK_OPTIONS
      • ADDITIONAL_SPARK_OPTIONS

        public static final org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.util.Map<java.lang.String,​java.lang.String>> ADDITIONAL_SPARK_OPTIONS
      • COMPRESS_OUTPUT

        public static final org.openstreetmap.atlas.utilities.runtime.Command.Switch<java.lang.String> COMPRESS_OUTPUT
      • SPARK_CONTEXT_PROVIDER

        public static final org.openstreetmap.atlas.utilities.runtime.Command.Switch<SparkContextProvider> SPARK_CONTEXT_PROVIDER
      • SAVING_SEPARATOR

        public static final java.lang.String SAVING_SEPARATOR
        See Also:
        Constant Field Values
    • Constructor Detail

      • SparkJob

        public SparkJob()
    • Method Detail

      • resource

        public static org.openstreetmap.atlas.streaming.resource.Resource resource​(java.lang.String path,
                                                                                   java.util.Map<java.lang.String,​java.lang.String> configurationMap)
      • getName

        public abstract java.lang.String getName()
        Returns:
        The name of the job
      • onRun

        public int onRun​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)
        Specified by:
        onRun in class org.openstreetmap.atlas.utilities.runtime.Command
      • start

        public abstract void start​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)
        The spark Job
        Parameters:
        command - The arguments passed to the main method
      • configuration

        protected org.apache.hadoop.conf.Configuration configuration()
      • configurationMap

        protected java.util.Map<java.lang.String,​java.lang.String> configurationMap()
      • copyToOutput

        protected void copyToOutput​(org.openstreetmap.atlas.utilities.runtime.CommandMap command,
                                    java.lang.String input,
                                    java.lang.String output)
      • getAlternateParallelFolderOutput

        protected java.lang.String getAlternateParallelFolderOutput​(java.lang.String output,
                                                                    java.lang.String name)
        Get an alternate output based on the main output folder used for monitoring
        Parameters:
        output - The main output folder
        name - The name of the alternate folder
        Returns:
        The alternate output
      • getAlternateSubFolderOutput

        protected java.lang.String getAlternateSubFolderOutput​(java.lang.String output,
                                                               java.lang.String name)
        Get an alternate output based on the main output folder used for monitoring. Use the sub-folder.
        Parameters:
        output - The main output folder
        name - The name of the alternate folder
        Returns:
        The alternate output
      • getContext

        protected org.apache.spark.api.java.JavaSparkContext getContext()
      • input

        protected java.lang.String input​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)
      • output

        protected java.lang.String output​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)
      • outputToClean

        protected java.util.List<java.lang.String> outputToClean​(org.openstreetmap.atlas.utilities.runtime.CommandMap command)
        Define all the folders to clean before a run.
        Parameters:
        command - The command parameters sent to the main class.
        Returns:
        All the paths to clean
      • resource

        protected org.openstreetmap.atlas.streaming.resource.Resource resource​(java.lang.String path)
        Parameters:
        path - The path to open (in an URL format)
        Returns:
        The resource at this path
      • setContext

        protected void setContext​(org.apache.spark.api.java.JavaSparkContext context)
      • splitAndSaveAsHadoopFile

        protected <T> void splitAndSaveAsHadoopFile​(org.apache.spark.api.java.JavaPairRDD<java.lang.String,​T> input,
                                                    java.lang.String path,
                                                    java.lang.Class<T> valueClass,
                                                    java.lang.Class<? extends org.apache.hadoop.mapred.lib.MultipleOutputFormat<java.lang.String,​T>> formatterClass,
                                                    java.util.function.UnaryOperator<java.lang.String> keyReducer)
        Instead of saving a full RDD(String, T) in a single folder, this function allows to save subsets of an RDD(String, T) in separate folders. The keyReducer function needs to provide the unique String by which each key string needs to be grouped with. For example an RDD with keys "aaa_1", "aaa_2", and "bbb_1" and a function that takes the first part of the key as a grouping key will be saved in two different folders. If the path is /path/to/output, then the two folders will be /path/to-aaa/ and /path/to-bbb/

        This function might be slow as it will generate a Spark stage for each category in this RDD. In the example above, it would create two stages. When the number of stages increases, it might be really slow.

        Type Parameters:
        T - The type of the object to save
        Parameters:
        input - The RDD to save
        path - The output path of the job
        valueClass - The type to save as Hadoop file
        formatterClass - The corresponding Hadoop formatter
        keyReducer - The key reducing function explained above.
      • switches

        protected org.openstreetmap.atlas.utilities.runtime.Command.SwitchList switches()
        Specified by:
        switches in class org.openstreetmap.atlas.utilities.runtime.Command