Class RddRdfWriter<T>

java.lang.Object
net.sansa_stack.spark.io.rdf.output.RddRdfWriterSettings<RddRdfWriter<T>>
net.sansa_stack.spark.io.rdf.output.RddRdfWriter<T>
Type Parameters:
T -

public class RddRdfWriter<T> extends RddRdfWriterSettings<RddRdfWriter<T>>
A fluent API for configuration of how to save an RDD of RDF data to disk. This class uniformly handles Triples, Quads, Model, Datasets, etc using a set of lambdas for relevant conversion. Instances of this class should be created using the appropriate createFor[Type] methods.
  • Field Details

    • dispatcher

      protected RddRdfOpsImpl<T> dispatcher
    • sparkContext

      protected org.apache.spark.api.java.JavaSparkContext sparkContext
    • rdd

      protected org.apache.spark.api.java.JavaRDD<? extends T> rdd
    • hadoopConfiguration

      protected org.apache.hadoop.conf.Configuration hadoopConfiguration
  • Constructor Details

  • Method Details

    • setRdd

      public RddRdfWriter<T> setRdd(org.apache.spark.api.java.JavaRDD<? extends T> rdd)
    • getRdd

      public org.apache.spark.api.java.JavaRDD<? extends T> getRdd()
    • self

      protected RddRdfWriter<T> self()
      Overrides:
      self in class RddRdfWriterSettings<RddRdfWriter<T>>
    • mutate

      public RddRdfWriter<T> mutate(Consumer<RddRdfWriter<T>> action)
      Pass this object to a consumer. Useful to conditionally configure this object without breaking the fluent chain:
          rdd.configureSave().mutate(self -> { if (condition) { self.setX(); }}).run();
       
      Parameters:
      action -
      Returns:
    • runUnchecked

      public void runUnchecked()
      Same as .run() just without the checked IOException
    • run

      public void run() throws IOException
      Throws:
      IOException
    • getEffectiveRdd

      public org.apache.spark.api.java.JavaRDD<T> getEffectiveRdd(RdfPostProcessingSettings settings)
      Create the effective RDD w.r.t. configuration (sort, unqiue, optimize prefixes) If optimize prefixes is enabled then invoking this method will immediately perform that analysis The current behavior is that this writer's prefix map will be updated to the used prefixes. However, this is subject to change such that a new writer instance with the used prefixes is created.
    • runOutputToConsole

      protected void runOutputToConsole() throws IOException
      Throws:
      IOException
    • safeDeletePartitionFolder

      public static void safeDeletePartitionFolder(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path folderPath, org.apache.hadoop.conf.Configuration conf) throws IOException
      This method first checks that all top-level files in the partition folder belong to hadoop. If this is the case then a single recursive delete is made. Note that concurrent modifications could still cause those files to be deleted. Raises [@link DirectoryNotEmptyException} if not all files can be removed. If the directory does not exist then it is ignored.
      Throws:
      IOException
    • runSpark

      public void runSpark() throws IOException
      Run the save action according to configuration
      Throws:
      IOException
    • validateOutFolder

      public static void validateOutFolder(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration conf, boolean deleteIfExists) throws IOException
      Throws:
      IOException
    • toString

      public static String toString(org.apache.jena.shared.PrefixMapping prefixMapping, org.apache.jena.riot.RDFFormat rdfFormat)
      Convert a prefix mapping to a string
    • mergeFolder

      public static void mergeFolder(Path outFile, Path srcFolder, String pattern, Comparator<? super Path> pathComparator) throws IOException
      Throws:
      IOException
    • partitionMapperNTriples

      public static Iterator<String> partitionMapperNTriples(Iterator<org.apache.jena.graph.Triple> it)
      Save the RDD to a single file. Underneath invokes [[JenaDatasetWriter#saveToFolder]] and merges the set of files created by it. See [[JenaDatasetWriter#saveToFolder]] for supported formats. mode exitOnError / def saveToFile(outFile: String, prefixMapping: PrefixMapping, rdfFormat: RDFFormat, outFolder: String, mode: io.SaveMode.Value = SaveMode.ErrorIfExists, exitOnError: Boolean = false): Unit = {

      val outFilePath = Paths.get(outFile).toAbsolutePath val outFileFileName = outFilePath.getFileName.toString val outFolderPath = if (outFolder == null) outFilePath.resolveSibling(outFileFileName + "-parts") else Paths.get(outFolder).toAbsolutePath

      saveToFolder(outFolderPath.toString, prefixMapping, rdfFormat, mode, exitOnError) mergeFolder(outFilePath, outFolderPath, "part*") }

    • partitionMapperNQuads

      public static Iterator<String> partitionMapperNQuads(Iterator<org.apache.jena.sparql.core.Quad> it)
    • createStreamRDFFactory

      public static Function<OutputStream,org.apache.jena.riot.system.StreamRDF> createStreamRDFFactory(org.apache.jena.riot.RDFFormat rdfFormat, boolean mapQuadsToTriplesForTripleLangs, org.apache.jena.shared.PrefixMapping prefixMapping)
      Create a function that can create a StreamRDF instance that is backed by the given OutputStream.
      Parameters:
      rdfFormat -
      prefixMapping -
      Returns:
    • partitionMapperRDFStream

      public static <T> org.aksw.commons.lambda.throwing.ThrowingFunction<Iterator<T>,Iterator<String>> partitionMapperRDFStream(Function<OutputStream,org.apache.jena.riot.system.StreamRDF> streamRDFFactory, BiConsumer<? super T,org.apache.jena.riot.system.StreamRDF> sendRecordToWriter)
    • saveToFolder

      public static <T> void saveToFolder(org.apache.spark.api.java.JavaRDD<T> javaRdd, String path, org.apache.jena.riot.RDFFormat rdfFormat, boolean mapQuadsToTriplesForTripleLangs, org.apache.jena.shared.PrefixMapping globalPrefixMapping, BiConsumer<T,org.apache.jena.riot.system.StreamRDF> sendRecordToStreamRDF) throws IOException
      Save the data in Trig/Turtle or its sub-formats (n-quads/n-triples) format. If prefixes should be written out then they have to provided as an argument to the prefixMapping parameter. Prefix mappings are broadcasted to and processed in a .mapPartition operation. If the prefixMapping is non-empty then the first part file written out contains them. No other partition will write out prefixes.
      Parameters:
      path - the folder into which the file(s) will be written to mode the expected behavior of saving the data to a data source
      Throws:
      IOException
    • saveUsingElephas

      public static <T> void saveUsingElephas(org.apache.spark.api.java.JavaRDD<T> rdd, org.apache.hadoop.fs.Path path, org.apache.jena.riot.Lang lang, org.aksw.commons.lambda.serializable.SerializableFunction<? super T,?> recordToWritable)
    • createForTriple

      public static RddRdfWriter<org.apache.jena.graph.Triple> createForTriple()
    • createForQuad

      public static RddRdfWriter<org.apache.jena.sparql.core.Quad> createForQuad()
    • createForGraph

      public static RddRdfWriter<org.apache.jena.graph.Graph> createForGraph()
    • createForDatasetGraph

      public static RddRdfWriter<org.aksw.jenax.arq.dataset.api.DatasetGraphOneNg> createForDatasetGraph()
    • createForModel

      public static RddRdfWriter<org.apache.jena.rdf.model.Model> createForModel()
    • createForDataset

      public static RddRdfWriter<org.aksw.jenax.arq.dataset.api.DatasetOneNg> createForDataset()
    • validate

      public static void validate(RddRdfWriterSettings<?> settings)
    • sendToStreamRDF

      public static <T> void sendToStreamRDF(org.apache.spark.api.java.JavaRDD<T> javaRdd, org.aksw.commons.lambda.serializable.SerializableBiConsumer<T,org.apache.jena.riot.system.StreamRDF> sendRecordToStreamRDF, org.aksw.commons.lambda.serializable.SerializableSupplier<org.apache.jena.riot.system.StreamRDF> streamRdfSupplier)