Class RddRdfWriter<T>
java.lang.Object
net.sansa_stack.spark.io.rdf.output.RddRdfWriterSettings<RddRdfWriter<T>>
net.sansa_stack.spark.io.rdf.output.RddRdfWriter<T>
- Type Parameters:
T-
A fluent API for configuration of how to save an RDD of RDF data to disk.
This class uniformly handles Triples, Quads, Model, Datasets, etc using a set of
lambdas for relevant conversion.
Instances of this class should be created using the appropriate createFor[Type] methods.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected RddRdfOpsImpl<T>protected org.apache.hadoop.conf.Configurationprotected org.apache.spark.api.java.JavaRDD<? extends T>protected org.apache.spark.api.java.JavaSparkContextFields inherited from class net.sansa_stack.spark.io.rdf.output.RddRdfWriterSettings
allowOverwriteFiles, consoleOutSupplier, deferOutputForUsedPrefixes, deletePartitionFolderAfterMerge, globalPrefixMapping, mapQuadsToTriplesForTripleLangs, outputFormat, partitionFolder, partitionsAsIndependentFiles, postProcessingSettings, targetFile, useCoalesceOne, useElephas -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic RddRdfWriter<org.aksw.jenax.arq.dataset.api.DatasetOneNg>static RddRdfWriter<org.aksw.jenax.arq.dataset.api.DatasetGraphOneNg>static RddRdfWriter<org.apache.jena.graph.Graph>static RddRdfWriter<org.apache.jena.rdf.model.Model>static RddRdfWriter<org.apache.jena.sparql.core.Quad>static RddRdfWriter<org.apache.jena.graph.Triple>static Function<OutputStream,org.apache.jena.riot.system.StreamRDF> createStreamRDFFactory(org.apache.jena.riot.RDFFormat rdfFormat, boolean mapQuadsToTriplesForTripleLangs, org.apache.jena.shared.PrefixMapping prefixMapping) Create a function that can create a StreamRDF instance that is backed by the given OutputStream.org.apache.spark.api.java.JavaRDD<T>getEffectiveRdd(RdfPostProcessingSettings settings) Create the effective RDD w.r.t.org.apache.spark.api.java.JavaRDD<? extends T>getRdd()static voidmergeFolder(Path outFile, Path srcFolder, String pattern, Comparator<? super Path> pathComparator) mutate(Consumer<RddRdfWriter<T>> action) Pass this object to a consumer.partitionMapperNQuads(Iterator<org.apache.jena.sparql.core.Quad> it) partitionMapperNTriples(Iterator<org.apache.jena.graph.Triple> it) Save the RDD to a single file.partitionMapperRDFStream(Function<OutputStream, org.apache.jena.riot.system.StreamRDF> streamRDFFactory, BiConsumer<? super T, org.apache.jena.riot.system.StreamRDF> sendRecordToWriter) voidrun()protected voidvoidrunSpark()Run the save action according to configurationvoidSame as .run() just without the checked IOExceptionstatic voidsafeDeletePartitionFolder(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path folderPath, org.apache.hadoop.conf.Configuration conf) This method first checks that all top-level files in the partition folder belong to hadoop.static <T> voidsaveToFolder(org.apache.spark.api.java.JavaRDD<T> javaRdd, String path, org.apache.jena.riot.RDFFormat rdfFormat, boolean mapQuadsToTriplesForTripleLangs, org.apache.jena.shared.PrefixMapping globalPrefixMapping, BiConsumer<T, org.apache.jena.riot.system.StreamRDF> sendRecordToStreamRDF) Save the data in Trig/Turtle or its sub-formats (n-quads/n-triples) format.static <T> voidsaveUsingElephas(org.apache.spark.api.java.JavaRDD<T> rdd, org.apache.hadoop.fs.Path path, org.apache.jena.riot.Lang lang, org.aksw.commons.lambda.serializable.SerializableFunction<? super T, ?> recordToWritable) protected RddRdfWriter<T>self()static <T> voidsendToStreamRDF(org.apache.spark.api.java.JavaRDD<T> javaRdd, org.aksw.commons.lambda.serializable.SerializableBiConsumer<T, org.apache.jena.riot.system.StreamRDF> sendRecordToStreamRDF, org.aksw.commons.lambda.serializable.SerializableSupplier<org.apache.jena.riot.system.StreamRDF> streamRdfSupplier) static StringtoString(org.apache.jena.shared.PrefixMapping prefixMapping, org.apache.jena.riot.RDFFormat rdfFormat) Convert a prefix mapping to a stringstatic voidvalidate(RddRdfWriterSettings<?> settings) static voidvalidateOutFolder(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration conf, boolean deleteIfExists) Methods inherited from class net.sansa_stack.spark.io.rdf.output.RddRdfWriterSettings
configureFrom, getConsoleOutSupplier, getFallbackOutputFormat, getGlobalPrefixMapping, getOutputFormat, getPartitionFolder, getPostProcessingSettings, getTargetFile, isAllowOverwriteFiles, isConsoleOutput, isDeletePartitionFolderAfterMerge, isMapQuadsToTriplesForTripleLangs, isPartitionsAsIndependentFiles, isUseCoalesceOne, isUseElephas, setAllowOverwriteFiles, setConsoleOutput, setConsoleOutSupplier, setDeferOutputForUsedPrefixes, setDeletePartitionFolderAfterMerge, setGlobalPrefixMapping, setGlobalPrefixMapping, setMapQuadsToTriplesForTripleLangs, setOutputFormat, setOutputFormat, setPartitionFolder, setPartitionFolder, setPartitionsAsIndependentFiles, setPostProcessingSettings, setTargetFile, setTargetFile, setUseCoalesceOne, setUseElephas
-
Field Details
-
dispatcher
-
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext -
rdd
-
hadoopConfiguration
protected org.apache.hadoop.conf.Configuration hadoopConfiguration
-
-
Constructor Details
-
RddRdfWriter
-
-
Method Details
-
setRdd
-
getRdd
-
self
- Overrides:
selfin classRddRdfWriterSettings<RddRdfWriter<T>>
-
mutate
Pass this object to a consumer. Useful to conditionally configure this object without breaking the fluent chain:rdd.configureSave().mutate(self -> { if (condition) { self.setX(); }}).run();- Parameters:
action-- Returns:
-
runUnchecked
public void runUnchecked()Same as .run() just without the checked IOException -
run
- Throws:
IOException
-
getEffectiveRdd
Create the effective RDD w.r.t. configuration (sort, unqiue, optimize prefixes) If optimize prefixes is enabled then invoking this method will immediately perform that analysis The current behavior is that this writer's prefix map will be updated to the used prefixes. However, this is subject to change such that a new writer instance with the used prefixes is created. -
runOutputToConsole
- Throws:
IOException
-
safeDeletePartitionFolder
public static void safeDeletePartitionFolder(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path folderPath, org.apache.hadoop.conf.Configuration conf) throws IOException This method first checks that all top-level files in the partition folder belong to hadoop. If this is the case then a single recursive delete is made. Note that concurrent modifications could still cause those files to be deleted. Raises [@link DirectoryNotEmptyException} if not all files can be removed. If the directory does not exist then it is ignored.- Throws:
IOException
-
runSpark
Run the save action according to configuration- Throws:
IOException
-
validateOutFolder
public static void validateOutFolder(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration conf, boolean deleteIfExists) throws IOException - Throws:
IOException
-
mergeFolder
public static void mergeFolder(Path outFile, Path srcFolder, String pattern, Comparator<? super Path> pathComparator) throws IOException - Throws:
IOException
-
partitionMapperNTriples
Save the RDD to a single file. Underneath invokes [[JenaDatasetWriter#saveToFolder]] and merges the set of files created by it. See [[JenaDatasetWriter#saveToFolder]] for supported formats. mode exitOnError / def saveToFile(outFile: String, prefixMapping: PrefixMapping, rdfFormat: RDFFormat, outFolder: String, mode: io.SaveMode.Value = SaveMode.ErrorIfExists, exitOnError: Boolean = false): Unit = {val outFilePath = Paths.get(outFile).toAbsolutePath val outFileFileName = outFilePath.getFileName.toString val outFolderPath = if (outFolder == null) outFilePath.resolveSibling(outFileFileName + "-parts") else Paths.get(outFolder).toAbsolutePath
saveToFolder(outFolderPath.toString, prefixMapping, rdfFormat, mode, exitOnError) mergeFolder(outFilePath, outFolderPath, "part*") }
-
partitionMapperNQuads
-
partitionMapperRDFStream
public static <T> org.aksw.commons.lambda.throwing.ThrowingFunction<Iterator<T>,Iterator<String>> partitionMapperRDFStream(Function<OutputStream, org.apache.jena.riot.system.StreamRDF> streamRDFFactory, BiConsumer<? super T, org.apache.jena.riot.system.StreamRDF> sendRecordToWriter) -
saveUsingElephas
public static <T> void saveUsingElephas(org.apache.spark.api.java.JavaRDD<T> rdd, org.apache.hadoop.fs.Path path, org.apache.jena.riot.Lang lang, org.aksw.commons.lambda.serializable.SerializableFunction<? super T, ?> recordToWritable) -
createForTriple
-
createForQuad
-
createForGraph
-
createForDatasetGraph
public static RddRdfWriter<org.aksw.jenax.arq.dataset.api.DatasetGraphOneNg> createForDatasetGraph() -
createForModel
-
createForDataset
-
validate
-
sendToStreamRDF
public static <T> void sendToStreamRDF(org.apache.spark.api.java.JavaRDD<T> javaRdd, org.aksw.commons.lambda.serializable.SerializableBiConsumer<T, org.apache.jena.riot.system.StreamRDF> sendRecordToStreamRDF, org.aksw.commons.lambda.serializable.SerializableSupplier<org.apache.jena.riot.system.StreamRDF> streamRdfSupplier)
-