public class Hfs extends cascading.tap.Tap<Configuration,RecordReader,OutputCollector> implements cascading.tap.type.FileType<Configuration>, cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector>
FlowConnector sub-classes when creating Hadoop executable Flow
instances.
Paths typically should point to a directory, where in turn all the "part" files immediately in that directory will be included. This is the practice Hadoop expects. Subdirectories are not included and typically result in a failure.
To include subdirectories, Hadoop supports "globing". Globing is a frustrating feature and is supported more
robustly by GlobHfs and less so by Hfs.
Or in later versions of Hadoop (2.x+), set this property to true:
"mapreduce.input.fileinputformat.input.dir.recursive"
Hfs will accept /* (wildcard) paths, but not all convenience methods like
jobConf.getSize will behave properly or reliably. Nor can the Hfs instance
with a wildcard path be used as a sink to write data.
In those cases use GlobHfs since it is a sub-class of MultiSourceTap.
Optionally use Dfs or Lfs for resources specific to Hadoop Distributed file system or
the Local file system, respectively. Using Hfs is the best practice when possible, Lfs and Dfs are conveniences.
Use the Hfs class if the 'kind' of resource is unknown at design time. To use, prefix a scheme to the 'stringPath'. Where
hdfs://... will denote Dfs, and file://... will denote Lfs.
Call HfsProps.setTemporaryDirectory(java.util.Map, String) to use a different temporary file directory path
other than the current Hadoop default path.
By default, Cascading on Hadoop will assume any source or sink Tap using the file:// URI scheme
intends to read files from the local client filesystem (for example when using the Lfs Tap) where the Hadoop
job jar is started. Subsequently, Cascading will force any MapReduce jobs reading or writing to file:// resources
to run in Hadoop "standalone mode" so that the file can be read.
To change this behavior, HfsProps.setLocalModeScheme(java.util.Map, String) to set a different scheme value,
or to "none" to disable entirely for the case the file to be read is available on every Hadoop processing node
in the exact same path.
When using a MapReduce planner, Hfs can optionally combine multiple small files (or a series of small "blocks") into larger "splits". This reduces the number of resulting map tasks created by Hadoop and can improve application performance.
This is enabled by calling HfsProps.setUseCombinedInput(boolean) to true. By default, merging
or combining splits into large ones is disabled.
Apache Tez planner does not require this setting, it is supported by default and enabled by the application manager.
If Hfs is overridden and sets multiple input paths directly against the configuration object, some methods will
attempt to honor this property. But the primary path (returned by getIdentifier(), must be a common directory
to all the 'child' input paths added directly.
| Modifier and Type | Field and Description |
|---|---|
protected java.lang.String |
stringPath
Field stringPath
|
| Modifier | Constructor and Description |
|---|---|
protected |
Hfs() |
protected |
Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme) |
|
Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
Path path)
Constructor Hfs creates a new Hfs instance.
|
|
Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
Path path,
cascading.tap.SinkMode sinkMode)
Constructor Hfs creates a new Hfs instance.
|
|
Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
java.lang.String stringPath)
Constructor Hfs creates a new Hfs instance.
|
|
Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
java.lang.String stringPath,
cascading.tap.SinkMode sinkMode)
Constructor Hfs creates a new Hfs instance.
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
applySourceConfInitIdentifiers(cascading.flow.FlowProcess<? extends Configuration> process,
Configuration conf,
java.lang.String... fullIdentifiers) |
protected cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector> |
create(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
Path path,
cascading.tap.SinkMode sinkMode) |
boolean |
createResource(Configuration conf) |
boolean |
deleteChildResource(Configuration conf,
java.lang.String childIdentifier) |
boolean |
deleteChildResource(cascading.flow.FlowProcess<? extends Configuration> flowProcess,
java.lang.String childIdentifier) |
boolean |
deleteResource(Configuration conf) |
long |
getBlockSize(Configuration conf)
Method getBlockSize returns the
blocksize specified by the underlying file system for this resource. |
long |
getBlockSize(cascading.flow.FlowProcess<? extends Configuration> flowProcess)
Method getBlockSize returns the
blocksize specified by the underlying file system for this resource. |
java.lang.String[] |
getChildIdentifiers(Configuration conf) |
java.lang.String[] |
getChildIdentifiers(Configuration conf,
int depth,
boolean fullyQualified) |
java.lang.String[] |
getChildIdentifiers(cascading.flow.FlowProcess<? extends Configuration> flowProcess) |
java.lang.String[] |
getChildIdentifiers(cascading.flow.FlowProcess<? extends Configuration> flowProcess,
int depth,
boolean fullyQualified) |
protected static boolean |
getCombinedInputSafeMode(Configuration conf) |
protected static Path[] |
getDeclaredInputPaths(Configuration conf) |
protected FileSystem |
getDefaultFileSystem(Configuration configuration) |
java.net.URI |
getDefaultFileSystemURIScheme(Configuration configuration)
Method getDefaultFileSystemURIScheme returns the URI scheme for the default Hadoop FileSystem.
|
FileStatus |
getFileStatus(Configuration conf) |
protected FileSystem |
getFileSystem(Configuration configuration) |
java.lang.String |
getFullIdentifier(Configuration conf) |
java.lang.String |
getIdentifier() |
protected static java.lang.String |
getLocalModeScheme(Configuration conf,
java.lang.String defaultValue) |
long |
getModifiedTime(Configuration conf) |
Path |
getPath() |
int |
getReplication(Configuration conf)
Method getReplication returns the
replication specified by the underlying file system for
this resource. |
int |
getReplication(cascading.flow.FlowProcess<? extends Configuration> flowProcess)
Method getReplication returns the
replication specified by the underlying file system for
this resource. |
long |
getSize(Configuration conf) |
long |
getSize(cascading.flow.FlowProcess<? extends Configuration> flowProcess) |
static Path |
getTempPath(Configuration conf) |
java.net.URI |
getURIScheme(Configuration jobConf) |
protected static boolean |
getUseCombinedInput(Configuration conf) |
boolean |
isDirectory(Configuration conf) |
boolean |
isDirectory(cascading.flow.FlowProcess<? extends Configuration> flowProcess) |
protected java.lang.String |
makeTemporaryPathDirString(java.lang.String name) |
protected java.net.URI |
makeURIScheme(Configuration configuration) |
cascading.tuple.TupleEntryIterator |
openForRead(cascading.flow.FlowProcess<? extends Configuration> flowProcess,
RecordReader input) |
cascading.tuple.TupleEntryCollector |
openForWrite(cascading.flow.FlowProcess<? extends Configuration> flowProcess,
OutputCollector output) |
void |
resetFileStatuses()
Method resetFileStatuses removes the status cache, if any.
|
boolean |
resourceExists(Configuration conf) |
protected void |
setStringPath(java.lang.String stringPath) |
protected void |
setUriScheme(java.net.URI uriScheme) |
void |
sinkConfInit(cascading.flow.FlowProcess<? extends Configuration> process,
Configuration conf) |
void |
sourceConfInit(cascading.flow.FlowProcess<? extends Configuration> process,
Configuration conf) |
protected void |
sourceConfInitAddInputPath(Configuration conf,
Path qualifiedPath)
Deprecated.
|
protected void |
sourceConfInitAddInputPaths(Configuration conf,
java.lang.Iterable<Path> qualifiedPaths) |
protected void |
sourceConfInitComplete(cascading.flow.FlowProcess<? extends Configuration> process,
Configuration conf) |
protected static void |
verifyNoDuplicates(Configuration conf) |
cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector> |
withChildIdentifier(java.lang.String identifier) |
cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector> |
withScheme(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme) |
cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector> |
withSinkMode(cascading.tap.SinkMode sinkMode) |
commitResource, entryStream, entryStream, entryStreamCopy, entryStreamCopy, equals, flowConfInit, getConfigDef, getNodeConfigDef, getScheme, getSinkFields, getSinkMode, getSourceFields, getStepConfigDef, getTrace, hasConfigDef, hashCode, hasNodeConfigDef, hasStepConfigDef, id, isKeep, isReplace, isSink, isSource, isTemporary, isUpdate, openForRead, openForReadUnchecked, openForWrite, outgoingScopeFor, prepareResourceForRead, prepareResourceForWrite, presentSinkFields, presentSourceFields, resolveIncomingOperationArgumentFields, resolveIncomingOperationPassThroughFields, retrieveSinkFields, retrieveSourceFields, rollbackResource, setScheme, spliterator, splititerator, taps, toString, tupleStream, tupleStream, tupleStreamCopy, tupleStreamCopyprotected java.lang.String stringPath
protected Hfs()
@ConstructorProperties(value="scheme") protected Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme)
@ConstructorProperties(value={"scheme","stringPath"})
public Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
java.lang.String stringPath)
scheme - of type SchemestringPath - of type String@ConstructorProperties(value={"scheme","stringPath","sinkMode"})
public Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
java.lang.String stringPath,
cascading.tap.SinkMode sinkMode)
scheme - of type SchemestringPath - of type StringsinkMode - of type SinkMode@ConstructorProperties(value={"scheme","path"})
public Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
Path path)
scheme - of type Schemepath - of type Path@ConstructorProperties(value={"scheme","path","sinkMode"})
public Hfs(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
Path path,
cascading.tap.SinkMode sinkMode)
scheme - of type Schemepath - of type PathsinkMode - of type SinkModeprotected static java.lang.String getLocalModeScheme(Configuration conf, java.lang.String defaultValue)
protected static boolean getUseCombinedInput(Configuration conf)
protected static boolean getCombinedInputSafeMode(Configuration conf)
public cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector> withChildIdentifier(java.lang.String identifier)
withChildIdentifier in interface cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector>public cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector> withScheme(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme)
withScheme in interface cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector>public cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector> withSinkMode(cascading.tap.SinkMode sinkMode)
withSinkMode in interface cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector>protected cascading.tap.type.TapWith<Configuration,RecordReader,OutputCollector> create(cascading.scheme.Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme, Path path, cascading.tap.SinkMode sinkMode)
protected void setStringPath(java.lang.String stringPath)
protected void setUriScheme(java.net.URI uriScheme)
public java.net.URI getURIScheme(Configuration jobConf)
protected java.net.URI makeURIScheme(Configuration configuration)
public java.net.URI getDefaultFileSystemURIScheme(Configuration configuration)
configuration - of type JobConfprotected FileSystem getDefaultFileSystem(Configuration configuration)
protected FileSystem getFileSystem(Configuration configuration)
public java.lang.String getIdentifier()
getIdentifier in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>public Path getPath()
public java.lang.String getFullIdentifier(Configuration conf)
getFullIdentifier in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>public void sourceConfInit(cascading.flow.FlowProcess<? extends Configuration> process, Configuration conf)
sourceConfInit in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>protected static void verifyNoDuplicates(Configuration conf)
protected static Path[] getDeclaredInputPaths(Configuration conf)
protected void applySourceConfInitIdentifiers(cascading.flow.FlowProcess<? extends Configuration> process, Configuration conf, java.lang.String... fullIdentifiers)
protected void sourceConfInitAddInputPaths(Configuration conf, java.lang.Iterable<Path> qualifiedPaths)
@Deprecated protected void sourceConfInitAddInputPath(Configuration conf, Path qualifiedPath)
protected void sourceConfInitComplete(cascading.flow.FlowProcess<? extends Configuration> process, Configuration conf)
public void sinkConfInit(cascading.flow.FlowProcess<? extends Configuration> process, Configuration conf)
sinkConfInit in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>public cascading.tuple.TupleEntryIterator openForRead(cascading.flow.FlowProcess<? extends Configuration> flowProcess, RecordReader input) throws java.io.IOException
openForRead in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>java.io.IOExceptionpublic cascading.tuple.TupleEntryCollector openForWrite(cascading.flow.FlowProcess<? extends Configuration> flowProcess, OutputCollector output) throws java.io.IOException
openForWrite in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>java.io.IOExceptionpublic boolean createResource(Configuration conf) throws java.io.IOException
createResource in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>java.io.IOExceptionpublic boolean deleteResource(Configuration conf) throws java.io.IOException
deleteResource in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>java.io.IOExceptionpublic boolean deleteChildResource(cascading.flow.FlowProcess<? extends Configuration> flowProcess, java.lang.String childIdentifier) throws java.io.IOException
java.io.IOExceptionpublic boolean deleteChildResource(Configuration conf, java.lang.String childIdentifier) throws java.io.IOException
java.io.IOExceptionpublic boolean resourceExists(Configuration conf) throws java.io.IOException
resourceExists in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>java.io.IOExceptionpublic boolean isDirectory(cascading.flow.FlowProcess<? extends Configuration> flowProcess) throws java.io.IOException
isDirectory in interface cascading.tap.type.FileType<Configuration>java.io.IOExceptionpublic boolean isDirectory(Configuration conf) throws java.io.IOException
isDirectory in interface cascading.tap.type.FileType<Configuration>java.io.IOExceptionpublic long getSize(cascading.flow.FlowProcess<? extends Configuration> flowProcess) throws java.io.IOException
getSize in interface cascading.tap.type.FileType<Configuration>java.io.IOExceptionpublic long getSize(Configuration conf) throws java.io.IOException
getSize in interface cascading.tap.type.FileType<Configuration>java.io.IOExceptionpublic long getBlockSize(cascading.flow.FlowProcess<? extends Configuration> flowProcess) throws java.io.IOException
blocksize specified by the underlying file system for this resource.flowProcess - java.io.IOException - whenpublic long getBlockSize(Configuration conf) throws java.io.IOException
blocksize specified by the underlying file system for this resource.conf - of JobConfjava.io.IOException - whenpublic int getReplication(cascading.flow.FlowProcess<? extends Configuration> flowProcess) throws java.io.IOException
replication specified by the underlying file system for
this resource.flowProcess - java.io.IOException - whenpublic int getReplication(Configuration conf) throws java.io.IOException
replication specified by the underlying file system for
this resource.conf - of JobConfjava.io.IOException - whenpublic java.lang.String[] getChildIdentifiers(cascading.flow.FlowProcess<? extends Configuration> flowProcess) throws java.io.IOException
getChildIdentifiers in interface cascading.tap.type.FileType<Configuration>java.io.IOExceptionpublic java.lang.String[] getChildIdentifiers(Configuration conf) throws java.io.IOException
getChildIdentifiers in interface cascading.tap.type.FileType<Configuration>java.io.IOExceptionpublic java.lang.String[] getChildIdentifiers(cascading.flow.FlowProcess<? extends Configuration> flowProcess, int depth, boolean fullyQualified) throws java.io.IOException
getChildIdentifiers in interface cascading.tap.type.FileType<Configuration>java.io.IOExceptionpublic java.lang.String[] getChildIdentifiers(Configuration conf, int depth, boolean fullyQualified) throws java.io.IOException
getChildIdentifiers in interface cascading.tap.type.FileType<Configuration>java.io.IOExceptionpublic long getModifiedTime(Configuration conf) throws java.io.IOException
getModifiedTime in class cascading.tap.Tap<Configuration,RecordReader,OutputCollector>java.io.IOExceptionpublic FileStatus getFileStatus(Configuration conf) throws java.io.IOException
java.io.IOExceptionpublic static Path getTempPath(Configuration conf)
protected java.lang.String makeTemporaryPathDirString(java.lang.String name)
public void resetFileStatuses()
Copyright © 2007-2021 Cascading Maintainers. All Rights Reserved.