Packages

class MinHashModel extends GenericSimilarityEstimatorModel

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. MinHashModel
  2. GenericSimilarityEstimatorModel
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new MinHashModel()

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. var _featuresColumnNameDfA: String
    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  5. var _featuresColumnNameDfB: String
    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  6. var _inputCol: String
    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  7. var _similarityEstimationColumnName: String
    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  8. var _uriColumnNameDfA: String
    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  9. var _uriColumnNameDfB: String
    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  10. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  11. def checkColumnNames(dfA: DataFrame, dfB: DataFrame): Unit
    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  12. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native() @IntrinsicCandidate()
  13. def createCrossJoinDF(dfA: DataFrame, dfB: DataFrame): DataFrame

    This method creates a cross join dataframe with all possible pairs in the dataframe which can be compared

    This method creates a cross join dataframe with all possible pairs in the dataframe which can be compared

    dfA

    is the first dataFrame which has to have column for URI and for the Vector based feature vector

    dfB

    is second dataframe

    returns

    it return a dataframe with four columns two for the uri columns and two for the feature vector. the size is approx the product of both dataframe sizes

    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  14. def createNnDF(df: DataFrame, key: Vector, keyUri: String = "unknown"): DataFrame

    This method creates a dataframe aligned to createCrossJoinDF result.

    This method creates a dataframe aligned to createCrossJoinDF result. but in this case we only assign to each

    df

    the dataframe with all entities we want to compare against with all uri feature vector representations

    key

    the vector representation of uri features

    keyUri

    you can specify the key uri name

    returns

    dataframe with four columns with all desired pairs to compare. uriA, uriB, featuresA, featuresB

    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  15. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  16. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  17. val estimatorMeasureType: String
  18. val estimatorName: String
  19. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @IntrinsicCandidate()
  20. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @IntrinsicCandidate()
  21. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  22. val modelType: String
  23. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  24. def nearestNeighbors(dfA: DataFrame, key: Vector, k: Int, keyUri: String = "unknown", valueColumn: String = "distCol", keepKeyUriColumn: Boolean = false): DataFrame

    dfA

    one dataframe formatted with two columns one for the URI as String and one for feature vector as indexed feature vector representation (like from Spark MLlib Count Vectorizer)

    key

    key vector representation like output from Count Vectorizer from MLlib

    k

    number of nearest neighbors we want to return

    keyUri

    name of the uri which is later listed in the resulting dataframe

    valueColumn

    name of the column of resulting similarities and distances

    keepKeyUriColumn

    boolean value to decide if the key uri should be presented in a column or not

    returns

    resulting dataframe of similar uris with distances/similarities to the given key

    Definition Classes
    MinHashModelGenericSimilarityEstimatorModel
  25. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @IntrinsicCandidate()
  26. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @IntrinsicCandidate()
  27. def reduceJoinDf(simDf: DataFrame, threshold: Double): DataFrame

    This method reduces the overall created datafraame by a set threshold

    This method reduces the overall created datafraame by a set threshold

    All values which are less good than the threshold will not taken into account

    simDf

    the resulting dataframe we want to reduce

    threshold

    the threshold which is taken for reduction as upper bound for distance and lower bound for similarity

    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  28. def reduceNnDf(simDf: DataFrame, k: Int, keepKeyUriColumn: Boolean): DataFrame

    Limits the number of presented nearest neighbors and orders the results

    Limits the number of presented nearest neighbors and orders the results

    simDf

    the similarity evaluated data frame

    k

    the number of nearest neighbors we search for

    keepKeyUriColumn

    if we want to keep the column of the fixed uri. so decision between a resulting dataframe of two or three columns

    returns

    return dataframe of k nearest neighbor uris in one column, in another column the estimated similarity and in a third column the uri we compared against (third column is optional)

    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  29. def setFeaturesColumnNameDfA(features_column_name: String): MinHashModel.this.type
  30. def setFeaturesColumnNameDfB(features_column_name: String): MinHashModel.this.type
  31. def setInputCol(inputCol: String): MinHashModel.this.type
  32. def setNumHashTables(n: Int): MinHashModel.this.type
  33. def setSimilarityEstimationColumnName(valueColumnName: String): Unit

    This method sets the column name for the resulting dataframe similarity or distance: e.g.

    This method sets the column name for the resulting dataframe similarity or distance: e.g. "jaccardSimilarity"

    valueColumnName

    the name of the resulting distance/similarity value

    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  34. def setUriColumnNameDfA(uri_column_name: String): MinHashModel.this.type
  35. def setUriColumnNameDfB(uri_column_name: String): MinHashModel.this.type
  36. val similarityEstimation: UserDefinedFunction
    Attributes
    protected
    Definition Classes
    GenericSimilarityEstimatorModel
  37. def similarityJoin(dfA: DataFrame, dfB: DataFrame, threshold: Double = -1.0, valueColumn: String = "distCol"): DataFrame

    This method creates a dataframe which propses for each pair of URI the assigned similarity/distance

    This method creates a dataframe which propses for each pair of URI the assigned similarity/distance

    dfA

    one dataframe formatted with two columns one for the URI as String and one for feature vector as indexed feature vector representation (like from Spark MLlib Count Vectorizer)

    dfB

    second dataframe to compare entries from dataframe dfA. formetting needed as in dfA

    threshold

    threshold for minimal distance or similarity final dataframe is filtered for

    valueColumn

    column name of the resulting similarity/distance column name

    returns

    dataframe with the columns for uris and the assigned similarity column

    Definition Classes
    MinHashModelGenericSimilarityEstimatorModel
  38. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  39. def toString(): String
    Definition Classes
    AnyRef → Any
  40. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  41. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  42. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] ) @Deprecated
    Deprecated

Inherited from AnyRef

Inherited from Any

Ungrouped