Class FuzzyHash

  • All Implemented Interfaces:
    Serializable

    public class FuzzyHash
    extends Hash
    A fuzzy hash is an aggregation of multiple hashes mapped to a single mean hash representing the average hash while keeping track of fractional (probability) bits. This kind of composite hash is suited to represent clustered hashes and minimizing the distance to all members added.

    To receive reasonable results a fuzzy hash should only contain hashes created by the same algorithm.

     
     Combining three hashes 
     H1:  1001
     H2:  1011
     H3:  1111
     -------
     Res: 1011 (used for hamming distances and getHashValue())
     
     
    In the above case the first and last bit have a certainty of 100%. While the middle bits are only present in 66% of the added hashes. The weighted distance will takes those certainties into account while the original distance method inherited from will calculate the distance to the modus bit for each bit.

    Implnote: Opposed to the original Hash equals and hashcode are not overwritten to ensure correct functionality in hash collections after factoring in mutable fields. To check if hashes are equals calculate the distance between the hashes instead.

    Since:
    3.0.0
    Author:
    Kilian
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected int[] bits
      The difference in 1's or 0 bits added for each position.
      boolean dirtyBits
      Requires an update of the hashValue
    • Constructor Summary

      Constructors 
      Constructor Description
      FuzzyHash()
      Create an empty fuzzy hash object with a 0 bit hash length and a undefined algorithm id.
      FuzzyHash​(Hash... hashs)
      Create a fuzzy hahs by merging the supplied hashes together.
    • Field Detail

      • bits

        protected int[] bits
        The difference in 1's or 0 bits added for each position. Positive values indicate more 1's negative more 0's.
      • dirtyBits

        public transient boolean dirtyBits
        Requires an update of the hashValue
    • Constructor Detail

      • FuzzyHash

        public FuzzyHash()
        Create an empty fuzzy hash object with a 0 bit hash length and a undefined algorithm id. These values will be populated as soon as the first hash is added.
      • FuzzyHash

        public FuzzyHash​(Hash... hashs)
        Create a fuzzy hahs by merging the supplied hashes together. The hashes are all expected to be created by the same algorithm.

        The fuzzy hash will adapt the algorithm id and bit length of the first supplied hash.

        Parameters:
        hashs - the hashes to merge.
    • Method Detail

      • merge

        public void merge​(Hash hash)
        Merge a hash into this object and lazily update the values as required. A merge operation looks at every individual bit and increments or decrements the counter prompting a recomputation of the underlying hash value if necessary.

        Merging multiple hashes will result in a hash which has to lowest summed distance to it's members.

        Creating composite hashes is only reasonable for hashes generated by the same algorithm. This method will throw an IllegalArgumentException if the hashes are not created by the same algorithm or are of different length

        Parameters:
        hash - the hash to merge
      • merge

        public void merge​(Hash... hash)
        Merge multiple hashes into this object and lazily update the values as required. A merge operation looks at every individual bit and increments or decrements the counter prompting a recomputation of the underlying hash value if necessary.

        Merging multiple hashes will result in a hash which has to lowest summed distance to it's members.

        Creating composite hashes is only reasonable for hashes generated by the same algorithm. This method will throw an IllegalArgumentException if the hashes are not created by the same algorithm or are of different length

        Parameters:
        hash - the hash to merge
      • mergeFast

        public void mergeFast​(Hash hash)
        Merge a hash into this object and lazily update the values as required. A merge operation looks at every individual bit and increments or decrements the counter prompting a recomputation of the underlying hash value if necessary.

        Merging multiple hashes is a geometric averaging operation resulting in a hash with the lowest summed distance to it's members.

        Opposed to merge(Hash) this method does not perform any input validation. If incompatible hashes are added (especially hashes with differing lengths) the instance will be left in an undefined state and future calls to any of the methods may return unpredictable results.

        Parameters:
        hash - to merge
      • mergeFast

        public void mergeFast​(FuzzyHash hash)
        Merge a fuzzy hash into this object and lazily update the values as required. A merge operation looks at every individual bit and increments or decrements the counter prompting a recomputation of the underlying hash value if necessary.

        Opposed to merge(Hash) this method does not perform any input validation. If incompatible hashes are added (especially hashes with differing lengths) the instance will be left in an undefined state and future calls to any of the methods may return unpredictable results.

        Parameters:
        hash - to merge
      • mergeFast

        public void mergeFast​(Hash... hash)
        Merge multiple hashes into this object and lazily update the values as required. A merge operation looks at every individual bit and increments or decrements the counter prompting a recomputation of the underlying hash value if necessary.

        Merging multiple hashes is a geometric averaging operation resulting in a hash with the lowest summed distance to it's members.

        Opposed to merge(Hash) this method does not perform any input validation. If incompatible hashes are added (especially hashes with differing lengths) the instance will be left in an undefined state and future calls to any of the methods may return unpredictable results.

        Parameters:
        hash - to merge
      • subtract

        public void subtract​(Hash hash)
        Subtract a hash from this object and lazily update the values as required. A subtraction operation looks at every individual bit and increments or decrements the counter prompting a recomputation of the underlying hash value if necessary.

        Creating composite hashes is only reasonable for hashes generated by the same algorithm. This method will throw an IllegalArgumentException if the hashes are not created by the same algorithm or are of different length. Only hashes that were added beforehand should be subtracted.

        Parameters:
        hash - to subtract
      • subtractFast

        public void subtractFast​(Hash hash)
        Subtract a hash from this object and lazily update the values as required. A subtraction operation looks at every individual bit and increments or decrements the counter prompting a recomputation of the underlying hash value if necessary.

        Creating composite hashes is only reasonable for hashes generated by the same algorithm. This method will throw an IllegalArgumentException if the hashes are not created by the same algorithm or are of different length. Only hashes that were added beforehand should be subtracted.

        Parameters:
        hash - to subtract
      • weightedDistance

        public double weightedDistance​(Hash h)
        Calculate the normalized weighted distance between the supplied hash and this hash. Opposed to the hamming distance the weighted distance takes partial bits into account. e.g. if this fuzzy hashes first bit hash a probability of 70% being a 0 it will have a weighted distance of .7 if it's a 1. Be aware that this method id much more expensive than calculating the simple distance between 2 ordinary hashes. (1 quick xor vs multiple calculations per bit).
        Parameters:
        h - The hash to calculate the distance to
        Returns:
        similarity value ranging between [0 - 1]
      • weightedDistance

        public double weightedDistance​(FuzzyHash h)
        Calculate the normalized weighted distance between two fuzzy hashes Opposed to the hamming distance the weighted distance takes partial bits into account. e.g. if this fuzzy hashes first bit hash a probability of 70% being a 0 and the second fuzzyhash has a 60% probability the distance will be 10%. Be aware that this method id much more expensive than calculating the simple distance between 2 ordinary hashes. (1 quick xor vs multiple calculations per bit).
        Parameters:
        h - The hash to calculate the distance to
        Returns:
        similarity value ranging between [0 - 1]
      • squaredWeightedDistance

        public double squaredWeightedDistance​(Hash h)
        Calculate the squared normalized weighted distance between the supplied hash and this hash. Opposed to the hamming distance the weighted distance takes partial bits into account. e.g. if this fuzzy hashes first bit hash a probability of 70% being a 0 it will have a weighted distance of .7^2 if it's a 1. Be aware that this method id much more expensive than calculating the simple distance between 2 ordinary hashes. (1 quick xor vs multiple calculations per bit).
        Parameters:
        h - The hash to calculate the distance to
        Returns:
        similarity value ranging between [0 - 1]
      • squaredWeightedDistance

        public double squaredWeightedDistance​(FuzzyHash h)
        Calculate the squared normalized weighted distance between two fuzzy hashes Opposed to the hamming distance the weighted distance takes partial bits into account. e.g. if this fuzzy hashes first bit hash a probability of 70% being a 0 and the second fuzzyhash has a 60% probability the distance will be 10%^2. Be aware that this method is much more expensive than calculating the simple distance between 2 ordinary hashes. (1 quick xor vs multiple calculations per bit).
        Parameters:
        h - The hash to calculate the distance to
        Returns:
        similarity value ranging between [0 - 1]
      • getMaximalError

        public double getMaximalError()
      • reset

        public void reset()
        Removing all previous knowledge of hashes added while keeping the current cluster hash in place treating it as the only hash added.

        This operation is the in place equivalent to

         
         
         FuzzyHash fuzzy ...
         ...
         ...
         
         FuzzyHash temp = new FuzzyHash();
         temp.mergeFast(new Hash(fuzzy.getHashValue(),fuzzy.getBitResolution,fuzzy.getAlgorithmId()));
         fuzzy = temp;
         
         
      • getCertainty

        public double getCertainty​(int position)
        Get the certainty of a bit at the given bit position.

        The returned value is in the range [-1,1]. A negative value indicates that the bit is more likely to be a 0. A positive value indicates the bit to be more likely a 1.

        The value is the probability of a randomly drawn hash from the set of previously merged hashes nth bit being 0 or 1.

        Parameters:
        position - the bit position.
        Returns:
        The certainty in range of [-1,1].
      • getUncertaintyMask

        public boolean[] getUncertaintyMask​(double certainty)
        Return a mask indicating if the bits are above a certain uncertainty.

        The certainty of a hash is calculated by comparing the number of 0 bits at position n with the number of 1 bits at position n of all the added hashes. If all hashes agree the certainty is 100%. If 50% of the hashes contain 0 bits and the other half contains 1 bits the bit has a certainty of 0.

        Be aware that index 0 relegates to the rightmost bit. Printing the array will show the boolean values in reverse order.

        This method only returns useable results as soon as one hash was added-

        Parameters:
        certainty - the certainty [0-1] up to which bits will be included. In other words, if a bit is more certain than specified by this argument it will not be included in the result.
        Returns:
        a boolean array indicating which bits are uncertain
      • getUncertaintyHash

        public Hash getUncertaintyHash​(double certainty)
        Return a simple hash containing only the bits that are below the specified uncertainty. This hash can be used to further compare hashes matched to this cluster while ignoring bits that are likely to be similar.

        To obtain compatible hashes take a look at {toUncertaintyHash(Hash, double) to escape other hashes the same way.

        Parameters:
        certainty - the certainty [0-1] up to which bits will be included. In other words, if a bit is more certain than specified by this argument it will not be included in the result.
        Returns:
        a hash with the certain bits discarded.
      • toUncertaintyHash

        public Hash toUncertaintyHash​(Hash source,
                                      double certainty)
        Escape the given hash the same way getUncertaintyHash(double) escapes this fuzzy hash.

        Return a simple hash containing only the bits that are below the specified uncertainty of the fuzzy hash. This hash can be used to further compare hashes matched to this cluster while ignoring bits that are likely to be similar.

        To get reasonable results the source hash has to be compatible with the fuzzy hash, meaning that it was generated by the same algorithm and settings as the hashes which make up the fuzzy hash.

        Parameters:
        source - the hash to convert.
        certainty - threshold to indicate which bits to discard
        Returns:
        a hash with the certain bits discarded.
      • getHashValue

        public BigInteger getHashValue()
        Overrides:
        getHashValue in class Hash
        Returns:
        the base BigInteger holding the hash value
      • getBitUnsafe

        public boolean getBitUnsafe​(int position)
        Description copied from class: Hash
        Check if the bit at the given position of the hash is set. This method does not check the bounds of the supplied argument.
        Overrides:
        getBitUnsafe in class Hash
        Parameters:
        position - of the bit. An index of 0 points to the lowest (rightmost bit)
        Returns:
        true if the bit is set (1). False if it's not set (0) ot the index is bigger than the hash length.
      • hammingDistance

        public int hammingDistance​(Hash h)
        Description copied from class: Hash
        Calculate the hamming distance of 2 hash values. The distance of two hashes is the difference of the individual bits found in the hash.

        The hamming distance falls within [0-bitResolution]. Lower values indicate closer similarity while identical images must return a score of 0. On the flip side score of 0 does not mean images have to be identical!

        A longer hash (higher bitResolution) will increase the average hamming distance returned. While this method allows for the most accurate fine tuning of the distance Hash.normalizedHammingDistance(Hash) is hash length independent.

        Please be aware that only hashes produced by the same algorithm with the same settings will return meaningful result and should be compared. This method will check if the hashes are compatible if no additional check is required see Hash.hammingDistanceFast(Hash)

        Overrides:
        hammingDistance in class Hash
        Parameters:
        h - The hash to calculate the distance to
        Returns:
        similarity value ranging between [0 - hash length]
      • hammingDistanceFast

        public int hammingDistanceFast​(Hash h)
        Description copied from class: Hash
        Calculate the hamming distance of 2 hash values. The distance of two hashes is the difference of the individual bits found in the hash.

        The hamming distance falls within [0-bitResolution]. Lower values indicate closer similarity while identical images must return a score of 0. On the flip side score of 0 does not mean images have to be identical!

        A longer hash (higher bitResolution) will increase the average hamming distance returned. While this method allows for the most accurate fine tuning of the distance Hash.normalizedHammingDistance(Hash) is hash length independent.

        Please be aware that only hashes produced by the same algorithm with the same settings will return meaningful result and should be compared. This method will NOT check if the hashes are compatible.

        Overrides:
        hammingDistanceFast in class Hash
        Parameters:
        h - The hash to calculate the distance to
        Returns:
        similarity value ranging between [0 - hash length]
        See Also:
        Hash.hammingDistance(Hash)
      • hammingDistanceFast

        public int hammingDistanceFast​(BigInteger bInt)
        Description copied from class: Hash
        Calculate the hamming distance of 2 hash values. The distance of two hashes is the difference of the individual bits found in the hash.

        The hamming distance falls within [0-bitResolution]. Lower values indicate closer similarity while identical images must return a score of 0. On the flip side score of 0 does not mean images have to be identical!

        A longer hash (higher bitResolution) will increase the average hamming distance returned. While this method allows for the most accurate fine tuning of the distance Hash.normalizedHammingDistance(Hash) is hash length independent.

        Please be aware that only hashes produced by the same algorithm with the same settings will return meaningful result and should be compared. This method will NOT check if the hashes are compatible.

        Overrides:
        hammingDistanceFast in class Hash
        Parameters:
        bInt - A big integer representing a hash
        Returns:
        similarity value ranging between [0 - hash length]
        See Also:
        Hash.hammingDistance(Hash)
      • normalizedHammingDistance

        public double normalizedHammingDistance​(Hash h)
        Description copied from class: Hash
        Calculate the hamming distance of 2 hash values. The distance of two hashes is the difference of the individual bits found in the hash.

        The normalized hamming distance falls within [0-1]. Lower values indicate closer similarity while identical images must return a score of 0. On the flip side score of 0 does not mean images have to be identical!

        See Hash.hammingDistance(Hash) for a non normalized version Please be aware that only hashes produced by the same algorithm with the same settings will return meaningful result and should be compared. This method will check if the hashes are compatible if no additional check is required see Hash.normalizedHammingDistanceFast(Hash)

        Overrides:
        normalizedHammingDistance in class Hash
        Parameters:
        h - The hash to calculate the distance to
        Returns:
        similarity value ranging between [0 - 1]
      • normalizedHammingDistanceFast

        public double normalizedHammingDistanceFast​(Hash h)
        Description copied from class: Hash
        Calculate the hamming distance of 2 hash values. The distance of two hashes is the difference of the individual bits found in the hash.

        The normalized hamming distance falls within [0-1]. Lower values indicate closer similarity while identical images must return a score of 0. On the flip side score of 0 does not mean images have to be identical!

        See Hash.hammingDistance(Hash) for a non normalized version Please be aware that only hashes produced by the same algorithm with the same settings will return meaningful result and should be compared. This method will NOT check if the hashes are compatible.

        Overrides:
        normalizedHammingDistanceFast in class Hash
        Parameters:
        h - The hash to calculate the distance to
        Returns:
        similarity value ranging between [0 - 1]
        See Also:
        Hash.hammingDistance(Hash)
      • toByteArray

        public byte[] toByteArray()
        Description copied from class: Hash
        Return the byte representation of the big integer with the leading zero byte stripped if present. The BigInteger class prepends a sign byte if necessary to indicate the signum of the number. Since our hashes are always positive we can get rid of it and reduce the space requirement in our db by 1 byte.

        To reconstruct the big integer value we can simply prepend a [0x00] byte even if it wasn't present in the first place. The constructor BigInteger(byte[]) will take care of it.

        Overrides:
        toByteArray in class Hash
        Returns:
        the byte representation of the big integer without an artificial sign byte.
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class Hash
      • equals

        public boolean equals​(Object obj)
        Overrides:
        equals in class Hash
      • toImage

        public BufferedImage toImage​(int blockSize)
        Creates a visual representation of the hash mapping the hash values to the section of the rescaled image used to generate the hash.
        Overrides:
        toImage in class Hash
        Parameters:
        blockSize - Stretch factor.Due to rescaling the image was shrunk down during hash creation.
        Returns:
        A black and white image representing the individual bits of the hash
      • toImage

        public BufferedImage toImage​(int blockSize,
                                     HashingAlgorithm hashingAlgorithm)
        Description copied from class: Hash
        Creates a visual representation of the hash mapping the hash values to the section of the rescaled image used to generate the hash.

        Some hash algorithms may chose to construct their hashes in a non default manner (e.g. DifferenceHash).

        Overrides:
        toImage in class Hash
        Parameters:
        blockSize - scaling factor of each pixel in the has. each bit of the hash will be represented to blockSize*blockSize pixels
        hashingAlgorithm - HashAlgorithm which created this hash.
        Returns:
        A black and white image representing the individual bits of the hash
      • getAddedCount

        public int getAddedCount()
        Return the number of hashes that currently make up this hash. After calling reset the added count will be set to 1.
        Returns:
        the count
      • getMaxUncertainty

        public double getMaxUncertainty​(int bitIndex)
        Return the maximum fuzzy distance for this bit. The maximum distance is the distance to either the 0 bit or the 1 bit which ever is greater.
        Parameters:
        bitIndex - the index of the bit
        Returns:
        the maximum distance
      • getWeightedDistance

        public double getWeightedDistance​(int bitIndex,
                                          boolean bit)
        Gets the weighted distance of this bit in the range [0-1]
        Parameters:
        bitIndex - the position in the hash
        bit - if true the distance to a 1 bit will be calculated. if false the distance to a 0 bit.
        Returns:
        the distance of full probable bit to this hashes bit.
      • toFile

        public void toFile​(File saveLocation)
                    throws IOException
        Description copied from class: Hash
        Saves this hash to a file for persistent storage. The hash can later be recovered by calling Hash.fromFile(File);
        Overrides:
        toFile in class Hash
        Parameters:
        saveLocation - the file to save the hash to
        Throws:
        IOException - If an error occurs during file access
      • fromFile

        public static FuzzyHash fromFile​(File source)
                                  throws IOException,
                                         ClassNotFoundException
        Reads a hash from a serialization file and returns it. Only hashes can be read from file that got saved by the same class instance using toFile(File);
        Parameters:
        source - The file this hash can be read from.
        Returns:
        a hash object
        Throws:
        IOException - If an error occurs during file read
        ClassNotFoundException - if the class used to serialize this hash can not be found
        Since:
        3.0.0