Class BloomFilter


  • @Experimental("Not well tested yet")
    public class BloomFilter
    extends PlainImageMatcher
    A bloom filter is an approximative data structure with constant space and time guarantee to check if an image is likely in the set or definitely not present.

    False positives are possible, false negatives are not. In other words, if the bloom filter decides that an element is not present it wasn't added beforehand with absolute certainty.

    Images are mapped by multiple hash functions. Contrary to other matchers the bloom filter checks against images with the exact same hash in constant time making it a good candidate to filter before initiating more expensive operations.

    If image duplicates shall be avoided different hash functions with high bit resolutions are to be chosen.

    While a check on image similarity isn't possible by traditional bloom filter using a very low bit hash function, will result in images with small deviation being hashed to the same bucket with higher possibility allowing for further investigation.

    The supplied hashing algorithms are only used to seed a Pcg Random Number Generator and are chained back to back to each other if more hash functions are required than got provided by the user to ensure the target false positive probability.

    If more algorithms are provided as can be used these algorithms will be discarded with a notice in the console/logger.

    Since:
    3.0.0
    Author:
    Kilian
    • Constructor Detail

      • BloomFilter

        public BloomFilter​(int expectedElements,
                           int bits)
        Create a bloom filter with expected elements and bit size. Exceeding the number of expected elements will quickly result in degrading false positive probability.
        Parameters:
        expectedElements - the maximum number of elements to be added to this set
        bits - the number of bits used to store hashes
      • BloomFilter

        public BloomFilter​(int expectedElements,
                           double desiredFalsePositiveProbability)
        Parameters:
        expectedElements - the maximum number of elements to be added to this set
        desiredFalsePositiveProbability - the probability of an element being considered a false positive once the number of distinct elements added to the filter reaches the expectedElements count. Range (0-1]
    • Method Detail

      • addHashingAlgorithm

        public boolean addHashingAlgorithm​(HashingAlgorithm hashingAlgorithm)
        Added hashing algorithms will be queried to seed the internal rng. If not enough hashing algorithms are provided by the user exists algorithms will be chained back to back.

        Hashing algorithms may only be added and removed as long as no image has been added to this set.

        If too many algorithms are provided to comply with the false positive target the later added algorithms may be discarded.

        Append a new hashing algorithm to be used by this matcher. The same algorithm may only be added once. Attempts to add the same algorithm twice is a NOP.

        For some matchers the order of added hashing algorithms is crucial. The order the hashes are added is preserved.

        Overrides:
        addHashingAlgorithm in class PlainImageMatcher
        Parameters:
        hashingAlgorithm - The algorithms to be added
        Returns:
        true if the algorithm was added, false if it was already present
      • removeHashingAlgorithm

        public boolean removeHashingAlgorithm​(HashingAlgorithm hashingAlgorithm)
        Hashing algorithms may only be added and removed as long as no image has been added to this set.

        Remove a hashing algorithm from this matcher.

        Overrides:
        removeHashingAlgorithm in class PlainImageMatcher
        Parameters:
        hashingAlgorithm - The algorithms to be removed
        Returns:
        true if the algorithm was removed, false if it was not present.
      • clearHashingAlgorithms

        public void clearHashingAlgorithms()
        Hashing algorithms may only be added and removed as long as no image has been added to this set.

        Remove all hashing algorithms used by this image matcher instance. At least one algorithm has to be supplied before images can be processed

        Overrides:
        clearHashingAlgorithms in class PlainImageMatcher
      • isPresent

        public boolean isPresent​(File file)
                          throws IOException
        Checks if the hash created by this image utilizing the added hashing algorithms was already added to the bloom filter with the curren false positive probability as returned by getFalsePositiveProbability()
        Parameters:
        file - of the image to check
        Returns:
        false if the image is definitely not in the filter, true if the image might be in the set.
        Throws:
        IOException - if an error occurs during file reading
      • isPresent

        public boolean isPresent​(BufferedImage image)
        Checks if the hash created by this image utilizing the added hashing algorithms was already added to the bloom filter with the curren false positive probability as returned by getFalsePositiveProbability()
        Parameters:
        image - to check
        Returns:
        false if the image is definitely not in the filter, true if the image might be in the set.
      • imageToHashCode

        protected int imageToHashCode​(HashingAlgorithm hasher,
                                      BufferedImage image)
        Convert the image to a numerical hashcode
        Parameters:
        hasher - the hashing algorithm to use
        image - the image to hash
        Returns:
        an integer hashcode
      • getBucket

        protected int getBucket​(int hashCode,
                                int i)
        Map an integer to a bucket of the bloom filter
        Parameters:
        hashCode - the integer used as seed
        i - secondary seed value
        Returns:
        index of a bucket
      • checkLockState

        protected void checkLockState()
        Check if modification of the filter is allowed.
      • lock

        protected void lock()
        Lock the bloom filter and set variables to fixed values. This will prevent hashing algorithms to be added to the filter.
      • bitsSet

        public int bitsSet()
        Returns:
        the number of buckets set
      • bucketsSet

        public double bucketsSet()
        This method will return usable results after the first image has been added
        Returns:
        the percentage of buckets set. [0-1]
      • getApproximateDistinctElementsInFilter

        public double getApproximateDistinctElementsInFilter()
        Return the approximate number of distinct elements added to this set.

        This method will return usable results after the first image has been added

        Returns:
        the approximate number of elements added
      • getFalsePositiveProbability

        public double getFalsePositiveProbability()
        Get the current false positive probability of the filter using the getApproximateDistinctElementsInFilter() guess to assume the number of elements currently in the filter.

        This method will return usable results after the first image has been added

        Returns:
        the probability of a false positive [0-1]
      • getFalsePositiveProbability

        public double getFalsePositiveProbability​(double numberOfElements)
        Get the false positive probability of the filter once the filter reached the specified amount of distinct added items.

        This method will return usable results after the first image has been added

        Parameters:
        numberOfElements - the number of elements are currently in the filter
        Returns:
        the probability of a false positive [0-1]
      • getOptimalNumberHashFunctions

        public static int getOptimalNumberHashFunctions​(int m,
                                                        int n)
        Get the optimal number of hash functions needed to reduce false positive errors.
        Parameters:
        m - bit size of the bloom filter
        n - number of maximum elements added
        Returns:
        k the number of hash functions that should be used to reduce false positives.
      • getOptimalBitSizeOfFilter

        public static int getOptimalBitSizeOfFilter​(double p,
                                                    int n)
        Assuming the optimal k value used as calculated by getOptimalNumberHashFunctions(int, int). Get the optimal number of bits to achieve the target probability when the set is filled to maximum capacity
        Parameters:
        p - the desired false positive probability (0-1]
        n - number of maximum elements added to the set
        Returns:
        m the optimal bit size of the bloom filter