Class BloomFilter
- java.lang.Object
-
- dev.brachtendorf.jimagehash.matcher.PlainImageMatcher
-
- dev.brachtendorf.jimagehash.matcher.exotic.BloomFilter
-
@Experimental("Not well tested yet") public class BloomFilter extends PlainImageMatcher
A bloom filter is an approximative data structure with constant space and time guarantee to check if an image is likely in the set or definitely not present.False positives are possible, false negatives are not. In other words, if the bloom filter decides that an element is not present it wasn't added beforehand with absolute certainty.
Images are mapped by multiple hash functions. Contrary to other matchers the bloom filter checks against images with the exact same hash in constant time making it a good candidate to filter before initiating more expensive operations.
If image duplicates shall be avoided different hash functions with high bit resolutions are to be chosen.
While a check on image similarity isn't possible by traditional bloom filter using a very low bit hash function, will result in images with small deviation being hashed to the same bucket with higher possibility allowing for further investigation.
The supplied hashing algorithms are only used to seed a
PcgRandom Number Generator and are chained back to back to each other if more hash functions are required than got provided by the user to ensure the target false positive probability.If more algorithms are provided as can be used these algorithms will be discarded with a notice in the console/logger.
- Since:
- 3.0.0
- Author:
- Kilian
-
-
Field Summary
-
Fields inherited from class dev.brachtendorf.jimagehash.matcher.PlainImageMatcher
steps
-
-
Constructor Summary
Constructors Constructor Description BloomFilter(int expectedElements, double desiredFalsePositiveProbability)BloomFilter(int expectedElements, int bits)Create a bloom filter with expected elements and bit size.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description booleanaddHashingAlgorithm(HashingAlgorithm hashingAlgorithm)Added hashing algorithms will be queried to seed the internal rng.voidaddImage(BufferedImage image)Adds the hashes produced by the added hashing algorithm of this image to the filter.voidaddImage(File image)Adds the hashes produced by the added hashing algorithm of this image to the filter.intbitsSet()doublebucketsSet()This method will return usable results after the first image has been addedprotected voidcheckLockState()Check if modification of the filter is allowed.voidclearHashingAlgorithms()Hashing algorithms may only be added and removed as long as no image has been added to this set.doublegetApproximateDistinctElementsInFilter()Return the approximate number of distinct elements added to this set.protected intgetBucket(int hashCode, int i)Map an integer to a bucket of the bloom filterdoublegetFalsePositiveProbability()Get the current false positive probability of the filter using thegetApproximateDistinctElementsInFilter()guess to assume the number of elements currently in the filter.doublegetFalsePositiveProbability(double numberOfElements)Get the false positive probability of the filter once the filter reached the specified amount of distinct added items.static intgetOptimalBitSizeOfFilter(double p, int n)Assuming the optimal k value used as calculated bygetOptimalNumberHashFunctions(int, int).static intgetOptimalNumberHashFunctions(int m, int n)Get the optimal number of hash functions needed to reduce false positive errors.protected intimageToHashCode(HashingAlgorithm hasher, BufferedImage image)Convert the image to a numerical hashcodebooleanisPresent(BufferedImage image)Checks if the hash created by this image utilizing the added hashing algorithms was already added to the bloom filter with the curren false positive probability as returned bygetFalsePositiveProbability()booleanisPresent(File file)Checks if the hash created by this image utilizing the added hashing algorithms was already added to the bloom filter with the curren false positive probability as returned bygetFalsePositiveProbability()protected voidlock()Lock the bloom filter and set variables to fixed values.booleanremoveHashingAlgorithm(HashingAlgorithm hashingAlgorithm)Hashing algorithms may only be added and removed as long as no image has been added to this set.-
Methods inherited from class dev.brachtendorf.jimagehash.matcher.PlainImageMatcher
equals, getAlgorithms, hashCode
-
-
-
-
Constructor Detail
-
BloomFilter
public BloomFilter(int expectedElements, int bits)Create a bloom filter with expected elements and bit size. Exceeding the number of expected elements will quickly result in degrading false positive probability.- Parameters:
expectedElements- the maximum number of elements to be added to this setbits- the number of bits used to store hashes
-
BloomFilter
public BloomFilter(int expectedElements, double desiredFalsePositiveProbability)- Parameters:
expectedElements- the maximum number of elements to be added to this setdesiredFalsePositiveProbability- the probability of an element being considered a false positive once the number of distinct elements added to the filter reaches the expectedElements count. Range (0-1]
-
-
Method Detail
-
addHashingAlgorithm
public boolean addHashingAlgorithm(HashingAlgorithm hashingAlgorithm)
Added hashing algorithms will be queried to seed the internal rng. If not enough hashing algorithms are provided by the user exists algorithms will be chained back to back.Hashing algorithms may only be added and removed as long as no image has been added to this set.
If too many algorithms are provided to comply with the false positive target the later added algorithms may be discarded.
Append a new hashing algorithm to be used by this matcher. The same algorithm may only be added once. Attempts to add the same algorithm twice is a NOP.
For some matchers the order of added hashing algorithms is crucial. The order the hashes are added is preserved.
- Overrides:
addHashingAlgorithmin classPlainImageMatcher- Parameters:
hashingAlgorithm- The algorithms to be added- Returns:
- true if the algorithm was added, false if it was already present
-
removeHashingAlgorithm
public boolean removeHashingAlgorithm(HashingAlgorithm hashingAlgorithm)
Hashing algorithms may only be added and removed as long as no image has been added to this set.Remove a hashing algorithm from this matcher.
- Overrides:
removeHashingAlgorithmin classPlainImageMatcher- Parameters:
hashingAlgorithm- The algorithms to be removed- Returns:
- true if the algorithm was removed, false if it was not present.
-
clearHashingAlgorithms
public void clearHashingAlgorithms()
Hashing algorithms may only be added and removed as long as no image has been added to this set.Remove all hashing algorithms used by this image matcher instance. At least one algorithm has to be supplied before images can be processed
- Overrides:
clearHashingAlgorithmsin classPlainImageMatcher
-
isPresent
public boolean isPresent(File file) throws IOException
Checks if the hash created by this image utilizing the added hashing algorithms was already added to the bloom filter with the curren false positive probability as returned bygetFalsePositiveProbability()- Parameters:
file- of the image to check- Returns:
- false if the image is definitely not in the filter, true if the image might be in the set.
- Throws:
IOException- if an error occurs during file reading
-
isPresent
public boolean isPresent(BufferedImage image)
Checks if the hash created by this image utilizing the added hashing algorithms was already added to the bloom filter with the curren false positive probability as returned bygetFalsePositiveProbability()- Parameters:
image- to check- Returns:
- false if the image is definitely not in the filter, true if the image might be in the set.
-
addImage
public void addImage(File image) throws IOException
Adds the hashes produced by the added hashing algorithm of this image to the filter. Future calls toisPresent(File)will return true for the image.Adding an image to the filter will prevent further modifications of the filter in terms of calling
addHashingAlgorithm(HashingAlgorithm),removeHashingAlgorithm(HashingAlgorithm)orclearHashingAlgorithms().- Parameters:
image- The image to add- Throws:
IOException- if an error occurs during file reading.
-
addImage
public void addImage(BufferedImage image)
Adds the hashes produced by the added hashing algorithm of this image to the filter. Future calls toisPresent(File)will return true for the image.Adding an image to the filter will prevent further modifications of the filter in terms of calling
addHashingAlgorithm(HashingAlgorithm),removeHashingAlgorithm(HashingAlgorithm)orclearHashingAlgorithms().- Parameters:
image- The image to add
-
imageToHashCode
protected int imageToHashCode(HashingAlgorithm hasher, BufferedImage image)
Convert the image to a numerical hashcode- Parameters:
hasher- the hashing algorithm to useimage- the image to hash- Returns:
- an integer hashcode
-
getBucket
protected int getBucket(int hashCode, int i)Map an integer to a bucket of the bloom filter- Parameters:
hashCode- the integer used as seedi- secondary seed value- Returns:
- index of a bucket
-
checkLockState
protected void checkLockState()
Check if modification of the filter is allowed.
-
lock
protected void lock()
Lock the bloom filter and set variables to fixed values. This will prevent hashing algorithms to be added to the filter.
-
bitsSet
public int bitsSet()
- Returns:
- the number of buckets set
-
bucketsSet
public double bucketsSet()
This method will return usable results after the first image has been added- Returns:
- the percentage of buckets set. [0-1]
-
getApproximateDistinctElementsInFilter
public double getApproximateDistinctElementsInFilter()
Return the approximate number of distinct elements added to this set.This method will return usable results after the first image has been added
- Returns:
- the approximate number of elements added
-
getFalsePositiveProbability
public double getFalsePositiveProbability()
Get the current false positive probability of the filter using thegetApproximateDistinctElementsInFilter()guess to assume the number of elements currently in the filter.This method will return usable results after the first image has been added
- Returns:
- the probability of a false positive [0-1]
-
getFalsePositiveProbability
public double getFalsePositiveProbability(double numberOfElements)
Get the false positive probability of the filter once the filter reached the specified amount of distinct added items.This method will return usable results after the first image has been added
- Parameters:
numberOfElements- the number of elements are currently in the filter- Returns:
- the probability of a false positive [0-1]
-
getOptimalNumberHashFunctions
public static int getOptimalNumberHashFunctions(int m, int n)Get the optimal number of hash functions needed to reduce false positive errors.- Parameters:
m- bit size of the bloom filtern- number of maximum elements added- Returns:
- k the number of hash functions that should be used to reduce false positives.
-
getOptimalBitSizeOfFilter
public static int getOptimalBitSizeOfFilter(double p, int n)Assuming the optimal k value used as calculated bygetOptimalNumberHashFunctions(int, int). Get the optimal number of bits to achieve the target probability when the set is filled to maximum capacity- Parameters:
p- the desired false positive probability (0-1]n- number of maximum elements added to the set- Returns:
- m the optimal bit size of the bloom filter
-
-