Abstract
This is my 2nd post. In this post, I'd like to record some of the key learnings from a video (see bottom) of Bloom Filter, a popular technique for checking whether a collection contains a specific target by its key.
From wikipedia: A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives.
Key Learnings
Here are some of the key learnings:
- Bloom Filters are a privacy data structure that can give false positives but not false negatives. In other words, when checking existence, it may say Yes while the truth is No, but it will never say No while the true is Yes.
- Bloom filter simply works by recording a key in an array. The input key is converted using some hash functions along with modularization to fit in the underlying array. Checking the existence of the key is O(1) operation given the accurate index of the key can be calculated using the same way.
- Collision may occur if a key is not inserted before but the hash function produces a value that duplicates with another already inserted item.
- Bloom filter is space efficient because of multiple reasons.
- It does not need to store the entire object but only the key.
- The key is mapped to an index of array, in which the space is fixed and commonly much smaller than the original dataset size.
- Using a byte buffer to optimize space consumption. Basically, we can switch from Boolean[] to Byte[], while each Byte in the array represents 8 items. Bit manipulation techniques were utilized to reduce space usage by a factor of 8.
- Using multiple hash functions to reduce false positives. Multiple hash functions produces a sequence of bits that need to be set and checked, hence possibly reducing collision rate.
:P 50% of the above summarization is generated by using GPT-backed tool. Only thing I needed to do is to feed the youtube link.
When to use Bloom Filter
I've seen Spark leverages Bloom Filter to speed up joins between tables as well as filters on single table, which is a quite common scenario.
By using bloom filter, one can answer the question that whether a key is possibly existed in a collection. Assuming 1% false positive rate is there. One can eliminate 99% of the candidates in the beginning after creating and possibly transferring very small amount of data (the bloom filter itself). In reality, the false positive rate can be much smaller.
Credit to Ytb Video link: https://www.youtube.com/watch?v=UVFnabieyzc&ab_channel=AsliEngineeringbyArpitBhayani
Top comments (0)