Bloom Filter

vishal rana
4 min readMar 4, 2023

--

“A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set” — Wiki

We’ll understand it in more depth . Let’s get started.

Bloom filter is a probabilistic data structure used for fast set membership testing. It provides a space-efficient way to represent a set of items, and can answer queries on whether an item is in the set or not with high probability.

A Bloom filter is essentially a bit array of length ‘m’ and k hash functions which map an item to one of ‘m’ bits.

To add an item, it is hashed by each of the k hash functions and the corresponding bits in the bit array are set to 1.

To query whether an item is in the set, it is hashed by each of the k hash functions and if any of the corresponding bits in the bit array are 0, then the item is definitely not in the set. If all the bits are 1, then the item is probably in the set.

Bloom filters are space efficient as they use bit array which are very light weight then array that we use in hash tables or binary tree.

Let try to understand with an example:

Let say you are keeping track of what users are using your blog, and for each user you want to keep track that user can access the blog only one time. So the approach will be to hash the user id and map that to the bit array.

When you try query the userID from the bloom filter. The bloom filter can give you two possible outcomes.

  • Either firm NO
  • Or Probably Yes

You cam implement the same in other use cases as well. Bloom filter is easy.

Bloom filters are less accurate as they are probabilistic Data structure but they consume less memory.

Bloom filter can be an ideal use when there is:

  • OK to have False positive
  • Not OK to have False Negative

False positive:False positive occurs in Bloom Filters when the filter returns “Yes, the element is in the set” for an element that is not actually present in the set.

False Negative: False negatives occur when the Bloom filter incorrectly indicates that an item is not in the set, when in fact it is.

Let see the implementation of bloom filter in Nodejs:

we are using crypto library to calculate hash.

const crypto = require('crypto');

class BloomFilter {
constructor(m, k) {
this.m = m; // size of the bit array
this.k = k; // number of hash functions
this.bits = Buffer.alloc(Math.ceil(m / 8), 0);
}

// adding a new element to the set
add(item) {
for (let i = 0; i < this.k; i++) {
const digest = crypto.createHash('sha256')
.update(i.toString())
.update(item)
.digest();
const index = digest.readUInt32BE() % this.m;
this.bits[Math.floor(index / 8)] |= 1 << (index % 8);
}
}

// query an element from the set
query(item) {
// iterating for given number of hash functions to set the bit for those index to 1
for (let i = 0; i < this.k; i++) {
const digest = crypto.createHash('sha256')
.update(i.toString())
.update(item)
.digest();
const index = digest.readUInt32BE() % this.m; // will create a index for this value using this hash function
if ((this.bits[Math.floor(index / 8)] & (1 << (index % 8))) === 0) {
return false;
}
}
return true;
}
}

In this example, we use a Buffer to represent the bit array and the crypto module to generate hash functions. The addmethod hashes the item using k different hash functions and sets the corresponding bits in the bit array to 1. The query method hashes the item using the same k hash functions and checks whether all the corresponding bits in the bit array are 1.

Integrating it

const filter = new BloomFilter(10000, 3);
filter.add('Apple');
filter.add('Oranges');
console.log(filter.query('Apple')); // true
console.log(filter.query('Oranges')); // true (with some probability)
console.log(filter.query('Kiwi')); // false

Suppose this is how apple and oranges map to the bit array

Let say we need to query Apple

Hash functions will calculate the hash and then find the value in the array. In our case it will say Element exist as all the values in the array are 1.

Let query “KIWI”

Algorithm will return False, as we have bit set to 0 at 9TH position.

Where are Bloom filters used:

  1. Google Bigtable uses bloom filters to reduce disk lookups for non-existent rows or columns.
  2. Apache HBase uses bloom filters to reduce the number of disk seeks for non-existent row or column keys.
  3. The Squid proxy server uses bloom filters to speed up access controls by avoiding expensive disk lookups.
  4. The Bitcoin network uses bloom filters to reduce the number of false positive matches when searching for unspent transaction outputs.
  5. The Cassandra database uses bloom filters to avoid hitting the disk for non-existent data.

I hope this gives you an idea, how bloom filters work and where they can be useful.

Stay tuned for more to come.

--

--

No responses yet