Friday, October 21, 2011

Bloom filter implementation in C++ - NikBloomFilter

Bloom filter can be considered as a succinct representation of a set of elements. Set of elements might be stored in files or in Database tables, etc. And a set may support operations like insertion, deletion and retrieving elements.

For this discussion, Let us assume a dictionary of passwords which is a set of huge number of strings which is stored in several files according to their starting character like A-List, B-List,etc. We assume that we don't keep all this information in Main memory. So, in order to find a string's presence in the dictionary, Program has to fetch the right file according to the starting character and keep it in main memory for a Binary search. Generally, we don't require frequent insertions and deletions from this set, but queries for the presence of an element. Only way to make the queries fast is to cache complete set of elements in Main memory. But there are some scenarios in which caching is unaffordable.

Bloom filter, as I said early, is a succinct data-structure which registers the presence of elements in a set. While we insert an element in a set, Bloom filter must be updated. Bloom filter is basically a bit array. A small subset of bits in that array would correspond to an element in the set. So, when an element is inserted, corresponding bits should be set. Now query is simply a check for the corresponding bits state; if all the bits corresponding to that element are set, then query would return true. But this true may not be a real true. Since every element maps to a subset of bits in the bit array, overlapping may occur and which could cause a false positive. Before going further, we will analyse this structure mathematically.

Bloom filter is a probabilistic data structure. When an element is inserted, it has to set certain bits in the bits array. In order to find out what are all the bits to be set, we need to hash the content of an element. Now, hashed elements could be used to construct the indices of bits to be set. For instance, MD5 of a byte array would result in 16 bytes which could be used to construct 4 indices. We will get into implementation details later in this discussion. As we discussed early, this data structure may give false positive. Luckily, we could be able to control this by trading off space.

Let us get into the analysis. Let us assume the size of the bloom filter, "M" bits and "N" elements had been inserted into the set, hence in Bloom filter also.

Now, probability of a bit B[i] to be unset or zero is (1 - 1/M) ** kN. Here ** represents "power of" operator. k is the number of bits to be set for an element in a set. The above formula could be easily derived as follows: Since N elements are inserted, setting some bit in the bloom filter would have happened for kN times. One particular bit not set in the first time is M-1 / M, for two consecutive times, is (M-1 / M) * (M-1 / M) and so on. Underlying assumption in this, is setting bits are statistically independent. Asymptotically, this is true. So, we will assume so in this discussion.

P{ B[i] == 0] } is (1 - 1/M) ** kN, approximately e ** (-kN/M).

False positive is a condition in which bloom filter tells a string present in the set which is not actually present. Cause of this condition is the bits corresponding to the string set by other independent strings and due to cumulative effect of that, as described earlier in this discussion. Probability of false positive is all the bits corresponding to the string are set which is,

(1 - e ** (-kN / M) ) ** k => P{ False Positive }
Since the false positive probability depends on the factors k, N, M, we need to minimize the function ( taking derivative and equating to zero) with respect to 'k'. Analysis shows this function becomes minimum if 'k' becomes (M / N) * ln 2. ln 2 is natural logarithm of 2.

If we assign 'k', the value (M/N) * ln2 in the False positive probability function, we will get,

P{ False Probability } = (1 - e ** -(ln 2) ) ** k = (1/2) ** k or (0.6185) ** (M/N).

The above result is the lynch-pin of the Bloom filter implementation, as it gives the way to control False positives. For instance, if we wish to have F.P of 0.2 then M/N must be around 8. "N" is the number of elements in the set. Given that, "N", bit-set size of the Bloom filter, could be easily calculated.

An implementation of Bloom filter in C++ has been done and checked in to the Git-hub.

This implementation doesn't have support for Deletion and merging with some other Bloom filter as it is the initial release. Any comments on this is welcome.

Thanks for Reading.

2 comments:

  1. Hello, What's the idea behind mapping the MD5 hash function output to bloomfilter indices?Can you please explain elaborately??

    ReplyDelete
  2. Not only MD5, any one-way hashing should work. Hashing makes Bloom filter not to worry about the input distribution. And another important point is, any amount of input(Web URL, simple string or huge text file) can be hashed to get a fixed size, so that bloom filter can be size agnostic.

    ReplyDelete