Bloom Filters for Everyone

A Bloom filter according to wikipedia, "is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set." Named for the computer scientist Burton Bloom, the Bloom filter has the following informal attributes:
  • The filter can yield false positives (told element is in the set but it isn't) but not false negatives (told element is not in set but it is)
  • Typically implemented using a bit vector
  • The more elements added to the filter, the higher the chance for false positives (assuming no resize)
  • Is space compact with respect to the set it represents (can represent all possibilities with little space)
So, what might you want to use a Bloom filter for? There are plenty of examples in search problems which illustrate their usefulness so we'll look at two of them. First (taken from Wikipedia), consider a spell checker. You can imagine a language with a large dictionary where it is expensive to do a spell check (dictionary too large to hold in memory). In this case, you map words in the dictionary to a large bit vector using a Bloom filter. When you go to spell check the document and your filter indicates that the word is _correct_, you can check an original source (dictionary file) to ensure you haven't received a false positive. The above usage isn't ideal since realistically in most cases (except mine), the majority of words will be spelled _correctly_. What you really want for a use case with a bloom filter is one where the majority of content is not found. A better example comes from PlanetPeer, a P2P network. When a peer searches for content, it looks for peers that have Bloom filters with the correct bits set to indicate that the content is available. The peer then checks all found peers to see if the content actually exists on those nodes or not. This is the ideal case for Bloom filter usage, where it reduces the search space and reduces the number of expensive operations required (in this case, network connections). The mathematics for a bloom filter are fairly trivial so I leave the majority of the mathematics to the wikipedia reference. I also don't know how to embed LaTeX into Blogger yet. Three important choices for the Bloom filter are the size of m (total number of bits in the vector), k (there are k hash functions, one for each bit set in the vector) and the hash function (which should decrease false positives). Calculating these values will often times be functions of the physical limitations of the hardware (disk space, etc) and the desire to reduce false positives. One should note that the probability of a false positive is (1/2)^k or approximately 0.6185^(m/n) where k is the number of hash functions used (and therefor bits set), m is the size of the vector and n is the number of elements inserted into the vector. This makes k with respect to m and the potential size of n the crucial elements to reducing false positives. There are a variety of references to implementations in the wikipedia article.

No comments: