Friday, April 27, 2018

bloom filter

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.The base data structure of a Bloom filter is a Bit Vector. Visit Bloom Filters by Example for an interactive visualization.

source: http://llimllib.github.io/bloomfilter-tutorial/

Operations

The basic bloom filter supports two operations: test and add.
Test is used to check whether a given element is in the set or not. If it returns:
  • false then the element is definitely not in the set.
  • true then the element is probably in the set. The false positive rate is a function of the bloom filter's size and the number and independence of the hash functions used.
Add simply adds an element to the set. Removal is impossible without introducing false negatives, but extensions to the bloom filter are possible that allow removal e.g. counting filters.

Applications

The classic example is using bloom filters to reduce expensive disk (or network) lookups for non-existent keys. If the element is not in the bloom filter, then we know for sure we don't need to perform the expensive lookup. On the other hand, if it is in the bloom filter, we perform the lookup, and we can expect it to fail some proportion of the time (the false positive rate).

source: https://www.jasondavies.com/bloomfilter/ including a bloom filter implementation in JavaScript

Wikipedia

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives.

The following topics are also covered:
  • Algorithm description
  • Space and time advantages
  • Probability of false positives
  • Optimal number of hash functions
  • Approximating the number of items in a Bloom filter
  • The union and intersection of sets
  • Interesting properties

Extensions and applications

  • Cache filtering
  • Counting filters
  • Decentralized aggregation
  • Data synchronization
  • Bloomier filters
  • Compact approximators
  • Stable Bloom filters
  • Scalable Bloom filters
  • Layered Bloom filters
  • Attenuated Bloom filters
  • Chemical structure searching

Examples

  • Akamai's web servers use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache.
  • Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
  • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs.
  • Medium uses Bloom filters to avoid recommending articles a user has previously read.
Link to the whole article: https://en.wikipedia.org/wiki/Bloom_filter

No comments:

Post a Comment