Techy Word of the Week: bloom filter

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.The base data structure of a Bloom filter is a Bit Vector. Visit Bloom Filters by Example for an interactive visualization.

source: http://llimllib.github.io/bloomfilter-tutorial/

Operations

The basic bloom filter supports two operations: test and add.
Test is used to check whether a given element is in the set or not. If it returns:

false then the element is definitely not in the set.
true then the element is probably in the set. The false positive rate is a function of the bloom filter's size and the number and independence of the hash functions used.

Add simply adds an element to the set. Removal is impossible without introducing false negatives, but extensions to the bloom filter are possible that allow removal e.g. counting filters.

Applications

The classic example is using bloom filters to reduce expensive disk (or network) lookups for non-existent keys. If the element is not in the bloom filter, then we know for sure we don't need to perform the expensive lookup. On the other hand, if it is in the bloom filter, we perform the lookup, and we can expect it to fail some proportion of the time (the false positive rate).

source: https://www.jasondavies.com/bloomfilter/ including a bloom filter implementation in JavaScript

Wikipedia

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives.

The following topics are also covered:

Algorithm description
Space and time advantages
Probability of false positives
Optimal number of hash functions
Approximating the number of items in a Bloom filter
The union and intersection of sets
Interesting properties

Extensions and applications

Cache filtering
Counting filters
Decentralized aggregation
Data synchronization
Bloomier filters
Compact approximators
Stable Bloom filters
Scalable Bloom filters
Layered Bloom filters
Attenuated Bloom filters
Chemical structure searching

Examples

Akamai's web servers use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache.
Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
The Google Chrome web browser used to use a Bloom filter to identify malicious URLs.
Medium uses Bloom filters to avoid recommending articles a user has previously read.

Link to the whole article: https://en.wikipedia.org/wiki/Bloom_filter

Techy Word of the Week

Friday, April 27, 2018

bloom filter

Operations

Applications

Wikipedia

Extensions and applications

Examples

No comments:

Post a Comment