Bloom Filters & Stream Algorithms
Bloom Filters & Stream Algorithms
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Each element of data stream is a tuple Example: Email spam filtering Given a set of keys S that we want to filter
Given a list of keys S ▪ We know 1 billion “good” email addresses Create a bit array B of n bits, initially all 0s
Determine which tuples of stream are in S ▪ If an email comes from one of these, it is NOT Choose a hash function h with range [0,n)
spam Hash each member of s S to one of
Obvious solution: Hash table n buckets, and set that bit to 1, i.e., B[h(s)]=1
▪ But suppose we do not have enough memory to Publish-subscribe systems Hash each element a of the stream and
store all of S in a hash table ▪ You are collecting lots of messages (news articles) output only those that hash to bit that was
▪ E.g., we might be processing millions of filters ▪ People express interest in certain sets of keywords
on the same stream set to 1
▪ Determine whether each message matches user’s ▪ Output a if B[h(a)] == 1
interest
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Hash
If the email address is in S, then it surely Consider: If we throw m darts into n equally
func h hashes to a bucket that has the big set to 1,
likely targets, what is the probability that
so it always gets through (no false negatives)
0010001011000 Bit array B a target gets at least one dart?
Drop the item. Approximately 1/8 of the bits are set to 1, so
It hashes to a bucket set
to 0 so it is surely not in S. about 1/8th of the addresses not in S get In our case:
through to the output (false positives) ▪ Targets = bits/buckets
Creates false positives but no false negatives
▪ Actually, less than 1/8th, because more than one ▪ Darts = hash values of items
▪ If the item is in S we surely output it, if not we may address might hash to the same bit
still output it
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
0.2
What fraction of the bit vector B are 1s? m = 1 billion, n = 8 billion 0.18
Bloom filters guarantee no false negatives,
▪ Throwing k∙m darts at n targets ▪ k = 1: (1 – e-1/8) = 0.1175 0.16
and use limited memory
False positive prob.
0.14
▪ So fraction of 1s is (1 – e-km/n) ▪ k = 2: (1 – e-1/4)2 = 0.0493 0.12 ▪ Great for pre-processing before more
0.1 expensive checks
But we have k independent hash functions 0.08
Suitable for hardware implementation
and we only let the element x through if all k What happens as we 0.06
0.04
▪ Hash function computations can be parallelized
hash element x to a bucket of value 1 keep increasing k? 0.02
0 2 4 6 8 10 12 14 16 18 20
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
Problem: How many different words are found among
▪ Data stream consists of a universe of elements the Web pages being crawled at a site?
chosen from a set of size N ▪ Unusually low or high numbers could indicate
▪ Maintain a count of the number of distinct artificial pages (spam?)
elements seen so far
How many different Web pages does each
Obvious approach: customer request in a week?
Maintain the set of elements seen so far
▪ That is, keep a hash table of all the distinct How many distinct products have we sold in
elements seen so far the last week?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
Pick a hash function h that maps each of the Very very rough and heuristic intuition why
Real problem: What if we do not have space N elements to at least log2 N bits Flajolet-Martin works:
to maintain the set of elements seen so far? ▪ h(a) hashes a with equal prob. to any of N values
For each stream element a, let r(a) be the ▪ Then h(a) is a sequence of log2 N bits,
Estimate the count in an unbiased way number of trailing 0s in h(a) where 2-r fraction of all as have a tail of r zeros
▪ About 50% of as hash to ***0
▪ r(a) = position of first 1 counting from the right
Accept that the count may have a little error, ▪ About 25% of as hash to **00
▪ E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
but limit the probability that the error is large ▪ So, if we saw the longest tail of r=2 (i.e., item hash
Record R = the maximum r(a) seen ending *100) then we have probably seen
▪ R = maxa r(a), over all the items a seen so far about 4 distinct items so far
▪ So, it takes to hash about 2r items before we
Estimated number of distinct elements = 2R see one with zero-suffix of length r
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
−=
−−
−r −r
Note: (1 − 2−r )m = (1 − 2−r )2 ( m2 ) e−m2
r
Now we show why Flajolet-Martin works What is the probability that a given h(a) ends
in at least r zeros is 2-r Prob. of NOT finding a tail of length r is:
Formally, we will show that probability of ▪ h(a) hashes elements uniformly at random ▪ If m << 2r, then prob. tends to 1
finding a tail of r zeros: −r
▪ (1 − 2− r )m e− m 2 = 1 as m/2r→ 0
▪ Probability that a random number ends in
▪ Goes to 1 if 𝒎 ≫ 𝟐𝒓 at least r zeros is 2-r ▪ So, the probability of finding a tail of length r tends to 0
▪ Goes to 0 if 𝒎 ≪ 𝟐𝒓 Then, the probability of NOT seeing a tail ▪ If m >> 2r, then prob. tends to 0
−r
where 𝒎 is the number of distinct elements of length r among m elements: ▪ (1 − 2− r )m e− m 2 = 0 as m/2r →
seen so far in the stream 𝟏 − 𝟐−𝒓 𝒎 ▪ So, the probability of finding a tail of length r tends to 1
Thus, 2R will almost always be around m!
Prob. all end in Prob. that given h(a) ends Thus, 2R will almost always be around m!
fewer than r zeros. in fewer than r zeros
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
iA
(mi ) k
Stream of length 100
11 distinct values
AMS method works for all moments
Gives an unbiased estimate
0thmoment = number of distinct elements We will just concentrate on the 2nd moment S
Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 We pick and keep track of many variables X:
▪ The problem just considered Surprise S = 910
1st moment = count of the numbers of ▪ For each variable X we store X.el and X.val
▪ X.el corresponds to the item i
elements = length of the stream Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
▪ X.val corresponds to the count of item i
▪ Easy to compute Surprise S = 8,110
▪ Note this requires a count in main memory,
2nd moment = surprise number S = so number of Xs is limited
a measure of how uneven the distribution is Our goal is to compute 𝑺 = σ𝒊 𝒎𝟐𝒊
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
Count: 1 2 3 ma Count: 1 2 3 ma
How to set X.val and X.el? Stream: a a b b b a b a Stream: a a b b b a b a
▪ Assume stream has length n (we relax this later) 𝟐 1
2nd moment is 𝑺 = σ𝒊 𝒎𝒊 𝐸 𝑓(𝑋) = σ𝑖 𝑛 (1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1)
▪ Pick some random time t (t<n) to start, 𝑛
ct … number of times item at time t appears
so that any time is equally likely ▪ Little side calculation: 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1 =
from time t onwards (c1=ma , c2=ma-1, c3=mb) 𝑚𝑖 𝑚𝑖 +1
▪ Let at time t the stream have item i. We set X.el = i 𝟏 σ𝑚𝑖=1(2𝑖 − 1) = 2
𝑖
− 𝑚𝑖 = (𝑚𝑖 )2
▪ Then we maintain count c (X.val = c) of the number 𝑬 𝒇(𝑿) = σ𝒏𝒕=𝟏 𝒏(𝟐𝒄𝒕 − 𝟏) m … total count of 𝟏
2
𝟐
𝒏 i
item i in the stream Then 𝑬 𝒇(𝑿) = σ𝒊 𝒏 𝒎𝒊
of is in the stream starting from the chosen time t 𝟏 𝒏
= σ 𝒏 (𝟏 + 𝟑 + 𝟓 + ⋯ + 𝟐𝒎𝒊 − 𝟏)
(we are assuming
𝒏 𝒊
stream has length n)
Then the estimate of the 2nd moment (σ𝒊 𝒎𝟐 𝒊 ) is:
𝑺 = 𝒇(𝑿) = 𝒏 (𝟐 · 𝒄 – 𝟏) Time t when Time t when
So, 𝐄 𝐟(𝐗) = σ𝒊 𝒎𝒊 𝟐 = 𝑺
▪ Note, we will keep track of multiple Xs, (X1, X2,… Xk) Group times
Time t when
the last i is
the penultimate the first i is We have the second moment (in expectation)!
by the value i is seen (ct=2) seen (ct=mi)
and our final estimate will be 𝑺 = 𝟏/𝒌 σ𝒌𝒋 𝒇(𝑿𝒋 ) seen
seen (ct=1)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33
For estimating kth moment we essentially use the In practice: (1) The variables X have n as a factor –
same algorithm but change the estimate: ▪ Compute 𝒇(𝑿) = 𝒏(𝟐 𝒄 – 𝟏) for keep n separately; just hold the count in X
▪ For k=2 we used n (2·c – 1) as many variables X as you can fit in memory (2) Suppose we can only store k counts.
▪ For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val) ▪ Average them in groups We must throw some Xs out as time goes on:
Why? ▪ Take median of averages ▪ Objective: Each starting time t is selected with
probability k/n
▪ For k=2: Remember we had 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1 Problem: Streams never end
and we showed terms 2c-1 (for c=1,…,m) sum to m2 ▪ Solution: (fixed-size sampling!)
▪ We assumed there was a number n, ▪ Choose the first k times for k variables
▪ σ𝑚 𝑚 2 𝑚
𝑐=1 2𝑐 − 1 = σ𝑐=1 𝑐 − σ𝑐=1 𝑐 − 1
2 = 𝑚2
the number of positions in the stream ▪ When the nth element arrives (n > k), choose it with
▪ So: 𝟐𝒄 − 𝟏 = 𝒄𝟐 − 𝒄 − 𝟏 𝟐
▪ But real streams go on forever, so n is probability k/n
▪ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
a variable – the number of inputs seen so far ▪ If you choose it, throw one of the previously stored
Generally: Estimate = 𝑛 (𝑐 𝑘 − 𝑐 − 1 𝑘 ) variables X out, with equal probability
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
New Problem: Given a stream, which items In principle, you could count frequent pairs
appear more than s times in the window? or even larger sets the same way
Possible solution: Think of the stream of ▪ One stream per itemset
baskets as one binary stream per item
▪ 1 = item present; 0 = not present Drawbacks:
▪ Use DGIM to estimate counts of 1s for all items ▪ Only approximate
6 10 ▪ Number of itemsets is way too big
4
3 2
2 1
1 0
010011100010100100010110110111001010110011010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
of the stream, take the answer at time t to be: stream (1 if x appears, 0 if x does not appear) 1/c
𝒕
= σ𝒊=𝟏 𝒂𝒊 𝟏 − 𝒄 𝒕−𝒊 ▪ New item x arrives:
Important property: Sum over all weights
▪ c is a constant, presumably tiny, like 10-6 or 10-9 ▪ Multiply all counts by (1-c)
σ𝒕 𝟏 − 𝒄 𝒕 is 1/[1 – (1 – c)] = 1/c
When new at+1 arrives: ▪ Add +1 to count for element x
Multiply current sum by (1-c) and add at+1 Call this sum the “weight” of item x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42
What are “currently” most popular movies? Count (some) itemsets in an E.D.W. Start a count for an itemset S ⊆ B if every
Suppose we want to find movies of weight > ½ ▪ What are currently “hot” itemsets? proper subset of S had a count prior to arrival
▪ Important property: Sum over all weights ▪ Problem: Too many itemsets to keep counts of of basket B
σ𝑡 1 − 𝑐 𝑡 is 1/[1 – (1 – c)] = 1/c all of them in memory ▪ Intuitively: If all subsets of S are being counted
When a basket B comes in: this means they are “frequent/hot” and thus S has
Thus: a potential to be “hot”
▪ There cannot be more than 2/c movies with ▪ Multiply all counts by (1-c) Example:
weight of ½ or more ▪ For uncounted items in B, create new count ▪ Start counting S={i, j} iff both i and j were counted
So, 2/c is a limit on the number of ▪ Add 1 to count of any item in B and to any itemset prior to seeing B
movies being counted at any time contained in B that is already being counted ▪ Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}
▪ Drop counts < ½ were all counted prior to seeing B
▪ Initiate new counts (next slide)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
Counts for single items < (2/c)∙(avg. number
of items in a basket)