0% found this document useful (0 votes)

11 views4 pages

Bloom Filters & Stream Algorithms

Uploaded by

kieulinh2003.phuthai.thienkhoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views4 pages

Bloom Filters & Stream Algorithms

Uploaded by

kieulinh2003.phuthai.thienkhoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

 More algorithms for streams:

▪ (1) Filtering a data stream: Bloom filters
▪ Select elements with property x from stream
▪ (2) Counting distinct elements: Flajolet-Martin
▪ Number of distinct elements in the last k elements
of the stream
▪ (3) Estimating moments: AMS method
▪ Estimate std. dev. of last k elements
Mining of Massive Datasets ▪ (4) Counting frequent items
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

 Each element of data stream is a tuple  Example: Email spam filtering  Given a set of keys S that we want to filter
 Given a list of keys S ▪ We know 1 billion “good” email addresses  Create a bit array B of n bits, initially all 0s
 Determine which tuples of stream are in S ▪ If an email comes from one of these, it is NOT  Choose a hash function h with range [0,n)
spam  Hash each member of s S to one of
 Obvious solution: Hash table n buckets, and set that bit to 1, i.e., B[h(s)]=1
▪ But suppose we do not have enough memory to  Publish-subscribe systems  Hash each element a of the stream and
store all of S in a hash table ▪ You are collecting lots of messages (news articles) output only those that hash to bit that was
▪ E.g., we might be processing millions of filters ▪ People express interest in certain sets of keywords
on the same stream set to 1
▪ Determine whether each message matches user’s ▪ Output a if B[h(a)] == 1
interest

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

Output the item since it may be in S.

Item hashes to a bucket that at least
 |S| = 1 billion email addresses  More accurate analysis for the number of
one of the items in S hashed to. |B|= 1GB = 8 billion bits false positives
Item

Hash
 If the email address is in S, then it surely  Consider: If we throw m darts into n equally
func h hashes to a bucket that has the big set to 1,
likely targets, what is the probability that
so it always gets through (no false negatives)
0010001011000 Bit array B a target gets at least one dart?
Drop the item.  Approximately 1/8 of the bits are set to 1, so
It hashes to a bucket set
to 0 so it is surely not in S. about 1/8th of the addresses not in S get  In our case:
through to the output (false positives) ▪ Targets = bits/buckets
 Creates false positives but no false negatives
▪ Actually, less than 1/8th, because more than one ▪ Darts = hash values of items
▪ If the item is in S we surely output it, if not we may address might hash to the same bit
still output it
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

 We have m darts, n targets  Fraction of 1s in the array B =  Consider: |S| = m, |B| = n

 What is the probability that a target gets at = probability of false positive = 1 – e-m/n  Use k independent hash functions h1 ,…, hk
least one dart?  Initialization:
Equals 1/e
Equivalent
 Example: 109 darts, 8∙109 targets ▪ Set B to all 0s
as n →∞
▪ Fraction of 1s in B = 1 – e-1/8 = 0.1175 ▪ Hash each element s S using each hash function hi,
n( m / n)
▪ Compare with our earlier estimate: 1/8 = 0.125 set B[hi(s)] = 1 (for each i = 1,.., k) (note: we have a
1 - (1 – 1/n) single array B!)
1 – e–m/n  Run-time:
▪ When a stream element with key x arrives
Probability some
target X not hit ▪ If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
Probability at
by a dart
least one dart ▪ That is, x hashes to a bucket set to 1 for every hash function hi(x)
hits target X ▪ Otherwise discard the element x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

0.2

 What fraction of the bit vector B are 1s?  m = 1 billion, n = 8 billion 0.18
 Bloom filters guarantee no false negatives,
▪ Throwing k∙m darts at n targets ▪ k = 1: (1 – e-1/8) = 0.1175 0.16
and use limited memory
False positive prob.

0.14

▪ So fraction of 1s is (1 – e-km/n) ▪ k = 2: (1 – e-1/4)2 = 0.0493 0.12 ▪ Great for pre-processing before more
0.1 expensive checks
 But we have k independent hash functions 0.08
 Suitable for hardware implementation
and we only let the element x through if all k  What happens as we 0.06

0.04
▪ Hash function computations can be parallelized
hash element x to a bucket of value 1 keep increasing k? 0.02
0 2 4 6 8 10 12 14 16 18 20

Number of hash functions, k

 Is it better to have 1 big B or k small Bs?
 So, false positive probability = (1 – e-km/n)k  “Optimal” value of k: n/m ln(2)
▪ It is the same: (1 – e-km/n)k vs. (1 – e-m/(n/k))k
▪ In our case: Optimal k = 8 ln(2) = 5.54 ≈ 6
▪ Error at k = 6: (1 – e-1/6)2 = 0.0235
▪ But keeping 1 big B is simpler

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
 Problem:  How many different words are found among
▪ Data stream consists of a universe of elements the Web pages being crawled at a site?
chosen from a set of size N ▪ Unusually low or high numbers could indicate
▪ Maintain a count of the number of distinct artificial pages (spam?)
elements seen so far
 How many different Web pages does each
 Obvious approach: customer request in a week?
Maintain the set of elements seen so far
▪ That is, keep a hash table of all the distinct  How many distinct products have we sold in
elements seen so far the last week?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

 Pick a hash function h that maps each of the  Very very rough and heuristic intuition why
 Real problem: What if we do not have space N elements to at least log2 N bits Flajolet-Martin works:
to maintain the set of elements seen so far? ▪ h(a) hashes a with equal prob. to any of N values
 For each stream element a, let r(a) be the ▪ Then h(a) is a sequence of log2 N bits,
 Estimate the count in an unbiased way number of trailing 0s in h(a) where 2-r fraction of all as have a tail of r zeros
▪ About 50% of as hash to ***0
▪ r(a) = position of first 1 counting from the right
 Accept that the count may have a little error, ▪ About 25% of as hash to **00
▪ E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
but limit the probability that the error is large ▪ So, if we saw the longest tail of r=2 (i.e., item hash
 Record R = the maximum r(a) seen ending *100) then we have probably seen
▪ R = maxa r(a), over all the items a seen so far about 4 distinct items so far
▪ So, it takes to hash about 2r items before we
 Estimated number of distinct elements = 2R see one with zero-suffix of length r
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

−=
−−

−r −r
Note: (1 − 2−r )m = (1 − 2−r )2 ( m2 )  e−m2
r
 Now we show why Flajolet-Martin works  What is the probability that a given h(a) ends 
in at least r zeros is 2-r  Prob. of NOT finding a tail of length r is:
 Formally, we will show that probability of ▪ h(a) hashes elements uniformly at random ▪ If m << 2r, then prob. tends to 1
finding a tail of r zeros: −r
▪ (1 − 2− r )m  e− m 2 = 1 as m/2r→ 0
▪ Probability that a random number ends in
▪ Goes to 1 if 𝒎 ≫ 𝟐𝒓 at least r zeros is 2-r ▪ So, the probability of finding a tail of length r tends to 0
▪ Goes to 0 if 𝒎 ≪ 𝟐𝒓  Then, the probability of NOT seeing a tail ▪ If m >> 2r, then prob. tends to 0
−r
where 𝒎 is the number of distinct elements of length r among m elements: ▪ (1 − 2− r )m  e− m 2 = 0 as m/2r → 
seen so far in the stream 𝟏 − 𝟐−𝒓 𝒎 ▪ So, the probability of finding a tail of length r tends to 1
 Thus, 2R will almost always be around m!
Prob. all end in Prob. that given h(a) ends  Thus, 2R will almost always be around m!
fewer than r zeros. in fewer than r zeros

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

 E[2R] is actually infinite  Suppose a stream has elements chosen

▪ Probability halves when R → R+1, but value doubles from a set A of N values
 Workaround involves using many hash
functions hi and getting many samples of Ri  Let mi be the number of times value i occurs
 How are samples Ri combined? in the stream
▪ Average? What if one very large value 𝟐𝑹𝒊 ?  The kth moment is
▪ Median? All estimates are a power of 2
▪ Solution:
▪ Partition your samples into small groups
 iA
(mi ) k
▪ Take the median of groups
▪ Then take the average of the medians
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

[Alon, Matias, and Szegedy]

 iA
(mi ) k 

Stream of length 100
11 distinct values


AMS method works for all moments
Gives an unbiased estimate
 0thmoment = number of distinct elements  We will just concentrate on the 2nd moment S
 Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9  We pick and keep track of many variables X:
▪ The problem just considered Surprise S = 910
 1st moment = count of the numbers of ▪ For each variable X we store X.el and X.val
▪ X.el corresponds to the item i
elements = length of the stream  Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
▪ X.val corresponds to the count of item i
▪ Easy to compute Surprise S = 8,110
▪ Note this requires a count in main memory,
 2nd moment = surprise number S = so number of Xs is limited
a measure of how uneven the distribution is  Our goal is to compute 𝑺 = σ𝒊 𝒎𝟐𝒊
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
Count: 1 2 3 ma Count: 1 2 3 ma
 How to set X.val and X.el? Stream: a a b b b a b a Stream: a a b b b a b a
▪ Assume stream has length n (we relax this later) 𝟐 1
 2nd moment is 𝑺 = σ𝒊 𝒎𝒊  𝐸 𝑓(𝑋) = σ𝑖 𝑛 (1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1)
▪ Pick some random time t (t<n) to start, 𝑛
 ct … number of times item at time t appears
so that any time is equally likely ▪ Little side calculation: 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1 =
from time t onwards (c1=ma , c2=ma-1, c3=mb) 𝑚𝑖 𝑚𝑖 +1
▪ Let at time t the stream have item i. We set X.el = i 𝟏 σ𝑚𝑖=1(2𝑖 − 1) = 2
𝑖
− 𝑚𝑖 = (𝑚𝑖 )2
▪ Then we maintain count c (X.val = c) of the number  𝑬 𝒇(𝑿) = σ𝒏𝒕=𝟏 𝒏(𝟐𝒄𝒕 − 𝟏) m … total count of 𝟏
2
𝟐
𝒏 i
item i in the stream  Then 𝑬 𝒇(𝑿) = σ𝒊 𝒏 𝒎𝒊
of is in the stream starting from the chosen time t 𝟏 𝒏
= σ 𝒏 (𝟏 + 𝟑 + 𝟓 + ⋯ + 𝟐𝒎𝒊 − 𝟏)
(we are assuming

𝒏 𝒊
stream has length n)
 Then the estimate of the 2nd moment (σ𝒊 𝒎𝟐 𝒊 ) is:
𝑺 = 𝒇(𝑿) = 𝒏 (𝟐 · 𝒄 – 𝟏) Time t when Time t when
 So, 𝐄 𝐟(𝐗) = σ𝒊 𝒎𝒊 𝟐 = 𝑺
▪ Note, we will keep track of multiple Xs, (X1, X2,… Xk) Group times
Time t when
the last i is
the penultimate the first i is  We have the second moment (in expectation)!
by the value i is seen (ct=2) seen (ct=mi)
and our final estimate will be 𝑺 = 𝟏/𝒌 σ𝒌𝒋 𝒇(𝑿𝒋 ) seen
seen (ct=1)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

 For estimating kth moment we essentially use the  In practice:  (1) The variables X have n as a factor –
same algorithm but change the estimate: ▪ Compute 𝒇(𝑿) = 𝒏(𝟐 𝒄 – 𝟏) for keep n separately; just hold the count in X
▪ For k=2 we used n (2·c – 1) as many variables X as you can fit in memory  (2) Suppose we can only store k counts.
▪ For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val) ▪ Average them in groups We must throw some Xs out as time goes on:
 Why? ▪ Take median of averages ▪ Objective: Each starting time t is selected with
probability k/n
▪ For k=2: Remember we had 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1  Problem: Streams never end
and we showed terms 2c-1 (for c=1,…,m) sum to m2 ▪ Solution: (fixed-size sampling!)
▪ We assumed there was a number n, ▪ Choose the first k times for k variables
▪ σ𝑚 𝑚 2 𝑚
𝑐=1 2𝑐 − 1 = σ𝑐=1 𝑐 − σ𝑐=1 𝑐 − 1
2 = 𝑚2
the number of positions in the stream ▪ When the nth element arrives (n > k), choose it with
▪ So: 𝟐𝒄 − 𝟏 = 𝒄𝟐 − 𝒄 − 𝟏 𝟐
▪ But real streams go on forever, so n is probability k/n
▪ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
a variable – the number of inputs seen so far ▪ If you choose it, throw one of the previously stored
 Generally: Estimate = 𝑛 (𝑐 𝑘 − 𝑐 − 1 𝑘 ) variables X out, with equal probability
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

 New Problem: Given a stream, which items  In principle, you could count frequent pairs
appear more than s times in the window? or even larger sets the same way
 Possible solution: Think of the stream of ▪ One stream per itemset
baskets as one binary stream per item
▪ 1 = item present; 0 = not present  Drawbacks:
▪ Use DGIM to estimate counts of 1s for all items ▪ Only approximate
6 10 ▪ Number of itemsets is way too big
4
3 2
2 1
1 0
010011100010100100010110110111001010110011010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

 Exponentially decaying windows: A heuristic  If each ai is an “item” we can compute the

for selecting likely frequent item(sets) characteristic function of each possible
▪ What are “currently” most popular movies? item x as an Exponentially Decaying Window
▪ Instead of computing the raw count in last N elements ▪ That is: σ𝒕𝒊=𝟏 𝜹𝒊 ⋅ 𝟏 − 𝒄 𝒕−𝒊
▪ Compute a smooth aggregation over the whole stream where δi=1 if ai=x, and 0 otherwise
 If stream is a1, a2,… and we are taking the sum ▪ Imagine that for each item x we have a binary
...

of the stream, take the answer at time t to be: stream (1 if x appears, 0 if x does not appear) 1/c
𝒕
= σ𝒊=𝟏 𝒂𝒊 𝟏 − 𝒄 𝒕−𝒊 ▪ New item x arrives:
 Important property: Sum over all weights
▪ c is a constant, presumably tiny, like 10-6 or 10-9 ▪ Multiply all counts by (1-c)
σ𝒕 𝟏 − 𝒄 𝒕 is 1/[1 – (1 – c)] = 1/c
 When new at+1 arrives: ▪ Add +1 to count for element x
Multiply current sum by (1-c) and add at+1  Call this sum the “weight” of item x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

 What are “currently” most popular movies?  Count (some) itemsets in an E.D.W.  Start a count for an itemset S ⊆ B if every
 Suppose we want to find movies of weight > ½ ▪ What are currently “hot” itemsets? proper subset of S had a count prior to arrival
▪ Important property: Sum over all weights ▪ Problem: Too many itemsets to keep counts of of basket B
σ𝑡 1 − 𝑐 𝑡 is 1/[1 – (1 – c)] = 1/c all of them in memory ▪ Intuitively: If all subsets of S are being counted
 When a basket B comes in: this means they are “frequent/hot” and thus S has
 Thus: a potential to be “hot”
▪ There cannot be more than 2/c movies with ▪ Multiply all counts by (1-c)  Example:
weight of ½ or more ▪ For uncounted items in B, create new count ▪ Start counting S={i, j} iff both i and j were counted
 So, 2/c is a limit on the number of ▪ Add 1 to count of any item in B and to any itemset prior to seeing B
movies being counted at any time contained in B that is already being counted ▪ Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}
▪ Drop counts < ½ were all counted prior to seeing B
▪ Initiate new counts (next slide)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
 Counts for single items < (2/c)∙(avg. number
of items in a basket)

 Counts for larger itemsets = ??

 But we are conservative about starting

counts of large sets
▪ If we counted every set we saw, one basket
of 20 items would initiate 1M counts

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Week3 - Mining Data Streams
No ratings yet
Week3 - Mining Data Streams
38 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
Mod2 Data Streams
No ratings yet
Mod2 Data Streams
75 pages
ch04 Streams1
No ratings yet
ch04 Streams1
4 pages
Lecture 27
No ratings yet
Lecture 27
21 pages
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
4 Frequent Item Set Mining & Association Rules
No ratings yet
4 Frequent Item Set Mining & Association Rules
68 pages
16 Streams
No ratings yet
16 Streams
61 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
64 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
110 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
59 pages
Association Rules and Frequent Item Sets
No ratings yet
Association Rules and Frequent Item Sets
98 pages
Frequent Itemsets
No ratings yet
Frequent Itemsets
21 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Community Detection in Social Networks
No ratings yet
Community Detection in Social Networks
64 pages
Unit 5
No ratings yet
Unit 5
39 pages
Big Data - Week04 - Association Rules
No ratings yet
Big Data - Week04 - Association Rules
46 pages
DGIM
No ratings yet
DGIM
90 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Unit 4
No ratings yet
Unit 4
60 pages
Big Data - Lecture05 - LSH
No ratings yet
Big Data - Lecture05 - LSH
56 pages
01 Support Vector Machines - Introduction 7-30
No ratings yet
01 Support Vector Machines - Introduction 7-30
4 pages
ch-09 - Part 1
No ratings yet
ch-09 - Part 1
22 pages
19 Submodular
No ratings yet
19 Submodular
47 pages
Streaming Algorithms Overview
No ratings yet
Streaming Algorithms Overview
90 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Ch01 Intro
No ratings yet
Ch01 Intro
19 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
29 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
56 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
Ch06 Frequent Itemsets
No ratings yet
Ch06 Frequent Itemsets
59 pages
16 Streams
No ratings yet
16 Streams
5 pages
SPA Session 13 Streaming Algo Bloom
No ratings yet
SPA Session 13 Streaming Algo Bloom
23 pages
MMD 05
No ratings yet
MMD 05
50 pages
Data Mining for Advanced Learners
No ratings yet
Data Mining for Advanced Learners
58 pages
18-Sub-Modular Functions
No ratings yet
18-Sub-Modular Functions
51 pages
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
No ratings yet
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
52 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Module 4
No ratings yet
Module 4
10 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
48 pages
Big Data Processing with MapReduce
No ratings yet
Big Data Processing with MapReduce
49 pages
08 Recsys2
No ratings yet
08 Recsys2
60 pages
L2 Linkanalysis1 2024
No ratings yet
L2 Linkanalysis1 2024
59 pages
Streams 1
No ratings yet
Streams 1
33 pages
Limited Pass Algorithm
No ratings yet
Limited Pass Algorithm
33 pages
ch07 Clustering
No ratings yet
ch07 Clustering
62 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
Streams 2
No ratings yet
Streams 2
49 pages
BD - Lecture 3 - Decision Tree
No ratings yet
BD - Lecture 3 - Decision Tree
39 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
42 pages
Week 8 Lab - Exercise-Sheet-6
No ratings yet
Week 8 Lab - Exercise-Sheet-6
1 page
PPCL
No ratings yet
PPCL
19 pages
Java CsvFileReader
No ratings yet
Java CsvFileReader
5 pages
Exercise Sheet 2
No ratings yet
Exercise Sheet 2
1 page
Enhancing Biolistic Plant Transformation and Genome Editing With A Ow Guiding Barrel
No ratings yet
Enhancing Biolistic Plant Transformation and Genome Editing With A Ow Guiding Barrel
14 pages
Exploring The Latent Structure of Behavior Using The Human Connectome Project's Data
No ratings yet
Exploring The Latent Structure of Behavior Using The Human Connectome Project's Data
13 pages
2023 12 20 572500v1 Full
No ratings yet
2023 12 20 572500v1 Full
71 pages
Structural Repetition Detector For Multi-Scale Quantitative Mapping of Molecular Complexes Through Micros
No ratings yet
Structural Repetition Detector For Multi-Scale Quantitative Mapping of Molecular Complexes Through Micros
11 pages
Marriage Law
No ratings yet
Marriage Law
19 pages
Fischhoff Evaluating Science Communication
No ratings yet
Fischhoff Evaluating Science Communication
6 pages
Colleran 2016 The Cultural Evolution of Fertility Decline
No ratings yet
Colleran 2016 The Cultural Evolution of Fertility Decline
12 pages
Significant Figures
No ratings yet
Significant Figures
2 pages
Azure VMware Solution
No ratings yet
Azure VMware Solution
4 pages
Excel Formula CHEAT SHEET
No ratings yet
Excel Formula CHEAT SHEET
1 page
JNTUK R-20 B.Tech Agri. Syllabus
No ratings yet
JNTUK R-20 B.Tech Agri. Syllabus
37 pages
CHARAN NET New
No ratings yet
CHARAN NET New
6 pages
PHP 29 1
No ratings yet
PHP 29 1
41 pages
Publisher Manual Feb 26 2024
100% (1)
Publisher Manual Feb 26 2024
319 pages
Online Test Portal
No ratings yet
Online Test Portal
73 pages
Digital Energy: T60 Transformer Protection System
No ratings yet
Digital Energy: T60 Transformer Protection System
748 pages
Pany MACs-Hits
No ratings yet
Pany MACs-Hits
2 pages
1 Esteem Dandan
No ratings yet
1 Esteem Dandan
45 pages
CAD Standards
No ratings yet
CAD Standards
7 pages
Vikas Subramaniam: Professional Summary
No ratings yet
Vikas Subramaniam: Professional Summary
2 pages
Old Microcontroller Question Paper
No ratings yet
Old Microcontroller Question Paper
6 pages
Business Intelligence
No ratings yet
Business Intelligence
14 pages
Impact of Artificial Intelligence On Accounting
100% (1)
Impact of Artificial Intelligence On Accounting
28 pages
Interview Script
No ratings yet
Interview Script
2 pages
Thesis List of Abbreviations Example
100% (3)
Thesis List of Abbreviations Example
6 pages
Adcx V18a LG
No ratings yet
Adcx V18a LG
300 pages
E SatisfactionasamediatorbetweenconsumerloyaltyandE CRM
No ratings yet
E SatisfactionasamediatorbetweenconsumerloyaltyandE CRM
20 pages
89c51 Code Lock Design and Software
No ratings yet
89c51 Code Lock Design and Software
11 pages
Individual Assignment Questions
No ratings yet
Individual Assignment Questions
3 pages
Dinesh Kamani Full Times Resume
No ratings yet
Dinesh Kamani Full Times Resume
1 page
HD Tune Pro Manual
No ratings yet
HD Tune Pro Manual
39 pages
Sapnote 0000847388
No ratings yet
Sapnote 0000847388
2 pages
How To Create A Passive Income Using Facebook and Ebay: This Method Works
No ratings yet
How To Create A Passive Income Using Facebook and Ebay: This Method Works
22 pages
M.Sc. SS May 2023 Exam Schedule
No ratings yet
M.Sc. SS May 2023 Exam Schedule
2 pages
Youtube Monetization: Three Steps To Monetizing Your Content
No ratings yet
Youtube Monetization: Three Steps To Monetizing Your Content
4 pages
Ict Lecture 1
No ratings yet
Ict Lecture 1
30 pages
White Paper WiFi 6 Accelerates A Path To A Hyper-Connected World
No ratings yet
White Paper WiFi 6 Accelerates A Path To A Hyper-Connected World
10 pages

Bloom Filters & Stream Algorithms

Uploaded by

Bloom Filters & Stream Algorithms

Uploaded by

Note to other teachers and users of these slides: We would be delighted if you found this our

 More algorithms for streams:

Output the item since it may be in S.

 We have m darts, n targets  Fraction of 1s in the array B =  Consider: |S| = m, |B| = n

Number of hash functions, k

 E[2R] is actually infinite  Suppose a stream has elements chosen

[Alon, Matias, and Szegedy]

 Exponentially decaying windows: A heuristic  If each ai is an “item” we can compute the

 Counts for larger itemsets = ??

 But we are conservative about starting

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

You might also like