Probabilistic Data
Structures
DR. HAMED ABDELHAQ
Outline
What are Probabilistic Data Structures
Examples of Probabilistic Data Structures
Bloom Filters
Locality Sensitive Hashing
Motivation
When processing large data sets, we might need to do simple tasks:
counting number of unique items
checks whether some items exist in the data set
Using deterministic data structure with big data
can be very expensive and infeasible
The data does not fit in memory
Motivation: Probabilistic Data Structure
group of data structures that are extremely useful for big data
data we are dealing with becomes very large, e.g., arriving from streaming
applications
employ hash functions to randomize and compactly represent a set of items
use much less memory and have constant query time
When can be used?
Membership: Frequency
Checking whether some items exist in Counting most frequent items
the data set
E.g., Count-Min Sketch
E.g., bloom filters
Cardinality
Searching
estimating the number of distinct
Searching for similar items
elements
E.g., locality sensitive hashing E.g., hyperloglog
Bloom filter
Approximate set-membership problem
Use the concept of hash tables
Fast in insertion
Fast in look-ups
So, why Bloom filters?
+ve:
Much more space efficient than hash sets
-ve:
Cannot store associated objects
No deletions
Allow for errors (non-zero) false positive probability
Applications of Bloom filters
Spell checking
Keep track of a list of forbidden passwords
Network router
Limited memory, and you need to be super fast
E.g., keep track of a lot of IP addresses
Ingredients of Bloom filters
Bloom filters have two components:
1. Array of n entries: each entry is a single bit.
Suppose we have a set to be inserted into the array
S = {s1,s2,...,sm}
Thus, the number of bits per element = n/m
2. A set of hash functions (k hash functions): h1,…, hk
Now, we need to answer a question like “Is x an element of S?”
If xS , we must answer yes
Operations
1. Initially set the array to 0
2. sS, A[hi(s)] = 1 for 1 i k
(an entry can be set to 1 multiple times, only the first times has an effect )
3. To check if xS
we check whether all location A[hi(x)] for 1 i k are set to 1
If not, clearly xS.
If all A[hi(x)] are set to 1, we assume xS
Possibility of errors
x1 y x2
0 0 10 0 10 10 0 0 10 0 10 0
If only
Each
To check
element
1s ifappear,
Initialy isofwith
inSconclude
S,
is all
hashed
check
0 thethat
k times
kyhash
is in S
Each
location.
This hash
mayIfyield
location
a 0 false
appears
setpositive
to, 1y is not in S
Performance of Bloom Filters
Probability of false positive depends on:
The density of 1s in the array
The number of hash functions
=
Number of 1s is approximately the number of inserted elements times the number
of hash functions.
Collision lowers this slightly
Estimating error probability
Probability of false positive:
f (1 p ) k (1 e km / n ) k
To find the optimal k to minimize f: minimizing g=ln(f) km / n
dg km / n km e
ln(1 e ) km / n
Þ k=ln(2)*(n/m) dk n 1 e
Þ f = (1/2)k = (0.6185..)n/m
The false positive probability falls exponentially in n/m ,the number bits used per item !!
Example
Suppose we use an array of n=1 billion bits, k=5 hash functions,
and m=100 million elements.
Fraction of zeros =
Fraction of 1s = 1 - 0.607 = 0.393
Probability of false positive =
Locality Sensitive Hashing
finding duplicate documents in a list may look like a simple task
use a hash table
finding documents with differences such as typos or different words
the problem becomes much more complex
Jaccard Similarity
Ex) Our use-case example
1. “Who was the first president of Palestine”
2. “Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
Jaccard similarity(q1,q2) = 6/8 = 0.75
the more common words, the bigger the Jaccard index, the more probable it is
that two questions are a duplicate.
Minhash Signatures
Jaccard can be a good string metric, however
we need to split each question into the words
compare the two sets
repeat for every pair
The amount of pairs will grow rapidly
creating a simple fixed-size numeric fingerprint (signature) for each sentence.
Called minhash signatures
Creating Minhashes
To calculate MinHash
we need to create the dictionary (a set of all words) from all our questions.
create a random permutation
Back to our use case example:
The set of words we have:
(Who, was, the, first, president, of, Palestine, ruler, king, Jordan)
Creating Minhashes 1. “Who was the first president of Palestine”
2. “Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
1st permutation:
Index Word Q1 Q2 Q3
1 ruler 1
2 of 2 2 2
3 the 3 3 3
4 first 4 4 4
5 president 5
6 Who 6 6 6
7 Jordan 7
8 was 8 8 8
9 king 9
10 Palestine 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
2nd
permutation:
Index Word Q1 Q2 Q3
1 president 1
2 king 2
3 Jordan 3
4 first 4 4 4
5 ruler 5
6 Palestine 6 6
7 the 7 7 7
8 was 8 8 8
9 of 9 9 9
10 Who 10 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
3rd
permutation:
Index Word Q1 Q2 Q3
1 Jordan 1
2 Who 2 2 2
3 king 3
4 first 4 4 4
5 of 5 5 5
6 Palestine 6 6
7 president 7
8 was 8 8 8
9 the 9 9 9
10 ruler 10
Resulting Minhash Signatures
Trying 3 more permutations, we might end up having the following minhashes:
MinHash(Q1) = [2, 1, 2, 2, 1, 1]
MinHash(Q2) = [1, 4, 4, 2, 1, 1]
MinHash(Q3) = [2, 2, 1, 4, 4, 1]
Locality Sensitive Hashing (LSH) for Minhash
Signatures
Problem: finding questions similar a certain question is computationally
expensive.
Even when using minhash signatures
Solution: “hashing” items several times, in such a way that
similar items are more likely to be hashed to the same bucket than dissimilar items are
any pair that hashed to the same bucket for any of the hashings to be a candidate pair
false positives: dissimilar pairs in the same bucket
false negatives: similar pairs in different buckets
LSH - Minhash Signature Partitioning
Dividing the signature into a number of b bands, consisting of r rows each
This increases the chance of having bands with identical partitions
These identical partition will then be mapped to the same bucket
MinHash(Q1) = [2, 1, 2, 2, 1, 1] => [2, 1, 2] [2, 1, 1]
MinHash(Q2) = [1, 4, 4, 2, 1, 1] => [1, 4, 4] [2, 1, 1]
MinHash(Q3) = [2, 2, 1, 4, 4, 1] => [2, 2, 1] [4, 4, 1]
LSH – Mapping elements to buckets
For each band, a hash function that
takes vectors of r integers
hashes them to some large
number of buckets
We can use different hash function,
one function for each band
Analysis of the banding technique
Finding the probability that a pair of document can be mapped to the same bucket
(similar documents)
Assume the Jaccard similarity btn them is (s)
The probability that the signatures agree in all rows of one particular band is .
The probability that the signatures disagree in at least one row of a particular band
is 1-.
The probability that the signatures disagree in at least one row of each of the
bands is
Analysis of the banding technique
The probability that the
signatures agree in all the rows
of at least one band, and
therefore become a
candidate pair
1-