0% found this document useful (0 votes)

87 views26 pages

Probabilistic Data Structures

This document discusses probabilistic data structures and provides examples of Bloom filters and locality sensitive hashing. It explains that probabilistic data structures can be extremely useful for processing large datasets as they employ hash functions to compactly represent sets of items using much less memory and constant query time. Specific probabilistic data structures are described in detail, including how they work and their applications. Examples of creating minhash signatures from text to enable detection of similar documents are also provided.

Uploaded by

ghassanmaq7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views26 pages

Probabilistic Data Structures

Uploaded by

ghassanmaq7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Probabilistic Data

Structures
DR. HAMED ABDELHAQ
Outline

 What are Probabilistic Data Structures

 Examples of Probabilistic Data Structures

 Bloom Filters
 Locality Sensitive Hashing
Motivation

 When processing large data sets, we might need to do simple tasks:

 counting number of unique items
 checks whether some items exist in the data set

 Using deterministic data structure with big data

 can be very expensive and infeasible
 The data does not fit in memory
Motivation: Probabilistic Data Structure

 group of data structures that are extremely useful for big data

 data we are dealing with becomes very large, e.g., arriving from streaming
applications

 employ hash functions to randomize and compactly represent a set of items

 use much less memory and have constant query time

When can be used?

 Membership:  Frequency
 Checking whether some items exist in  Counting most frequent items
the data set
 E.g., Count-Min Sketch
 E.g., bloom filters
 Cardinality
 Searching
 estimating the number of distinct
 Searching for similar items
elements
 E.g., locality sensitive hashing  E.g., hyperloglog
Bloom filter
 Approximate set-membership problem
 Use the concept of hash tables
 Fast in insertion
 Fast in look-ups
 So, why Bloom filters?
 +ve:
 Much more space efficient than hash sets
 -ve:
 Cannot store associated objects
 No deletions
 Allow for errors (non-zero) false positive probability
Applications of Bloom filters

 Spell checking
 Keep track of a list of forbidden passwords
 Network router
 Limited memory, and you need to be super fast
 E.g., keep track of a lot of IP addresses
Ingredients of Bloom filters

 Bloom filters have two components:

1. Array of n entries: each entry is a single bit.
 Suppose we have a set to be inserted into the array
S = {s1,s2,...,sm}
 Thus, the number of bits per element = n/m

2. A set of hash functions (k hash functions): h1,…, hk

 Now, we need to answer a question like “Is x an element of S?”

 If xS , we must answer yes
Operations

1. Initially set the array to 0

2. sS, A[hi(s)] = 1 for 1 i  k
(an entry can be set to 1 multiple times, only the first times has an effect )
3. To check if xS
 we check whether all location A[hi(x)] for 1 i  k are set to 1
 If not, clearly xS.
 If all A[hi(x)] are set to 1, we assume xS
Possibility of errors

x1 y x2

0 0 10 0 10 10 0 0 10 0 10 0

If only
Each
To check
element
1s ifappear,
Initialy isofwith
inSconclude
S,
is all
hashed
check
0 thethat
k times
kyhash
is in S
Each
location.
This hash
mayIfyield
location
a 0 false
appears
setpositive
to, 1y is not in S
Performance of Bloom Filters

 Probability of false positive depends on:

 The density of 1s in the array
 The number of hash functions
 =

 Number of 1s is approximately the number of inserted elements times the number

of hash functions.
 Collision lowers this slightly
Estimating error probability

 Probability of false positive:

f  (1  p ) k  (1  e  km / n ) k

 To find the optimal k to minimize f: minimizing g=ln(f)  km / n

dg  km / n km e
 ln(1  e )  km / n
Þ k=ln(2)*(n/m) dk n 1 e
Þ f = (1/2)k = (0.6185..)n/m
The false positive probability falls exponentially in n/m ,the number bits used per item !!
Example

 Suppose we use an array of n=1 billion bits, k=5 hash functions,

and m=100 million elements.
 Fraction of zeros =
 Fraction of 1s = 1 - 0.607 = 0.393
 Probability of false positive =
Locality Sensitive Hashing

 finding duplicate documents in a list may look like a simple task

 use a hash table

 finding documents with differences such as typos or different words

 the problem becomes much more complex
Jaccard Similarity

 Ex) Our use-case example

1. “Who was the first president of Palestine”
2. “Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
Jaccard similarity(q1,q2) = 6/8 = 0.75

 the more common words, the bigger the Jaccard index, the more probable it is
that two questions are a duplicate.
Minhash Signatures

 Jaccard can be a good string metric, however

 we need to split each question into the words
 compare the two sets
 repeat for every pair
 The amount of pairs will grow rapidly
 creating a simple fixed-size numeric fingerprint (signature) for each sentence.
 Called minhash signatures
Creating Minhashes

 To calculate MinHash
 we need to create the dictionary (a set of all words) from all our questions.
 create a random permutation

 Back to our use case example:

 The set of words we have:
(Who, was, the, first, president, of, Palestine, ruler, king, Jordan)
Creating Minhashes 1. “Who was the first president of Palestine”
2. “Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
1st permutation:

Index Word Q1 Q2 Q3
1 ruler 1
2 of 2 2 2
3 the 3 3 3
4 first 4 4 4
5 president 5
6 Who 6 6 6
7 Jordan 7
8 was 8 8 8
9 king 9
10 Palestine 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
2nd
permutation:

Index Word Q1 Q2 Q3
1 president 1
2 king 2
3 Jordan 3
4 first 4 4 4
5 ruler 5
6 Palestine 6 6
7 the 7 7 7
8 was 8 8 8
9 of 9 9 9
10 Who 10 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
3rd
permutation:

Index Word Q1 Q2 Q3
1 Jordan 1
2 Who 2 2 2
3 king 3
4 first 4 4 4
5 of 5 5 5
6 Palestine 6 6
7 president 7
8 was 8 8 8
9 the 9 9 9
10 ruler 10
Resulting Minhash Signatures

 Trying 3 more permutations, we might end up having the following minhashes:

 MinHash(Q1) = [2, 1, 2, 2, 1, 1]
MinHash(Q2) = [1, 4, 4, 2, 1, 1]
MinHash(Q3) = [2, 2, 1, 4, 4, 1]
Locality Sensitive Hashing (LSH) for Minhash
Signatures
 Problem: finding questions similar a certain question is computationally
expensive.
 Even when using minhash signatures
 Solution: “hashing” items several times, in such a way that
 similar items are more likely to be hashed to the same bucket than dissimilar items are
 any pair that hashed to the same bucket for any of the hashings to be a candidate pair
 false positives: dissimilar pairs in the same bucket
 false negatives: similar pairs in different buckets
LSH - Minhash Signature Partitioning

 Dividing the signature into a number of b bands, consisting of r rows each

 This increases the chance of having bands with identical partitions
 These identical partition will then be mapped to the same bucket
 MinHash(Q1) = [2, 1, 2, 2, 1, 1] => [2, 1, 2] [2, 1, 1]
MinHash(Q2) = [1, 4, 4, 2, 1, 1] => [1, 4, 4] [2, 1, 1]
MinHash(Q3) = [2, 2, 1, 4, 4, 1] => [2, 2, 1] [4, 4, 1]
LSH – Mapping elements to buckets

 For each band, a hash function that

takes vectors of r integers
 hashes them to some large
number of buckets
 We can use different hash function,
one function for each band
Analysis of the banding technique

 Finding the probability that a pair of document can be mapped to the same bucket
(similar documents)
 Assume the Jaccard similarity btn them is (s)
 The probability that the signatures agree in all rows of one particular band is .
 The probability that the signatures disagree in at least one row of a particular band
is 1-.
 The probability that the signatures disagree in at least one row of each of the
bands is
Analysis of the banding technique

 The probability that the

signatures agree in all the rows
of at least one band, and
therefore become a
candidate pair
1-

Bloomfilter
No ratings yet
Bloomfilter
9 pages
DM LSH en PF
No ratings yet
DM LSH en PF
31 pages
Locality Sensitive Hashing Towards Data Science
No ratings yet
Locality Sensitive Hashing Towards Data Science
16 pages
SPA Session 13 Streaming Algo Bloom
No ratings yet
SPA Session 13 Streaming Algo Bloom
23 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
DGIM
No ratings yet
DGIM
90 pages
Similarity Search For Big Data Locality Sensitive Hashing
No ratings yet
Similarity Search For Big Data Locality Sensitive Hashing
41 pages
Rsa 2008
No ratings yet
Rsa 2008
32 pages
1 Overview: Lecture 2 - February 3, 2005
No ratings yet
1 Overview: Lecture 2 - February 3, 2005
6 pages
B.tech Bloom Filter 3
No ratings yet
B.tech Bloom Filter 3
14 pages
Advanced Data Structures Lecture
No ratings yet
Advanced Data Structures Lecture
46 pages
Chapter 5
No ratings yet
Chapter 5
53 pages
Computational Tools DTU Presentation Week4
No ratings yet
Computational Tools DTU Presentation Week4
40 pages
Advanced Data Structures Notes
100% (2)
Advanced Data Structures Notes
142 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
MinHash LSH for Plagiarism Detection
No ratings yet
MinHash LSH for Plagiarism Detection
16 pages
Streaming Algorithms Overview
No ratings yet
Streaming Algorithms Overview
90 pages
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
No ratings yet
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
52 pages
DSBD Unit-II 3
No ratings yet
DSBD Unit-II 3
28 pages
Streams 2
No ratings yet
Streams 2
49 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Similarity Detection Techniques
No ratings yet
Similarity Detection Techniques
53 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
Lecture 5 - Big Data
No ratings yet
Lecture 5 - Big Data
51 pages
Data Mining: Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Sketching, Locality Sensitive Hashing
61 pages
Efficient Hashing for Integers & Strings
No ratings yet
Efficient Hashing for Integers & Strings
17 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Tut03 Soln
No ratings yet
Tut03 Soln
4 pages
Data Structures & Algorithms Guide
No ratings yet
Data Structures & Algorithms Guide
34 pages
L3: Finding Similar Items: Locality Sensitive Hashing
No ratings yet
L3: Finding Similar Items: Locality Sensitive Hashing
54 pages
Bda PT 2
No ratings yet
Bda PT 2
35 pages
1994 - Graphs, Hypergraphs and Hashing
No ratings yet
1994 - Graphs, Hypergraphs and Hashing
13 pages
Bloom Filters A Tutorial, Analysis, and Survey
No ratings yet
Bloom Filters A Tutorial, Analysis, and Survey
31 pages
Universal Hashing Explained
No ratings yet
Universal Hashing Explained
4 pages
1 Hashing: 1.1 Maintaining A Dictionary
No ratings yet
1 Hashing: 1.1 Maintaining A Dictionary
17 pages
Lect1004 PDF
No ratings yet
Lect1004 PDF
7 pages
1 Hashing: 1.1 Desired Properties
No ratings yet
1 Hashing: 1.1 Desired Properties
8 pages
Lecture 8 Hashing
No ratings yet
Lecture 8 Hashing
47 pages
Assocrules 2
No ratings yet
Assocrules 2
49 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
Lecture 4 - Big Data
No ratings yet
Lecture 4 - Big Data
34 pages
Hashing Techniques for CS Students
No ratings yet
Hashing Techniques for CS Students
111 pages
Lec 31 Handout
No ratings yet
Lec 31 Handout
18 pages
c11 Hashing
No ratings yet
c11 Hashing
9 pages
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
53 pages
03.2 03.3 Shingling MinHash
No ratings yet
03.2 03.3 Shingling MinHash
32 pages
Probabilistic Data Structures Guide
No ratings yet
Probabilistic Data Structures Guide
5 pages
24 SimilaritySearch
No ratings yet
24 SimilaritySearch
52 pages
Hashing Búsqueda Por Transformación de Claves: Yana Saint-Priest
No ratings yet
Hashing Búsqueda Por Transformación de Claves: Yana Saint-Priest
26 pages
Module 5
No ratings yet
Module 5
25 pages
Dsa 240404 220052
No ratings yet
Dsa 240404 220052
9 pages
Toc CS246 PRK
No ratings yet
Toc CS246 PRK
17 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
DSA Test for Computer Science Students
No ratings yet
DSA Test for Computer Science Students
9 pages
Mathematical Foundations For Data Science: BITS Pilani
No ratings yet
Mathematical Foundations For Data Science: BITS Pilani
29 pages
Week2 6
No ratings yet
Week2 6
7 pages
Practical
No ratings yet
Practical
21 pages
B+ Tree in DBMS
No ratings yet
B+ Tree in DBMS
21 pages
Syllabus For OS and DS
No ratings yet
Syllabus For OS and DS
3 pages
Discrete Maths Practices
No ratings yet
Discrete Maths Practices
16 pages
Chapter 9
No ratings yet
Chapter 9
22 pages
Implementation of Queue and Linked List
No ratings yet
Implementation of Queue and Linked List
4 pages
Hill Climbing Ipynb - Colaboratory 24102023 053236pm
No ratings yet
Hill Climbing Ipynb - Colaboratory 24102023 053236pm
2 pages
Cambridge International AS & A Level: Computer Science 9608/41
No ratings yet
Cambridge International AS & A Level: Computer Science 9608/41
20 pages
AI - Sample of Questions
No ratings yet
AI - Sample of Questions
11 pages
DAA Lab File
No ratings yet
DAA Lab File
45 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
15 pages
Cs3551 Unit 3 QB
No ratings yet
Cs3551 Unit 3 QB
3 pages
Data Structures - Unit Wise Important Questions
No ratings yet
Data Structures - Unit Wise Important Questions
2 pages
Unit I Question Bank Ds 2024 25
No ratings yet
Unit I Question Bank Ds 2024 25
2 pages
Sorting - Worksheet
No ratings yet
Sorting - Worksheet
3 pages
Filters May Be Used For Three Information-Processing Tasks: 1. Filtering
No ratings yet
Filters May Be Used For Three Information-Processing Tasks: 1. Filtering
7 pages
Advanced Graph Algorithms Guide
No ratings yet
Advanced Graph Algorithms Guide
4 pages
2017 Summer Model Answer Paper
No ratings yet
2017 Summer Model Answer Paper
29 pages
Al Programs
No ratings yet
Al Programs
22 pages
Queue: (FIFO) Lists
No ratings yet
Queue: (FIFO) Lists
31 pages
Age of Empires and Genetic Algorithms
No ratings yet
Age of Empires and Genetic Algorithms
4 pages
Heap Sort
No ratings yet
Heap Sort
47 pages
Quick Sort & Merge Sort
No ratings yet
Quick Sort & Merge Sort
17 pages
CSC 221 Intro To DSA - Arrays 23022023 111731am
No ratings yet
CSC 221 Intro To DSA - Arrays 23022023 111731am
53 pages
II Pu Cs Mid-Term Model QP 2 2025-26
No ratings yet
II Pu Cs Mid-Term Model QP 2 2025-26
2 pages
Acc2 Cat2 Ethnus Merged
No ratings yet
Acc2 Cat2 Ethnus Merged
303 pages
Lec12 AlgAnalysisExamples NEW
No ratings yet
Lec12 AlgAnalysisExamples NEW
21 pages

Probabilistic Data Structures

Uploaded by

Probabilistic Data Structures

Uploaded by

Probabilistic Data

 What are Probabilistic Data Structures

 Examples of Probabilistic Data Structures

 When processing large data sets, we might need to do simple tasks:

 Using deterministic data structure with big data

 employ hash functions to randomize and compactly represent a set of items

 use much less memory and have constant query time

 Bloom filters have two components:

2. A set of hash functions (k hash functions): h1,…, hk

 Now, we need to answer a question like “Is x an element of S?”

1. Initially set the array to 0

 Probability of false positive depends on:

 Number of 1s is approximately the number of inserted elements times the number

 Probability of false positive:

 To find the optimal k to minimize f: minimizing g=ln(f)  km / n

 Suppose we use an array of n=1 billion bits, k=5 hash functions,

 finding duplicate documents in a list may look like a simple task

 finding documents with differences such as typos or different words

 Ex) Our use-case example

 Jaccard can be a good string metric, however

 Back to our use case example:

 Trying 3 more permutations, we might end up having the following minhashes:

 Dividing the signature into a number of b bands, consisting of r rows each

 For each band, a hash function that

 The probability that the

You might also like