Privacy-Preserving
Databases and Data
Mining
Yücel SAYGIN
ysaygin@sabanciuniv.edu
http://people.sabanciuniv.edu/~ysaygin/
Privacy and data mining
There are two aspects of data mining when we look at it from a
privacy perspective
Being able to mine the data without seeing the actual data
Protecting the privacy of people against the misusage of data
How can we protect the sensitive knowledge
against data mining?
Types of sensitive knowledge that could be extracted via data
mining techniques are
Patterns (Association rules, sequences)
Clusters that describe the data
Classification models for prediction
Association Rule Hiding
Large amounts of customer transaction data is collected in
supermarket chains to find association rules in customer buying
patterns
lots of research conducted on finding association rules efficiently
and tools were developed.
Association rule hiding algorithms are deterministic with given
support and confidence thresholds
Therefore association rules are a good starting point.
Motivating examples
Sniffing prozac users
Association Rule Hiding
Rules: “Body ® Head”
Ex1: “Diapher ® Beer”
Ex2: “Internetworking with TCP/IP ® ” Interconnections:
bridges, routers,…”
parameters: (support, confidence)
Minimum Support, and Confidence Thresholds are
used to prune the non-significant rules
Trans ID Items Bought Min. support 50%
1 A, B, D Min. confidence 70%
2 B, E, F
3 A, C, D
4 A, B
5 A, B, D
Frequent Itemset Support Rule Support
{A} %80 A => B 60%
{B} %80 A => D 60%
{D} %60
{A,B} %60
{A,D} %60
Algorithms for Rule Hiding
What we try to achieve is:
Let D be the source database
Let R be the set of significant association rules that are mined from D
with certain thresholds
Let ri be a sensitive rule in R
Transform D into D’ so that all rules in R can still be mined from D’
except ri
It was proven that optimal hiding of association rules with minimal side
effects is NP-Hard
Heuristic Methods
We developed heuristics to deal with the problem.
Different techniques are implemented based on:
Modifying the database by inserting false data or by removing some data.
Inserting unknown values to fuzzify the rules
Basic Approach for Rule Hiding
Reduce the support of confidential rules
Reduce the confidence of rules
This way prevent tools to discover these rules
The challenge is the data quality
Our metric for data quality is the number of rules that can still be
mined and the number of rules that appear as a side effect
We developed heuristic algorithms to minimize the newly appearing
rules, and to minimize the accidentally hidden rules.
Basics of the heuristic algorithms
If we want to remove an item from a transaction to reduce the support
or the confidence
Which item should we start from
Which transaction should we choose to hide the selected item
We can either
Select an item and a transaction in round robin fashion, I.e., select the next
item from the next transaction that supports that item, and move to another
item and another transaction.
Select the item that will probably have the minimal impact on the other rules
Basics of rule hiding
conf(X=>Y) = sup(XY)/sup(X)
Decreasing the confidence of a rule can be done by:
Increasing the support of X in transactions not supporting Y
Decreasing the support of Y in transactions supporting both X and Y
Decreasing support of rule can be done by
Decreasing the support of the corresponding large itemset XY
Min. support 20%
Min. confidence 80%
Trans ID Items Bought
1 A, B, C
2 A, B, C Rule Confidence
3 A, C AB => C 100%
4 A BC => A 100%
5 B
Hiding AB->C by increasing support of AB
Trans ID Items Bought
Trans ID Items Bought
1 A, B, C
1 A, B, C
2 A, B, C
2 A, B, C
3 A, C
3 A, C
4 A A, B
4
5 B
5 B
Rule Confidence
AB => C 66%
BC => A 100%
Hiding AB->C by decreasing support of ABC
Trans ID Items Bought
Trans ID Items Bought
1 A, B, C
1 A, C
2 A, B, C
2 A, B
3 A, C
3 A, C
4 A A, B
4
5 B
5 B
Rule Confidence
AB => C 0%
BC => A 0%
Hiding AB->C by decreasing the support of C
Trans ID Items Bought
Trans ID Items Bought
1 A, B, C
1 A, B
2 A, B, C
2 A, B, C
3 A, C
3 A, C
4 A A, B
4
5 B
5 B
Rule Confidence
AB => C 50%
BC => A 100%
Rule Hiding by Fuzzification
In some applications where publishing wrong data is not acceptable,
then unkown values may be inserted to blur the rules.
When unknowns values are inserted, support and confidence values
would fall into a range instead of a fixed value.
Similar heuristics for rule hiding can be employed to minimize the side
effects
TID A B C D
1 1 1 0 1 Support and confidence
2 0 1 0 0 Becomes a range of values
3 1 0 1 1
4 1 1 0 0
5 1 1 0 1
TID A B C D
1 ? 1 0 1
2 0 1 0 0
3 1 0 1 ?
4 1 ? 0 0
5 1 ? 0 1
Classification model as a threat to privacy
Document classification for authorship identification
Main idea: based on a database of documents and authors, assign the
most probable author to a new document
It’s a possible threat to privacy when the text needs to stay anonymous
Classification model as a threat to privacy
The fact that each author uses a characteristic frequency distribution
over words and phrases helps us
Feature representation used:
T: total number of tokens
V: total number of types
C: total number of characters
Classify the document by a learning algorithm and then try to perturb
the classification
Another Motivating Application
Given a set of attribute values that are confidential and therefore
downgraded by inserting unknown values for the place of actual ones
before being released.
Can someone build a classification model using the rest of the
attributes to predict the hidden value?
Classification Models as a threat to
privacy
How do we prevent a row to be classified as class C by perturbing the
data.
Main challenge is that (unlike association rule mining) resulting
classification models depend on the technique, selected training data,
and pruning methodology.
Purpose is to decrease the accuracy of the classification model.
Approach : inserting unknown values to the selected attribute values in
the rest of the database.
Mining the data without actually seeing it
Things that we need to consider are:
Data type
Data mining technique
Data distribution
Centralized
Distributed (vertically or horizontally)
Classification on perturbed data
Reference: Rakesh Agrawal and Ramakrishnan Srikant. “Privacy-
Preserving Data Mining”. SIGMOD, 2000, Dallas, TX.
They developed a technique for consturcting a classification model
on perturbed data.
The data is assumed to be stored in a centralized database
And it is outsourced to a third party for mining, therefore the
confidential values need to be handled
The following slides are based on the slides by the authors of the
paper above
Reconstruction Problem
Original values x1, x2, ..., xn
from probability distribution X (unknown)
To hide these values, we use y1, y2, ..., yn
from probability distribution Y
Given
x1+y1, x2+y2, ..., xn+yn
the probability distribution of Y
Estimate the probability distribution of X.
Intuition (Reconstruct single point)
Use Bayes' rule for density functions
10 V 90
Age
Original distribution for Age
Probabilistic estimate of original value of V
Intuition (Reconstruct single point)
Use Bayes' rule for density functions
10 V 90
Age
Original Distribution for Age
Probabilistic estimate of original value of V
Reconstructing the Distribution
Combine estimates of where point came from for all the points:
Gives estimate of original distribution.
10 90
Age
Reconstruction: Bootstrapping
fX0 := Uniform distribution
j := 0 // Iteration number
repeat
j
fXj+1(a) := 1 n f Y (( xi yi ) a ) f X ( a ) (Bayes'
rule) n
Y i i
j
i 1 f (( x y ) a ) f X (a )
j := j+1
until (stopping criterion met)
Shown to work in experiments on large
data sets.
1200
1000
Number of People
800
Original
600 Randomized
Reconstructed
400
200
0
20 60
Age
Algorithms
“Global” Algorithm
Reconstruct for each attribute once at the beginning
“By Class” Algorithm
For each attribute, first split by class, then reconstruct separately
for each class.
See SIGMOD 2000 paper for details.
Experimental Methodology
Compare accuracy against
– Original: unperturbed data without randomization.
– Randomized: perturbed data but without making any corrections for
randomization.
Test data not randomized.
Synthetic data benchmark.
Training set of 100,000 records, split equally between the two
classes.
Quantifying Privacy
Add a random value between -30 and +30 to age.
If randomized value is 60
know with 90% confidence that age is between 33 and 87.
Interval width amount of privacy.
Example: (Interval Width : 54) / (Range of Age: 100) 54%
randomization level @ 90% confidence
Privacy Preserving Distributed Data Mining
Consider the case where data is distributed horizontally or vertically
to multiple sites.
Each site is autonomous and does not want to share their actual data
Lets consider the following scenario:
There are multiple hospitals that have their own local database,
and they would like to participate in a scientific study that will analyze the
results of treatements for different patients
The privacy concern here is that, a hospital would not like to share the
knowledge unless the other site also has it, to protect the privacy of itself
and its operation
Another scenario:
Two bookstores would like to learn what books are sold together so that
they make some offers to their companies (Amazon does that actually)
Case study: Association rules
How do we mine association rules from distributed sources while
preserving the privacy of the data owners?
The confidential information in this case is:
The data itself
The fact that a local site supports a rules with certain confidence and
certain support (No company wants to loose competitive advantage, and
would not like to reveal anything if it will not benefit from the release of the
data)
Privacy preserving distributed association rule mining methods use
distributed rule mining techniques
Distributed rule mining
We know how rules are mined from centralized databases
The distributed scenario is similar
Consider that we have only two sites S1 and S2, which have
databases D1 (with 3 transactions) and D2 (with 5 transactions)
T ran s ID Item s B ou gh t Trans ID Items Bought
1 A, B, C 1 A, B, C
2 A, B, C
3 A, C 2 A, B, C
4 A 3 A, C
5 B
4 A
5 B
Distributed rule mining
We would like to mine the databases as if they are parts of a single
centralized database of 8 transactions
In order to do this, we need to calculate the local supports
For example the local support of A in D1 is 100%
The local support of the itemset {A,B,C} in D1 is 66%, and the local
support of {A,B,C} in D2 is 40%.
Trans ID Items Bought
Trans ID Items Bought 1 A, B, C
1 A, B, C 2 A, B, C
2 A, B, C
3 A, C
3 A, C
4 A
5 B
Distributed rule mining
Assume that the minimum support threshold is 50% then {A,B,C} is
frequent in D1, but it is not frequent in D2.
However when we assume that the databases are combined then the
support of {A,B,C} in D1 U D2 is 50%
which means that an itemset could be locally frequent in one database,
but not frequent in another database. And it can be frequent globally
In order for an itemset ot be frequent globally, it should be frequent in at
least one database
Trans ID Items Bought
Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining
The algorithm is based on apriori which prunes the rules by looking at the
support
Apriori also uses the fact that an itemset is frequent only if all its subsets
are frequent
Therefore only frequent itemsets should be used to generated larger
frequent itemsets
Trans ID Items Bought
Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining
1) The local sites will find their frequent itemsets.
2) They will broadcast the frequent itemsets to each other
3) Individual sites will count the frequencies of the itemsets in their local
database
4) They will broadcast the result to every site
5) Every site can now find globally frequent itemsets
Trans ID Items Bought
Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining
Ex: 50% min supp threshold
●
We will start from a singletons and calculate the frequencies of items
●
In D1 A (freq 3), B (freq 2), C (freq 3) are frequent, in D2 A (freq 4), B
(freq 3), C (freq 3) are frequent
●
They will broadcast the results to each other and each site will update the
counts of A, B, C by adding the local counts
Trans ID Items Bought
Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining
Ex: 50% min supp threshold
●
Each site will eliminate the items that are not globally frequent. In this
case all of A, B, C are globally frequent. Now
●
Now using the frequent items, each site will generate candidates of size 2
which are {A,B}, {A,C}, {B,C}
●
And the same steps will be applied
Trans ID Items Bought
Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Now we would like to do the same thing but
preserve the privacy of the individual sites
The basic notions we need for that are
Commutative encryption
And Secure multi-party computation
An encryption is commutative if the following two equations hold for
any given feasible encryption keys K1, K2, ... Kn, any M, and any
permutations of i,j
EKi1(... EKin(M)) = EKKj1 (...Ekjn(M))
For different M1, and M2 the probablity of collusion is very low
RSA is a famous commutative encryption technique
A simple application of commutative encryption
Assume that person A has salary S1, and person B has salary S2.
How can they know wheather their salaries are equal to each other?
(without revealing their salaries)
Assume that A, and B have their own encryption keys, say K1, and
K2. And we go from there!
Distributed PP Association Rule Mining
For distributed association rule mining, each site needs to distribute
its locally frequent itemsets to the rest of the sites
Instead of circulating the actual itemsets, the ecrypted versions are
circulated
Example:
S1 contains A, S2 contains B, S3 contains A. Each of them have their
own keys, K1, K2, K3.
At the end of step 1, each all sites will have items encrypted by all sites.
The encrypted items are then passed to a common site to eliminate
the duplicates and to start decryption. This was they will not know
who has sent which item.
Decryption can now start and after everybody finished decrypting,
then they will have the actual items.
Distributed PP Association Rule Mining
Now we need to see if the global support of an item is larger than the
threshold.
We we do not want to reveal the supports, since support of an item is
assumed to be confidential.
A secure multi-party computation technique is utilized for this
Assume that there are three sites, and each of them has {A,B,C} and freq
in S1 is 5 (out of 100 transactions), in S2 is 6 (out of 300), and in S3 20
(out of 300), and minimum support is 5%.
S1 selects a random number, say 17
S1 adds the difference 5 – 5%x100 to 17 and sends the result (17) to S2
S2 adds 6 – 5%x200 to 17 and sends the result (13) to S3.
S3 adds 20 – 5%x300 to 13 and sends the result (18) back to S1
18 > the chosen random number (17), so {A,B,C} is globally frequent.
Distributed PP Association Rule Mining
This technique assumes a semi-honest model
Where each party follows the rules of the protocol using its correct input,
but it is free to later use what it sees during execution of the protocol to
compromise security.
Cost of encryption is the key issue since it is heavily used in this
method.