[go: up one dir, main page]

0% found this document useful (0 votes)
21 views47 pages

2002 Spring CS525 Lecture 4

The document discusses privacy-preserving techniques in data mining, focusing on methods to mine data without exposing sensitive information. It highlights the challenges of protecting sensitive knowledge, such as association rules and classification models, and presents algorithms for rule hiding and data perturbation. The document also addresses privacy concerns in distributed data mining scenarios, emphasizing the need for confidentiality among data owners while still allowing for effective data analysis.

Uploaded by

Ly Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views47 pages

2002 Spring CS525 Lecture 4

The document discusses privacy-preserving techniques in data mining, focusing on methods to mine data without exposing sensitive information. It highlights the challenges of protecting sensitive knowledge, such as association rules and classification models, and presents algorithms for rule hiding and data perturbation. The document also addresses privacy concerns in distributed data mining scenarios, emphasizing the need for confidentiality among data owners while still allowing for effective data analysis.

Uploaded by

Ly Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 47

Privacy-Preserving

Databases and Data


Mining
Yücel SAYGIN
ysaygin@sabanciuniv.edu
http://people.sabanciuniv.edu/~ysaygin/
Privacy and data mining

 There are two aspects of data mining when we look at it from a


privacy perspective
 Being able to mine the data without seeing the actual data
 Protecting the privacy of people against the misusage of data
How can we protect the sensitive knowledge
against data mining?

 Types of sensitive knowledge that could be extracted via data


mining techniques are
 Patterns (Association rules, sequences)
 Clusters that describe the data
 Classification models for prediction
Association Rule Hiding

 Large amounts of customer transaction data is collected in


supermarket chains to find association rules in customer buying
patterns
 lots of research conducted on finding association rules efficiently
and tools were developed.
 Association rule hiding algorithms are deterministic with given
support and confidence thresholds
 Therefore association rules are a good starting point.
Motivating examples
 Sniffing prozac users
Association Rule Hiding

 Rules: “Body ® Head”


 Ex1: “Diapher ® Beer”
 Ex2: “Internetworking with TCP/IP ® ” Interconnections:
bridges, routers,…”
 parameters: (support, confidence)
 Minimum Support, and Confidence Thresholds are
used to prune the non-significant rules
Trans ID Items Bought Min. support 50%
1 A, B, D Min. confidence 70%
2 B, E, F
3 A, C, D
4 A, B
5 A, B, D

Frequent Itemset Support Rule Support


{A} %80 A => B 60%
{B} %80 A => D 60%
{D} %60
{A,B} %60
{A,D} %60
Algorithms for Rule Hiding

 What we try to achieve is:


 Let D be the source database
 Let R be the set of significant association rules that are mined from D

with certain thresholds


 Let ri be a sensitive rule in R

 Transform D into D’ so that all rules in R can still be mined from D’

except ri
 It was proven that optimal hiding of association rules with minimal side
effects is NP-Hard
Heuristic Methods
 We developed heuristics to deal with the problem.
 Different techniques are implemented based on:
 Modifying the database by inserting false data or by removing some data.
 Inserting unknown values to fuzzify the rules
Basic Approach for Rule Hiding

 Reduce the support of confidential rules


 Reduce the confidence of rules
 This way prevent tools to discover these rules
 The challenge is the data quality
 Our metric for data quality is the number of rules that can still be
mined and the number of rules that appear as a side effect
 We developed heuristic algorithms to minimize the newly appearing
rules, and to minimize the accidentally hidden rules.
Basics of the heuristic algorithms

 If we want to remove an item from a transaction to reduce the support


or the confidence
 Which item should we start from
 Which transaction should we choose to hide the selected item
 We can either
 Select an item and a transaction in round robin fashion, I.e., select the next
item from the next transaction that supports that item, and move to another
item and another transaction.
 Select the item that will probably have the minimal impact on the other rules
Basics of rule hiding

 conf(X=>Y) = sup(XY)/sup(X)
 Decreasing the confidence of a rule can be done by:
 Increasing the support of X in transactions not supporting Y
 Decreasing the support of Y in transactions supporting both X and Y
 Decreasing support of rule can be done by
 Decreasing the support of the corresponding large itemset XY
Min. support 20%
Min. confidence 80%
Trans ID Items Bought
1 A, B, C
2 A, B, C Rule Confidence
3 A, C AB => C 100%
4 A BC => A 100%
5 B
Hiding AB->C by increasing support of AB

Trans ID Items Bought


Trans ID Items Bought
1 A, B, C
1 A, B, C
2 A, B, C
2 A, B, C
3 A, C
3 A, C
4 A A, B
4
5 B
5 B

Rule Confidence
AB => C 66%
BC => A 100%
Hiding AB->C by decreasing support of ABC

Trans ID Items Bought


Trans ID Items Bought
1 A, B, C
1 A, C
2 A, B, C
2 A, B
3 A, C
3 A, C
4 A A, B
4
5 B
5 B

Rule Confidence
AB => C 0%
BC => A 0%
Hiding AB->C by decreasing the support of C

Trans ID Items Bought


Trans ID Items Bought
1 A, B, C
1 A, B
2 A, B, C
2 A, B, C
3 A, C
3 A, C
4 A A, B
4
5 B
5 B

Rule Confidence
AB => C 50%
BC => A 100%
Rule Hiding by Fuzzification

 In some applications where publishing wrong data is not acceptable,


then unkown values may be inserted to blur the rules.
 When unknowns values are inserted, support and confidence values
would fall into a range instead of a fixed value.
 Similar heuristics for rule hiding can be employed to minimize the side
effects
TID A B C D
1 1 1 0 1 Support and confidence
2 0 1 0 0 Becomes a range of values
3 1 0 1 1
4 1 1 0 0
5 1 1 0 1

TID A B C D
1 ? 1 0 1
2 0 1 0 0
3 1 0 1 ?
4 1 ? 0 0
5 1 ? 0 1
Classification model as a threat to privacy

 Document classification for authorship identification


 Main idea: based on a database of documents and authors, assign the
most probable author to a new document
 It’s a possible threat to privacy when the text needs to stay anonymous
Classification model as a threat to privacy

 The fact that each author uses a characteristic frequency distribution


over words and phrases helps us
 Feature representation used:
 T: total number of tokens
 V: total number of types
 C: total number of characters

 Classify the document by a learning algorithm and then try to perturb


the classification
Another Motivating Application

 Given a set of attribute values that are confidential and therefore


downgraded by inserting unknown values for the place of actual ones
before being released.
 Can someone build a classification model using the rest of the
attributes to predict the hidden value?
Classification Models as a threat to
privacy
 How do we prevent a row to be classified as class C by perturbing the
data.
 Main challenge is that (unlike association rule mining) resulting
classification models depend on the technique, selected training data,
and pruning methodology.
 Purpose is to decrease the accuracy of the classification model.
 Approach : inserting unknown values to the selected attribute values in
the rest of the database.
Mining the data without actually seeing it
 Things that we need to consider are:
 Data type
 Data mining technique
 Data distribution
 Centralized
 Distributed (vertically or horizontally)
Classification on perturbed data

 Reference: Rakesh Agrawal and Ramakrishnan Srikant. “Privacy-


Preserving Data Mining”. SIGMOD, 2000, Dallas, TX.
 They developed a technique for consturcting a classification model
on perturbed data.
 The data is assumed to be stored in a centralized database
 And it is outsourced to a third party for mining, therefore the
confidential values need to be handled
 The following slides are based on the slides by the authors of the
paper above
Reconstruction Problem

 Original values x1, x2, ..., xn


 from probability distribution X (unknown)
 To hide these values, we use y1, y2, ..., yn
 from probability distribution Y
 Given
 x1+y1, x2+y2, ..., xn+yn
 the probability distribution of Y
Estimate the probability distribution of X.
Intuition (Reconstruct single point)

 Use Bayes' rule for density functions

10 V 90
Age
Original distribution for Age
Probabilistic estimate of original value of V
Intuition (Reconstruct single point)

 Use Bayes' rule for density functions

10 V 90
Age
Original Distribution for Age
Probabilistic estimate of original value of V
Reconstructing the Distribution

 Combine estimates of where point came from for all the points:
 Gives estimate of original distribution.

10 90
Age
Reconstruction: Bootstrapping

fX0 := Uniform distribution


j := 0 // Iteration number
repeat
j
fXj+1(a) := 1 n f Y (( xi  yi )  a ) f X ( a ) (Bayes'
rule) n
 
 Y i i
j
i 1 f (( x  y )  a ) f X (a )


j := j+1
until (stopping criterion met)
Shown to work in experiments on large
data sets.

1200

1000
Number of People

800
Original
600 Randomized
Reconstructed
400

200
0
20 60
Age
Algorithms

 “Global” Algorithm
 Reconstruct for each attribute once at the beginning
 “By Class” Algorithm
 For each attribute, first split by class, then reconstruct separately
for each class.

 See SIGMOD 2000 paper for details.


Experimental Methodology

 Compare accuracy against


– Original: unperturbed data without randomization.
– Randomized: perturbed data but without making any corrections for
randomization.
 Test data not randomized.
 Synthetic data benchmark.
 Training set of 100,000 records, split equally between the two
classes.
Quantifying Privacy

 Add a random value between -30 and +30 to age.


 If randomized value is 60
 know with 90% confidence that age is between 33 and 87.
 Interval width  amount of privacy.
 Example: (Interval Width : 54) / (Range of Age: 100)  54%
randomization level @ 90% confidence
Privacy Preserving Distributed Data Mining
 Consider the case where data is distributed horizontally or vertically
to multiple sites.
 Each site is autonomous and does not want to share their actual data
 Lets consider the following scenario:
 There are multiple hospitals that have their own local database,
 and they would like to participate in a scientific study that will analyze the
results of treatements for different patients
 The privacy concern here is that, a hospital would not like to share the
knowledge unless the other site also has it, to protect the privacy of itself
and its operation
 Another scenario:
 Two bookstores would like to learn what books are sold together so that
they make some offers to their companies (Amazon does that actually)
Case study: Association rules
 How do we mine association rules from distributed sources while
preserving the privacy of the data owners?
 The confidential information in this case is:
 The data itself
 The fact that a local site supports a rules with certain confidence and
certain support (No company wants to loose competitive advantage, and
would not like to reveal anything if it will not benefit from the release of the
data)
 Privacy preserving distributed association rule mining methods use
distributed rule mining techniques
Distributed rule mining
 We know how rules are mined from centralized databases
 The distributed scenario is similar
 Consider that we have only two sites S1 and S2, which have
databases D1 (with 3 transactions) and D2 (with 5 transactions)

T ran s ID Item s B ou gh t Trans ID Items Bought


1 A, B, C 1 A, B, C
2 A, B, C
3 A, C 2 A, B, C
4 A 3 A, C
5 B
4 A
5 B
Distributed rule mining
 We would like to mine the databases as if they are parts of a single
centralized database of 8 transactions
 In order to do this, we need to calculate the local supports
 For example the local support of A in D1 is 100%
 The local support of the itemset {A,B,C} in D1 is 66%, and the local
support of {A,B,C} in D2 is 40%.

Trans ID Items Bought


Trans ID Items Bought 1 A, B, C
1 A, B, C 2 A, B, C
2 A, B, C
3 A, C
3 A, C
4 A
5 B
Distributed rule mining
 Assume that the minimum support threshold is 50% then {A,B,C} is
frequent in D1, but it is not frequent in D2.
 However when we assume that the databases are combined then the
support of {A,B,C} in D1 U D2 is 50%
 which means that an itemset could be locally frequent in one database,
but not frequent in another database. And it can be frequent globally
 In order for an itemset ot be frequent globally, it should be frequent in at
least one database
Trans ID Items Bought
Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining
 The algorithm is based on apriori which prunes the rules by looking at the
support
 Apriori also uses the fact that an itemset is frequent only if all its subsets
are frequent
 Therefore only frequent itemsets should be used to generated larger
frequent itemsets

Trans ID Items Bought


Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining

1) The local sites will find their frequent itemsets.


2) They will broadcast the frequent itemsets to each other
3) Individual sites will count the frequencies of the itemsets in their local
database
4) They will broadcast the result to every site
5) Every site can now find globally frequent itemsets

Trans ID Items Bought


Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining

Ex: 50% min supp threshold



We will start from a singletons and calculate the frequencies of items

In D1 A (freq 3), B (freq 2), C (freq 3) are frequent, in D2 A (freq 4), B
(freq 3), C (freq 3) are frequent

They will broadcast the results to each other and each site will update the
counts of A, B, C by adding the local counts

Trans ID Items Bought


Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining

Ex: 50% min supp threshold



Each site will eliminate the items that are not globally frequent. In this
case all of A, B, C are globally frequent. Now

Now using the frequent items, each site will generate candidates of size 2
which are {A,B}, {A,C}, {B,C}

And the same steps will be applied

Trans ID Items Bought


Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Now we would like to do the same thing but
preserve the privacy of the individual sites
 The basic notions we need for that are
 Commutative encryption
 And Secure multi-party computation
 An encryption is commutative if the following two equations hold for
any given feasible encryption keys K1, K2, ... Kn, any M, and any
permutations of i,j
 EKi1(... EKin(M)) = EKKj1 (...Ekjn(M))
 For different M1, and M2 the probablity of collusion is very low
 RSA is a famous commutative encryption technique
A simple application of commutative encryption
 Assume that person A has salary S1, and person B has salary S2.
 How can they know wheather their salaries are equal to each other?
(without revealing their salaries)
 Assume that A, and B have their own encryption keys, say K1, and
K2. And we go from there!
Distributed PP Association Rule Mining
 For distributed association rule mining, each site needs to distribute
its locally frequent itemsets to the rest of the sites
 Instead of circulating the actual itemsets, the ecrypted versions are
circulated
 Example:
 S1 contains A, S2 contains B, S3 contains A. Each of them have their
own keys, K1, K2, K3.
 At the end of step 1, each all sites will have items encrypted by all sites.
 The encrypted items are then passed to a common site to eliminate
the duplicates and to start decryption. This was they will not know
who has sent which item.
 Decryption can now start and after everybody finished decrypting,
then they will have the actual items.
Distributed PP Association Rule Mining
 Now we need to see if the global support of an item is larger than the
threshold.
 We we do not want to reveal the supports, since support of an item is
assumed to be confidential.
 A secure multi-party computation technique is utilized for this
 Assume that there are three sites, and each of them has {A,B,C} and freq
in S1 is 5 (out of 100 transactions), in S2 is 6 (out of 300), and in S3 20
(out of 300), and minimum support is 5%.
 S1 selects a random number, say 17
 S1 adds the difference 5 – 5%x100 to 17 and sends the result (17) to S2
 S2 adds 6 – 5%x200 to 17 and sends the result (13) to S3.
 S3 adds 20 – 5%x300 to 13 and sends the result (18) back to S1
 18 > the chosen random number (17), so {A,B,C} is globally frequent.
Distributed PP Association Rule Mining
 This technique assumes a semi-honest model
 Where each party follows the rules of the protocol using its correct input,
but it is free to later use what it sees during execution of the protocol to
compromise security.
 Cost of encryption is the key issue since it is heavily used in this
method.

You might also like