0% found this document useful (0 votes)

21 views47 pages

2002 Spring CS525 Lecture 4

The document discusses privacy-preserving techniques in data mining, focusing on methods to mine data without exposing sensitive information. It highlights the challenges of protecting sensitive knowledge, such as association rules and classification models, and presents algorithms for rule hiding and data perturbation. The document also addresses privacy concerns in distributed data mining scenarios, emphasizing the need for confidentiality among data owners while still allowing for effective data analysis.

Uploaded by

Ly Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views47 pages

2002 Spring CS525 Lecture 4

Uploaded by

Ly Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 47

Privacy-Preserving

Databases and Data

Mining
Yücel SAYGIN
ysaygin@sabanciuniv.edu
http://people.sabanciuniv.edu/~ysaygin/
Privacy and data mining

 There are two aspects of data mining when we look at it from a

privacy perspective
 Being able to mine the data without seeing the actual data
 Protecting the privacy of people against the misusage of data
How can we protect the sensitive knowledge
against data mining?

 Types of sensitive knowledge that could be extracted via data

mining techniques are
 Patterns (Association rules, sequences)
 Clusters that describe the data
 Classification models for prediction
Association Rule Hiding

 Large amounts of customer transaction data is collected in

supermarket chains to find association rules in customer buying
patterns
 lots of research conducted on finding association rules efficiently
and tools were developed.
 Association rule hiding algorithms are deterministic with given
support and confidence thresholds
 Therefore association rules are a good starting point.
Motivating examples
 Sniffing prozac users
Association Rule Hiding

 Rules: “Body ® Head”

 Ex1: “Diapher ® Beer”
 Ex2: “Internetworking with TCP/IP ® ” Interconnections:
bridges, routers,…”
 parameters: (support, confidence)
 Minimum Support, and Confidence Thresholds are
used to prune the non-significant rules
Trans ID Items Bought Min. support 50%
1 A, B, D Min. confidence 70%
2 B, E, F
3 A, C, D
4 A, B
5 A, B, D

Frequent Itemset Support Rule Support

{A} %80 A => B 60%
{B} %80 A => D 60%
{D} %60
{A,B} %60
{A,D} %60
Algorithms for Rule Hiding

 What we try to achieve is:

 Let D be the source database
 Let R be the set of significant association rules that are mined from D

with certain thresholds

 Let ri be a sensitive rule in R

 Transform D into D’ so that all rules in R can still be mined from D’

except ri
 It was proven that optimal hiding of association rules with minimal side
effects is NP-Hard
Heuristic Methods
 We developed heuristics to deal with the problem.
 Different techniques are implemented based on:
 Modifying the database by inserting false data or by removing some data.
 Inserting unknown values to fuzzify the rules
Basic Approach for Rule Hiding

 Reduce the support of confidential rules

 Reduce the confidence of rules
 This way prevent tools to discover these rules
 The challenge is the data quality
 Our metric for data quality is the number of rules that can still be
mined and the number of rules that appear as a side effect
 We developed heuristic algorithms to minimize the newly appearing
rules, and to minimize the accidentally hidden rules.
Basics of the heuristic algorithms

 If we want to remove an item from a transaction to reduce the support

or the confidence
 Which item should we start from
 Which transaction should we choose to hide the selected item
 We can either
 Select an item and a transaction in round robin fashion, I.e., select the next
item from the next transaction that supports that item, and move to another
item and another transaction.
 Select the item that will probably have the minimal impact on the other rules
Basics of rule hiding

 conf(X=>Y) = sup(XY)/sup(X)
 Decreasing the confidence of a rule can be done by:
 Increasing the support of X in transactions not supporting Y
 Decreasing the support of Y in transactions supporting both X and Y
 Decreasing support of rule can be done by
 Decreasing the support of the corresponding large itemset XY
Min. support 20%
Min. confidence 80%
Trans ID Items Bought
1 A, B, C
2 A, B, C Rule Confidence
3 A, C AB => C 100%
4 A BC => A 100%
5 B
Hiding AB->C by increasing support of AB

Trans ID Items Bought

Trans ID Items Bought
1 A, B, C
1 A, B, C
2 A, B, C
2 A, B, C
3 A, C
3 A, C
4 A A, B
4
5 B
5 B

Rule Confidence
AB => C 66%
BC => A 100%
Hiding AB->C by decreasing support of ABC

Trans ID Items Bought

Trans ID Items Bought
1 A, B, C
1 A, C
2 A, B, C
2 A, B
3 A, C
3 A, C
4 A A, B
4
5 B
5 B

Rule Confidence
AB => C 0%
BC => A 0%
Hiding AB->C by decreasing the support of C

Trans ID Items Bought

Trans ID Items Bought
1 A, B, C
1 A, B
2 A, B, C
2 A, B, C
3 A, C
3 A, C
4 A A, B
4
5 B
5 B

Rule Confidence
AB => C 50%
BC => A 100%
Rule Hiding by Fuzzification

 In some applications where publishing wrong data is not acceptable,

then unkown values may be inserted to blur the rules.
 When unknowns values are inserted, support and confidence values
would fall into a range instead of a fixed value.
 Similar heuristics for rule hiding can be employed to minimize the side
effects
TID A B C D
1 1 1 0 1 Support and confidence
2 0 1 0 0 Becomes a range of values
3 1 0 1 1
4 1 1 0 0
5 1 1 0 1

TID A B C D
1 ? 1 0 1
2 0 1 0 0
3 1 0 1 ?
4 1 ? 0 0
5 1 ? 0 1
Classification model as a threat to privacy

 Document classification for authorship identification

 Main idea: based on a database of documents and authors, assign the
most probable author to a new document
 It’s a possible threat to privacy when the text needs to stay anonymous
Classification model as a threat to privacy

 The fact that each author uses a characteristic frequency distribution

over words and phrases helps us
 Feature representation used:
 T: total number of tokens
 V: total number of types
 C: total number of characters

 Classify the document by a learning algorithm and then try to perturb

the classification
Another Motivating Application

 Given a set of attribute values that are confidential and therefore

downgraded by inserting unknown values for the place of actual ones
before being released.
 Can someone build a classification model using the rest of the
attributes to predict the hidden value?
Classification Models as a threat to
privacy
 How do we prevent a row to be classified as class C by perturbing the
data.
 Main challenge is that (unlike association rule mining) resulting
classification models depend on the technique, selected training data,
and pruning methodology.
 Purpose is to decrease the accuracy of the classification model.
 Approach : inserting unknown values to the selected attribute values in
the rest of the database.
Mining the data without actually seeing it
 Things that we need to consider are:
 Data type
 Data mining technique
 Data distribution
 Centralized
 Distributed (vertically or horizontally)
Classification on perturbed data

 Reference: Rakesh Agrawal and Ramakrishnan Srikant. “Privacy-

Preserving Data Mining”. SIGMOD, 2000, Dallas, TX.
 They developed a technique for consturcting a classification model
on perturbed data.
 The data is assumed to be stored in a centralized database
 And it is outsourced to a third party for mining, therefore the
confidential values need to be handled
 The following slides are based on the slides by the authors of the
paper above
Reconstruction Problem

 Original values x1, x2, ..., xn

 from probability distribution X (unknown)
 To hide these values, we use y1, y2, ..., yn
 from probability distribution Y
 Given
 x1+y1, x2+y2, ..., xn+yn
 the probability distribution of Y
Estimate the probability distribution of X.
Intuition (Reconstruct single point)

 Use Bayes' rule for density functions

10 V 90
Age
Original distribution for Age
Probabilistic estimate of original value of V
Intuition (Reconstruct single point)

 Use Bayes' rule for density functions

10 V 90
Age
Original Distribution for Age
Probabilistic estimate of original value of V
Reconstructing the Distribution

 Combine estimates of where point came from for all the points:
 Gives estimate of original distribution.

10 90
Age
Reconstruction: Bootstrapping

fX0 := Uniform distribution

j := 0 // Iteration number
repeat
j
fXj+1(a) := 1 n f Y (( xi  yi )  a ) f X ( a ) (Bayes'
rule) n
 
 Y i i
j
i 1 f (( x  y )  a ) f X (a )


j := j+1
until (stopping criterion met)
Shown to work in experiments on large
data sets.

1200

1000
Number of People

800
Original
600 Randomized
Reconstructed
400

200
0
20 60
Age
Algorithms

 “Global” Algorithm
 Reconstruct for each attribute once at the beginning
 “By Class” Algorithm
 For each attribute, first split by class, then reconstruct separately
for each class.

 See SIGMOD 2000 paper for details.

Experimental Methodology

 Compare accuracy against

– Original: unperturbed data without randomization.
– Randomized: perturbed data but without making any corrections for
randomization.
 Test data not randomized.
 Synthetic data benchmark.
 Training set of 100,000 records, split equally between the two
classes.
Quantifying Privacy

 Add a random value between -30 and +30 to age.

 If randomized value is 60
 know with 90% confidence that age is between 33 and 87.
 Interval width  amount of privacy.
 Example: (Interval Width : 54) / (Range of Age: 100)  54%
randomization level @ 90% confidence
Privacy Preserving Distributed Data Mining
 Consider the case where data is distributed horizontally or vertically
to multiple sites.
 Each site is autonomous and does not want to share their actual data
 Lets consider the following scenario:
 There are multiple hospitals that have their own local database,
 and they would like to participate in a scientific study that will analyze the
results of treatements for different patients
 The privacy concern here is that, a hospital would not like to share the
knowledge unless the other site also has it, to protect the privacy of itself
and its operation
 Another scenario:
 Two bookstores would like to learn what books are sold together so that
they make some offers to their companies (Amazon does that actually)
Case study: Association rules
 How do we mine association rules from distributed sources while
preserving the privacy of the data owners?
 The confidential information in this case is:
 The data itself
 The fact that a local site supports a rules with certain confidence and
certain support (No company wants to loose competitive advantage, and
would not like to reveal anything if it will not benefit from the release of the
data)
 Privacy preserving distributed association rule mining methods use
distributed rule mining techniques
Distributed rule mining
 We know how rules are mined from centralized databases
 The distributed scenario is similar
 Consider that we have only two sites S1 and S2, which have
databases D1 (with 3 transactions) and D2 (with 5 transactions)

T ran s ID Item s B ou gh t Trans ID Items Bought

1 A, B, C 1 A, B, C
2 A, B, C
3 A, C 2 A, B, C
4 A 3 A, C
5 B
4 A
5 B
Distributed rule mining
 We would like to mine the databases as if they are parts of a single
centralized database of 8 transactions
 In order to do this, we need to calculate the local supports
 For example the local support of A in D1 is 100%
 The local support of the itemset {A,B,C} in D1 is 66%, and the local
support of {A,B,C} in D2 is 40%.

Trans ID Items Bought

Trans ID Items Bought 1 A, B, C
1 A, B, C 2 A, B, C
2 A, B, C
3 A, C
3 A, C
4 A
5 B
Distributed rule mining
 Assume that the minimum support threshold is 50% then {A,B,C} is
frequent in D1, but it is not frequent in D2.
 However when we assume that the databases are combined then the
support of {A,B,C} in D1 U D2 is 50%
 which means that an itemset could be locally frequent in one database,
but not frequent in another database. And it can be frequent globally
 In order for an itemset ot be frequent globally, it should be frequent in at
least one database
Trans ID Items Bought
Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining
 The algorithm is based on apriori which prunes the rules by looking at the
support
 Apriori also uses the fact that an itemset is frequent only if all its subsets
are frequent
 Therefore only frequent itemsets should be used to generated larger
frequent itemsets

Trans ID Items Bought

Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining

1) The local sites will find their frequent itemsets.

2) They will broadcast the frequent itemsets to each other
3) Individual sites will count the frequencies of the itemsets in their local
database
4) They will broadcast the result to every site
5) Every site can now find globally frequent itemsets

Trans ID Items Bought

Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining

Ex: 50% min supp threshold

●
We will start from a singletons and calculate the frequencies of items
●
In D1 A (freq 3), B (freq 2), C (freq 3) are frequent, in D2 A (freq 4), B
(freq 3), C (freq 3) are frequent
●
They will broadcast the results to each other and each site will update the
counts of A, B, C by adding the local counts

Trans ID Items Bought

Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Distributed rule mining

Ex: 50% min supp threshold

●
Each site will eliminate the items that are not globally frequent. In this
case all of A, B, C are globally frequent. Now
●
Now using the frequent items, each site will generate candidates of size 2
which are {A,B}, {A,C}, {B,C}
●
And the same steps will be applied

Trans ID Items Bought

Trans ID Items Bought 4 A, B, C
1 A, B, C 5 A, B, C
2 A, B, C
6 A, C
3 A, C
7 A
8 B
Now we would like to do the same thing but
preserve the privacy of the individual sites
 The basic notions we need for that are
 Commutative encryption
 And Secure multi-party computation
 An encryption is commutative if the following two equations hold for
any given feasible encryption keys K1, K2, ... Kn, any M, and any
permutations of i,j
 EKi1(... EKin(M)) = EKKj1 (...Ekjn(M))
 For different M1, and M2 the probablity of collusion is very low
 RSA is a famous commutative encryption technique
A simple application of commutative encryption
 Assume that person A has salary S1, and person B has salary S2.
 How can they know wheather their salaries are equal to each other?
(without revealing their salaries)
 Assume that A, and B have their own encryption keys, say K1, and
K2. And we go from there!
Distributed PP Association Rule Mining
 For distributed association rule mining, each site needs to distribute
its locally frequent itemsets to the rest of the sites
 Instead of circulating the actual itemsets, the ecrypted versions are
circulated
 Example:
 S1 contains A, S2 contains B, S3 contains A. Each of them have their
own keys, K1, K2, K3.
 At the end of step 1, each all sites will have items encrypted by all sites.
 The encrypted items are then passed to a common site to eliminate
the duplicates and to start decryption. This was they will not know
who has sent which item.
 Decryption can now start and after everybody finished decrypting,
then they will have the actual items.
Distributed PP Association Rule Mining
 Now we need to see if the global support of an item is larger than the
threshold.
 We we do not want to reveal the supports, since support of an item is
assumed to be confidential.
 A secure multi-party computation technique is utilized for this
 Assume that there are three sites, and each of them has {A,B,C} and freq
in S1 is 5 (out of 100 transactions), in S2 is 6 (out of 300), and in S3 20
(out of 300), and minimum support is 5%.
 S1 selects a random number, say 17
 S1 adds the difference 5 – 5%x100 to 17 and sends the result (17) to S2
 S2 adds 6 – 5%x200 to 17 and sends the result (13) to S3.
 S3 adds 20 – 5%x300 to 13 and sends the result (18) back to S1
 18 > the chosen random number (17), so {A,B,C} is globally frequent.
Distributed PP Association Rule Mining
 This technique assumes a semi-honest model
 Where each party follows the rules of the protocol using its correct input,
but it is free to later use what it sees during execution of the protocol to
compromise security.
 Cost of encryption is the key issue since it is heavily used in this
method.

Hiding Sensitive Association Rule Using Heuristic Approach
No ratings yet
Hiding Sensitive Association Rule Using Heuristic Approach
7 pages
GMM3
No ratings yet
GMM3
74 pages
Improved Association Rule Hiding Algorithm For Privacy Preserving Data Mining
No ratings yet
Improved Association Rule Hiding Algorithm For Privacy Preserving Data Mining
6 pages
Literature Survey of Association Rule Based Techniques For Preserving Privacy
No ratings yet
Literature Survey of Association Rule Based Techniques For Preserving Privacy
6 pages
An Overview On Privacy Preserving Data Mining Methodologies
No ratings yet
An Overview On Privacy Preserving Data Mining Methodologies
5 pages
Hiding Sensitive Predictive Association Rules: Shyue-Liang Wang, Ayat Jafari
No ratings yet
Hiding Sensitive Predictive Association Rules: Shyue-Liang Wang, Ayat Jafari
6 pages
Privacy Preserving Data Mining
No ratings yet
Privacy Preserving Data Mining
10 pages
Reconstruction Based
No ratings yet
Reconstruction Based
8 pages
Privacy Preserving Heuristic Approach For Association Rule Mining in Distributed Database
No ratings yet
Privacy Preserving Heuristic Approach For Association Rule Mining in Distributed Database
7 pages
Introduction To Privacy Preserving Data Publishing Concepts and Techniques Chapman Hall CRC Data Mining and Knowledge Discovery Series
No ratings yet
Introduction To Privacy Preserving Data Publishing Concepts and Techniques Chapman Hall CRC Data Mining and Knowledge Discovery Series
355 pages
Siva Sankar
No ratings yet
Siva Sankar
6 pages
Privacy Preserving Data Mining Survey
No ratings yet
Privacy Preserving Data Mining Survey
20 pages
Workshops, 502-506.: Association Rule Hiding Methods
No ratings yet
Workshops, 502-506.: Association Rule Hiding Methods
1 page
Big Data Privacy with Perturbation
No ratings yet
Big Data Privacy with Perturbation
10 pages
Maintaining Privacy and Data Quality To Hide Sensitive Items From Database
No ratings yet
Maintaining Privacy and Data Quality To Hide Sensitive Items From Database
4 pages
Privacy Preservation Techniques in Data Mining
No ratings yet
Privacy Preservation Techniques in Data Mining
5 pages
Unit 5
No ratings yet
Unit 5
54 pages
Web Mining and Privacy: Bettina Berendt
No ratings yet
Web Mining and Privacy: Bettina Berendt
89 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Efficient Sensitive Data Protection Techniques
No ratings yet
Efficient Sensitive Data Protection Techniques
4 pages
A Survey On Privacy Preserving Data Mining Techniques
No ratings yet
A Survey On Privacy Preserving Data Mining Techniques
5 pages
Association Rule Hiding For Data Mining 2010th Edition by Aris Gkoulalas Divanis, Vassilios Verykios ISBN 1441965688 9781441965684
No ratings yet
Association Rule Hiding For Data Mining 2010th Edition by Aris Gkoulalas Divanis, Vassilios Verykios ISBN 1441965688 9781441965684
43 pages
A Hybrid Approach of Privacy Preserving Data
No ratings yet
A Hybrid Approach of Privacy Preserving Data
6 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
An Effective Heuristic Approach For Hiding Sensitive Patterns in Databases
No ratings yet
An Effective Heuristic Approach For Hiding Sensitive Patterns in Databases
6 pages
International Journal of Computational Engineering Research (IJCER)
No ratings yet
International Journal of Computational Engineering Research (IJCER)
8 pages
Paper 16
No ratings yet
Paper 16
4 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
154 pages
Shared Secret Preserving Method To Protect Health Care Privacy
No ratings yet
Shared Secret Preserving Method To Protect Health Care Privacy
5 pages
Data Mining Applications Overview
No ratings yet
Data Mining Applications Overview
14 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Team-17 Final
No ratings yet
Team-17 Final
42 pages
08 Data Mining Application
No ratings yet
08 Data Mining Application
19 pages
Differential Privacy Insights
No ratings yet
Differential Privacy Insights
56 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
Unit-5 Case Studies - 21CSE355T
No ratings yet
Unit-5 Case Studies - 21CSE355T
21 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
Complexityprivacy 1
No ratings yet
Complexityprivacy 1
95 pages
Privacy Preserving Data Mining Thesis PDF
100% (3)
Privacy Preserving Data Mining Thesis PDF
4 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
Privacy Preservation in Data Mining
No ratings yet
Privacy Preservation in Data Mining
83 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
Slides Courtesy: Ling Chen lchen@L3S.de
No ratings yet
Slides Courtesy: Ling Chen lchen@L3S.de
42 pages
1832 5564 1 PB
No ratings yet
1832 5564 1 PB
6 pages
A25-Article 1686738490
No ratings yet
A25-Article 1686738490
9 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
No ratings yet
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
50 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
Data Miningppt378
No ratings yet
Data Miningppt378
31 pages
A Frame Work For Ontological Privacy Preserved Mining
No ratings yet
A Frame Work For Ontological Privacy Preserved Mining
12 pages
Research Proposal
No ratings yet
Research Proposal
17 pages
Unit 1
No ratings yet
Unit 1
8 pages
02 Synopsis
No ratings yet
02 Synopsis
16 pages
Central Air Filter
No ratings yet
Central Air Filter
3 pages
ASI Eksklusif: Data dan Dukungan
No ratings yet
ASI Eksklusif: Data dan Dukungan
10 pages
How To Use Hootsuite
No ratings yet
How To Use Hootsuite
114 pages
Impact of Social Media On The Lives of
No ratings yet
Impact of Social Media On The Lives of
29 pages
Pocket Money Project Report
No ratings yet
Pocket Money Project Report
23 pages
122-20!05!2025-Circular For The Payment of 5th Semester Fee of 23 Batch (BE Programs)
No ratings yet
122-20!05!2025-Circular For The Payment of 5th Semester Fee of 23 Batch (BE Programs)
2 pages
Performing Feminisms Feminist Critical Theory and Theatre Edited by Sue Ellen Case Baltimore Johns Hopkins University Press 1990 PP 327 Dollar1495
No ratings yet
Performing Feminisms Feminist Critical Theory and Theatre Edited by Sue Ellen Case Baltimore Johns Hopkins University Press 1990 PP 327 Dollar1495
1 page
Freytag Mitosis Lesson Plan
No ratings yet
Freytag Mitosis Lesson Plan
71 pages
CH 4 Test Review Questions
No ratings yet
CH 4 Test Review Questions
17 pages
(2021) Zri Solar Pump Catalog PDF
No ratings yet
(2021) Zri Solar Pump Catalog PDF
56 pages
Essay Male ICPNA 2
No ratings yet
Essay Male ICPNA 2
2 pages
Examen Final de Ism
No ratings yet
Examen Final de Ism
5 pages
Consolidated Result Notice 12012022
No ratings yet
Consolidated Result Notice 12012022
1,107 pages
Writing Task 1: 150 Words Minimum 20 Minutes
No ratings yet
Writing Task 1: 150 Words Minimum 20 Minutes
8 pages
Schrack Power PCB Relay Card E: General Purpose Relays PCB Relays
No ratings yet
Schrack Power PCB Relay Card E: General Purpose Relays PCB Relays
3 pages
English: Whole Brain Learning System Outcome-Based Education
No ratings yet
English: Whole Brain Learning System Outcome-Based Education
28 pages
Case Study Rubric
100% (10)
Case Study Rubric
2 pages
Impact of Social Media
No ratings yet
Impact of Social Media
7 pages
BDCOM GP1704 Series - V3.4-1
No ratings yet
BDCOM GP1704 Series - V3.4-1
6 pages
Philosophical Language Debate
No ratings yet
Philosophical Language Debate
42 pages
Short Stories in French For Beginners Teach Yourself Short Stories Illustrated Olly Richards Instant Download
No ratings yet
Short Stories in French For Beginners Teach Yourself Short Stories Illustrated Olly Richards Instant Download
47 pages
(TEST) See Your Elephant
No ratings yet
(TEST) See Your Elephant
12 pages
Font Bundle for Designers
No ratings yet
Font Bundle for Designers
1 page
Jeep-Grand Cherokee (WK2) 2011-2017
No ratings yet
Jeep-Grand Cherokee (WK2) 2011-2017
4 pages
PAS 68 2013 Impact Test Specifications For Vehicle Security Barrier Systems
50% (2)
PAS 68 2013 Impact Test Specifications For Vehicle Security Barrier Systems
36 pages
Harmonize - 2 - End-Of-Year Test - Basic
No ratings yet
Harmonize - 2 - End-Of-Year Test - Basic
7 pages
Language and Geographical
No ratings yet
Language and Geographical
2 pages
DSAR04: Application For Exemption of Study Units
No ratings yet
DSAR04: Application For Exemption of Study Units
1 page
Understanding Media PDF Original PDF
100% (1)
Understanding Media PDF Original PDF
272 pages
Voting Machine Lab 11 Grp7
No ratings yet
Voting Machine Lab 11 Grp7
19 pages

2002 Spring CS525 Lecture 4

Uploaded by

2002 Spring CS525 Lecture 4

Uploaded by

Privacy-Preserving

Databases and Data

 There are two aspects of data mining when we look at it from a

 Types of sensitive knowledge that could be extracted via data

 Large amounts of customer transaction data is collected in

 Rules: “Body ® Head”

Frequent Itemset Support Rule Support

 What we try to achieve is:

with certain thresholds

 Transform D into D’ so that all rules in R can still be mined from D’

 Reduce the support of confidential rules

 If we want to remove an item from a transaction to reduce the support

Trans ID Items Bought

Trans ID Items Bought

Trans ID Items Bought

 In some applications where publishing wrong data is not acceptable,

 Document classification for authorship identification

 The fact that each author uses a characteristic frequency distribution

 Classify the document by a learning algorithm and then try to perturb

 Given a set of attribute values that are confidential and therefore

 Reference: Rakesh Agrawal and Ramakrishnan Srikant. “Privacy-

 Original values x1, x2, ..., xn

 Use Bayes' rule for density functions

 Use Bayes' rule for density functions

fX0 := Uniform distribution

 See SIGMOD 2000 paper for details.

 Compare accuracy against

 Add a random value between -30 and +30 to age.

T ran s ID Item s B ou gh t Trans ID Items Bought

Trans ID Items Bought

Trans ID Items Bought

1) The local sites will find their frequent itemsets.

Trans ID Items Bought

Ex: 50% min supp threshold

Trans ID Items Bought

Ex: 50% min supp threshold

Trans ID Items Bought

You might also like