[go: up one dir, main page]

0% found this document useful (0 votes)
53 views39 pages

SMCQL: Privacy-Preserving Querying For Federated Databases

SMCQL is a system for privacy-preserving querying of distributed databases. It uses secure multiparty computation (SMC) techniques like garbled circuits to execute SQL queries over private data from multiple untrusted data owners, without revealing any raw data. SMCQL aims to provide privacy, efficient query execution, and usability. It uses attribute-level security policies and query optimization techniques like sliced evaluation to minimize the use of expensive secure computation and improve performance.

Uploaded by

chaoslawful
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views39 pages

SMCQL: Privacy-Preserving Querying For Federated Databases

SMCQL is a system for privacy-preserving querying of distributed databases. It uses secure multiparty computation (SMC) techniques like garbled circuits to execute SQL queries over private data from multiple untrusted data owners, without revealing any raw data. SMCQL aims to provide privacy, efficient query execution, and usability. It uses attribute-level security policies and query optimization techniques like sliced evaluation to minimize the use of expensive secure computation and improve performance.

Uploaded by

chaoslawful
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

SMCQL:

Privacy-Preserving
Querying for Federated
Databases
Johes Bater
With Greg Elliot, Craig Eggen, Satyender Goel, Abel Kho, and
Jennie Rogers
Data Solutions… and Issues
• The rise of cheap computing and storage allows
people to store and process enormous amounts of
data
• This data however, is often fragmented among
many different owners
• These users are hesitant to share information
with each other, often due to privacy concerns
Motivating Problem
● Data owners want to combine their data with
data from untrusted collaborators for analytics,
without revealing their raw tuples
● Each party requires a solution that provides:
○ Privacy from each other
○ Efficient execution for queries
○ Usability for their clients
(Some) Existing Solutions
● Single Database
○ Store data from all users in one location
○ Problem: Cannot effectively restrict data access
● Encrypted Query Processing
○ Execute queries over encrypted data
○ Problem: Vulnerable to side-channel attacks, limited
support for complex analytics
● Differential Privacy
○ Insert statistical noise into query results
○ Problem: No exact answers, privacy budget
Our Solution: SMCQL
A privacy-preserving, federated database where:
● All computation is carried out in-situ using semi-honest secure
multiparty computation (SMC)
● A public, unified schema specifies attribute-level privacy
guarantees
● Users can submit SQL queries that are run on all participating
database parties
● Heuristics-based optimizations automatically generate hybrid
secure/plaintext execution plans
Running Example: Medical Data
Running Example: Medical Data

patientID sex diag …..

00001 M blues …..

00002 F cdiff …..

00003 M X …..
Running Example: Electronic Health Records

public private private


patientID sex diag …..

00001 M blues …..

00002 F cdiff …..

00003 M X …..
Research Consortium

A Clinical Data Research


Network (CDRN) is a consortium
of healthcare sites that agree to
share their data for research.

For this project, we partnered


with HealthLNK, a
Chicago-based CDRN
Research Consortium (CDRN)
Sharing data among mutually distrustful
parties
Research Consortium (CDRN)
Sharing data among mutually distrustful
parties

Client
Research Consortium (CDRN)
“How many
patients are
there?”

Client
Research Consortium (CDRN)
“How many
patients are
there?”

SELECT
COUNT(DISTINCT
PATIENT ID)
FROM diagnosis;

Client
Research Consortium (CDRN)
“How many
patients are
there?”

SELECT
COUNT(DISTINCT Honest
PATIENT ID) Broker
FROM diagnosis;

Client
Research Consortium (CDRN)
“How many
patients are
there?”

SELECT
COUNT(DISTINCT Honest
PATIENT ID) Broker
FROM diagnosis;

Client
Research Consortium (CDRN)
“How many
patients have
rare disease X?”

Honest
Broker

Client
Research Consortium (CDRN)
“How many
patients have
rare disease X?”
SELECT
COUNT(DISTINCT Honest
patient id) Broker
FROM diagnosis
WHERE diag=X;

Client
Research Consortium (CDRN)
“How many I’m not telling
patients have anyone private
rare disease X?” data!

SELECT
I’m not telling
COUNT(DISTINCT Honest
anyone private
patient id) Broker
data!
FROM diagnosis
WHERE diag=X;

I’m not telling


Client anyone private
data!
Research Consortium (CDRN)
“How many Securely
compute
patients have query
CT…
rare disease X?” SEL E

SELECT
COUNT(DISTINCT Honest SELECT…
patient id) Broker
FROM diagnosis
WHERE diag=X; SE
LE
CT

Client
Research Consortium (CDRN)
secret
“How many shares
patients have
rare disease X?”
SELECT
COUNT(DISTINCT Honest
patient id) Broker
FROM diagnosis
WHERE diag=X;
ry
que lt
u
Client res
Distrustful Data Federation (DDF)
● Privacy-preserving
○ Must not leak data to other participating nodes
○ Protects from outside attackers
● Usable
○ Accepts SQL queries
○ Automatic translation into SMC
● Efficient
○ Runs in a reasonable amount of time
○ Highly optimized to minimize SMC
SMC Building Blocks
● Garbled circuits
○ Cryptographic protocol used to securely compute a function
across two parties
○ Protects a query’s program traces from snooping
● Oblivious RAM (ORAM)
○ Data structure that shuffles all data on any read/write
○ Protects memory traces from leaking information
○ Up to O(log2n) cost per I/O
● ObliVM
○ Code generator that translates C-style code into garbled
circuits and ORAM
○ Translates DB operators into garbled circuits
ObliVM Code Generator
int$dSize[m*n] join(int$lSize[m] lhs, int$rSize[n] rhs) {
int$dSize[m*n] dst;
• Convert a plaintext function into int dstIdx = 0;
an oblivious version
for(int i = 0; i < m; i=i+1) {
int$lSize l = lhs[i];
• ObliVM library
for(int j = 0; j < n; j=j+1) {
int$rSize r = rhs[j];
• C-style syntax
if($filter(l, r) == 1) {
dst[dstIdx] = $project;
• We use templates for generating dstIdx = dstIdx + 1;
secure operators }
}
}
return dst;
}

blue = constant inserted at compile time red = function generated at compile time green = constant populated at runtime
DDF Architecture
Garbled
Circuit
ObliVM Evaluation
4 Translation

This work is a 2-party prototype


Honest Broker: Query Planning
HealthLNK Reference Queries
COMORBIDITY RECURRENT C. DIFF
SELECT diag, COUNT(*) cnt WITH rcd AS (
FROM diagnoses SELECT pid, time, row_no() OVER
WHERE patient_id IN cdiff_cohort (PARTITION BY pid ORDER BY time)
GROUP BY diag FROM diagnosis
ORDER BY cnt WHERE diag=cdiff)
LIMIT 10;
SELECT DISTINCT pid
FROM rcd r1 JOIN rcd r2 ON r1.pid =
ASPIRIN COUNT
r2.pid
SELECT COUNT(DISTINCT pid)
WHERE r2.time - r1.time >= 15 DAYS
FROM diagnosis d
AND r2.time - r1.time <= 56 DAYS
JOIN medication m ON d.pid = m.pid
AND r2.row_no = r1.row_no + 1;
WHERE d.diag = hd AND m.med = aspirin
AND d.time <= m.time;
Hand-Coded SMC Performance

Test (50 tuples) Plaintext (ms) Secure (ms) Slowdown


Comorbidity 148 253,894 1,609X
Recurrent C. Diff 165 159,145 967X
Aspirin Count 193 8,195,317 43,337X

Pure SMC: TOO SLOW!


Attribute-Level Security Model
Unified schema with security annotations:
● Public attributes
○ Readable by all parties
○ E.g., Lab results, anonymized IDs
● Protected attributes
○ Conditionally available to other parties (k-anonymous)
○ E.g., Age, gender, diagnosis codes
● Private attributes
○ Only available to originating party
○ E.g., Timestamps, zip codes
Reducing Use of Secure Computation
• Trace the flow of sensitive
attributes through the operator
tree Oblivious

• Identify minimal subtree that


must be computed securely to
uphold security policy

Plaintext
COMORBIDITY
SELECT diag, COUNT(*) cnt
FROM diagnoses
WHERE patient_id IN cdiff_cohort
GROUP BY diag
ORDER BY cnt
LIMIT 10;
Optimizing Queries: Sliced Evaluation
Horizontally partition data on public attributes for oblivious evaluation

S
1 1 1 2 2
1
1
Unsliced Output
R 1
R Join S 2
2
1 1
1 1
1 ⋈pid 1
2 2 S1
2 2
1 1 1 S2
Sliced Output 1 2 2
R1 1 2
1
R2 2
Optimizing Queries: Split Operators
Precompute part of the operator locally

Partial count(*) #1

Oblivious
Plaintext

Partial count(*) #2
Optimizing Queries: Semi-Join
Find single-party slices to eliminate unnecessary SMC

Honest Broker
Encrypted Output
Encrypted Output

Local Local
Evaluation
Oblivious Evaluation Evaluation

Tuple ID ∈ Tuple ID ∈
IDA - (IDA ∩ IDB) IDB - (IDA ∩ IDB)
Tuple ID ∈ IDA ∩
IDB

Alice Bob
Optimized Plans
Experimental Setup
• HealthLNK data repository
⚫ One year of data
⚫ 500,000 patients
⚫ 42 million diagnoses
⚫ 23 million medication records
⚫ 15 GB

• Experimental Setup
⚫ 2-party prototype
⚫ 3 pairs of servers
Performance on Reference Queries

Minimizing SMC use by


reducing secure subtrees
greatly improves
performance
Full Optimization using
slicing often provides
further benefits
Operator Level Performance for
Recurrent C. Diff

Optimization is
dependent on both the
query and the input
data
System Scale Up

Minimizing the secure


subtree enables us to
scale to larger inputs.
SMCQL vs Plaintext

Secure computation
has substantial
overhead, but there is
fertile ground for
optimization in future
work.
Contributions
● Generalization of federated DBMSs for
mutually distrustful parties in a semi-honest
setting
● SQL-to-SMC translator
● Heuristics-based optimizer for managing use of
SMC that leverages fine-grained privacy
annotations for schemas

You might also like