SMCQL: Privacy-Preserving Querying For Federated Databases
SMCQL: Privacy-Preserving Querying For Federated Databases
Privacy-Preserving
Querying for Federated
Databases
Johes Bater
With Greg Elliot, Craig Eggen, Satyender Goel, Abel Kho, and
Jennie Rogers
Data Solutions… and Issues
• The rise of cheap computing and storage allows
people to store and process enormous amounts of
data
• This data however, is often fragmented among
many different owners
• These users are hesitant to share information
with each other, often due to privacy concerns
Motivating Problem
● Data owners want to combine their data with
data from untrusted collaborators for analytics,
without revealing their raw tuples
● Each party requires a solution that provides:
○ Privacy from each other
○ Efficient execution for queries
○ Usability for their clients
(Some) Existing Solutions
● Single Database
○ Store data from all users in one location
○ Problem: Cannot effectively restrict data access
● Encrypted Query Processing
○ Execute queries over encrypted data
○ Problem: Vulnerable to side-channel attacks, limited
support for complex analytics
● Differential Privacy
○ Insert statistical noise into query results
○ Problem: No exact answers, privacy budget
Our Solution: SMCQL
A privacy-preserving, federated database where:
● All computation is carried out in-situ using semi-honest secure
multiparty computation (SMC)
● A public, unified schema specifies attribute-level privacy
guarantees
● Users can submit SQL queries that are run on all participating
database parties
● Heuristics-based optimizations automatically generate hybrid
secure/plaintext execution plans
Running Example: Medical Data
Running Example: Medical Data
00003 M X …..
Running Example: Electronic Health Records
00003 M X …..
Research Consortium
Client
Research Consortium (CDRN)
“How many
patients are
there?”
Client
Research Consortium (CDRN)
“How many
patients are
there?”
SELECT
COUNT(DISTINCT
PATIENT ID)
FROM diagnosis;
Client
Research Consortium (CDRN)
“How many
patients are
there?”
SELECT
COUNT(DISTINCT Honest
PATIENT ID) Broker
FROM diagnosis;
Client
Research Consortium (CDRN)
“How many
patients are
there?”
SELECT
COUNT(DISTINCT Honest
PATIENT ID) Broker
FROM diagnosis;
Client
Research Consortium (CDRN)
“How many
patients have
rare disease X?”
Honest
Broker
Client
Research Consortium (CDRN)
“How many
patients have
rare disease X?”
SELECT
COUNT(DISTINCT Honest
patient id) Broker
FROM diagnosis
WHERE diag=X;
Client
Research Consortium (CDRN)
“How many I’m not telling
patients have anyone private
rare disease X?” data!
SELECT
I’m not telling
COUNT(DISTINCT Honest
anyone private
patient id) Broker
data!
FROM diagnosis
WHERE diag=X;
SELECT
COUNT(DISTINCT Honest SELECT…
patient id) Broker
FROM diagnosis
WHERE diag=X; SE
LE
CT
…
Client
Research Consortium (CDRN)
secret
“How many shares
patients have
rare disease X?”
SELECT
COUNT(DISTINCT Honest
patient id) Broker
FROM diagnosis
WHERE diag=X;
ry
que lt
u
Client res
Distrustful Data Federation (DDF)
● Privacy-preserving
○ Must not leak data to other participating nodes
○ Protects from outside attackers
● Usable
○ Accepts SQL queries
○ Automatic translation into SMC
● Efficient
○ Runs in a reasonable amount of time
○ Highly optimized to minimize SMC
SMC Building Blocks
● Garbled circuits
○ Cryptographic protocol used to securely compute a function
across two parties
○ Protects a query’s program traces from snooping
● Oblivious RAM (ORAM)
○ Data structure that shuffles all data on any read/write
○ Protects memory traces from leaking information
○ Up to O(log2n) cost per I/O
● ObliVM
○ Code generator that translates C-style code into garbled
circuits and ORAM
○ Translates DB operators into garbled circuits
ObliVM Code Generator
int$dSize[m*n] join(int$lSize[m] lhs, int$rSize[n] rhs) {
int$dSize[m*n] dst;
• Convert a plaintext function into int dstIdx = 0;
an oblivious version
for(int i = 0; i < m; i=i+1) {
int$lSize l = lhs[i];
• ObliVM library
for(int j = 0; j < n; j=j+1) {
int$rSize r = rhs[j];
• C-style syntax
if($filter(l, r) == 1) {
dst[dstIdx] = $project;
• We use templates for generating dstIdx = dstIdx + 1;
secure operators }
}
}
return dst;
}
blue = constant inserted at compile time red = function generated at compile time green = constant populated at runtime
DDF Architecture
Garbled
Circuit
ObliVM Evaluation
4 Translation
Plaintext
COMORBIDITY
SELECT diag, COUNT(*) cnt
FROM diagnoses
WHERE patient_id IN cdiff_cohort
GROUP BY diag
ORDER BY cnt
LIMIT 10;
Optimizing Queries: Sliced Evaluation
Horizontally partition data on public attributes for oblivious evaluation
S
1 1 1 2 2
1
1
Unsliced Output
R 1
R Join S 2
2
1 1
1 1
1 ⋈pid 1
2 2 S1
2 2
1 1 1 S2
Sliced Output 1 2 2
R1 1 2
1
R2 2
Optimizing Queries: Split Operators
Precompute part of the operator locally
Partial count(*) #1
Oblivious
Plaintext
Partial count(*) #2
Optimizing Queries: Semi-Join
Find single-party slices to eliminate unnecessary SMC
Honest Broker
Encrypted Output
Encrypted Output
Local Local
Evaluation
Oblivious Evaluation Evaluation
Tuple ID ∈ Tuple ID ∈
IDA - (IDA ∩ IDB) IDB - (IDA ∩ IDB)
Tuple ID ∈ IDA ∩
IDB
Alice Bob
Optimized Plans
Experimental Setup
• HealthLNK data repository
⚫ One year of data
⚫ 500,000 patients
⚫ 42 million diagnoses
⚫ 23 million medication records
⚫ 15 GB
• Experimental Setup
⚫ 2-party prototype
⚫ 3 pairs of servers
Performance on Reference Queries
Optimization is
dependent on both the
query and the input
data
System Scale Up
Secure computation
has substantial
overhead, but there is
fertile ground for
optimization in future
work.
Contributions
● Generalization of federated DBMSs for
mutually distrustful parties in a semi-honest
setting
● SQL-to-SMC translator
● Heuristics-based optimizer for managing use of
SMC that leverages fine-grained privacy
annotations for schemas