GB2584910A

GB2584910A - Assessing likelihood of re-identification

Info

Publication number: GB2584910A
Application number: GB1908942.4A
Authority: GB
Inventors: Rocher Luc; Hendrickx Julien; De Montjoye Yves-Alexandre
Original assignee: Universite Catholique de Louvain UCL; Imperial College Innovations Ltd
Current assignee: Universite Catholique de Louvain UCL; Ip2ipo Innovations Ltd
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2020-12-23
Also published as: GB201908942D0; WO2020254829A1

Abstract

Receiving a statistical generative and multivariate model of a distribution of a population of entities having a set of attributes, receiving a record of an entity x having a given set of attributes, receiving the population size n, determining a probability p(x,n) for the given entity and population size and using this probability to determine a first probability that the given entity is unique to the population and/or a second probability that the given entity is identifiable, and outputting the first and/or second probability. Also disclosed is a similar method but instead using a first probability that the given entity is unique to the population and/or a second probability that the given entity is identifiable, and determining a value for population size n. There is also a disclosed receiving a set of marginal distributions, receiving a matrix of mutual information scores between pairs of attributes for a dataset, forming a statistical generative and multivariate model using the set of marginal distributions and the mutual information scores, and storing said model. The model used may be a Gaussian copula model. This invention aims to calculate the risk of individuals being re-identified an anonymised set of personal medical records.

Description

Assessing likelihood of re-identification

Field

The present invention relates to a method of and a system for assessing a likelihood of an individual being correctly re-identified in anonymous data.

Background

In the last decade, the ability to collect and store personal data has exploded. With two thirds of the world population having access to the Internet, electronic medical records becoming the norm, and the rise of the Internet of Things, this is unlikely to stop anytime soon. Collected at scale from financial or medical services, when filling in online surveys or liking pages, this data has an incredible potential for good. It drives scientific advancements in medicine, social science, and Al, and promises to revolutionize the way businesses and governments function.

However, the large-scale collection and use of detailed individual-level data raise legitimate privacy concerns. The recent backlashes against the sharing of the UK National Health Service medical data with DeepMind and the collection and subsequent sale of Facebook data to Cambridge Analytica are the latest evidences that people are concerned about the confidentiality, privacy, and ethical use of their data. In a recent survey, more than 72% of U.S. citizens reported being worried about sharing personal information online. In the wrong hands, sensitive data can be exploited for blackmailing, mass surveillance, social engineering, or identity theft.

De-identification, the process of anonymizing datasets before sharing them, has been the main paradigm used in research and elsewhere to share data while preserving people's privacy. Data protection laws worldwide consider anonymous data as not personal data anymore allowing it to be freely used, shared, and sold. Academic journals are, e.g., increasingly requiring authors to make anonymous data available to the research community. While standards for anonymous data vary, modern data protection laws, such as the European General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), consider that each and every person in a dataset has to be protected for the dataset to be considered anonymous. This new higher standard for anonymization is further made clear by the introduction in GDPR of pseudonymous data: data that does not contain obvious identifiers but might be re-identifiable and is therefore within the scope of the law.

Yet numerous supposedly anonymous datasets have recently been released and re-identified. In 2016, journalists re-identified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences. A few months before, the Australian Department of Health publicly released de-identified medical records for 10% of the population only for researchers to re-identify them six weeks later. Before that, studies had shown that de-identified hospital discharge data could be re-identified using basic demographic attributes and that diagnostic codes, year of birth, gender, and ethnicity could uniquely identify patients in genomic studies data. Finally, researchers were able to uniquely identify individuals in anonymized taxi trajectories in NYC, bike sharing trips in London, subway data in Riga and mobile phone and credit card datasets.

Statistical disclosure control researchers and some companies are disputing the validity of these re-identifications: as datasets are always incomplete, journalists and researchers can never be sure they have re-identified the right person even if they found a match. They argue that this provides strong plausible deniability to participants and reduce the risks, making such de-identified datasets anonymous incl. according to GDPR. De-identified datasets can be intrinsically incomplete, e.g. because the dataset only covers patients of one of the hospital networks in a country, or because they have been subsampled as part of the de-identification process. For example, the U.S. Census Bureau releases only 1% of their decennial census and sampling fractions for international census range from 0.07% in India to ro% in South American countries. Companies are adopting similar approaches with, e.g., the Netflix Prize dataset including less than ro% of their users.

Imagine a health insurance company who decides to run a contest to predict breast cancer and publishes a de-identified dataset of 1,000 people, 1% of their 100,000 insureds in California, including people's birth date, gender, ZIP code, and breast cancer diagnosis. John Doe's employer downloads the dataset and finds one (and only one) record matching Doe's information: male living in Berkeley, CA (94720), born on January 2nd, 1968, and diagnosed with breast cancer (self-disclosed byJohn Doe). This record also contains the details of his recent (failed) stage IV treatments. When contacted, the insurance company argues that matching does not equal re-identification: the record could belong to one of the 99,000 other people they insure or, -3 -if the employer does not know whether Doe is insured by this company or not, to anyone else of the 39.5M people living in California.

Reference is made to US 2011/0258206 Al which discloses a re-identification risk metric which is determined for the scenario where an intruder wishes to re-identify as many records as possible in a disclosed database. The dataset can be analysed to determine equivalence class for variables in the dataset and one or more equivalence class sizes. The re-identification risk metric associated with the dataset can be determined using a modified log-linear model by measuring a goodness of fit of the one /o or more equivalence class sizes.

Reference is also made to US 2016/0155061 Al which discloses a system for determining journalist risk of a dataset using population equivalence class distribution estimation. The dataset may be a cross-sectional dataset of a longitudinal dataset. The determine risk of identification can be determined and used in de-identification process of the dataset.

Reference is also made to US 2010/77006 Al, US 2018/114037 Al and US 2015/106944 Al. -4 -

Summary

According to a first aspect of the present invention there is provided a computer-implemented method of assessing a likelihood of re-identification. The method comprises receiving a statistical generative and multivariate model of a distribution of a population of entities having a set of attributes, receiving a record for a given entity having a given set of attributes, receiving a value of population size and determining a probability of the record for the given entity and population size. The method further comprises determining, using the probability of the record for the given entity and the population size: a first probability that the given entity is unique in the population Jo and/or, for a given set of attributes, a second probability that the given entity is correctly identifiable and outputting the first and/or second probability as measure(s) of likelihood. Outputting the first or second probability may comprise storing the first or second probability in memory or storage, and/or displaying storing the first or second probability.

Thus, the model can be used and tailored for a specific entity for assessing the likelihood of identification.

The entity may be a legal entity. The entity may be an individual (or person). The 20 entity may be a company. The entity may be an organization.

The record may be a hypothetical (or "theoretical") record or extant (or "real") record. If the record is hypothetical, the probability of uniqueness is one for a hypothetical entity. This can be used to determine the probability of an entity existing, for example, -0or the probability that a 98-year-old man lives in London, has been married twice and has four children. If the record is extant, the probability of uniqueness is for a real entity and so can be used to determine the uniqueness or correctness of that entity.

The model is a population-based model and it is possible to assign a respective score to one or more entities (even to every entity) in the population after training the model on a small set of records.

The given entity may be a first entity and the record may be a first record, and the method may further comprise receiving a second record for a second entity having a respective set of attributes, determining, using the first probability, a third probability -5 -that the first and second records come from the same entity and outputting the third probability.

Thus, the method can be used not only to estimate a correctness of a specific re-5 identification, but also to match records between two databases.

According to a second aspect of the present invention there is provided a computer-implemented method of determining a size of a population for which a given likelihood of re-identification occurs. The method comprises receiving a statistical generative and Jo multivariate model of a distribution of a population of entities having a set of attributes, receiving a record for a given entity having a given set of attributes, receiving a first probability that the given entity is unique in the population or, for a given set of attributes, a second probability that the given entity is correctly identifiable. The method comprises determining a value of population size needed to obtain the first probability and/or the second probability, wherein determining the value of population size comprises determining, for each of at least one value of population size, a probability of the record for the given entity and population size. The method further comprises outputting the value of population size.

According to a third aspect of the present invention there is provided a computer-implemented method of training a model for use in assessing a likelihood of re-identification. The method comprises receiving a set of marginal distributions, receiving a matrix of mutual information scores between pairs of attributes for a dataset D, forming a statistical generative and multivariate model using the set of marginal distributions and the mutual information scores and storing the statistical generative and multivariate model.

The marginal distributions may be empirical, theoretical or fitted.

Forming the statistical generative and multivariate model may comprise, for each attribute, mapping the attribute value onto an interval using the marginal in such a way that distribution of a mapped attribute is known, and generating a multi-variate statistical model for the mapped attributes jointly in a hypercube such that, for each attribute, the marginal of the multi-variate model corresponds to the known distribution on the interval and, for each pair of attributes, the mutual information -6 -between the mapped attributes in the model is consistent with the matrix of mutual information received.

The interval maybe [o, 1] (i.e., be between o and 1). Mapping may involve assigning a numerical value to a non-numerical input, such as a string or characteristic. For example, blond hair = [o, 0.2] and brown hair = [0.2, 0.6]. A known distribution of a mapped attribute may be that the distribution of a mapped attribute is uniform across the interval. The effect of mapping has the effect that any combination of attributes values (any potential record) corresponds to a given subset of hypercube [o, i]" where d lo is the number of attributes. Thus, for example, brown hair and green eyes could be mapped into a box [0.2, o.6] x [0.3, 0.4]).

To use the model, for each potential record x, the probability of x is/can be computed by integrating the multivariate model on the subset of [o, i]d corresponding to the 15 record.

The method may further comprise receiving a dataset of records, each record relating to a respective entity and containing attributes, extracting marginals from the dataset, estimating a respective distribution for each of said marginals to form the set of marginal distributions, extracting or computing mutual information scores between variables in the dataset and generating the mutual information matrix using the mutual information scores.

The generative model maybe a copula model, for example, a Gaussian copula model or 25 a vine copula model. The generative model may be a deep learning generative model.

There may be at least two attributes in a record. There may be between 2 and 100 attributes in a record. There maybe at least ioo attributes in a record.

The population may contain at least 1o3 entities. The population may contain between 1o3 and 1o7,108 or 169 individuals. The population may contain at least 1o7 individuals.

Even if the population could exceed 1o7, a small dataset of, for example, ro3 or 1o4 records, can be used to train a model.

Attributes may include nominal or ordinal data, categorical data (for example, strings), or continuous data. -7 -

Extracting marginals from the dataset may comprise building a histogram and using it as an estimated distribution. Extracting marginals from the dataset may comprise fitting attribute values to a statistical distribution, such as a Poisson distribution, Zipf distribution or negative binomial distribution.

Extracting or computing mutual information scores may comprise using an empirical estimator from specified joint bin frequencies. Extracting or computing mutual information scores may comprise using shrinkage estimator or a Chao-Shen entropy (c) estimator.

According to fifth aspect of the present invention there is provided a computer-implemented method of estimating correctness of re-identification of an entity after having matched said entity in an anonymized/de-identified dataset.

According to a sixth aspect of the present invention there is provided a computer-implemented method of estimating uniqueness of an entity in a complete population.

According to a seventh aspect of the present invention there is provided a computer-20 implemented method of estimating a probability that first and second data records in first and second different databases come from the same person.

According to an eighth aspect of the present invention there is provided a method of assessing probability of re-identification comprising the method of the first and third or aspects.

The method of the eighth aspect may be a method of privacy certification, plausibility deniability, verifying whether a dataset complies with data protection regulations or record matching.

The method(s) follow a generative statistical approach. The method may estimate a joint distribution of a complete population, from a set of sample records, and outputs an estimated distribution for the data.

The method may use a multivariate copula-based approach. The method may comprise extracting all marginals and estimating a distribution for each. The method may -8 -comprise extracting all mutual information scores between pairs of variables. The method may estimate correlation scores between latent uniform marginal distributions, using the fitted marginals. The method may combine the correlation scores and the fitted marginals to form a joint distribution of the population, using a Gaussian copula.

The method may assign an entity score tailored to each entity based on their characteristics, for example, by modelling the probability to draw an entity record from the population. The method is population-based and may predict the correct re-identification, or uniqueness, or record linkage inside a population scaling to billions of /c) people.

The method can combine prior knowledge about the marginals or the association structure, from previous work or measurements, to improve the predictions.

The approach generally involves training a model, in particular, a joint distribution X, and then using the model to assess risk of re-identification. A generative statistical approach is followed whereby an estimate of the joint distribution of the complete population is made, from a set of sample records, and an estimated distribution for the data is output.

The process includes assigning an entity score tailored to each entity based on their characteristics (by modelling the probability to draw an entity record from the population). The approach is population-based: the process can predict the correct re-identification, or uniqueness, and/or record linkage inside a population and can scale or to billions of people. The approach is flexible: the process can combine prior knowledge about marginals or association structure, from previous work or measurements, to improve the predictions.

The dataset may contain demographic data, socio-demographic data (e.g., responses to an online survey), behavioural data (e.g., online tracking data, web usage, device fingerprints and the like), financial data (e.g., transaction summaries, loan applications and the like), insurance data (e.g., consumer profile, data from data brokers), census data (using public census microdata to train the method), and medical data (e.g., hospital or clinic records; administrative, diagnostic-related, or treatment information). -9 -

According to a ninth aspect of the present invention is provided a computer program comprising instructions for performing the method.

According to tenth aspect of the present invention is provided a computer program 5 product comprising a computer readable medium (which may be non-transitory) storing the computer program.

According to an eleventh aspect of the present invention there is provided apparatus comprising at least one processor and memory configured to perform the method.

According to a twelfth aspect of the present invention there is provided a computer product comprising a computer-readable medium and a statistical generative and multivariate model of a distribution of a population of entities having attributes, the model stored on the computer-readable medium.

-10 -

Brief Description of the Drawings

Certain embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which: Figure 1 illustrates estimating the population uniqueness of the USA corpus. (a) We compare, for each population, empirical and estimated population uniqueness (boxplot with median, 25th and 75th percentiles, maximum 1.5 interquartile range (1QR) for each population, with too independent trials per population). For example, date of birth, location (PUMA code), marital status, and gender uniquely identify 78.7% of the 3 million people in this population (empirical uniqueness) which our model estimates to be 78.2 ± 0.5% (boxplot in black). (b) Absolute error when estimating USA's population uniqueness when the disclosed dataset is randomly sampled from to% to 0.1%. The boxplots (25, 50, and 75th percentiles, 1.5 IQR) show the distribution of MAE for population uniqueness, at one subsampling fraction across all USA populations (too trials per population and sampling fraction). The y-axis shows both p, the sampling fraction, and ns = p x it, the sample size. Our model estimates population uniqueness very well for all sampling fractions with the MAE slightly increasing when only a very small number of records are available (p = o.1% or 3,061 records). Figure 2 illustrates the model predicts correct re-identifications with high confidence. (a) Receiver operating characteristic (ROC) curves for USA populations (light ROC curve for each population and a solid line for the average ROC curve) Our method accurately predicts the (binary) individual uniqueness. (Inset) False-discovery rate (FDR) for individual records classified with > 0.9, > 0.95, and > 0.99. For re-identifications that the model predicts are likely to be correct > 0.95), only 5.26% of them are incorrect (FDR). (b) Our model outperforms by 39% the best theoretically or achievable prediction using population uniqueness across every corpus. A red point shows the Brier Score obtained by our model, when trained on a 1% sample. The solid line represents the lowest Brier Score achievable when using the exact population uniqueness while the dashed line represents the Brier Score of a random guess prediction (BS = 1/3).

Figure 3 illustrates average individual uniqueness increases fast with the number of collected demographic attributes (a)Distribution of predicted individual uniqueness knowing ZIP code, date of birth, and gender (resp. ZIP code, date of birth, gender, and number of children) in blue (resp. orange). The dotted blue line atx = 0.580 (resp. dashed orange line at = 0.997) illustrates the predicted individual uniqueness of Gov. Weld knowing the same combination of attributes. (Inset) The correctness xz is solely determined by uniqueness >x and population size n (here for Massachusetts). We show individual uniqueness and correctness for William Weld with 3 (in blue) and 4 (in orange) attributes. (b) The boxplots (25, 50, and 75th percentiles, 1.5 IQR) show the average uniqueness ce, > knowing k demographic attributes, grouped by number of attributes. The individual uniqueness scores G are estimated on the complete population in Massachusetts, based on the 5% PUMS files. While few attributes might not be sufficient for a re-identification to be correct, collecting a few more attributes will quickly render the re-identification very likely to be successful. For instance, 15 demographic attributes would render 99.98% of people in Massachusetts unique. (c) lo Uniqueness varies with the specific value of attributes. For instance a 33 year old is less unique than a 58 year old person. We here either (i) randomly replace the value of one baseline attribute (ZIP code, date of birth, or gender) or (ii) add one extra attribute, both by sampling from its marginal distribution, to the uniqueness of a 58 years old male from Cambridge, MA. The dashed baseline shows his original uniqueness G = 0.580 and the boxplots the distribution of individual uniqueness obtained after randomly replacing or adding one attribute. A complete description of the attributes and method is available in Supplementary Methods.

Figure 4 shows a first table ("Table 1") showing Mean Absolute Error (mean ± s.d.) when estimating population uniqueness Ono trials per population). Our model correctly estimates population uniqueness even when only a small to very small fraction of the population is available. n denotes the population size and c the corpus size (the total number of populations considered per corpus). We do not estimate population uniqueness when the sampled dataset contains less than 5o records.

Figure 5 (Supplementary Figure 1) shows comparing empirical and estimated uniqueness for every corpus. One boxplot represents one population, for which we compare its empirical and estimated uniqueness values. For each population, we run the model 100 times and display the median, the 25th and 75th percentiles for estimated scores. Whiskers show the maximum 1.5 interquartile range (IQR). The panels (a) to (e) correspond to the corpora MERNIS, USA, ADULT, HDV, MIDUS).

Figure 6 (Supplementary Figure 2) shows absolute error when estimating population uniqueness using i00% to o.).% samples for every corpus. Boxplots (25, 50, and 75 quantiles and 1.5 IQR) show the MAE values for one subsampling fraction across all populations. The y-axis shows both p, the sampling fraction, and ns = p x n, the sample size. The panels (a) to (e) correspond to the corpora MERNIS, USA, ADULT, HDV, MMUS.

-12 -Figure 7 (Supplementary Figure 3) shows Re-identification receiver operating characteristic (ROC) curves for every corpus. For each population, we train a copula model on a 1% sample and measure the accuracy and recall of the resulting individual uniqueness likelihoods. A solid colored line represents the average ROC curve and light curves the ROC curves for a single population. The inset graph represents the false-discovery rate for individual records classified with > 0:9, > 0:95, and > 0:99. Not only does the method discriminates correctly between unique and non-unique records, but it also accurately classifies records with the highest likelihood of successful re-identification. The panels (a) to (e) correspond to the corpora MERNIS, USA, ADULT, HDV, MIDUS.

Figure 8 (Supplementary Figure 4) shows comparing the performance of our model against frequency-based re-identification risk models from the literature. Our model obtains a significantly lower MAE on every corpus and sampling fraction but two (USA with to% and 5% sampling compared to the Ewens model). We report the MAE for average uniqueness precision (using too trials per population) and the p-value of the Wilcoxon signed-rank test, comparing the performance of the copula model with other approaches (m for P < 0:01, ** for P < o:o5,* for P < 0:1, and otherwise ns. when non significant). The panels (a) to (e) correspond to the corpora MERNIS, USA, ADULT, HDV, MIDUS.

Figure 9 (Supplementary Figure 5) illustrates false-discovery rate with original (left) and corrected (right) scores for all five corpora. The copula model is trained on a I% sample, and the FDR scores are computed on 1000 records from the original population.

Figure 10 (Supplementary Figure 6) illustrates cumulative distribution of the deviation between average individual likelihood and predicted population uniqueness.

For each population (all corpora combined), we select woo individuals from the original population, and compare their average likelihood with the estimated population uniqueness (in blue). The dashed red line represents the baseline null deviation after correction. The copula model is inferred on a I% sample from the original population.

Figure IA (Supplementary Figure 7) shows reliability diagrams for copula methods trained on 1% samples. Calibration plots of mean predicted value vs. fraction of positive outcomes (unique individuals). The solid lines are the copula estimators the the dashed ones the corrected copula estimators e; and the diagonal lines represents an ideal -13 -predictor. We train copula methods on 1% samples for each population, and report these measures for 1000 records per population.

Figure 12 (Supplementary Figure 8) shows re-identification receiver operating characteristic (ROC) curves for every corpus (only sample unique records, to be compared with Fig 3). For each population, we train a copula model on a 1% sample and measure the accuracy and recall of the resulting individual uniqueness likelihoods, only evaluated on sample unique records. A solid colored line represents the average ROC curve and light curves the ROC curves for a single population. The inset graph represents the false-discovery rate for individual records classified with g> o:9, > 0:95, and g > 0:99. The panels (a) to (e) correspond to the corpora MERNIS, USA, ADULT, HDV, MMUS.

Figure 13 (Supplementary Figure 9) shows at very small sampling fractions, the error is mostly determined by the marginals. The boxplots show the mean absolute error (MAE) for population uniqueness estimates, grouped by sample size (from 5o to 6400 records). Panels A and C represent the absolute error for the USA and MERNIS populations when the marginals are approximated from the sample, while panels B and D are the absolute error using the exact marginals (for USA and MERNIS respectively) Figure 14 (Supplementary Figure lo) shows Kullback-Leibler divergence (in nats) between the empirical p(x) and estimated q(x) distributions for each corpus. The copula method is trained on a 1% sample. Overall, the copula method achieves small prediction errors, with a K-L divergence of 1:59 ± 1:78 nats overall.

Figure 15 (Supplementary Figure n) shows the copula parameters converge quickly.

Left and right panels compare the covariance parameter 2' estimated on 5o to 6400 records, with E estimated on t00% of the population. The boxplots in panel A (resp. C) or show the RMSE between and E for the USA (resp. MERNIS) corpus. The boxplots in panel B (resp. D) show the bias between E and E for the USA (resp. MERNIS) corpus. Both metrics are computed, for each population, on every pair (ij) of attributes. Figure 16 (Supplementary Figure 12) illustrates poor calibration achieved by log-linear models. A blue point shows the Brier Score obtained by a log-linear model, when trained on a 1% sample, and evaluated on sample unique records. The panel (a) uses a Poisson log-linear model and the panel (b) a Negative Binomial model. The solid line represents the lowest Brier Score achievable when using the exact population uniqueness while the dashed line represents the Brier Score of a random guess prediction (BS = 1/3). For n% of all studied populations, the model did not converge: we report these results with dots outside the right margin.

Figure 17 (Supplementary Figure 13) illustrates re-identification receiver operating characteristic (ROC) curves for every corpus (all sample records, to be compared with Fig 3). For each population, we train a copula model on a 1% sample and measure the accuracy and recall of the resulting individual uniqueness likelihoods, evaluated on all sample records (with sample non-unique records assigned an individual uniqueness r, , = o). A solid colored line represents the average ROC curve and light curves the ROC curves for a single population. The inset graph represents the false-discovery rate for individual records classified with > 0:9, > 0:95, and > 0:99. The panels (a) to (e) correspond to the corpora MERNIS, USA, ADULT, HDV, MIDUS.

so Figure 18 shows a second table ("Supplementary Table 1") of AUC and FDR for the classification of individual uniqueness. For each population, for moo individuals sampled at random in the whole population, we estimate their individual uniqueness (method trained on a 1% sample) and compare the predicted likelihood to the true value. We report the AUC (mean ± s.d.) and the FDR per corpus and overall. We also report the F-scores in Table 7.

Figure 19 shows a third table ("Supplementary Table 2") of illustrates error rates for predicting population uniqueness (exact marginals). Mean absolute error (MAE) [mean ± s.d., in percent] on estimated population uniqueness grouped by corpus. n denotes the population size and c the corpus size (the total number of populations considered per corpus). We do not evaluate the model when samples contain less than 5o records.

Compared to Table 1 (Main Text), exact marginals provide lower error rates for small sampling fractions.

Figure 20 shows a fourth table ("Supplementary Table 3"), within-population standard deviation (s.d.) for the population uniqueness, for each corpus, grouped by sampling fraction (mean ± s.d. for too trials). n denotes the population size and c the corpus size (the total number of populations considered per corpus).

Figure 21 shows a fifth table ("Supplementary Table 4") of REML-estimated mixed effects models for Ecx and for the RMSE between Ex and E-x.

Figure 22 shows a sixth table ("Supplementary Table 5") of REML-estimated mixed 30 effects models for C., (grouped by corpus and population) and for the RMSE between kr and c" (grouped by corpus).

Figure 23 shows a seventh table ("Supplementary Table 6") of AUC and FDR for the classification of individual uniqueness (only sample unique records, to be compared with Table D. For each population, for all individuals unique in the 1% training sample, we estimate their individual uniqueness (method trained on a 1% sample) and compare -15 -the predicted likelihood to the true value. We report the AUC (mean ± s.d.) and the FDR per corpus and overall.

Figure 24 shows an eighth table ("Supplementary Table 7") of F-score for the classification of individual uniqueness. For each population, for moo individuals sampled at random in the whole population, we estimate their individual uniqueness (method trained on a 1% sample) and compare the predicted likelihood C." to the true value fix. We report the F-score per corpus and overall. Bold scores indicate the highest score per corpus. If the model must obtain a good balance between precision and recall, the optimal threshold lies near = 0:50. Yet, if the model must obtain a low proportion so of false positives, achieved by a small false-discovery rate, Table 1 shows that this requires a higher cutoff, above = 0:90.

Figure 25 is a schematic block diagram of a first computer system.

Figure 26 is schematic block diagram of a second computer system.

Figure 27 is a process flow diagram of training a model.

Figure 28 illustrates training a model.

Figure 29 illustrates a process of estimating correctness of re-identification of an individual and/or uniqueness of an individual in a population.

Figure 3o illustrates a process of estimating a probability that two data records come from the same person.

Detailed Description

Our disclosure shows how the likelihood of a specific individual to have been correctly re-identified can be estimated with high accuracy even when the anonymized dataset is heavily incomplete. We propose a generative graphical model that can be accurately and efficiently trained on incomplete data. Using sociodemographic, survey, and health datasets, we show that our model exhibits a mean absolute error (MAE) of 0.018 on average in estimating population uniqueness and an MAE of 0.041 in estimating population uniqueness when the model is trained on only a I% population sample. Once trained, our model allows us to predict whether the re-identification of an individual is correct with an average false-discovery rate of less than 6.7% for a 95% threshold (CX> 0.95), and an error rate 39% lower than the best achievable population-level estimator. With population uniqueness increasing fast with the number of attributes available, our results show that the likelihood of a re-identification to be correct, even in a heavily sampled dataset, can be accurately estimated and is often high. Our results reject the claims that, first, re-identification is not a practical risk and, -16 - second, sampling or releasing partial datasets provide plausible deniability. Moving forward, they question whether current de-identification practices satisfy the anonymization standards of modern data protection laws such as GDPR and CCPA, and emphasize the need to move, from a legal and regulatory perspective, beyond the de-identification release-and-forget model.

Results Using Gaussian copulas to model uniqueness We consider a dataset D, released by an organization, and containing a sample of individuals extracted at random from a population of n individuals, e.g., the US population. Each row xo) is an individual record, containing d nominal or ordinal attributes (e.g. demographic variables, survey responses) taking values in a discrete sample space X. We consider the rows x(0 to be independent and identically distributed (i.i.d.), drawn from the probability distribution X with F(X = x), abbreviated p(x).

Our model quantifies, for any individual x, the likelihood ex for this record to be unique in the complete population, and therefore always successfully reidentified when matched. From c, we derive the likelihood K, for x to be correctly re-identified when matched, which we call correctness. Tf Doe's record x(d) is unique in D, he will always be correctly re-identified (Kx(d) and c(a)= Q. However, if two other people share the same attribute (x(d) not unique, c(a)= o), Doe would still have one chance out of three to have been successfully re-identified (Kx(cn= 1/3). We model cc as: and K, as 0:led in -17 -with proofs in Methods (see hereinafter).

We model the joint distribution of X1, Xa using a latent Gaussian copula. Copulas have been used to study a wide range of dependence structures in finance, geology, and biomedicine and allow us to model the density of X by specifying separately the marginal distributions, easy to infer from limited samples, and the dependency structure. For a large sample space X' and a small number z1v of available records, Gaussian copulas provide a good approximation of the density using only d(d 1)/2 parameters for the dependency structure and no hyperparameter.

The density of a Gaussian copula Cr, is expressed as: with a covariance matrix E, u e [o, l]d, and 0 the CDF of a standard univariate /5 normal distribution.

We estimate from D the marginal distributions W (marginal parameters) for X1, ... X/ and the copula distribution E (covariance matrix), such that p(x) is modeled by with F; the CDF of the discrete variable X/. in practice, the copula distribution is a continuous distribution on the unit cube, and p(x) its discrete counterpart on X (see 25 Supplementary Methods hereinafter).

We select, using maximum likelihood estimation, the marginal distributions from categorical, logarithmic, and negative binomial count distributions (see Supplementary methods). Sampling the complete set of covariance matrices to estimate the association structure of copulas is computationally expensive for large datasets. We rely instead on a fast two-steps approximate inference method: we infer separately each pairwise -18 -correlation factor Ei; and then project the constructed matrix E on the set of symmetric positive definite matrices to accurately recover the copula covariance matrix (see Methods hereinafter).

We collect five corpora from publicly available sources: population census (USA and MERNIS) as well as surveys from the UCI Machine Learning repository (ADULT, MIDUS, HDV). From each corpus, we create populations by selecting subsets of attributes (columns) uniformly. The resulting 210 populations cover a large range of uniqueness values (o to 0.96), numbers of attributes (2 to 47), and records (7108 to 9 million individuals). For readability purposes, we report in the main text the numerical results for all five corpora but will show figures only for USA. Figures for MERNIS, ADULT, MIDUS, HDV are similar and available in Supplementary Information.

Fig. IA shows that, when trained on the entire population, our model correctly /5 estimates population uniqueness Ex ac,: pt,e{ , i.e. the expected percentage of unique individuals in (x0), , x(^)). The mean absolute error (MAE) between the empirical uniqueness of our population Ex and the estimated uniqueness E.; is 0.028 ± 0.026 [mean ± s.d.] for USA and 0.018 ± 0.019 on average across every corpus (see Table 1). Fig. 1A and Supplementary Figure 1 furthermore show that our model correctly estimates uniqueness across all values of uniqueness, with low within-population s.d. (Supplementary Table 3).

Fig. 113 shows that our model estimates population uniqueness very well even when the dataset is heavily sampled (see Supplementary Figure 2, for other populations). For instance, our model achieves an MAE of 0.029 ± 0.015 when the dataset only contains 1% of the USA population and an MAE of 0.041 ± 0.053 on average across every corpus. Table 1 shows that our model reaches a similarly low MAE, usually below 0.050, across corpora and sampling fractions.

-19 -Likelihood of successful re-identification Once trained, we can use our model to estimate the likelihood of his employer having correctly re-identified John Doe, our 5o year-old male from Berkeley with breast cancer. More specifically, given an individual record x, we can use the trained model to compute the likelihood for this record x to be unique in the population. Our model takes into account information on both marginal prevalence (e.g. breast cancer prevalence) and global attribute association (e.g. gender and breast cancer). Since the cdf of a Gaussian copula distribution has no close-form expression, we evaluate q(xIE,W) with a numerical integration of the latent continuous joint density f() inside the hyperrectangle defined by the d components (x,, x2,...,xd). We assume no prior knowledge on the order of outcomes inside marginals for nominal attributes and randomize their order.

Fig. 2A shows that, when trained on 1% of the USA populations, our model predicts very well individual uniqueness, achieving a mean AUC (area under the receiver-operator characteristic curve) of 0.89. For each population, to avoid overfitting, we train the model on a single 1% sample then select 1,000 records, independent from the training sample, to test the model. For re-identifications that the model predicts to be always correct (C > 0.95, estimated individual uniqueness above 95%), the likelihood of them to be incorrect (false-discovery rate) is 5.26% (see bottom-right inset in Fig. 2A). ROC curves for the other populations are available in Supplementary Figure 3 and have overall a mean AUC of 0.93 and mean false-discovery rate of 6.67% for C, > 0.95 (see Supplementary Table 1).

Finally, Fig. 2B shows that our model outperforms even the best theoretically achievable prediction using only population uniqueness, i.e. assigning the score c@'")= Ex-to every individual (ground truth population uniqueness, see Supplementary Methods). We use the Brier score [491 to measure the calibration of probabilistic predictions: 13S = with, in our case, c(i) the actual uniqueness of the record x(0 (1 if xco is unique and o if not) and cr(0 the estimated likelihood. Our model obtains scores on average 39% lower than the best theoretically achievable prediction using only population uniqueness, emphasizing the importance of modeling individuals' characteristics.

Appropriateness of the de-identification model Using our model, we revisit the (successful) re-identification of Gov. Weld (Reference is made to Sweeney, L. Weaving technology and policy together to maintain confidentiality. J. Law Med. Ethics 25, 98-11o, 82 (1997)). We train our model on the 5% Public Use Microdata Sample (PUMS) files using ZIP code, date of birth, and gender, and validate it using the last national estimate (Reference is made to Golle, P. Revisiting the uniqueness of simple demographics in the US population. In 5th ACM workshop on Privacy in Electronic Society (ACM, 2006)). We show that, as a male born on July 31, 1945 and living in Cambridge (02138), the information used by Latanya Sweeney at the time, William Weld was unique with a 58% likelihood = 0.58 and K, = 0.77), meaning that Latanya Sweeney's re-identification had 77% chances of being correct. We show that, if his medical records had included number of children-5 for William Weld-, her re-identification would have had 99.8% chances of being correct. Fig. 3A shows that the same combinations of attributes (ZIP code, date of birth, gender, and number of children) would also identify 79.4% of the population in Massachusetts with high confidence (r x> 0.80). We finally evaluate the impact of specific attributes on William Weld's uniqueness. We either change the value of one of his baseline attributes (ZIP code, date of birth, or gender) or add one extra attribute, in both cases picking the attribute at random from its distribution (see Supplementary Methods). Fig. 3C shows for instance that individuals with 3 cars or no car are harder to re-identify than those with 2 cars. Similarly, it shows that it would not take much to re-identify people living in Harwich Port, MA, a city of less than 2,000 inhabitants.

Modern datasets contain a large number of points per individuals. For instance, the data broker Experian sold Alteryx access to a de-identified dataset containing 248 attributes per household for 12oM Americans; Cambridge university researchers shared anonymous Facebook data for 3M users collected through the myPersonality app and containing, amongst other attributes, users' age, gender, location, status updates, and results on a personality quiz. These datasets do not necessarily share all the characteristics of the one studied here. Yet, our analysis of the re-identification of Coy. Weld by Latanya Sweeney shows that few attributes are often enough to render the likelihood of correct re-identification very high. For instance, Fig. 3B shows that the average individual uniqueness increases fast with the number of collected demographic attributes and that 15 demographic attributes would render 99.98% of people in nr Massachusetts unique.

-21 -Our results, first, show that few attributes are often sufficient to re-identify with high confidence individuals in heavily incomplete datasets and, second, reject the claim that sampling or releasing partial datasets, e.g. from one hospital network or a single online service, provide plausible deniability. Finally, they show that, third, even if population uniqueness is low-an argument often used to justify that data is sufficiently de-identified to be considered anonymous -, many individuals are still at risk of being successfully re-identified by an attacker using our model.

/0 As standards for anonymization are being redefined, incl. by national and regional data protection authorities in the EU, it is essential for them to be robust and account for new threats like the one we present in this paper. They need to take into account the individual risk of re-identification and the lack of plausible deniability-even if the dataset is incomplete-, as well as legally recognize the broad range of provable privacy- /5 enhancing systems and security measures that would allow data to be used while effectively preserving people's privacy.

Discussion In this disclosure, we proposed and validated a statistical model to quantify the likelihood for a re-identification attempt to be successful, even if the disclosed dataset is heavily incomplete.

Beyond the claim that the incompleteness of the dataset provides plausible deniability, our method also challenges claims that a low population uniqueness is sufficient to protect people's privacy. Indeed, an attacker can, using our model, correctly re-identify an individual with high likelihood even if the population uniqueness is low (Fig. 3A). While more advanced guarantees like k-anonymity would give every individual in the dataset some protection, they have been shown to be NP-Hard, hard to achieve in modern high dimensional datasets, and not always sufficient.

While developed to estimate the likelihood of a specific re-identification to be successful, our model can also be used to estimate population uniqueness. We show in Supplementary Note 1 that, while not its primary goal, our model performs consistently better than existing methods to estimate population uniqueness on all five corpora (Supplementary Figure 4, P < 0.05 in 78 cases out of 80), and consistently better than previous attempts to estimate individual uniqueness. Existing approaches, indeed, exhibit unpredictably large over-and under-estimation errors. Finally, a recent work quantifies the correctness of individual re-identification in incomplete (10%) hospital data using complete population frequencies. Compared to this work, our approach does not require external data nor to assume this external data to be complete.

To study the stability and robustness of our estimations, we perform further experiments (see Supplementary Notes 2-8 hereinafter).

First, we analyze the impact of marginal and association parameters on the model error, and show how to use exogenous information to lower it. Table 1 and Supplementary Note 7 show that, at very small sampling fraction (below 0.1%), where the error is the largest, the error is mostly determined by the marginals, and converges after few hundred records when the exact marginals are known. The copula covariance parameters exhibit no significant bias and decrease fast when the sample size increases (Supplementary Note 8).

As our method separates marginals and association structure inference, exogenous information from larger data sources could also be used to estimate marginals with higher accuracy. For instance, count distributions for attributes such as date of birth or ZIP code could be directly estimated from national surveys. We replicate our analysis on the USA corpus using a subsampled dataset to infer the association structure along with the exact counts for marginal distributions. Incorporating exogenous information reduces, e.g., the mean MAE of uniqueness across all corpora by 48.6% (P < 0.01, Mann-Whitney) for a 0.1% sample. Exogenous information become less useful as the sampling fraction increases (Supplementary Table 2).

Second, our model assumes that 73 is either uniformly sampled from the population of interest Xor, as several census bureaus are doing, released with post-stratification weights to match the overall population. We believe this to be a reasonable assumption as biases in the data would greatly affect its usefulness and affect any application of the data, including our model. To overcome an existing sampling bias, the model can be (i) further trained on a random sample from the population 73 (e.g., microdata census or survey data) and then applied to a non-uniform released sample (e.g. hospital data, not uniformly sampled from the population), or (ii) trained using better, potentially unbiased, estimates for marginals or association structure coming from other sources (see hereinbefore).

-23 -Third, since 73 is a sample from the population X, only the records that are unique in the sample can be unique in the population. Hence, we further evaluate the performance on our model only on records that are sample unique and show that it only marginally decrease the AUC (Supplementary Note 5). We therefore prefer to not restrict our predictions to sample unique records as (a) our models need to perform well on non sample unique records for us to be able to estimate correctness and (b) to keep the method robust if oversampling or sampling with replacement were to have been used.

Methods Inferring marginals distributions Marginals can be either (i) unknown and are estimated from the marginals of the population sample Xs, this is the assumption used in the main text, or (ii) known with /5 their exact distribution and cumulative density function directly available.

In the first case, we fit marginal counts to categorical (naive plug-in estimator), negative binomial, and logarithmic distributions using maximum loglikelihood. We compare the obtained distributions and select the best likelihood according to its Bayesian information criterion (BIC): where L is the maximized value of the likelihood function,nD the number of individuals 25 in the sample D, and k the number of parameters in the fitted marginal distribution.

Inferring the parameters of the latent copula Each cell St of the E covariance matrix of a multivariate copula distribution is the correlation parameter of a pairwise copula distribution. Hence, instead of inferring E 3o from the set of all covariance matrices, we separately infer every cell Eij E [o, 1] from the joint sample of Di and pi. We first measure the mutual information [(Di: Di) between the two attributes and select a = 27 minimizing the Euclidean distance between the empirical mutual information and the mutual information of the inferred joint distribution.

In practice, since the cdf. of a Gaussian copula is not tractable, we use a bounded Nelder-Mead minimization algorithm. For a given (a, (Pi, WM, we sample from the distribution q I a, (W" TO and generate a discrete bivariate sample Y from which we 5 measure the objective: We then project the obtained.2 matrix on the set of SDP matrices by solving the following optimization problem: Modeling the association structure using mutual information We use the pairwise mutual information to measure the strength of association between attributes. For a dataset D, we denote by ID the mutual information matrix where each cell I(Di: Di) is the mutual information between attributes Di and Di. When /5 evaluating mutual information from small samples, obtained scores are often overestimating the strength of association. We apply a correction for randomness using a permutation model: In practice, we estimate the expected mutual information between Di and Di with successive permutations of Df. We found that the adjusted mutual information provides significant improvement for small samples and large support size I X I compared to the naive estimator.

Theoretical and empirical population uniqueness For n individuals x(2), , x(n) drawn from X, the uniqueness Ex is the expected percentage of unique individuals. It can be estimated either (i) by computing the mean of individual uniqueness or (ii) by sampling a synthetic population of n individuals from the copula distribution. in the former case, we have where L, equals one if there exists a single individual i such as x(i) = x and zero otherwise. Ti follows a binomial distribution B(p(x), n). Therefore and This requires to iterate over all combinations of attributes, whose number grows no exponentially as the number of attributes increases, and quickly becomes computationally intractable. The second method is therefore often more tractable and we use it to estimate population uniqueness in the disclosure.

For cumulative marginal distributions F1, ... ,Fa and copula correlation matrix E, the algorithm 1 (Supplementary Methods) samples n individuals from q(. 1E, W) using the latent copula distribution. From the n generated records (y(1), y(2), , y(1)), we compute the empirical uniqueness Individual likelihood of uniqueness and correctness The probability distribution q(* 1E, W) can be computed by integrating over the latent copula density. Note that the marginal distributions X1 to Ka are discrete, causing the inverses Fr' to Fr' to have plateaus. When estimating p(x), we integrate over the latent copula distribution inside the hypercube - x [x, - x x [xd -Jed]: with cik the density of a zero-mean multivariate normal (MVN) of correlation matrix E Several methods have been proposed in the literature to estimate MVN rectangle probabilities. Genz and Bretz proposed a randomized quasi Monte Carlo method which we use to estimate the discrete copula density. Reference is made to Genz, A. Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1, 141-149 (1992) and Genz, A. & Bretz, F. Computation of Multivariate and t Probabilities (Springer Science & Business Media, 2009).

In The likelihood F for an individual's record x to be unique in a population of n individuals can be derived from px (X = x): which the copula model then estimates as Similarly, the likelihood icx for an individual's record x to be correctly matched in a population of n individuals can be derived from px (X = 9. With T = Er,?_i[x(0 = -1, the number of potential false positives in the population, we have (zr unique -27 -n Note that, since records are independent, T follows a Binomial distribution B (n -p(x)). We substitute the expression for >x in the last formula and obtain: Data availability The USA corpus, extracted from the 1-Percent Public Use Microdata Sample (PUMS) files, is available at https://www.census.gov/main/www/pums.html. The 5% PUMS files used to estimate the correctness of Governor Weld's reidentification are also available at the same address. The ADULT corpus, extracted from the Adult Income dataset, is available at https://archive.ics. uci.edu/ml/datasets/adult. The HDV corpus, extracted from the Histoire de vie survey, is available at https://www.insee.fr/fr/ statistiques/2532244. The MIDUS corpus, extracted from the Midlife in United States survey, is available at https://www.icpsr.umich.edu/icpsrweb/ICPSR/series/203.

The MERNIS corpus is extracted from a complete population database of virtually all 48 million individuals born before early 1991 in Turkey, that was made available online in April 2016 after a data leak from Turkey's Central Civil Registration System. Our use of this data was approved by Imperial College as it provides a unique opportunity to perform uniqueness estimation on a complete census survey. Due to the sensitivity of the data, we have only analyzed a copy of the dataset where every distinct value was replaced by a unique integer to obfuscate records, without loss of precision for uniqueness modeling.

A complete description of each corpus is available in the Supplementary Information (hereinafter).

Supplementary Methods Data collection We assembled, for this project, a rich database of 210 different datasets from five corpora with various levels of uniqueness and socio-demographic, survey, and health attributes that would be reasonable quasi-identifiers.

The USA datasets are extracted from the 1-Percent Public Use Microdata Sample (PUMS) files, a collection of 3,061,692 individual records from the 2010 US Census of Population and Housing. The PUMS files are available online on the US Census Bureau website and contain n attributes we use: state FIP, county, PUMA, number of vehicles, sex, date of birth, marital status, race, educational attainment, employment status, and occupation.

The Adult Income dataset (ADULT) is a canonical Machine Learning dataset, composed of 32,561 individuals from the 1994 US Census database. The ADULT dataset is available in the UCI Machine Learning Repository and contain 10 nominal and ordinal attributes we use: age, workclass, education-num, maritalstatus, occupation, relationship, race, sex, hours-per-week, and native country.

MERNIS is a complete population database of virtually all 48 million individuals born before early 1991 in Turkey, that was made available online in April 2016 after a data leak from Turkey's Central Civil Registration System. Our use of this data was approved by Imperial College as it provides a unique opportunity to perform uniqueness estimation on a complete census survey. Due to the sensitivity of the data, we have only analyzed a copy of the dataset where every distinct value was replaced by a unique integer to obfuscate records, without loss of precision for uniqueness modeling. We have analyzed a sample of 8,820,050 individuals (district of Istanbul) using 8 attributes: year, month, and day of birth, home city, district, address, and birthplace city and district.

The Histoire de vie (HDV) dataset is composed of 13,500 individual responses to a 35 2003 survey from the French National Institute of Statistics and Economic Studies (TNSEE), and available on TNSEE's website. After preprocessing and removal of null responses, the dataset contains 632 attributes we use for 8403 individuals.

Midlife in the United States (MTDUS) is a longitudinal survey of 7,108 individuals, comprising physical health, psychological well-being, and social variables. The survey files are available on the Inter-university Consortium for Po litical and Social Research (ICPSR) website. After pre-processing and removal of null responses, the dataset contains 415 attributes.

We also use the 5% PUMS files from 1990 to estimate the correctness of Governor Weld's re-identification and provide population uniqueness estimates in Fig. 4 (Main Text), for which we used 15 attributes: ZIP code (inferred from the PUMA code), date of birth (inferred from age), marital status, citizenship status, class, occupation, mortgage, state of work, race, vehicle occupancy, time of departure for work, sex, school, number /5 of vehicles, number of own natural born/adopted children.

Experiments to validate correctness For each surveyed population D, we first infer the marginals distributions and then the correlation matrix of the latent copula distribution. We measure the uniqueness values Ex (ground truth) and Ex. For a corpus C of C populations, we report the uniqueness MAE as described in algorithm 2: gorith I,:qUnIzitg, 3t:AU Itpula ThCOlure SAMPLI:COPULAf '4K RI (7hoka,:ky I]) do -30 - A.Worithrn 2 M odd vMtOat Enot on tion unicakme, :8;no:nthiasin-" procedure VAIII1St Ifjh NIA E.; ) k 11 to r.? +"' marginal fr4An X ks{:itaatm,i Onirn1((i(n1 0Inti 0( 1( (n.n X 1 (0 n (10 111(n(K u (1( ;XL 40 " . return;:=x Subsampling experiments For each surveyed population D, for each sampling fraction p, we select m = 100 samples Sur, containing ns = n xp individuals. We use an algorithm similar to algorithm 2 to estimate the MAE distribution for a corpus C (algorithm 3).

Algorithm 3 Ertor onP(Thn^ nflOInatitt(lint:N 0 ',akin?, Oxl: Nun pun * I, procedure -..S.....:1-oAmPLE.N,:i.A.E.(X,m,n,,91: 2 for k -L.,-1 to do Xs anniph) ns finial X With inn. n(Phi(nUinnt.

lintrgin:A fitsm ÷7( ..:1( 0;3.1;7:u iOflTubofl nmtn.N. from Vs do [haw qt, * i4nkr::.n(ntnir n< * return Brier score for a uniform method If the same likelihood ex = p is assigned to every individual, the Brier Score when predicting individual uniqueness is: -31 - 1.11.3:3= M433333-s *--*,' I 44 * * 4' 443.: 1714) ..

47-t.:1.

M.Sdi: s. ** * :444,4 E4:4:-4x.E3'43:443in <-4-4134)414:4 ?+-44:41: --:::4".4).-4:))143:.*Mc)344-4 [to:-p').421) (41. Y43? The lowest Brier score for a uniform likelihood is obtained when p = Ex. In that case, we have BS = Ex -39.

5Brier scores for a random guess method If cc-U(o; 1) is uniformly drawn at random, the Brier Score when predicting individual uniqueness is: BS Met1:43.

I 'Ale: 4-44 44.m =.

3,31e-34.44n it4sx t I.1 le. 4-3

43;4 d433: io Impact of specific attributes on individual uniqueness In Fig. 3C, we evaluate the impact of specific attributes on William Weld's uniqueness.

To do so, we train our copula model on the PUMS corpus for the Massachusetts population (see Results hereinbefore).

/5 For each baseline attribute, (ZIP code, date of birth, or gender), we then perform woo trials, randomly replacing the value of this attribute by a new value sampled at random from its marginal distribution. For each trial, we estimate uniqueness.

For each single additional attribute, we sample from the marginal distribution of this additional attribute, add the new value to the base characteristics (58 year old male from Cambridge, MA), and estimate the uniqueness using the 3+1 attributes. We perform woo trials for each of the eleven additional attributes.

Fig. 3C reports these scores of uniqueness, grouped by attribute: each boxplot shows the distribution of uniqueness obtained after replacing (resp. adding) a baseline (resp. additional) attribute. in order of appearance, the attributes used from the PUMS corpus are ZIP code (ZCTAs extrapolated from PUMA codes), gender (SEX variable in the PUMS corpus), date of birth (extrapolated from AGE with random month and day), Race, citizenship (CITIZEN), school enrollment (SCHOOL), vehicle occupancy (RIDERS), place of work -state (POWState), mortgage (MORTGAGE), marital status (MARITAL), class of worker (CLASS), number of vehicles (VEHICLES), occupation 10 (OCCUP).

Supplementary note 1: comparison with previous work While developed to estimate the likelihood of a specific re-identification to be successful, our model can also be used to estimate population uniqueness. Previous approaches, based on extrapolations of the contingency table of a disclosed random sample of the dataset, have been proposed to model population uniqueness. A contingency table is here a d-dimensional table where each cell (or class) counts the number of individuals with a specific combinations of the d attributes.

Using our previous notations, let I XI denote the number of potential combinations of attributes, i.e. the true number of cells in the complete contingency table. F (resp.fi) denotes the size of the i111 cell in the contingency table of X (resp. Xs), with an arbitrary order on cells. We define SK = Et.11Frk(the number of cells of size k in X) and SK = E, Hick (the number of cells of size k in Xs) while the empirical uniqueness of the population is Ex = All of those methods then use parametric models, fitting a specific distribution on the unordered frequency distribution of the contingency table, and estimate the population uniqueness from the unordered estimated distribution of cell frequencies.

Following studies comparing these estimators, we selected the most promising methods to estimate uniqueness: the Ewens model, the Slide Negative Binomial (SNB) model, the Pitman model, and the Zayatz model.

The Ewens model, based on the multivariate Ewens distribution, estimates uniqueness as: -33 -The Zayatz model estimates uniqueness as: where F(F = i if = 1) follows an hypergeometric distribution. The Pitman model assumes: The a and 0 parameters are estimated from the log-likelihood of the sampling ro distribution, using Newton-Raphton.

The Slide Negative Binomial model assumes a translated negative binomial distribution on the Fi frequencies (without zero count), such as: /5 The parameters a, fl are estimated from the sample by solving a system of two non-linear equations, and I XI separately estimated from the sample.

We have implemented the original methods and verified the numerical results using the open source ARX toolbox. Sometimes, they do not converge or estimate uniqueness above 1. We therefore limit scores to [o, 1] and, if a method does not converge on a specific sample (such as occasionally with the SNB inference), we do not make a decision and discard the sample. Taking a conservative approach and assuming, e.g., that every individual is unique when the method do not converge gives the same results below (P < o:o5 in 78 cases out of 8o on Fig. 4). Specifically, the Pitman method estimates uniqueness to be above too for 12.4% of all trials while the SNB (resp.

Ewens) method do not converge in 4.2% (resp. 1.1%) of all trials.

According to the literature, both the Pitman and SNB methods provide accurate estimators of population uniqueness while the Pitman method provides the best estimator for small sampling fractions. We found that, while not its primary goal, our method performs significantly better than all other approaches (P < o:o5 in 78 cases out of 8o). Fig. 4 compares the mean absolute error (MAE) between empirical and estimated uniqueness for the four proposed methods and ours. Several of the methods proposed in the literature severely under-or overestimate the risk of re-identification, especially when their parameters are fitted on a small population sample. This is likely due to the fact that (i) fitting specific distribution to frequency counts can inherently provide biased results if inappropriate distributions are selected, (ii) fitting the frequency counts of a population requires a lot more samples than for a multivariate model where each marginal distribution is estimated separately.

Finally, Skinner and Holmes and Skinner and Shlomo have proposed to use log-linear models to estimate the likelihood for sample unique records to be population unique. Using the sample contingency table, these models can smooth an estimator of population uniqueness, by taking into account the main effects from the sample marginals.

A log-linear model is a generalized linear model (GLM) used to model count data and contingency tables. It assumes that the response variable, the population count FA of an equivalence class x e X (the number of individuals with the record x in the population), follows a specific count distribution (e.g. Poisson) with mean A1, that is: F, -Po ( AA). For a sampling fraction p, the sample count fr also follows a Poisson distribution: f -Po (p AA).

In order to "borrow strength" between equivalence classes, a GLM model is fitted to the sample using the canonical logarithmic link: t.

where)6 is atx q parameter vector, and zk a q x 1 vector of the main effects of each 3o marginal attribute, and potentially low order interactions between attributes. Tt yields an estimated individual likelihood of uniqueness: -it -v We have implemented log-linear models on all studied corpora. The loglinear models did not converge for 27 of the 210 studied populations (when all sample records are sample unique, or share the same frequency, a GLM cannot be fitted, as there is only one possible outcome). When they converge, they obtain poor calibration with Brier scores, on average, 418% higher (lower is better) than our copula-based method (and strictly worse for 210 out of 210 tested population). Fig. 12 shows the calibration of Poisson and Negative Binomial log-linear models on a 1% sample. This is to be compared with Fig. 2B, showing the calibration of our copula-based approach.

/0 Despite being an individual-level measure of uniqueness, log-linear models do not perform better at predicting individual uniqueness than simply using population uniqueness (assigning every individual x the overall population uniqueness G = Ex). For instance, a Poisson (resp. Negative Binomial)loglinear model obtains Brier scores on average 56.4% (resp. 68.1%) higher than the Zayatz method (see above), and 233% (resp. 291%) higher than the best theoretically achievable prediction using only population uniqueness (for each individual x, assigning ex = Ex).

Supplementary note 2: correcting individual scores using population uniqueness In our experiments, we noticed a small numerical discrepancy between the mean of the estimated likelihoods G and the estimated population uniqueness. Specifically, EIG1 > 21 in 66% of the tested populations (see Fig. 6). This is likely due to numerical errors between (i) sampling a population of n individuals to compute population uniqueness and (ii) sampling 1,000 individuals from the original dataset and computing their estimated likelihood G from q (xIE,11) (e.g. for 9M individuals, a uniqueness of G = 0.90 corresponds to a small probability q (xIE,'P) = 1.17x10-8)., We test here whether the population-level method (ii) can be used to correct (normalize) the individual likelihoods from method (i). Recall that, from a sample S, we estimate both the average uniqueness Lt. and the likelihood G for each individual x. We apply a correction factor a: and let ex= cf: be the corrected likelihood of uniqueness for the record x.

This correction reduces, for certain corpora (MERNIS, ADULT, HDV), the false-discovery rate (Fig. 5) and provides an increased calibration (Fig. 7). However, we did not find enough evidence that the calibration helps and do not use it in this disclosure.

Supplementary note 3: using the exact marginals to improve predictions As mentioned in the Discussion section hereinbefore, marginal distributions can be inadequately modeled when the sample size is small. Our method can incorporate exogenous information such as better estimates for marginals. Table 2 compares the performance of our model when using approximate (inferred from the given sample) and exact marginals (prior knowledge of the marginal distributions from the complete population). Using the exact marginals decrease MAE when the sample is small.

Supplementary note 4: testing for bias and heteroskedasticity in uniqueness prediction Our estimate of uniqueness might be more accurate at lower or at higher uniqueness. We therefore test for potential biases in our estimates for population uniqueness (-7-cx) and for homoscedasticity of errors. We use general linear mixed-effects models with restricted maximum likelihood (REML) on the estimates given by our copula method at l00% sampling size. We control for corpus effect by grouping observations by corpus (5 groups).

Overall, we find a statistically significant bias and heteroskedasticity, albeit both with negligible effects on uniqueness. Table 4 shows the results of modeling c..-r; (32000 observations across corpora and repeated trials) and the RMSE between Ex and 27,-(210 observations). Notably, there is not significant difference between corpora.

We further test for potential bias and heteroskedasticity in the predicted individual likelihoods of uniqueness. Table 5 shows the results of modelling ty (210000 observations), grouped by corpus and population (210 groups), as well as modeling the so RMSE between x and G, grouped by corpus (5 groups). The results validate the homoscedasticity of errors and exhibit no significant bias.

Supplementary note 5: Sample unique records and individual uniqueness As described in the text hereinbefore, we do not take into account whether a record x is 35 unique in D. Our modeled uniqueness ex = (1 -p(x)r-1 depends only on the probability to draw x in the population and on 71, the number of individuals in the population.

As D is sampled at random from the population, every record x that is not unique in the sample D cannot be unique in the population (r = o). We therefore further evaluate the performances of our model only on records that are sample unique. Table 6 shows the AUC and FDR scores when the model is evaluated only on sample unique records, and Fig 8 the ROC curves for the same experiment.

v.) We therefore prefer to not restrict our predictions to sample unique records. First, this keeps the method more robust e.g. if oversampling or sampling with replacement were to have been used. In that case, we can never rule out that a sample non-unique record is not unique in the population. Second, in order to accurately measure the correctness xx of a matched record x, we need to ensure that the model performs well for any record, sample unique or not. Indeed, if 9 other records match x in the population X, and 2 other records in the sample D (x sample non-unique), x still has a 1 chance out of in to be correctly re-identified.

However, potential adversaries running a re-identification attack on a released sample will use this sample to train the model then estimate the uniqueness of the sample records. Therefore, an adversary's success at predicting uniqueness can also be measured on both sample unique and non-unique records. For this reason, we also provide the ROC curves for the five corpora, when the model is trained and tested on the same 1% sample (Fig 13). We update the definition of", the likelihood for an individual x to be unique in the population: This yields a higher accuracy, as sample non-unique records are never population unique (perfect prediction).

Supplementary note 6: Multivariate mutual information Our copula method takes into account the marginals and pairwise association structure. While our model, when trained on the complete population, already performs very well with an average MAE across corpora of 0.018 ± 0.019, more complex models, capturing better the complete association structure between attributes, might perform even better.

To investigate this, we compute the distribution of triplewise information I (X; Y;Z). Positive interaction information values denote redundant combinations (X and Y share the same information about Z) and negative values synergistic combinations (X and Y provide more information about Z together than they do individually).

We perform 100 trials per corpus. For MERNTS, ADULT, and HDV, the interaction factors are synergistic but very small, with an average mean ± s.d. of -0:001 ± 0:035 (MERNIS), -0:017 ± 0:036 (ADULT), and -0:002 t 0:009 (HDV). For USA and MIDUS, the interactions factors are redundant, with an average mean ± s.d. of 0:309 1:050 (USA) and 0:009 ± 0:152 (MIDUS).

Across an 500 trials, the triplewise interaction factors obtained are often null or redundant, with I (X; Y;Z) > -0:mat in 94% of all trials. This suggest that, at least for the five corpora we study, covering a broad range of uniqueness values and association patterns, pairwise associations capture most of the information.

Supplementary note 7: Estimating population uniqueness at extremely small sampling fractions for large datasets The MERNIS and USA populations contain respectively 8,820,049 and 3,061,692 individuals. Even a 0.1% sample still contains the records of thousands of individuals.

We here study the performance of our method at extremely small sample size for those two datasets.

Fig. 9A and C show that our model performs well until a sample size of approximately 6400 records. Fig. 9B and D then show that, when using the exact marginals, our models performs well even with as little as 200 records. Indeed, at small sampling fractions, marginals with high entropy and therefore many unique records, such as the "postal address" attribute in MERNTS, are difficult to estimate without a few thousand records. incorrectly inferring the distribution of "postal address" with only 50 or 100 records can lead to a large variance in population uniqueness estimates.

Supplementary note 8: Error when modeling the count distribution and the copula parameters Our method to estimate individual uniqueness relies on a generative model for p(x), the probability to draw a record x from the joint distribution Xwhich we call q(x). Figure 10 shows that the distribution of the Kullback-Leibler divergence, a measure of distance between the true empirical distribution p(x) and the estimated distribution q(x), is very small (1:59nats on average).

Fig. ii furthermore shows that the estimated covariance matrix of the latent copula io distribution is not significantly biased, even at small sampling fractions (pairwise correlations higher or lower, on average, than their value when trained on i00% of the population). Similarly, the variance of the covariance error decreases fast, showing signs of convergence after few hundred records.

Systems The generative statistical approach herein described can be used to assess a likelihood of re-identification. For example, it can be used to estimate correctness of re-identification of an individual, to estimate uniqueness of an individual in a population, to estimate a probability that two data records come from the same person and other similar properties.

The approach generally involves training a model, in particular, a joint distribution X, and then using the model to assess risk of re-identification. A generative statistical approach is followed whereby an estimate of the joint distribution of the complete -0or population is made, from a set of sample records, and an estimated distribution for the data is output.

The process includes assigning an individual score tailored to each individual based on their characteristics (by modelling the probability to draw an individual record from the population). The approach is population-based: the process can predict the correct re-identification, or uniqueness, and/or record linkage inside a population and can scale to billions of people. The approach is flexible: the process can combine prior knowledge about marginals or association structure, from previous work or measurements, to improve the predictions.

The approach can be applied to a variety of different data, such as demographic data, socio-demographic data (e.g., responses to an online survey), behavioural data (e.g., online tracking data, web usage, device fingerprints and the like), financial data (e.g., transaction summaries, loan applications and the like), insurance data (e.g., consumer profile, data from data brokers), census data (using public census microdata to train the method), and medical data (e.g., hospital or clinic records; administrative, diagnostic-related, or treatment information).

Referring also to Figure 25, a first computer system 1 for training a model 2 is shown.

The computer system 1 includes at least one processor 3 and memory 4 which stores software code 5 for a model trainer 6 executable by the processor(s) 3. Storage 7 holds a dataset D for a population or part of population which is used to train a model 2. The computer system 1 may take the form of a desktop computer or workstation. Elements, such as user input device(s) (e.g., keyboard, pointing device(s) etc.), user output devices (e.g., display(s) etc.), network interfaces and other peripheral devices are not shown for clarity.

Referring to Figure 26, a second computer system it for using the model 2 is shown.

The computer system 11 includes at least one processor 13 and memory 14 which stores software code 15 for an estimator 16 (or "calculator") executable by the processor(s) 13. Storage 17 holds the model 2, one or more records 18 (for example, which may take the form of the dataset 8) and a value 19 of population size. The estimator 16 is used to generate output value(s) 20, 21, 22 such as uniqueness 20, correctness 21 and/or or probability 22. The type of output 20, 21, 22 depends on application and the type of assessment being performed.

The computer system 11 includes at least one user input device 23 (such as a keyboard, pointing device(s), microphone, or totichscreen), at least one user output device 24 (such as one or more displays including display(s) etc.) and one or more network interfaces 25 (such as a wired LAN network interface, a wireless LAN network, a mobile network (or "cellular network") interface and/or short-range communication networks.) The second computer system 11 may take the form of a desktop computer or workstation, a handheld computing device, such as a tablet computer or mobile device -41 - (e.g., a smart phone or wearable device). The same computer system 11 can be used for the first and second computer systems 12.

Training Figure 27 is a process flow diagram of a method of training a model in the form of a joint distribution. The method follows a generative statistical approach whereby it estimates, from a set of sample records, a joint distribution of a larger (e.g., complete) population, and outputs an estimated distribution for the data. In this example, the method uses a multivariate copula-based approach. However, other types of generative probability distributions (such as other copula distributions or vine copulas) or deep learning generative methods may be used.

Referring to Figures 25, 27 and 28, the trainer 2 extracts all marginals (e.g., age, sex, postcode) from a dataset D (steps Si and S2) and estimates the distribution for each /5 marginal (step S3). The trainer 2 extracts all mutual information scores between pairs of variables (step S4). The trainer 2 estimates correlation scores between latent uniform marginal distributions, using the fitted marginals (step S4). The trainer 2 combines the correlation scores and the fitted marginals to form a joint distribution of the population, using a Gaussian copula (step S6). The joint distribution is then stored (step S7).

In more detail, the trainer 2 retrieves a dataset D (step Si). The dataset D preferably take the forms a tabular dataset with one row per record and one column per attribute. The dataset contains nD rows and d columns. The cells can contain nominal or ordinal data, categorical data (e.g., strings), or continuous data.

The trainer 2 extracts and infers the marginal distributions X; for each column 12, of the dataset D (steps S2 & 83). Extracting and inferring marginal distributions Xi may include building an exact histogram and using it as an estimated distribution or fitting 3o the attribute values to statistical distributions (e.g., Poisson, Zipf, Negative Binomial etc.). Optionally, the data can be sorted based on discrete frequency by increasing or decreasing counts.

The trainer 2 computes mutual information Ml (Di; Di) between each pair of attributes (step S4). This can be achieved by using a plug-in empirical estimator from the specified joint bin frequencies, i.e., MI(X,Y) = H(X) + H(Y) -H(X, 1/) where H the plug-in entropy estimator, or corrected mutual information using, e.g., a Shrinkage estimator or a Chao-Shen entropy estimator.

The trainer 2 computes the correlation factor between latent copula distributions U and Li; from the mutual information MI(Di; D1), and outputs a d-by-d correlation matrix E (step S5).

The trainer 2 forms a Gaussian copula distribution by associating the marginal distributions (X1) and the correlation matrix 1, and returns a probability density it) function p(x,n) for a record x to be drawn in a population of 7/ individuals (step S6).

The joint distribution is stored (step S7).

Steps S4 and S5 can be combined by estimating directly the correlation matrix from the tabular dataset D. Steps S4, S5 and S6 can use other types of copula distributions, e.g., vine copulas, that include more information than the marginals and the pairwise correlation matrix E. Steps S3, S4, S5 and S6 can use other types of generative density estimation methods, such as Bayesian Networks (BN) or Neural Autoregressive Distribution Estimation (NADE), taking as input a tabular dataset and output a probability density function.

More than one computer system 1 can be used for training the model 2. In particular, one or more computers can be used to prepare data, such as extracting marginals, mutual information scores and correlation scores, and another, different computer system may be used to combine correlation scores and fitted marginals to form the joint distribution.

Model embeddint Uses of the approach herein described in privacy certification, plausible deniability / in a judicial setting and record matching will now be described.

Privacy certification A data-holding organisation A (which may be a company) has a pseudo-anonymous (i.e., without direct identifier) dataset D containing, for each record i, demographic information d, (such as age range, region where the person lives, ethnic information and so on), as well as private, potentially sensitive information xi (e.g., one or more medical conditions). The organisation A wants to release the dataset D for research purpose or to share with third parties, but does not want to create a privacy breach. Indeed, the organisation A wants to avoid this in view of GDPR regulation.

The data-holding organisation A transmits all the data records d" but not the sensitive z, to a certification organisation P for certification. The certification organisation P may be a company.

The certification organisation P trains the model on the dataset D, and evaluates, for ro each record di, its uniqueness and/or correctness x, i.e., the probability of the person being unique in the population and/or the probability that this person would be correctly re-identified if match. An overview of the process is shown in Figure 31.

If this probability is sufficiently low (La, below predetermined threshold level), then the certification organisation P produces a certificate of compliance attesting to the absence of risk so that the dataset Dean be released. Else the certification organisation P can do one of the following. First, it can warn the organisation A that the dataset cannot be released. Secondly, it can flag specific records that pose a privacy risk, and allowing the release of the rest of the dataset. Thirdly, is can use the list of records posing a risk to decrease the precision of one or more certain fields, for example, by replacing a first age range, for example, 85 to 9o, by a second, broader age range, for example, 6o+, or by coarsening an address, e.g. by truncating the post code.

The certification organisation P may repeat the process of training the model and evaluating uniqueness and/or correctness x until satisfactory.

It should be noted that the data-holding organisation A and the certifying organisation could be two independent departments from the same organisation. It should be noted that this process can be applied to any sort of sensitive data, not just medical data.

The data-holding organisation A need not transmit all of the records, but just a random subset of the records, to further enhance privacy. Thus, the process can be performed on a subset of the dataset.

The step of training the model and evaluating uniqueness and/or correctness lc can be used after the dataset has been release, i.e., a posteriori by, for example, a government agency, an advocacy group or other interested party to show that the organisation A has endangered the privacy of individuals by releasing a dataset. Thus, the process can be seen as one of auditing.

Plausible deniability / judicial setting The approach disclosed herein can be used to screen characteristics of a suspect in legal proceedings. A method can be used when it is known that a person who committed a crime has one or more certain physical or demographic characteristics c, and a suspect has been found having those characteristics.

An assessor (for example, a forensic scientist, a lawyer or independent service provider) can train a model on detailed datasets about part of the population (e.g., using portion of census data) and compute the probability that there is a unique individual with the one or more characteristics c. A prosecutor might want to use this information since a high probability would strengthen the case against the suspect. A defence lawyer might want to use this information since a low probability would decrease the value of the evidence based on the characteristics c, thereby providing a measurable degree of plausible deniability.

Record matching The approach disclosed herein can be used in record matching.

Two datasets A, B do not use mutually-consistent identifiers or keys, but contain some common data. Consider the following. -0or

A first dataset A contains, for each record i, a first type of identifier, a first set of demographic information dA, (e.g. age, post code, etc.) and some first dataset-specific information z. A second dataset B contains, for each record j, a second, different type of identifier, a second set of demographic information dB and some second dataset-specific information m. The goal is to merge the two datasets, in other words, to create a new dataset AB containing all the information in first and second datasets A, B in such a way that a person appearing as i in the first dataset A and as j in the second dataset B would only appear once in AB, with both z and m as information. The size of the total population potentially covered by first and second datasets A, B is also known (e.g., the population of a country or city).

-45 -To achieve the goal, a data aggregator does the following: The data aggregator trains a model on the first dataset A. The data aggregator, for every record j in the second dataset B: looks to find if there is a record i in the first dataset A such that = dBi. if found, the data aggregator computes a probability that information dA1 is unique in the population. If the probability exceeds a threshold, the aggregator merges the records i and j. The threshold is selected depending on the relative cost of false positive and false negative, and can be Jo determined directly since it is a threshold on the probability.

The output is a merged dataset AB.

It should be noted the order of the steps of looking for matching records and computing the probability can be reversed.

Modifications It will be appreciated that various modifications may be made to the embodiments hereinbefore described. Such modifications may involve equivalent and other features which are already known in the design, manufacture and use of re-identification risk assessment systems and component parts thereof and which may be used instead of or in addition to features already described herein. Features of one embodiment may be replaced or supplemented by features of another embodiment.

or Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure of the present invention also includes any novel features or any novel combination of features disclosed herein either explicitly or implicitly or any generalization thereof, whether or not it relates to the same invention as presently claimed in any claim and whether or not it mitigates any or all of the same technical problems as does the present invention.

The applicants hereby give notice that new claims may be formulated to such features and/or combinations of such features during the prosecution of the present application or of any further application derived therefrom.

Claims

Claims 1. A computer-implemented method comprising: * receiving a statistical generative and multivariate model of a distribution of a population of entities having a set of attributes; * receiving a record for a given entity x having a given set of attributes; receiving a value 71 of population size; * determining a probability p(x,n) of the record for the given entity and population size; * determining, using the probabilityp(x, n) of the record for the given entity and the population size: - a first probabilityx that the given entity is unique in the population; and/or - for a given set of attributes, a second probability xr that the given entity is correctly identifiable; and * outputting the first and/or second probability as measure(s) of likelihood.
2. The method of claim 1, wherein the given entity is a first entity and the record is a first record, wherein the method further comprises: receiving a second record for a second entity having a respective set of attributes; determining, using the first probability, a third probability that the first and second records come from the same entity; outputting the third probability.
3. A computer-implemented method comprising: or * receiving a statistical generative and multivariate model of a distribution of a population of entities having a set of attributes; receiving a record for a given entity x having a given set of attributes; * receiving a first probability that the given entity is unique in the population or, for a given set of attributes, a second probability xx that the given entity is correctly 30 identifiable; * determining a value n of population size needed to obtain the first probability or the second probability K1, wherein determining the value n of population size comprises determining, for each of at least one value of population size, a probability p(x,n) of the record for the given entity and population size; * outputting the population size.
4. A computer-implemented method comprising: * receiving a set of marginal distributions; * receiving a matrix of mutual information scores between pairs of attributes for a dataset; * forming a statistical generative and multivariate model using the set of marginal distributions and the mutual information scores; and storing the statistical generative and multivariate model.
5. The method of claim 4, wherein forming the statistical generative and Jo multivariate model comprises: for each attribute, mapping the attribute value onto an interval using the marginal in such a way that distribution of a mapped attribute is known; and generating a multi-variate statistical model for the mapped attributes jointly in a hypercube such that: for each attribute, the marginal of the multi-variate model corresponds to the known distribution on the interval; and for each pair of attributes, the mutual information between the mapped attributes in the model is consistent with the matrix of mutual information received.
6. The method of claim 4 or 5, further comprising: receiving a dataset of records, each record relating to a respective entity and containing attributes; * extracting marginals from the dataset; -0or * estimating a respective distribution for each of said marginals to form the set of marginal distributions; extracting mutual information scores between variables in the dataset; and * generating the matrix of mutual information using the mutual information scores. 3°
7. The method of any one of claims i to 6, wherein the statistical generative and multivariate model is a copula model.
8. The method of claim 7, where the copula model is a Gaussian copula model.
9. A method comprising: the method of any one of claims 1 to 3, 7 or 8; and the method of any one of claims 4 to 6.
10. A computer program comprising instructions which, when executed by at least one processor, causes the at least one processor to perform the method of any one any one of claims 1 to 3, 7 or 8.
11. A computer program product comprising a computer-readable medium storing /0 the computer program of claim 10.
12. A computer program comprising instructions which, when executed by at least one processor, causes the at least one processor to perform the method of any one any one of claims 4 to 8.
13. A computer program product comprising a computer-readable medium storing the computer program of claim 12.
14. Apparatus comprising: at least one processor; and memory in operative communication with the at least one processor; the at least one processor configured to perform the method of any one of claims 1 to 3, 7 OF 8.-0or
15. Apparatus comprising: at least one processor; and memory in operative communication with the at least one processor; the at least one processor configured to perform the method of any one of claims 4 to 8.
16. A computer product comprising: a computer-readable medium; and a statistical generative and multivariate model of a distribution of a population of individuals having attributes, the model stored on the computer-readable medium.