Abstract
Microaggregation is a masking mechanism to protect confidential data in a public release. This technique can produce a k-anonymous dataset where data records are partitioned into groups of at least k members. In each group, a representative centroid is computed by aggregating the group members and is published instead of the original records. In a conventional microaggregation algorithm, the centroids are computed based on simple arithmetic mean of group members. This naïve formulation does not consider the proximity of the published values to the original ones, so an intruder may be able to guess the original values. This paper proposes a disclosure-aware aggregation model, where published values are computed in a given distance from the original ones to attain a more protected and useful published dataset. Empirical results show the superiority of the proposed method in achieving a better trade-off point between disclosure risk and information loss in comparison with other similar anonymization techniques.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The measures are discussed in Sect. 2.3 with more details.
For simplicity, we define \(Var(X)=\sigma ^2_X=1/n \sum _{i=1}^{n}(x_i-\mu _{X})^2\) where X is a set of n equally likely values \(x_i\) with \(\mu _{X}=Mean(X)\).
We review some general purpose \(\textit{DR}\) and \(\textit{IL}\) measures only for continuous data type, which is addressed in this paper. The variants of the measures for other data types can be found in Hundepool et al. (2012).
It is also known as identity disclosure or re-identification risk.
Interval disclosure is a special case of attribute disclosure for continuous datasets.
The heuristic can be simply extended to consider each attribute separately, however, our experiments show that there is no a significant improvement that justifies this additional cost.
These methods are described in Sect. 3.
Please note that in Table 2, MDAV-DA usually performs better for \(k=5\) than other aggregation levels for MDAV-DA.
In fact, we select the trade-off point with closest but greater \(\textit{DR}\) than the value of MDAV-DA, to allow a more (potential) decrease of \(\textit{IL}\) for the methods.
An illustrative example is presented in Fig. 1.
References
Askari M, Safavi-Naini R, Barker K (2012) An information theoretic privacy and utility measure for data sanitization mechanisms. In: Proceedings of the second ACM conference on data and application security and privacy, ACM, New York, NY CODASPY, pp 283–294
Batet M, Erola A, Sánchez D, Castellà-Roca J (2013) Utility preserving query log anonymization via semantic microaggregation. Inf Sci 242:49–63
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517
Brand R (2003) Microdata protection through noise addition. In: Domingo-Ferrer J (ed) Inference control in statistical databases., Lecture notes in computer scienceSpringer, Berlin, pp 97–116
Brand R, Domingo-Ferrer J, Mateo-Sanz J (2002) Reference data sets to test and compare SDC methods for protection of numerical microdata. European Project IST-2000-25069 CASC, http://neon.vb.cbs.nl/casc
Burridge J (2003) Information preserving statistical obfuscation. Stat Comput 13(4):321–327
Charu A, Philip S (2008) Privacy-preserving data mining: models and algorithms. ASPVU, Boston
Defays D, Nanopoulos P (1993) Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the 1992 symposium on design and analysis of longitudinal surveys, pp 195–204
Domingo-Ferrer J, Torra V (2001a) Disclosure protection methods and information loss for microdata. Confidentiality, disclosure and data access: theory and practical applications for statistical agencies, pp 91–110
Domingo-Ferrer J, Torra V (2001b) A quantitative comparison of disclosure control methods for microdata. Confidentiality, disclosure and data access: theory and practical applications for statistical agencies, pp 111–134
Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min Knowl Discov 11(2):195–212
Domingo-Ferrer J, Rebollo-Monedero D (2009) Measuring risk and utility of anonymized data using information theory. In: Proceedings of the EDBT/ICDT Workshops, ACM, New York, NY, EDBT/ICDT, pp 126–130
Domingo-Ferrer J, Mateo-Sanz JM, Torra V (2001) Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In: Pre-proceedings of ETK-NTTS, vol 2, pp 807–826
Domingo-Ferrer J, Martínez-Ballesté A, Mateo-Sanz JM, Sebé F (2006a) Efficient multivariate data-oriented microaggregation. VLDB J 15(4):355–369
Domingo-Ferrer J, Solanas A, Martinez-Balleste A (2006b) Privacy in statistical databases: k-anonymity through microaggregation. In: Proceedings of international conference on granular computing, IEEE, pp 774–777
Domingo-Ferrer J, Sebé F, Solanas A (2008) An anonymity model achievable via microaggregation. In: Secure data management, Springer, Heidelberg, pp 209–218
Drud AS (1994) CONOPT a large-scale GRG code. ORSA J Comput 6(2):207–216
Fayyoumi E, Oommen BJ (2010) A survey on statistical disclosure control and micro-aggregation techniques for secure statistical databases. Softw Pract Exp 40(12):1161–1188
Hansen S, Mukherjee S (2003) A polynomial algorithm for optimal univariate microaggregation. IEEE Trans Knowl Data Eng 15(4):1043–1044
Heaton B (2012) New record ordering heuristics for multivariate microaggregation. PhD thesis, Nova Southeastern University
Herranz J, Matwin S, Nin J, Torra V (2010) Classifying data from protected statistical datasets. Comput Secur 29(8):875–890
Herranz J, Nin J, Solé M (2012a) Kd-trees and the real disclosure risks of large statistical databases. Inf Fusion 13(4):260–273
Herranz J, Nin J, Solé M (2012b) More hybrid and secure protection of statistical data sets. IEEE Trans Dependable Secur Comput 9(5):727–740
Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Nordholt ES, Spicer K, De Wolf PP (2006) Cenex SDC handbook on statistical disclosure control, version 1.01
Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Nordholt ES, Spicer K, De Wolf PP (2012) Statistical disclosure control. Wiley, Chichester
Kim JJ (1986) A method for limiting disclosure in microdata based on random noise and transformation. In: Proceedings of the ASA section on survey research methodology, pp 303–308
Laszlo M, Mukherjee S (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans Knowl Data Eng 17(7):902–911
Li Y, Zhu S, Wang L, Jajodia S (2002) A privacy-enhanced microaggregation method. In: Eiter T, Schewe KD (eds) Foundations of Information and Knowledge Systems., Lecture notes in computer scienceSpringer, Berlin, pp 148–159
Lin JL, Chang PC, Liu JYC, Wen TH (2010) Comparison of microaggregation approaches on anonymized data quality. Expert Syst Appl 37(12):8161–8165
López A (2011) Effect of microaggregation on regression results: an application to Spanish innovation data. Emprical Econ Lett 10(12):1265–1272
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) L-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov From Data (TKDD) 1(1):1–52
Mateo-Sanz J, Sebé F, Domingo-Ferrer J (2004) Outlier protection in continuous microdata masking. In: Domingo-Ferrer J, Torra V (eds) Privacy in statistical databases., Lecture notes in computer scienceSpringer, Berlin, pp 201–215
Mateo-Sanz J, Domingo-Ferrer J, Sebé F (2005) Probabilistic information loss measures in confidentiality protection of continuous microdata. Data Min Knowl Discov 11(2):181–193
Moore Jr RA (1996) Controlled data-swapping techniques for masking public use microdata sets. Tech. Rep. 96-04, Statistical Research Division Report Series, US Bureau of the Census, Washington D.C
Mortazavi R, Jalili S (2014) Fast data-oriented microaggregation algorithm for large numerical datasets. Knowl Based Syst 67:195–205
Mortazavi R, Jalili S (2015) Preference-based anonymization of numerical datasets by multi-objective microaggregation. Inf Fusion 25:85–104
Mortazavi R, Jalili S, Gohargazi H (2013) Multivariate microaggregation by iterative optimization. Appl Intell 39(3):529–544
Navarro-Arribas G, Torra V (2009) Towards microaggregation of log files for Web usage mining in B2C e-commerce. In: Fuzzy information processing society (NAFIPS), IEEE, pp 1–6
Navarro-Arribas G, Torra V (2012) Information fusion in data privacy: a survey. Inf Fusion 13(4):235–244
Nin J, Herranz J, Torra V (2008) On the disclosure risk of multivariate microaggregation. Data Knowl Eng 67(3):399–412
Oganian A, Domingo-Ferrer J (2001) On the complexity of optimal microaggregation for statistical disclosure control. Stat J U N Econ Com Eur 18(4):345–354
Oganian A, Karr AF (2006) Combinations of SDC methods for microdata protection. In: Privacy in Statistical Databases, Springer, Heidelberg, pp 102–113
Pagliuca D, Seri G (1999) Some results of individual ranking method on the system of enterprise accounts annual survey. Report, Esprit SDC Project, Deliverable MI-3 D
Schmid M, Schneeweiss H, Küchenhoff H (2007) Estimation of a linear regression under microaggregation with the response variable as a sorting variable. Statis Neerl 61(4):407–431
Solanas A (2008) Privacy protection with genetic algorithms. In: Yang A, Shan Y, Bui L (eds) Success in evolutionary computation, studies in computational intelligence. Springer, Berlin, pp 215–237
Solanas A, Sebé F, Domingo-Ferrer J (2008) Micro-aggregation-based heuristics for p-sensitive k-anonymity: one step beyond. In: Proceedings of the 2008 international workshop on privacy and anonymity in information society, ACM, pp 61–69
Solé M, Muntés-Mulero V, Nin J (2012) Efficient microaggregation techniques for large numerical data volumes. Int J Inf Secur 11(4):253–267
Sweeney L (2002) k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5):557–570
Torra V (2005) Fuzzy c-means for fuzzy hierarchical clustering. In: The 14th IEEE international conference on fuzzy systems, IEEE, pp 646–651
Truta TM, Vinay B (2006) Privacy protection: p-sensitive k-anonymity property. In: Proceedings 22nd international conference on data engineering workshops, IEEE, pp 94–94
Willenborg LC, De Waal T (2001) Elements of statistical disclosure control, vol 155. Springer, New York
Winkler WE (2004) Re-identification methods for masked microdata. In: Privacy in statistical databases, Springer, Berlin, pp 216–230
Yancey W, Winkler W, Creecy R (2002) Disclosure risk assessment in perturbative microdata protection. In: Domingo-Ferrer J (ed) Inference control in statistical databases., Lecture notes in computer scienceSpringer, Berlin, pp 135–152
Acknowledgments
This research is partially supported by ITRC (Iran Telecommunication Research Center) under Contract No. 12200/500.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Kristian Kersting.
Rights and permissions
About this article
Cite this article
Mortazavi, R., Jalili, S. Enhancing aggregation phase of microaggregation methods for interval disclosure risk minimization. Data Min Knowl Disc 30, 605–639 (2016). https://doi.org/10.1007/s10618-015-0432-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0432-z