Abstract
Due to recent advances in data collection and processing, data publishing has emerged by some organizations for scientific and commercial purposes. Published data should be anonymized such that staying useful while the privacy of data respondents is preserved. Microaggregation is a popular mechanism for data anonymization, but naturally operates on numerical datasets. However, the type of data in the real world is usually mixed i.e., there are both numeric and categorical attributes together. In this paper, we propose a novel transformation based method for microaggregation of mixed data called TBM. The method uses multidimensional scaling to generate a numeric equivalent from mixed dataset. The partitioning step of microaggregation is performed on the equivalent dataset but the aggregation step on the original data. TBM can microaggregate large mixed datasets in a short time with low information loss. Experimental results show that the proposed method attains better trade-off between data utility and privacy in a shorter time in comparison with the traditional methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Microaggregation with minimum information loss.
The definition of LCS in ontology is similar to CCG in VGH.
References
Abril D, Navarro-Arribas G, Torra V (2010a) Towards privacy preserving information retrieval through semantic microaggregation. In: 2010 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, pp 296–299. IEEE, Piscataway
Abril D, Navarro-Arribas G, Torra V (2010b) Towards semantic microaggregation of categorical data for confidential documents. Modeling decisions for artificial intelligence. Springer, Heidelberg, pp 266–276
Alpaydin E (2010) Introduction to machine learning, 2nd edn. The MIT Press, London
Bai L, Liang J, Dang C (2011) An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data. Knowl Based Syst 24(6):785–795
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517
Cao F, Liang J, Li D, Bai L, Dang C (2012) A dissimilarity measure for the k-modes clustering algorithm. Knowl Based Syst 26:120–127
Chettri S, Borah B (2012) MDAV2K: a variable-size microaggregation technique for privacy preservation. In: International conference on information technology convergence and services, pp 105–118
Chettri S, Borah B (2013) An efficient microaggregation method for protecting mixed data. Computer networks and communications (NetCom). Springer, New York, pp 551–561
Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min Knowl Discov 11(2):195–212
Domingo-Ferrer J, Martínez-Ballesté A, Mateo-Sanz JM, Sebé F (2006) Efficient multivariate data-oriented microaggregation. Int J Very Large Data Bases 15(4):355–369
Fayyoumi E, Oommen BJ (2009) Achieving microaggregation for secure statistical databases using fixed-structure partitioning-based learning automata. IEEE Trans Syst Man Cybern B 39(5):1192–1205
Ghinita G, Karras P, Kalnis P, Mamoulis N (2007) Fast data anonymization with low information loss. In: Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, pp 758–769
Guzman-Arenas A, Cuevas AD, Jimenez A (2011) The centroid or consensus of a set of objects with qualitative attributes. Expert Syst Appl 38(5):4908–4919
Han J, Yu J, Mo Y, Lu J, Liu H (2014) Mage: a semantics retaining k-anonymization method for mixed data. Knowl Based Syst 55:75–86
Hansen SL, Mukherjee S (2003) A polynomial algorithm for optimal univariate microaggregation. IEEE Trans Knowl Data Eng 15(4):1043–1044
Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD), Singapore, pp 21–34
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Jiang W, Clifton C (2006) A secure distributed framework for achieving k-anonymity. Int J Very Large Data Bases 15(4):316–333
Juan Y, Jianmin H, Jianmin C, Zanzhu X (2009) TopDown-KACA: an efficient local-recoding algorithm for k-anonymity. In: IEEE international conference on granular computing, GRC’09, pp 727–732. IEEE, Piscataway
Kokolakis G, Fouskakis D (2009) Importance partitioning in micro-aggregation. Comput Stat Data Anal 53(7):2439–2445
Laszlo M, Mukherjee S (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Tran Knowl Data Eng 17(7):902–911
Li J, Wong RCW, Fu AWC, Pei J (2006) Achieving k-anonymity by clustering in attribute hierarchical structures. In: Tjoa AM, Trujillo J (eds) DaWaK 2006. Springer, Berlin Heidelberg, pp 405–416
Li N, Li T, Venkatasubramanian S (2007) t-closeness: Privacy beyond k-anonymity and l-diversity. In: Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE), vol 7, pp 106–115
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov Data 1(1):3
Martínez S, Sánchez D, Valls A (2012) Semantic adaptive microaggregation of categorical microdata. Comput Secur 31(5):653–672
Martínez S, Valls A, Snchez D (2012) Semantically-grounded construction of centroids for datasets with textual attributes. Knowl Based Syst 35:160–172
Monreale A, Trasarti R, Pedreschi D, Renso C, Bogorny V (2011) C-safety: a framework for the anonymization of semantic trajectories. Trans Data Privacy 4(2):73–101
Mortazavi R, Jalili S (2014) Fast data-oriented microaggregation algorithm for large numerical datasets. Knowl Based Syst 67:195–205
Mortazavi R, Jalili S, Gohargazi H (2013) Multivariate microaggregation by iterative optimization. Appl Intell 39(3):529–544
Pagliuca D, Seri G (1999) Some results of individual ranking method on the system of enterprise accounts annual survey. Esprit SDC Project, Deliverable MI-3 D 2:1999
Samarati P (2001) Protecting respondents identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027
Solanas A, Martínez-Ballesté A (2006) V-MDAV: Variable group size multivariate microaggregation. COMPSTAT2006 pp 917–925
Solé M, Muntés-Mulero V, Nin J (2012) Efficient microaggregation techniques for large numerical data volumes. Int J Inf Secur 11(4):253–267
Ting-ting C, Jian-min H, Hui-qun Y, Juan Y (2008) An efficient microaggregation algorithm for mixed data. In: Proceedings of the international conference on computer science and software engineering, IEEE Computer Society 3:1053–1056
Torra V (2004) Microaggregation for categorical variables: a median based approach. Privacy in statistical databases. Springer, Heidelberg, pp 162–174
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp 133–138
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Responsible editor: Charu Aggarwal.
Rights and permissions
About this article
Cite this article
Salari, M., Jalili, S. & Mortazavi, R. TBM, a transformation based method for microaggregation of large volume mixed data. Data Min Knowl Disc 31, 65–91 (2017). https://doi.org/10.1007/s10618-016-0457-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-016-0457-y