Abstract
Recently, proximity-based outlier detection methods receive much attention. For any given object x, a proximity-based method usually measures the degree of outlierness of x through examining the nearest neighbor structure of x, where the size of nearest neighborhood should be predetermined by the users. However, it is difficult for users to determine the size of nearest neighborhood. To solve the above problem, in this paper, we present an approximation accuracy entropy-based outlier detection algorithm, called ODAAE, within the framework of rough sets. Approximation accuracy entropy is an extension of Shannon information entropy in rough sets. To quantify the degree of outlierness of any given object, we develop a measure called the AAE(approximation accuracy entropy)-based outlier factor. Experimental results on real-world data sets show that the proposed algorithm is effective for outlier detection.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Gu B, Sheng VS (2017) A robust regularization path algorithm for \(\nu\)-support vector classification. IEEE Trans Neural Netw Learn Syst 28(5):1241–1248
Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23(5):1638–1654
Wang XZ, Wang R, Xu C (2018) Discovering the relationship between generalization and uncertainty by incorporating complexity of classification. IEEE Trans Cybern 48(2):703–715
Zheng YH, Jeon B, Xu DH, Wu QMJ, Zhang H (2015) Image segmentation by generalized hierarchical fuzzy \(C\)-means algorithm. J Intell Fuzzy Syst 28(2):961–973
Han JW, Kamber M, Pei J (2011) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco
Hodge VJ, Austin J (2004) A survey of outlier detection methodologies. AI Rev 22(2):85–126
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58
Shepherd JM, Burian SJ (2003) Detection of urban-induced rainfall anomalies in a major coastal city. Earth Interact 7:1–17
Bolton RJ, Hand DJ (2002) Statistical fraud detection: a review. Stat Sci 17(3):235–255
Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data. In: Barbar D (ed) Data mining for security applications. Kluwer Academic Publishers, Boston, pp 1–20
Mehran R, Oyama A, Shah M (2009) Abnormal crowd behavior detection using social force model. In: Proc. of the 22nd Int. Conf. on IEEE Computer Vision and Pattern Recognition, Miami, USA, pp 935–942
Gruhl C, Sick B (2016) Novelty detection with CANDIES: a holistic technique based on probabilistic models. Int J Mach Learn Cybern 9(6):927–945
Daneshpazhouh A, Sami A (2014) Entropy-based outlier detection using semi-supervised approach with few positive examples. Pattern Recogn Lett 49:77–84
Hawkins D (1980) Identifications of outliers. Chapman and Hall, London
Gaddam SR, Phoha VV, Balagani KS (2007) \(K\)-Means+ID3: A novel method for supervised anomaly detection by cascading \(K\)-Means clustering and ID3 decision tree learning methods. IEEE Trans Knowl Data Eng 19(3):345–354
Angiulli F, Fassetti F (2014) Exploiting domain knowledge to detect outliers. Data Mining Knowl Discov 28(2):519–568
Aggarwal CC (2013) Outlier analysis. Springer, New York
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: Identifying density-based local outliers. In: Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, Dallas, USA, pp 93–104
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J Very Large Data Bases 8(3–4):237–253
Wu S, Wang SR (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602
Ha J, Seok S, Lee JS (2015) A precise ranking method for outlier detection. Inform Sci 324:88–107
He ZY, Deng SC, Xu XF (2005) An optimization model for outlier detection in categorical data. In: Proc. of the 2005 Int. Conf. on intelligent computing (ICIC 2005), Hefei, China, pp 400–409
Jiang F, Sui YF, Cao CG (2010) An information entropy-based approach to outlier detection in rough sets. Expert Syst Appl 37(9):6338–6344
Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11(5):341–356
Zhao HB, Jiang F, Wang CP (2012) An approximation decision entropy based decision tree algorithm and its application in intrusion detection. In: Proc. of the 6th Int. Conf. on Rough Set and Knowledge Technology (RSKT2012), Chengdu, China, pp 101–106
Shannon CE (1948) The mathematical theory of communication. Bell Syst Tech J 27(3–4):373–423
Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht
Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml
Utkin LV (2014) A framework for imprecise robust one-class classification models. Int J Machine Learn Cybern 5(3):379–393
Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Peng ZC, Hu QH, Dang JW (2017) Multi-kernel SVM based depression recognition using social media data. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-017-0697-1
Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54:45–66
Barnett V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, Chichester
Rousseeuw PJ, Leroy AM (2005) Robust regression and outlier detection, vol 589. Wiley, New York
Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Recogn Lett 22(6):691–700
Sevakula RK, Verma NK (2014) Clustering based outlier detection in fuzzy SVM. In: Proc. of the 2014 IEEE Int. Conf. on Fuzzy Systems, Beijing, China, pp 1172–1177
Johnson T, Kwok I, Ng RT (1998) Fast computation of 2-dimensional depth contours. Proc. of the 4th Int. Conf. on Knowledge Discovery and Data Mining, New York, pp 224–228
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Rec 29(2):427–438
Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Proc. of the 6th European Conf. on the Principles of Data Mining and Knowledge Discovery, Helsinki, Finland, pp 15–26
Zhang K, Hutter M, Jin HD (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Proc. of the 13th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD 2009), Bangkok, Thailand, pp 813–822
Kriegel HP, Kröger P, Schubert E, Zimek A (2009) LoOP: local outlier probabilities. In: Proc. of the 18th ACM Conf. on Information and Knowledge Management, Hong Kong, China, pp 1649–1652
Düntsch I, Gediga G (1998) Uncertainty measures of rough set prediction. Artif Intell 106:109–137
Liang JY, Wang JH, Qian YH (2009) A new measure of uncertainty based on knowledge granulation for rough sets. Inform Sci 179(4):458–470
Hu QH, Che XJ, Zhang L, Zhang D, Guo MZ, Yu DR (2012) Rank entropy based decision trees for monotonic classification. IEEE Trans Knowl Data Eng 24(11):2052–2064
Wang CZ, He Q, Shao MW, Xu YY, Hu QH (2017) A unified information measure for general binary relations. Knowl Based Syst 135(1):18–28
Miao DQ, Hu GR (1999) A heuristic algorithm for reduction of knowledge. Comput Res Dev 36(6):681–684
Wang GY, Yu H, Yang DC (2002) Decision table reduction based on conditional information entropy. Chin J Comput 25(7):759–766
Liang JY, Shi ZZ (2004) The information entropy, rough entropy and knowledge granulation in rough set theory. Int J Uncertain Fuzziness Knowl Based Syst 12(1):37–46
Liang JY, Shi ZZ, Li DY, Wierman MJ (2006) Information entropy, rough entropy and knowledge granularity in incomplete information systems. Int J Gen Syst 35(6):641–654
Qian YH, Liang JY, Wang F (2009) A New method for measuring the uncertainty in incomplete information systems. Int J Uncertain Fuzziness Knowl Based Syst 17(6):855–880
Jiang F, Liu GZ, Du JW, Sui YF (2016) Initialization of \(K\)-modes clustering using outlier detection techniques. Inform Sci 332:167–183
Jiang F, Sui YF, Cao CG (2011) A hybrid approach to outlier detection based on boundary region. Pattern Recogn Lett 32(14):1860–1870
Jiang F, Chen YM (2015) Outlier detection based on granular computing and rough set theory. Appl Intell 42(2):303–322
Nguyen SH, Nguyen HS (1996) Some efficient algorithms for rough set methods. In: Proc. of the 6th Int. Conf. on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’ 96), Granada, Spain, pp 1451–1456
Xu ZY, Liu ZP, Yang BR, Song W (2006) A quick attribute reduction algorithm with complexity of max(\(O( |C | | U | ), O( | C | ^2 | U/C |\))). Chin J Comput 29(3):391–399
KDD Cup 99 Data Set (1999). http://kdd.ics.uci.edu/databases/kddcup-99/kddcup99.html
Kriegel HP, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Las Vegas, USA, pp 444–452
Yamanishi K, Takeuchi JI, Williams G, Milne P (2000) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In: Proc. of the 6th ACM SIGKDD Int. Conf, pp 320–324
Jin W, Tung AKH, Han JW, Wang W (2006) Ranking outliers using symmetric neighborhood relationship. In: Proc. of the 10th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD 2006), Singapore, pp 577–593
Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. of the 13th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD 2009), Bangkok, Thailand, pp 831–838
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. of the 13th Int. Conf. on Artificial Intelligence, Chambéry, France, pp 1022–1027
Foss A, Zaïane OR (2011) Class separation through variance: a new application of outlier detection. Knowl Inform Syst 29(3):565–596
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. Proc. of the (2001) ACM SIGMOD Int. Conf. on Managment of. Data, California, pp 37–46
Chen ST, Chen GL, Guo WZ, Liu YH (2010) Feature selection of the intrusion detection data based on particle swarm optimization and neighborhood reduction. Comput Res Dev 47(7):1261–1267
Hu QH, Pedrycz W, Yu DR, Lang J (2010) Selecting discrete and continuous features based on neighborhood decision error minimization. IEEE Trans Syst Man Cybern Part B 40(1):137–150
Zhu PF, Hu QH, Zuo WM, Yang M (2014) Multi-granularity distance metric learning via neighborhood granule margin maximization. Inform Sci 282:321–331
Wang CZ, Hu QH, Wang XZ, Chen DG, Qian YH, Dong Z (2017) Feature selection based on neighborhood discrimination index. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2017.2710422
Wang CZ, He Q, Shao MW, Hu QH (2017) Feature selection based on maximal neighborhood discernibility. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-017-0712-6
Chen DG, Yang YY (2014) Attribute reduction for heterogeneous data based on the combination of classical and fuzzy rough set models. IEEE Trans Fuzzy Syst 22(5):1325–1334
Wang CZ, Shao MW, He Q, Qian YH, Qi YL (2016) Feature subset selection based on fuzzy neighborhood rough sets. Knowl Based Syst 111(1):173–179
Wang CZ, Qi YL, Shao MW, Hu QH, Chen DG, Qian YH, Lin YJ (2017) A fitting model for feature selection with fuzzy rough sets. IEEE Trans Fuzzy Syst 25(4):741–753
Wang XZ, Zhang TL, Wang R (2017) Non-iterative deep learning: incorporating restricted boltzmann machine into multilayer random weight neural networks. IEEE Trans Syst Man Cybern Syst. https://doi.org/10.1109/TSMC.2017.2701419
Wang R, Wang XZ, Kwong S, Xu C (2017) Incorporating diversity and informativeness in multiple-instance active learning. IEEE Trans Fuzzy Syst 25(6):1460–1475
Abe N, Zadrozny B, Langford J (2006) Outlier detection by active learning. In: Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Philadelphia, USA, pp 504–509
Acknowledgements
We should like to thank the anonymous referees for their constructive remarks which helped to improve the clarity and the completeness of the paper. This work is supported by the National Natural Science Foundation of China (Grant Nos. 61402246, 61273180), the Natural Science Foundation of Shandong Province, China (Grant No. ZR2018MF007).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jiang, F., Zhao, H., Du, J. et al. Outlier detection based on approximation accuracy entropy. Int. J. Mach. Learn. & Cyber. 10, 2483–2499 (2019). https://doi.org/10.1007/s13042-018-0884-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-018-0884-8