[go: up one dir, main page]

Skip to main content
Log in

Outlier detection based on approximation accuracy entropy

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Recently, proximity-based outlier detection methods receive much attention. For any given object x, a proximity-based method usually measures the degree of outlierness of x through examining the nearest neighbor structure of x, where the size of nearest neighborhood should be predetermined by the users. However, it is difficult for users to determine the size of nearest neighborhood. To solve the above problem, in this paper, we present an approximation accuracy entropy-based outlier detection algorithm, called ODAAE, within the framework of rough sets. Approximation accuracy entropy is an extension of Shannon information entropy in rough sets. To quantify the degree of outlierness of any given object, we develop a measure called the AAE(approximation accuracy entropy)-based outlier factor. Experimental results on real-world data sets show that the proposed algorithm is effective for outlier detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Gu B, Sheng VS (2017) A robust regularization path algorithm for \(\nu\)-support vector classification. IEEE Trans Neural Netw Learn Syst 28(5):1241–1248

    Article  Google Scholar 

  2. Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23(5):1638–1654

    Article  Google Scholar 

  3. Wang XZ, Wang R, Xu C (2018) Discovering the relationship between generalization and uncertainty by incorporating complexity of classification. IEEE Trans Cybern 48(2):703–715

    Article  MathSciNet  Google Scholar 

  4. Zheng YH, Jeon B, Xu DH, Wu QMJ, Zhang H (2015) Image segmentation by generalized hierarchical fuzzy \(C\)-means algorithm. J Intell Fuzzy Syst 28(2):961–973

    Google Scholar 

  5. Han JW, Kamber M, Pei J (2011) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  6. Hodge VJ, Austin J (2004) A survey of outlier detection methodologies. AI Rev 22(2):85–126

    MATH  Google Scholar 

  7. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58

    Article  Google Scholar 

  8. Shepherd JM, Burian SJ (2003) Detection of urban-induced rainfall anomalies in a major coastal city. Earth Interact 7:1–17

    Article  Google Scholar 

  9. Bolton RJ, Hand DJ (2002) Statistical fraud detection: a review. Stat Sci 17(3):235–255

    Article  MathSciNet  MATH  Google Scholar 

  10. Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data. In: Barbar D (ed) Data mining for security applications. Kluwer Academic Publishers, Boston, pp 1–20

    Google Scholar 

  11. Mehran R, Oyama A, Shah M (2009) Abnormal crowd behavior detection using social force model. In: Proc. of the 22nd Int. Conf. on IEEE Computer Vision and Pattern Recognition, Miami, USA, pp 935–942

  12. Gruhl C, Sick B (2016) Novelty detection with CANDIES: a holistic technique based on probabilistic models. Int J Mach Learn Cybern 9(6):927–945

    Article  Google Scholar 

  13. Daneshpazhouh A, Sami A (2014) Entropy-based outlier detection using semi-supervised approach with few positive examples. Pattern Recogn Lett 49:77–84

    Article  Google Scholar 

  14. Hawkins D (1980) Identifications of outliers. Chapman and Hall, London

    Book  Google Scholar 

  15. Gaddam SR, Phoha VV, Balagani KS (2007) \(K\)-Means+ID3: A novel method for supervised anomaly detection by cascading \(K\)-Means clustering and ID3 decision tree learning methods. IEEE Trans Knowl Data Eng 19(3):345–354

    Article  Google Scholar 

  16. Angiulli F, Fassetti F (2014) Exploiting domain knowledge to detect outliers. Data Mining Knowl Discov 28(2):519–568

    Article  MathSciNet  MATH  Google Scholar 

  17. Aggarwal CC (2013) Outlier analysis. Springer, New York

    Book  MATH  Google Scholar 

  18. Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: Identifying density-based local outliers. In: Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, Dallas, USA, pp 93–104

  19. Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J Very Large Data Bases 8(3–4):237–253

    Article  Google Scholar 

  20. Wu S, Wang SR (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602

    Article  Google Scholar 

  21. Ha J, Seok S, Lee JS (2015) A precise ranking method for outlier detection. Inform Sci 324:88–107

    Article  MathSciNet  MATH  Google Scholar 

  22. He ZY, Deng SC, Xu XF (2005) An optimization model for outlier detection in categorical data. In: Proc. of the 2005 Int. Conf. on intelligent computing (ICIC 2005), Hefei, China, pp 400–409

  23. Jiang F, Sui YF, Cao CG (2010) An information entropy-based approach to outlier detection in rough sets. Expert Syst Appl 37(9):6338–6344

    Article  Google Scholar 

  24. Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11(5):341–356

    Article  MATH  Google Scholar 

  25. Zhao HB, Jiang F, Wang CP (2012) An approximation decision entropy based decision tree algorithm and its application in intrusion detection. In: Proc. of the 6th Int. Conf. on Rough Set and Knowledge Technology (RSKT2012), Chengdu, China, pp 101–106

  26. Shannon CE (1948) The mathematical theory of communication. Bell Syst Tech J 27(3–4):373–423

    MathSciNet  Google Scholar 

  27. Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht

    Book  MATH  Google Scholar 

  28. Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml

  29. Utkin LV (2014) A framework for imprecise robust one-class classification models. Int J Machine Learn Cybern 5(3):379–393

    Article  Google Scholar 

  30. Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471

    Article  MATH  Google Scholar 

  31. Peng ZC, Hu QH, Dang JW (2017) Multi-kernel SVM based depression recognition using social media data. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-017-0697-1

  32. Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54:45–66

    Article  MATH  Google Scholar 

  33. Barnett V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, Chichester

    MATH  Google Scholar 

  34. Rousseeuw PJ, Leroy AM (2005) Robust regression and outlier detection, vol 589. Wiley, New York

    MATH  Google Scholar 

  35. Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Recogn Lett 22(6):691–700

    Article  MATH  Google Scholar 

  36. Sevakula RK, Verma NK (2014) Clustering based outlier detection in fuzzy SVM. In: Proc. of the 2014 IEEE Int. Conf. on Fuzzy Systems, Beijing, China, pp 1172–1177

  37. Johnson T, Kwok I, Ng RT (1998) Fast computation of 2-dimensional depth contours. Proc. of the 4th Int. Conf. on Knowledge Discovery and Data Mining, New York, pp 224–228

  38. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Rec 29(2):427–438

    Article  Google Scholar 

  39. Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Proc. of the 6th European Conf. on the Principles of Data Mining and Knowledge Discovery, Helsinki, Finland, pp 15–26

  40. Zhang K, Hutter M, Jin HD (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Proc. of the 13th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD 2009), Bangkok, Thailand, pp 813–822

  41. Kriegel HP, Kröger P, Schubert E, Zimek A (2009) LoOP: local outlier probabilities. In: Proc. of the 18th ACM Conf. on Information and Knowledge Management, Hong Kong, China, pp 1649–1652

  42. Düntsch I, Gediga G (1998) Uncertainty measures of rough set prediction. Artif Intell 106:109–137

    Article  MathSciNet  MATH  Google Scholar 

  43. Liang JY, Wang JH, Qian YH (2009) A new measure of uncertainty based on knowledge granulation for rough sets. Inform Sci 179(4):458–470

    Article  MathSciNet  MATH  Google Scholar 

  44. Hu QH, Che XJ, Zhang L, Zhang D, Guo MZ, Yu DR (2012) Rank entropy based decision trees for monotonic classification. IEEE Trans Knowl Data Eng 24(11):2052–2064

    Article  Google Scholar 

  45. Wang CZ, He Q, Shao MW, Xu YY, Hu QH (2017) A unified information measure for general binary relations. Knowl Based Syst 135(1):18–28

    Article  Google Scholar 

  46. Miao DQ, Hu GR (1999) A heuristic algorithm for reduction of knowledge. Comput Res Dev 36(6):681–684

    Google Scholar 

  47. Wang GY, Yu H, Yang DC (2002) Decision table reduction based on conditional information entropy. Chin J Comput 25(7):759–766

    MathSciNet  Google Scholar 

  48. Liang JY, Shi ZZ (2004) The information entropy, rough entropy and knowledge granulation in rough set theory. Int J Uncertain Fuzziness Knowl Based Syst 12(1):37–46

    Article  MathSciNet  MATH  Google Scholar 

  49. Liang JY, Shi ZZ, Li DY, Wierman MJ (2006) Information entropy, rough entropy and knowledge granularity in incomplete information systems. Int J Gen Syst 35(6):641–654

    Article  MathSciNet  MATH  Google Scholar 

  50. Qian YH, Liang JY, Wang F (2009) A New method for measuring the uncertainty in incomplete information systems. Int J Uncertain Fuzziness Knowl Based Syst 17(6):855–880

    Article  MathSciNet  MATH  Google Scholar 

  51. Jiang F, Liu GZ, Du JW, Sui YF (2016) Initialization of \(K\)-modes clustering using outlier detection techniques. Inform Sci 332:167–183

    Article  MATH  Google Scholar 

  52. Jiang F, Sui YF, Cao CG (2011) A hybrid approach to outlier detection based on boundary region. Pattern Recogn Lett 32(14):1860–1870

    Article  Google Scholar 

  53. Jiang F, Chen YM (2015) Outlier detection based on granular computing and rough set theory. Appl Intell 42(2):303–322

    Article  Google Scholar 

  54. Nguyen SH, Nguyen HS (1996) Some efficient algorithms for rough set methods. In: Proc. of the 6th Int. Conf. on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’ 96), Granada, Spain, pp 1451–1456

  55. Xu ZY, Liu ZP, Yang BR, Song W (2006) A quick attribute reduction algorithm with complexity of max(\(O( |C | | U | ), O( | C | ^2 | U/C |\))). Chin J Comput 29(3):391–399

    Google Scholar 

  56. KDD Cup 99 Data Set (1999). http://kdd.ics.uci.edu/databases/kddcup-99/kddcup99.html

  57. Kriegel HP, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Las Vegas, USA, pp 444–452

  58. Yamanishi K, Takeuchi JI, Williams G, Milne P (2000) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In: Proc. of the 6th ACM SIGKDD Int. Conf, pp 320–324

  59. Jin W, Tung AKH, Han JW, Wang W (2006) Ranking outliers using symmetric neighborhood relationship. In: Proc. of the 10th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD 2006), Singapore, pp 577–593

  60. Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. of the 13th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD 2009), Bangkok, Thailand, pp 831–838

  61. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. of the 13th Int. Conf. on Artificial Intelligence, Chambéry, France, pp 1022–1027

  62. Foss A, Zaïane OR (2011) Class separation through variance: a new application of outlier detection. Knowl Inform Syst 29(3):565–596

    Article  Google Scholar 

  63. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  64. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. Proc. of the (2001) ACM SIGMOD Int. Conf. on Managment of. Data, California, pp 37–46

  65. Chen ST, Chen GL, Guo WZ, Liu YH (2010) Feature selection of the intrusion detection data based on particle swarm optimization and neighborhood reduction. Comput Res Dev 47(7):1261–1267

    Google Scholar 

  66. Hu QH, Pedrycz W, Yu DR, Lang J (2010) Selecting discrete and continuous features based on neighborhood decision error minimization. IEEE Trans Syst Man Cybern Part B 40(1):137–150

    Article  Google Scholar 

  67. Zhu PF, Hu QH, Zuo WM, Yang M (2014) Multi-granularity distance metric learning via neighborhood granule margin maximization. Inform Sci 282:321–331

    Article  Google Scholar 

  68. Wang CZ, Hu QH, Wang XZ, Chen DG, Qian YH, Dong Z (2017) Feature selection based on neighborhood discrimination index. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2017.2710422

  69. Wang CZ, He Q, Shao MW, Hu QH (2017) Feature selection based on maximal neighborhood discernibility. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-017-0712-6

  70. Chen DG, Yang YY (2014) Attribute reduction for heterogeneous data based on the combination of classical and fuzzy rough set models. IEEE Trans Fuzzy Syst 22(5):1325–1334

    Article  MathSciNet  Google Scholar 

  71. Wang CZ, Shao MW, He Q, Qian YH, Qi YL (2016) Feature subset selection based on fuzzy neighborhood rough sets. Knowl Based Syst 111(1):173–179

    Article  Google Scholar 

  72. Wang CZ, Qi YL, Shao MW, Hu QH, Chen DG, Qian YH, Lin YJ (2017) A fitting model for feature selection with fuzzy rough sets. IEEE Trans Fuzzy Syst 25(4):741–753

    Article  Google Scholar 

  73. Wang XZ, Zhang TL, Wang R (2017) Non-iterative deep learning: incorporating restricted boltzmann machine into multilayer random weight neural networks. IEEE Trans Syst Man Cybern Syst. https://doi.org/10.1109/TSMC.2017.2701419

  74. Wang R, Wang XZ, Kwong S, Xu C (2017) Incorporating diversity and informativeness in multiple-instance active learning. IEEE Trans Fuzzy Syst 25(6):1460–1475

    Article  Google Scholar 

  75. Abe N, Zadrozny B, Langford J (2006) Outlier detection by active learning. In: Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Philadelphia, USA, pp 504–509

Download references

Acknowledgements

We should like to thank the anonymous referees for their constructive remarks which helped to improve the clarity and the completeness of the paper. This work is supported by the National Natural Science Foundation of China (Grant Nos. 61402246, 61273180), the Natural Science Foundation of Shandong Province, China (Grant No. ZR2018MF007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanjun Peng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, F., Zhao, H., Du, J. et al. Outlier detection based on approximation accuracy entropy. Int. J. Mach. Learn. & Cyber. 10, 2483–2499 (2019). https://doi.org/10.1007/s13042-018-0884-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-018-0884-8

Keywords

Navigation