Abstract
Outlier detection is one of the most important tasks in data analysis. It refers to the process of recognizing unusual characteristics which may provide useful insights in helping us to understand the behaviors of data. In the paper, an isolation-based outlier detection method, called Entropy-based Greedy Isolation Tree (EGiTree), is proposed. Unlike other tree-like detection methods, our method exploits a half-baked isolation tree, which is constructed via three entropy-based heuristics, to identify outliers. Specifically, the heuristics are used to guide the selection process of attribute and its split value when constructing the tree. Thus, the outlierness score of each data point is estimated based on the total partition cost of the isolation node in the tree, as well as the path length and complexity of partition. Experiment results on public real-world datasets show that our approach outperforms distanced-based, density-based, subspace-based as well as state-of-the-art isolation-based approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, C.C.: Outlier Analysis. Springer, New York (2013)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufman Publishers, San Francisco (2000)
Zhang, J.: Advancements of outlier detection: a survey. ICST Trans. Scalable Inf. Syst. 13(1), 1–26 (2013)
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)
Barnett, V.: The ordering of multivariate data (with discussion). J. Roy. Stat. Soc. Ser. A 139, 318–354 (1976)
Laurikkala, J., Juhola, M., Kentala, E.: Informal identification of outliers in medical data. In: Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology, pp. 20–24 (2000)
Solberg, H.E., Lahti, A.: Detection of outliers in reference distributions: performance of horn’s algorithm. Clin. Chem. 51(12), 2326–2332 (2005)
Endler, D.: Intrusion detection: applying machine learning to solaris audit data. In: Proceedings of the 14th Annual Computer Security Applications Conference, p. 268 (1998)
Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in behavior. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 53–62 (1999)
Kruegel, C., Toth, T., Kirda, E.: Service specific anomaly detection for network intrusion detection. In: Proceedings of the 2002 ACM Symposium on Applied Computing, pp. 201–208 (2002)
Yamanishi, K., Takeuchi, J.I.: Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 389–394 (2001)
Parzen, E.: On the estimation of a probability density function and mode. Ann. Math. Stat. 33, 1065–1076 (1962)
Bishop, C.: Novelty detection and neural network validation. In: Proceedings of IEEE Vision, Image and Signal Processing, vol. 141, pp. 217–222 (1994)
Tarassenko, L.: Novelty detection for the identification of masses in mammograms. In: Proceedings of the 4th IEEE International Conference on Artificial Neural Networks, vol. 4, pp. 442–447, Cambridge, UK (1995)
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large dataset. In: Proceedings of 24th International Conference on Very Large Data Bases, VLDB 1998, pp 392–403, New York, NY (1998)
Knorr, E.M., Ng, R.T.: Finding intentional knowledge of distance-based outliers. In: Proceedings of 25th International Conference on Very Large Data Bases, VLDB 1999, pp. 211–222, Edinburgh, Scotland (1999)
Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. 8(3–4), 237–253 (2000)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000, pp. 427–438. ACM Press, New York (2000)
Zhang, K., Hutter, M., Jin, H.: A new local distance-based outlier detection approach for scattered real-world data. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 813–822. Springer, Heidelberg (2009)
Sugiyama, M., Borgwardt, K.M.: Rapid distance-based outlier detection via sampling. In: Proceedings of the Twenty-Seventh Annual Conference on Neural Information Processing Systems, Harrahs and Harveys, Lake Tahoe, 5–10 Dec 2013
Berchtold, S., Keim, D.A., Kriegel, H.-P.: The X-tree: an index structure for high-dimensional data. In: Proceedings of the 22th International Conference on Very Large Data Bases, pp. 28–39 (1996)
Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*tree: an efficient and robust access method for points and rectangles. In: Proceedings of 1990 ACM SIGMOD International Conference on Management of Data, SIGMOD 1990, pp 322–331, Atlantic City, NJ (1990)
Tang, J., Chen, Z., Fu, AW.-c., Cheung, D.W.: Enhancing effectiveness of outlier detections for low density patterns. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 535–548. Springer, Heidelberg (2002)
Papadimitriou, S., Kitagawa, H., Gibbons, P., Faloutsos, C.: Loci: fast outlier detection using the local correlation integral. In: Proceedings. 19th International Conference on Data Engineering, pp. 315–326 (2003)
Fan, H., Zaïane, O.R., Foss, A., Wu, J.: A nonparametric outlier detection for effectively discovering Top-N outliers from engineering data. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 557–566. Springer, Heidelberg (2006)
Jin, W., Tung, A.K., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 577–593. Springer, Heidelberg (2006)
Keller, F., Muller, E., Bohm, K.: HiCS: high contrast subspaces for density-based outlier ranking. In: IEEE 28th International Conference on Data Engineering, ICDE 2012, pp: 1037–1048 (2012)
Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 831–838. Springer, Heidelberg (2009)
Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data 6(1), 3:1–3:39 (2012)
Liu, F.T., Ting, K.M., Zhou, Z.-H.: On detecting clustered anomalies using SCiForest. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010, Part II. LNCS, vol. 6322, pp. 274–290. Springer, Heidelberg (2010)
Gyorfi, L., van der Meulen, E.C.: Density-free convergence properties of various estimators of entropy. Comput. Stat. Data Anal. 5, 425–436 (1987)
Lichmanm M.: UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science (2013). http://archive.ics.uci.edu/ml
Acknowledgement
This work is partially supported by the NSF of China (No. 61272468, 61272007, 61572443), the NSF of Zhejiang Province (No. LY14F020012, No. LQ13F020026), and the Opening Fund of Top Key Discipline of Computer Software and Theory in Zhejiang Provincial Colleges at Zhejiang Normal University (No. ZSDZZZZXK27).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Shen, Y., Liu, H., Wang, Y., Chen, Z., Sun, G. (2016). A Novel Isolation-Based Outlier Detection Method. In: Booth, R., Zhang, ML. (eds) PRICAI 2016: Trends in Artificial Intelligence. PRICAI 2016. Lecture Notes in Computer Science(), vol 9810. Springer, Cham. https://doi.org/10.1007/978-3-319-42911-3_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-42911-3_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42910-6
Online ISBN: 978-3-319-42911-3
eBook Packages: Computer ScienceComputer Science (R0)