[go: up one dir, main page]

skip to main content
research-article

Isolation-Based Anomaly Detection

Published: 01 March 2012 Publication History

Abstract

Anomalies are data points that are few and different. As a result of these properties, we show that, anomalies are susceptible to a mechanism called isolation. This article proposes a method called Isolation Forest (iForest), which detects anomalies purely based on the concept of isolation without employing any distance or density measure---fundamentally different from all existing methods.
As a result, iForest is able to exploit subsampling (i) to achieve a low linear time-complexity and a small memory-requirement and (ii) to deal with the effects of swamping and masking effectively. Our empirical evaluation shows that iForest outperforms ORCA, one-class SVM, LOF and Random Forests in terms of AUC, processing time, and it is robust against masking and swamping effects. iForest also works well in high dimensional problems containing a large number of irrelevant attributes, and when anomalies are not available in training sample.

References

[1]
Abe, N., Zadrozny, B., and Langford, J. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 504--509.
[2]
Angiulli, F. and Fassetti, F. 2009. Dolphin: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Discov. Data 3, 1, 1--57.
[3]
Asuncion, A. and Newman, D. 2007. UCI machine learning repository. http://mlearn.ics.ucl.edu/MLRepository.html.
[4]
Bay, S. D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 29--38.
[5]
Breiman, L. 2001. Random forests. Mach. Learn. 45, 1, 5--32.
[6]
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. LOF: Identifying density-based local outliers. ACM SIGMOD Record 29, 2, 93--104.
[7]
Caputo, B., Sim, K., Furesjo, F., and Smola, A. 2002. Appearance-based object recognition using SVMs: Which kernel should I use? In Proceedings of the NIPS Workshop on Statitsical Methods for Computational Experiments in Visual Processing and Computer Vision.
[8]
Chandola, V., Banerjee, A., and Kumar, V. 2009. Anomaly detection: A survey. ACM Comput. Surv. 41, 3, 1--58.
[9]
Chaudhary, A., Szalay, A. S., Szalay, E. S., and Moore, A. W. 2002. Very fast outlier detection in large multidimensional datasets. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. ACM.
[10]
Fan, H., Zaïane, O. R., Foss, A., and Wu, J. 2006. A nonparametric outlier detection for effectively discovering top-n outliers from engineering data. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Lecture Notes in Computer Science, vol. 3918, Springer, 557--566.
[11]
Ghoting, A., Otey, M. E., and Parthasarathy, S. 2004. Loaded: Link-based outlier and anomaly detection in evolving datasets. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04). IEEE Computer Society Press, 387--390.
[12]
Hand, D. J. and Till, R. J. 2001. A simple generalisation of the area under the roc curve for multiple class classification problems. Mach. Learn. 45, 2, 171--186.
[13]
He, Z., Xu, X., and Deng, S. 2003. Discovering cluster-based local outliers. Patt. Recog. Lett. 24, 9--10, 1641--1650.
[14]
He, Z., Deng, S., and Xu, X. 2005. A unified subspace outlier ensemble framework for outlier detection. In Proceedings of the International Conference on Web-Age Information Management (WAIM). W. Fan, Z. Wu, and J. Yang Eds., Lecture Notes in Computer Science, vol. 3739, Springer, 632--637.
[15]
Joanes, D. N. and Gill, C. A. 1998. Comparing measures of sample skewness and kurtosis. J. Roy. Stat. Soc. (Series D): Statistician 47, 1, 183--189.
[16]
Knorr, E. M. and Ng, R. T. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB’98). Morgan Kaufmann, 392--403.
[17]
Knorr, E. M., Ng, R. T., and Tucakov, V. 2000. Distance-based outliers: Algorithms and applications. VLDB J. 8, 3--4, 237--253.
[18]
Knuth, D. E. 2006. Art of Computer Programming, Volume 4, Fascicle 4: Generating All Trees. Addison-Wesley Professional.
[19]
Lazarevic, A. and Kumar, V. 2005. Feature bagging for outlier detection. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, 157--166.
[20]
Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2008a. Isolation forest. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM’08). IEEE Computer Society Press, 413--422.
[21]
Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2008b. Spectrum of variable-random trees. J. Artif. Intel. Res. 32, 355--384.
[22]
Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2010a. Can isolation-based anomaly detectors handle arbitrary multi-modal patterns in data? Tech. rep. TR2010/2, Monash University.
[23]
Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2010b. On detecting clustered anomalies using SCiForest. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD’10). 274--290.
[24]
Liu, R. Y., Parelius, J. M., and Singh, K. 1999. Multivariate analysis by data depth: Descriptive statistics, graphics and inference. Ann. Stat. 27, 3, 783--840.
[25]
Murphy, R. B. 1951. On tests for outlying observations. Ph.D. dissertation, Princeton University.
[26]
Papadimitriou, S., Kitagawa, H., Gibbons, P., and Faloutsos, C. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering. 315--326.
[27]
Preiss, B. R. 1999. Data Structures and Algorithms with Object-Oriented Design Patterns in Java. Wiley.
[28]
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning 1st Ed. (Series in Machine Learning), Morgan Kaufmann.
[29]
Ramaswamy, S., Rastogi, R., and Shim, K. 2000. Efficient algorithms for mining outliers from large datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 427--438.
[30]
Rocke, D. M. and Woodruff, D. L. 1996. Identification of outliers in multivariate data. J. Amer. Statist. Soc. 91, 435, 1047--1061.
[31]
Rousseeuw, P. J. and Leroy, A. M. 1987. Robust Regression and Outlier Detection. Wiley, New York.
[32]
Ruskey, F. 1980. On the average shape of binary trees. SIAM J. Alg. Disc. Meth. 1, 1, 43--50.
[33]
Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., and Williamson, R. C. 2001. Estimating the support of a high-dimensional distribution. Neural Comput. 13, 7, 1443--1471.
[34]
Shi, T. and Horvath, S. 2006. Unsupervised learning with random forest predictors. J. Computat. Graph. Stat. 15, 1, 118--138.
[35]
Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. 2005. ROCR: Visualizing classifier performance in r. Bioinformatics 21, 20, 3940--3941.
[36]
Song, X., Wu, M., Jermaine, C., and Ranka, S. 2007. Conditional anomaly detection. IEEE Trans. Knowl. Data Eng. 19, 5, 631--645.
[37]
Tan, P.-N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley.
[38]
Tang, J., Chen, Z., Fu, A. W.-C., and Cheung, D. W.-L. 2002. Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD). Springer-Verlag, 535--548.
[39]
Tao, Y., Xiao, X., and Zhou, S. 2006. Mining distance-based outliers from large databases in any metric space. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, 394--403.
[40]
Tax, D. M. J. and Duin, R. P. W. 2004. Support vector data description. Mach. Learn. 54, 1, 45--66.
[41]
Tukey, J. W. 1977. Exploratory Data Analysis. Addison-Wesley.
[42]
Williams, G., Baxter, R., He, H., Hawkins, S., and Gu, L. 2002. A comparative study of rnn for outlier detection in data mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM’02). IEEE Computer Society Press, 709--712.
[43]
Wu, M. and Jermaine, C. 2006. Outlier detection by sampling with accuracy guarantees. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 767--772.
[44]
Yamanishi, K., Takeuchi, J.-I., Williams, G., and Milne, P. 2000. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 320--324.
[45]
Yu, X., Tang, L. A., and Han, J. 2009. Filtering and refinement: A two-stage approach for efficient and effective anomaly detection. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, 617--626.

Cited By

View all
  • (2025)Detecting energy consumption anomalies with dynamic adaptive encoder-decoder deep learning networksRenewable and Sustainable Energy Reviews10.1016/j.rser.2024.114975207(114975)Online publication date: Jan-2025
  • (2025)Improving port state control through a transfer learning-enhanced XGBoost modelReliability Engineering & System Safety10.1016/j.ress.2024.110558253(110558)Online publication date: Jan-2025
  • (2025)Outlier detection using local density and global structurePattern Recognition10.1016/j.patcog.2024.110947157(110947)Online publication date: Jan-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 6, Issue 1
March 2012
137 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2133360
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2012
Accepted: 01 February 2011
Revised: 01 August 2010
Received: 01 March 2010
Published in TKDD Volume 6, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Anomaly detection
  2. binary tree
  3. ensemble methods
  4. isolation
  5. isolation forest
  6. outlier detection
  7. random tree ensemble

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,182
  • Downloads (Last 6 weeks)125
Reflects downloads up to 14 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Detecting energy consumption anomalies with dynamic adaptive encoder-decoder deep learning networksRenewable and Sustainable Energy Reviews10.1016/j.rser.2024.114975207(114975)Online publication date: Jan-2025
  • (2025)Improving port state control through a transfer learning-enhanced XGBoost modelReliability Engineering & System Safety10.1016/j.ress.2024.110558253(110558)Online publication date: Jan-2025
  • (2025)Outlier detection using local density and global structurePattern Recognition10.1016/j.patcog.2024.110947157(110947)Online publication date: Jan-2025
  • (2025)Cloud-GAN: Cloud Generation Adversarial Networks for anomaly detectionPattern Recognition10.1016/j.patcog.2024.110866157(110866)Online publication date: Jan-2025
  • (2025)Rapid identification method for on-road high-emission vehicle based on deep semi-supervised anomaly detectionMeasurement10.1016/j.measurement.2024.115430239(115430)Online publication date: Jan-2025
  • (2025)LaAeb: A comprehensive log-text analysis based approach for insider threat detectionComputers & Security10.1016/j.cose.2024.104126148(104126)Online publication date: Jan-2025
  • (2025)State estimation of a biogas plant based on spectral analysis using a combination of machine learning and metaheuristic algorithmsApplied Energy10.1016/j.apenergy.2024.124447377(124447)Online publication date: Jan-2025
  • (2024)An Investigation on the Use of Clustering Algorithms for Data Preprocessing in Breast Cancer DiagnosisTürk Doğa ve Fen Dergisi10.46810/tdfd.136439713:1(70-77)Online publication date: 26-Mar-2024
  • (2024)Impact of Enterprise Supply Chain Digitalization on Cost of Debt: A Four-Flows Perspective Analysis Using Explainable Machine Learning MethodologySustainability10.3390/su1619870216:19(8702)Online publication date: 9-Oct-2024
  • (2024)A Survey of Advanced Border Gateway Protocol Attack Detection TechniquesSensors10.3390/s2419641424:19(6414)Online publication date: 3-Oct-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media