Abstract
This paper proposes an approach using large-scale text features for fault-prone module detection inspired by spam filtering. The number of every text feature in the source code of a module is counted and used as data for training detection models. In this paper, we prepared a naive Bayes classifier and a logistic regression model as detection models. To show the effectiveness of our approaches, we conducted experiments with five open source projects and compared them with a well-known metrics set, thereby achieving higher detection results. The results imply that large-scale text features are useful in constructing practical detection models, and measuring sophisticated metrics is not always necessary for detecting fault-prone modules.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
java weka.filters.unsupervised.attribute.StringToWordVector -C -W 5000
References
Aversano L, Cerulo L, Grosso CD (2007) Learning from bug-introducing changes to prevent fault prone code. In: Proc. of 9th international workshop on principles of software evolution. ACM, New York, pp 19–26
Basili VR, Briand LC, Melo WL (1996) A validation of object oriented metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761
Bellini P, Bruno I, Nesi P, Rogai D (2005) Comparing fault-proneness estimation models. In: Proc. of 10th IEEE international conference on engineering of complex computer systems, pp 205–214
Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493
Denaro G, Pezze M (2002) An empirical evaluation of fault-proneness models. In: Proc. of 24th international conference on software engineering, pp 241–251
Fowler M, Beck K (1999) Refactoring: improving the design of existing code. Addison-Wesley Longman, Boston
Graves TL, Karr AF, Marron J, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661
Guo L, Cukic B, Singh H (2003) Predicting fault prone modules by the dempster-shafer belief networks. In: Proc. of 18st international conference on automated software engineering, pp 249–252
Gyimóthy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Softw Eng 31(10):897–910
Halstead MH (1977) Elements of software science. Elsevier, Amsterdam
Hassan AE, Holt RC (2005) The top ten list: dynamic fault prediction. In: Proc. of 21st IEEE international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 263–272
Herraiz I, German DM, Gonzalez-Barahona JM, Robles G (2008) Towards a simplification of the bug report form in eclipse. In: Proc. of 5th international workshop on mining software repositories. ACM, New York, pp 145–148
Higo Y, Murao K, Kusumoto S, Inoue K (2008) Predicting fault-prone modules based on metrics transitions. In: Proc. of 2008 workshop on defects in large software systems. ACM, New York, pp 6–10
Khoshgoftaar TM, Seliya N (2004) Comparative assessment of software quality classification techniques: an empirical study. Empirical Software Engineering 9:229–257
Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A (2007) Predicting faults from cached history. In: Proc. of 29th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 489–498
Kim S, Whitehead EJ Jr, Zhang Y (2008) Classifying software changes: clean or buggy? IEEE Trans Softw Eng 34(2):181–196
Layman L, Kudrjavets G, Nagappan N (2008) Iterative identification of fault-prone binaries using in-process metrics. In: Proc. of 2nd international symposium on empirical software engineering and measurement. ACM, New York, pp 206–212
Li Z, Zhou Y (2005) PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code. In: Proc. of 5th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, New York, pp 306–315
Livshits B, Zimmermann T (2005) Dynamine: finding common error patterns by mining software revision histories. ACM SIGSOFT Softw Eng Notes 30(5):296–305
Madhavan J, Whitehead EJ Jr (2007) Predicting buggy changes inside an integrated development environment. In: Proc. of the 2007 OOPSLA workshop on eclipse technology exchange. ACM, New York, pp 36–40
Mäntylä M, Vanhanen J, Lassenius C (2003) A taxonomy and an initial empirical study of bad smells in code. In: Proc. of the international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 381–384
McCabe TJ (1976) A complexity measure. In: Proc. of 2nd international conference on software engineering. IEEE Computer Society Press, Los Alamitos, p 407
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13
Mileva YM, Zeller A (2008) Project-specific deletion patterns. In: Proc. of international workshop on recommendation systems for software engineering. ACM, New York, pp 41–42
Mizuno O, Kikuno T (2007) Training on errors experiment to detect fault-prone software modules by spam filter. In: Proc. of 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, pp 405–414
Mizuno O, Ikami S, Nakaichi S, Kikuno T (2007) Spam filter based approach for finding fault-prone software modules. In: Proc. of 4th international workshop on mining software repositories
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proc. of 27th International Conference on Software Engineering, pp 284–292
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proc. of 28th international conference on software engineering. ACM, New York, pp 452–461
Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Proc. of 14th ACM conference on computer and communications security. ACM, New York, pp 529–540
Ostrand T, Weyuker E, Bell R (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31:340–355
Pan K, Kim S, Whitehead EJ Jr (2009) Toward an understanding of bug fix patterns. Empir Softw Eng 14(3):286–315
Ratzinger J, Sigmund T, Gall H (2008) On the relation of refactorings and software defect prediction. In: Proc. of 5th international workshop on mining software repositories. ACM, New York, pp 35–38
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Proc. of AAAI workshop on learning for text categorization. AAAI Technical Report WS-98-05
Schröter A, Zimmermann T, Zeller A (2006) Predicting component failures at design time. In: Proc. of ACM/IEEE international symposium on empirical software engineering. ACM, New York, pp 18–27
Seliya N, Khoshgoftaar TM, Zhong S (2005) Analyzing software quality with limited fault-proneness defect data. In: Proc. of 9th IEEE international symposium on high-assurance systems engineering, pp 89–98
Śliwerski J, Zimmermann T, Zeller A (2005a) HATARI: raising risk awareness. In: Proc. of 5th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, New York, pp 107–110
Śliwerski J, Zimmermann T, Zeller A (2005b) When do changes induce fixes? (on Fridays.) In: Proc. of 2nd international workshop on mining software repositories, pp 24–28
Williams C, Hollingsworth J (2005) Automatic mining of source code repositories to improve bug finding techniques. IEEE Trans Softw Eng 31:466–480
Witten IH, Frank E (2005) Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco.
Acknowledgements
The authors would like to express their thanks to the three anonymous reviewers and the editor for providing insightful and useful suggestions and comments. We also thank the developers of the CRM114 classifier. Without the CRM114, this work could not be conducted. We thank Tatsuya Miyake, Yoshiki Higo, and Katsuro Inoue who implemented a software metrics measurement tool. Finally, the authors also wish to thank the developers of Eclipse who have made the repository of Eclipse available for research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Claes Wohlin
Rights and permissions
About this article
Cite this article
Hata, H., Mizuno, O. & Kikuno, T. Fault-prone module detection using large-scale text features based on spam filtering. Empir Software Eng 15, 147–165 (2010). https://doi.org/10.1007/s10664-009-9117-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-009-9117-9