Abstract
It is common to encounter databases that have up to a half of the entries missing, which is specifically true with medical databases. Most of the statistical and data mining techniques require complete datasets and obviously these techniques do not provide accurate results with missing values. Several methods have been proposed to deal with the missing data. Commonly used method is to delete instances with missing value attribute. These approaches are suitable when there are few missing values. In case of large number of missing values, deleting these instances results in loss of bulk of information. Other method to cope-up with this problem is to complete their imputation (filling in missing attribute). We propose an efficient missing value imputation method based on clustering with weighted distance. We divide the data set into clusters based on user specified value K. Then find a complete valued neighbor which is nearest to the missing valued instance. Then we compute the missing value by taking the average of the centroid value and the centroidal distance of the neighbor. This value is used as impute value. In our proposed approach we use K-means technique with weighted distance and show that our approach results in better performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
SAS Institute, Inc.: SAS Procedure Guide. SAS Institute Inc. Cary NC (1990)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Statistical Society 82, 528–550 (1978)
Myrtveit, I., Stensrud, E., Olsson, U.H.: Analyzing Datasets with Missing Data: an Empirical Evaluation of Imputation Methods and Likelihood-Based Methods. IEEE Trans. on Software Engineering 27, 999–1013 (2001)
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Mateo (1999)
Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural, and Statistical Classification. Ellis Horwood, New York (1994)
Chan, S.L., Dunn, O.J.: The Treatment of Missing Values in Discriminant Analysis. J. American Statistical Association 67, 473–477 (1972)
Mundfrom, D.J., Whitcomb, A.: Imputing Missing Values: The effect on the Accuracy of Classification. Multiple Linear Regression Viewpoints 25(1), 13–19 (1998)
Beaumont, J.F.: On Regression Imputation in the Presence of Nonignorable Nonresponse. In: Proceedings of the Survey Research 570 Methods Section, ASA, pp. 580–585 (2000)
Lall, U., Sharma, A.: A Nearest-Neighbor Bootstrap for Resampling Hydrologic Time Series. Water Resource. Res. 32, 679–693 (1996)
Chen, S.M., Huang, C.M.: Generating Weighted Fuzzy Rules from Relational Database Systems for Estimating Null Values using Genetic Algorithms. IEEE Trans. Fuzzy Systems 11, 495–506 (2003)
Congdon, P.: Bayesian Models for Categorical Data. John Wiley & Sons, New York (2005)
Chiu, H.Y., Sedransk, J.: A Bayesian Procedure for Imputing Missing Values in Sample Surveys. J. Amer. Statist. Assoc., 5667–5676 (1996)
Batista, G.E.A.P.A., Monard, M.C.: An analysis of Four Missing Data Treatment Methods for Supervised Learning. J. Applied Artificial Intelligence 17, 519–533 (2003)
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 17, 520–525 (2001)
Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004)
Jain, A.K.: Data Clustering: 50 Years Beyond K-Means. J. Pattern Recognition Letters (2009)
Newman, D.J., Hettich, S., Blake, C.L.S., Merz, C.J.: UCI Repository of Machine Learning databases. University of California, Department of Information and Computer Science, Irvine (1998) (last assessed: 15/01/2010)
Chen, G., Astebro, T.: How to Deal with Missing Categorical data: Test of a Simple Bayesian Method. Organ. Res. Methods, 309–327 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Patil, B.M., Joshi, R.C., Toshniwal, D. (2010). Missing Value Imputation Based on K-Mean Clustering with Weighted Distance. In: Ranka, S., et al. Contemporary Computing. IC3 2010. Communications in Computer and Information Science, vol 94. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14834-7_56
Download citation
DOI: https://doi.org/10.1007/978-3-642-14834-7_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14833-0
Online ISBN: 978-3-642-14834-7
eBook Packages: Computer ScienceComputer Science (R0)