Abstract
Clustering and classification are two important techniques of mining Web information. In this paper, a new adaptive method of mining Chinese documents from the internet is proposed. First, we give an algorithm of clustering documents which combines Genetic Algorithm(GA) and Simulated Annealing(SA) based on Boolean Model. This Algorithm avoids the disadvantage of clustering documents by using pure GA which can not be utilized accurately since GA converges too early and bogs the local optimum. Then, considering that the effect of classification with traditional Vector Space Model(VSM) is not satisfying enough since it is not related to the grades of importance of words, we add the position-factors of key words into VSM and set up a new classifier model to classify Chinese Web documents. Experimental results indicate that this adaptive method can make the process of clustering and classification more accurate and reasonable comparing to the methods which does not have the positions of words considered.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Melucci, M.: Context modeling and discovery using vector space bases. In: Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05 (2005)
Goncalves, A., Jianhan, Z., Dawei, S., Uren, V., Pacheco, R.: LRD: Latent relation discovery for vector space expansion and information retrieval. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 122–133. Springer, Heidelberg (2006)
Schneider, K.M.: On word frequency information and negative evidence in Naive Bayes text classification. In: Vicedo, J.L., MartÃnez-Barco, P., MuÅ„oz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 474–485. Springer, Heidelberg (2004)
Tang, C., Lau, R.W.H., Li, Q., Li, T., Yu, Z.: Distance courseware discrimination based on representative sentence assaying. In: Proceedings of Seven-th International Conference of Advanced Database Applications, pp. 92–99. IEEE Publishing, Hong Kong (2001)
Li, T., Tang, C.J., Zuo, J.: Web document filtering technique based on natural language understanding. International Journal Computer Processing of Oriental Language 14(3), 279–291 (2001)
Riyaz, S., Selwyn, P.: Efficient genetic slgorithm based data mining using feature selection with hausdorff distance. Information Technology and Management 6(4), 315–331 (2005)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. Journal of Chemical Physics, 1087–1092 (1953)
Casillas, A., de Lena, M.T.G., MartÃnez, R.: Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm. In: MatouÅ¡ek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 43–49. Springer, Heidelberg (2003)
Xu, X.S., Ma, J., Wang, H.: An improved simulated annealing algorithm for the maximum independent set problem. In: Huang, D.-S., Li, K., Irwin, G.W. (eds.) ICIC 2006. LNCS, vol. 4113, pp. 822–831. Springer, Heidelberg (2006)
Kang, Y.H.: Representative term based feature selection method for svm based document classification. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3681, pp. 56–61. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bai, X., Sun, J., Che, H., Wang, J. (2007). A General Method of Mining Chinese Web Documents Based on GA&SA and Position-Factors. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_41
Download citation
DOI: https://doi.org/10.1007/978-3-540-77018-3_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77016-9
Online ISBN: 978-3-540-77018-3
eBook Packages: Computer ScienceComputer Science (R0)