Abstract
With the rapid growth of the available information on the Internet, it is more difficult for us to find the relevant information quickly on the Web. Text classification, one of the most useful web information processing tools, has been paid more and more attention recently. Instead of using traditional classification models, we apply n-gram language models to classify Chinese Web text information on subject. We investigate several factors that have important effect on the performance of n-gram models, including various order n, different smoothing techniques, and different granularity of textual representation unit in Chinese. The experiment result indicates that bi-gram model based on word and tri-gram model based on character outperform others, achieving approximately 90% evaluated by F1 score.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aas, K., Eikvil, L.: Text Categorization: A Survey. Technical Report #941, Norwegian Computing Center (1999)
Joachim, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Processing of ICML 1997, 14th International Conference on Machine Learning, pp. 143-151 (1996)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. In: Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics
Peng, F., Schuurmans, D., Wang, S.: Augmenting Naïve Bayes Classifiers with Statistical Language Models. Information Retrieval 7(3-4), 317–345 (2004)
Rosenfeld, R.: Two decades of Statistical Language Modeling: Where Do We Go From Here? Proceedings of the IEEE 88(8) (2000)
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, London (1999)
Sleator, D., Temperley, D.: Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report CMU-CS-91-196 (October 1991)
Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-35(3), 400–401 (1987)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, X., Wang, T., Zhou, H., Chen, H. (2004). Categorizing Web Information on Subject with Statistical Language Modeling. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds) Web Information Systems – WISE 2004. WISE 2004. Lecture Notes in Computer Science, vol 3306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30480-7_41
Download citation
DOI: https://doi.org/10.1007/978-3-540-30480-7_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23894-2
Online ISBN: 978-3-540-30480-7
eBook Packages: Springer Book Archive