Abstract
In this paper, we discuss kernels that can be applied for the classification of XML documents based on their DOM trees. DOM trees are ordered trees in which every node might be labeled by a vector of attributes including its XML tag and the textual content. We describe five new kernels suitable for such structures: a kernel based on predefined structural features, a tree kernel derived from the well-known parse tree kernel, the set tree kernel that allows permutations of children, the string tree kernel being an extension of the so-called partial tree kernel, and the soft tree kernel as a more efficient alternative. We evaluate the kernels experimentally on a corpus containing the DOM trees of newspaper articles and on the well-known SUSANNE corpus.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Feldmann, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)
Mehler, A., Gleim, R., Dehmer, M.: Towards structure-sensitive hypertext categorization. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds.) Proc. of the 29th Ann. Conf. of the German Class. Soc., Springer, Heidelberg (2005)
Mehler, A., Geibel, P., Gleim, R., Pustylnikov, S.H.O., Jain, B.J.: Learning text types solely by structural differentiae, vol. 1. Publications of the Institute of Cognitive Science (PICS), Osnabrück (January 2007)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Gärtner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(2), 49–58 (2003)
Geibel, P., Jain, B.J., Wysotzki, F.: Combining recurrent neural networks and support vector machines for structural pattern recognition. Neurocomputing 64, 63–105 (2005)
Collins, M., Duffy, N.: Convolution kernels for natural language. In: NIPS, pp. 625–632 (2001)
Moschitti, A.: A study on convolution kernels for shallow statistic parsing. In: ACL, pp. 335–342 (2004)
Haussler, D.: Convolution Kernels on Discrete Structure. Technical Report UCSC-CRL-99-10, University of California at Santa Cruz, Santa Cruz, CA, USA (1999)
Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: Proc. ICML, pp. 291–298 (2002)
Moschitti, A.: Efficient convolution kernels for dependency and constituent syntactic trees. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 318–329. Springer, Heidelberg (2006)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. JMLR 2, 419–444 (2002)
Biber, D.: Dimensions of Register Variation. A Cross-Linguistic Comparison. Cambridge University Press, Cambridge (1995)
Mehler, A.: Hierarchical orderings of textual units. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 646–652. Morgan Kaufmann, San Francisco (2002)
Köhler, R.: Syntactic Structures: Properties and Interrelations. Journal of Quantitative Linguistics, 46–47 (1999)
Pustylnikov, O.: Guessing Text Type by Structure. In: Proceedings of the ESSLLI Student Session 2007 (to appear, 2007)
Geibel, P., Wysotzki, F.: Learning relational concepts with decision trees. In: Saitta, L. (ed.) Machine Learning: Proceedings of the Thirteenth International Conference, pp. 166–174. Morgan Kaufmann Publishers, San Francisco (1996)
Sampson, G.: English for the Computer: The Susanne Corpus and Analytic Scheme: SUSANNE Corpus and Analytic Scheme. Clarendon Press (1995)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, KU. (2008). Classification of Documents Based on the Structure of Their DOM Trees. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds) Neural Information Processing. ICONIP 2007. Lecture Notes in Computer Science, vol 4985. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69162-4_81
Download citation
DOI: https://doi.org/10.1007/978-3-540-69162-4_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69159-4
Online ISBN: 978-3-540-69162-4
eBook Packages: Computer ScienceComputer Science (R0)