Abstract
We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different components covers complementary techniques the logical document analysis and semantic annotations with the methods of machine learning. We use a real case conversion project as a driving example to exemplify different techniques implemented in the project.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. IJDAR 4(1), 2–17 (2001)
Berger, A.L., Della Pietra, S., Della Pietra, V.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 39–71 (1996)
Le Bourgeois, F., Emptoz, H., Bensafi, S.: Document understanding using probabilistic relaxation: Application on tables of contents of periodicals. In: ICDAR (2001)
Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.M.: Geometric layout analysis techniques for document image understanding: a review. Technical Report #9703-09, ITC-IRST (1997)
Sundaresan, N., Chung, C.Y., Gertz, M.: Reverse engineering for web data: From visual to semantic structures. In: 18th Intern. Conf Data Eng, ICDE (2002)
Curran, J.R., Wong, R.K.: Transformation-based learning for automatic translation from HTML to XML. In: Proc. 4th Austral. Doc. Comp. Symp, ADCS (1999)
Penttonen, M., Kuikka, E., Leinonen, P.: Towards automating of document structure transformations. In: Proc. ACM Sym. on Doc. Eng., pp. 103–110 (2002)
Ha, J., Haralick, R.M., Phillips, I.T.: Recursive X-Y cut using bounding boxes of connected components. In: ICDAR (1995)
He, F., Ding, X., Peng, L.: Hierarchical logical structure extraction of book documents by analyzing tables of contents. In: Proc. of SPIE-IS&T Elect. Imaging. SPIE, vol. 5296 (1995)
Ishitani, Y.: Document transformation system from papers to xml data based on pivot xml document method. In: ICDAR (2003)
Kurgan, L., Swiercz, W., Cios, K.J.: Semantic mapping of XML tags using inductive machine learning. In: Proc. Intern. Conf. Machine Learn. and Applic., pp. 99–109 (2002)
Lin, X.: Text-mining based journal splitting. In: ICDAR (2003)
Lin, X.: Automatic document navigation for digital content re-mastering. Master’s thesis, HP, Technical report (2003)
Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: Intern. Conf. Pattern Recogn. (1984)
Ramakrishnan, I.V., Mukherjee, S., Yang, G.: Automatic annotation of content-rich web documents: Structural and semantic analysis. In: Intern. Sem. Web Conf. (2003)
Wang, Y., Phillips, I.T., Haralick, R.: From image to SGML/XML representation: One method. In: Intern. Workshop Doc. Layout Interpr. and Its Applic., DLIAP (1999)
XQuery 1.0: An XML query language, http://www.w3c.org/TR/xquery/
XSL Transformations (XSLT) version 1.0, http://www.w3c.org/TR/xslt/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chanod, JP. et al. (2005). From Legacy Documents to XML: A Conversion Framework. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2005. Lecture Notes in Computer Science, vol 3652. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551362_9
Download citation
DOI: https://doi.org/10.1007/11551362_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28767-4
Online ISBN: 978-3-540-31931-3
eBook Packages: Computer ScienceComputer Science (R0)