From Legacy Documents to XML: A Conversion Framework

Jean-Pierre Chanod¹⁹,
Boris Chidlovskii¹⁹,
Hervé Dejean¹⁹,
Olivier Fambon¹⁹,
Jérôme Fuselier¹⁹,
Thierry Jacquin¹⁹ &
…
Jean-Luc Meunier¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3652))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1238 Accesses
4 Citations

Abstract

We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different components covers complementary techniques the logical document analysis and semantic annotations with the methods of machine learning. We use a real case conversion project as a driving example to exemplify different techniques implemented in the project.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

LaTeXML 2012 - A Year of LaTeXML

Getting Started with XML and JSON

Schema Extraction and Integration of Heterogeneous XML Document Collections

References

Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. IJDAR 4(1), 2–17 (2001)
Article Google Scholar
Berger, A.L., Della Pietra, S., Della Pietra, V.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 39–71 (1996)
Google Scholar
Le Bourgeois, F., Emptoz, H., Bensafi, S.: Document understanding using probabilistic relaxation: Application on tables of contents of periodicals. In: ICDAR (2001)
Google Scholar
Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.M.: Geometric layout analysis techniques for document image understanding: a review. Technical Report #9703-09, ITC-IRST (1997)
Google Scholar
Sundaresan, N., Chung, C.Y., Gertz, M.: Reverse engineering for web data: From visual to semantic structures. In: 18th Intern. Conf Data Eng, ICDE (2002)
Google Scholar
Curran, J.R., Wong, R.K.: Transformation-based learning for automatic translation from HTML to XML. In: Proc. 4th Austral. Doc. Comp. Symp, ADCS (1999)
Google Scholar
Penttonen, M., Kuikka, E., Leinonen, P.: Towards automating of document structure transformations. In: Proc. ACM Sym. on Doc. Eng., pp. 103–110 (2002)
Google Scholar
Ha, J., Haralick, R.M., Phillips, I.T.: Recursive X-Y cut using bounding boxes of connected components. In: ICDAR (1995)
Google Scholar
He, F., Ding, X., Peng, L.: Hierarchical logical structure extraction of book documents by analyzing tables of contents. In: Proc. of SPIE-IS&T Elect. Imaging. SPIE, vol. 5296 (1995)
Google Scholar
Ishitani, Y.: Document transformation system from papers to xml data based on pivot xml document method. In: ICDAR (2003)
Google Scholar
Kurgan, L., Swiercz, W., Cios, K.J.: Semantic mapping of XML tags using inductive machine learning. In: Proc. Intern. Conf. Machine Learn. and Applic., pp. 99–109 (2002)
Google Scholar
Lin, X.: Text-mining based journal splitting. In: ICDAR (2003)
Google Scholar
Lin, X.: Automatic document navigation for digital content re-mastering. Master’s thesis, HP, Technical report (2003)
Google Scholar
Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: Intern. Conf. Pattern Recogn. (1984)
Google Scholar
Ramakrishnan, I.V., Mukherjee, S., Yang, G.: Automatic annotation of content-rich web documents: Structural and semantic analysis. In: Intern. Sem. Web Conf. (2003)
Google Scholar
Wang, Y., Phillips, I.T., Haralick, R.: From image to SGML/XML representation: One method. In: Intern. Workshop Doc. Layout Interpr. and Its Applic., DLIAP (1999)
Google Scholar
XQuery 1.0: An XML query language, http://www.w3c.org/TR/xquery/
XSL Transformations (XSLT) version 1.0, http://www.w3c.org/TR/xslt/

Download references

Author information

Authors and Affiliations

Xerox Research Centre Europe, 6, chemin de Maupertuis, F–38240, Meylan, France
Jean-Pierre Chanod, Boris Chidlovskii, Hervé Dejean, Olivier Fambon, Jérôme Fuselier, Thierry Jacquin & Jean-Luc Meunier

Authors

Jean-Pierre Chanod
View author publications
You can also search for this author in PubMed Google Scholar
Boris Chidlovskii
View author publications
You can also search for this author in PubMed Google Scholar
Hervé Dejean
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Fambon
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Fuselier
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Jacquin
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Luc Meunier
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Vienna University of Technology, Vienna, Austria
Andreas Rauber
Laboratory of Distributed Multimedia Information Systems and Applications, Technical University of Crete (MUSIC/TUC) Chania, 73100, Crete, Greece
Stavros Christodoulakis
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040, Wien, Austria
A Min Tjoa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chanod, JP. et al. (2005). From Legacy Documents to XML: A Conversion Framework. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2005. Lecture Notes in Computer Science, vol 3652. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551362_9

Download citation

DOI: https://doi.org/10.1007/11551362_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28767-4
Online ISBN: 978-3-540-31931-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

From Legacy Documents to XML: A Conversion Framework

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

LaTeXML 2012 - A Year of LaTeXML

Getting Started with XML and JSON

Schema Extraction and Integration of Heterogeneous XML Document Collections

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

From Legacy Documents to XML: A Conversion Framework

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

LaTeXML 2012 - A Year of LaTeXML

Getting Started with XML and JSON

Schema Extraction and Integration of Heterogeneous XML Document Collections

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation