Establishing the provenance of historical manuscripts with a novel distance measure

Bing Hu¹,
Thanawin Rakthanmanon¹,
Bilson J. L. Campana¹,
Abdullah Mueen¹ &
…
Eamonn Keogh¹

352 Accesses
6 Citations
Explore all metrics

Abstract

The recent digitization of more than 20 million books has been led by initiatives from countries wishing to preserve their cultural heritage and by several commercial endeavors, including the Google Print Library Project. It is expected that within a few years a significant fraction of the world’s books will be online. However, for millions of complete books and tens of millions of loose pages, the provenance of the manuscripts may be completely unknown or disputed, thus denying historians an understanding of the context in which the content was created. In a handful of cases, it may be possible for experts to regain the provenance by examining linguistic, cultural and/or stylistic clues. However, such experts are a rarity and these investigations are time-consuming and expensive. One technique used by experts to establish provenance is the examination of the ornate initial letters appearing in the questioned manuscript. By comparing the initial letters in the manuscript to annotated initial letters whose origin is known, the provenance can be determined. In this work, we show for the first time that we can reproduce this ability with a computer algorithm. We use a recently introduced technique to measure texture similarity and show that it can recognize initial letters with an accuracy that rivals or exceeds human performance. A brute force implementation of this measure would require several months to process a single large book; however, we introduce a novel lower bound that allows us to process the books in hours or minutes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A texture-based pixel labeling approach for historical books

Article 08 February 2015

Towards a Digital Infrastructure for Illustrated Handwritten Archives

Texture feature benchmarking and evaluation for historical document image analysis

Article 05 January 2017

Notes

We defer a detailed discussion of our experimental philosophy until Sect. 5; however, we briefly note that all our experiments are reproducible, and all code and data are available at [11].
Just the unique initial letters were deleted.
SIFT was faster by about 30% when given unlimited main memory. If we force it to use a smaller memory footprint, it becomes significantly slower [36].

References

Antonacopoulos A, Downton AC (2007) Special issue on the analysis of historical documents. IJDAR 9(2)
Alabert A, Rangel LM (2011) Classifying the typefaces of the gutenberg 42-line Bible. IJDAR 14(4)
Coustaty M, Pareti R, Vincent N, Ogier JM (2011) Towards historical document indexing: extraction of drop cap letters. IJDAR 14(3)
Consortium of European Research Libraries (2011) www.cerl.org/web/
Ornaments typographical. www.ornements-typo-mouriau.be/
Virtual Library Humanist Program (2011) www.bvh.univ-tours.fr/index.htm
Agam G, Argamon S, Frieder O, Grossman D, Lewis D (2006) The Complex Document Image Processing (CDIP) Test Collection Project. Illinois Institute of Technology. http://ir.iit.edu/projects/CDIP.html
Bronner E (2008) Stolen manuscripts plague israeli archives. New York Times
Calvani S (2008) Frequency and figures of organised crime in art and antiquities. ISPAC
Victoria and Albert Museum: Woodcut Printing (video). www.youtube.com/watch?v=mgCYovlFRNY
Hu B Supporting URL for this paper. www.cs.ucr.edu/bhu002/IL/IL.html. This URL contains all data and code used in this paper
Alderman K (2009) Thieves take a page out of rare books and manuscripts. Art Cult Heritage Law Newsl I(V)
INTERPOL (2011) Stolen works of art. www.interpol.int/Public/WorkOfArt/woafaq.asp. Accessed 7 July 2011
Atran S, Henrich J (2010) The evolution of religion: how cognitive by-products, adaptive learning heuristics, ritual displays, and group competition generate deep commitments to prosocial religions. Biological theory: integrating development, evolution, and cognition, vol 5, pp 18–30
Landre J, Morain-Nicolier F (2009) Retrieval of the ornaments from the hand-press period: an overview. In: 10th ICDAR
Campana B, Keogh E (2010) A compression based distance measure for texture. SDM
Maltoni D, Maio D, Jain AK, Prabhakar S (2003) Handbook of fingerprint recognition, Springer, Berlin
Ogier JM, Tombre K (2006) Document image analysis techniques for cultural heritage documents. In: Proceedings of 1st EVA conference, pp 107–114
Basa P, Sabari PS, Nishikanta R, Ramakrishnan AG (2004) Gabor filters for document analysis in Indian bilingual documents. In: International conference on intelligent sensing and information processing, pp 123–126
Delalandre M, Ogier JM, Llados J (2008) A fast CBIR system of old ornamental letter. In: Workshop on graphics recognition, LNCS, pp 135–144
Fauzi MFA, Lewis PH (2008) A multiscale approach to texture-based image retrieval. J Pattern Anal Appl 11(2)
Garz A, Diem M, Sablatnig R (2010) Local descriptors for document layout analysis. In: Proceedings of Addison-Wesley series in statistics, pp 29–38
Ramel JY, Leriche S, Demonet ML, Busson S (2007) User-driven page layout analysis of historical printed books. IJDAR 243–261
Su Z, Cao Z, Wang Y, Zhen X (2011) Identification of unreliable segments to improve skeletonization of handwriting images. J Pattern Anal Appl 14(1)
Tseng YH, Lee HJ (2008) Document image binarization by two-stage block extraction and background intensity determination. J Pattern Anal Appl 11(1)
Tu SF, Hsu CS (2006) A DCT-based ownership identification method with gray-level and colorful signatures. J Pattern Anal Appl 9(2–3)
Google Scholar
Journet N, Eglin V, Ramel JY, Mullot R (2006) Dedicated texture based tools for characterization of old books. In: Proceedings of the 2nd DIAL, April 2006
Moghaddam RF, Cheriet M (2009) Low quality document image modeling and enhancement. IJDAR 11(4)
Hénault DR, Moghaddam RF, Cheriet M (2011) A local linear level set method for the binarization of degraded historical document image. IJDAR 14
Zhu Q, Keogh E (2010) Mother fugger: mining historical manuscripts with local color patches. ICDM 699–708
Li M, Chen X, Li X, Ma B, Vitányi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264
Article MATH Google Scholar
Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee S, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14(1):99–129
Article MathSciNet Google Scholar
Baudrier E, Busson S, Corsini S, Delalandre M (2009) Retrieval of the ornaments from the hand-press period: an overview, In: 10th ICDAR 2009
Vedaldi A (2011) http://www.vlfeat.org/~vedaldi/index.html
Garz A, Diem M, Sablatnig R (2011) Layout analysis of ancient manuscripts using local features. In: Eikonopoiia: digital imaging of ancient textual heritage
Lowe DG (2004) Distinctive image features from scale-invariant key point. Int J Comput Vis 60:91–110
Article Google Scholar
Ancient Greek Manuscripts Hit the Internet (2010) www.foxnews.com/scitech/2010/09/27/british-library-posts-greek-manuscripts-web/. Accessed 27 Sep 2010
Keogh E (2002) Exact indexing of dynamic time warping. In: VLDB, pp 406–417
Rubner Y, Tomasi C, Guibas L (1998) A metric for distributions with applications to image databases. In: Proceedings of the IEEE ICCV, pp 59–66
Tang Q, Nasiopoulos P (2010) Efficient motion re-estimation with rate-distortion optimization for MPEG-2 to H.264/AVC transcoding. IEEE Trans Circuits Syst Video Technol 20:262–274
Article Google Scholar
Pigeon S, Coulombe S (2008) Very low cost algorithms for predicting the file size of jpeg images subject to changes of quality factor and scaling. In: DCC
Wang X, Ye L, Keogh EJ, Shelton CR, Annotating historical archives of images. JCDL 341–350
Hu B, Rakthanmanon T, Campana B, Mueen A, Keogh E (2012) Image mining of historical manuscripts to establish provenance. In: SIAM conference on data mining (SDM)
Justin TP (1559) Histoire Universelle de Trogues Pompée, Réduite En Abrégé par Justin
Lewis D, Agam G, Argamon S, Frieder O, Grossman D, Heard J (2006) Building a test collection for complex document information processing. In: Proceedings of the 29th annual international ACM SIGIR conference, pp 665–666
Journet N, Ramel J, Mullot R, Eglin V (2008) Document image characterization using multi-resolution analysis of the texture: application to old documents. IJDAR 11:9–18
Article Google Scholar
Marinai S (2011) Text retrieval from early printed books. IJDAR 14(2):117–129
Article Google Scholar
Plötz T, Fink G (2009) Markov models for of fine handwriting recognition: a survey. IJDAR 12:269–298
Article Google Scholar
The Legacy Tobacco Document Library (LTDL) (2007) University of California, San Francisco. http://legacy.library.ucsf.edu/
Tobacco800 Signature and Logos. http://lampsrv02.umiacs.umd.edu/projdb/project.php?id=52
Rusiñol M, Lladó J (2010) Efficient logo retrieval through hashing shape context descriptors. In: Proceedings of the ninth IAPR international workshop on document analysis systems, In: DAS10, pp 215–222
Zhu G, Zheng Y, Doermann D, Jaeger S (2009) Signature detection and matching for document image retrieval. IEEE Trans Pattern Anal Mach Intell 2015–2031
Zhu G, Doermann D (2007) Automatic document logo detection. IJDAR 864–868
Zhu G, Jaeger S, Doermann D (2006) A robust stamp detection framework on degraded documents. IJDAR XIII:1–9
Google Scholar
Jouili S, Coustaty M, Tabbone S, Ogier JM (2010) NAVIDOMASS: structural-based approaches towards handling historical documents. In: ICPR, pp 946–949
Wei L, Keogh E, Van Herle H, Mafra-Neto A (2005) Atomic wedgie: efficient query filtering for streaming times series. ICDM 490–497
Fornés A, Dutta A, Gordo A, Lladó J (2011) CVC-MUSCIMA: A ground truth of handwritten music score images for writer identification and staff removal. IJDAR 14
Renou J (1626) Les Oeuvres Pharmaceutiques du Sr Jean de Renou, Conseiller & Medecin du Roy

Download references

Acknowledgments

We thank all the digital archivists who produced the vast amounts of data that made this work possible, especially the NaviDoMass group who did exceptional work preparing and annotating the data. This work was funded by NSF awards 0803410 and 0808770. We also thank the reviewers for their useful comments.

Author information

Authors and Affiliations

Riverside, USA
Bing Hu, Thanawin Rakthanmanon, Bilson J. L. Campana, Abdullah Mueen & Eamonn Keogh

Authors

Bing Hu
View author publications
You can also search for this author in PubMed Google Scholar
Thanawin Rakthanmanon
View author publications
You can also search for this author in PubMed Google Scholar
Bilson J. L. Campana
View author publications
You can also search for this author in PubMed Google Scholar
Abdullah Mueen
View author publications
You can also search for this author in PubMed Google Scholar
Eamonn Keogh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, B., Rakthanmanon, T., Campana, B.J.L. et al. Establishing the provenance of historical manuscripts with a novel distance measure. Pattern Anal Applic 18, 313–331 (2015). https://doi.org/10.1007/s10044-013-0332-z

Download citation

Received: 15 May 2012
Accepted: 01 April 2013
Published: 21 April 2013
Issue Date: May 2015
DOI: https://doi.org/10.1007/s10044-013-0332-z

Establishing the provenance of historical manuscripts with a novel distance measure

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A texture-based pixel labeling approach for historical books

Towards a Digital Infrastructure for Illustrated Handwritten Archives

Texture feature benchmarking and evaluation for historical document image analysis

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Establishing the provenance of historical manuscripts with a novel distance measure

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A texture-based pixel labeling approach for historical books

Towards a Digital Infrastructure for Illustrated Handwritten Archives

Texture feature benchmarking and evaluation for historical document image analysis

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation