Abstract
The recent digitization of more than 20 million books has been led by initiatives from countries wishing to preserve their cultural heritage and by several commercial endeavors, including the Google Print Library Project. It is expected that within a few years a significant fraction of the world’s books will be online. However, for millions of complete books and tens of millions of loose pages, the provenance of the manuscripts may be completely unknown or disputed, thus denying historians an understanding of the context in which the content was created. In a handful of cases, it may be possible for experts to regain the provenance by examining linguistic, cultural and/or stylistic clues. However, such experts are a rarity and these investigations are time-consuming and expensive. One technique used by experts to establish provenance is the examination of the ornate initial letters appearing in the questioned manuscript. By comparing the initial letters in the manuscript to annotated initial letters whose origin is known, the provenance can be determined. In this work, we show for the first time that we can reproduce this ability with a computer algorithm. We use a recently introduced technique to measure texture similarity and show that it can recognize initial letters with an accuracy that rivals or exceeds human performance. A brute force implementation of this measure would require several months to process a single large book; however, we introduce a novel lower bound that allows us to process the books in hours or minutes.
Similar content being viewed by others
Notes
Just the unique initial letters were deleted.
SIFT was faster by about 30% when given unlimited main memory. If we force it to use a smaller memory footprint, it becomes significantly slower [36].
References
Antonacopoulos A, Downton AC (2007) Special issue on the analysis of historical documents. IJDAR 9(2)
Alabert A, Rangel LM (2011) Classifying the typefaces of the gutenberg 42-line Bible. IJDAR 14(4)
Coustaty M, Pareti R, Vincent N, Ogier JM (2011) Towards historical document indexing: extraction of drop cap letters. IJDAR 14(3)
Consortium of European Research Libraries (2011) www.cerl.org/web/
Ornaments typographical. www.ornements-typo-mouriau.be/
Virtual Library Humanist Program (2011) www.bvh.univ-tours.fr/index.htm
Agam G, Argamon S, Frieder O, Grossman D, Lewis D (2006) The Complex Document Image Processing (CDIP) Test Collection Project. Illinois Institute of Technology. http://ir.iit.edu/projects/CDIP.html
Bronner E (2008) Stolen manuscripts plague israeli archives. New York Times
Calvani S (2008) Frequency and figures of organised crime in art and antiquities. ISPAC
Victoria and Albert Museum: Woodcut Printing (video). www.youtube.com/watch?v=mgCYovlFRNY
Hu B Supporting URL for this paper. www.cs.ucr.edu/bhu002/IL/IL.html. This URL contains all data and code used in this paper
Alderman K (2009) Thieves take a page out of rare books and manuscripts. Art Cult Heritage Law Newsl I(V)
INTERPOL (2011) Stolen works of art. www.interpol.int/Public/WorkOfArt/woafaq.asp. Accessed 7 July 2011
Atran S, Henrich J (2010) The evolution of religion: how cognitive by-products, adaptive learning heuristics, ritual displays, and group competition generate deep commitments to prosocial religions. Biological theory: integrating development, evolution, and cognition, vol 5, pp 18–30
Landre J, Morain-Nicolier F (2009) Retrieval of the ornaments from the hand-press period: an overview. In: 10th ICDAR
Campana B, Keogh E (2010) A compression based distance measure for texture. SDM
Maltoni D, Maio D, Jain AK, Prabhakar S (2003) Handbook of fingerprint recognition, Springer, Berlin
Ogier JM, Tombre K (2006) Document image analysis techniques for cultural heritage documents. In: Proceedings of 1st EVA conference, pp 107–114
Basa P, Sabari PS, Nishikanta R, Ramakrishnan AG (2004) Gabor filters for document analysis in Indian bilingual documents. In: International conference on intelligent sensing and information processing, pp 123–126
Delalandre M, Ogier JM, Llados J (2008) A fast CBIR system of old ornamental letter. In: Workshop on graphics recognition, LNCS, pp 135–144
Fauzi MFA, Lewis PH (2008) A multiscale approach to texture-based image retrieval. J Pattern Anal Appl 11(2)
Garz A, Diem M, Sablatnig R (2010) Local descriptors for document layout analysis. In: Proceedings of Addison-Wesley series in statistics, pp 29–38
Ramel JY, Leriche S, Demonet ML, Busson S (2007) User-driven page layout analysis of historical printed books. IJDAR 243–261
Su Z, Cao Z, Wang Y, Zhen X (2011) Identification of unreliable segments to improve skeletonization of handwriting images. J Pattern Anal Appl 14(1)
Tseng YH, Lee HJ (2008) Document image binarization by two-stage block extraction and background intensity determination. J Pattern Anal Appl 11(1)
Tu SF, Hsu CS (2006) A DCT-based ownership identification method with gray-level and colorful signatures. J Pattern Anal Appl 9(2–3)
Journet N, Eglin V, Ramel JY, Mullot R (2006) Dedicated texture based tools for characterization of old books. In: Proceedings of the 2nd DIAL, April 2006
Moghaddam RF, Cheriet M (2009) Low quality document image modeling and enhancement. IJDAR 11(4)
Hénault DR, Moghaddam RF, Cheriet M (2011) A local linear level set method for the binarization of degraded historical document image. IJDAR 14
Zhu Q, Keogh E (2010) Mother fugger: mining historical manuscripts with local color patches. ICDM 699–708
Li M, Chen X, Li X, Ma B, Vitányi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264
Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee S, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14(1):99–129
Baudrier E, Busson S, Corsini S, Delalandre M (2009) Retrieval of the ornaments from the hand-press period: an overview, In: 10th ICDAR 2009
Vedaldi A (2011) http://www.vlfeat.org/~vedaldi/index.html
Garz A, Diem M, Sablatnig R (2011) Layout analysis of ancient manuscripts using local features. In: Eikonopoiia: digital imaging of ancient textual heritage
Lowe DG (2004) Distinctive image features from scale-invariant key point. Int J Comput Vis 60:91–110
Ancient Greek Manuscripts Hit the Internet (2010) www.foxnews.com/scitech/2010/09/27/british-library-posts-greek-manuscripts-web/. Accessed 27 Sep 2010
Keogh E (2002) Exact indexing of dynamic time warping. In: VLDB, pp 406–417
Rubner Y, Tomasi C, Guibas L (1998) A metric for distributions with applications to image databases. In: Proceedings of the IEEE ICCV, pp 59–66
Tang Q, Nasiopoulos P (2010) Efficient motion re-estimation with rate-distortion optimization for MPEG-2 to H.264/AVC transcoding. IEEE Trans Circuits Syst Video Technol 20:262–274
Pigeon S, Coulombe S (2008) Very low cost algorithms for predicting the file size of jpeg images subject to changes of quality factor and scaling. In: DCC
Wang X, Ye L, Keogh EJ, Shelton CR, Annotating historical archives of images. JCDL 341–350
Hu B, Rakthanmanon T, Campana B, Mueen A, Keogh E (2012) Image mining of historical manuscripts to establish provenance. In: SIAM conference on data mining (SDM)
Justin TP (1559) Histoire Universelle de Trogues Pompée, Réduite En Abrégé par Justin
Lewis D, Agam G, Argamon S, Frieder O, Grossman D, Heard J (2006) Building a test collection for complex document information processing. In: Proceedings of the 29th annual international ACM SIGIR conference, pp 665–666
Journet N, Ramel J, Mullot R, Eglin V (2008) Document image characterization using multi-resolution analysis of the texture: application to old documents. IJDAR 11:9–18
Marinai S (2011) Text retrieval from early printed books. IJDAR 14(2):117–129
Plötz T, Fink G (2009) Markov models for of fine handwriting recognition: a survey. IJDAR 12:269–298
The Legacy Tobacco Document Library (LTDL) (2007) University of California, San Francisco. http://legacy.library.ucsf.edu/
Tobacco800 Signature and Logos. http://lampsrv02.umiacs.umd.edu/projdb/project.php?id=52
Rusiñol M, Lladó J (2010) Efficient logo retrieval through hashing shape context descriptors. In: Proceedings of the ninth IAPR international workshop on document analysis systems, In: DAS10, pp 215–222
Zhu G, Zheng Y, Doermann D, Jaeger S (2009) Signature detection and matching for document image retrieval. IEEE Trans Pattern Anal Mach Intell 2015–2031
Zhu G, Doermann D (2007) Automatic document logo detection. IJDAR 864–868
Zhu G, Jaeger S, Doermann D (2006) A robust stamp detection framework on degraded documents. IJDAR XIII:1–9
Jouili S, Coustaty M, Tabbone S, Ogier JM (2010) NAVIDOMASS: structural-based approaches towards handling historical documents. In: ICPR, pp 946–949
Wei L, Keogh E, Van Herle H, Mafra-Neto A (2005) Atomic wedgie: efficient query filtering for streaming times series. ICDM 490–497
Fornés A, Dutta A, Gordo A, Lladó J (2011) CVC-MUSCIMA: A ground truth of handwritten music score images for writer identification and staff removal. IJDAR 14
Renou J (1626) Les Oeuvres Pharmaceutiques du Sr Jean de Renou, Conseiller & Medecin du Roy
Acknowledgments
We thank all the digital archivists who produced the vast amounts of data that made this work possible, especially the NaviDoMass group who did exceptional work preparing and annotating the data. This work was funded by NSF awards 0803410 and 0808770. We also thank the reviewers for their useful comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hu, B., Rakthanmanon, T., Campana, B.J.L. et al. Establishing the provenance of historical manuscripts with a novel distance measure. Pattern Anal Applic 18, 313–331 (2015). https://doi.org/10.1007/s10044-013-0332-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-013-0332-z