[go: up one dir, main page]

Academia.eduAcademia.edu
The 15th International Scientific Conference eLearning and Software for Education Bucharest, April 11-12, 2019 10.12753/2066-026X-19-000 DESIGNING A DOCUMENT IMAGE ANALYSIS SYSTEM ON 3 AXIS: EDUCATION, RESEARCH AND PERFORMANCE Giorgiana Violeta VLĂSCEANU, PhD, Costin-Anton BOIANGIU, Răzvan-Adrian DEACONESCU, Marcel PRODAN, PhD, Cristian AVATAVULUI, PhD, Răzvan RUGHINIȘ, Irina MOCANU Faculty of Automatic Control and Computer Science, University POLITEHNICA of Bucharest, Splaiul Independenței 313, Bucharest, Romania giorgiana.vlasceanu@cs.pub.ro, costin.boiangiu@cs.pub.ro, razvan.deaconescu@cs.pub.ro, marcoprod@gmail.com, cristianavatavului@gmail.com, razvan.rughinis@cs.pub.ro, irina.mocanu@cs.pub.ro Abstract: Technology advances to make life easier for people. We tend to surround us with devices as small as possible and with the highest computing power. The need for data access from everywhere is an important detail. As a consequence, digital documents have been gaining ground on printed ones and for some sectors, the latter were even replaced. The need and the obligation to preserve the written cultural heritage, represented by books and valuable documents, some of them rare and even unique, forced us to imagine a system that protects the patrimony but makes it also accessible. In order to make books easily available to the public and at the lowest possible risk for the protection of the originals, we came to the idea of designing and creating an efficient digitization system of these records. The current article presents the proposed architecture of a Document Image Analysis System that will process the information with individual modules for each type of operation. The main scope for such tool is to recognize information from the documents and extract them for electronic use. The flow of operations are indicated by user, some steps can be eliminated depending on the user’s desire and needs. In order to design an efficient Document Image Analysis System, we need a 3 axis approach: Education involving students that can receive tasks for replacing modules and validating their homework, Research - performing various tests and Performance - testing the module interconnection and enabling the system to be extremely configurable. No matter what axis is considered, the main scope is the flexibility of the system - performed by individual modules as physical binaries or collection of binaries that are linked via scripts. Each module is designed to accomplish a certain major task by executing several sub-tasks whose results, in most cases, are subject to an intelligent voting process that produces the module’s output data. Keywords: Retroconversion; Document Image Analysis; Optical Character Recognition; OCR; Digitization; Document Export; Lib2life I. INTRODUCTION The main goal for this paper is to present a modular system with Optical Character Recognition functionality. This design implies a multitude of components, which offer the possibility of a dynamic runtime configuration. Actual systems usually have a monolithic architecture, thus the user cannot configure processing steps, his/her choices being limited to the format for the output file that can be PDF, DOC or TXT or select the areas of the documents which would undergo content extraction. II. DIAS ARCHITECTURE The proposed system has a particular structure organized in modules as shown in Figure 1. Each functionality is realized by an independent component. Typically, a component is an executable, retrieving parameters from the command line and outputting images, XML or JSON files. Each module is designed to accomplish a specific task by doing a series of sub-tasks in order to collect a series of candidates. In most cases, the voting system is added in this step in order to choose the best candidate. The proposed flow for the system can be customized by the user at the runtime. Figure 1. System architecture Architecture Design Principles The aim is to have a modular, executable-centric architecture where each module gets input from another module or from the user and provides the output that will be used by another module. There is an Executor component responsible for overseeing the process and commanding each modular executable as in Figure 2. Figure 2. Overview of Component Interconnection The Executor receives a list of image files and an initial XML configuration file. The XML configuration file consists of the modules to be used and their order, parameters for the overall process or particular to a given module. Each image file is fed to the first module together with a per-module XML configuration file created by the Executor. The module outputs a processed image file that will then be passed by the Executor to another module, together with a new XML configuration file. The Executor may choose to run each module in a pipeline, i.e. each module processes a given file at a time while another module processes another file, to increase the processing throughput. Intermediary files may be saved if enabled in the XML configuration file. Moreover, the Executor will create a new process from each module executable file and for each processed file using the standard process creation API. It will wait for a process to complete and use its output as input for another new process from another executable module. While not part of the current design, the modular approach may allow for each module running on separate machines, adding further throughput. However, this will require designing a protocol between the Executor and each module running on a remote machine. III. DIAS MODULES Modules for the proposed system are grouped in classes, in correlation with the main task they try to solve. The architecture based on individual modules (binaries) has several advantages in different fields:  Educational: students can receive a basic task, for instance, to replace one individual module, or a more complex one like designing a full module class, while being fully able to exploit the system and validate their homework;  Research: researchers may perform various tests (mostly in a trial-and-error manner) by replacing a module and examining the overall impact and system performance;  Production: the module interconnection can be unbelievably complex, enabling the system to be extremely configurable, to provide feedback to its own errors and to take the necessary actions fully autonomously. 2.1 Import module The modules in the import class should be able to fully determine the document skeleton based on a collection of input image pages. The document skeleton will be an XML file with a predefined schema [1]. All the pages should contain links to their image data. The import XML should also contain the document native data, such as author(s), title, title of the series, issue and/or volume number, years of publication (original work and reprint), publishing house, the number of pages, language(s), the paper format, and some document acquisition details (paper/printing support degradation phase, scanner or camera brand and type, scanning/photo settings used, illumination type, etc.) A basic import system, like the one used in Lib2Life project [2] should only be able to select a folder with image pages and input the main bibliographical data of imported work. Future work includes extending the Import module to add multiple physical images per logical pages, re-registering the folding capabilities of pages, and both extended and custom-defined document metadata, page reordering and (re)numbering capabilities. 2.2 Smart Grayscale Conversion module An image in a grayscale tone will be needed for the process stages that require a continuousspace uni-component as input rather than a multi-component (like RGB) or discontinued (clusterized) binarized one. The scientific novelty of the proposed approach will be the use of a smart grayscale-like conversion that enhances the perceptual color differences. Let us imagine that we are reading a medium-blue component text on a light-red paper. Due to the fact that the grayscale component is a linear combination of RGB data, both background and foreground will map to the same grayscale nuance although the text is very well readable in color, the result is a gray, empty-looking-like, piece of paper. This problem is very well known when printing colored ink documents on monochrome printing. One of the contributions of the Lib2Life project [2] is a Smart Grayscale converter which solves the situations like those presented above by “faking” the grayscale and maximizing the color differences perceived by the human visual perceptual mechanism. The differentiated perception is measured using the CIE-LAB [3] color space and the DeltaE distance metric [4]. 2.3 Skew Detection and Correction (Deskew) module The Deskew class of modules comprises some feature-based skew detector and one smart voting mechanism. The independent skew modules will take into account different features of the input document, so that, if one fails, the probability of still getting a correct result is increased. For that purpose, we are planning to test the following:  A projection-profiling technique [5] based on characters’ alignments. The more the characters, the better the projection variance, thus the increased confidence in the detected skew angle. Will be used on the results of a fast segmentation routine applied on the Smart Grayscale Conversion.  A generalized Hough transform [6] to detect the (near-horizontal and/or vertical) linesegments in the document. The longer the lines, taken as a percent from the document size, the better the confidence returned by the module. Will be used on the Smart Grayscale Conversion image.  A Fast Fourier Transform (FFT) [7] in which the dominant skew will be identified in the amplitude of the result. The better the separation between classes when applying a thresholding operation on the amplitude, the better the confidence returned by the module. Will be used on the Smart Grayscale Conversion grayscale image.  A voting mechanism which will combine the aforementioned individual modules, taking into account their detected skew angle, their overall probability for success, their returned confidence and the characteristics of the input document. The output of this module will be the result of skew detector. In the Lib2Life project, due to the fixed position in the scanner of the input on-paper documents and the text-based nature of the book collections, it is possible that these modules collection will not be used at all, or that the result of the projection-profiling module could be forwarded directly as the output result. The tests on the specific collections will tell which approach is the recommended choice for achieving the best balance between processing speed and accuracy. 2.4 Image Processing Module The Document Image Analysis System (DIAS) employed in [2] may use multiple image processor, but with mainly two different purposes:  Image enhancement for automatic processing and better specific features o Horizontal and Vertical Line enhancer. Will be used to better detect in subsequent processing stages H/V lines, tables, and to contribute in the Layout Analysis system. It is optional in the Lib2Life project [2] if the preliminary tests will reveal that the lines are already consistent in the original images. o Page dewarping. Will be used to dewarp image documents when the books cannot be flattened enough to touch the scanning area or when pages are suffering from geometrical distortions due to exposure to moist and/or fast drying. This module is not likely to be necessary for the Lib2life project [2] due to the good preservation stage of the collections to be digitized. o Noise reduction. Should be identified the most suitable noise-suppression mechanism for image documents: will be tested at least Gaussian Blur, Median Filtering, Bilateral Filtering, and perhaps, an original approach of our research group called “DifferenceGatherer”.  Image enhancement for visual impact and better readability o Sharpen. A sharpening and edge enhancement technique will be employed in order to enhance the readability of the texts. It may not be necessary to be employed in the Lib2Life collections due to the overall good visual separation between text and background. o 2.5 Tone Curve. Should be identified the best tone curve in order to enhance the contrast and illumination. As a scientific advancement, a current original approach of our research group, the “MaxOnMinVariance” approach will be finalized, tested and deployed. Locality and Globality This is a novel scientific development of our research group. The purpose of this module is to offer the “best locality” and the “best globality” as windows in the image space determined for every pixel. The “best locality” will be the window onto which a local algorithm will be computed, while the “globality” will be the window into which a global algorithm will function. The “globality” should not be the entire space of the document since the document may contain items that are not related to each other, like, for example, different pages scanned together, items with totally different characteristics, for example, columns with different fonts and formatting, and so on. This is subject of an indevelopment research activity of our group and the preliminary results are very promising. 2.6 Binarization module This module class will perform the foreground-background separation. A scientifically complex and totally new approach with a locality-globality weighted binarization methods, as candidates, and a smart voting mechanism as a final processing stage will be employed. As individual modules there will be used:  A binarization weighted locally-globally, which will offer the best signal-to-noise ratio;  A binarization weighted locally-globally, which will encourage the recovery of formatting structures like lines and tables;  A binarization using a per-pixel machine learning approach. The most important individual pixel features will be obtained using the locality-globality approach. The voting mechanism will try to offer the best compromise between a majority per-pixel voting and individual characteristics of individual binarization modules. 2.7 Layout Analyzer module The Layout Analyzer (LA) [8] will operate in intra-page mode and will be performed using a voting mechanism. The input layout candidates will be obtained by employing the Tesseract OCR Engine [9] analyzer on The Binarization, The Smart Grayscale, The Original Document. The result will be mixed accordingly to the overall accuracy obtained on the individual files, to the text confidence retrieved by the Tesseract engine in the individual layout elements, and to the probability of a coherent, plausible, geometric layout. The Tesseract OCR engine operates mainly in binary mode, thus a powerful binarization will help enormously. And, the expectation is that, as the Tesseract engine advances, the future will bring more and more features computed directly in continuous spaces, both grayscale and color-mode. 2.8 OCR Will be obtained by the Tesseract engine in the subsequent processing phase of the layout analyzer. Again, after more comprehensive tests performed onto the the Lib2Life [2] project’s book and newspaper collections, it is possible to conclude that the Binarization-only version of the OCR-ed image document shall be used, because adding operations in native true-color or smart grayscale-mode will not worth the extra processing time nor add significantly better quality to the text. 2.9 Hierarchy Analyzer The Hierarchy Analyzer (HA) [10] will operate in inter-page mode. All the HOCR output files for every page of the input document will be aggregated and the layout elements will be classified so that the document will receive a “table of contents”- like structure if the module will be able to detect and “understand” one. The layout elements will be marked as Title, Subtitle, Heading 1, Heading 2, and so on, using both geometrically features and measurement on the font included in the elements, formulation heuristics (like begins with “Chapter #…”, is numbered like “#.#.#…”, has a page number in a roman format so it is probably including in the preface, and so on). 2.10 The Document Image Compressor In order to efficiently store the document image pages in the output container, a documentspecific compression technique will be employed, based on the Mixed Raster Contents (MRC) technology [11]. The MRC will split the image in 3 planes of different sizes, which will be finally assembled and aggregated in the container at the same resolution as the original image. These planes will provide the following information: a selector mask to decide which pixel belongs to the foreground and which pixel belongs to the background (stored in binary format at the native resolution), a foreground plane and a background plane (both of them stored in continuous tones and at lower resolutions). Together, these elements will be compressed using specific technology adequate to their bit-depth and spatial-frequency composition, resulting in a very tight storage space in the output container. 2.11 Export module The export will be performed using the PDF format [12]. The original appearance of the document will be preserved, the aforementioned MRC component will be stored to provide the lookand-feel of the on-paper document, and an invisible layout of text will contain all the formatting and metadata obtained during DIAS processing. The PDF will look like on paper document, but it will have the search, structure, text copying and pasting features of a modern digital document. 2.12 The Operation/Correction Interfaces The proposed processing flow is very complex and contains numerous scientific advancements. It will be capable of running in full background mode using automated scripts so, in normal circumstances, graphical interfaces will not be necessary at all. Also, it is expected that the system will have very few errors and a very small memory footprint. However, after a thorough analysis of the run onto the Lib2Life [2] prototype collections, after discussing the results with the users of the system and collecting feedback from all the stakeholders of the project, it is possible to design and implement some correction stages. For this purpose, the modules will operate with clear, easy to edit input/output data formats: images, XML files, JSON files, that can be edited and fine-tuned using well-known, powerful and free version editors like GIMP [13], XML Copy Editor [14], JSON Editor. If, after processing the user feedback reports, there will emerge a need for specific correction tools dedicated to some of the processing stages, a collection of independent modules with visuallybased operating mode may be employed to correct the output in one of the following stages: Import, Deskew, Layout Analysis, Hierarchy Analysis and/or OCR. IV. CONCLUSIONS The universe of the Document Image Analysis Systems is getting bigger and bigger. Unfortunately, the majority of the proposed applications don't have an entire arrangement for the handling flow and for the components included. The system presented in this paper has a 3 axis design plan, involving education processes, research in the field and performance. Moreover, it is easily configurable and completely customizable, while offering robust daily use and good quality results. Acknowledgements This work was supported by a grant of the Romanian Ministry of Research and Innovation, CCCDI - UEFISCDI, project number PN-III-P1-1.2-PCCDI-2017-0689 / „Lib2Life- Revitalizarea bibliotecilor și a patrimoniului cultural prin tehnologii avansate” / "Revitalizing Libraries and Cultural Heritage through Advanced Technologies", within PNCDI III. Reference Text and Citations [1] Y. Ishitani, Document transformation system from papers to XML data based on pivot XML document method, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., Edinburgh, UK, 2003, pp. 250255 vol.1. doi: 10.1109/ICDAR.2003.1227668 [2] Lib2Life - Revitalizarea bibliotecilor și a patrimoniului cultural prin tehnologii avansate, Available online: https://www.ici.ro/pn3-lib2life/, Accessed at: November 21, 2018. [3] Hannah Weller, CIELab Analyses, Available online: https://cran.r-project.org/web/packages/colordistance/vignettes/labanalyses.html, Accessed at: November 27, 2018 [4] D. Silverstein, X. Zhang, J. Farrell and B. Wandell, Color image quality metric S-CIELAB and its application on halftone texture visibility, Computer Conference, IEEE International (COMPCON), San Jose California, 1997, pp. 44. doi:10.1109/CMPCON.1997.584669 [5] Roman Ptak, Bartosz Zygadlo, Olgierd Unold, Projection–Based Text Line Segmentation with A Variable Threshold, Int. J. Appl. Math. Comput. Sci., 2017 [6] D.H.Ballard, Generalizing the Hough Transform to Detect Arbitrary Shapes, Pattern Recognition, Vol.13, No.2, p.111122, 1981 [7] S. Allen Broughton, Kurt Bryan, Discrete Fourier Analysis and Wavelets: Applications to Signal and Image Processing, 2nd Edition, 2018 [8] Mahmoud Soua, Alae Benchekroun, Rostom Kachouri, Mohamed Akil. Real-time text extraction based on the page layout analysis system. SPIE Conference on Real-Time Image and Video Processing, Apr 2017 [9] Tesseract-OCR. Available online: https://github.com/tesseract-ocr, Accessed at: November 30, 2018 [10] Song Maoa, Azriel Rosenfelda, Tapas Kanungob, Document Structure Analysis Algorithms: A Literature Survey, Proceedings of SPIE - The International Society for Optical Engineering 5010:197-207, DOI: 10.1117/12.476326 [11] ISO/IEC 16485:2000 Information technology -- Mixed Raster Content (MRC), Available online: https://www.iso.org/standard/32228.html, Accessed at: November 30, 2018 [12] ISO 32000-2:2017 Document management -- Portable document format -- Part 2: PDF 2.0, Available online: https://www.iso.org/standard/63534.html accessed at November 30, 2018 [13] GIMP - GNU IMAGE MANIPULATION PROGRAM. Available online: https://www.gimp.org, Accessed at: November 21, 2018 [14] XML Copy Editor. Available online: http://xml-copy-editor.sourceforge.net/ Accessed at November 30, 2018