The 15th International Scientific Conference
eLearning and Software for Education
Bucharest, April 11-12, 2019
10.12753/2066-026X-19-000
DESIGNING A DOCUMENT IMAGE ANALYSIS SYSTEM ON 3 AXIS:
EDUCATION, RESEARCH AND PERFORMANCE
Giorgiana Violeta VLĂSCEANU, PhD, Costin-Anton BOIANGIU, Răzvan-Adrian DEACONESCU,
Marcel PRODAN, PhD, Cristian AVATAVULUI, PhD, Răzvan RUGHINIȘ, Irina MOCANU
Faculty of Automatic Control and Computer Science, University POLITEHNICA of Bucharest, Splaiul Independenței 313,
Bucharest, Romania
giorgiana.vlasceanu@cs.pub.ro, costin.boiangiu@cs.pub.ro, razvan.deaconescu@cs.pub.ro, marcoprod@gmail.com,
cristianavatavului@gmail.com, razvan.rughinis@cs.pub.ro, irina.mocanu@cs.pub.ro
Abstract: Technology advances to make life easier for people. We tend to surround us with devices as
small as possible and with the highest computing power. The need for data access from everywhere is
an important detail. As a consequence, digital documents have been gaining ground on printed ones
and for some sectors, the latter were even replaced. The need and the obligation to preserve the written
cultural heritage, represented by books and valuable documents, some of them rare and even unique,
forced us to imagine a system that protects the patrimony but makes it also accessible. In order to make
books easily available to the public and at the lowest possible risk for the protection of the originals, we
came to the idea of designing and creating an efficient digitization system of these records. The current
article presents the proposed architecture of a Document Image Analysis System that will process the
information with individual modules for each type of operation. The main scope for such tool is to
recognize information from the documents and extract them for electronic use. The flow of operations
are indicated by user, some steps can be eliminated depending on the user’s desire and needs. In order
to design an efficient Document Image Analysis System, we need a 3 axis approach: Education involving students that can receive tasks for replacing modules and validating their homework,
Research - performing various tests and Performance - testing the module interconnection and enabling
the system to be extremely configurable. No matter what axis is considered, the main scope is the
flexibility of the system - performed by individual modules as physical binaries or collection of binaries
that are linked via scripts. Each module is designed to accomplish a certain major task by executing
several sub-tasks whose results, in most cases, are subject to an intelligent voting process that produces
the module’s output data.
Keywords: Retroconversion; Document Image Analysis; Optical Character Recognition; OCR;
Digitization; Document Export; Lib2life
I.
INTRODUCTION
The main goal for this paper is to present a modular system with Optical Character
Recognition functionality. This design implies a multitude of components, which offer the possibility
of a dynamic runtime configuration. Actual systems usually have a monolithic architecture, thus the
user cannot configure processing steps, his/her choices being limited to the format for the output file
that can be PDF, DOC or TXT or select the areas of the documents which would undergo content
extraction.
II.
DIAS ARCHITECTURE
The proposed system has a particular structure organized in modules as shown in Figure 1.
Each functionality is realized by an independent component. Typically, a component is an executable,
retrieving parameters from the command line and outputting images, XML or JSON files. Each
module is designed to accomplish a specific task by doing a series of sub-tasks in order to collect a
series of candidates. In most cases, the voting system is added in this step in order to choose the best
candidate.
The proposed flow for the system can be customized by the user at the runtime.
Figure 1. System architecture
Architecture Design Principles
The aim is to have a modular, executable-centric architecture where each module gets input
from another module or from the user and provides the output that will be used by another module.
There is an Executor component responsible for overseeing the process and commanding each
modular executable as in Figure 2.
Figure 2. Overview of Component Interconnection
The Executor receives a list of image files and an initial XML configuration file. The XML
configuration file consists of the modules to be used and their order, parameters for the overall process
or particular to a given module. Each image file is fed to the first module together with a per-module
XML configuration file created by the Executor. The module outputs a processed image file that will
then be passed by the Executor to another module, together with a new XML configuration file.
The Executor may choose to run each module in a pipeline, i.e. each module processes a given
file at a time while another module processes another file, to increase the processing throughput.
Intermediary files may be saved if enabled in the XML configuration file.
Moreover, the Executor will create a new process from each module executable file and for
each processed file using the standard process creation API. It will wait for a process to complete and
use its output as input for another new process from another executable module.
While not part of the current design, the modular approach may allow for each module
running on separate machines, adding further throughput. However, this will require designing a
protocol between the Executor and each module running on a remote machine.
III.
DIAS MODULES
Modules for the proposed system are grouped in classes, in correlation with the main task they
try to solve. The architecture based on individual modules (binaries) has several advantages in
different fields:
Educational: students can receive a basic task, for instance, to replace one individual
module, or a more complex one like designing a full module class, while being fully
able to exploit the system and validate their homework;
Research: researchers may perform various tests (mostly in a trial-and-error manner)
by replacing a module and examining the overall impact and system performance;
Production: the module interconnection can be unbelievably complex, enabling the
system to be extremely configurable, to provide feedback to its own errors and to take
the necessary actions fully autonomously.
2.1
Import module
The modules in the import class should be able to fully determine the document skeleton
based on a collection of input image pages. The document skeleton will be an XML file with a
predefined schema [1]. All the pages should contain links to their image data.
The import XML should also contain the document native data, such as author(s), title, title of
the series, issue and/or volume number, years of publication (original work and reprint), publishing
house, the number of pages, language(s), the paper format, and some document acquisition details
(paper/printing support degradation phase, scanner or camera brand and type, scanning/photo settings
used, illumination type, etc.)
A basic import system, like the one used in Lib2Life project [2] should only be able to select a
folder with image pages and input the main bibliographical data of imported work. Future work
includes extending the Import module to add multiple physical images per logical pages, re-registering
the folding capabilities of pages, and both extended and custom-defined document metadata, page
reordering and (re)numbering capabilities.
2.2
Smart Grayscale Conversion module
An image in a grayscale tone will be needed for the process stages that require a continuousspace uni-component as input rather than a multi-component (like RGB) or discontinued (clusterized)
binarized one. The scientific novelty of the proposed approach will be the use of a smart grayscale-like
conversion that enhances the perceptual color differences. Let us imagine that we are reading a
medium-blue component text on a light-red paper. Due to the fact that the grayscale component is a
linear combination of RGB data, both background and foreground will map to the same grayscale
nuance although the text is very well readable in color, the result is a gray, empty-looking-like, piece
of paper. This problem is very well known when printing colored ink documents on monochrome
printing. One of the contributions of the Lib2Life project [2] is a Smart Grayscale converter which
solves the situations like those presented above by “faking” the grayscale and maximizing the color
differences perceived by the human visual perceptual mechanism. The differentiated perception is
measured using the CIE-LAB [3] color space and the DeltaE distance metric [4].
2.3
Skew Detection and Correction (Deskew) module
The Deskew class of modules comprises some feature-based skew detector and one smart
voting mechanism. The independent skew modules will take into account different features of the
input document, so that, if one fails, the probability of still getting a correct result is increased. For that
purpose, we are planning to test the following:
A projection-profiling technique [5] based on characters’ alignments. The more the
characters, the better the projection variance, thus the increased confidence in the detected
skew angle. Will be used on the results of a fast segmentation routine applied on the Smart
Grayscale Conversion.
A generalized Hough transform [6] to detect the (near-horizontal and/or vertical) linesegments in the document. The longer the lines, taken as a percent from the document size, the
better the confidence returned by the module. Will be used on the Smart Grayscale
Conversion image.
A Fast Fourier Transform (FFT) [7] in which the dominant skew will be identified in the
amplitude of the result. The better the separation between classes when applying a
thresholding operation on the amplitude, the better the confidence returned by the module.
Will be used on the Smart Grayscale Conversion grayscale image.
A voting mechanism which will combine the aforementioned individual modules, taking into
account their detected skew angle, their overall probability for success, their returned
confidence and the characteristics of the input document. The output of this module will be the
result of skew detector.
In the Lib2Life project, due to the fixed position in the scanner of the input on-paper
documents and the text-based nature of the book collections, it is possible that these modules
collection will not be used at all, or that the result of the projection-profiling module could be
forwarded directly as the output result. The tests on the specific collections will tell which approach is
the recommended choice for achieving the best balance between processing speed and accuracy.
2.4
Image Processing Module
The Document Image Analysis System (DIAS) employed in [2] may use multiple image
processor, but with mainly two different purposes:
Image enhancement for automatic processing and better specific features
o Horizontal and Vertical Line enhancer. Will be used to better detect in subsequent
processing stages H/V lines, tables, and to contribute in the Layout Analysis system. It
is optional in the Lib2Life project [2] if the preliminary tests will reveal that the lines
are already consistent in the original images.
o Page dewarping. Will be used to dewarp image documents when the books cannot be
flattened enough to touch the scanning area or when pages are suffering from
geometrical distortions due to exposure to moist and/or fast drying. This module is not
likely to be necessary for the Lib2life project [2] due to the good preservation stage of
the collections to be digitized.
o Noise reduction. Should be identified the most suitable noise-suppression mechanism
for image documents: will be tested at least Gaussian Blur, Median Filtering, Bilateral
Filtering, and perhaps, an original approach of our research group called
“DifferenceGatherer”.
Image enhancement for visual impact and better readability
o Sharpen. A sharpening and edge enhancement technique will be employed in order to
enhance the readability of the texts. It may not be necessary to be employed in the
Lib2Life collections due to the overall good visual separation between text and
background.
o
2.5
Tone Curve. Should be identified the best tone curve in order to enhance the contrast
and illumination. As a scientific advancement, a current original approach of our
research group, the “MaxOnMinVariance” approach will be finalized, tested and
deployed.
Locality and Globality
This is a novel scientific development of our research group. The purpose of this module is to
offer the “best locality” and the “best globality” as windows in the image space determined for every
pixel. The “best locality” will be the window onto which a local algorithm will be computed, while the
“globality” will be the window into which a global algorithm will function. The “globality” should not
be the entire space of the document since the document may contain items that are not related to each
other, like, for example, different pages scanned together, items with totally different characteristics,
for example, columns with different fonts and formatting, and so on. This is subject of an indevelopment research activity of our group and the preliminary results are very promising.
2.6
Binarization module
This module class will perform the foreground-background separation. A scientifically
complex and totally new approach with a locality-globality weighted binarization methods, as
candidates, and a smart voting mechanism as a final processing stage will be employed. As individual
modules there will be used:
A binarization weighted locally-globally, which will offer the best signal-to-noise ratio;
A binarization weighted locally-globally, which will encourage the recovery of formatting
structures like lines and tables;
A binarization using a per-pixel machine learning approach. The most important individual
pixel features will be obtained using the locality-globality approach.
The voting mechanism will try to offer the best compromise between a majority per-pixel voting and
individual characteristics of individual binarization modules.
2.7
Layout Analyzer module
The Layout Analyzer (LA) [8] will operate in intra-page mode and will be performed using a
voting mechanism. The input layout candidates will be obtained by employing the Tesseract OCR
Engine [9] analyzer on The Binarization, The Smart Grayscale, The Original Document.
The result will be mixed accordingly to the overall accuracy obtained on the individual files,
to the text confidence retrieved by the Tesseract engine in the individual layout elements, and to the
probability of a coherent, plausible, geometric layout. The Tesseract OCR engine operates mainly in
binary mode, thus a powerful binarization will help enormously. And, the expectation is that, as the
Tesseract engine advances, the future will bring more and more features computed directly in
continuous spaces, both grayscale and color-mode.
2.8
OCR
Will be obtained by the Tesseract engine in the subsequent processing phase of the layout
analyzer. Again, after more comprehensive tests performed onto the the Lib2Life [2] project’s book
and newspaper collections, it is possible to conclude that the Binarization-only version of the OCR-ed
image document shall be used, because adding operations in native true-color or smart grayscale-mode
will not worth the extra processing time nor add significantly better quality to the text.
2.9
Hierarchy Analyzer
The Hierarchy Analyzer (HA) [10] will operate in inter-page mode. All the HOCR output files
for every page of the input document will be aggregated and the layout elements will be classified so
that the document will receive a “table of contents”- like structure if the module will be able to detect
and “understand” one. The layout elements will be marked as Title, Subtitle, Heading 1, Heading 2,
and so on, using both geometrically features and measurement on the font included in the elements,
formulation heuristics (like begins with “Chapter #…”, is numbered like “#.#.#…”, has a page number
in a roman format so it is probably including in the preface, and so on).
2.10 The Document Image Compressor
In order to efficiently store the document image pages in the output container, a documentspecific compression technique will be employed, based on the Mixed Raster Contents (MRC)
technology [11]. The MRC will split the image in 3 planes of different sizes, which will be finally
assembled and aggregated in the container at the same resolution as the original image. These planes
will provide the following information: a selector mask to decide which pixel belongs to the
foreground and which pixel belongs to the background (stored in binary format at the native
resolution), a foreground plane and a background plane (both of them stored in continuous tones and
at lower resolutions). Together, these elements will be compressed using specific technology adequate
to their bit-depth and spatial-frequency composition, resulting in a very tight storage space in the
output container.
2.11 Export module
The export will be performed using the PDF format [12]. The original appearance of the
document will be preserved, the aforementioned MRC component will be stored to provide the lookand-feel of the on-paper document, and an invisible layout of text will contain all the formatting and
metadata obtained during DIAS processing. The PDF will look like on paper document, but it will
have the search, structure, text copying and pasting features of a modern digital document.
2.12 The Operation/Correction Interfaces
The proposed processing flow is very complex and contains numerous scientific
advancements. It will be capable of running in full background mode using automated scripts so, in
normal circumstances, graphical interfaces will not be necessary at all. Also, it is expected that the
system will have very few errors and a very small memory footprint.
However, after a thorough analysis of the run onto the Lib2Life [2] prototype collections, after
discussing the results with the users of the system and collecting feedback from all the stakeholders of
the project, it is possible to design and implement some correction stages.
For this purpose, the modules will operate with clear, easy to edit input/output data formats:
images, XML files, JSON files, that can be edited and fine-tuned using well-known, powerful and free
version editors like GIMP [13], XML Copy Editor [14], JSON Editor.
If, after processing the user feedback reports, there will emerge a need for specific correction
tools dedicated to some of the processing stages, a collection of independent modules with visuallybased operating mode may be employed to correct the output in one of the following stages: Import,
Deskew, Layout Analysis, Hierarchy Analysis and/or OCR.
IV.
CONCLUSIONS
The universe of the Document Image Analysis Systems is getting bigger and bigger.
Unfortunately, the majority of the proposed applications don't have an entire arrangement for the
handling flow and for the components included. The system presented in this paper has a 3 axis design
plan, involving education processes, research in the field and performance. Moreover, it is easily
configurable and completely customizable, while offering robust daily use and good quality results.
Acknowledgements
This work was supported by a grant of the Romanian Ministry of Research and Innovation,
CCCDI - UEFISCDI, project number PN-III-P1-1.2-PCCDI-2017-0689 / „Lib2Life- Revitalizarea
bibliotecilor și a patrimoniului cultural prin tehnologii avansate” / "Revitalizing Libraries and Cultural
Heritage through Advanced Technologies", within PNCDI III.
Reference Text and Citations
[1] Y. Ishitani, Document transformation system from papers to XML data based on pivot XML document method, Seventh
International Conference on Document Analysis and Recognition, 2003. Proceedings., Edinburgh, UK, 2003, pp. 250255 vol.1. doi: 10.1109/ICDAR.2003.1227668
[2] Lib2Life - Revitalizarea bibliotecilor și a patrimoniului cultural prin tehnologii avansate, Available online:
https://www.ici.ro/pn3-lib2life/, Accessed at: November 21, 2018.
[3] Hannah Weller, CIELab Analyses, Available online: https://cran.r-project.org/web/packages/colordistance/vignettes/labanalyses.html, Accessed at: November 27, 2018
[4] D. Silverstein, X. Zhang, J. Farrell and B. Wandell, Color image quality metric S-CIELAB and its application on
halftone texture visibility, Computer Conference, IEEE International (COMPCON), San Jose California, 1997, pp. 44.
doi:10.1109/CMPCON.1997.584669
[5] Roman Ptak, Bartosz Zygadlo, Olgierd Unold, Projection–Based Text Line Segmentation with A Variable Threshold,
Int. J. Appl. Math. Comput. Sci., 2017
[6] D.H.Ballard, Generalizing the Hough Transform to Detect Arbitrary Shapes, Pattern Recognition, Vol.13, No.2, p.111122, 1981
[7] S. Allen Broughton, Kurt Bryan, Discrete Fourier Analysis and Wavelets: Applications to Signal and Image Processing,
2nd Edition, 2018
[8] Mahmoud Soua, Alae Benchekroun, Rostom Kachouri, Mohamed Akil. Real-time text extraction based on the page
layout analysis system. SPIE Conference on Real-Time Image and Video Processing, Apr 2017
[9] Tesseract-OCR. Available online: https://github.com/tesseract-ocr, Accessed at: November 30, 2018
[10] Song Maoa, Azriel Rosenfelda, Tapas Kanungob, Document Structure Analysis Algorithms: A Literature Survey,
Proceedings of SPIE - The International Society for Optical Engineering 5010:197-207, DOI: 10.1117/12.476326
[11] ISO/IEC 16485:2000 Information technology -- Mixed Raster Content (MRC), Available online:
https://www.iso.org/standard/32228.html, Accessed at: November 30, 2018
[12] ISO 32000-2:2017 Document management -- Portable document format -- Part 2: PDF 2.0, Available online:
https://www.iso.org/standard/63534.html accessed at November 30, 2018
[13] GIMP - GNU IMAGE MANIPULATION PROGRAM. Available online: https://www.gimp.org, Accessed at: November
21, 2018
[14] XML Copy Editor. Available online: http://xml-copy-editor.sourceforge.net/ Accessed at November 30, 2018