The ever growing collection of tools to perform OCR.
Achieving a good quality OCR in one go is not easy. Depending on the quality of the input, the process may include a number of iterations to improve the original image(s) in order to achieve reasonable recognition quality, followed by some (often manual) correction of the recognised text to remove various OCR errors. This is not a massive problem when digitising a page or two, but processing a book of 500 pages makes things a lot harder. This project aims to help with complex OCR projects, but instead of providing one monolithic tool that would include all the processing a user can possibly want, here we develop a number of smaller instruments that can do only the obviously needed steps like OCR itself, but also allowing for user-defined processing to be integrated into the pipeline.
The toolset wraps around a number of well-known programs that perform tasks like PDF or image processing, character recognition, etc., aiming to create an environment for iterative processing of large documents with the ability to utilise custom scripts.
For example, given a document text.pdf
, the simplest OCR session may look like
the following:
▶ mkdir book && cd book
▶ ocr-open ../text.pdf
ocr-open: processing file "../text.pdf"
ocr-open: extracting all pages
▶ ocr
ocr: processing page 1 [ "./page-01.pgm" ]
ocr: processing page 2 [ "./page-02.pgm" ]
{ ... }
ocr: processing page 15 [ "./page-15.pgm" ]
▶ ocr-ls --text | xargs cat
{ ... recognised text }
▶
In this simple example we first create a directory and cd
into it, after that we convert
each page of the document text.pdf
to an image using ocr-open
tool, and then we do
the actual character recognition via ocr
tool. The last command gives an example of
how other custom tools can be integrated into the process with the help of the ocr-ls
utility. Here we use the standard Linux cat
utility to display the recognised text.
Internally, the toolset operates on images in PGM
format, that has been chosen as the lowest common denominator between all the tools
wrapped by this toolset, and also because it is understood by the good old netpbm
package, which is often a bit faster than imagemagic
when it comes to simple
operations like image cropping.
All images are named using pattern page-N.pgm
, where N
is the page number ranging from
1 to the maximum of 9999, as in the source document, and with a sufficient number
of leading zeroes to make sure that a list of files sorted alphabetically gives the correct
page order. The text recognised from each page is stored in a file named using the same pattern,
but with the .txt
extension. Most of the tools in this toolset can operate on a sub-range of
pages via -p
or --pages
command line option, see help (-h
or --help
) on
a particular tool. Generally, the toolset is designed to operate on "pages" rather
than files, for convenience.
Another thing these tools are designed to do is to check all the parameters and input files before passing them over to the underlying programs, because the error messages from those programs are sometimes a bit cryptic.
For details on the command line options supported by a particular tool, simply
invoke the tool with -h
or --help
option.
The included tools are:
This is usually the first command to invoke when starting a new project. The tool
converts each page of the specified document to a separate image. There are options
to specify the range of pages to extract, and the destination directory.
Input document can be either in .pdf
or .djvu
format. Internally the tool invokes
either ddjvu
or pdftoppm
program, depending on the type of the input file.
The main purpose of the tool is to produce a list of files for bulk-processing. The tool outputs a list of files, text or images, from the selected range(s) of pages, in order. A simple example is given above, where it is used to concatenate all the recognised text. For a more involved example, consider the situation where every page except the first one has a page number at the bottom that we don't want to see in the recognised text, and so we want to crop (for example) 6.5% from the bottom of each image starting from the page 2, and till the end of the document. This can be achieved with the following command:
ocr-ls -p 2- | xargs -I{} -n 1 crop-image -b 6.5% {} {}
(see below for the description of the crop-image
command)
The tool invokes tesseract
program to recognise text from the given images. There
are options to specify the range of pages to process, as well as the directory
where the image files are stored. Per each page, the recognised text is written to
the same directory, and to the file with the same name but with a .txt
extension.
For example, this is how to extract text in Russian and English, from pages
5 to 10 only, all located in the directory book
:
ocr -p 5-10 -d book/ -- -l rus+eng
Note: everything to the right from "--"
is passed over to the tesseract
program.
Crops the specified image. The amount of space to crop is given as the percentage of
the image's width or height, which is often more convenient than using pixels. Wraps
around the pamcut
utility from netpbm
toolset.
A tiny utility that crops the image to content and then adds 5% white border. Wraps around
ImageMagic convert
tool. Rarely useful, except the situations where there are
poor quality scanned images with some dust bits on the space surrounding the text,
that sometimes get recognised as punctuation.
A script to normalise text by removing hyphenation and line breaks inside paragraphs.
Normally, tesseract
separates paragraphs by empty lines, and this is required
for the tool to work correctly. The tool takes its input from stdin
, and
writes to stdout
.
Ensures the correct paragraph boundary at the end of the page. Takes one or more text
files as input, and writes its output to stdout
. Can be used in conjunction with
other tools, for example:
ocr-ls -t | xargs norm-page | norm-text
The toolset makes use of external tools that need to be installed first:
sudo apt install netpbm imagemagick tesseract-ocr djvulibre-bin poppler-utils
Optionally, install language packs for tesseract
, for example:
sudo apt install tesseract-ocr-rus
The preferred way to install the toolset is to grab the ocr-*.tar.xz
archive
attached to the latest release
on github (starting from version 0.8), and extract it to a directory listed on the
$PATH
. Alternatively, if the very recent but yet unreleased updates are required,
just clone the project from github
git clone --recursive https://github.com/maxim2266/ocr
then install dependencies for the build
sudo apt install build-essential libmagic-dev
and finally run make release
from the root directory of the project. This will compile the
toolset and create an archive with all the utilities, which can then be extracted to a directory
on the $PATH
.
The toolset has been tested on Linux Mint 19.3, and will probably work on other Debian-based
distributions as well. Supported tesseract
version is 4.0.0 or later.