M20CS061
M20CS061
M20CS061
Sonali
Signature
Sonali
Roll Number: M20CS061
iii
Certificate
This is to certify that the Project Report titled A Study of Strike-Through Text Identification from
Handwritten Answer Sheets, submitted by Sonali (Roll Number: M20CS061) to the Indian Institute of
Technology Jodhpur for the award of the degree of M.Tech. in Artificial Intelligence, is a bonafide record
of the research work done by him under my supervision. To the best of my knowledge, the contents of
this report, in full or in parts, have not been submitted to any other Institute or University for the award
of any degree or diploma.
Signature
Chiranjoy Chattopadhyay, PhD
iv
Acknowledgements
I want to express gratitude to Professor Shantanu Chaudhury, Director, IIT Jodhpur, and Professor
Richa Singh, Head of the Department, Computer Science Department, for giving me an opportunity to
be a part of this institute where I could work on this project.
My sincere thanks to the project supervisor and mentor, Dr. Chiranjoy Chattopadhyay, for his valuable
guidance and expertise. Without his constant support and motivation, this project would not have been
possible.
I heartily thank my friends, family, and all the people who have helped me through this project.
v
Abstract
Handwritten Answer Sheet Evaluation is an important research area in text analysis. The performance of
deep learning classifiers decreases with variability of text style and presence of strike-out text components.
These strike-out are present at character, word, paragraph and page level. When these documents are
processed by OCR they give erroneous results. This work proposes a new method which includes text
component detection using word segmentation and deep learning for classification of text components
as non-strike and strike words. For text component detection we use computer vision techniques such
as segmentation,blob analysis and connected component. Also we explore, deep learning architecture
for classifying strike out components by considering text images as input to classifier. For our exper-
imentation, we have generated our own dataset using ScrabbleGAN and incorporated computer vision
techniques such as histogram matching, making and overlaying as currently no handwritten strike-out
dataset was available. Experimental Results demonstrated that we are able to achieve state of the art
F1 score along with explainability by Grad-CAM , LIME and Integrated gradients.
vi
Contents
Abstract vi
1 Motivation 2
3 Problem Statement 6
4 Literature survey 7
5 Dataset 9
5.1 Available datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Existing datasets and their related issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6 Approaches 13
6.1 Experiments that failed to produce desired results . . . . . . . . . . . . . . . . . . . . . . 13
6.1.1 Word Detection and Localisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2 Experiments that produce desired results . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.2.1 Word Detection and Localisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.2.2 Classification of strike-out text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7 Architecture 17
7.1 Project Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8 Framework Modules 18
8.1 Pdf to Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8.2 Preprocessing and text component segmentation . . . . . . . . . . . . . . . . . . . . . . . 18
8.3 Dealing with class imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
8.4 Deep learning Classification for Strike Classification . . . . . . . . . . . . . . . . . . . . . 21
vii
11 Conclusion 27
11.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References 28
viii
List of Figures
1.1 Illustrations of various kinds of strike-outs . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Handwritten Answer Sheet containing strike-out . . . . . . . . . . . . . . . . . . . . . . . 4
5.1 Images generated using ScrabbleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Illustration of Synthetic dataset creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Illustration of Synthetic dataset creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.1 Word detection using Tessearct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2 Word detection using laplacian and gaussian techniques . . . . . . . . . . . . . . . . . . . 14
6.3 Word detection using MSER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.4 Word detection using contour detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.5 Results obtained from vision transformer and its corresponding attention maps . . . . . . 16
7.1 Overall project framework diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8.1 Results of Text Component Segmentation in Handwritten Pages . . . . . . . . . . . . . . 18
8.2 Generation of strike-out images via data augmentation . . . . . . . . . . . . . . . . . . . . 20
8.3 Architecture of Deep Learning Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
10.1 Results of GradCAM at word level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
10.2 Results of GradCAM at line level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
10.3 Results of GradCAM at paragraph level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
10.4 Results of Integrated gradient and LIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
List of Tables
5.1 Strike-classification performance with Various Techniques along with Traditional Data
Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.1 Results obtained from pre-trained vision transformer models . . . . . . . . . . . . . . . . . 16
8.1 Text component detection quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . 19
8.2 Count of various categories of strikes in augmented dataset . . . . . . . . . . . . . . . . . 21
9.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9.2 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9.3 F1 scores of strike-out class at various levels . . . . . . . . . . . . . . . . . . . . . . . . . . 24
9.4 Comparison of our model with SOTA models . . . . . . . . . . . . . . . . . . . . . . . . . 24
10.1 Strike-out classifier detection performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
A Study of Strike-Through Text
Identification from Handwritten Answer
Sheets
1 Motivation
Handwritten recognition has received a lot of attention, under the assumption that the input is a clean
image with text components. However, strike-out and noisy components may be included in scanned
handwritten answer sheets given by students. If this paper is submitted to optical handwritten character
recognition (OHCR), it may result in a slew of errors due to pupils’ differing handwriting styles, quality
degradation due to artefacts, ink spills, and poor lighting. The text segmentation and pre-processing of
these scanned documents are critical to the performance of an OHCR. These recognizers require papers
that are correctly structured and stored in a controlled environment.
In practise, however, there is no rigid format that can be imposed on students. As a result, automatic
evaluation of answer sheets remains a research challenge. In the absence of pre-processing, physical
damage, strains, and artefacts are also considered text components. In addition, uneven spacing and a
variety of handwriting styles make it difficult for OCRs to process directly. There are several strike-outs
on these handwritten answer sheets at various levels of character, word, paragraph, and page level. These
strikes come in a variety of styles, including single and multiple lines, zig-zag, wavy, and cross. In the
dataset, Figure 1 shows a range of frequent strike-out text components. From forensics to manuscript
digitization, detecting these types of strikes has a wide range of uses. Handwritten response sheet grading,
writer’s behaviour and age verification, and fraudulent and forged text component detection are some of
the most typical applications. However, obtaining these clean documents for OHCR is extremely difficult.
Classifiers must properly localise and identify these strike text components. However, because the strike
out text is so small in comparison to the non-strike components in natural settings, a problem of skewed
class distribution extrinsic imbalance arose. Due to the high prior probability, classifiers trained with this
imbalance data showed a significant bias towards the majority non-strike class. We addressed this issue by
employing our proposed Targeted Data Augmentation technique with ScrabbleGAN [1] and overlaying
masked strikes on these generated words by GAN. To overcome class imbalance, we also investigated
other techniques such as cost-sensitive learning and focal loss, among others.
The proposed framework takes an offline handwritten answer sheet image as input and localises and
classifies a text component as strike or non-strike. We attempted to identify strike-out at various levels
and dealt with class imbalance with targeted data augmentation using ScrabbleGAN and strike mask over-
laying. Our significant contributions are as follows: (i) text component detection via word segmentation;
and (ii) proposed targeted data augmentation to overcome class imbalance, (iii) investigating cutting-
edge vision transformers for the classification of strike-out components, (iv) explainability through the
use of Grad-CAM [2], LIME [3], and Integrated gradients to make model decisions transparent rather
than black boxes.
2
Figure 1.1: Illustrations of various kinds of strike-outs
Handwritten answer sheets contain a lot of noise like overwriting, strike-out text at word, line and
paragraph level with varied strike-through styles such as wavy, single , multiple, zigzag and diagonal etc.
Much research has been done focusing on analyzing these clean handwritten documents but it becomes a
challenging task for detecting and cleaning these strike out words and other distortions such as ink spills,
torn pages, over writings. To deal with these mentioned problems a preprocessing module for identification
of strike-out and forbidding them to enter OCR Engine is needed. This project is an attempt to advance
mentioned preprocessing for handwritten answer sheets before entering OCR Engine. Our approach tries
to find a complete pipeline for preprocessing these answer sheets for automatic evaluation consisting of
removing the struck-out regions and presenting a clean image for OCR. Here the task is divided as follows
a) Text component Detection in handwritten answer sheet b) Strike out word identification through Vision
Transformers c) Prohibition of these strike out words entering into OCR Engine d) Further enhancement
of these documents for better text detection.
3
2 Handwritten answer Sheet composition
Scanning handwritten answer sheet is a fundamental part of the current education system. A handwritten
page comprises a set of text components. These handwritten words could be of different shapes and sizes
with different handwriting styles. Text components may be clean, overwritten, or strike-through words
that need further processing before sending it to OCR Engine. These strike-outs are of various types and
positions; the thickness of the strike may vary according to the writer’s style. Also, these strike-outs can
be at different words, lines, and paragraph levels; hence, we need to perform text segmentation referred
in Figure 2.1, followed by the classification of strike-through components. The scanned documents might
be low quality because of poor lighting conditions, artifacts, and skew issues; hence, we need proper
enhancements to be applied using computer vision techniques. Also, handwritten answer sheets are
semi-structured documents; it becomes challenging to detect all types of text components and identify
strike-through words with removal to produce a clean document for automatic evaluation.
Although there has been much work in analyzing handwritten text recognition by OCR, however,
classification of strike-through and removal of these components are pretty less investigated. To eliminate
the manual and tedious work of evaluation, we build an approach to develop a pre-processing pipeline
that involves text component identification and identifying strike-outs at various levels. This step-by-
4
step pre-processing step enables the generation of an error-free clean document that OCR can process.
We used blob detection and segmentation to extract text components using computer vision techniques.
Moreover, we applied the vision transformer for the strike-through classification at various levels.
We also composed a dataset of strike and non-strike words at a crossword, line, and paragraph-level
in English Handwritten Answer Sheets with complete ground truth annotation in XML formats.
5
3 Problem Statement
The goal of the work proposed in this thesis is to design a preprocessing pipeline for handwritten answer
sheets automatic evaluation which perform the following tasks:
The input is a handwritten answer sheet in pdf format; the framework would auto-
matically detect text component detection with localization and identification of strike-out
components at word, line, and paragraph levels. These strike-out words are further re-
moved from the answer sheet, and documents are further enhanced to be given to OCR
Engine.
As proposed in this thesis, we are developing a preprocessing technique for the automation of hand-
written answer sheet evaluation, whose input to the pipeline would be a scanned answer sheet PDF. The
expected output would be a clean document with strike-out words removed and enhanced to be fed into
OCR. We have improvised a few deep learning techniques to segment strike-out components at the word,
line, and paragraph level.
The rest of the thesis is organized in the following manner. Section 4 describes the state-of-the-art
literature survey, which is closely related to our work. Section 5 highlights the freely available and publicly
available datasets and the need for a new dataset. We also describe the dataset created as part of the
M.Tech. Thesis work with the necessary details. Section 6 shows the list of approaches that do and do not
produced the desired results. In Sections 7 and 8, the proposed approach is described briefly along with
its framework modules. Section 9 discussed the obtained results forming our basis for improvement in
developing techniques that help us achieve desired results. Section 11 discusses the proposed architecture
framework for preprocessing of automation of handwritten answer sheet evaluation. Finally, Section 9
includes the concluding remarks and future work plan.
6
4 Literature survey
The classification problem of strike out and non strike out words containing different variations of cross
style is still at its infancy stage. less work has been done through deep learning techniques. The major
contributions use classical machine learning methods through handcrafted features, however, use of deep
learning techniques are not fully explored.
In 2008, Brink et al. [4] presented a novel method of automatic identification and removal of crossed-
out words in offline settings. They also explored the influence of removing crossed-out text on writer
verification and identification. The method used connected component analysis through Otsu thresholding
[5]. They incorporated the categorization of connected components as normal, crossed-out, and others.
They experimented with classical classifiers such as K - nearest neighbor, linear support vector machine,
decision trees, and shallow neural networks. Decision trees performed the best out of those classifiers.
However, their work lacks experiments with multiple types of strike-through styles. On the contrary,
Likforman et al. [6] used the hidden Markov models to identify wavy and line trajectory strokes. These
strokes are generated through control points, spline curves, and superimposed lines from the delta-log
normal model. The sliding window approach is used for obtaining a sequence of feature vectors. The
detection model identifies those word modes that maximize the estimating likelihood using a density
hidden Markov model. In contrast to this, Adak et al. [7] identified strike-through text using connected
component analysis. They constructed a graphical representation of the text component using intersection
points. The strike-through identified were mainly assumed to be straight, continuous, and horizontally
elongated and identified using Dijkstra’s Algorithm [8]. Hindi/Devanagari characters such as ‘maatra’ and
‘shirorekha’ consists of strikes in their patterns; hence, traditional text recognition approaches considered
them as errors. However, the authors addressed this problem by identifying the stroke count above
and below these mantras. The proposed work failed to address non-linear and longer strikeouts. Since
this work focuses on finding intersection points, it failed to show accurate results in noisy and degraded
documents.
In 2015, Bhattacharya et al. [9] proposed a method to identify overwriting repetition and crossing out
detection in online handwritten documents. The methodology involves data pre-processing through piece-
wise histogram skew correction and extraction of the middle zone of the text component. The extracted
density-based features are then measured using their trajectories count through pixels. In addition, they
also explored a special feature to represent the similarity of strokes using temporal sequence and spatial
information. A segmentation-free HMM-based stochastic sequential classifier was used and classified in
noisy and clean text. They have designed this work only for online settings, and no work was done for
restoring clean text from noise. In 2016, Chaudhuri et al. [10] focused their research on identification,
localization, and Cleaning of struck-out words. This method primarily identifies strike through words
using an SVM-based classifier by handcrafted features and automatic features. The handcrafted features
are extracted based on branch points, holes, and density, whereas the LeNet [11] model extracts the
automatic features. After this step, different methods are applied to perform the text skeleton graph-
based identification. They incorporated shortest path algorithms to identify straight strokes, whereas
7
zig-zag stroke is detected by finding all paths and curvature discontinuities. Other complex strokes such
as wavy-based strike-outs are identified using a regression line in the middle of the connected component.
In order to identify multi-word and line strike-outs, they perform splitting and several graph-based
approaches. Each of the pixels generated represents a struck-out line. However, when strike-through is
located at a particular right region, identifying wavy retrograded and spiral-strokes is failed. The method
is quite prone to false-positive samples. It is computationally expensive as it involves many stages and
heuristics in case multiple words are struck out.
In 2017, Adak et al. [12] studied the impact of struck-out words on writer identification. They
proposed a hybrid approach using CNN and SVM-based classifiers. Similar to [10], they also incorporated
LeNet-5 architecture to perform feature extraction. The extracted features are then passed to an SVM-
based classifier. They also used radial basis functions to calculate the distances between variable inputs
and their respective origins. In order to employ model selection, they utilized traditional grid searching
techniques. The databases used for training contain very few struck out words; hence, the model fails to
classify various types of strike-through inputs. The works, as mentioned earlier, majorly use handcrafted
features or a combination of deep learning and handcrafted features for the removal of strike-out words. In
2019, Nisa et al. [13] reported the problem of class imbalance in current handwritten documents i.e., very
few strike out words present in the database. In order to address this issue, they generate a synthetic
dataset by modifying the IAM dataset [14] utilizing the cross-out strokes. This approach helped in
generating realistic strike-out words. The techniques utilized the frequency-based histogram and euclidean
distance properties in connected components. The pipeline of the model consists of recurrent blocks of
bidirectional LSTMs [15] with fully connected layers and convolutions. The model shows quite promising
results on recognition; however, it fails to perform well for handwritten-based applications. The model also
gives misleading results when alphabets contain lines between them, and even punctuation marks generate
false-positive results. On the other hand, Qi et al. [16] addressed the problem of how ink artifacts degrade
the performance of handwritten recognition algorithms. They utilized a fully convolutional network for
mapping images to binary segmentation masks. The mask makes the artifact pixels to white regions,
ultimately leading to cleansing; however, this work is only confined to simple strokes.
In 2013, [17] patented a specialized method that identifies struck-out character recognition in hand-
written text. The model parses the scanned image into specific regions where each region specifies hand-
written characters. These regions are passed to structural or feature classifiers to identify the strike-out
characters. The model consists of a two-layer character recognition process using structural classifiers.
The secondary recognition stage uses different classifiers than the primary stage to improve the model’s
recognition quality. These are trained with large datasets to identify the struck-out characters of different
complexity levels. This was a new benchmark in the field of text recognition.
8
5 Dataset
There are many datasets in the literature that are publicly available for text analysis. Some of them are
defined as follows:
1. IAM [18] - It consists of around 13k images of handwritten text lines from Lancaster-Oslo/Bergen
Corpus written in British English. Around 650+ writers have generated 1500+ pages of text labeled
at different levels viz word, line and sentence.
2. BFL [19] - The database consists of Brazillian forensic letters written by 300+ writers contributing
more than 900 handwritten text images. Forensic experts widely use this for writer identification
tasks in criminal records. Since college students generate the dataset consisting of letters, it consists
of very few strike-out mistakes.
3. MLS [20] - This dataset represents images from manuscripts of medieval and historical periods.
It is used wisely for the task of line segmentation. The significance of this dataset is that lines
are overlapped at different regions along with curved-based lines and rotated scans. This makes it
difficult to apprehend and challenging to pre-process for the downstream task.
4. BH2M [21] - It consists of marriage registrar book images from the Cathedral of Barcelona. They are
annotated in XML hierarchical structure. They are widely used for word-spotting and handwriting
recognitions. They are annotated in different levels viz word blocks, sentences and paragraphs.
The publicly available datasets consists very-few struck-out words that makes the dataset biased
towards one class. There are two ways in which we can tackle this problem, either, we can use focal
loss in our model to mitigate the bias or we could create synthetic dataset with manual strike-outs
intentionally.
Handwritten datasets from forensics, marriage registration, and old manuscripts are publicly available;
however, answer sheets from universities containing multiple types of strike-out words are not publicly
available. The aforementioned dataset is not intended for our automatic evaluation because the strike-
out words are very few in these corpora with no proper annotations. The old scanned documents have
artifacts, noise and suffer from degradation and aging. The scanned documents also have shine through
and background handwriting, making them hard to process through OCR.
The publicly available handwritten datasets contain very few strike-outs leading to class imbalance.
Classifiers tend to get biased predictions towards the majority class with skewed datasets. According to
286891, due to class imbalance length of the majority class gradient is more significant than the minority
and tends to dominate the net gradient. Due to this, the minority class tends to have slower convergence.
9
Also, some datasets include old manuscripts; hence, these documents may suffer from degradation and
irregular scans, making them harder to train.
We require a large balanced training dataset to solve this problem. We have utilized the dataset
introduced in [22]. It contains 200 scanned answer sheets of university students. Students wrote in
natural offline settings on A4 size plain white sheets. We assume that mobile devices scanned these
answer sheets with 300 PPI and 256 gray levels. For each answer sheet, a corresponding XML annotation
file is also provided. These XML annotations contain each text component bounding-box coordinates
with labeling as strike-out and non-strike-out. These strike-outs are of various types: single, multiple,
diagonal, crossed, zig-zag and wavy, available at character, word, line, and paragraph levels. The highest
strike-out rate was 0.75 in the answer sheet, and the average strike-out rate was 0.0883. Although this
dataset is suitable for our problem, they contain a class imbalance ratio of 1 : 11 for the strike-out class.
With traditional data-augmentation, we tried various techniques at data and algorithmic levels to
overcome class imbalance issues using ensemble, focal loss, cost-sensitive learning, and various data aug-
mentation techniques. The F1 scores for both the classes obtained using techniques are shown in Table
8.1.
Table 5.1: Strike-classification performance with Various Techniques along with Traditional Data Aug-
mentation
As these techniques were not as effective, we proposed our data augmentation approach referred to
in Section 5.3.
Traditional data augmentation techniques such as rotation, position shifting, zooming, and shearing
could not change the imbalanced class distribution. Hence, we explored traditional GANs for data
augmentation. As the ISI answer scripts dataset is imbalanced, GANs trained on this dataset failed to
capture minority class characters well.
In order to solve this issue, we utilized ScrabbleGAN [1] to generate handwritten text consisting
of words, multi-words, and sentences. This semi-supervised approach uses the GAN paradigm with a
10
discriminator and recognition network. The former promotes realistic, and the latter focuses on overall
readability generating versatile handwritten text in lexicon and style. After utilizing the overlapping
receptive fields of CNNs in the generator, the word synthesized can be considered a concatenation of
identical character conditional generators. This overlap in the generator promotes adjacent character
interaction and uses noise vector to control style. The convolutional paradigm discriminator concatenates
the real/fake classifiers overlapping the receptive fields. The primary benefit incorporated is the ability
to use unlabelled data and generate varying length text components. The recognizer is trained on natural
labeled data and penalized to promote the generation of a readable text. The generator loss is optimized
by joint loss combined with recognizer, and discriminator adversarial loss is minimized by l = lD + λlR .
The images generated using ScrabbleGAN is shown in Fig. 6.1 that are further augmented to the non-
strike-out class.
We synthetically strike-out the random words that we have sampled from non-strike-out class con-
taining having equally proportion from the original dataset and ScrabbleGAN. We processed the strokes
and non-strike images for generating the strike-out images in each iteration. Each iteration consists of
various stages. The first stage includes resizing and histogram matching to unify contrast levels. In the
second stage, we applied thresholding [23] and opening on the stroke image to obtain a stroke mask with
the help of the kernel. The stroke mask is superimposed on non-strike-out images using computer vision
operations to generate a synthetic strike-out image in the final stage. The detailed algorithm of this
approach has been discussed in Section 8.3. The complete synthetic image generation procedure is shown
in Fig. 5.2.
Using this approach, we generated around 87500 synthetic strike-out images that addressed the chal-
11
lenge of data imbalance. The benefits were two-fold: dataset imbalance problem vanished, and various
deep learning techniques like vision transformers that require more training set.
Some examples of generated realistic handwritten strike Images created by this approach are shown
in 5.3
12
6 Approaches
We have tried five different techniques for word detection and localisation, however, they did not give
desired results. Using tesseract [24], a lot of text components were not detected because tessearct is not
trained for handwritten text detection and recognition well. We concluded that for our problem a general
purpose OCR won’t be beneficial to us. Word detection using Tessearct is shown in Figure 6.1.
Blob detection for text segmentation remains challenging due to these methods because of large
intensity variation in blob surrounding , blob cluster formation when placed in vicinity , boundaries are
not well defined for handwritten text. Text segmentation using scale-space theory is done by LOG which
highlighted text components but text segmentation was not proper. It was observed that it worked well
for symmetric blobs efficiently but not for asymmetric ones . For difference of gaussian they suspect
to over detect blobs that can be observed from results so by further analysis a post pruning would be
required. The results from LOG, DOG and DOH are shown in Figure 6.2.
13
Figure 6.2: Word detection using laplacian and gaussian techniques
When we performed MSER [25] for word detection, it failed for blurry textual images. It is because
bluriness is subjective to down-scaling of images, hence, MSER failed to produce result for scaling. Also,
such images that are very finely present are vulnerable due to MSER’s irregular nature towards scaling
and non-convex fragments. In Figure 6.3 we observed that we get many overlapping blobs and many
inner blobs are detected inside the outer blob which produces erroneous results.
14
6.2 Experiments that produce desired results
Using contour detection [26], we were able to segment out text components well from this approach which
formed our base for finalising the approach for text segmentation using similar idea.
15
6.2.2 Classification of strike-out text
Strike-out Non-strike-out
Models Accuracy Precision F1 score Precision F1 score
Vit b32 0.84 0.80 0.80 0.76 0.76
Vit b32+rbf kernel 0.80 0.85 0.81 0.75 0.79
Vit b16 0.86 0.83 0.84 0.88 0.88
Vit b16+ rbf kernel 0.867 0.83 0.85 0.90 0.88
Vit l16 0.75 0.72 0.76 0.77 0.72
Vit l16+rbf kernel 0.768 0.78 0.77 0.76 0.76
vit l32 0.576 0.56 0.67 0.63 0.41
Vit l32+rbf kernel 0.60 0.62 0.62 0.60 0.59
We used pre-trained vision transformers and fine-tuned them for our dataset. The results obtained are
shown in Table 6.1. These results produce good F1 scores to our earlier experiments done using classical
machine learning and deep learning models. But being black box model we could not interpret the reasons
for good F1 scores . As the data for pre-trained was Imagenet and our dataset have different domain.
Hence, we treated this study as base and concluded that we should train our vision transformer from
scratch.
Figure 6.5: Results obtained from vision transformer and its corresponding attention maps
We also studied their attention maps to study explanability to know where our model focus as shown
in Figure 6.5. Since this model architecture of vision transformer was working well, We decided to train
it from scratch.
16
7 Architecture
This section presents the approach we followed in order to develop a classification model for strike text
component images that include text component detection , data augmentation and classification .
This section presents the step-by-step approach followed in Project, shown by Fig. 7.1, for detection
of text components and classification of strike images in handwritten answer sheet using deep learning.
It involves four main steps; (1) Preprocessing, (2) Text component Segmentation, (3) Targeted Data
Augmentation and (4) Classification using Vision Transformer. In the following sub-section, we describe
each framework module in detail.
There are four main modules involved in the complete classification of strike-out images described by
the Figure 7.1.
Firstly we convert answer sheet scanned pdf to separate handwritten page images to extract text
components.
Text components can be called the fundamental unit of a handwritten answer sheet for classification.
Usually a handwritten word is a connected component despite struck-out or non-struck-out words. We did
Text segmentation using morphological operations and blob extraction. The given dataset was observed to
have imbalance of 1:11. Further we tried data augmentation that produces strike word using ScrabbleGAN
and Strike mask overlaying to generate strike text components images which can overcome the skewed
distribution of dataset referred in section 8.3. Lastly these images are fed into deep learning classifier for
strike classification. Fig. 7.1 depicts proposed framework comprising of three phases. All the phases are
described in the following sub-sections.
17
8 Framework Modules
In this section we will describe each module and it’s working.
Handwritten scanned answered sheets in pdf format is converted into images using hon module ”pdf2image”
and with a dpi parameter of 300.
In order to detect and localize text components from handwritten pages, we performed preprocessing
and segmentation using computer vision operations. The image is further binarized using Otsu’s thresh-
olding for separating foreground and background. The preprocessing stage also involves skew detection
and correction using the horizontal gradient method. The noise removal is done by using a smoothing
operation with Gaussian Filtering. Due to thresholding, individual text components can be extracted
as smeared black regions. We incorporated blob extraction to extract the black-connected components.
These extracted blobs were enclosed in a bounding box using connected component analysis similar to
work in [27]. We discarded tiny components such as dots and noise using area filters as shown in Fig.
8.1.
Figure 8.1 shows text component detected in Handwritten answer sheet page.Object detection accu-
racy is calculated using Precision and Recall.
18
TP
P recision = (1)
TP + FP
and
TP
Recall = (2)
TP + FN
|A ∩ B|
IOU = (3)
|A ∪ B|
Precision and Recall are calculated using Intersection over Union(IOU) in which we define a IOU
threshold and prediction with IOU > threshold are true positive and otherwise false positive. For the
accuracy calculation we took IOU threshold as 0.5. Additionally, IOU is ratio of common area between
ground and actual bounding box of components to total area of union of ground and predicted bounding
box.
For evaluating our model performance we used precision, recall, F-score and IOU. Quantitative Results
of the above metric are reported in Table 8.1.
Scores in Table 8.1 confirms that we are able to detect text components well. Also precision > recall
confirm stability and reliability of detection method.
Qualitative Results are demonstrated from Figure 8.1 signifying that it worked quite well for detecting
text components with very less miss-outs.Even multiple words strike could be detected as desired. This
component is able to work with all single and multiple strikes. We deliberately coded to avoid detecting
small components such as comma, period or noise.
Original dataset of ISI Answer Sheet referred in Section 5.2 has a ratio of non-strike-out to strike-out
class is 11 : 1. This is a clear example of skewed data distribution precisely extrinsic imbalance. These
rare occurrences of strike-out class are of our interest. For balanced good results we require a well
represented and non-overlapping distribution of both classes. Models trained with original imbalance
dataset have a high possibility of getting biased towards the prediction and leads to failure in correctly
predicting for minority class. They tend to over - prediction for the majority class because of high prior
probability. Traditional data augmentation techniques such as zooming, cropping and rotating were not
suitable enough for data generation of strike images as they are not capable enough to change imbalance
distribution.
19
Despite of the above challenges , there are different categories of strike-out at character, word, line
and paragraph level. A proper representation of adequate data is required from each category for deep
learning methods to classify properly. To overcome this class imbalance, we proposed targeted data
augmentation for skewed class in two stages. Algorithmic Technique have been discussed in section
In the first stage , We used ScrabbleGAN, a semi-supervised approach for generating real-looking
handwritten text images varied in style and lexicon.This architecture comprises of discriminator for
realistic handwriting style and text recognition network for readability with input similarity along with
generator.Details of architecture have been discussed in section 5.3. We generated words of varying length
and sentences augmenting them into a non-strike class.
In the latter stage for generating strike images , we randomly sample non-strike class images (contain-
ing both original and ScrabbleGAN generated) for augmentation. We created masks for various types of
stroke images that can be superimposed on non-strike images for augmenting strike class. In this stage we
used computer vision operations such as resizing and histogram matching of stroke image and non-strike
image for unification of contrast level of both the images. Further , we applied thresholding and opening
on the stroke image to obtain a mask with the help of the kernel. In the last step, we used Image overlay
to superimpose stroke mask over non-strike image to augment this strike image containing text to our
skewed class referred.
for i in list1 do
for j in list2 do
1. resize(i, j)
2. histogram − matching(i, j)
3. k = mask − creation(i)
4. overlay(k, j)
Some of the examples of generated strike-out images are shown in Figure 5.3.
20
8.4 Deep learning Classification for Strike Classification
We trained a vision transformer (ViT) [28] instead of CNN for automated classification of strike-out and
non-strike-out images. The state-of-the-art ViT model is competitive to CNN for a wide range of image
classification tasks and outperforms by approximately ×4 in efficiency and accuracy. It shows excellent
results comparable to CNN even in fewer resources but exhibit a weaker inductive bias, so we require
strong data augmentation or regularization for smaller datasets. Dataset augmentation was referred in
previous stage 8.3 and counts per categories are summarized in Table 8.2. These strike-out images in
dataset were available at various diversities such as character, word, line and paragraph level.
The input to the ViT are varying dimension images that are resized to 128 × 128. After preprocessing,
the extracted text components images are fed as input to the deep learning classifier. We used the
following configuration for VIT architecture - patch size extracted from 128 × 128 input image : 64 × 64,
number of transformer layer and head : 12, MLP output dimension : 3072, dropout rate in dense layers
: 0.1 with dense layer size : [2048, 1024] MLP head units. We trained our ViT model from scratch as
shown in Fig. 8.3.
The initial patch creation layer splits input image into patches of fixed size 6×6 with number of patches
21
num patches = (image size//patch size) serving as effective input sequence length for transformer.
These patches will be fed to the patch encoding layer for linear transformation to a projected vector
of size of latent vector size of dimension 64. In addition to this, the learnable positional embedding is
added to patch embeddings obtained from the previous step for retaining positional information. We
further pass the embedding vector sequence from the previous step to the transformer encoder. The
transformer encoder consists of altering layers of MultiHeadAttention(self) and MLP blocks. In addition,
it utilizes layer normalization and residual connection before and after every block, respectively. These
blocks generate a tensor consisting of batch size, number of patches, and projection dimension (latent
vector size). The obtained tensor is then fed to the classifier with softmax activation. We get the output
as the probability of strike-out and non-strike-out classification for each image. We incorporated sparse
categorical cross-entropy as the loss function and AdamW optimizer for our binary classification problem
with a learning rate of 0.01.
22
9 Experiments and Results
As mentioned earlier, overcoming extrinsic class imbalance and strike-out classification is a challenging
task. In order to balance the dataset, we obtained an augmented dataset as discussed in Section 5.3
from the proposed methodology. The entire augmented dataset consist of around 47503 strike-out images
and 40000 non-strike-out images summing upto 87513 images. We split the dataset into train-test split
ratio of 75 : 15 respectively. We trained the ViT model from scratch, whose implementation details are
mentioned in Section 8.4.
The model was trained using augmented data using AdamW optimizer, and parameters were updated
using sparse categorical cross-entropy loss on a batch size of 128 images. We trained the model for 200
epochs to avoid over-fitting. We also observed that data augmentation helps reduce over-fitting and
increases the variability and style of handwritten text components with strikes. The performance of
strike-out classification was evaluated using precision, recall, F1-measure, and confusion matrix as shown
in Table 9.2 and Table 9.1 respectively. The overall accuracy of the model was reported as 98.256%.
Non-Strike Strike
Non-Strike 5812 141
Strike 88 7086
The proposed methodology achieves desirable performance by obtaining a 98.409 as F1-score for our
interest class in Table 9.2. We tried interpreting the caliber of our classifier using other measurements.
We obtained a high sensitivity or recall value from the table, indicating a small number of misclassifications
and false negatives. The fewer false negatives reduce the possibility of misclassifying strike-outs that can
be prevented from entering OCR for text conversion, which is highly desirable. We could also observe
a high precision value, indicating less possibility of classifying a non-strike-out sample as strike-out.
We obtained a high F1 score with high precision and recall that classifies well for the strike-out and
non-strike-out classes.
We analyzed the model’s performance for various strike-out categories, including zig-zag, wavy, single,
23
multiple, diagonal, and crossed strokes. We evaluated these stroke categories at the word, line, and
paragraph levels as shown in Table 9.3. The word-level includes both character and word strike-outs.
To the best of our knowledge, this is the first work done that evaluated the performance of strike-out
classification at these levels.
We choose F1 as the evaluation metric as it signifies our model robustness and good generalization
ability for our positive class, i.e., strike-out class. We got higher scores for multiple and wavy strikes than
in other categories. Also, paragraph level performance is better than word and line level. We analyzed
this difference in performance and found that the misclassification chance increases in the case of less
contextual information.
For benchmarking and quantifying the proposed methodology against state-of-the-art work, we compute
and report parameters F1-scores of the various strike categories in Table 9.4. The ground truth for various
categories of strike-out text components was generated using XML annotations manually for evaluation.
P recisión×Recall
We evaluated the performance using F − measure = P recisión+Recall .
We were able to overcome the limitation of state-of-the art work for some categories and also considered
the strike-outs at various level which were not included in previous research.
F-1 score for Strike Categories Proposed Poddar et. al Chaudhuri et. al
Single 0.95 0.98 0.97
Multiple 0.95 0.97 0.94
Slanted 0.92 0.94 0.94
Crossed 0.86 0.97 0.92
ZigZag 0.83 - 0.82
Wavy 0.90 - 78.58
These results demonstrate that we were able to achieve state-of-the-art F1-scores for classification
along.
24
10 Analysis and Discussion
10.1 Explanability
In order to make our model explainable, we evaluated our results using Grad-CAM [29], integrated
gradients [30] and LIME [31].
Grad-CAM provides visual explanations using the localization of gradients. It incorporates mapping
of class activation of the target class and produces a coarse mapping of those regions of the input image
that are essential in predicting that class. This way, we could understand whether our model focuses on
the correct feature and up to what impact. The Grad-CAM explanations at word, line and paragraph
levels are shown in Figure 10.1, Figure 10.2 and Figure 10.3 respectively.
Integrated gradients perform their predictions by attributing the results to independent sample fea-
tures. It makes the model interpretable and helps visualize the relationship between the output and the
independent features. On the other hand, LIME provides explanations of an instance by sampling and
feature sampling only the essential features. It uses various regression techniques such as Lasso and ridge
models for feature selection. LIME is widely used in the research community because of its intuitive
properties: model agnosticism and local explanations. The integrated gradient and LIME explanations
are shown in Figure 10.4.
25
Figure 10.4: Results of Integrated gradient and LIME
Most research on strike-out classification focuses on handcrafted features rather than features generated by
deep learning methods. In our attempt, we explored the classification performance of classical machine
learning and deep learning architecture without feeding into handcrafted features. We evaluated the
performance of classical machine learning models. We observed that these classifiers are not effective in
generating features and are not able to perform well for our positive class in Table 10.1.
We further studied the performance of the deep learning classifier and compared their performance
against the ViT model. For evaluation, we used F1-score as measurement and demonstrated their perfor-
mance for both classes. We observed that deep learning classifiers perform better than machine learning
due to their ability to extract features better.
26
11 Conclusion
The goal of this work proposed in this thesis is to design a pre-processing pipelining of handwritten answer
sheets for automatic evaluation. The analysis of a handwritten answer sheet poses a challenge because of
variations in handwriting style, strike-throughs, and overlapping words. The documents also consist of
noise, artifacts, and dark patches due to different lighting conditions in which scanning was captured. In
our proposed pipeline, we performed text localisation by computer vision techniques, data augmentation
using ScrabbleGAN with masking of strokes on generated images and classification of strike-out words
using state-of-the-art vision transformer architecture where shows some promising results.
1. As part of the work done so far, we are planning to submit a paper in the Pattern Recognition
Journal soon.
2. We would improve the limitations of our model by focusing on mis-classifications indicating our
model failure for some strike cases.
3. We would like to work on post-processing further by deletion of strike text components and enhanc-
ing further for OCR smooth processing.
References
[1] S. Fogel, H. Averbuch-Elor, S. Cohen, S. Mazor, and R. Litman, “Scrabblegan: Semi-supervised
varying length handwritten text generation,” in Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, 2020, pp. 4324–4333.
[2] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual
explanations from deep networks via gradient-based localization.” in ICCV. IEEE Computer
Society, 2017, pp. 618–626. [Online]. Available: http://dblp.uni-trier.de/db/conf/iccv/iccv2017.
html#SelvarajuCDVPB17
[3] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”: Explaining the predictions of
any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016, pp. 1135–1144.
[4] A. Brink, H. van der Klauw, and L. Schomaker, “Automatic removal of crossed-out handwritten text
and the effect on writer verification and identification,” in Document Recognition and Retrieval XV,
vol. 6815. International Society for Optics and Photonics, 2008, p. 68150A.
[5] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE transactions on systems,
man, and cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
[6] L. Likforman-Sulem and A. Vinciarelli, “Hmm-based offline recognition of handwritten words crossed
out with different kinds of strokes,” 2008.
[7] C. Adak and B. B. Chaudhuri, “An approach of strike-through text identification from handwritten
documents,” in 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE,
2014, pp. 643–648.
[8] E. W. Dijkstra et al., “A note on two problems in connexion with graphs,” Numerische mathematik,
vol. 1, no. 1, pp. 269–271, 1959.
[9] N. Bhattacharya, V. Frinken, U. Pal, and P. P. Roy, “Overwriting repetition and crossing-out de-
tection in online handwritten text,” in 2015 3rd IAPR Asian Conference on Pattern Recognition
(ACPR). IEEE, 2015, pp. 680–684.
[10] B. B. Chaudhuri and C. Adak, “An approach for detecting and cleaning of struck-out handwritten
text,” Pattern Recognition, vol. 61, pp. 282–294, 2017.
[12] C. Adak, B. B. Chaudhuri, and M. Blumenstein, “Impact of struck-out text on writer identification,”
in 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017, pp. 1465–1471.
28
[13] H. Nisa, J. A. Thom, V. Ciesielski, and R. Tennakoon, “A deep learning approach to handwritten
text recognition in the presence of struck-out text,” in 2019 International Conference on Image and
Vision Computing New Zealand (IVCNZ). IEEE, 2019, pp. 1–6.
[14] U.-V. Marti and H. Bunke, “A full english sentence database for off-line handwriting recognition,” in
Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99
(Cat. No. PR00318). IEEE, 1999, pp. 705–708.
[15] S. Hochreiter, “Ja1 4 rgen schmidhuber (1997).“long short-term memory”,” Neural Computation,
vol. 9, no. 8.
[16] Y. Qi, W. R. Huang, Q. Li, and J. Degange, “Deeperase: Weakly supervised ink artifact removal
in document text images,” in Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, 2020, pp. 3522–3530.
[18] U.-V. Marti and H. Bunke, “The iam-database: an english sentence database for offline handwriting
recognition,” International Journal on Document Analysis and Recognition, vol. 5, no. 1, pp. 39–46,
2002.
[19] C. Freitas, L. S. Oliveira, R. Sabourin, and F. Bortolozzi, “Brazilian forensic letter database,” in
11th International workshop on frontiers on handwriting recognition, Montreal, Canada, 2008.
[20] O. Surinta, M. Holtkamp, F. Karabaa, J.-P. Van Oosten, L. Schomaker, and M. Wiering, “A path
planning for line segmentation of handwritten documents,” in 2014 14th International Conference
on Frontiers in Handwriting Recognition. IEEE, 2014, pp. 175–180.
[21] D. Fernández-Mota, J. Almazán, N. Cirera, A. Fornés, and J. Lladós, “Bh2m: The barcelona his-
torical, handwritten marriages database,” in 2014 22nd International Conference on Pattern Recog-
nition. IEEE, 2014, pp. 256–261.
[22] P. Shivakumara, T. Jain, N. Surana, U. Pal, T. Lu, M. Blumenstein, and S. Chanda, “A connected
component-based deep learning model for multi-type struck-out component classification,” in Doc-
ument Analysis and Recognition – ICDAR 2021 Workshops, E. H. Barney Smith and U. Pal, Eds.
Cham: Springer International Publishing, 2021, pp. 158–173.
[23] M. Sezgin and B. Sankur, “Survey over image thresholding techniques and quantitative performance
evaluation,” J. Electronic Imaging, vol. 13, pp. 146–168, 2004.
[24] R. Smith, “An overview of the tesseract ocr engine,” in Ninth international conference on document
analysis and recognition (ICDAR 2007), vol. 2. IEEE, 2007, pp. 629–633.
[25] W. Huang, Y. Qiao, and X. Tang, “Robust scene text detection with convolution neural network
induced mser trees,” in European conference on computer vision. Springer, 2014, pp. 497–511.
29
[26] F. Kurniawan, A. R. Khan, and D. Mohamad, “Contour vs non-contour based word segmentation
from handwritten text lines: An experimental analysis,” International Journal of Digital Content
Technology and its Applications, vol. 3, no. 2, pp. 127–131, 2009.
[29] R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: visual expla-
nations from deep networks via gradient-based localization. 2016,” arXiv preprint arXiv:1610.02391,
2016.
[30] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” CoRR, vol.
abs/1703.01365, 2017. [Online]. Available: http://arxiv.org/abs/1703.01365
[31] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”: Explaining
the predictions of any classifier,” CoRR, vol. abs/1602.04938, 2016. [Online]. Available:
http://arxiv.org/abs/1602.04938
30