AI Possible Risks & Mitigations: Optical Character Recognition
AI Possible Risks & Mitigations: Optical Character Recognition
OF EXPERTS PROGRAMME
As part of the SPE programme, the EDPB may commission contractors to provide reports and tools
on specific topics.
The views expressed in the deliverables are those of their authors and they do not necessarily reflect
the official position of the EDPB. The EDPB does not guarantee the accuracy of the information
included in the deliverables. Neither the EDPB nor any person acting on the EDPB’s behalf may be
held responsible for any use that may be made of the information contained in the deliverables.
Some excerpts may be redacted or removed from the deliverables as their publication would
undermine the protection of legitimate interests, including, inter alia, the privacy and integrity of an
individual regarding the protection of personal data in accordance with Regulation (EU) 2018/1725
and/or the commercial interests of a natural or legal person.
2
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Table of Contents
1. Background ..................................................................................................................................... 4
2. Data protection and privacy risk identification ............................................................................ 12
Definition of the criteria to consider when identifying risks and their categorization..................... 12
Presentation of examples of risks specific to OCR ............................................................................ 13
3. Data protection and privacy risk assessment ............................................................................... 18
Criteria to establish the likelihood of OCR risks. How to assess likelihood. ..................................... 18
Criteria to establish the severity of OCR risks. How to assess severity. ........................................... 19
Examples of OCR specific risks assessments ..................................................................................... 19
4. Data protection and privacy risk treatment ................................................................................. 23
Risk treatment criteria ...................................................................................................................... 23
Presentation of mitigation measure examples/risk treatment options ........................................... 24
Residual risk acceptance ................................................................................................................... 27
Reference to specific technologies, tools, methodologies, processes or strategies. ........................... 30
Disclaimer by the Author: the examples and mentions of companies in this report are illustrative
and do not imply that the author considers them the only or the best choice. The technology analysis
presented in this report is based on the state of the art of the technology in August 2023.
3
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
1. Background
Description of the task, main technologies used and references to some
openly accessible examples.
OCR stands for Optical Character Recognition, and it is a technology used to convert images or scanned
documents containing text into machine-readable text. OCR technology enables the extraction of text
from both physical paper documents and digital sources.
In the binarization 1 process, characters are identified by recognizing dark areas as text and light areas
as the background. The dark areas undergo processing to identify alphabetic letters or numeric digits.
Algorithms for pattern recognition and feature extraction are then used to identify and analyze these
characters.
Pattern recognition involves isolating a character image, referred to as a glyph2, and comparing it with
a stored glyph that shares a similar font and scale. For successful pattern recognition, the stored glyph
must closely match the input glyph in terms of font and scale. This approach is more effective when
working with scanned document images that have been typed using a known font.
Feature extraction involves breaking down or decomposing glyphs into various characteristics,
including lines, closed loops, line direction, and line intersections. These features are then used to
identify the most suitable match among the stored glyphs.
In addition to character recognition, an OCR program examines the structure of a document image by
segmenting it into elements like text blocks, tables, or images. After isolating the characters, the
program compares them with a collection of pattern images. It then processes potential matches and
presents the recognized text as the output.
1 Binarization is the step that is performed prior to performing OCR. The aim of binarization is to
separate foreground text from the background of a document.
2 In typography, a glyph is "the specific shape, design, or representation of a character". It is a
4
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Image source: Microsoft OCR Vision Studio. Example of segmentation and pattern recognition.
OCR systems often work with an established system of templates; this means that documents need
to have the same basic page structure or the same relative positioning of elements within the
document as the template. Because the system relies on defined templates, using OCR with
documents that have a different structure will result in a lower accuracy. Currently OCR systems based
on templates are available on a broad variety of languages and they provide and extensive catalogue
of different templates.
Nowadays, modern OCR systems also incorporate Machine Learning (ML) algorithms, particularly
those based on Deep Learning, to improve the accuracy of character recognition. This type of deep
learning models can support documents that have similar information, but different page structures.
This is called Intelligent Document Processing (IDP) and uses OCR as its foundational technology to
additionally extract structure, relationships, key-values, entities, and other document insights.
OCR combined with Deep Learning supports structured, semi-structured, and unstructured
documents for data extraction.
Often these OCR systems are offered as Software as a Service (SaaS) solution offering the possibility
to use pre-trained models or to train your own model with your own dataset 3.
Most vendors offer OCR systems as a cloud solution via a system of APIs 4, what seems to be the
preferred option for most customers because of their ease of integration and fast productivity. Though
some providers 5 of this technology offer a general system where the models are shared by the
customers, there are also some that offer the possibility to have a custom model that the customer
can train and delete when necessary 6.
Some vendors also offer the possibility for customers to host the models on-premises making the OCR
capabilities available in the customer’s own local IT infrastructure. This can be a good alternative to
comply with strict security and data governance requirements.
It is also possible to develop and implement your own OCR solution in-house. There are different OCR
libraries and frameworks available such as Tesseract, OpenCV, Easyocr, Keras_ocr and the FineReader
engine from ABBYY.
3 Microsoft and ABBYY are examples of SaaS OCR solutions offering the two possibilities:
https://www.abbyy.com/vantage/ocr-skill/features/
4 An Application Programming Interface is a way for two or more computer programs to communicate
recog-3.0.0
5
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
OCR SaaS solution hosted in OCR Third party solution hosted OCR self-developed, hosted on
cloud on premises premises
- Ready to use models that are Models trained by vendor or Models trained by user
trained by the vendor customer
- Possibility to create your own
self-trained models
1. Example of data flow diagram when using third party OCR systems hosted in the cloud
Step 1: The input data are transmitted via an API from the customer’s location where the OCR system
is located to the vendor’s location in the cloud where the data extraction process will take place.
Step 2: The input (and output) data can be temporarily stored locally at the vendor’s location in the
cloud. The most common storage options are the following:
6
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
1. The data could be stored in a buffer only during the execution of the data extraction process. The
vendor does not retain any data once it has sent the output to the customer.
2. The data could also be temporarily stored in cache 7 to be reused by other immediate processes.
The data retention period is variable and depends on the cache memory capacity and configuration.
3. Another possible scenario is the storage of data in a persistent 8 storage layer such as a database or
a cloud storage. This could be done for the analysis or processing of the data at a later stage.
The longer the data is stored in a system the higher the risk of a data breach, unlawful repurpose or
an infringement of the data storage limitation principle. In this specific case, number 1 (buffer) is the
option with less risks since data is stored only during the process in memory. In option 2 (cache)
though also with a low risk, data is usually stored for a longer period than in buffer and this can happen
outside the process and even on a different location 9. Option number 3 (storage location like a file or
a database for instance) is the one with the highest risks since the storage can take place for a longer
period of time.
Step 3: Once the data extraction process has finalized, the output data are sent back via an API to the
customer.
In some cases, the input and/or the output data could be used by the vendor to retrain and fine-tune
the OCR model. Though this is usually done after informing the customers and obtaining their consent,
it is important to verify it with the vendor.
2. Example of data flow diagram when using third party OCR systems hosted on premises
Step 1: All data transfers and data extraction process take place internally at the customer’s premises
on their own servers within their data centers.
restarted.
9 The location and method of caching can depend on the specific requirements of the OCR system
and the architecture of the application. For instance, caching could happen in a different location
when using a distributed microservices architecture or a cloud-based caching.
7
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Step 2: The input and output data can be temporarily stored locally. The data could be stored in a
buffer only during the execution of the data extraction process or could be temporarily stored in cache
to be reused by other immediate processes.
The input and/or output data could also be stored in a storage location at the customer’s premises.
This could be done for analysis or processing of the data at a later stage or for auditing purposes.
Step 3: Once the OCR process has finalized, the output data is produced.
The input and/or output data could also be used to retrain and fine-tune the OCR model that is also
stored at the user’s premises.
A self-hosted OCR system from a third party provider can be set up in different ways depending on
the architecture and design choices. The specific details can vary if the choice is a completely on
premises set up or a hybrid one 10in which some of the processes are still hosted at the vendor side.
For instance, some of the steps in the data extraction process phase could happen outside the
customers premises (see image below).
It's important to review the documentation and architecture of the third party OCR system to
understand its data flows and whether there are data transfers to the vendor or other third-parties.
Additionally, it is also important to assess the system's compatibility with the user’s infrastructure, the
security as well as any potential required change in network or firewall configurations.
10 Different steps from the data extraction phase could take place in the cloud. This could be a
decision due to various reasons such as resource-intensive tasks, redundancy, expertise, etc.
8
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
The data flow in this scenario is similar to the one of example 2. In this use case, there are no data
transfers to any OCR vendor, and all the processes are executed on premises. All processes take place
internally.
The accuracy of an OCR system measures the percentage of correctly identified characters or words
with respect to the total number of character or words. It is typically evaluated by comparing its output
to the ‘ground truth’ 11 and calculating the proportion of characters or words that were correctly
identified. Any discrepancies between the OCR output and the ground truth are considered errors.
Higher accuracy rates indicate better performance. The accuracy value range is usually represented
as a percentage between 0% (low) and 100% (high). For printed documents with clear and legible
text, accurate OCR results in the range of 95% to 99% are commonly achievable. However, it's
important to note that the accuracy can vary depending on the specific document types, languages,
and OCR software being used.
Confidence score is a measure that provides an indication of the level of certainty of correctness
associated with the extracted data. It is represented with a number between 0 and 1 and a high
confidence score would mean that the OCR system believes its recognition of a particular text is likely
to be correct. The confidence scores are typically determined by the OCR software or algorithm itself
though in some OCR systems 12users can set confidence score thresholds to filter out characters or
results that fall below or above a certain level of confidence. This threshold can be set based on the
desired level of accuracy and tolerance for errors. The specific method for calculating confidence
scores may vary depending on the OCR system and the underlying algorithms used.
11 The accurately known text that the images or documents being processed by the OCR system are
supposed to represent. The term "ground truth" is used in machine learning to refer to the actual
values or outcomes that a model's predictions are compared against during training and evaluation.
12 https://pyimagesearch.com/2020/05/25/tesseract-ocr-text-localization-and-detection/
9
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Confidence scores can be calculated in different ways, but the score is often based on the clarity of
the image, character distinctness, context of surrounding text and the similarity of the identified
characters to the patterns the system has been trained on.
While confidence scores at the character level are prevalent, OCR systems may also provide
confidence scores at other levels such as word, line, or block.
The availability and presentation of confidence scores may vary among different OCR systems and
software implementations. Some systems may only provide character-level scores, while others may
offer a combination of character-level and higher-level scores.
While a high confidence score like 0.95 would suggests the OCR system believes the output is correct,
it does not guarantee that the output is actually accurate.
If an OCR system persistently misread a particular character due to errors in its training data, it could
still assign a high confidence score to this incorrect interpretation. Similarly, an accurate result could
receive a low confidence score if the system finds the recognition challenging due to factors like image
quality or unconventional font style. Hence, it's critical to consider these factors when interpreting
confidence scores and accuracy in OCR systems.
10
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Accounts payable 13: to speed up invoice data entry. OCR technology can be used to automate the
process of entering invoice data into a system. Instead of manually keying in data from paper or
digital invoices, the OCR software extracts key information such as supplier names, invoice dates,
amounts, and invoice numbers. This helps to reduce data entry errors and speeds up the accounts
payable process.
Banking 14: to extract and digitize information making data easier to search, store, and manage. In
identity verification, OCR is used to read data from identity cards, passports, or driving licenses
quickly and accurately. By extracting information from identity documents, OCR facilitates the
verification of customer details and speeds up the account opening procedures.
Digitizing 15 and/or archiving of paper documentation, converting printed paper documents into
machine-readable text documents. Once digitized, the text from these documents can be easily
searched, edited, stored, and managed, making it much more accessible and useful.
Vehicle license plate identification. 16 OCR is a key technology behind Automatic Number Plate
Recognition (ANPR) systems. These systems use OCR to read the license plate numbers of vehicles
from digital images or video feeds for purposes like traffic enforcement, toll collection, or parking
management.
Consumer behavior and market analysis: extracting data from retail receipts and consumer-
generated content (such as reviews or handwritten notes), which can then be analyzed for insights
and consumer behavior analysis. 17
Transforming documents into text that can be read aloud to visually impaired or blind users. 18
OCR is used in assistive technologies to convert printed text into digital text, which can then be
read aloud using text-to-speech (TTS) systems.
Logistics and warehouse automation. 19 OCR can be used to automate processes such as inventory
management, shipping, and receiving of goods. For example, OCR can be used to read labels,
barcodes, or other identifiers on packages, enabling automatic tracking and sorting of goods.
Medical Documentation Transcription & Automation.20 OCR can be used to integrate paper and
images originating from existing patient records into new electronic health records (EHR). It
extracts the data required to automatically associate the information in a patient record, such as
a medical record number, date of birth, patient name, etc., with the right electronic health record.
It can also be used to digitize medical prescriptions.
A real use case of OCR in the medical sector is the ABBYY 21medical records management software.
13 Example: https://rossum.ai/
14 Example: https://www.klippa.com/en/blog/information/4-ways-to-perform-document-based-kyc-
checks-with-ocr-and-ai/
15 Example: https://pdf.abbyy.com/finereader-pdf-for-mac/how-to/digitize-paper/
16 Example: https://platerecognizer.com/
17 Example: https://microblink.com/commerce/receipt-ocr/
18 Example: https://www.afb.org/blindness-and-low-vision/using-technology/assistive-technology-
videos/scanning-and-specialized-ocr
19 Example: https://www.innovapptive.com/blog/using-optical-character-recognition-ocr-to-overcome-
3-supply-chain-bottlenecks
20 Example: https://nanonets.com/blog/ocr-for-healthcare/
21 https://www.abbyy.com/solutions/healthcare/capture-to-ehr/
11
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
To help identify risks associated to the use of data extraction technologies like OCR we can make use
of a variety of risk factors.
Risk factors are conditions associated with a higher probability of undesirable outcomes. They can
help to identify, assess, and prioritize potential risks. For instance, using health data and processing
large volumes of data are risk factors with a high level of risk. Acknowledging them in your own use
case, can help you identify related potential risks and their severity. In this case, an example of
associated risk with a high severity could be ‘a risk of violation of patients privacy due to a data breach’.
The risk factors shown below are the result of analysing the contents of legal instruments such as the
GDPR 22, the EUDPR 23, the EU Charter 24 and other applicable guidelines related to privacy and data
protection25.
The following risk factors can help us identify data protection and privacy high level risks in data
extraction technologies like OCR:
Processing sensitive data - When using OCR for digitizing medical records or legal
When the OCR system is processing sensitive data such as: health documents by courts.
data, special categories of data, personal data related to convictions - When using OCR for digitizing invoices or other processes
and criminal offences, financial data, behavioural data, unique in banking sector, for consumer behavior and market
identifiers, location data, etc. This is a reason of concern since analysis, in data extraction from Identification documents
processing inappropriately this personal data could negatively and bank cards, ANPR and the digitization of medical
impact individuals. records.
Large scale processing This could apply to most of OCR use cases since data
Processing high volumes of personal data is a reason of concern, extraction technologies like OCR are usually applied to
especially if these personal data are sensitive. The higher the large volumes of data.
volume the bigger the impact in case of a data breach or any other
situation that put the individuals at risk.
Processing data of vulnerable individuals This could be the case when OCR solutions are used in the
This is a concern because vulnerable individuals often require health sector, at schools, social services organizations,
special protection. Processing their personal data without proper government institutions, employers, etc.
safeguards can lead to violations of their fundamental rights. Some
examples of vulnerable individuals are children, elderly people,
people with mental illness, disabled, patients, people at risk of social
exclusion, asylum seekers, persons who access social services,
employees, etc.
12
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Low data quality OCR systems are not 100% accurate and quality issues in
The low data quality of the input data and/or the training data is a the input data are common.
concern bringing possible risks of inaccuracies in the generated
output what could cause wrong identification of characters and
have other adverse impacts depending on the use case.
Insufficient security measures - This could be the case if there are not sufficient
The lack of sufficient safeguards could be the cause of a data breach. safeguards implemented to protect the input data and the
Data could also be transferred to states or organisations in other results of the processing. This could be applicable to any
countries without an adequate level of protection. use case.
- Data extraction technologies like OCR are often offered
as SaaS solutions. Input data could be sent for processing
to countries without an adequate level of protection.
Data protection and privacy risks posed by the procurement of those types of AI systems:
Data extraction solutions are frequently available as SaaS solution from third party providers. Due to
the different type of configurations available and the required maintenance of the models used, the
use of an external supplier is usually the preferred option for users of this technology.
Some third party data extraction systems, though rarely, can also be hosted on-premises.
13
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Data protection and privacy risks posed by the development of those types of AI systems:
The development of data extraction technologies can also face data protection and privacy risks.
Risks could arise at different phases of the development life cycle, that is why it is important to
implement an iterative process for the identification of this type of risks.
The development of an OCR system typically involves training machine learning models on large
datasets of annotated images or documents. These datasets can consist of various types of digital
and printed documents. The data used for training an OCR system typically includes:
Training Data: this data includes a diverse set of images or documents that represent the
target domain. It encompasses a wide range of fonts, text sizes, styles, layouts, and
document types.
Validation Data: a separate portion of the dataset is reserved for validation purposes during
the model development process.
Test Data: the other portion of the dataset that is used to evaluate the final performance of
the trained OCR system. The test data should be representative of the real-world scenarios
and provide a fair assessment of the system's accuracy and reliability.
OCR system developers often curate or collect their own datasets, which can include publicly
available data, proprietary data, or datasets obtained through partnerships or collaborations. It is
important to mention that training data can introduce certain risks in the development of OCR
systems. Here are a few key considerations:
- Bias present in the training data, such as imbalances in document types, languages, or fonts,
can impact the OCR system's performance and introduce unfairness.
- Inaccurate or incomplete annotations in the training data can adversely affect the
performance of the OCR system. If the labeled data contains errors or inconsistencies, the
14
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
model may learn incorrect patterns or struggle to generalize well to unseen data. Ensuring
high-quality annotations is crucial for effective training.
- If the training data does not adequately cover the full range of document types, fonts, text
sizes, or languages encountered in real-world scenarios, the system may struggle to
accurately recognize text in unseen or challenging conditions.
- The training data may contain sensitive or private information, such as personal details or
confidential documents.
- Training data could be collected and used in an unethical manner, without respecting
privacy, consent, copyright and other legal obligations.
The following table offers an overview of data protection and privacy risks that developers of OCR
systems should consider during the design and development phase. The idea behind this table is to
make developers conscious of privacy by design choices that can help prevent risks:
27 SDK stands for software development kit. SDK is a set of software-building tools for a specific
platform.
15
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Breach of the data Extensive processing of Infringement of Art. 5 (c) Certain OCR systems require large amounts
minimization personal data for training the Data minimisation of data to train the models 28.Tasks that
principle model require handling more diverse fonts, styles,
or languages may generally require a larger
dataset to capture the necessary
variability.
Data protection and privacy risks posed by the use of those types of AI systems:
Users of data extraction technologies need to consider the risks related to their specific use cases
and context. Making use of the risk factors or evaluation criteria can facilitate the identification of
those risks. For instance, the criteria ‘large-scale processing of personal data’ can already trigger the
identification of risky processing activities that could result in harm.
When using an OCR solution, users have three different service model provisions available: SaaS
solution from third party providers hosted in the cloud, third party solutions hosted on-premises 29
and self-developed own solutions hosted on- premises.
28 The size of a training dataset can vary depending on multiple factors such as the complexity of the
documents, the diversity of fonts and text styles, and the desired level of accuracy. For simpler
document types with limited variations in fonts and layouts, a smaller training dataset may be
sufficient to achieve reasonable accuracy. However, for more complex document types or scenarios
requiring high accuracy, a larger and more diverse training dataset is typically necessary.
29 Example: https://learn.microsoft.com/en-us/azure/applied-ai-services/form-
recognizer/faq?view=form-recog-3.0.0
16
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
of providing human intervention, Processing of special be the case when using OCR in the
intervention for and/or there is a categories of personal banking sector for identity
processing that can processing of data verification by enabling automatic
have a legal or special categories comparison and validation of the
important effect on of personal data provided data against reference
the data subject databases or identity records. Or
when data are extracted from legal
contracts or financial statements
and is then analyzed automatically
to identify non-compliant clauses,
suspicious activities, or anomalies,
triggering appropriate actions or
alerts.
Lack of compliance Data subjects’ Infringement of Art. This could be the case If there is • SaaS cloud
with GDPR by not requests to rectify 16 and Art. 17 Right not a possibility to search for the • Third party on-
granting data or to erase personal to rectification and data subject’s data in the output premises
subjects their right data cannot be right to erasure and to correct and delete it. • Self-developed
to data rectification completed
and erasure
Unlawful unlimited Input data and/or Infringement of Art. 5 In principle an OCR system doesn’t • SaaS cloud
storage of personal output data are (e) Storage limitation need to store the input data unless • Third party on-
data being stored longer it is necessary for audit, premises
than necessary verification, or archival purposes. • Self-developed
The system should avoid
unnecessary retention or storage of
input data that is not directly
relevant to the OCR process.
Breach of the data Extensive Infringement of Art. 5 There are certain OCR systems that • SaaS cloud
minimization processing of (c) Data minimisation might require the user to input • Third party on-
principle personal data for samples of data until the system premises
training the model has learned the additional features. • Self-developed
The input data is used to train or
fine-tune the OCR model, enabling
it to handle the specific
complexities of the given OCR task.
This can be the case in OCR
systems used in medical fields that
may require specialized training on
medical terminology and
handwritten prescriptions, also for
handwritten recognition, when
using low-quality documents or
specialised fonts.
Unlawful transfer of Data are being Infringement of Art. OCR systems could store the data • SaaS cloud
personal data processed in 44 , General principle and be processing the data in • Self-developed if
countries without for transfers, Art. 45 countries that do not offer enough using Cloud
an adequate level Transfers on the basis safeguards.
of protection of an adequacy This could also be the case with
decision, Art.46 self-developed systems if we store
Transfers subject to the data in the cloud.
17
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
appropriate
safeguards
The GDPR outlines in Recital 90 the importance of establishing the context: “taking into account the
nature, scope, context and purposes of the processing and the sources of the risk”.
This is an important process when performing a privacy risk assessment to manage risks to the rights
and freedoms of natural persons.
The following processes are 30:
- assessing the likelihood and severity of the risks;
- treating the risks by mitigating the identified risks and in that way ensuring the protection of
personal data and demonstrating compliance with the GDPR and EUDPR.
There are different risk management methodologies available to classify and assess risks. It is not the
purpose of this document to define or establish a methodology to be used since this is a choice that
should be left to each organization. But for the purpose of this document, we will use the
international standards that have been previously referenced in the WP29 31 and the AEPD 32
Guidelines.
To assess the level of risk of the data protection and privacy risks identified when procuring,
developing and using data extraction technologies, we first need to estimate the likelihood and
severity of the identified risks happening.
To determine the likelihood of the risks of data extraction technologies we are using the following
four level risk classification matrix:
30 Guidelines on Data Protection Impact Assessment (DPIA) and determining whether processing is
“likely to result in a high risk” for the purposes of Regulation 2016/679, Article 29 Data Protection
Working Party, Last revision 2017
31 ISO 31000:2009, Risk management — Principles and guidelines, International Organization for
Standardization (ISO); ISO/IEC 29134, Information technology – Security techniques – Privacy impact
assessment – Guidelines, International Organization for Standardization (ISO).
32 ISO 31010:2019, Risk management — Risk Assessment Techniques, International Organization for
Standardization (ISO)
18
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
To determine the severity of risks of data extraction technologies we are using the four level risk
classification matrix 33:
The severity criteria are related to a loss of privacy that is experienced by the data subject but that
may have further related consequences impacting other individuals and/or society.
Scenario: We want to digitize legal documents containing court filings for archiving purposes. The
documents contain sensitive personal data such us criminal history, health and financial information.
We do not have the expertise to develop and host ourselves an OCR system, so we are going to
contract a third-party provider offering a SaaS solution in the cloud.
33 Pag. 77, AEPD, “Risk Management and Impact Assessment in Processing of Personal Data”, 2021
19
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
The following risk factors/ important concerns from section 2 could be applicable in our specific use
case:
Based on the identified risks factors we are going to identify together with other stakeholders 34 the
data protection and privacy risks that could arise with the OCR implementation.
We are going to use as foundation the risks identified in section 2 for procurement, and we are going
to assess what is the likelihood of the identified risks and assign to each risk one of the 4 likelihood
classification levels from the matrix: Very high, High, Low, Unlikely.
20
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
After the likelihood assessment, we are going to assess what is the impact of the identified risks on
the data subjects, individuals and society. Based on that impact/severity assessment, we will assign
one of the 4 severity classification levels: Very significant, Significant, Limited, Very limited.
21
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
22
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
The assessments of likelihood and severity will offer us the basis to obtain the risk level classification.
Based on the four level classification used for likelihood and severity, we can use a matrix like the
following to obtain the resulting final risk level classification: Very High, High, Medium, Low.
Based on this matrix we can classify the risks identified in our use case as follows:
Data Protection and Privacy Risk description Likelihood Severity Risk Level
Risks
Insufficient protection of Safeguards for the protection of personal data are Low Very Very High
personal data what eventually not implemented or are insufficient significant
can be the cause of a data
breach
Possible adverse impact on The output of the system could have an adverse Low Very Very High
data subjects that could impact on the individual if erroneous data are used significant
negatively impact for important decisions.
fundamental rights
Lack of compliance with GDPR Data subjects’ requests to rectify or to erase Low Significant High
by not granting data subjects personal data cannot be completed
their right to data rectification
and erasure
Unlawful repurpose of Personal data extracted is used for a different Low Very Very High
personal data purpose significant
*This risk
will be
unlikely in
an on-
premises
solution
Unlawful unlimited storage of Input data and/or data extracted from images are Low Significant High
personal data being stored longer than necessary
Unlawful transfer of personal Data are being processed in countries without an Low Significant High
data adequate level of protection
We have identified three risks with a very high level, and three with a high level. Best practices in risk
management suggest that the mitigation of very high and high level risks should be prioritized. 35 The
next step involves the implementation of a risk treatment plan.
35 https://www.pmi.org/learning/library/high-risk-critical-path-projects-7675
23
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Risk treatment involves developing options for mitigating the risks and preparing and implementing
action plans. The appropriate treatment option should be chosen on a contextual basis and
considering a feasibility analysis 36 like the following:
o Evaluate the type of risk and the available mitigation measures that can be implemented.
o Compare the potential benefits gained from implementing the mitigation against the costs
and efforts involved.
o Assess the impact on the purpose that is being pursued by implementing the OCR system.
o Evaluate what could be the reasonable expectations of individuals.
o Assess the impact mitigation measures could have on transparency and fairness of the
processing.
An analysis of these criteria is essential to risk mitigation and risk management planning and helps in
determining whether the risk mitigation is justifiable.
The most common risk treatment criteria are: Mitigate, Transfer, Avoid and Accept.
For each risk one of the criteria options will be selected:
Mitigate – Identify ways to reduce the likelihood or the severity of the risk.
Transfer – Make another party responsible for the risk (buy insurance, outsourcing, etc.).
Avoid – Eliminate the risk by eliminating the cause.
Accept – Nothing will be done.
Deciding whether a risk can be mitigated involves assessing the nature of the risk, understanding its
potential impact, and evaluating potential mitigation measures such as implementing controls,
adopting best practices, modifying processes, and using tools that can help reduce the likelihood or
severity of the risk.
Not all risks can be fully mitigated. Some risks may be inherent and cannot be entirely avoided. In such
cases, the goal is to reduce the risk to an acceptable level or to put in place measures that help manage
the severity of the risk effectively.
In our use case we have identified several very high and high level risks. After going through the
feasibility analysis and the treatment criteria, we have decided that we cannot transfer the risks to
any third party, we cannot avoid all the risks, and acceptance of the risks is an unacceptable option
for us. As long as there are measures that we can implement to help us mitigate the risks, resulting in
acceptable conditions to go on with the implementation, we choose the treatment option of risk
mitigation.
36 “Risk, High Risk, Risk Assessments and Data Protection Impact Assessments under the GDPR”,
24
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Data Protection and Risk Level Risk Mitigation measures Feasibility Assessment New risk
Privacy Risks Level after
mitigation
Insufficient Very High The third-party vendor chosen must have 1. Cost of Low
protection of implemented security measures such as implementation:
personal data what secure transmissions, strong access control The implementation of
eventually can be the measures, and data encryption at rest and pseudonymization or
cause of a data sufficient privacy design strategies 37 to anonymization techniques
breach: protect the data. We will ask after the data extraction
safeguards for the certifications 38 and results of a pentest 39 to would imply additional
protection of the vendor. cost.
personal data are not 2. Impact on purpose of
implemented or are As controller we can also protect the digitization and
insufficient specific sensitive data in the documents by archiving: No
applying pseudonymization or 3. Impact on
anonymization techniques after the data expectations of
extraction. Depending on the different individuals: No
needs, we could implement default 4. Impact on
anonymization or reversible data masking transparency and
techniques, for instance allowing access to fairness of the
the unmasked data to certain people. If the processing: No
OCR SaaS solution does not offer the
possibility to implement these
techniques 40, we could decide to look for a
vendor that can offer them or implement
them ourselves.
Possible adverse Very High 1.In this specific use case, digitized 1. Cost of 1.Low
impact on data documents are going to be archived and implementation: 2.Medium
subjects that could not been used for any other processing They highest cost is the
negatively impact what would already mitigate this risk. effort that implies doing
fundamental rights: the human review when
the output of the 2.But in cases where search and analysis of needed and providing the
system could have an data is required, we would need to make human resources for that.
adverse impact on sure that the OCR system we use offers a 2. Impact on purpose of
the individual if high percentage of accuracy guarantee. digitization and
erroneous data are This is usually between 98-99%. 41 This archiving: No
used for important accuracy should not only be at page level 3. Impact on
decisions. but also at character and word level what expectations of
is often challenging. It is also important to individuals: No for
follow best practices 42 when using OCR data subjects, but it
systems: making sure that our original has an impact on the
documents have a good quality, employees that would
considering things such as resolution, be in charge of the
brightness, straightness, and discoloration human review
before we scan text. 4. Impact on
Use of special fonts and low contrast could transparency and
also affect accuracy. fairness of the
Another important aspect is that currently processing: No if it is
there are no systems offering 100% properly implemented
accuracy, and the only way to achieve that and information about
and avoid any error is by doing a human how the accuracy of
review and correction of the output. the system works is
provided to users and
eventually to data
subjects.
25
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Unlawful repurpose Very High 1.One of the best mitigation measure 1. Cost of 1.Low
of personal data: would be using a SaaS solution that offers implementation: 2.Medium
personal data the option to keep the data on premises. If the on-premise option is
extracted is used for This is the case if the data processing takes offered by the vendor this
a different purpose place at location (where the OCR machine could have an additional
is located for instance) and the input is cost. It could also imply
automatically deleted and the output data that we need to make
is only stored at the user location. resources available for
taking care of the on-
2.If the data extraction takes place at the premise solution.
vendor’s location, then a minimum of 2. Impact on purpose of
security measures such as access control, digitization and
audit trail and encryption together with archiving: No
proper data protection agreements need 3. Impact on
to be in place. expectations of
individuals: No
4. Impact on
transparency and
fairness of the
processing: No
Lack of compliance High We could implement an OCR system that 1. Cost of Low
with GDPR by not offers editable output format. This will implementation:
granting data allow as to search and edit text in the If the option is offered by
subjects their right to output. This is not possible in other the vendor this could have
data rectification and formats like only searchable output an additional cost.
erasure: formats. This editable output format will 2. Impact on purpose of
Data subjects’ allow us to look up for the data subject’s digitization and
requests to rectify or information and update it or delete it. archiving: No
to erase personal 3. Impact on
data cannot be expectations of
completed individuals: No
4. Impact on
transparency and
fairness of the
processing: Yes, on a
positive way
Unlawful transfer of High We can implement an OCR system from a 1. Cost of Low
personal data: third-party provider that is located in a implementation:
Data are being country offering adequate level of This measure should in
processed in protection. principle not have any
countries without an additional cost, but it
adequate level of depends on the vendors
protection availability.
2. Impact on purpose of
digitization and
archiving: No
3. Impact on
expectations of
individual: No
4. Impact on
transparency and
fairness of the
processing: No
Unlawful unlimited High We could try to implement an OCR system 1. Cost of Low
storage of personal in which deletion of data can be configured implementation:
data: so that input and output data are deleted If the option is offered by
Input and output from the system immediately after the the vendor this could have
data are being stored data extraction or at a scheduled moment an additional cost.
longer than necessary (this is by most vendor a period of 24 to 48 2. Impact on purpose of
hours). digitization and
We could also implement a retention archiving: No
period for the output data that we want to
26
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
In our use case, after the assessment, all the risks have been reduced to the lowest risk level ‘low’,
and there are two risks with a possible classification of ‘Medium’ depending in the example on the
mitigation measure adopted.
We calculate the residual risk by evaluating the likelihood and severity of the risks that still exists
despite the implemented mitigation measures. This residual risk represents the level of risk that
remains after taking mitigation actions.
Once residual risk has been identified, we need to decide whether the residual risk is within acceptable
levels for our organization. If it is, we can decide to accept it. If it's not, we would need to consider
further mitigation strategies.
Some organizations establish criteria for the acceptability of residual risks based on elements such us
social norms, benefits, harms, similar use cases, etc 43.
Organizations must be able to justify their risk mitigation and acceptance decisions as part of their
accountability obligations which also fall under the GDPR principle of accountability (Article 5.2,
Recital 74).
43 https://www.sciencedirect.com/topics/engineering/residual-risk
44 This could be done by performing a pentest and/or requesting pentest results to the vendor.
45 Elsayed et al , “Adversarial reprogramming of neural networks”, 2018
27
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
membership inference 46, inversion 47 and poisoning attacks 48. Also access and
change logs should be established to document access and changes to digitized
records.
Possible adverse impact on data subjects As user, implement OCR solutions that offer a minimum 98-99% accuracy. Often
that could negatively impact fundamental the systems offer the results of this metric after every data extraction. It is
rights important to monitor the values and make the necessary adjustments and
corrections to the results. Make sure the system recognizes different conditions
applicable to the input data.
The quality of input data is important. This is important for users of OCR system
as well as for developers that need to use training data of quality to train their
models. As developer there are techniques and tools 49 you can use to reduce the
low quality of input data such as binarization, deskweing 50, rotation, increasing or
reducing brightness, scaling the image, or removing specific objects to improve
the accuracy levels.
Some of these techniques are also available in the configuration settings of OCR
third party solutions that are available as SaaS 51 solution and on-premises 52.
Possible adverse impact on data subjects Users could implement a human review process to verify the correctness of the
and lack of compliance with GDPR personal data, especially if this is sensitive, and a process to approve high risk
requirement of providing human decisions after human verification has been done.
intervention for processing that can have a It is also important that the system provides with an overview of the accuracy and
legal or important effect on the data the confidence levels achieved after the data extraction and with a dashboard or
subject other type of interface for manual human review and correction.
In certain use cases it might be necessary to implement a redress mechanism for
data subjects.
Lack of compliance with GDPR by not As user, developer and procurement entity, implement searchable and editable
granting data subjects their right to data output format functions to identify the personal data in the data extracted so that
rectification and erasure it is possible to respond to data erasure and rectification requests.
Unlawful unlimited storage of personal As user and procurement entity make agreements with the third-party supplier
data about how long the input data and output data should be stored. This can be part
of the service contract, product documentation 53 or data processing agreement.
If data are being stored on your premises, establish retention rules and /or a
mechanism for the deletion of data.
Breach of the data minimization principle For users and developers, one possible way to mitigate this risk is by providing
documents to the OCR model where personal data has been replaced by synthetic
data.
As user, it is also important to compare the different OCR solutions available on
the market to understand which systems require less volume of data to train the
models and to improve the accuracy levels.
Unlawful transfer of personal data As user and procurement entity, verify with the vendor where the data
processing is taking place. Make the necessary safeguard diligences and when
necessary, perform a Data Transfer Impact Assessment. Make the necessary
contractual agreements. Consider this risk when making a selection among
different vendors.
Once risk mitigation measures have been implemented, it is crucial to continuously monitor their
effectiveness. Implementing methodologies like threat modeling for the identification of risks,
46 Shokri et al, “Membership Inference Attacks Against Machine Learning Models”, 2017
47 Zhang et al, “Generative Model-Inversion Attacks Against Deep Neural Networks”, 2020
48 Junfeng Guo, Cong Liu, “Practical Poisoning Attacks on Neural Networks”, 2020
49 See next section under ‘Resources about how to improve OCR accuracy applying different tools
and techniques’
50 Deskewing is the process of straightening an image that has been scanned or written crookedly. It
is a process whereby skew is removed by rotating an image by the same amount as its skew but in
the opposite direction.
51 https://help.abbyy.com/en-us/flexicapture/12/distributed_administrator/template_properties/
52 https://selectec.com/on-premise-ocr/
53 In this document from Tencent Cloud in page 6, there is a retention Policy indicated for the images
uploaded (input data) and the returned results (output): data is deleted upon completion of the
processing https://main.qcloudimg.com/raw/document/intl/product/pdf/tencent-
cloud_1005_50443_en.pdf
28
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
maintaining a risk register and assigning risk owners are effective strategies for regularly reviewing
and reassessing the risk landscape. This ensures that the implemented risk mitigation measures
remain relevant and effective in preventing data protection and privacy risks that could adversely
impact individuals and organizations.
29
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
There is not a benchmark available for defining a good CER value, as it is highly dependent on the
use case, the different scenarios and complexity. Some research studies 55 propose a good OCR
accuracy should be CER 1-2% (i.e., 98-99% accurate) 56.
54 “Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing”, Nguyen, Adam,
Coustaty, Nguyen, Doucet, 2019
55 https://www.researchgate.net/publication/281583162_Performance_Comparison_of_OCR_Tools
56https://www.docsumo.com/blog/ocr-accuracy
30
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
operates at the word level instead. It represents the number of word substitutions, deletions, or
insertions needed to transform one sentence into another.
While CER and WER are handy, they are not bulletproof performance indicators of OCR models.
This is because the quality and condition of the original documents (e.g., handwriting legibility,
image DPI, etc.) plays a very important role and not just the OCR model itself.
Guidances:
Digitalization Guide for Records with Long-term Retention at NYC Agencies
https://www.nyc.gov/assets/records/pdf/Digitization%20Guide%20for%20Records%20with%20
Long%20Term%20Retention%20at%20NYC%20Agencies%2020161031.pdf
On-premise or Cloud OCR: A guide to help you decide:
https://www.klippa.com/en/blog/information/on-premise-ocr/
Resources about how to improve OCR accuracy applying different tools and
techniques:
Improve OCR accuracy using advanced pre-processing techniques
https://www.nitorinfotech.com/blog/improve-ocr-accuracy-using-advanced-preprocessing-
techniques/
Tesseract: Improving the quality of the output
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
15 tools for document Deskewing and Dewarping
https://safjan.com/tools-for-doc-deskewing-and-dewarping/
Improve the quality of your OCR information extraction
https://aicha-fatrah.medium.com/improve-the-quality-of-your-ocr-information-extraction-
ebc93d905ac4
31
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
Methodologies and tools for the identification of data protection and privacy
risks:
Privacy Library of Threats (PLOT4ai) is a threat modeling methodology for the identification of
risks in AI systems. It also contains a library with more than 80 risks specific to AI systems:
https://plot4.ai/
MITRE ATLAS™ (Adversarial Threat Landscape for Artificial-Intelligence Systems), is a
knowledge base of adversary tactics, techniques, and case studies for machine learning (ML)
systems: https://atlas.mitre.org/
Assessment List for Trustworthy Artificial Intelligence (ALTAI) is a checklist that guides
developers and deployers of AI systems in implementing trustworthy AI principles:
https://digital-strategy.ec.europa.eu/en/library/assessment-list-trustworthy-artificial-
intelligence-altai-self-assessment
32
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)
33