[go: up one dir, main page]

0% found this document useful (0 votes)
27 views33 pages

AI Possible Risks & Mitigations: Optical Character Recognition

Uploaded by

doamaral.anarosa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views33 pages

AI Possible Risks & Mitigations: Optical Character Recognition

Uploaded by

doamaral.anarosa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

SUPPORT POOL

OF EXPERTS PROGRAMME

AI Possible Risks & Mitigations


Optical Character Recognition
by Isabel BARBERÁ
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

As part of the SPE programme, the EDPB may commission contractors to provide reports and tools
on specific topics.

The views expressed in the deliverables are those of their authors and they do not necessarily reflect
the official position of the EDPB. The EDPB does not guarantee the accuracy of the information
included in the deliverables. Neither the EDPB nor any person acting on the EDPB’s behalf may be
held responsible for any use that may be made of the information contained in the deliverables.

Some excerpts may be redacted or removed from the deliverables as their publication would
undermine the protection of legitimate interests, including, inter alia, the privacy and integrity of an
individual regarding the protection of personal data in accordance with Regulation (EU) 2018/1725
and/or the commercial interests of a natural or legal person.

2
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Table of Contents
1. Background ..................................................................................................................................... 4
2. Data protection and privacy risk identification ............................................................................ 12
Definition of the criteria to consider when identifying risks and their categorization..................... 12
Presentation of examples of risks specific to OCR ............................................................................ 13
3. Data protection and privacy risk assessment ............................................................................... 18
Criteria to establish the likelihood of OCR risks. How to assess likelihood. ..................................... 18
Criteria to establish the severity of OCR risks. How to assess severity. ........................................... 19
Examples of OCR specific risks assessments ..................................................................................... 19
4. Data protection and privacy risk treatment ................................................................................. 23
Risk treatment criteria ...................................................................................................................... 23
Presentation of mitigation measure examples/risk treatment options ........................................... 24
Residual risk acceptance ................................................................................................................... 27
Reference to specific technologies, tools, methodologies, processes or strategies. ........................... 30

Disclaimer by the Author: the examples and mentions of companies in this report are illustrative
and do not imply that the author considers them the only or the best choice. The technology analysis
presented in this report is based on the state of the art of the technology in August 2023.

Document submitted in September 2023

3
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

1. Background
Description of the task, main technologies used and references to some
openly accessible examples.

OCR stands for Optical Character Recognition, and it is a technology used to convert images or scanned
documents containing text into machine-readable text. OCR technology enables the extraction of text
from both physical paper documents and digital sources.

How do data extraction technologies like OCR work?


OCR techniques use different approaches such as rule-based methods and pattern matching
algorithms to identify characters and convert them into machine-readable text.
The process of extracting data using OCR typically involves three primary tasks: detection, localization,
and segmentation. Each of these stages can employ various algorithms.
 In the detection and localization stages, algorithms are employed to identify and locate text
within a given frame or image.
 Localization algorithms analyze frames to determine the bounding regions surrounding the
text, effectively pinpointing its location. These algorithms work together to identify and
delineate text regions.
 The segmentation task involves converting the localized text into a binary format that is
suitable for OCR processing. Segmentation algorithms apply techniques to transform the text
into a format where characters are clearly distinguished from the background, improving the
accuracy of character recognition.

In the binarization 1 process, characters are identified by recognizing dark areas as text and light areas
as the background. The dark areas undergo processing to identify alphabetic letters or numeric digits.
Algorithms for pattern recognition and feature extraction are then used to identify and analyze these
characters.

Pattern recognition involves isolating a character image, referred to as a glyph2, and comparing it with
a stored glyph that shares a similar font and scale. For successful pattern recognition, the stored glyph
must closely match the input glyph in terms of font and scale. This approach is more effective when
working with scanned document images that have been typed using a known font.

Feature extraction involves breaking down or decomposing glyphs into various characteristics,
including lines, closed loops, line direction, and line intersections. These features are then used to
identify the most suitable match among the stored glyphs.

In addition to character recognition, an OCR program examines the structure of a document image by
segmenting it into elements like text blocks, tables, or images. After isolating the characters, the
program compares them with a collection of pattern images. It then processes potential matches and
presents the recognized text as the output.

1 Binarization is the step that is performed prior to performing OCR. The aim of binarization is to
separate foreground text from the background of a document.
2 In typography, a glyph is "the specific shape, design, or representation of a character". It is a

particular graphical representation, in a particular typeface, of an element of written language. Source:


Wikipedia

4
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Image source: Microsoft OCR Vision Studio. Example of segmentation and pattern recognition.

OCR systems often work with an established system of templates; this means that documents need
to have the same basic page structure or the same relative positioning of elements within the
document as the template. Because the system relies on defined templates, using OCR with
documents that have a different structure will result in a lower accuracy. Currently OCR systems based
on templates are available on a broad variety of languages and they provide and extensive catalogue
of different templates.

Nowadays, modern OCR systems also incorporate Machine Learning (ML) algorithms, particularly
those based on Deep Learning, to improve the accuracy of character recognition. This type of deep
learning models can support documents that have similar information, but different page structures.
This is called Intelligent Document Processing (IDP) and uses OCR as its foundational technology to
additionally extract structure, relationships, key-values, entities, and other document insights.
OCR combined with Deep Learning supports structured, semi-structured, and unstructured
documents for data extraction.

Often these OCR systems are offered as Software as a Service (SaaS) solution offering the possibility
to use pre-trained models or to train your own model with your own dataset 3.

Most vendors offer OCR systems as a cloud solution via a system of APIs 4, what seems to be the
preferred option for most customers because of their ease of integration and fast productivity. Though
some providers 5 of this technology offer a general system where the models are shared by the
customers, there are also some that offer the possibility to have a custom model that the customer
can train and delete when necessary 6.

Some vendors also offer the possibility for customers to host the models on-premises making the OCR
capabilities available in the customer’s own local IT infrastructure. This can be a good alternative to
comply with strict security and data governance requirements.

It is also possible to develop and implement your own OCR solution in-house. There are different OCR
libraries and frameworks available such as Tesseract, OpenCV, Easyocr, Keras_ocr and the FineReader
engine from ABBYY.

3 Microsoft and ABBYY are examples of SaaS OCR solutions offering the two possibilities:
https://www.abbyy.com/vantage/ocr-skill/features/
4 An Application Programming Interface is a way for two or more computer programs to communicate

with each other (source: Wikipedia).


5 In 2023, some of the most known OCR solutions providers were ABBYY, Kofax and Microsoft.
6 https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/overview?view=form-

recog-3.0.0

5
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

OCR SaaS solution hosted in OCR Third party solution hosted OCR self-developed, hosted on
cloud on premises premises
- Ready to use models that are Models trained by vendor or Models trained by user
trained by the vendor customer
- Possibility to create your own
self-trained models

Data flow in an OCR solution:


OCR solutions are mostly used to digitize documents that are originally in paper format. Some OCR
solutions can also be used for the extraction of data in documents that are already available in digital
format. In both cases, the document that we want to digitize and analyze will be considered our Input
Data and the results of the data extraction process will be the Output Data.

In the following examples we show three possible scenarios:


1. A customer uses an OCR third party solution hosted in the cloud
2. A customer uses an OCR third party solution hosted on premises
3. A user develops an own OCR system

1. Example of data flow diagram when using third party OCR systems hosted in the cloud

Step 1: The input data are transmitted via an API from the customer’s location where the OCR system
is located to the vendor’s location in the cloud where the data extraction process will take place.

Step 2: The input (and output) data can be temporarily stored locally at the vendor’s location in the
cloud. The most common storage options are the following:

6
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

1. The data could be stored in a buffer only during the execution of the data extraction process. The
vendor does not retain any data once it has sent the output to the customer.
2. The data could also be temporarily stored in cache 7 to be reused by other immediate processes.
The data retention period is variable and depends on the cache memory capacity and configuration.
3. Another possible scenario is the storage of data in a persistent 8 storage layer such as a database or
a cloud storage. This could be done for the analysis or processing of the data at a later stage.

The longer the data is stored in a system the higher the risk of a data breach, unlawful repurpose or
an infringement of the data storage limitation principle. In this specific case, number 1 (buffer) is the
option with less risks since data is stored only during the process in memory. In option 2 (cache)
though also with a low risk, data is usually stored for a longer period than in buffer and this can happen
outside the process and even on a different location 9. Option number 3 (storage location like a file or
a database for instance) is the one with the highest risks since the storage can take place for a longer
period of time.

Step 3: Once the data extraction process has finalized, the output data are sent back via an API to the
customer.
In some cases, the input and/or the output data could be used by the vendor to retrain and fine-tune
the OCR model. Though this is usually done after informing the customers and obtaining their consent,
it is important to verify it with the vendor.

2. Example of data flow diagram when using third party OCR systems hosted on premises

Step 1: All data transfers and data extraction process take place internally at the customer’s premises
on their own servers within their data centers.

7 Caching is usually implemented at a software level to reduce the computational overhead of


reprocessing the same text or data and improve overall performance.
8 Persistent storage means that the data remains intact even when the system is powered off or

restarted.
9 The location and method of caching can depend on the specific requirements of the OCR system

and the architecture of the application. For instance, caching could happen in a different location
when using a distributed microservices architecture or a cloud-based caching.

7
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Step 2: The input and output data can be temporarily stored locally. The data could be stored in a
buffer only during the execution of the data extraction process or could be temporarily stored in cache
to be reused by other immediate processes.
The input and/or output data could also be stored in a storage location at the customer’s premises.
This could be done for analysis or processing of the data at a later stage or for auditing purposes.

Step 3: Once the OCR process has finalized, the output data is produced.

The input and/or output data could also be used to retrain and fine-tune the OCR model that is also
stored at the user’s premises.

A self-hosted OCR system from a third party provider can be set up in different ways depending on
the architecture and design choices. The specific details can vary if the choice is a completely on
premises set up or a hybrid one 10in which some of the processes are still hosted at the vendor side.
For instance, some of the steps in the data extraction process phase could happen outside the
customers premises (see image below).

It's important to review the documentation and architecture of the third party OCR system to
understand its data flows and whether there are data transfers to the vendor or other third-parties.
Additionally, it is also important to assess the system's compatibility with the user’s infrastructure, the
security as well as any potential required change in network or firewall configurations.

10 Different steps from the data extraction phase could take place in the cloud. This could be a

decision due to various reasons such as resource-intensive tasks, redundancy, expertise, etc.

8
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

3. Example of data flow diagram when using a self-developed OCR system

The data flow in this scenario is similar to the one of example 2. In this use case, there are no data
transfers to any OCR vendor, and all the processes are executed on premises. All processes take place
internally.

Performance measure of OCR systems: Accuracy and Confidence Scores


The performance of OCR systems can be measured with different metrics being accuracy and
confidence scores the two most commonly used.

The accuracy of an OCR system measures the percentage of correctly identified characters or words
with respect to the total number of character or words. It is typically evaluated by comparing its output
to the ‘ground truth’ 11 and calculating the proportion of characters or words that were correctly
identified. Any discrepancies between the OCR output and the ground truth are considered errors.
Higher accuracy rates indicate better performance. The accuracy value range is usually represented
as a percentage between 0% (low) and 100% (high). For printed documents with clear and legible
text, accurate OCR results in the range of 95% to 99% are commonly achievable. However, it's
important to note that the accuracy can vary depending on the specific document types, languages,
and OCR software being used.

Confidence score is a measure that provides an indication of the level of certainty of correctness
associated with the extracted data. It is represented with a number between 0 and 1 and a high
confidence score would mean that the OCR system believes its recognition of a particular text is likely
to be correct. The confidence scores are typically determined by the OCR software or algorithm itself
though in some OCR systems 12users can set confidence score thresholds to filter out characters or
results that fall below or above a certain level of confidence. This threshold can be set based on the
desired level of accuracy and tolerance for errors. The specific method for calculating confidence
scores may vary depending on the OCR system and the underlying algorithms used.

11 The accurately known text that the images or documents being processed by the OCR system are
supposed to represent. The term "ground truth" is used in machine learning to refer to the actual
values or outcomes that a model's predictions are compared against during training and evaluation.
12 https://pyimagesearch.com/2020/05/25/tesseract-ocr-text-localization-and-detection/

9
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Confidence scores can be calculated in different ways, but the score is often based on the clarity of
the image, character distinctness, context of surrounding text and the similarity of the identified
characters to the patterns the system has been trained on.
While confidence scores at the character level are prevalent, OCR systems may also provide
confidence scores at other levels such as word, line, or block.
The availability and presentation of confidence scores may vary among different OCR systems and
software implementations. Some systems may only provide character-level scores, while others may
offer a combination of character-level and higher-level scores.

While a high confidence score like 0.95 would suggests the OCR system believes the output is correct,
it does not guarantee that the output is actually accurate.
If an OCR system persistently misread a particular character due to errors in its training data, it could
still assign a high confidence score to this incorrect interpretation. Similarly, an accurate result could
receive a low confidence score if the system finds the recognition challenging due to factors like image
quality or unconventional font style. Hence, it's critical to consider these factors when interpreting
confidence scores and accuracy in OCR systems.

Issues that can affect accuracy of the output:


 The accuracy of the models can be affected by variations in the structure of the documents. The
accuracy scores can be inconsistent when the analyzed documents differ from documents used
for training the model.
 The accuracy of the output is determined by the conditions and the quality of the input images.
For instance, the system can be susceptible to variations in the position of a document (landscape
or portrait), or to changes in fonts and formatting.
 OCR can introduce errors, such as incorrectly recognizing a character as another one. For example,
the OCR could recognize “N” and change it to “E.” This is common in texts with non-English
characters. OCR might mistake a lowercase “l” for a “1”, or a “b” for an “8”. This can cause
problems if the text is used for critical purposes, as could be the case with some legal documents.
 Punctuation marks cannot be always read by OCR because they are too small or non-contiguous,
or because they’re upside down and backwards.
 OCR may not be able to recognize text correctly if the text is in a language not supported by the
OCR engine. It is important to verify that your language is supported. An OCR system might have
difficulties recognizing properly right to left languages.

10
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Common uses of OCR technologies:


Currently, data extraction techniques like OCR are being used in different use cases. Here are examples
of some of them:

 Accounts payable 13: to speed up invoice data entry. OCR technology can be used to automate the
process of entering invoice data into a system. Instead of manually keying in data from paper or
digital invoices, the OCR software extracts key information such as supplier names, invoice dates,
amounts, and invoice numbers. This helps to reduce data entry errors and speeds up the accounts
payable process.
 Banking 14: to extract and digitize information making data easier to search, store, and manage. In
identity verification, OCR is used to read data from identity cards, passports, or driving licenses
quickly and accurately. By extracting information from identity documents, OCR facilitates the
verification of customer details and speeds up the account opening procedures.
 Digitizing 15 and/or archiving of paper documentation, converting printed paper documents into
machine-readable text documents. Once digitized, the text from these documents can be easily
searched, edited, stored, and managed, making it much more accessible and useful.
 Vehicle license plate identification. 16 OCR is a key technology behind Automatic Number Plate
Recognition (ANPR) systems. These systems use OCR to read the license plate numbers of vehicles
from digital images or video feeds for purposes like traffic enforcement, toll collection, or parking
management.
 Consumer behavior and market analysis: extracting data from retail receipts and consumer-
generated content (such as reviews or handwritten notes), which can then be analyzed for insights
and consumer behavior analysis. 17
 Transforming documents into text that can be read aloud to visually impaired or blind users. 18
OCR is used in assistive technologies to convert printed text into digital text, which can then be
read aloud using text-to-speech (TTS) systems.
 Logistics and warehouse automation. 19 OCR can be used to automate processes such as inventory
management, shipping, and receiving of goods. For example, OCR can be used to read labels,
barcodes, or other identifiers on packages, enabling automatic tracking and sorting of goods.
 Medical Documentation Transcription & Automation.20 OCR can be used to integrate paper and
images originating from existing patient records into new electronic health records (EHR). It
extracts the data required to automatically associate the information in a patient record, such as
a medical record number, date of birth, patient name, etc., with the right electronic health record.
It can also be used to digitize medical prescriptions.
A real use case of OCR in the medical sector is the ABBYY 21medical records management software.

13 Example: https://rossum.ai/
14 Example: https://www.klippa.com/en/blog/information/4-ways-to-perform-document-based-kyc-
checks-with-ocr-and-ai/
15 Example: https://pdf.abbyy.com/finereader-pdf-for-mac/how-to/digitize-paper/
16 Example: https://platerecognizer.com/
17 Example: https://microblink.com/commerce/receipt-ocr/
18 Example: https://www.afb.org/blindness-and-low-vision/using-technology/assistive-technology-

videos/scanning-and-specialized-ocr
19 Example: https://www.innovapptive.com/blog/using-optical-character-recognition-ocr-to-overcome-

3-supply-chain-bottlenecks
20 Example: https://nanonets.com/blog/ocr-for-healthcare/
21 https://www.abbyy.com/solutions/healthcare/capture-to-ehr/

11
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

2. Data protection and privacy risk identification


Definition of the criteria to consider when identifying risks and their
categorization

To help identify risks associated to the use of data extraction technologies like OCR we can make use
of a variety of risk factors.

Risk factors are conditions associated with a higher probability of undesirable outcomes. They can
help to identify, assess, and prioritize potential risks. For instance, using health data and processing
large volumes of data are risk factors with a high level of risk. Acknowledging them in your own use
case, can help you identify related potential risks and their severity. In this case, an example of
associated risk with a high severity could be ‘a risk of violation of patients privacy due to a data breach’.

The risk factors shown below are the result of analysing the contents of legal instruments such as the
GDPR 22, the EUDPR 23, the EU Charter 24 and other applicable guidelines related to privacy and data
protection25.
The following risk factors can help us identify data protection and privacy high level risks in data
extraction technologies like OCR:

High level Risk / Important concerns Examples of applicability


Sensitive & impactful purpose of the processing - When documents are classified and archived
Using a OCR system to decide on or prevent the exercise of automatically, and the classification could have an impact
fundamental rights of individuals, or about their access to a service, on the data subject. This could also apply in use cases such
the execution or performance of a contract, or access to financial as banking, health sector or ANPR 26.
services is a concern, especially if these decisions will be automated
without human intervention. Wrong decisions could have an
adverse impact on individuals.

Processing sensitive data - When using OCR for digitizing medical records or legal
When the OCR system is processing sensitive data such as: health documents by courts.
data, special categories of data, personal data related to convictions - When using OCR for digitizing invoices or other processes
and criminal offences, financial data, behavioural data, unique in banking sector, for consumer behavior and market
identifiers, location data, etc. This is a reason of concern since analysis, in data extraction from Identification documents
processing inappropriately this personal data could negatively and bank cards, ANPR and the digitization of medical
impact individuals. records.
Large scale processing This could apply to most of OCR use cases since data
Processing high volumes of personal data is a reason of concern, extraction technologies like OCR are usually applied to
especially if these personal data are sensitive. The higher the large volumes of data.
volume the bigger the impact in case of a data breach or any other
situation that put the individuals at risk.
Processing data of vulnerable individuals This could be the case when OCR solutions are used in the
This is a concern because vulnerable individuals often require health sector, at schools, social services organizations,
special protection. Processing their personal data without proper government institutions, employers, etc.
safeguards can lead to violations of their fundamental rights. Some
examples of vulnerable individuals are children, elderly people,
people with mental illness, disabled, patients, people at risk of social
exclusion, asylum seekers, persons who access social services,
employees, etc.

22 General Data Protection Regulation (2016/679)


23 European Union Data Protection Regulation (Reg. 2018/1725)
24 Charter of Fundamental Rights of the European Union (2012/C 326/02)
25 Pag. 79, AEPD, “Risk Management and Impact Assessment in Processing of Personal Data”, 2021
26 Automatic Number Plate Recognition

12
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Low data quality OCR systems are not 100% accurate and quality issues in
The low data quality of the input data and/or the training data is a the input data are common.
concern bringing possible risks of inaccuracies in the generated
output what could cause wrong identification of characters and
have other adverse impacts depending on the use case.
Insufficient security measures - This could be the case if there are not sufficient
The lack of sufficient safeguards could be the cause of a data breach. safeguards implemented to protect the input data and the
Data could also be transferred to states or organisations in other results of the processing. This could be applicable to any
countries without an adequate level of protection. use case.
- Data extraction technologies like OCR are often offered
as SaaS solutions. Input data could be sent for processing
to countries without an adequate level of protection.

Presentation of examples of risks specific to OCR


Technologies for data extraction like OCR can present different types of privacy and data protection
risks. The number and type of risks will depend on the use case, the context in which the technology
is being applied as well as the different risks factors previously identified.
We are going to analyze different risks related to the procurement, development and use of this
technology.

Data protection and privacy risks posed by the procurement of those types of AI systems:
Data extraction solutions are frequently available as SaaS solution from third party providers. Due to
the different type of configurations available and the required maintenance of the models used, the
use of an external supplier is usually the preferred option for users of this technology.
Some third party data extraction systems, though rarely, can also be hosted on-premises.

Data Protection Risk description GDPR potential Examples Risk applicable


and Privacy Risks Impact on service model
provision
Insufficient Safeguards for the Infringement of Art. OCR systems that process text
protection of protection of personal 32 Security of containing personal data could be • SaaS cloud
personal data data are not processing, Art. 5 (f) not properly secured. This could be • On-premises
that eventually implemented or are Integrity and the case if for instance,
can be the cause insufficient confidentiality and transmission of data is not secure,
of a data breach Art. 9 Processing of data are not stored encrypted or
special categories of with an adequate access control
personal data mechanism.
Possible adverse The output of the Infringement of Art. 5 A system providing output that is
impact on data system could have an (d) Accuracy, Art. 5(a) not accurate and does not provide • SaaS cloud
subjects that adverse impact on the Fairness, Art. with mechanisms to amend errors. • On-premises
could negatively individual if erroneous 22Automated Or when vendors claim their
impact data are used for individual decision- system offers certain performance,
fundamental important decisions making, including but this is not reproduced in real
rights profiling, Art. 25 Data cases.
protection by design
and by default
Lack of Data subjects’ requests Infringement of Art. A low-quality output could prevent
compliance with to rectify or to erase 16 and Art. 17: Right a controller from finding all the • SaaS cloud
GDPR by not personal data cannot to rectification and data of a data subject in their data • On-premises
granting data be completed right to erasure storage since the data cannot be
subjects their matched properly.
right to data This could also be the case If there
rectification and is not a possibility to search for the
erasure

13
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

data subject’s data in the output


and to correct and delete data.
Unlawful Personal data extracted Infringement of Art. 5 This could be the case if the
repurpose of is used for a different (b) Purpose supplier uses the input and/or • SaaS cloud
personal data purpose limitation, Art. 5(a) output data for training the ML
Lawfulness, fairness models without this being formally
and transparency, agreed on beforehand.
Art. 29 Processing
under the authority
of the controller or
processor
Unlawful Input data and/or data Infringement of Art. 5 The system could be unnecessarily
unlimited storage extracted from images (e) Storage limitation storing input data that is not • SaaS cloud
of personal data is being stored longer directly relevant to the OCR • On-premises
than necessary process.
In some cases, the output could be
stored by the vendor longer than
necessary.
Unlawful transfer Data are being Infringement of Art. Data extraction solutions could • SaaS cloud
of personal data processed in countries 44 General principle store and be processing the data in
without an adequate for transfers, Art. 45 countries that do not offer enough
level of protection Transfers on the basis safeguards.
of an adequacy
decision, Art. 46
Transfers subject to
appropriate
safeguards

Data protection and privacy risks posed by the development of those types of AI systems:
The development of data extraction technologies can also face data protection and privacy risks.
Risks could arise at different phases of the development life cycle, that is why it is important to
implement an iterative process for the identification of this type of risks.

The development of an OCR system typically involves training machine learning models on large
datasets of annotated images or documents. These datasets can consist of various types of digital
and printed documents. The data used for training an OCR system typically includes:
 Training Data: this data includes a diverse set of images or documents that represent the
target domain. It encompasses a wide range of fonts, text sizes, styles, layouts, and
document types.
 Validation Data: a separate portion of the dataset is reserved for validation purposes during
the model development process.
 Test Data: the other portion of the dataset that is used to evaluate the final performance of
the trained OCR system. The test data should be representative of the real-world scenarios
and provide a fair assessment of the system's accuracy and reliability.

OCR system developers often curate or collect their own datasets, which can include publicly
available data, proprietary data, or datasets obtained through partnerships or collaborations. It is
important to mention that training data can introduce certain risks in the development of OCR
systems. Here are a few key considerations:
- Bias present in the training data, such as imbalances in document types, languages, or fonts,
can impact the OCR system's performance and introduce unfairness.
- Inaccurate or incomplete annotations in the training data can adversely affect the
performance of the OCR system. If the labeled data contains errors or inconsistencies, the

14
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

model may learn incorrect patterns or struggle to generalize well to unseen data. Ensuring
high-quality annotations is crucial for effective training.
- If the training data does not adequately cover the full range of document types, fonts, text
sizes, or languages encountered in real-world scenarios, the system may struggle to
accurately recognize text in unseen or challenging conditions.
- The training data may contain sensitive or private information, such as personal details or
confidential documents.
- Training data could be collected and used in an unethical manner, without respecting
privacy, consent, copyright and other legal obligations.

The following table offers an overview of data protection and privacy risks that developers of OCR
systems should consider during the design and development phase. The idea behind this table is to
make developers conscious of privacy by design choices that can help prevent risks:

Data Protection and Risk description GDPR Potential Impact Examples


Privacy Risks
Insufficient Safeguards for the protection Infringement of Art. 32 We could be using third party libraries,
protection of of personal data that is part Security of processing, Art. SDK 27or applications for the development
personal data that of the training dataset are 5 (f) Integrity and of the OCR system, and we could be
eventually can be the not implemented or are confidentiality and Art. 9 leaking data to these third parties. The
cause of a data insufficient Processing of special system could be integrated with other
breach categories of personal data systems internally and the transmission of
input data could be insecure; data could be
stored unencrypted and with inadequate
access control mechanisms. If using the
cloud, this could be not configured
according to security best practices.
Possible adverse The output of the system Infringement of Art. 5 (d) Confidence levels could be based on
impact on data could have an adverse Accuracy, Art. 5(a) Fairness, validation rules and historical data what
subjects that could impact on the individual if Art. 22 Automated could act as a proxy for the quality of the
negatively impact inaccurate data are used for individual decision-making, extraction. This could prevent errors in
fundamental rights important decisions including profiling, Art. 25 documents for being flagged. OCR systems
Data protection by design might then assign high confidence scores
and by default to incorrect predictions (false positives) or
low confidence scores to correct
predictions (false negatives) what can lead
to incorrect interpretations or
misclassifications.
Lack of compliance Data subjects’ requests to Infringement of Art. 16 and A low-quality output could prevent a user
with GDPR by not rectify or to erase personal Art. 17 Right to rectification from finding all the data of a data subject
granting data subjects data cannot be completed and right to erasure in their data storage since the data cannot
their right to data be matched properly. This is more
rectification and problematic if the developed application
erasure does not provide with search output
format mechanisms and high accuracy
levels.
Unlawful unlimited Input data and/or data Infringement of Art. 5 (e) This could be the case if training datasets
storage of personal extracted from the images Storage limitation containing personal data are stored for too
data are being stored longer than long. But it could also be the case if the
necessary system is developed in a way where input
and output data are automatically stored
without offering the user the possibility for
deletion.

27 SDK stands for software development kit. SDK is a set of software-building tools for a specific

platform.

15
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Breach of the data Extensive processing of Infringement of Art. 5 (c) Certain OCR systems require large amounts
minimization personal data for training the Data minimisation of data to train the models 28.Tasks that
principle model require handling more diverse fonts, styles,
or languages may generally require a larger
dataset to capture the necessary
variability.

Data protection and privacy risks posed by the use of those types of AI systems:
Users of data extraction technologies need to consider the risks related to their specific use cases
and context. Making use of the risk factors or evaluation criteria can facilitate the identification of
those risks. For instance, the criteria ‘large-scale processing of personal data’ can already trigger the
identification of risky processing activities that could result in harm.

When using an OCR solution, users have three different service model provisions available: SaaS
solution from third party providers hosted in the cloud, third party solutions hosted on-premises 29
and self-developed own solutions hosted on- premises.

Data Protection Risk description GDPR Potential Examples Risk applicable on


and Privacy Risks Impact service model
provision
Insufficient Safeguards for the Infringement of Art. OCR systems that process text • SaaS cloud
protection of protection of 32 Security of containing personal data could be • Third party on-
personal data that personal data are processing, Art. 5 (f) not properly secured. This could be premises
eventually can be not implement or Integrity and the case if transmission of input • Self-developed
the cause of a data are insufficient confidentiality, and data is not secure, data are not
breach Art. 9 Processing of stored encrypted or with an
special categories of adequate access control
personal data mechanism. This is especially
sensitive when we are processing
special category of personal data
when using OCR for digitizing
medical records, criminal data and
banking information.
Possible adverse The output of the Infringement of Art. 5 Errors in the output could attribute • SaaS cloud
impact on data system could have (d) Accuracy, Art. 5(a) incorrectly actions to an individual • Third party on-
subjects that could an adverse impact Fairness, Art. 22 or group (misspelling errors in premises
negatively impact on the individual if Automated individual names and dates for instance). This • Self-developed
fundamental rights erroneous data are decision-making, could have especially a big impact
used for important including profiling, when using OCR for digitizing
decisions. Art. 25 Data medical records, banking ID
protection by design validation, legal documents with
and by default sensitive information and criminal
records.
Possible adverse Data subjects are Infringement of Art. The output of an OCR system could • SaaS cloud
impact on data subjected to an 22 Automated be used to make automatic • Third party on-
subjects and lack of automatic decision- individual decision- decisions which produce legal premises
compliance with making process making, including effects or similarly significant • Self-developed
GDPR requirement without human profiling, Art. 9 effects on data subjects, this could

28 The size of a training dataset can vary depending on multiple factors such as the complexity of the
documents, the diversity of fonts and text styles, and the desired level of accuracy. For simpler
document types with limited variations in fonts and layouts, a smaller training dataset may be
sufficient to achieve reasonable accuracy. However, for more complex document types or scenarios
requiring high accuracy, a larger and more diverse training dataset is typically necessary.
29 Example: https://learn.microsoft.com/en-us/azure/applied-ai-services/form-

recognizer/faq?view=form-recog-3.0.0

16
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

of providing human intervention, Processing of special be the case when using OCR in the
intervention for and/or there is a categories of personal banking sector for identity
processing that can processing of data verification by enabling automatic
have a legal or special categories comparison and validation of the
important effect on of personal data provided data against reference
the data subject databases or identity records. Or
when data are extracted from legal
contracts or financial statements
and is then analyzed automatically
to identify non-compliant clauses,
suspicious activities, or anomalies,
triggering appropriate actions or
alerts.
Lack of compliance Data subjects’ Infringement of Art. This could be the case If there is • SaaS cloud
with GDPR by not requests to rectify 16 and Art. 17 Right not a possibility to search for the • Third party on-
granting data or to erase personal to rectification and data subject’s data in the output premises
subjects their right data cannot be right to erasure and to correct and delete it. • Self-developed
to data rectification completed
and erasure
Unlawful unlimited Input data and/or Infringement of Art. 5 In principle an OCR system doesn’t • SaaS cloud
storage of personal output data are (e) Storage limitation need to store the input data unless • Third party on-
data being stored longer it is necessary for audit, premises
than necessary verification, or archival purposes. • Self-developed
The system should avoid
unnecessary retention or storage of
input data that is not directly
relevant to the OCR process.

The storage of an OCR system's


output depends on the specific
application and requirements. In
some cases, the output could be
stored by the vendor longer than
necessary.
But this could also be the case on-
premises if we are not applying
data retention rules to our stored
data.

Breach of the data Extensive Infringement of Art. 5 There are certain OCR systems that • SaaS cloud
minimization processing of (c) Data minimisation might require the user to input • Third party on-
principle personal data for samples of data until the system premises
training the model has learned the additional features. • Self-developed
The input data is used to train or
fine-tune the OCR model, enabling
it to handle the specific
complexities of the given OCR task.
This can be the case in OCR
systems used in medical fields that
may require specialized training on
medical terminology and
handwritten prescriptions, also for
handwritten recognition, when
using low-quality documents or
specialised fonts.
Unlawful transfer of Data are being Infringement of Art. OCR systems could store the data • SaaS cloud
personal data processed in 44 , General principle and be processing the data in • Self-developed if
countries without for transfers, Art. 45 countries that do not offer enough using Cloud
an adequate level Transfers on the basis safeguards.
of protection of an adequacy This could also be the case with
decision, Art.46 self-developed systems if we store
Transfers subject to the data in the cloud.

17
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

appropriate
safeguards

3. Data protection and privacy risk assessment


Once risks have been identified, it is time to proceed with their classification. The actual risk level or
risk classification will depend on the specific use case and context.

The GDPR outlines in Recital 90 the importance of establishing the context: “taking into account the
nature, scope, context and purposes of the processing and the sources of the risk”.
This is an important process when performing a privacy risk assessment to manage risks to the rights
and freedoms of natural persons.
The following processes are 30:
- assessing the likelihood and severity of the risks;
- treating the risks by mitigating the identified risks and in that way ensuring the protection of
personal data and demonstrating compliance with the GDPR and EUDPR.

There are different risk management methodologies available to classify and assess risks. It is not the
purpose of this document to define or establish a methodology to be used since this is a choice that
should be left to each organization. But for the purpose of this document, we will use the
international standards that have been previously referenced in the WP29 31 and the AEPD 32
Guidelines.

In general risk management terms, risk can be summarized in one equation:


Risk = Likelihood x Severity
This means that risk is the probability of an event occurring, multiplied by the potential impact or
severity incurred by the event.

To assess the level of risk of the data protection and privacy risks identified when procuring,
developing and using data extraction technologies, we first need to estimate the likelihood and
severity of the identified risks happening.

Criteria to establish the likelihood of OCR risks. How to assess likelihood.

To determine the likelihood of the risks of data extraction technologies we are using the following
four level risk classification matrix:

Level of Likelihood Definition


Consequence
Very High High likelihood of an event occurring

30 Guidelines on Data Protection Impact Assessment (DPIA) and determining whether processing is

“likely to result in a high risk” for the purposes of Regulation 2016/679, Article 29 Data Protection
Working Party, Last revision 2017
31 ISO 31000:2009, Risk management — Principles and guidelines, International Organization for

Standardization (ISO); ISO/IEC 29134, Information technology – Security techniques – Privacy impact
assessment – Guidelines, International Organization for Standardization (ISO).
32 ISO 31010:2019, Risk management — Risk Assessment Techniques, International Organization for

Standardization (ISO)

18
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

High Substantial probability of an event occurring


Low Low probability of an event occurring
Unlikely There is no evidence of such a risk materializing in any case
Likelihood can only be determined based on specific risks and use cases. We will look later at a
specific example to better understand how this process works.

Criteria to establish the severity of OCR risks. How to assess severity.

To determine the severity of risks of data extraction technologies we are using the four level risk
classification matrix 33:

Level of Severity Definition


Severity
Very It affects the exercise of fundamental rights and public freedoms, and its consequences are
Significant irreversible and/or the consequences are related to special categories of data or to criminal
offences and are irreversible and/or it causes significant social harm, such as
discrimination, and is irreversible and/or it affects particularly vulnerable data subjects,
especially children, in an irreversible way and/or causes significant and irreversible moral
or material losses.
Significant The above cases when the effects are reversible and/or there is loss of control of the data
subject over their personal data, where the extent of the data are high in relation to the
categories of data or the number of subjects and/or identity theft of data subjects occurs
or may occur and/or significant financial losses to data subjects may occur and/or loss of
confidentiality of data subject or breach of the duty of confidentiality and/or there is a
social detriment to data subjects or certain groups of data subjects
Limited Very limited loss of control of some personal data and to specific data subjects, other than
special category or irreversible criminal offences or convictions
and/or negligible and irreversible financial losses and/or loss of confidentiality of data
subject to professional secrecy but not special categories or infringement penalties
Very Limited In the above case (limited), when all effects are reversible

The severity criteria are related to a loss of privacy that is experienced by the data subject but that
may have further related consequences impacting other individuals and/or society.

Example of OCR specific risk assessment

Use case: OCR system for the digitization of legal documents

Scenario: We want to digitize legal documents containing court filings for archiving purposes. The
documents contain sensitive personal data such us criminal history, health and financial information.
We do not have the expertise to develop and host ourselves an OCR system, so we are going to
contract a third-party provider offering a SaaS solution in the cloud.

33 Pag. 77, AEPD, “Risk Management and Impact Assessment in Processing of Personal Data”, 2021

19
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

The following risk factors/ important concerns from section 2 could be applicable in our specific use
case:

Risk factor Use case applicability


Processing sensitive data Personal data related to convictions and criminal offences,
health and financial information
Large scale processing The volume of data to be processed is high
Processing data of vulnerable Criminal records
individuals
Low data quality We do not know if the dataset is of sufficient quality
Insufficient security measures We might transfer personal data to states or organisations in
other countries without an adequate level of protection.
There could be a possibility of a data breach. Third party
vendor solution has not been chosen yet; this must be taken
into account when making a choice.
Infringement of regulatory We might store the data too long, vendor might use it for
requirements other purposes than OCR, and we might not be able to answer
data subject rights requests for instance.

Based on the identified risks factors we are going to identify together with other stakeholders 34 the
data protection and privacy risks that could arise with the OCR implementation.

We are going to use as foundation the risks identified in section 2 for procurement, and we are going
to assess what is the likelihood of the identified risks and assign to each risk one of the 4 likelihood
classification levels from the matrix: Very high, High, Low, Unlikely.

Data Protection Risk factor Risk description Likelihood Reasoning


and Privacy Risks
Insufficient - Insufficient Safeguards for the protection of Low The third-party suppliers we have
protection of security personal data are not reviewed, have implemented security
personal data measures implemented or are insufficient measures such as secure transmissions,
what eventually - Processing strong access control measures, and data
can be the cause sensitive data encryption at rest.
of a data breach - Large scale We, as user/customer have also strong
processing security processes implemented internally.
- Processing
data of
vulnerable
individuals
Possible adverse - Low data The output of the system could Low Though the impact would be high, we
impact on data quality have an adverse impact on the could consider the likelihood of the risk
subjects that - Large scale individual if erroneous data are happening low due to the nature of the
could negatively processing used for important decisions processing for just the purpose of
impact - Processing archiving with not further analysis of the
fundamental data of data. This specific assessment could also
rights be done from the perspective of what is

34 Meaningful involvement of different stakeholders during the risk assessment process:


https://ecnl.org/sites/default/files/2023-
03/Final%20Version%20FME%20with%20Copyright%20%282%29.pdf

20
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

vulnerable the likelihood of the output having a


individuals negative impact just in general and in that
case, we could consider a higher
probability. It is important to assess the
risks based on context.
Lack of - Processing Data subjects’ requests to Low For this exercise we assume that in the
compliance with sensitive data rectify or to erase personal data OCR SaaS solution we are going to use in
GDPR by not - Processing cannot be completed our use case, the data can be easily
granting data data of searched once it is digitized. There is
subjects their vulnerable always a small possibility that not all data
right to data individuals has been properly extracted and cannot be
rectification and - Infringement found during the search function. The
erasure of regulatory system offers the possibility to delete data
requirements from the output since output is available in
formats that are modifiable.
Unlawful - Processing Personal data extracted is used Low Except in a case of unlawful processing by
repurpose of sensitive data for a different purpose the third party provider, in principle is the
personal data - Processing *This risk probability of the data extraction results
data of will be being used to retrain and fine-tune the
vulnerable unlikely in OCR model low. Most SaaS solutions
individuals an on- delete the data or offer you the possibility
- Insufficient premises to decide if you want to share the results
security solution for that purpose.
measures
- Infringement
of regulatory
requirements
Unlawful - Processing Input data and/or data Low For our use case we could consider for
unlimited storage sensitive data extracted from images are instance, vendors that offer a 24 hours
of personal data - Processing being stored longer than automated deletion period what already
data of necessary reduces the likelihood of this risk.
vulnerable Vendors usually offer a 24 or max. 48
individuals hours automated deletion period in SaaS
- Insufficient solutions. Self-developed OCR systems can
security be configured in a way that input and
measures output can be immediately deleted or at a
- Infringement scheduled moment.
of regulatory
requirements
Unlawful transfer - Insufficient Data are being processed in Low In our specific use case, we have decided
of personal data security countries without an adequate to work with vendors that offer an
measures level of protection adequate level of protection what already
- Processing reduces the likelihood though not the
sensitive data impact.
- Processing
data of
vulnerable
individuals
- Infringement
of regulatory
requirements

After the likelihood assessment, we are going to assess what is the impact of the identified risks on
the data subjects, individuals and society. Based on that impact/severity assessment, we will assign
one of the 4 severity classification levels: Very significant, Significant, Limited, Very limited.

21
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Data Protection and Risk description Likelihood Severity Reasoning


Privacy Risks
Insufficient Safeguards for the protection of Low Very The documents contain very sensitive
protection of personal data are not significant information, and a data breach could
personal data what implemented or are insufficient cause significant harm to the data
eventually can be subjects.
the cause of a data
breach
Possible adverse The output of the system could Low Very Using the system for other purposes
impact on data have an adverse impact on the significant beyond archiving could have an
subjects that could individual if erroneous data are adverse impact on data subjects if the
negatively impact used for important decisions output is inaccurate. This could be the
fundamental rights case if we use the system for search,
analyze, and retrieve of information.
Lack of compliance Data subjects’ requests to Low Significant Not being able to rectify incorrect or
with GDPR by not rectify or to erase personal data not up to date information could have
granting data cannot be completed a significant impact on the data subject
subjects their right due to the nature of the data being
to data rectification digitized: criminal history, medical and
and erasure financial information.
Unlawful repurpose Personal data extracted is used Low Very This could have a big impact on the
of personal data for a different purpose significant data subjects if for instance the vendor
*This risk keeps a copy of the input and/or the
will be output data and uses this afterwards
unlikely in for non-agreed purposes especially due
an on- to the nature of the personal data
premises contained in the documents.
solution
Unlawful unlimited Input data and/or data Low Significant An unlimited or unlawful storage of
storage of personal extracted from images are being personal data would worsen any data
data stored longer than necessary breach affecting stored data.
And although if data are properly
protected while being stored would
limit the harm cause to the data
subject, it will still be an infringement
of the GDPR.
Unlawful transfer of Data are being processed in Low Significant Transferring the data to a country that
personal data countries without an adequate doesn’t offer enough safeguards could
level of protection bring significant risks to the data
subjects.

22
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

4. Data protection and privacy risk treatment


Risk treatment criteria
i.e., mitigate, transfer, avoid or accept a risk.

The assessments of likelihood and severity will offer us the basis to obtain the risk level classification.
Based on the four level classification used for likelihood and severity, we can use a matrix like the
following to obtain the resulting final risk level classification: Very High, High, Medium, Low.

Very High Medium High Very high Very high


Likelihood High Low High Very high Very high
Low Low Medium High Very high
Unlikely Low Low Medium Very high
Very limited Limited Significant Very
Significant
Severity

Based on this matrix we can classify the risks identified in our use case as follows:

Data Protection and Privacy Risk description Likelihood Severity Risk Level
Risks
Insufficient protection of Safeguards for the protection of personal data are Low Very Very High
personal data what eventually not implemented or are insufficient significant
can be the cause of a data
breach
Possible adverse impact on The output of the system could have an adverse Low Very Very High
data subjects that could impact on the individual if erroneous data are used significant
negatively impact for important decisions.
fundamental rights
Lack of compliance with GDPR Data subjects’ requests to rectify or to erase Low Significant High
by not granting data subjects personal data cannot be completed
their right to data rectification
and erasure
Unlawful repurpose of Personal data extracted is used for a different Low Very Very High
personal data purpose significant
*This risk
will be
unlikely in
an on-
premises
solution
Unlawful unlimited storage of Input data and/or data extracted from images are Low Significant High
personal data being stored longer than necessary
Unlawful transfer of personal Data are being processed in countries without an Low Significant High
data adequate level of protection

We have identified three risks with a very high level, and three with a high level. Best practices in risk
management suggest that the mitigation of very high and high level risks should be prioritized. 35 The
next step involves the implementation of a risk treatment plan.

35 https://www.pmi.org/learning/library/high-risk-critical-path-projects-7675

23
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Risk treatment involves developing options for mitigating the risks and preparing and implementing
action plans. The appropriate treatment option should be chosen on a contextual basis and
considering a feasibility analysis 36 like the following:

o Evaluate the type of risk and the available mitigation measures that can be implemented.
o Compare the potential benefits gained from implementing the mitigation against the costs
and efforts involved.
o Assess the impact on the purpose that is being pursued by implementing the OCR system.
o Evaluate what could be the reasonable expectations of individuals.
o Assess the impact mitigation measures could have on transparency and fairness of the
processing.

An analysis of these criteria is essential to risk mitigation and risk management planning and helps in
determining whether the risk mitigation is justifiable.

The most common risk treatment criteria are: Mitigate, Transfer, Avoid and Accept.
For each risk one of the criteria options will be selected:
 Mitigate – Identify ways to reduce the likelihood or the severity of the risk.
 Transfer – Make another party responsible for the risk (buy insurance, outsourcing, etc.).
 Avoid – Eliminate the risk by eliminating the cause.
 Accept – Nothing will be done.

Deciding whether a risk can be mitigated involves assessing the nature of the risk, understanding its
potential impact, and evaluating potential mitigation measures such as implementing controls,
adopting best practices, modifying processes, and using tools that can help reduce the likelihood or
severity of the risk.
Not all risks can be fully mitigated. Some risks may be inherent and cannot be entirely avoided. In such
cases, the goal is to reduce the risk to an acceptable level or to put in place measures that help manage
the severity of the risk effectively.

Presentation of mitigation measure examples/risk treatment options


including an assessment of their practical feasibility and a definition of the criteria to define the level
of mitigation obtained.

In our use case we have identified several very high and high level risks. After going through the
feasibility analysis and the treatment criteria, we have decided that we cannot transfer the risks to
any third party, we cannot avoid all the risks, and acceptance of the risks is an unacceptable option
for us. As long as there are measures that we can implement to help us mitigate the risks, resulting in
acceptable conditions to go on with the implementation, we choose the treatment option of risk
mitigation.

We have identified the following risk mitigation measures:

36 “Risk, High Risk, Risk Assessments and Data Protection Impact Assessments under the GDPR”,

CIPL GDPR Interpretation and Implementation Project, 2016

24
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Data Protection and Risk Level Risk Mitigation measures Feasibility Assessment New risk
Privacy Risks Level after
mitigation
Insufficient Very High The third-party vendor chosen must have 1. Cost of Low
protection of implemented security measures such as implementation:
personal data what secure transmissions, strong access control The implementation of
eventually can be the measures, and data encryption at rest and pseudonymization or
cause of a data sufficient privacy design strategies 37 to anonymization techniques
breach: protect the data. We will ask after the data extraction
safeguards for the certifications 38 and results of a pentest 39 to would imply additional
protection of the vendor. cost.
personal data are not 2. Impact on purpose of
implemented or are As controller we can also protect the digitization and
insufficient specific sensitive data in the documents by archiving: No
applying pseudonymization or 3. Impact on
anonymization techniques after the data expectations of
extraction. Depending on the different individuals: No
needs, we could implement default 4. Impact on
anonymization or reversible data masking transparency and
techniques, for instance allowing access to fairness of the
the unmasked data to certain people. If the processing: No
OCR SaaS solution does not offer the
possibility to implement these
techniques 40, we could decide to look for a
vendor that can offer them or implement
them ourselves.
Possible adverse Very High 1.In this specific use case, digitized 1. Cost of 1.Low
impact on data documents are going to be archived and implementation: 2.Medium
subjects that could not been used for any other processing They highest cost is the
negatively impact what would already mitigate this risk. effort that implies doing
fundamental rights: the human review when
the output of the 2.But in cases where search and analysis of needed and providing the
system could have an data is required, we would need to make human resources for that.
adverse impact on sure that the OCR system we use offers a 2. Impact on purpose of
the individual if high percentage of accuracy guarantee. digitization and
erroneous data are This is usually between 98-99%. 41 This archiving: No
used for important accuracy should not only be at page level 3. Impact on
decisions. but also at character and word level what expectations of
is often challenging. It is also important to individuals: No for
follow best practices 42 when using OCR data subjects, but it
systems: making sure that our original has an impact on the
documents have a good quality, employees that would
considering things such as resolution, be in charge of the
brightness, straightness, and discoloration human review
before we scan text. 4. Impact on
Use of special fonts and low contrast could transparency and
also affect accuracy. fairness of the
Another important aspect is that currently processing: No if it is
there are no systems offering 100% properly implemented
accuracy, and the only way to achieve that and information about
and avoid any error is by doing a human how the accuracy of
review and correction of the output. the system works is
provided to users and
eventually to data
subjects.

37 “Privacy Design Strategies” Jaap-Henk Hoepman, 2022


38 Security certifications such as ISO27001, and SOC2
39 A penetration test, colloquially known as a pentest or ethical hacking, is an authorized simulated

cyberattack on a computer system, performed to evaluate the security of the system.


40 https://anonimiseren-bnas.nl/biqe-anonymization-2/
41 https://www.docsumo.com/blog/ocr-accuracy
42 https://guides.library.illinois.edu/OCR/bestpractices

25
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Unlawful repurpose Very High 1.One of the best mitigation measure 1. Cost of 1.Low
of personal data: would be using a SaaS solution that offers implementation: 2.Medium
personal data the option to keep the data on premises. If the on-premise option is
extracted is used for This is the case if the data processing takes offered by the vendor this
a different purpose place at location (where the OCR machine could have an additional
is located for instance) and the input is cost. It could also imply
automatically deleted and the output data that we need to make
is only stored at the user location. resources available for
taking care of the on-
2.If the data extraction takes place at the premise solution.
vendor’s location, then a minimum of 2. Impact on purpose of
security measures such as access control, digitization and
audit trail and encryption together with archiving: No
proper data protection agreements need 3. Impact on
to be in place. expectations of
individuals: No
4. Impact on
transparency and
fairness of the
processing: No
Lack of compliance High We could implement an OCR system that 1. Cost of Low
with GDPR by not offers editable output format. This will implementation:
granting data allow as to search and edit text in the If the option is offered by
subjects their right to output. This is not possible in other the vendor this could have
data rectification and formats like only searchable output an additional cost.
erasure: formats. This editable output format will 2. Impact on purpose of
Data subjects’ allow us to look up for the data subject’s digitization and
requests to rectify or information and update it or delete it. archiving: No
to erase personal 3. Impact on
data cannot be expectations of
completed individuals: No
4. Impact on
transparency and
fairness of the
processing: Yes, on a
positive way
Unlawful transfer of High We can implement an OCR system from a 1. Cost of Low
personal data: third-party provider that is located in a implementation:
Data are being country offering adequate level of This measure should in
processed in protection. principle not have any
countries without an additional cost, but it
adequate level of depends on the vendors
protection availability.
2. Impact on purpose of
digitization and
archiving: No
3. Impact on
expectations of
individual: No
4. Impact on
transparency and
fairness of the
processing: No
Unlawful unlimited High We could try to implement an OCR system 1. Cost of Low
storage of personal in which deletion of data can be configured implementation:
data: so that input and output data are deleted If the option is offered by
Input and output from the system immediately after the the vendor this could have
data are being stored data extraction or at a scheduled moment an additional cost.
longer than necessary (this is by most vendor a period of 24 to 48 2. Impact on purpose of
hours). digitization and
We could also implement a retention archiving: No
period for the output data that we want to

26
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

archive and that is already in our own 3. Impact on


premises. expectations of
If the vendor is storing the input and/or individuals: No
output data, we should negotiate 4. Impact on
contractual agreements about the transparency and
retention period of the data. fairness of the
processing: No

Residual risk acceptance


After the feasibility assessment has been done and the mitigation measures have been identified
and implemented, we should assess again the likelihood and severity of each risk to obtain a new
risk classification level and in this way assess if there is any remaining or residual risk.

In our use case, after the assessment, all the risks have been reduced to the lowest risk level ‘low’,
and there are two risks with a possible classification of ‘Medium’ depending in the example on the
mitigation measure adopted.

We calculate the residual risk by evaluating the likelihood and severity of the risks that still exists
despite the implemented mitigation measures. This residual risk represents the level of risk that
remains after taking mitigation actions.
Once residual risk has been identified, we need to decide whether the residual risk is within acceptable
levels for our organization. If it is, we can decide to accept it. If it's not, we would need to consider
further mitigation strategies.
Some organizations establish criteria for the acceptability of residual risks based on elements such us
social norms, benefits, harms, similar use cases, etc 43.
Organizations must be able to justify their risk mitigation and acceptance decisions as part of their
accountability obligations which also fall under the GDPR principle of accountability (Article 5.2,
Recital 74).

Example of general mitigation measures related to risks of OCR systems:


Choosing appropriate mitigation measures should be done on a case-by-case basis. We are going to
examine some of the possible mitigation measures that could be implemented to mitigate privacy and
data protection risks specific for data extraction technologies. These measures are general and not
related to any specific use case.

Data Protection and Privacy Risks Mitigation measures examples


Insufficient protection of personal data As user, procurement entity and developer, it is important to verify 44 that APIs
what eventually can be the cause of a data are securely implemented, transmission of data are protected with the adequate
breach encryption protocols, data at rest is encrypted, there is an adequate access
control mechanism implemented, there are measures implemented for
protection and identification of insider threats, measures to mitigate supply chain
attacks that could give access to the training data and/or the data storage and
encryption keys, measures implemented to prevent risks associated to the use of
deep learning such as the risk of reprogramming deep neural net attacks 45,

43 https://www.sciencedirect.com/topics/engineering/residual-risk
44 This could be done by performing a pentest and/or requesting pentest results to the vendor.
45 Elsayed et al , “Adversarial reprogramming of neural networks”, 2018

27
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

membership inference 46, inversion 47 and poisoning attacks 48. Also access and
change logs should be established to document access and changes to digitized
records.
Possible adverse impact on data subjects As user, implement OCR solutions that offer a minimum 98-99% accuracy. Often
that could negatively impact fundamental the systems offer the results of this metric after every data extraction. It is
rights important to monitor the values and make the necessary adjustments and
corrections to the results. Make sure the system recognizes different conditions
applicable to the input data.
The quality of input data is important. This is important for users of OCR system
as well as for developers that need to use training data of quality to train their
models. As developer there are techniques and tools 49 you can use to reduce the
low quality of input data such as binarization, deskweing 50, rotation, increasing or
reducing brightness, scaling the image, or removing specific objects to improve
the accuracy levels.
Some of these techniques are also available in the configuration settings of OCR
third party solutions that are available as SaaS 51 solution and on-premises 52.
Possible adverse impact on data subjects Users could implement a human review process to verify the correctness of the
and lack of compliance with GDPR personal data, especially if this is sensitive, and a process to approve high risk
requirement of providing human decisions after human verification has been done.
intervention for processing that can have a It is also important that the system provides with an overview of the accuracy and
legal or important effect on the data the confidence levels achieved after the data extraction and with a dashboard or
subject other type of interface for manual human review and correction.
In certain use cases it might be necessary to implement a redress mechanism for
data subjects.
Lack of compliance with GDPR by not As user, developer and procurement entity, implement searchable and editable
granting data subjects their right to data output format functions to identify the personal data in the data extracted so that
rectification and erasure it is possible to respond to data erasure and rectification requests.
Unlawful unlimited storage of personal As user and procurement entity make agreements with the third-party supplier
data about how long the input data and output data should be stored. This can be part
of the service contract, product documentation 53 or data processing agreement.
If data are being stored on your premises, establish retention rules and /or a
mechanism for the deletion of data.
Breach of the data minimization principle For users and developers, one possible way to mitigate this risk is by providing
documents to the OCR model where personal data has been replaced by synthetic
data.
As user, it is also important to compare the different OCR solutions available on
the market to understand which systems require less volume of data to train the
models and to improve the accuracy levels.
Unlawful transfer of personal data As user and procurement entity, verify with the vendor where the data
processing is taking place. Make the necessary safeguard diligences and when
necessary, perform a Data Transfer Impact Assessment. Make the necessary
contractual agreements. Consider this risk when making a selection among
different vendors.

Once risk mitigation measures have been implemented, it is crucial to continuously monitor their
effectiveness. Implementing methodologies like threat modeling for the identification of risks,

46 Shokri et al, “Membership Inference Attacks Against Machine Learning Models”, 2017
47 Zhang et al, “Generative Model-Inversion Attacks Against Deep Neural Networks”, 2020
48 Junfeng Guo, Cong Liu, “Practical Poisoning Attacks on Neural Networks”, 2020
49 See next section under ‘Resources about how to improve OCR accuracy applying different tools

and techniques’
50 Deskewing is the process of straightening an image that has been scanned or written crookedly. It

is a process whereby skew is removed by rotating an image by the same amount as its skew but in
the opposite direction.
51 https://help.abbyy.com/en-us/flexicapture/12/distributed_administrator/template_properties/
52 https://selectec.com/on-premise-ocr/
53 In this document from Tencent Cloud in page 6, there is a retention Policy indicated for the images

uploaded (input data) and the returned results (output): data is deleted upon completion of the
processing https://main.qcloudimg.com/raw/document/intl/product/pdf/tencent-
cloud_1005_50443_en.pdf

28
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

maintaining a risk register and assigning risk owners are effective strategies for regularly reviewing
and reassessing the risk landscape. This ensures that the implemented risk mitigation measures
remain relevant and effective in preventing data protection and privacy risks that could adversely
impact individuals and organizations.

29
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

Reference to specific technologies, tools, methodologies, processes or


strategies.
Unless standardised and freely and easily accessible, explanation on how these technologies, tools,
methodologies and processes work.

Methodologies for measuring accuracy in OCR:


When evaluating the accuracy and quality of OCR results, there are various methods and metrics to
consider. Character error rate (CER) and word error rate (WER) are quantitative metrics that measure
the percentage of characters or words that are incorrectly recognized by the OCR system. Layout error
rate (LER) is a qualitative metric that measures the degree of deviation between the OCR output and
the original image in terms of layout, structure, and formatting.
To measure the extent of errors 54 between two text sequences we can use the Levenshtein distance
metric to measure the difference between two string sequences. This is the minimum number of
single-character (or word) edits (i.e., insertions, deletions, or substitutions) required to change one
word (or sentence) into another.

Character Error Rate (CER)


CER calculation is based on the concept of Levenshtein distance, where we count the minimum
number of character-level operations required to transform the reference text (aka ground truth)
into the OCR output. It is represented with this formula:
CER=(S+D+I)/N
where:
S = Number of Substitutions
D = Number of Deletions
I = Number of Insertions
N = Number of characters in reference text
The output of this equation represents the percentage of characters in the reference text that was
incorrectly predicted in the OCR output. The lower the CER value (with 0 being a perfect score),
the better the performance of the OCR model. CER is relevant for extraction of particular
sequences (e.g., social security number, phone number, etc.)

There is not a benchmark available for defining a good CER value, as it is highly dependent on the
use case, the different scenarios and complexity. Some research studies 55 propose a good OCR
accuracy should be CER 1-2% (i.e., 98-99% accurate) 56.

Good OCR accuracy: CER 1‐2% (i.e. 98–99% accurate)


Average OCR accuracy: CER 2-10%
Poor OCR accuracy: CER >10% (i.e. below 90% accurate)

Word Error Rate (WER)


Word Error Rate is relevant for the extraction of paragraphs and sentences of words with meaning
(e.g., pages of books, newspapers). The formula for WER is the same as that of CER, but WER

54 “Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing”, Nguyen, Adam,
Coustaty, Nguyen, Doucet, 2019
55 https://www.researchgate.net/publication/281583162_Performance_Comparison_of_OCR_Tools
56https://www.docsumo.com/blog/ocr-accuracy

30
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

operates at the word level instead. It represents the number of word substitutions, deletions, or
insertions needed to transform one sentence into another.

While CER and WER are handy, they are not bulletproof performance indicators of OCR models.
This is because the quality and condition of the original documents (e.g., handwriting legibility,
image DPI, etc.) plays a very important role and not just the OCR model itself.

Available tools for measuring CER and WER:


CER metric: https://huggingface.co/spaces/evaluate-metric/cer
WER metric: https://huggingface.co/spaces/evaluate-metric/wer
WER: https://www.amberscript.com/en/wer-tool/

OCR quality standard:


The ISO standard ISO/IEC 30116:2016 can be useful to evaluate the quality of the character
recognition and data extraction output. The standardization project also defines test methods to
evaluate OCR document quality.
 ISO/IEC 30116:2016 Information technology — Automatic identification and data capture
techniques — Optical Character Recognition (OCR) quality testing
https://www.iso.org/obp/ui/#iso:std:iso-iec:30116:ed-1:v1:en

Guidances:
 Digitalization Guide for Records with Long-term Retention at NYC Agencies
https://www.nyc.gov/assets/records/pdf/Digitization%20Guide%20for%20Records%20with%20
Long%20Term%20Retention%20at%20NYC%20Agencies%2020161031.pdf
 On-premise or Cloud OCR: A guide to help you decide:
https://www.klippa.com/en/blog/information/on-premise-ocr/

Resources about how to improve OCR accuracy applying different tools and
techniques:
 Improve OCR accuracy using advanced pre-processing techniques
https://www.nitorinfotech.com/blog/improve-ocr-accuracy-using-advanced-preprocessing-
techniques/
 Tesseract: Improving the quality of the output
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
 15 tools for document Deskewing and Dewarping
https://safjan.com/tools-for-doc-deskewing-and-dewarping/
 Improve the quality of your OCR information extraction
https://aicha-fatrah.medium.com/improve-the-quality-of-your-ocr-information-extraction-
ebc93d905ac4

Privacy preserving OCR techniques and tools:


 Confidential Optical Character Recognition Service with Cape
https://capeprivacy.com/blog/confidential-Optical-character-recognition-service-with-cape/

31
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

 Blur out Text in Images Using OCR in Next.js


https://cloudinary.com/blog/guest_post/blur-out-text-in-images-using-ocr-in-next-js

Methodologies and tools for the identification of data protection and privacy
risks:
 Privacy Library of Threats (PLOT4ai) is a threat modeling methodology for the identification of
risks in AI systems. It also contains a library with more than 80 risks specific to AI systems:
https://plot4.ai/
 MITRE ATLAS™ (Adversarial Threat Landscape for Artificial-Intelligence Systems), is a
knowledge base of adversary tactics, techniques, and case studies for machine learning (ML)
systems: https://atlas.mitre.org/
 Assessment List for Trustworthy Artificial Intelligence (ALTAI) is a checklist that guides
developers and deployers of AI systems in implementing trustworthy AI principles:
https://digital-strategy.ec.europa.eu/en/library/assessment-list-trustworthy-artificial-
intelligence-altai-self-assessment

32
AI Possible Risks & Mitigations - Optical Character Recognition (OCR)

33

You might also like