0% found this document useful (0 votes)

28 views30 pages

NTCC Report 8th Sem

Uploaded by

kumar aniket

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views30 pages

NTCC Report 8th Sem

Uploaded by

kumar aniket

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Report on

Removal of Duplicate data using AI

Submitted to
Amity University, Ranchi (Jharkhand)

In partial fulfilment of the requirements for the award of the degree of

(B. tech C.S.E)

Md Gulam Hassnain

A35705219006

SEM-8th (2019-2023)

Under the guidance of

Dr. kanikaThakur

DEPARTMENT OF COMPUTER SCIENCE

Amity Institute of Information Technology
AMITY UNIVERSITY JHARKHAND

Ranchi
2019- 2023

1
DECLARATION
I, MD GULAM HASSNAIN, student of B.TECH (CSE) hereby declare that the project
titled “Removal of Duplicate Data using AI” which is submitted by me to the
Department of computer science and Technology, Amity school of engineering
and Technology, Amity University, Ranchi, Jharkhand, in the partial fulfillment of
requirement for the award of the degree of Bachelors of Technology, has not
been previously formed the basis of the award of and degree or other similar title
or recognition. The author attests that the permission has been obtained for the
use of any copy written material appearing in the dissertation /project report
other than brief experts requiring only proper acknowledgement in scholarity
writing and all such use is acknowledged.

Yours sincerely

Md. Gulam Hassnain

A35705219006

2
CERTIFICATE
This is to certify that Mr. Md Gulam Hassnain, students of B. TECH (CSE),
Ranchi have worked under the able guidance and supervision of Dr. Amarnath
Singh, designation faculty guide.

This project report has the requisite standard of partial fulfillment of undergraduate
degree in Bachelor of Technology to the best of my knowledge no part of this
report and the contents are based on the original research.

I am aware that in case of non-compliance, Amity School of Engineering and

Technology to cancel the report.

Signature

Dr. Kanika Thakur

(Faculty Guide)

3
ACKNOWLEDGEMENT
I express my sincere gratitude to my faculty guide DR. Kanika Thakur, for his able
guidance, continuous support, and cooperation throughout my research work,
without which the present work would not have been possible. My endeavor
stands incomplete without dedicating my gratitude to him; he has contributed a
lot towards successful completion of my research work.

I would also like to express my gratitude to my family, friends for their unending
support, and tireless effort that kept me motivated throughout the completion of
this research.

Yours sincerely

MD. GULAM HASSNAIN

B. TECH (CSE)

2019-2023

4
TABLE OF CONTENTS

1. ABSTRACT 6
2. METHODOLOGY 7
3. INTRODUCTION 8
4. DATA DUPLICATION 9
5. IMPORTANCE OF REMOVING DUPLICATE DATA 10
6. DATA DUPLICATION EXPLAINED 10-11
7. BENEFITS OF DATA DUPLICATION REMOVAL 11
8. REAL LIFE EXAMPLE 12
9. DATA DUPLICATION WITH AI 13
10. AI POWERED BENFITS 14-15
11. AI ALGORITHM FOR DUPLICATION REMOVAL 16
12. ROLE OF AI IN IMPROVING DUPLICATE DATA 17-18
13. STAGES OF DATA CLEANING 19-21
14. PROBLEMATIC ANALYSIS OF REMOVING DATA 22-23
15. LIMITAION AND CONSIDERATION 23
16. HIGHLIGHT FEATURES 23-24
17. SYSTEM REQUIREMENT 25
18. IMPLEMENTATION 26-27
19. RESULT 28
20. CONCLUSION 29
21. REFERENCES 30

5
ABSTRACT

Getting rid of duplicate data is a crucial step in data management since it helps to
protect the data's accuracy, consistency, and integrity. Manually locating and
eliminating duplicate data can be a difficult and time-consuming operation due to
the daily increase in data volume. Here is where artificial intelligence (AI) can be
quite useful. A dataset's duplicate data can be automatically found and eliminated
using AI-powered algorithms. These algorithms often analyze data using machine
learning methods to look for trends that might identify duplicate records. They
can also gain knowledge from human feedback to gradually increase their
accuracy.

For instance, the Python module Pandas offers robust tools for data analysis and
manipulation. Duplicate entries can be eliminated from a Pandas Data Frame
object using the function drop duplicates (), which is a built-in feature. The
function offers more sophisticated options for duplication detection and removal
by using a variety of arguments, including subset, keep, and in place.

Overall, employing AI-powered libraries in Jupyter Notebook helps speed up and

reduce the amount of time it takes to remove duplicate data. Users can also do
data analysis and visualization, as well as adjust the algorithmic settings for
greater accuracy. To ensure the quality and integrity of the data, it is crucial to
validate the results and bear in mind the constraints and presumptions of the
algorithms that were employed.

6
METHODOLOGY

The methodology for removing duplicate data using AI in Jupyter Notebook

typically involves the following steps:

Importing data: The dataset must first be imported into Jupyter Notebook, usually
with the help of the Pandas library. This phase could entail connecting to a
database or reading data from a file.

Data cleaning: After that, the dataset is cleaned and preprocessed to guarantee
accuracy and consistency. This could entail changing the data type, eliminating
unneeded columns or rows, and adding or removing missing values.

Duplicate data detection: After that, the dataset is cleaned and preprocessed to
guarantee accuracy and consistency. This could entail changing the data type,
eliminating unneeded columns or rows, and adding or removing missing values.

Duplicate data removal: When duplicate data is found, it can be eliminated from
the dataset. Pandas' built-in functions or specially created scripts that eliminate
duplicate data based on a set of criteria can be used to accomplish this.

Data validation: The dataset is tested to make sure it is correct and consistent
after duplicate data has been removed. This could entail assessing the dataset's
size and organizational structure as well as contrasting it with data from other
sources.

Data analysis: Finally, analysis and modelling can be done using the cleaned and
deduplicated dataset. This could entail employing data exploration tools or
machine learning algorithms to create predictive models.

To guarantee that the outcomes are repeatable and transparent, it is crucial to

record the actions taken and the decisions made throughout the entire process.
In order to guarantee that the data is correct and consistent, it is also crucial to
confirm the results using outside sources or professional judgement.

7
INTRODUCTION

Duplicate data can result in a variety of concerns in the field of data management
and analysis, from decreasing accuracy to performance problems. The manual
detection and elimination of duplicate data can be difficult and time-consuming
due to the daily increase in data volume. In order to solve this issue effectively
and precisely, artificial intelligence (AI) enters the picture.

Jupyter Notebook, a popular web tool that enables users to create and share
documents containing live code, visualizations, and narrative text, is one platform
where AI-powered algorithms can be utilized for duplicate data elimination.
Jupyter Notebook has a wealth of AI-powered libraries that can assist with
duplicate data reduction and is the perfect setting for data analysis and machine
learning activities.

Users of Jupyter Notebook may automatically find and eliminate duplicate data
using AI algorithms, which makes the process quicker and more effective.
Powerful tools and algorithms can automatically find and eliminate duplicate
records, and libraries like Pandas, Numpy, and Scikit-Learn also offer sophisticated
choices for more precise and effective duplication removal.

In Jupyter Notebook, removing duplicate data using AI can assist maintain data
integrity, accuracy, and consistency while also enhancing overall data
management and analysis. By automating the onerous and repetitive processes of
duplicate data removal, it enables users to concentrate on the more intricate and
valuable components of data analysis. To ensure the quality and integrity of the

8
data, it is crucial to validate the results and bear in mind the constraints and
presumptions of the algorithms that were employed.

What is Data Deduplication?

Deduplicating data is the process of removing unnecessary data from a dataset. It

entails locating and deleting duplicate versions of files, emails, or other data kinds
that are exactly the same or nearly the same. By eliminating duplicates,
businesses can increase their capacity for storage, shorten backup times, and
enhance their capacity for data recovery.

A pointer to the unique data copy is used to replace redundant data blocks. Data
deduplication and incremental backup, which transfers just the data that has
changed since the last backup, closely resemble each other in this regard.

To find duplicate byte patterns, data deduplication software examines data. In

this approach, the deduplication program checks that the single-byte pattern is
accurate and legitimate before using it as a reference. An additional pointer to the
previously stored byte pattern will be provided in response to any later requests
to store the same byte pattern.

An example of data deduplication

The identical 1-MB file attachment may appear 100 times in a typical email
system. All 100 instances are saved if the email platform is backed up or archived,
needing 100 MB of storage space. The attachment is only stored once thanks to
data deduplication; subsequent copies are all linked to the original copy. This
example reduces the 100 MB storage demand to 1 MB.

9
Importance of removing duplicate data from datasets

Duplicate data sets have the potential to contaminate the training data with the
test data, or the other way around. Outliers may undermine the training process
and cause your model to "learn" patterns that do not actually exist, while entries
with missing values will cause models to interpret features incorrectly.

Data deduplication Explained.

Different types of data deduplication exist. In its most basic version, the method
eliminates identical files at the level of individual files. File-level deduplication and
single instance storage (SIS) are other names for this.

Deduplication goes a step further by finding and removing redundant data

segments that are the same, even though the files they are in are not totally
identical. Storage space is freed up by a process known as block-level
deduplication or sub-file deduplication. Deduplication is frequently used to refer
to block-level deduplication. If they employ that modification, they are talking
about file-level deduplication.

There are two types of block-level deduplication: fixed block borders, where the
majority of block-level deduplication takes place, and variable block boundaries,
where data is divided up at random intervals. The rest of the procedure often
stays the same once the dataset has been divided into a number of little pieces of
data, known as chunks or shards.

Each shard is subjected to a hashing method, such as SHA-1, SHA-2, or SHA-256,

by the deduplication system, which results in a cryptographic alpha-numeric
representation of the shard known as a hash. The value of that hash is then tested

10
to see if it has ever been seen before against a hash table or hash database. The
new shard is written to storage and the hash is added to the hash table or
database if it has never been seen before; otherwise, it is deleted, and a new
reference is added to the hash table or database.

What are the benefits of deduplication?

Consider how frequently you edit documents to make minor changes. Even if you
simply altered one byte, an incremental backup would still save the entire file.
Every important business asset has a chance to include duplicate data. Up to 80%
of company data in many organizations is duplicated.

Target deduplication, also known as target-side deduplication, allows a customer

to significantly reduce storage, cooling, floor space, and maintenance costs by
doing the deduplication process inside a storage system after the native data has
been placed there. A customer can save money on storage and network traffic by
employing source deduplication, also known as source-side deduplication or
client-side deduplication, where redundant data is detected at the source before
being transported across the network. This is due to the fact that redundant data
segments are detected before being transferred.

With cloud storage, source deduplication performs incredibly well and can
significantly speed up backups. Deduplication speeds up backup and recovery by
lowering the amount of data and network bandwidth that backup operations
require. When deciding whether to employ deduplication, think about if your
company could profit from these advancements.

11
What is a real-life deduplication example?

Consider a scenario where a company manager distributes 500 copies of a 1 MB

file containing a financial outlook report with visuals to the entire team. All 500
copies of that file are currently being kept on the company's email server. If you
utilize a data backup solution, all 500 copies of the emails will be preserved and
take up 500 MB of server space. Even a simple data duplication mechanism at the
file level would only save one copy of the report. Only that one stored copy is
referred to in every other instance. This indicates that the unique data's final
bandwidth and storage demand on the server is only 1 MB.

Another illustration is what occurs when businesses periodically perform full

backups of files while also performing full backups of files when just a small
number of bytes have changed, as a result of long-standing design issues with
backup systems. Eight weekly full backups on a 10 TB file server would produce
800 TB of backups, and incremental backups over the same period of time would
likely produce another 8 TB or more. Without slowing down restore performance,
a competent deduplication solution may decrease this 808 TB to less than 100 TB.

12
Why Data Deduplication?‍

80% of a data scientist's time is spent on data preparation, which was rated as the
least enjoyable aspect of the job by 76% of those surveyed.

For these reasons, it would be wise to invest whatever sum of money is required
to fully automate these procedures. The difficult activities that require connecting
to data sources, creating reliable pipelines, and performing other jobs pique the
interest of data professionals (engineers, scientists, IT teams, etc.) who are in
charge of preparing data. It is detrimental to make data experts perform tiresome
jobs like data preparation since it decreases their morale and diverts them from
more crucial duties.

Effective deduplication can significantly improve a company's bottom line.

Although the cost per stored unit has decreased as cloud storage has grown in
popularity, there are still expenses related to managing huge amounts of data,
and duplicate data increases these expenses. Deciding may also take longer as a
result of the additional information.

Duplicate data can also produce inaccurate outcomes, making it challenging to

make wise business decisions. Such errors can be disastrous in the highly
competitive economic environment of today. There are several locations and
methods for storing data, which increases the likelihood of errors.

Deduplication is a crucial step in the data cleansing process that can help to lower
this risk. For analyses to produce precise and timely findings, duplicate data must
be eliminated from databases or from data models (using a deduplication
scrubber or other tool).

13
The business analytics solutions from Grow can assist with data deduplication and
analysis to produce quick insights for your entire organization.

Data Deduplication with AI

According to a McKinsey & Company analysis, businesses may increase

productivity by up to 50% by utilizing AI and machine learning to enhance their
data management and analytics.

Data deduplication can be accomplished using a variety of AI algorithms, such as

deep learning and machine learning.

In order to find duplicate data, machine learning algorithms may analyze datasets
and spot trends. They can gain knowledge from prior data deduplication efforts
and gradually increase their accuracy. Deep learning techniques are very helpful
for complicated datasets because they can use neural networks to spot and get
rid of duplicate data.

AI-driven data deduplication can assist organizations in a number of ways. For

instance, it can cut down on the time and effort needed for data deduplication,
freeing up staff members to work on more important projects. Additionally, it can
increase data deduplication accuracy, lowering the possibility of data errors and
inconsistencies.

Additionally, organizations may find duplicate data that might otherwise go

unnoticed with the aid of AI-powered business analytics solutions, resulting in a
more thorough and efficient data deduplication process. Additionally, it can assist

14
organizations in locating previously undiscovered data patterns and insights,
resulting in better decision-making and better commercial results.

Let's imagine that a business has a client database that has duplicate records. To
make sure that its customer information is correct and current, the organization
wishes to get rid of duplicate records. How AI, ML, and deep learning can assist
with this work is as follows:

Artificial intelligence-based methods for data deduplication: The business can

utilize AI-based methods like machine learning and deep learning to find and
eliminate duplicate client records. These methods analyze the data using
algorithms to look for patterns that point to duplicate entries.

Datasets used as training for AI-based methods: The business must create a
training dataset before using AI-based techniques for data deduplication. To train
the AI model, the dataset should contain examples of duplicate and unique
customer entries.

15
Consider a sample dataset with the following customer entries:

The dataset can be analyzed by AI to spot duplicate customer entries. Because the
email and phone numbers in this example match, AI may determine that "John
Smith" and "John Doe" are the same individual. Similar to this, AI may determine
that "Sarah Brown" and "Sarah Brown" are identical people based on how closely
their phone number and email match.

Popular AI Algorithms for Duplicate Data Removal

>Parsing

The parsing approach is employed in data purification to find syntax problems.

Lexical and domain errors can be fixed by parsing since it first uses a sample set of
values to determine the format of the domain. Additionally, it produces a
discrepancy detector for anomaly detection.

16
>Data Transformation

The process of data transformation is similar to that of data cleansing in that data
is first mapped from one format to another, into a common scheme, and then it is
transformed into the desired format. Prior to mapping, transformations are used
to clean up the data by standardizing and normalizing it.

>Integrity Constraint Enforcement

When data is changed by adding, removing, or updating something, integrity is

the main concern. During integrity constraint checking, it is rejected if any
integrity constraints are broken. Only if the integrity requirement has not been
violated are additional identified updates allowed to be applied to the original
data.

>Duplicate Elimination

Data cleansing must include duplicate deletion. Every method of duplication

elimination has a number of variations. There must be an algorithm that
recognizes duplicate entries in each duplicate detection method.

Role of AI in improving data deduplication

AI plays a crucial role in improving data deduplication since it helps to get beyond
some of the drawbacks of manual approaches for finding and eliminating

17
duplicates. The following are some ways that AI might enhance data
deduplication:

1. Speed and efficiency: The ability of AI to process massive amounts of data

rapidly and accurately is one of the main advantages of employing it for
data deduplication. Duplicate detection in a large dataset can be time-
consuming and laborious using standard manual methods. On the other
hand, AI systems can analyze enormous amounts of data and find
duplicates far more quickly and effectively.
2. Accuracy: Accuracy is another benefit of employing AI for data
deduplication. AI algorithms can spot patterns and resemblances that are
challenging for humans to see since they are built to learn from data. By
utilizing AI, businesses can make sure that duplicates are correctly
recognized and eliminated from their datasets, increasing the accuracy of
their data as a whole.
3. Scalability: AI-based data deduplication methods can be used on datasets
of any size because they are very scalable. For businesses that deal with
enormous amounts of data, including social networking platforms, e-
commerce businesses, and financial institutions, this is especially
advantageous. These businesses can effectively manage their data and
make sure that duplicates are eliminated thanks to AI.
4. Consistency: Results from manual data deduplication techniques might be
unpredictable and depend on the person doing the work. On the other
hand, AI algorithms take a consistent approach, thus the same outcomes
are obtained each time the algorithm is run. Companies who need to make

18
sure their data is correct and consistent across various systems and apps
may find this consistency useful.
5. Learning and adaptation: AI systems are able to change their methods for
recognizing duplication by learning from new data. This implies that the AI
model can be modified as the data evolves over time to guarantee that it
correctly detects and eliminates duplicates. Companies that deal with
quickly changing data, like healthcare providers or online retailers, might
benefit greatly from this agility.

Stages of Data Cleaning

Stage 1: Removing Duplicates

Duplicate entries are problematic for a variety of reasons. An entry that appears
more than once is given significant weight during training. Models that seem to
do well on frequent entries actually don't. Duplicate entries may destroy the
separation between the train, validation, and test sets when identical items are
not all in the same set. This could lead to erroneous performance forecasts that
let the model down in terms of actual outcomes.

Database duplicates can come from a wide range of causes, including processing
operations that were repeated along the data pipeline. Duplicate information
substantially impairs learning, but the issue can be fixed easily. One possibility is
to mandate that columns be singular whenever possible. An additional choice is
to run a script that will immediately detect and delete duplicate entries. This is

19
easy to do with Pandas' drop duplicates capability, as shown in the sample code
below:

Stage 2: Removing Irrelevant Data

Since data usually comes from numerous sources, there is a good chance that
a given table or database contains entries that shouldn't be there. In some
cases, it could be required to filter out older entries. In other situations, a more
complex data filtering is necessary.

Stage 3: Fixing Structural Errors

Similar-sounding tables, columns, or values frequently coexist in the same

database. Because a data engineer inserted an underscore or capital letter where
it wasn't supposed to, your data might be disastrous. If these objects are
integrated, it will go a long way towards making your data clean and suitable for
learning.

Stage 4: Detecting Outliers

It might be challenging to identify outliers. It demands a greater understanding of

how the data should appear and when entries should be ignored because they
20
are suspect. Think of a real estate dataset where the value of each property has
increased by one digit. Even while it is fairly easy to make this kind of error, it can
significantly harm the model's ability to learn. Investigating the possibilities and
ranges for numerical and categorical data inputs is the first step in locating
undesirable outliers. As an illustration, a negative car cost number is certainly an
unfavorable outlier. Additionally, employing algorithms for anomaly or outlier
identification like KNN or Isolation Forest, outliers can be automatically
discovered and deleted.

Stage 5: Handling Missing Data

The most important step in ML data cleaning is handling missing data. Missing
data can occur as a result of online forms that were only filled out with
mandatory fields or when tables and forms were updated. In some cases, it makes

21
sense to substitute the meaning or most prevalent value for any missing data. If
there are more important elements, it may be desirable to discard the entire data
entry.

How it can be problematic for data analysis in Removing Data

>Costs And Lost Income

Consider the additional expenses involved in sending one person five of the
identical catalogues. Users must be able to locate duplicate records and stop new
duplicate entries from being added to CRM in order to assist save wasteful costs.

>Difficult Segmentation

Furthermore, it becomes challenging to segment effectively without a precise

perspective of every customer. Non-targeted email distribution can reduce open
and click-through rates, wasting your time and money. When it comes to
attracting and retaining customers, being able to provide more personalized
communications is made possible by having high-quality, accurate customer
information..

>Less Informed decisions and inaccurate reports

Make sure your data is thorough, accurate, and duplicate-free if you intend to use
it to influence decisions on how to best position your company for future
commercial growth. Low-quality data-based decisions are nothing more than
guesses.

22
>Poor Business Processes

Employees can switch to using more conventional techniques to maintain client

data, such as Excel or even post-It notes, when they get weary of the CRM due to
the volume of erroneous data and duplicates. Utilizing such business tools may
restrict your clients' perspectives and the expansion of your company.

The quantity of customer records will increase as your clientele and business
expand, which will make the data more difficult to maintain and raise the
possibility that it will be lost.

Limitations and Considerations

AI algorithms are not error-free and can still make mistakes, despite the fact that
they can considerably increase the effectiveness and accuracy of duplicate data
removal. To guarantee the quality and integrity of the data, human oversight and
validation are crucial. Users should also evaluate the results to make sure they are
acceptable and correct and be aware of the assumptions and limits of the
algorithms being employed.

The Highlight Features:

Some of the key features of Jupyter Notebook include:

23
Interactive computing: Jupyter Notebook makes it simple to examine and
experiment with data by allowing users to run and modify code interactively.

Language agnostic: Python, R, Julia, and a host of other programming languages

are all supported by Jupyter Notebook. This makes it a flexible platform for
activities involving machine learning and data analysis.

Easy visualization: Jupyter Notebook offers built-in support for data visualization
tools, such as Matplotlib and Seaborn, making it easy to generate visualizations of
data.

Collaboration: Jupyter Notebook makes it simple to collaborate on data research

projects since it enables numerous users to work on the same notebook
concurrently.

Documentation: Jupyter Notebook makes it simple to create detailed

documentation for data analysis projects by allowing users to add narrative prose,
equations, and photos to their notebooks.

Large ecosystem: There is a sizable and vibrant user and developer community for
Jupyter Notebook, and there are numerous libraries and extensions available to
increase its capability.

Cloud-based: Jupyter Notebook makes it simple to access and share data analysis
projects from anywhere because it can be run locally on a user's computer or in
the cloud.

24
System Requirement:

 Vs Code

C, C#, C++, Fortran, Go, Java, JavaScript, Node.js, Python, and Rust are just a
few of the programming languages that may be utilized with Visual Studio
Code, a source-code editor. Based on the Electron framework, which is used to
create Node, it was created.

 Python

The most used interpreted language for image processing is Python. Python was
selected for this project due to its straightforward syntax and broad selection of
libraries and modules. Python's syntax enables programmers to express concepts
in less code than may be possible in languages like C++ or Java because it was
designed to be extendable. The main cause of Python's strong demand among
programmers is its sizable, well-known library. Additionally supported for
internet-connected apps are MIME and HTTP. Python version 3.6 (64 bit) is used
in this work on Windows 10.

25
Implementation:

26
27
Result:

Result.csv Generated in which all duplicate data has been removed..

28
Conclusion

In Jupyter Notebook, the elimination of duplicate data using AI can dramatically

increase data integrity, correctness, and consistency while also enhancing data
administration and analysis. By automating the onerous and repetitive processes
of duplicate data removal, it enables users to concentrate on the more intricate
and valuable components of data analysis.

In conclusion, the problem of duplicate data in data management and analysis

may be effectively and accurately solved by using artificial intelligence algorithms
in Jupyter Notebook. Libraries like Pandas and Scikit-Learn, which have built-in
functions and algorithms that can automatically find and eliminate duplicate data,
offer strong tools for data manipulation and analysis. The optimal setting for data
analysis and machine learning activities is provided by Jupyter Notebook, which
enables users to see and analyze data as well as fine-tune algorithm parameters
for increased accuracy.

To ensure the quality and integrity of the data, it is crucial to validate the results
and bear in mind the constraints and presumptions of the algorithms that were
employed. In order to guarantee that the data is correct and consistent, human
oversight and validation are necessary. Overall, the removal of duplicate data
using AI algorithms in Jupyter Notebook can considerably increase data integrity,
correctness, and consistency while also enhancing data administration and
analysis.

29
References:

> https://www.druva.com/glossary/what-is-deduplication-definition-
andrelatedfaqs#:~:text=Deduplication%20refers%20to%20a
%20method,instance%20can%20then%20be%20stored

> https://www.grow.com/blog/data-deduplication-with-ai

https://indicodata.ai/blog/should-we-remove-duplicates-ask-slater/

> https://runneredq.com/news/problems-generated-by-having-
duplicate-data-in-a-database/

> https://deepchecks.com/what-is-datacleaning/#:~:text=Datasets
%20that%20contain%20duplicates%20may,do%20not%20exist%20in
%20reality
>https://www.researchgate.net/publication/339561834_An_Effective
_Duplicate_Removal_Algorithm_for_Text_Documents