NTCC Report 8th Sem
NTCC Report 8th Sem
Submitted to
Amity University, Ranchi (Jharkhand)
By
Md Gulam Hassnain
A35705219006
SEM-8th (2019-2023)
Ranchi
2019- 2023
1
DECLARATION
I, MD GULAM HASSNAIN, student of B.TECH (CSE) hereby declare that the project
titled “Removal of Duplicate Data using AI” which is submitted by me to the
Department of computer science and Technology, Amity school of engineering
and Technology, Amity University, Ranchi, Jharkhand, in the partial fulfillment of
requirement for the award of the degree of Bachelors of Technology, has not
been previously formed the basis of the award of and degree or other similar title
or recognition. The author attests that the permission has been obtained for the
use of any copy written material appearing in the dissertation /project report
other than brief experts requiring only proper acknowledgement in scholarity
writing and all such use is acknowledged.
Yours sincerely
A35705219006
2
CERTIFICATE
This is to certify that Mr. Md Gulam Hassnain, students of B. TECH (CSE),
Ranchi have worked under the able guidance and supervision of Dr. Amarnath
Singh, designation faculty guide.
This project report has the requisite standard of partial fulfillment of undergraduate
degree in Bachelor of Technology to the best of my knowledge no part of this
report and the contents are based on the original research.
Signature
(Faculty Guide)
3
ACKNOWLEDGEMENT
I express my sincere gratitude to my faculty guide DR. Kanika Thakur, for his able
guidance, continuous support, and cooperation throughout my research work,
without which the present work would not have been possible. My endeavor
stands incomplete without dedicating my gratitude to him; he has contributed a
lot towards successful completion of my research work.
I would also like to express my gratitude to my family, friends for their unending
support, and tireless effort that kept me motivated throughout the completion of
this research.
Yours sincerely
B. TECH (CSE)
2019-2023
4
TABLE OF CONTENTS
1. ABSTRACT 6
2. METHODOLOGY 7
3. INTRODUCTION 8
4. DATA DUPLICATION 9
5. IMPORTANCE OF REMOVING DUPLICATE DATA 10
6. DATA DUPLICATION EXPLAINED 10-11
7. BENEFITS OF DATA DUPLICATION REMOVAL 11
8. REAL LIFE EXAMPLE 12
9. DATA DUPLICATION WITH AI 13
10. AI POWERED BENFITS 14-15
11. AI ALGORITHM FOR DUPLICATION REMOVAL 16
12. ROLE OF AI IN IMPROVING DUPLICATE DATA 17-18
13. STAGES OF DATA CLEANING 19-21
14. PROBLEMATIC ANALYSIS OF REMOVING DATA 22-23
15. LIMITAION AND CONSIDERATION 23
16. HIGHLIGHT FEATURES 23-24
17. SYSTEM REQUIREMENT 25
18. IMPLEMENTATION 26-27
19. RESULT 28
20. CONCLUSION 29
21. REFERENCES 30
5
ABSTRACT
Getting rid of duplicate data is a crucial step in data management since it helps to
protect the data's accuracy, consistency, and integrity. Manually locating and
eliminating duplicate data can be a difficult and time-consuming operation due to
the daily increase in data volume. Here is where artificial intelligence (AI) can be
quite useful. A dataset's duplicate data can be automatically found and eliminated
using AI-powered algorithms. These algorithms often analyze data using machine
learning methods to look for trends that might identify duplicate records. They
can also gain knowledge from human feedback to gradually increase their
accuracy.
For instance, the Python module Pandas offers robust tools for data analysis and
manipulation. Duplicate entries can be eliminated from a Pandas Data Frame
object using the function drop duplicates (), which is a built-in feature. The
function offers more sophisticated options for duplication detection and removal
by using a variety of arguments, including subset, keep, and in place.
6
METHODOLOGY
Importing data: The dataset must first be imported into Jupyter Notebook, usually
with the help of the Pandas library. This phase could entail connecting to a
database or reading data from a file.
Data cleaning: After that, the dataset is cleaned and preprocessed to guarantee
accuracy and consistency. This could entail changing the data type, eliminating
unneeded columns or rows, and adding or removing missing values.
Duplicate data detection: After that, the dataset is cleaned and preprocessed to
guarantee accuracy and consistency. This could entail changing the data type,
eliminating unneeded columns or rows, and adding or removing missing values.
Duplicate data removal: When duplicate data is found, it can be eliminated from
the dataset. Pandas' built-in functions or specially created scripts that eliminate
duplicate data based on a set of criteria can be used to accomplish this.
Data validation: The dataset is tested to make sure it is correct and consistent
after duplicate data has been removed. This could entail assessing the dataset's
size and organizational structure as well as contrasting it with data from other
sources.
Data analysis: Finally, analysis and modelling can be done using the cleaned and
deduplicated dataset. This could entail employing data exploration tools or
machine learning algorithms to create predictive models.
7
INTRODUCTION
Duplicate data can result in a variety of concerns in the field of data management
and analysis, from decreasing accuracy to performance problems. The manual
detection and elimination of duplicate data can be difficult and time-consuming
due to the daily increase in data volume. In order to solve this issue effectively
and precisely, artificial intelligence (AI) enters the picture.
Jupyter Notebook, a popular web tool that enables users to create and share
documents containing live code, visualizations, and narrative text, is one platform
where AI-powered algorithms can be utilized for duplicate data elimination.
Jupyter Notebook has a wealth of AI-powered libraries that can assist with
duplicate data reduction and is the perfect setting for data analysis and machine
learning activities.
Users of Jupyter Notebook may automatically find and eliminate duplicate data
using AI algorithms, which makes the process quicker and more effective.
Powerful tools and algorithms can automatically find and eliminate duplicate
records, and libraries like Pandas, Numpy, and Scikit-Learn also offer sophisticated
choices for more precise and effective duplication removal.
In Jupyter Notebook, removing duplicate data using AI can assist maintain data
integrity, accuracy, and consistency while also enhancing overall data
management and analysis. By automating the onerous and repetitive processes of
duplicate data removal, it enables users to concentrate on the more intricate and
valuable components of data analysis. To ensure the quality and integrity of the
8
data, it is crucial to validate the results and bear in mind the constraints and
presumptions of the algorithms that were employed.
A pointer to the unique data copy is used to replace redundant data blocks. Data
deduplication and incremental backup, which transfers just the data that has
changed since the last backup, closely resemble each other in this regard.
The identical 1-MB file attachment may appear 100 times in a typical email
system. All 100 instances are saved if the email platform is backed up or archived,
needing 100 MB of storage space. The attachment is only stored once thanks to
data deduplication; subsequent copies are all linked to the original copy. This
example reduces the 100 MB storage demand to 1 MB.
9
Importance of removing duplicate data from datasets
Duplicate data sets have the potential to contaminate the training data with the
test data, or the other way around. Outliers may undermine the training process
and cause your model to "learn" patterns that do not actually exist, while entries
with missing values will cause models to interpret features incorrectly.
Different types of data deduplication exist. In its most basic version, the method
eliminates identical files at the level of individual files. File-level deduplication and
single instance storage (SIS) are other names for this.
There are two types of block-level deduplication: fixed block borders, where the
majority of block-level deduplication takes place, and variable block boundaries,
where data is divided up at random intervals. The rest of the procedure often
stays the same once the dataset has been divided into a number of little pieces of
data, known as chunks or shards.
10
to see if it has ever been seen before against a hash table or hash database. The
new shard is written to storage and the hash is added to the hash table or
database if it has never been seen before; otherwise, it is deleted, and a new
reference is added to the hash table or database.
Consider how frequently you edit documents to make minor changes. Even if you
simply altered one byte, an incremental backup would still save the entire file.
Every important business asset has a chance to include duplicate data. Up to 80%
of company data in many organizations is duplicated.
With cloud storage, source deduplication performs incredibly well and can
significantly speed up backups. Deduplication speeds up backup and recovery by
lowering the amount of data and network bandwidth that backup operations
require. When deciding whether to employ deduplication, think about if your
company could profit from these advancements.
11
What is a real-life deduplication example?
12
Why Data Deduplication?
80% of a data scientist's time is spent on data preparation, which was rated as the
least enjoyable aspect of the job by 76% of those surveyed.
For these reasons, it would be wise to invest whatever sum of money is required
to fully automate these procedures. The difficult activities that require connecting
to data sources, creating reliable pipelines, and performing other jobs pique the
interest of data professionals (engineers, scientists, IT teams, etc.) who are in
charge of preparing data. It is detrimental to make data experts perform tiresome
jobs like data preparation since it decreases their morale and diverts them from
more crucial duties.
Deduplication is a crucial step in the data cleansing process that can help to lower
this risk. For analyses to produce precise and timely findings, duplicate data must
be eliminated from databases or from data models (using a deduplication
scrubber or other tool).
13
The business analytics solutions from Grow can assist with data deduplication and
analysis to produce quick insights for your entire organization.
In order to find duplicate data, machine learning algorithms may analyze datasets
and spot trends. They can gain knowledge from prior data deduplication efforts
and gradually increase their accuracy. Deep learning techniques are very helpful
for complicated datasets because they can use neural networks to spot and get
rid of duplicate data.
14
organizations in locating previously undiscovered data patterns and insights,
resulting in better decision-making and better commercial results.
Let's imagine that a business has a client database that has duplicate records. To
make sure that its customer information is correct and current, the organization
wishes to get rid of duplicate records. How AI, ML, and deep learning can assist
with this work is as follows:
Datasets used as training for AI-based methods: The business must create a
training dataset before using AI-based techniques for data deduplication. To train
the AI model, the dataset should contain examples of duplicate and unique
customer entries.
15
Consider a sample dataset with the following customer entries:
The dataset can be analyzed by AI to spot duplicate customer entries. Because the
email and phone numbers in this example match, AI may determine that "John
Smith" and "John Doe" are the same individual. Similar to this, AI may determine
that "Sarah Brown" and "Sarah Brown" are identical people based on how closely
their phone number and email match.
>Parsing
16
>Data Transformation
The process of data transformation is similar to that of data cleansing in that data
is first mapped from one format to another, into a common scheme, and then it is
transformed into the desired format. Prior to mapping, transformations are used
to clean up the data by standardizing and normalizing it.
>Duplicate Elimination
AI plays a crucial role in improving data deduplication since it helps to get beyond
some of the drawbacks of manual approaches for finding and eliminating
17
duplicates. The following are some ways that AI might enhance data
deduplication:
18
sure their data is correct and consistent across various systems and apps
may find this consistency useful.
5. Learning and adaptation: AI systems are able to change their methods for
recognizing duplication by learning from new data. This implies that the AI
model can be modified as the data evolves over time to guarantee that it
correctly detects and eliminates duplicates. Companies that deal with
quickly changing data, like healthcare providers or online retailers, might
benefit greatly from this agility.
Duplicate entries are problematic for a variety of reasons. An entry that appears
more than once is given significant weight during training. Models that seem to
do well on frequent entries actually don't. Duplicate entries may destroy the
separation between the train, validation, and test sets when identical items are
not all in the same set. This could lead to erroneous performance forecasts that
let the model down in terms of actual outcomes.
Database duplicates can come from a wide range of causes, including processing
operations that were repeated along the data pipeline. Duplicate information
substantially impairs learning, but the issue can be fixed easily. One possibility is
to mandate that columns be singular whenever possible. An additional choice is
to run a script that will immediately detect and delete duplicate entries. This is
19
easy to do with Pandas' drop duplicates capability, as shown in the sample code
below:
Since data usually comes from numerous sources, there is a good chance that
a given table or database contains entries that shouldn't be there. In some
cases, it could be required to filter out older entries. In other situations, a more
complex data filtering is necessary.
The most important step in ML data cleaning is handling missing data. Missing
data can occur as a result of online forms that were only filled out with
mandatory fields or when tables and forms were updated. In some cases, it makes
21
sense to substitute the meaning or most prevalent value for any missing data. If
there are more important elements, it may be desirable to discard the entire data
entry.
Consider the additional expenses involved in sending one person five of the
identical catalogues. Users must be able to locate duplicate records and stop new
duplicate entries from being added to CRM in order to assist save wasteful costs.
>Difficult Segmentation
Make sure your data is thorough, accurate, and duplicate-free if you intend to use
it to influence decisions on how to best position your company for future
commercial growth. Low-quality data-based decisions are nothing more than
guesses.
22
>Poor Business Processes
The quantity of customer records will increase as your clientele and business
expand, which will make the data more difficult to maintain and raise the
possibility that it will be lost.
AI algorithms are not error-free and can still make mistakes, despite the fact that
they can considerably increase the effectiveness and accuracy of duplicate data
removal. To guarantee the quality and integrity of the data, human oversight and
validation are crucial. Users should also evaluate the results to make sure they are
acceptable and correct and be aware of the assumptions and limits of the
algorithms being employed.
23
Interactive computing: Jupyter Notebook makes it simple to examine and
experiment with data by allowing users to run and modify code interactively.
Easy visualization: Jupyter Notebook offers built-in support for data visualization
tools, such as Matplotlib and Seaborn, making it easy to generate visualizations of
data.
Large ecosystem: There is a sizable and vibrant user and developer community for
Jupyter Notebook, and there are numerous libraries and extensions available to
increase its capability.
Cloud-based: Jupyter Notebook makes it simple to access and share data analysis
projects from anywhere because it can be run locally on a user's computer or in
the cloud.
24
System Requirement:
Vs Code
C, C#, C++, Fortran, Go, Java, JavaScript, Node.js, Python, and Rust are just a
few of the programming languages that may be utilized with Visual Studio
Code, a source-code editor. Based on the Electron framework, which is used to
create Node, it was created.
Python
The most used interpreted language for image processing is Python. Python was
selected for this project due to its straightforward syntax and broad selection of
libraries and modules. Python's syntax enables programmers to express concepts
in less code than may be possible in languages like C++ or Java because it was
designed to be extendable. The main cause of Python's strong demand among
programmers is its sizable, well-known library. Additionally supported for
internet-connected apps are MIME and HTTP. Python version 3.6 (64 bit) is used
in this work on Windows 10.
25
Implementation:
26
27
Result:
28
Conclusion
To ensure the quality and integrity of the data, it is crucial to validate the results
and bear in mind the constraints and presumptions of the algorithms that were
employed. In order to guarantee that the data is correct and consistent, human
oversight and validation are necessary. Overall, the removal of duplicate data
using AI algorithms in Jupyter Notebook can considerably increase data integrity,
correctness, and consistency while also enhancing data administration and
analysis.
29
References:
> https://www.druva.com/glossary/what-is-deduplication-definition-
andrelatedfaqs#:~:text=Deduplication%20refers%20to%20a
%20method,instance%20can%20then%20be%20stored
> https://www.grow.com/blog/data-deduplication-with-ai
https://indicodata.ai/blog/should-we-remove-duplicates-ask-slater/
> https://runneredq.com/news/problems-generated-by-having-
duplicate-data-in-a-database/
> https://deepchecks.com/what-is-datacleaning/#:~:text=Datasets
%20that%20contain%20duplicates%20may,do%20not%20exist%20in
%20reality
>https://www.researchgate.net/publication/339561834_An_Effective
_Duplicate_Removal_Algorithm_for_Text_Documents
30