0% found this document useful (0 votes)

21 views8 pages

PAACDA Comprehensive Data Corruption Detection Algorithm

The document presents the PAACDA (Proximity based Adamic Adar Corruption Detection Algorithm), a novel method for detecting corrupted data in datasets, which outperforms existing unsupervised learning algorithms with high accuracy rates. It addresses the shortcomings of current anomaly detection methods that primarily focus on outliers rather than corrupted data. The research emphasizes the importance of effective data corruption detection to enhance machine learning processes and outlines system requirements for implementation.

Uploaded by

Thirumala Ganta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views8 pages

PAACDA Comprehensive Data Corruption Detection Algorithm

Uploaded by

Thirumala Ganta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

PAACDA Comprehensive Data Corruption Detection

Algorithm
ABSTRACT

With the advent of technology, data and its analysis are no longer just values and
attributes strewn across spreadsheets, they are now seen as a stepping stone to bring
about revolution in any significant field. Data corruption can be brought about by a
variety of unethical and illegal sources, making it crucial to develop a method that is
highly effective to identify and appropriately highlight the various corrupted data
existing in the dataset. Detection of corrupted data, as well as recovering data from a
corrupted dataset, is a challenging problem. This requires utmost importance and if
not addressed at earlier stages may pose problems in later stages of data processing
with machine or deep learning algorithms. In the following work we begin by
introducing the PAACDA: Proximity based Adamic Adar Corruption Detection
Algorithm and consolidating the results whilst particularly accentuating the detection
of corrupted data rather than outliers. Current state of the art models, such as
Isolation forest, DBSCAN also called ‘‘Density-Based Spatial Clustering of
Applications with Noise’’ and others, are reliant on fine-tuning parameters to provide
high accuracy and recall, but they also have a significant level of uncertainty when
factoring the corrupted data. In the present work, the authors look into the most niche
performance issues of several unsupervised learning algorithms for linear and
clustered corrupted datasets. Also, a novel PAACDA algorithm is proposed which
outperforms other unsupervised learning benchmarks on 15 popular baselines
including K-means clustering, Isolation forest and LOF (Local Outlier Factor) with
an accuracy of 96.35% for clustered data and 99.04% for linear data. This article also
conducts a thorough exploration of the relevant literature from the previously stated
perspectives.
In this research work, we pinpoint all the shortcomings of the present
techniques and draw direction for future work in this field.

EXISTING SYSTEM

A key area of research that has numerous practical applications is anomaly

identification in a given dataset. As a result, this topic has frequently been the focus
of research. Multiple approaches utilizing various aspects of the dataset have been
proposed to detect anomalies however only few methodologies lay emphasis on the
detection of corrupted data which would further provide the most efficient results
with respect to varying dataset sizes, higher dimensionality or varying degrees of
corruption present. A study by Chandola et al. in their publication [2] compares
numerous anomaly detection methods for diverse applications. By contrasting the
benefits and drawbacks of various techniques, Hodge and Austin [28] conducted a
review of outlier detection methods. An overview of cutting-edge methods for
spotting suspicious behaviour is presented by Patcha and Park [29] Jiang et al. [30]
together with detection scenarios for several real-world settings.

Dimensionality reduction approaches and the underlying mathematical

understandings are categorized by Sorzano et al. [31]. The issues with anomaly
detection are further laid out by a number of other reports, including papers by Gama
et al. [32], Gupta et al. [33], Heydari et al. [34], Jindal and Liu [35], and many more.
Outliers make up the majority of anomalies that can exist in a dataset. The first
method based on distance for detection of outliers was put forth by Knorr et al. [36],
and Ramaswamy et al. [37] expanded on it by suggesting that the greatest n locations
with highest Pk be supposed outliers (Pk(p) signifies the kth nearest neighbour
corresponding to p). They used a clustering technique to separate a dataset into
several categories. To improve the success of outlier detection for these groups, batch
processing and pruning may be beneficial [38]. Deviation-based outlier detection was
another method that was suggested for effectively detecting outliers. Objects or data
points that vary significantly from the bulk of data points constitute outliers.
Therefore, outliers are frequently called deviations [39] as given by the name
deviation-based outlier detection.

Several other methods have been invented over the years to detect anomalies, to
name Breunig et al. [21] developed a method based on density. Cluster-based
anomaly identification methods pinpointed anomalies by eliminating clusters from
the actual dataset [40] or by classifying small clusters as outliers [41]. Additionally,
Aggarwal and Yu [42] proposed a novel strategy for catching outliers that is
remarkably effective for extremely elevated dimensional datasets. Their
methodology focuses on finding locally sparse lower dimensional projections which
are otherwise difficult to differentiate using brute force methods due to the vast
amount of possible combinations. However, the study is inclined towards detection
of outliers and does not focus on the detection of corrupted or modified datasets.

Li et al. [43] in their paper proposed a unique outlier detection approach called
‘Empirical Cumulative distribution-based Outlier Detection’ (ECOD). This method
uses the empirical cumulative distribution to measure outlier values present in the
dataset. They extensively applied it to 30 datasets which showed that ECOD
outperformed the existing state of the model as it is fast and scalable. However, the
method doesn’t deal with an outlier that might not be in either the left or right tails
and demands readers to come up with another promising route that is, to find a
mechanism to expand ECOD to such environments while keeping it quick
and scalable.

The bulk of them, meanwhile, are primarily focused on outlier identification without
paying much attention to data that contains corrupted values. Many cutting edge
poisoning and outlier identification practices have been developed and they may
generally belong to one of the following categories: distribution based [43], [44],
[45], depth based [46], distance based [47], [48], [49], density based [50], [51],
cluster based [52], [53], [54] and generative models [55].

Disadvantages
 Deep support vector data description (DeepSVDD) approach is not uses in
which modification of support vector data description model which are another
traditional paradigm for anomalies detection.
 GMMs or Gaussian mixture models are not used in which sets of
parameterized probabilistic functions as the weighted components of the pre-
trained model using expectation maximization technique.

Proposed System

Machine learning is an important component of the growing field of data science.

Through the use of statistical methods, different type of algorithms is trained to make
classifications or predictions, and to uncover key insights in this project. These
insights subsequently drive decision making within applications and businesses,
ideally impacting key growth metrics.

Machine learning algorithms build a model based on this project data, known as
training data, in order to make predictions or decisions without being explicitly
programmed to do so. Machine learning algorithms are used in a wide variety of
datasets, where it is difficult or unfeasible to develop conventional algorithms to
perform the needed tasks.

The novel method proposed as a part of the research largely revolves around a graph-
based algorithm is Adamic Adar. Adamic Adar gives us access to the Adamic Adar
index, which aids in anticipating links, particularly in areas like social networks. By
taking into account how many common links there are between two nodes, the
Adamic Adar index is determined. The research puts forth a modified approach of
Adamic Adar called PAACDA (Proximity based Adamic Adar Corruption Detection
Algorithm) which when put into use detect the data corruption provides us with the
best accuracy compared to the above-mentioned algorithms.

Advantages
1) We propose a scalable and efficient Machine Learning models for exact
predictions.
2) To propose a novel method PAACDA, an unsupervised model to detect corrupted
data in a more accurate manner.

SYSTEM REQUIREMENTS

➢ H/W System Configuration:-

➢ Processor - Pentium –IV

➢ RAM - 4 GB (min)
➢ Hard Disk - 20 GB
➢ Key Board - Standard Windows Keyboard
➢ Mouse - Two or Three Button Mouse
➢ Monitor - SVGA

SOFTWARE REQUIREMENTS:

 Operating system : Windows 7 Ultimate.

 Coding Language : Python.

 Front-End : Python.

 Back-End : Django-ORM

 Designing : Html, css, javascript.

 Data Base : MySQL (WAMP Server).

Integriscan: A Graph-Aided Model For Detecting Corrupted and Anomalous Data Patterns
No ratings yet
Integriscan: A Graph-Aided Model For Detecting Corrupted and Anomalous Data Patterns
10 pages
Anomaly Detection for IT Experts
No ratings yet
Anomaly Detection for IT Experts
8 pages
Abstract
No ratings yet
Abstract
1 page
Anomaly Detection 2
No ratings yet
Anomaly Detection 2
8 pages
Anomaly Detection in Log Files Based On Machine Le
No ratings yet
Anomaly Detection in Log Files Based On Machine Le
13 pages
Anomaly Detection For Web Log Based Data
No ratings yet
Anomaly Detection For Web Log Based Data
5 pages
A Fuzzy Proximity Relation Approach For Outlier Detection in - 2021 - Soft Compu
No ratings yet
A Fuzzy Proximity Relation Approach For Outlier Detection in - 2021 - Soft Compu
12 pages
Choosing Allowability Boundaries For Describing Objects in Subject Areas
No ratings yet
Choosing Allowability Boundaries For Describing Objects in Subject Areas
8 pages
Khiêm
No ratings yet
Khiêm
7 pages
Expt 2
No ratings yet
Expt 2
3 pages
References
No ratings yet
References
6 pages
A Survey On Outlier Detection Methods
No ratings yet
A Survey On Outlier Detection Methods
4 pages
Anomaly Detection On Data Streams With H
No ratings yet
Anomaly Detection On Data Streams With H
6 pages
Prevention of Security Concerns During Outlier Detection
No ratings yet
Prevention of Security Concerns During Outlier Detection
3 pages
Anomaly Detection Report
No ratings yet
Anomaly Detection Report
33 pages
Anomaly Detection
No ratings yet
Anomaly Detection
7 pages
Elastic Anomalies
No ratings yet
Elastic Anomalies
7 pages
Anamoly Detection
0% (1)
Anamoly Detection
20 pages
Matlab GUI for IoT Sensor Outlier Detection
No ratings yet
Matlab GUI for IoT Sensor Outlier Detection
6 pages
2799-Document Upload-9165-2-10-20210702
No ratings yet
2799-Document Upload-9165-2-10-20210702
7 pages
Ipd3 Batch 3
No ratings yet
Ipd3 Batch 3
28 pages
Anomaly Detection For Data Streams in Large-Scale Distributed Heterogeneous Computing Environments
No ratings yet
Anomaly Detection For Data Streams in Large-Scale Distributed Heterogeneous Computing Environments
11 pages
Detecting Outliers in High Dimensional Data Sets U
No ratings yet
Detecting Outliers in High Dimensional Data Sets U
6 pages
Model Based Anomaly Detection in High Dimensional DATA
No ratings yet
Model Based Anomaly Detection in High Dimensional DATA
6 pages
1744 5586 1 PB
No ratings yet
1744 5586 1 PB
9 pages
Healthcare Fraud Detection System
No ratings yet
Healthcare Fraud Detection System
25 pages
An Analysis of Outlier Detection Through Clustering Method
No ratings yet
An Analysis of Outlier Detection Through Clustering Method
6 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
Real-Time Anomaly Detection in Big Data Streams: Machine Learning Approaches and Performance Evaluation
No ratings yet
Real-Time Anomaly Detection in Big Data Streams: Machine Learning Approaches and Performance Evaluation
10 pages
Ec 2645704571
No ratings yet
Ec 2645704571
2 pages
Credit Card Fraud Detection - Final
No ratings yet
Credit Card Fraud Detection - Final
3 pages
Anomaly Detection in High Dimensional Data
No ratings yet
Anomaly Detection in High Dimensional Data
30 pages
Unlocking Online Insights: LSTM Exploration and Transfer Learning Prospects
No ratings yet
Unlocking Online Insights: LSTM Exploration and Transfer Learning Prospects
14 pages
Đen Cam Đậm Đơn Giản Kỹ Thuật Số Công Nghệ Tài Chính (FinTech) Bản Thuyết Trình Công Nghệ
No ratings yet
Đen Cam Đậm Đơn Giản Kỹ Thuật Số Công Nghệ Tài Chính (FinTech) Bản Thuyết Trình Công Nghệ
19 pages
Compusoft, 3 (6), 831-835 PDF
No ratings yet
Compusoft, 3 (6), 831-835 PDF
5 pages
Ecmlpkdd08 Lazarevic Dmfa
No ratings yet
Ecmlpkdd08 Lazarevic Dmfa
116 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
10 - Anomaly Detection
No ratings yet
10 - Anomaly Detection
12 pages
Anomaly Detection Via Eliminating Data Redundancy and Rectifying Data Error in Uncertain Data Streams
No ratings yet
Anomaly Detection Via Eliminating Data Redundancy and Rectifying Data Error in Uncertain Data Streams
18 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
Variational Restricted Boltzmann Machines To Automated Anomaly Detection
No ratings yet
Variational Restricted Boltzmann Machines To Automated Anomaly Detection
14 pages
Measuring of Data Quality in KYC Using Anomaly Det
No ratings yet
Measuring of Data Quality in KYC Using Anomaly Det
7 pages
Mini Project
No ratings yet
Mini Project
12 pages
Six Sigma in Fraud Detection Analysis
No ratings yet
Six Sigma in Fraud Detection Analysis
4 pages
Fuzzy Granule Density-Based Outlier Detection With Multi-Scale Granular Balls
No ratings yet
Fuzzy Granule Density-Based Outlier Detection With Multi-Scale Granular Balls
16 pages
SSRN Id2376652
No ratings yet
SSRN Id2376652
8 pages
Applied Sciences: Outlier Detection Based Feature Selection Exploiting Bio-Inspired Optimization Algorithms
No ratings yet
Applied Sciences: Outlier Detection Based Feature Selection Exploiting Bio-Inspired Optimization Algorithms
28 pages
188 1496475265 - 03-06-2017 PDF
No ratings yet
188 1496475265 - 03-06-2017 PDF
6 pages
WP S-Ax Key Steps To Detect An Anomaly in Real-time-JAN10
No ratings yet
WP S-Ax Key Steps To Detect An Anomaly in Real-time-JAN10
10 pages
Analysis of Anomaly and Novelty Detection in Time
No ratings yet
Analysis of Anomaly and Novelty Detection in Time
14 pages
Credit Card Fraud Analysis Ashutosh
No ratings yet
Credit Card Fraud Analysis Ashutosh
3 pages
Anomaly Detection
No ratings yet
Anomaly Detection
13 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Unit 4 - Data Mining and Warehousing - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Data Mining and Warehousing - WWW - Rgpvnotes.in
13 pages
01 Intro To Data Mining
No ratings yet
01 Intro To Data Mining
43 pages
Ijetae 0512 58 PDF
No ratings yet
Ijetae 0512 58 PDF
5 pages
Etman MachineL 3
No ratings yet
Etman MachineL 3
47 pages
DaVinci - The ChatGPT Virtual Assistant Instructions
No ratings yet
DaVinci - The ChatGPT Virtual Assistant Instructions
11 pages
Data Structures Algorithms and Applications in C++ 1st Edition by Adam Drozdek ISBN 1133608426 9781133608424 PDF Download
No ratings yet
Data Structures Algorithms and Applications in C++ 1st Edition by Adam Drozdek ISBN 1133608426 9781133608424 PDF Download
53 pages
Information Security Manual
100% (1)
Information Security Manual
59 pages
Seminar Document
No ratings yet
Seminar Document
32 pages
SAP EAM SAP EWM Introduction
No ratings yet
SAP EAM SAP EWM Introduction
77 pages
Aditya
No ratings yet
Aditya
13 pages
What Is Artificial Intelligence (AI) : Computers To Perform Tasks
No ratings yet
What Is Artificial Intelligence (AI) : Computers To Perform Tasks
30 pages
Smart Manufacturing
No ratings yet
Smart Manufacturing
9 pages
Handout RPS - V1.0
No ratings yet
Handout RPS - V1.0
121 pages
Partner SW Training Instructional Guide - Final v3
No ratings yet
Partner SW Training Instructional Guide - Final v3
7 pages
What's Next in Red Hat OpenStack Platform - For Partners
No ratings yet
What's Next in Red Hat OpenStack Platform - For Partners
71 pages
Taskbar Styler
No ratings yet
Taskbar Styler
1 page
Apple Deployment and Management Test Study Guide
No ratings yet
Apple Deployment and Management Test Study Guide
51 pages
Ai Class X Notes
No ratings yet
Ai Class X Notes
55 pages
Model An Exterior Environment in Maya
No ratings yet
Model An Exterior Environment in Maya
9 pages
Four Pillars of Object Oriented Programming (OOP) : 1. What Is Abstraction?
No ratings yet
Four Pillars of Object Oriented Programming (OOP) : 1. What Is Abstraction?
10 pages
IX Class Computer Science Abbreviations
No ratings yet
IX Class Computer Science Abbreviations
7 pages
Hospital Management System Project Report
92% (120)
Hospital Management System Project Report
161 pages
Issam Abdo
No ratings yet
Issam Abdo
2 pages
Beigepaper: An Ethereum Technical Specification: 1. Imagining Bitcoin As A Computer
No ratings yet
Beigepaper: An Ethereum Technical Specification: 1. Imagining Bitcoin As A Computer
25 pages
Capcutmodapk
No ratings yet
Capcutmodapk
3 pages
Unit - 2 (SVVT)
No ratings yet
Unit - 2 (SVVT)
49 pages
CH 5 Presentation
No ratings yet
CH 5 Presentation
27 pages
Deskproto: Tutorial
No ratings yet
Deskproto: Tutorial
148 pages
Terraform On Azure Cloud v9
No ratings yet
Terraform On Azure Cloud v9
369 pages
Portofolio - Powerpoint
No ratings yet
Portofolio - Powerpoint
24 pages
Codes Win10
No ratings yet
Codes Win10
7 pages
Java Developer - FPT Software - C99
No ratings yet
Java Developer - FPT Software - C99
4 pages
Questions 1
No ratings yet
Questions 1
151 pages
DR - Csernus.imre Felnott - Husleves.hun - Ebook Adeptus
20% (5)
DR - Csernus.imre Felnott - Husleves.hun - Ebook Adeptus
236 pages

PAACDA Comprehensive Data Corruption Detection Algorithm

Uploaded by

PAACDA Comprehensive Data Corruption Detection Algorithm

Uploaded by

PAACDA Comprehensive Data Corruption Detection

A key area of research that has numerous practical applications is anomaly

Dimensionality reduction approaches and the underlying mathematical

Machine learning is an important component of the growing field of data science.

➢ H/W System Configuration:-

➢ Processor - Pentium –IV

 Operating system : Windows 7 Ultimate.

 Coding Language : Python.

 Designing : Html, css, javascript.

You might also like