PAACDA Comprehensive Data Corruption Detection
Algorithm
ABSTRACT
With the advent of technology, data and its analysis are no longer just values and
attributes strewn across spreadsheets, they are now seen as a stepping stone to bring
about revolution in any significant field. Data corruption can be brought about by a
variety of unethical and illegal sources, making it crucial to develop a method that is
highly effective to identify and appropriately highlight the various corrupted data
existing in the dataset. Detection of corrupted data, as well as recovering data from a
corrupted dataset, is a challenging problem. This requires utmost importance and if
not addressed at earlier stages may pose problems in later stages of data processing
with machine or deep learning algorithms. In the following work we begin by
introducing the PAACDA: Proximity based Adamic Adar Corruption Detection
Algorithm and consolidating the results whilst particularly accentuating the detection
of corrupted data rather than outliers. Current state of the art models, such as
Isolation forest, DBSCAN also called ‘‘Density-Based Spatial Clustering of
Applications with Noise’’ and others, are reliant on fine-tuning parameters to provide
high accuracy and recall, but they also have a significant level of uncertainty when
factoring the corrupted data. In the present work, the authors look into the most niche
performance issues of several unsupervised learning algorithms for linear and
clustered corrupted datasets. Also, a novel PAACDA algorithm is proposed which
outperforms other unsupervised learning benchmarks on 15 popular baselines
including K-means clustering, Isolation forest and LOF (Local Outlier Factor) with
an accuracy of 96.35% for clustered data and 99.04% for linear data. This article also
conducts a thorough exploration of the relevant literature from the previously stated
perspectives.
In this research work, we pinpoint all the shortcomings of the present
techniques and draw direction for future work in this field.
EXISTING SYSTEM
A key area of research that has numerous practical applications is anomaly
identification in a given dataset. As a result, this topic has frequently been the focus
of research. Multiple approaches utilizing various aspects of the dataset have been
proposed to detect anomalies however only few methodologies lay emphasis on the
detection of corrupted data which would further provide the most efficient results
with respect to varying dataset sizes, higher dimensionality or varying degrees of
corruption present. A study by Chandola et al. in their publication [2] compares
numerous anomaly detection methods for diverse applications. By contrasting the
benefits and drawbacks of various techniques, Hodge and Austin [28] conducted a
review of outlier detection methods. An overview of cutting-edge methods for
spotting suspicious behaviour is presented by Patcha and Park [29] Jiang et al. [30]
together with detection scenarios for several real-world settings.
Dimensionality reduction approaches and the underlying mathematical
understandings are categorized by Sorzano et al. [31]. The issues with anomaly
detection are further laid out by a number of other reports, including papers by Gama
et al. [32], Gupta et al. [33], Heydari et al. [34], Jindal and Liu [35], and many more.
Outliers make up the majority of anomalies that can exist in a dataset. The first
method based on distance for detection of outliers was put forth by Knorr et al. [36],
and Ramaswamy et al. [37] expanded on it by suggesting that the greatest n locations
with highest Pk be supposed outliers (Pk(p) signifies the kth nearest neighbour
corresponding to p). They used a clustering technique to separate a dataset into
several categories. To improve the success of outlier detection for these groups, batch
processing and pruning may be beneficial [38]. Deviation-based outlier detection was
another method that was suggested for effectively detecting outliers. Objects or data
points that vary significantly from the bulk of data points constitute outliers.
Therefore, outliers are frequently called deviations [39] as given by the name
deviation-based outlier detection.
Several other methods have been invented over the years to detect anomalies, to
name Breunig et al. [21] developed a method based on density. Cluster-based
anomaly identification methods pinpointed anomalies by eliminating clusters from
the actual dataset [40] or by classifying small clusters as outliers [41]. Additionally,
Aggarwal and Yu [42] proposed a novel strategy for catching outliers that is
remarkably effective for extremely elevated dimensional datasets. Their
methodology focuses on finding locally sparse lower dimensional projections which
are otherwise difficult to differentiate using brute force methods due to the vast
amount of possible combinations. However, the study is inclined towards detection
of outliers and does not focus on the detection of corrupted or modified datasets.
Li et al. [43] in their paper proposed a unique outlier detection approach called
‘Empirical Cumulative distribution-based Outlier Detection’ (ECOD). This method
uses the empirical cumulative distribution to measure outlier values present in the
dataset. They extensively applied it to 30 datasets which showed that ECOD
outperformed the existing state of the model as it is fast and scalable. However, the
method doesn’t deal with an outlier that might not be in either the left or right tails
and demands readers to come up with another promising route that is, to find a
mechanism to expand ECOD to such environments while keeping it quick
and scalable.
The bulk of them, meanwhile, are primarily focused on outlier identification without
paying much attention to data that contains corrupted values. Many cutting edge
poisoning and outlier identification practices have been developed and they may
generally belong to one of the following categories: distribution based [43], [44],
[45], depth based [46], distance based [47], [48], [49], density based [50], [51],
cluster based [52], [53], [54] and generative models [55].
Disadvantages
Deep support vector data description (DeepSVDD) approach is not uses in
which modification of support vector data description model which are another
traditional paradigm for anomalies detection.
GMMs or Gaussian mixture models are not used in which sets of
parameterized probabilistic functions as the weighted components of the pre-
trained model using expectation maximization technique.
Proposed System
Machine learning is an important component of the growing field of data science.
Through the use of statistical methods, different type of algorithms is trained to make
classifications or predictions, and to uncover key insights in this project. These
insights subsequently drive decision making within applications and businesses,
ideally impacting key growth metrics.
Machine learning algorithms build a model based on this project data, known as
training data, in order to make predictions or decisions without being explicitly
programmed to do so. Machine learning algorithms are used in a wide variety of
datasets, where it is difficult or unfeasible to develop conventional algorithms to
perform the needed tasks.
The novel method proposed as a part of the research largely revolves around a graph-
based algorithm is Adamic Adar. Adamic Adar gives us access to the Adamic Adar
index, which aids in anticipating links, particularly in areas like social networks. By
taking into account how many common links there are between two nodes, the
Adamic Adar index is determined. The research puts forth a modified approach of
Adamic Adar called PAACDA (Proximity based Adamic Adar Corruption Detection
Algorithm) which when put into use detect the data corruption provides us with the
best accuracy compared to the above-mentioned algorithms.
Advantages
1) We propose a scalable and efficient Machine Learning models for exact
predictions.
2) To propose a novel method PAACDA, an unsupervised model to detect corrupted
data in a more accurate manner.
SYSTEM REQUIREMENTS
➢ H/W System Configuration:-
➢ Processor - Pentium –IV
➢ RAM - 4 GB (min)
➢ Hard Disk - 20 GB
➢ Key Board - Standard Windows Keyboard
➢ Mouse - Two or Three Button Mouse
➢ Monitor - SVGA
SOFTWARE REQUIREMENTS:
Operating system : Windows 7 Ultimate.
Coding Language : Python.
Front-End : Python.
Back-End : Django-ORM
Designing : Html, css, javascript.
Data Base : MySQL (WAMP Server).