MODULE 1
DATA MINING
Introduction to Data Mining:
Data mining is the process of extracting useful information from large sets of data. It involves using various
techniques from statistics, machine learning, and database systems to identify patterns, relationships, and trends in
the data. This information can then be used to make data-driven decisions, solve business problems, and uncover
hidden insights. Applications of data mining include customer profiling and segmentation, market basket analysis,
anomaly detection, and predictive modeling. Data mining tools and technologies are widely used in various
industries, including finance, healthcare, retail, and telecommunications.
In general terms, “Mining” is the process of extraction of some valuable material from the earth e.g. coal mining,
diamond mining, etc.
It is basically the process carried out for the extraction of useful information from a bulk of data or data
warehouses. One can see that the term itself is a little confusing. In the case of coal or diamond mining, the result of
the extraction process is coal or diamond. But in the case of Data Mining, the result of the extraction process is not
data!! Instead, data mining results are the patterns and knowledge that we gain at the end of the extraction process.
In that sense, we can think of Data Mining as a step in the process of Knowledge Discovery or Knowledge
Extraction.
Data Mining Definitions:
The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously
unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection),
and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques
such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in
further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might
identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision
support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data
mining step, although they do belong to the overall KDD process as additional steps.
The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the
dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data. In contrast, data
mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of
data.[8]
The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample
parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the
validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the
larger data populations.
KDD: Knowledge discovery database
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown,
and potentially valuable information from large datasets. The KDD process is an iterative process and it requires
multiple iterations of the above steps to extract accurate knowledge from the data. The following steps are included
in KDD process:
Data Cleaning:
Data cleaning is defined as removal of noisy and irrelevant data from collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration:
Data integration is defined as heterogeneous data from multiple sources combined in a common source(Data
Warehouse). Data integration using Data Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.
Data Selection:
Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the d ata
collection. For this we can use Neural network, Decision Trees, Naive bayes, Clustering,
and Regression methods.
Data Transformation:
Data Transformation is defined as the process of transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination to capture transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining:
Data mining is defined as techniques that are applied to extract patterns potentially useful. It transforms task relevant
data into patterns, and decides purpose of model using classification or characterization.
Pattern Evaluation:
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given
measures. It find interestingness score of each pattern, and uses summarization and Visualization to make data
understandable by user.
Knowledge Representation:
This involves presenting the results in a way that is meaningful and can be used to make decisions.
Differences between KDD and Data Mining:
Parameter KDD Data Mining
KDD refers to a process of identifying valid, Data Mining refers to a process of extracting
Definition novel, potentially useful, and ultimately useful and valuable information or patterns
understandable patterns and relationships in data. from large data sets.
Objective To find useful knowledge from data. To extract useful information from data.
Data cleaning, data integration, data selection, data Association rules, classification, clustering,
Techniques transformation, data mining, pattern evaluation, regression, decision trees, neural networks,
Used
and knowledge representation and visualization. and dimensionality reduction.
Patterns, associations, or insights that can be
Structured information, such as rules and models,
Output used to improve decision-making or
that can be used to make decisions or predictions.
understanding.
Parameter KDD Data Mining
Focus is on the discovery of useful knowledge, Data mining focus is on the discovery of
Focus
rather than simply finding patterns in data. patterns or relationships in data.
Domain expertise is less critical in data
Domain expertise is important in KDD, as it helps
Role of mining, as the algorithms are designed to
domain in defining the goals of the process, choosing
identify patterns without relying on prior
expertise appropriate data, and interpreting the results.
knowledge.
Differences between DBMS and Data Mining:
DBMS (DATABASE MANAGEMENT
FEATURE SYSTEM) DATA MINING
Analyzing data to extract patterns and
Focus Storing, organizing, and managing data relationships
Identifies patterns and interesting relationships
Technique Creates, modifies, and queries databases in data
Relational databases, transactional databases, Business decision making, data analysis,
Application data warehousing pattern recognition
Data mining algorithms, machine learning
Tools My SQL, Oracle, SQL Server, SQL techniques
Database design, data entry, data retrieval, data Data cleaning, data preprocessing, data
Process manipulation analysis, data visualization
Data Mining Techniques:
1. Association:
Association analysis is the finding of association rules showing attribute-value conditions that occur frequently
together in a given set of data. Association analysis is widely used for a market basket or transaction data analysis.
Association rule mining is a significant and exceptionally dynamic area of data mining research. One method of
association-based classification, called associative classification, consists of two steps. In the main step, association
instructions are generated using a modified version of the standard association rule mining algorithm known as
Apriori. The second step constructs a classifier based on the association rules discovered.
2. Classification
Classification is the processing of finding a set of models (or functions) that describe and distinguish data classes or
concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.
The determined model depends on the investigation of a set of training data information (i.e. data objects whose
class label is known). The derived model may be represented in various forms, such as classification (if – then) rules,
decision trees, and neural networks. Data Mining has a different type of classifier:
Decision Tree
SVM(Support Vector Machine)
Generalized Linear Models
Bayesian classification:
Classification by Back propagation
K-NN Classifier
Rule-Based Classification
Frequent-Pattern Based Classification
Rough set theory
Fuzzy Logic
3. Prediction
Data Prediction is a two-step process, similar to that of data classification. Although, for prediction, we do not utilize
the phrasing of “Class label attribute” because the attribute for which values are being predicted is consistently
valued(ordered) instead of categorical (discrete-esteemed and unordered). The attribute can be referred to simply as
the predicted attribute. Prediction can be viewed as the construction and use of a model to assess the class of an
unlabeled object, or to assess the value or value ranges of an attribute that a given object is likely to have.
4. Clustering:
Unlike classification and prediction, which analyze class-labeled data objects or attributes, clustering analyzes data
objects without consulting an identified class label. In general, the class labels do not exist in the training data
simply because they are not known to begin with. Clustering can be used to generate these labels. The objects are
clustered based on the principle of maximizing the intra-class similarity and minimizing the interclass similarity.
That is, clusters of objects are created so that objects inside a cluster have high similarity in contrast with each
other, but are different objects in other clusters. Each Cluster that is generated can be seen as a class of objects, from
which rules can be inferred. Clustering can also facilitate classification formatio n, that is, the organization of
observations into a hierarchy of classes that group similar events together.
5. Regression:
Regression can be defined as a statistical modeling method in which previously obtained data is used to predicting a
continuous quantity for new observations. This classifier is also known as the Continuous Value Classifier. There are
two types of regression models: Linear regression and multiple linear regression models.
6. Artificial Neural network (ANN) Classifier Method:
An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could be a process model
supported by biological neural networks. It consists of an interconnected collection of artificial neurons. A neural
network is a set of connected input/output units where each connection has a weight associated with it. During the
knowledge phase, the network acquires by adjusting the weights to be able to predict t he correct class label of the
input samples.
The advantages of neural networks, however, contain their high tolerance to noisy data as well as their ability to
classify patterns on which they have not been trained. In addition, several algorithms have new ly been developed for
the extraction of rules from trained neural networks. These issues contribute to the usefulness of neural networks for
classification in data mining.
7. Outlier Detection:
A database may contain data objects that do not comply with the general behavior or model of the data. These data
objects are Outliers. The investigation of OUTLIER data is known as OUTLIER MINING. An outlier may be
detected using statistical tests which assume a distribution or probability model for the data, or using distance
measures where objects having a small fraction of “close” neighbors in space are considered outliers. Rather than
utilizing factual or distance measures, deviation-based techniques distinguish exceptions/outlier by inspecting
differences in the principle attributes of items in a group.
8. Genetic Algorithm:
Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of evolutionary algorithms.
Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent exploitation of
random search provided with historical data to direct the search into the region of better performance in solution
space. They are commonly used to generate high-quality solutions for optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means those species who can adapt to changes in
their environment are able to survive and reproduce and go to the next generation. In simple words, they simulate
“survival of the fittest” among individuals of consecutive generations for solving a problem. Each generation consist
of a population of individuals and each individual represents a point in search space and possible solution. Each
individual is represented as a string of character/integer/float/bits. This string is analogous to the Chromosome.
Problems and Challenges in Data Mining:
1] Data Quality:
The quality of data used in data mining is one of the most significant challenges. The accuracy, completeness, and
consistency of the data affect the accuracy of the results obtained. The data may contain errors, omissions,
duplications, or inconsistencies, which may lead to inaccurate results. Moreover, the data may be incomplete, meaning
that some attributes or values are missing, making it challenging to obtain a complete understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry errors, data storage issues, data integration
problems, and data transmission errors. To address these challenges, data mining practitioners must apply data cleaning
and data preprocessing techniques to improve the quality of the data. Data cleaning involves detecting and correcting
errors, while data preprocessing involves transforming the data to make it suitable for data mining.
2] Data Complexity:
Data complexity refers to the vast amounts of data generated by various sources, such as sensors, social media, and the
internet of things (IOT). The complexity of the data may make it challenging to process, analyze, and understand. In
addition, the data may be in different formats, making it challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced techniques such as clustering, classification, and
association rule mining. These techniques help to identify patterns and relationships in the data, which can then be used
to gain insights and make predictions.
3] Data Privacy and Security:
Data privacy and security is another significant challenge in data mining. As more data is collected, stored, and
analyzed, the risk of data breaches and cyber-attacks increases. The data may contain personal, sensitive, or
confidential information that must be protected. Moreover, data privacy regulations such as GDPR, CCPA, and HIPAA
impose strict rules on how data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data encryption techniques to protect the privacy and
security of the data. Data encryption involves removing personally identifiable information (PII) from the data, while
data encryption involves using algorithms to encode the data to make it unreadable to unauthorized users.
4] Scalability:
Data mining algorithms must be scalable to handle large datasets efficiently. As the size of the dataset increases, the
time and computational resources required to perform data mining operations also increase. Moreover, the
algorithms must be able to handle streaming data, which is generated continuously and must be processed in real-
time.
To address this challenge, data mining practitioners use distributed computing frameworks such as Hadoop and
Spark. These frameworks distribute the data and processing across multiple nodes, making it possible to process
large datasets quickly and efficiently.
5] Interpretability:
Data mining algorithms can produce complex models that are difficult to interpret. This is because the algorithms
use a combination of statistical and mathematical techniques to identify patterns and relationships in the data.
Moreover, the models may not be intuitive, making it challenging to understand how the model arrived at a
particular conclusion.
To address this challenge, data mining practitioners use visualization techniques to represent the data and the models
visually. Visualization makes it easier to understand the patterns and relationships in the data and to identify the
most important variables.
6] Ethics:
Data mining raises ethical concerns related to the collection, use, and dissemination of data. The data may be used to
discriminate against certain groups, violate privacy rights, or perpetuate existing biases. Moreover, data mining
algorithms may not be transparent, making it challenging to detect biases or discrimination.
Data Mining Applications
There are many measurable benefits that have been achieved in different application areas from data mining. So,
let’s discuss different applications of Data Mining:
Scientific Analysis: Scientific simulations are generating bulks of data every day. This includes data collected from
nuclear laboratories, data about human psychology, etc. Data mining techniques are capable of the analysis of these
data. Now we can capture and store more new data faster than we can analyze the old data a lready accumulated.
Example of scientific analysis:
Sequence analysis in bioinformatics
Classification of astronomical objects
Medical decision support.
Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital network. Network
intrusions often involve stealing valuable network resources. Data mining technique plays a vital role in searching
intrusion detection, network attacks, and anomalies. These techniques help in selecting and refining useful and
relevant information from large data sets. Data mining technique helps in classify relevant data for Intrusion
Detection System. Intrusion Detection system generates alarms for the network traffic about the foreign invasions in
the system. For example:
Detect security violations
Misuse Detection
Anomaly Detection
Business Transactions: Every business industry is memorized for perpetuity. Such transactions are usually time-
related and can be inter-business deals or intra-business operations. The effective and in-time use of the data in a
reasonable time frame for competitive decision-making is definitely the most important problem to solve for
businesses that struggle to survive in a highly competitive world. Data mining helps to analyze these business
transactions and identify marketing approaches and decision-making.
Direct mail targeting
Stock trading
Customer segmentation
Churn prediction (Churn prediction is one of the most popular Big Data use cases in business)
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of purchases done by a
customer in a supermarket. This concept identifies the pattern of frequent purchase items by customers. This analysis
can help to promote deals, offers, sale by the companies and data mining techniques helps to achieve this analysis
task. Example:
Data mining concepts are in use for Sales and marketing to provide better customer service, to improve cross-
selling opportunities, to increase direct mail response rates.
Customer Retention in the form of pattern identification and prediction of likely defections is possible by Data
mining.
Risk Assessment and Fraud area also use the data-mining concept for identifying inappropriate or unusual
behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining (EDM) method. This
method generates patterns that can be used both by learners and educators. By using data mining EDM we can
perform some educational task:
Predicting students admission in higher education
Predicting students profiling
Predicting student performance
Teachers teaching performance
Curriculum development
Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification, clustering, associations, and grouping of
data with perfection in the research area. Rules generated by data mining are unique to find results. In most of the
technical research in data mining, we create a training model and testing model. The training/testing model is a
strategy to measure the precision of the proposed model. It is called Train/Test because we split the data set into two
sets: a training data set and a testing data set. A training data set used to design the training model whereas testing
data set is used in the testing model. Example:
Classification of uncertain data.
Information-based clustering.
Decision support system
Web Mining
Domain-driven data mining
IOT (Internet of Things)and Cyber security
Smart farming IOT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity and their outcomes to
improve the focusing of high-value physicians and figure out which promoting activities will have the best effect in
the following upcoming months, Whereas the Insurance sector, data mining can help to predict which customers will
buy new policies, identify behavior patterns of risky customers and identify fraudulent behavior of customers.
Claims analysis i.e which medical procedures are claimed together.
Identify successful medical therapies for different illnesses.
Characterizes patient behavior to predict office visits.
Transportation: A diversified transportation company with a large direct sales force can apply data mining to
identify the best prospects for its services. A large consumer merchandise organization can apply information mining
to improve its business cycle to retailers.
Determine the distribution schedules among outlets.
Analyze loading patterns.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of customer transaction data to
identify customers most likely to be interested in a new credit product.
Credit card fraud detection.
Identify ‘Loyal’ customers.
Extraction of information related to customers.
Determine credit card spending by customer groups.