[go: up one dir, main page]

0% found this document useful (0 votes)
12 views21 pages

Unit-1 PPT

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 21

MALLA REDDY UNIVERSITY

MR22-1CS0148: DATA MINING

II YEAR B.TECH. CSE II - SEM

(MRU-R22)

UNIT-I
INTRODUCTION: FUNDAMENTALS OF DATA MINING,
DATA MINING FUNCTIONALITIES, CLASSIFICATION OF
DATA MINING SYSTEMS, DATA MINING TASK PRIMITIVES,
INTEGRATION OF A DATA MINING SYSTEM WITH A
DATABASE OR DATA WAREHOUSE SYSTEM, MAJOR ISSUES
IN DATA MINING.
INTRODUCTION TO DATA MINING

Data Mining
 Data Mining is defined as extracting information from huge sets of data. In other
words, we can say that data mining is the procedure of mining knowledge from
data.

 It is the computational process of discovering patterns in large data sets involving


methods at the intersection of artificial intelligence, machine learning, statistics,
and database systems.

 The overall goal of the data mining process is to extract information from a data
set and transform it into an understandable structure for further use.
The information or knowledge extracted so can be used for any of the following applications −
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration

Apart from these, data mining can also be used in the areas of production control, customer retention,
science exploration, sports, astrology, and Internet Web Surf-Aid.
 The key properties of data mining are:
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large datasets and databases
Data Mining Functionalities
Data mining involves six common classes of tasks:

 Anomaly detection (Outlier/change/deviation detection)


 Association rule learning (Dependency modelling)
 Clustering
 Classification
 Regression
 Summarization

 Anomaly detection (Outlier/change/deviation detection) - The identification of unusual data


records, that might be interesting or data errors that require further investigation.
 Anomaly detection (Outlier/change/deviation detection) - The identification of unusual data
records, that might be interesting or data errors that require further investigation.

 Association rule learning (Dependency modelling) - Searches for relationships between variables.
For example a supermarket might gather data on customer purchasing habits. Using association
rule learning, the supermarket can determine which products are frequently bought together and
use this information for marketing purposes. This is sometimes referred to as market basket
analysis.

 Clustering - is the task of discovering groups and structures in the data that are in some way or
another "similar", without using known structures in the data.
 Classification - is the task of generalizing known structure to apply to new data. For example, an
e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

 Regression - attempts to find a function which models the data with the least error.

 Summarization - providing a more compact representation of the data set, including visualization
and report generation.
Classification of Data Mining Systems

Data mining systems can be categorized according to various criteria, as follows:

 Classification according to the kinds of databases mined.


 Classification according to the kinds of knowledge mined.
 Classification according to the kinds of techniques utilized.
 Classification according to the applications adapted.
Classification according to the kinds of databases mined.
Database systems can be classified according to different criteria (such as data models, or the types
of data or applications involved), each of which may require its own data mining technique. Data
mining systems can therefore be classified accordingly.
For instance, if classifying according to data models, we may have a relational, transactional, object-
relational, or data warehouse mining system. If classifying according to the special types of data
handled, we may have a spatial, time-series, text, stream data, multimedia data mining system, or a
World Wide Web mining system.

Classification according to the kinds of knowledge mined.


Data mining systems can be categorized according to the kinds of knowledge they mine, that is,
based on data mining functionalities, such as characterization, discrimination, association and
correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. A
comprehensive data mining system usually provides multiple and/or integrated data mining
functionalities.
Classification according to the kinds of techniques utilized
Data mining systems can be categorized according to the underlying data mining techniques employed.
These techniques can be described according to the degree of user interaction involved (e.g., autonomous
systems, interactive exploratory systems, query-driven systems) or the methods of data analysis employed
(e.g., database-oriented or data warehouse–oriented techniques, machine learning, statistics, visualization,
pattern recognition, neural networks, and so on).

Classification according to the applications adapted


Data mining systems can also be categorized according to the applications they adapt. For example, data
mining systems may be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail,
and so on. Different applications often require the integration of application-specific methods.
Data Mining Process

Data Mining is a process of discovering various models, summaries, and derived values from a given
collection of data.
The general experimental procedure adapted to data-mining problems involves the following steps:
 State the problem and formulate the hypothesis
 Collect the data
 Preprocessing the data
 Estimate the model
 Interpret the model and draw conclusions
 State the problem and formulate the hypothesis
In this step, a modeler usually specifies a set of variables for the unknown dependency and, if
possible, a general form of this dependency as an initial hypothesis. There may be several hypotheses
formulated for a single problem at this stage. The first step requires the combined expertise of an
application domain and a data-mining model.

 Collect the data


This step is concerned with how the data are generated and collected. In general, there are two
distinct possibilities. The first is when the data-generation process is under the control of an expert
(modeler): this approach is known as a designed experiment. The second possibility is when the
expert cannot influence the data- generation process: this is known as the observational approach. An
observational setting, namely, random data generation, is assumed in most data-mining applications.
 Preprocessing the data
Data preprocessing usually includes at least two common tasks
 Outlier detection ( and removal) - Outliers are unusual data values that are not consistent with
most observations. Commonly, outliers result from measurement errors, coding and recording
errors, and, sometimes, are natural, abnormal values. Such non representative samples can
seriously affect the model produced later. There are two strategies for dealing with outliers:
a.Detect and eventually remove outliers as a part of the preprocessing phase, or
b.Develop robust modeling methods that are insensitive to outliers.

 Scaling, encoding, and selecting features - Data preprocessing includes several steps such as
variable scaling and different types of encoding. For example, one feature with the range [0, 1]
and the other with the range [-100, 1000] will not have the same weights in the applied technique;
they will also influence the final data-mining results differently.
 Estimate the model
The selection and implementation of the appropriate data-mining technique is the main task in this
phase.

 Interpret the model and draw conclusions


In most cases, data-mining models should help in decision making. Hence, such models need to be
interpretable in order to be useful because humans are not likely to base their decisions on complex
"black-box" models. Note that the goals of accuracy of the model and accuracy of its interpretation
are somewhat contradictory. Usually, simple models are more interpretable, but they are also less
accurate. Modem data-mining methods are expected to yield highly accurate results using high
dimensional models.
Architecture of Data Mining
A typical data mining system may have the following major components.
 Knowledge Base
 Data Mining Engine
 Pattern Evaluation Module
 User interface
 Knowledge Base
Examples of domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).
 Data Mining Engine
This is essential to the data mining system and ideally consists of a set of functional modules for
tasks such as characterization, association and correlation analysis, classification, prediction, cluster
analysis, outlier analysis, and evolution analysis.
 Pattern Evaluation Module
This component typically employs interestingness measures interacts with the data mining modules
so as to focus the search toward interesting patterns.
 User interface
This module communicates between users and the data mining system, allowing the user to interact
with the system by specifying a data mining query or task, providing information to help focus the
search, and performing exploratory datamining based on the intermediate data mining results.
Data integration
Data Integration: It combines data from multiple sources into a coherent data store, as in data
warehousing. These sources may include multiple databases, data cubes, or flat files.
The data integration systems are formally defined as triple <G,S,M>
Where G: The global schema
S: Heterogeneous source of schemas
M: Mapping between the queries of source and global schema
Issues in Data Mining
 Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues

Mining Methodology and User Interaction


 Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of
knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple levels of
abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high level languages, and visual representations. These representations
should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle the
noise and incomplete objects while mining the data regularities. If the data cleaning methods are
not there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they represent
common knowledge or lack novelty.
 Performance Issues
There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods.
 motivate the development of parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel fashion. Then the results
from the partitions is merged. The incremental algorithms, update databases without mining the
data again from scratch.
 Diverse Data Types Issues
 Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system
to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems − The data is
available at different data sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from them adds challenges to data
mining.

You might also like