Unit-1 PPT
Unit-1 PPT
Unit-1 PPT
(MRU-R22)
UNIT-I
INTRODUCTION: FUNDAMENTALS OF DATA MINING,
DATA MINING FUNCTIONALITIES, CLASSIFICATION OF
DATA MINING SYSTEMS, DATA MINING TASK PRIMITIVES,
INTEGRATION OF A DATA MINING SYSTEM WITH A
DATABASE OR DATA WAREHOUSE SYSTEM, MAJOR ISSUES
IN DATA MINING.
INTRODUCTION TO DATA MINING
Data Mining
Data Mining is defined as extracting information from huge sets of data. In other
words, we can say that data mining is the procedure of mining knowledge from
data.
The overall goal of the data mining process is to extract information from a data
set and transform it into an understandable structure for further use.
The information or knowledge extracted so can be used for any of the following applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Apart from these, data mining can also be used in the areas of production control, customer retention,
science exploration, sports, astrology, and Internet Web Surf-Aid.
The key properties of data mining are:
Prediction of likely outcomes
Creation of actionable information
Focus on large datasets and databases
Data Mining Functionalities
Data mining involves six common classes of tasks:
Association rule learning (Dependency modelling) - Searches for relationships between variables.
For example a supermarket might gather data on customer purchasing habits. Using association
rule learning, the supermarket can determine which products are frequently bought together and
use this information for marketing purposes. This is sometimes referred to as market basket
analysis.
Clustering - is the task of discovering groups and structures in the data that are in some way or
another "similar", without using known structures in the data.
Classification - is the task of generalizing known structure to apply to new data. For example, an
e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
Regression - attempts to find a function which models the data with the least error.
Summarization - providing a more compact representation of the data set, including visualization
and report generation.
Classification of Data Mining Systems
Data Mining is a process of discovering various models, summaries, and derived values from a given
collection of data.
The general experimental procedure adapted to data-mining problems involves the following steps:
State the problem and formulate the hypothesis
Collect the data
Preprocessing the data
Estimate the model
Interpret the model and draw conclusions
State the problem and formulate the hypothesis
In this step, a modeler usually specifies a set of variables for the unknown dependency and, if
possible, a general form of this dependency as an initial hypothesis. There may be several hypotheses
formulated for a single problem at this stage. The first step requires the combined expertise of an
application domain and a data-mining model.
Scaling, encoding, and selecting features - Data preprocessing includes several steps such as
variable scaling and different types of encoding. For example, one feature with the range [0, 1]
and the other with the range [-100, 1000] will not have the same weights in the applied technique;
they will also influence the final data-mining results differently.
Estimate the model
The selection and implementation of the appropriate data-mining technique is the main task in this
phase.
Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods.
motivate the development of parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel fashion. Then the results
from the partitions is merged. The incremental algorithms, update databases without mining the
data again from scratch.
Diverse Data Types Issues
Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system
to mine all these kind of data.
Mining information from heterogeneous databases and global information systems − The data is
available at different data sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from them adds challenges to data
mining.