[go: up one dir, main page]

0% found this document useful (0 votes)
8 views14 pages

DM Lesson3

A database is a structured collection of data for storage and retrieval, while data mining involves analyzing that data to extract insights. Data mining can be descriptive or predictive, with various task primitives guiding the process. Major issues in data mining include handling diverse data types, ensuring efficiency and scalability, and integrating with data warehousing systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

DM Lesson3

A database is a structured collection of data for storage and retrieval, while data mining involves analyzing that data to extract insights. Data mining can be descriptive or predictive, with various task primitives guiding the process. Major issues in data mining include handling diverse data types, ensuring efficiency and scalability, and integrating with data warehousing systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

What is the difference between database and

data mining?
A database is a collection of structured data
organized for efficient storage and retrieval, while
data mining is analyzing data to extract insights or
patterns.
Descriptive data mining is used
to summarize and describe the
data, while predictive data
mining is used to make
predictions about future events.
Both techniques have their own
advantages and applications, and the
choice of technique depends on the
specific problem and the nature of
the data.
DATA MINING TASK PRIMITIVES:

A DM task is represented in form of DM query is define in terms of DM task Primitives

Will allow the user to interactively communicate with the DM system.

There are 5 DM task Primitives


1. Set of task relevant data to be mined
2. Specifies the kind of knowledge to be mined
3. The background knowledge to be used in discovery process
4. The interestingness measures and thresholds for pattern evaluation
5. The expected representation for visualizing the discovery
INTERESTINGNESS OF PATTERN:

- In a data mining system, everyday million of data patterns are generated.


- Among all these patterns generated, how many are really interesting?

-Actually, a small fraction of patterns generated would be of interest to any given user.
This raises three 3 question
1. What make the pattern interesting? 2. Can DM system generate all of the interesting Pattern?
- easily understood by human -refers to completeness of a DM system
-valid on new/ test data - in reality it is not possible fo a DM system to generate all
-potentially useful interesting patterns.
3. Can DM systems generate only interesting pattern?
-refers to optimization of a DM system
-generating only interesting patterns are generated, it becomes easy and efficient for the user(time is save
INTEGRATING A DATA MINING SYSTEM WITH A DB/DW SYSTEM
INTEGRATION-association/ combining/gouping= DM + Db/DW – communication
If there is no Integration- no communication with Db.
We have a total of 4 integration scheme
1. No Coupling 0/100
- Coupling- Combine
- There is no communication with the DB
- for this , it communicate with the storage methods like file system
2. Loose coupling
- eg. 15/100
- will use some of the functionalities (only up to the extend)
- something is better than nothing.
- better than no coupling (fetch the data)
- suitable for small data sets.
3. Semitight coupling 50/100
- linked to the Db
- Also some of the DM primitive are also implemented in Db.

4. Tight coupling 100/100


- DM is completely linked to Db
- most efficient among all
The DB sys. Is fully integrated in such a way that it becomes part of the DM system.
Efficient and optimized implementation of DM.
Db part of DM
MAJOR ISSUES IN DATA WAREHOUSING AND MINING

1. Mining different kinds of knowledge in databases


2. Interactive mining of knowledge at multiple levels of abstraction
3. Incorporation of background knowledge
4. Presentation and visualization of data mining results
5. Handling noise and incomplete data
6. Efficiency and scalability of data mining algorithm
Issues:
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These
factors also create some issues. Here in this tutorial, we will discuss the major issues regarding −

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues
The following diagram describes
the major issues.
Mining Methodology and User
Interaction Issues:

It refers to the following kinds of issues −


 Mining different kinds of knowledge in databases − Different users may be interested in different kinds
of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery
task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be
interactive because it allows users to focus the search for patterns, providing and refining data mining
requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express the discovered
patterns, the background knowledge can be used. Background knowledge may be used to express the
discovered patterns not only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language that allows
the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language
and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are discovered it needs to
be expressed in high level languages, and visual representations. These representations should be
easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and
incomplete objects while mining the data regularities. If the data cleaning methods are not there then
the accuracy of the discovered patterns will be poor.

Pattern evaluation − The patterns discovered should be interesting


because either they represent common knowledge or lack novelty.
Performance Issues:
There can be performance-related
issues such as follows −

 Efficiency and scalability of data mining algorithms− In order to effectively extract the
information from huge amount of data in databases; data mining algorithm must be efficient and
scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the

data into partitions which is further processed in a parallel fashion. Then the results from the
partitions are merged. The incremental algorithms, update databases without mining the data
again from scratch.
Diverse Data Types Issues:
 Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system
to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems − The data
is available at different data sources on LAN or WAN. These data source may be structured,
semi structured or unstructured. Therefore mining the knowledge from them adds challenges to
data mining.

You might also like