Knowledge Discovery in Databases: "We Are Drowning in Information and Starving For Knowledge"
Knowledge Discovery in Databases: "We Are Drowning in Information and Starving For Knowledge"
Knowledge Discovery in Databases: "We Are Drowning in Information and Starving For Knowledge"
Introduction
It is a practical application of the methodologies of machine learning (Knowledge Discovery in Databases, KDD) It is of great interest to analyze immense amount of data that is stored in databases in order to obtain any value that it can have The problem is that to do this manually is impossible Some methodologies are needed to automate the process of discovery
Machine Learning
2010/2011
1 / 22
Introduction
The high point of KDD starts in early 2000 Many companies have shown their interest in obtaining the (possibly) valuable information stored in their databases The goal is to obtain information that can lead to better commercial strategies and practices (better knowledge of the consumers preferences and their behaviour) Many companies are putting a lot of eort to this kind of technology (Microsoft, IBM, Daimler-Benz, VISA, consulting companies...) Several buzz words have appeared: Business Intelligence, Business Analytics, Predictive Analytics, Data Science, Big Data, ...
Machine Learning
2010/2011
2 / 22
Introduction
Not only business applications are the promoters of this area The necessity to analyze scientic data has supposed an important part of the methodologies developed
Space probes Satellites Astronomical observations Genome Project Bioinformatics
Machine Learning
2010/2011
3 / 22
Introduction
KDD is an area of research that is the intersection of dierent areas: Statistical data analysis: Classical data analysis and modelling methodologies Machine learning and pattern recognition: Methods for machine knowledge discovery and knowledge characterization Databases: Data access eciency Data visualization: Tool to help in the discovery process and interpretation
Machine Learning
2010/2011
4 / 22
Introduction
KDD denitions
It is the search of valuable information in great volumes of data It is the explorations and analysis, by automatic or semiautomatic tools, of great volumes of data in order to discover patterns and rules It is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data
Machine Learning
2010/2011
5 / 22
Introduction
Elements of KDD
Pattern: Any representation formalism capable to describe the common characteristics of a group if instances Valid: A pattern is valid if it is able to predict the behaviour of new information with a degree of certainty Novelty: It is novel any knowledge that it is not know respect the domain knowledge and any previous discovered knowledge Useful: New knowledge is useful if it allows to perform actions that yield some benet given a established criteria Understandable: The knowledge discovered must be analyzed by an expert in the domain, in consequence the interpretability of the result is important
Machine Learning
2010/2011
6 / 22
Introduction
Elements of KDD
KDD methodologies have to search in the space of patterns optimizing those characteristics We need a heuristic able to measure them (some are very dicult to assess) The discovery step is the central part of KDD (Data mining) The process needs other additional steps that complete the process of KDD Some general methodologies exists for the process of KDD (CRISP-DM or SEMMA)
Machine Learning
2010/2011
7 / 22
Domain study Creating the dataset Data preprocessing Dimensionality reduction Selection of the discovery goal Selection of the adequate methodologies Data Mining Result assessment and interpretation Using the knowledge
Machine Learning
2010/2011
8 / 22
1. Study of the domain Gather information about the domain. Characteristics, goal of the discovering process (attributes, representative examples, types of pattern, sources of data) 2. Creating the dataset From the information of the previous step it is decided what source of data will be used. It has to be decided what attributes will describe the data and what examples are needed for the goals of the discovery process
Machine Learning
2010/2011
9 / 22
3. Data preprocessing and cleaning It has to be studied the circumstances that aect the quality of the data
Outliers Noise (exists, presents any pattern, it can be reduced) Missing values Discretization of continuous values
Machine Learning
2010/2011
10 / 22
Discretization allows to use methods that only treat qualitative values It can improve the interpretability of the results Automatic methods for discretization:
Direct: Equal size bins, Equal frequency bins Statistical distribution approximation: Histograms, function tting Entropy based
Machine Learning
2010/2011
11 / 22
4. Data reduction and projection We have to study what attributes are relevant to our goal (depending on the task some techniques can be used to measure the relevance of the attributes) and the number of examples that are needed. Not all the datamining algorithms are scalable
Instance selection (do we need all the examples? sampling techniques) Attribute selection (what is really relevant?)
Machine Learning
2010/2011
12 / 22
It is very important to use methods for attribute selection: Reduces the dimensionality, eliminates irrelevant and redundant information, the result of the process is usually better (curse of dimensionality) Attribute selection techniques;
Mathematical/Statistical techniques: Principal component analysis (PCA), projection pursuit, Multidimensional scaling Heuristics functions for attribute relevance (ranking of attributes, search in the space of subsets of attributes)
Machine Learning
2010/2011
13 / 22
5. Selecting the discovery goal The characteristics of the data, the domain and the aim of the project determines what kind of analysis are feasible or possible (group partitioning, summarization, classication, discovery of attribute relations, ...) 6. Selecting the adequate methodologies The goal and the characteristics of the data determines the more adequate methodologies
Machine Learning
2010/2011
14 / 22
Machine Learning
2010/2011
15 / 22
There are dierent goals that can be pursued as the result of the discovery process, among them: Classication: We need models that allow to discriminate instances that belong to a previously known set of groups (the model could or could not be interpretable) Clustering/Partitioning/Segmentation: We need to discover models that clusters the data in groups with common characteristics (a characterizations of the groups is desirable) Regression: We look for models that predicts the behaviour of continuous variables as a function of others
Machine Learning
2010/2011
16 / 22
Summarization: We look for a compact description that summarizes the characteristics of the data Causal dependence: We need models that reveal the causal dependence among the variables and assess the strength of this dependence Structure dependence: We need models that reveal patterns among the relatinos that describe the structure of the data Change: We need models that discover patterns in data that has temporal or spatial dependence
Machine Learning
2010/2011
17 / 22
Methodologies
Classicators, Regression:
Low interpretability but good accuracy Can be used for: Classication and regression Statistical regression, function approximation, Neural networks, Support Vector Machines, k-NN, Local Weighted Regression, ...
Machine Learning
2010/2011
18 / 22
Methodologies
Machine Learning
2010/2011
19 / 22
Applications
Applications
The number of applications are uncountable Bussines: Costumer segmentation, costumer preferences (publicity campaigns, marketing), fraud detection (credit cards), control of industrial processes WWW: User behaviour, on-line recommendation, user proling (WEB mining) Scientic applications: Pharmacology (Drug discovery, screening, in-silicon testing), Astronomy (astronomical bodies identication), Genetics (gen identication, DNA microarrays, bioinformatics), satelite data analysis (meteorology, astronomy, geological, ...) Spying! (Echelon, FBI carnivore, TIA, ...)
Machine Learning
2010/2011
20 / 22
Applications
There are a lot of tools available for KDD Some are tools developed in universities (C5.0, CART/MARS, QUEST, ...) that have become a commercial product Big sh eats little sh (C5.0 Clementine SPSS-clementine IBM DBMiner) Data analysis companies incorporate KDD techniques inside classical data analysis tools (SPSS, SAS) Companies selling databases add KDD tools as an added value (IBM DB2 (intelligent Miner), SQL Server, Oracle)
Machine Learning
2010/2011
21 / 22
Applications
Machine Learning
2010/2011
22 / 22