Data Mining Lecture 1 – summary
0. Introduction
From data to information
Data mining relates to the area of application
Passive: Observing relations
Active: Finding the underlying model
Passive: Clustering, Classification
Active: Passive + Model construction + parameter estimation
1. Example: PCB production
Normalization and standardization
Hidden parameters and missing values
Input, output, and state parameters
2. Data as Sets of Observations
Parameters and Observations/Measurements
Ordered Data Sets and alphanumeric sets and lexicographic sets
Data is a set of ordered numeric data of n observations of d parameters
Dynamic Sets and Time series
sampling
Data Quality
3. Statistics
Uncertainty and Probability distributions: normal, uniform, Poisson, Bernouli, something
Normal distribution
Kolmogorov-Smirnov test
Multivariate sets
Mean
Mean-centered data
Correlation and correlation coefficient
Covariance matrix
Estimation
Outliers and the linear correlation coefficient
Transforming
4. Similarity and Distance
Measures for Similarity and Dissimilarity
How similar or dissimilar two data points are
sim(p,q) in [0,1]
sim(p,q) = sim(q,p)
sim(p,p) = 1
dissim also in [0,1] : dissim = 1 – sim
Distance measures and Metrics
Distance measured according to certain rule in the data space
dist(p,q) >= 0
if dist(p,q) = 0 then p = q
triangle-inequality: dist(p,q) <= dist(p,a) + dist(a,q)
notice that distance relates to the dissimilarity
a vector space with a distance definition is a metric space
Examples of Distances
Euclidean : distE(p,q)2 = (p1 – q1)2 + (p2 – q2)2 + … + (pn – qn)2
Manhattan: distM(p,q) = |p1 – q1| + |p2 – q2| + … + |pn – qn|
Max-norm: distmax(p,q) = max {|p1 – q1|, |p2 – q2|, …, |pn – qn| }
Notice the graphs of dist@(p,0) in IR2 for @ = Euclidean, Manhattan, max
Generalized p-norm
1/ d
n d
p d pi
i 1
notice that for: d =2: ||p - q||d = Euclidean distance
for: d =1: ||p - q||d = Manhattan distance
for: d = ∞: ||p - q||d = Max-norm distance
normally d >= 1
Riemannian Metric
Let g be a definite non-negative matrix on IRd (i.e. all eigen values >= 0)
then g induces a Riemannian metric on the space:
2
p g
p T gp
Involving peculiarities of the distribution
Let ρ be a probability distribution on a data space. Let m be the mean and C be the
covariance matrix associated to ρ :
C (x m) (x)(x m)
T
dx
then the Mahalanobis distance is defined as:
dist Mahal (x) (x m)C 1 ( x m ) T
It give a measure for the distance of a data point x to the center of the distribution.
Notice that in the mean-centered space the Mahalanobis-distance is a Riemannian
norm with metric g = C-1.
Similarity and Distance
Notice that we can define the similarity between two data points p and q as some
function f: sim(p,q) = f(dist(p,q)). Examples are:
f(d) = 1/(1 + d/L)
f(d) = exp(-d/L)
f(d) = - d/L if d < L and f(d) = 0 if d >= L
where d = dist(p,q) and L is some characteristic length for the problem.
Correlation
5. Visualizing and Exploring Data
single variable-display:
Histogram
Cumulative distribution
* Kernel estimate
Two-variable representation
Scatter plot
Contour plot
Multiple-variable representation
Scatter plot matrix
Trelis plot
Chernoff face
6. Dimension Reduction
Principal Components Analysis (PCA)
Find the principal directions in the data, and use them to reduce the number of dimensions of
the set by representing the data in linear combinations of the principal components.
Works best for multivariate data. Finds the m < d eigen-vectors of the covariance matrix with
the largest eigen-values. These eigen-vectors are the principal components. Decompose the
data in these principal components and thus obtain as more concise data set.
Caution1: Depends on the normalization of the data! Ideal for data with equal units.
Caution2: works only in linear relations between the parameters
Caution3: valuable information can be lost in pca
Factor Analysis
Represent data with fewer variables
Not invariant for transformations: multiple equivalent solutions
Widely used esp. in alpha-world and medicine
Multidimensional scaling
Equivalent to PCA, and also in case there are non-linear relations between the parameters.
Input: similarity- or distance-map between the data-points. Output: a 2D- (or even higher D)
map of the data points.
Kohonen SO feature map
(Will be addressed later) Input: distance or similarity-map, Output: a 2D- (or even higher D)
map of the data points.