Chapter-2 Getting To Know Your Data
Chapter-2 Getting To Know Your Data
DATA WAREHOUSING
• Data Visualization
• Summary
What is Data?
Attributes
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a person,
temperature, etc.
– Attribute is also known as variable,
Objects
field, characteristic, dimension, or
feature
• A collection of attributes describe
an object
– Object is also known as record, point,
case, sample, entity, or instance
Important Characteristics of Structured Data
• Dimensionality
– It is the number of attributes that the objects in the dataset posses.
– Data with small number of dimensions tend to be qualitatively different than
moderate or high-dimensional data
– The difficulties lies in the high dimensionality data are sometimes referred to
as the “Curse of dimensionality”. And due to this dimensionality reduction is
required.
• Sparsity
– In database, sparsity and density describe the number of cells in a table that
are empty or zero(sparsity) and that contain information(density).
– For some data sets, such as those with symmetric features, most attibutes of
an object have value 0, in many cases fewer than 1% of the entries are
non-zero
– Sparsity is an advantage because usually only the non-zero values need to be
stored and manipulated.
Important Characteristics of Structured Data
• Resolution
– It is frequently possible to obtain data at different levels of resolution and
often the properties of the data are different at different resoultion
– For instance, the surface of Earth seems very uneven at a resoultion of few
meters, but is relatively smooth at a resolution of 10KMs.
– The patterns in the data also depend on the level of resolution
• If the resolution is too fine, a pattern may not be visible or may be
buried in noise
• If the resoulution is too coarse, the pattern may disappear.
For eg. variations in atmosheric pressure on a scale of hours reflect the
movement of storms and other weather systems, whereas on a scale of months
such phenomenon are not detectable
• Distribution
Centrality and dispersion
Chapter-2 Getting to Know Your Data
• Data Visualization
• Summary
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points,
objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Binary (0 or 1; Yes/No; Male/Female)
– Boolean (True/False)
– An attribute is countably infinite if the set of possible values is
infinite but the values can be put in a one-to-one
correspondence with natural numbers.
– For example, the attribute customer ID is countably infinite.
13
Discrete vs. Continuous Attributes (cont..)
• Continuous Attribute
An element of the
sequence
Ordered Data
• Genomic sequence data
Ordered Data
• Spatio-Temporal Data
• Data Visualization
• Summary
Descriptive Data Summarization
❑ For data preprocessing to be successful, it is essential to have an
overall picture of data.
❑ Descriptive Data Summarization techniques can be used to identify
the typical properties of data and highlights which data values should
be treated as noise or outliers.
❑ Data characterstics such as central tendency and dispersion of the
data are used to understand the distribution of the data.
Central Tendency: Measure the location of the middle or centre
of data distribution.
e.g. mean, median, mode and midrange
Dispersion of Data set: How data are spread out.
e.g. range, quartiles, interquartile range, five number summary,
variance and standard deviation.
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
• The most common and effective The mean of this set of values is
numeric measure of the “center” of a
set of data is the (arithmetic) mean.
Let x1,x2, ...... ,xN be a set of N values
or observations, such as for some
Note: N is population size.
numeric attribute X, like salary.
• Median:
– The median of a set of data is the middlemost number in the set.
The median is also the number that is halfway into the set.
– To find the median, the data should first be arranged in order from
least to greatest.
– Middle value if odd number of values, or average of the middle
two values otherwise
The Race.....This starts with some raw data (not a grouped frequency yet)..
To find the Mean Alex adds up all the numbers, then divides by how
many numbers:
Mean = (59+65+61+62+53+55+60+70+64+56+58+58+62+62+68+65
+56 +59+68+61+67)/21 = 61.38095...
Mean, Median and Mode from Grouped Frequencies
• To find the Median Alex places the numbers in value order and finds the
middle number.
• In this case the median is the 11th number:
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68,
70
• Median = 61
• To find the Mode, or modal value, Alex places the numbers in value
order then counts how many of each number. The Mode is the number
which appears most often (there can be more than one mode):
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68,
70
• 62 appears three times, more often than the other values, so Mode = 62
Mean, Median and Mode from Grouped Frequencies
where:
L is the lower class boundary of the group
containing the median
n is the total number of values
B is the cumulative frequency of the groups before
the median group
G is the frequency of the median group
w is the group width
For our example:
L = 60.5
n = 21
B=2+7=9
G=8
w=5
Estimating the Mode from Grouped Data
where:
L is the lower class boundary of the modal group
fm is the frequency of the modal group
fm-1 is the frequency of the group before the modal group
w is the group width
fm+1 is the frequency of the group after the modal group
For our example:
L = 60.5
fm-1 =7
fm = 8
fm+1 = 4
w =5
Baby Carrots Example Age Example
Example: You grew fifty baby carrots The ages of the 112 people who live on
using special soil. You dig them up and a tropical island are grouped as follows:
measure their lengths (to the nearest mm)
and group the results:
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively
skewed data
• Data in most real applications are not symmetric (a). They may instead
be either positively skewed (b), where the mode occurs at a value that is
smaller than the median or negatively skewed (c), where the mode
occurs at a value greater than the median.
Measuring the Dispersion of Data
Range
• Let x1, x2, . . . , xN be a set of observations for some numeric attribute, X.
• The range of the set is the difference between the largest (max()) and smallest
(min()) values.
Standard Deviation
• The standard deviation (usually abbreviated SD, sd, or just s) of a bunch of
numbers tells you how much the individual numbers tend to differ (in either
direction) from the mean. It’s calculated as follows:
But why n-1? If you knew the sample mean, and all but one of the values, you could
calculate what that last value must be. Statisticians say there are n-1 degrees of
freedom.
Measuring the Dispersion of Data (cont..)
• Several other useful measures of dispersion are related to the SD:
• The variance is just the square of the SD. For the IQ example, the
variance = 14.42 = 207.36.
Quartiles
It divide an ordered data set into four equal parts.
The values which divide each part are called the first, second, and third quartiles;
they are denoted by Q1, Q2, and Q3, respectively.
Q1 is the middle value of the first half of the ordered data set, Q2 is the median value
in the set, and Q3 is the middle value in the second half of the ordered data set.
– In previous example, for the diastolic blood pressures, the lower limit
is 64 - 1.5(77-64) = 44.5 and the upper limit is 77 + 1.5(77-64) = 96.5.
The diastolic blood pressures range from 62 to 81. Therefore there are
no outliers.
Example : The Full Framingham Cohort Data
• The Framingham Heart Study is a long-term, ongoing cardiovascular
cohort study on residents of the city of Framingham, Massachusetts.
The study began in 1948 with 5,209 adult subjects from Framingham,
and is now on its third generation of participants.
Since there are no suspected outliers in the subsample of n=10 participants, the mean and
standard deviation are the most appropriate statistics to summarize average values and
dispersion, respectively, of each of these characteristics.
Continue.....
• For clarity, we have so far used a very small subset of the Framingham Offspring
Cohort to illustrate calculations of summary statistics and determination of
outliers. For your interest, Table 3 displays the means, standard deviations,
medians, quartiles and interquartile ranges for each of the continuous variable
displayed in Table 1 in the full sample (n=3,539) of participants who attended the
seventh examination of the Framingham Offspring Study.
Which statistics are most appropriate to summarize the average or typical values and
the dispersion for each variable?
Observations on example......
• In the full sample, each of the characteristics has outliers on the upper
end of the distribution as the maximum values exceed the upper limits
in each case. There are also outliers on the low end for diastolic blood
pressure and total cholesterol, since the minimums are below the lower
limits.
• For some of these characteristics, the difference between the upper
limit and the maximum (or the lower limit and the minimum) is small
(e.g., height, systolic and diastolic blood pressures), while for others
(e.g., total cholesterol, weight and body mass index) the difference is
much larger. This method for determining outliers is a popular one but
not generally applied as a hard and fast rule. In this application it
would be reasonable to present means and standard deviations for
height, systolic and diastolic blood pressures and medians and
interquartile ranges for total cholesterol, weight and body mass index.
The beauty of the normal curve:
68-95-99.7 Rule
No matter what μ and σ are,
• the area between μ-σ and μ+σ is
about 68%; the area between
μ-2σ and μ+2σ is about 95%;
• and the area between
μ-3σ and μ+3σ is about 99.7%.
• Almost all values fall within 3 standard deviations. (μ: mean, σ:
standard deviation)
Chapter-2 Getting to Know Your Data
• Data Visualization
• Summary
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis represent frequencies
• Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are ≤ xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
• Scatter plot: each pair of values is a pair of coordinates and plotted
as points in the plane
60
Boxplot Analysis
• Boxplots are a popular way of visualizing
a distribution. A boxplot incorporates the
five-number summary as follows:
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and
third quartiles, i.e., the height of the box
is IQR
– The median is marked by a line within
the box Boxplot for the unit price data for items sold at four
branches of
– Two lines (called whiskers) outside the AllElectronics during a given time period.
66
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, corelation etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
67
Positively and Negatively Correlated Data
69
Chapter-2 Getting to Know Your Data
• Data Visualization
• Summary
Similarity and Dissimilarity
In data mining applications such as clustering, outlier analysis,
classification it is required to find how much similar or dissimilar one
data is in comparison to the other.
• Similarity
– A Numerical measure of how alike two data objects are
– Value is higher when objects are more alike and lower when
objects are more dissimilar
– Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
– A Numerical measure of how different two data objects are
– Value is higher when objects are more dissimilar and lower when
objects are more alike
– Minimum dissimilarity is often 0 and the upper limit varies
• Proximity refers to a similarity or dissimilarity
Data Matrix and Dissimilarity Matrix
• Data Matrix
– Object by attribute structure
– n data points (rows) with p
dimensions/attributes (cos) in the
form of a relational table
• Dissimilarity Matrix
– Object by Object structure
– n data points’ distance or
proximity with respect to all pairs.
– A triangular matrix
72
Similarity and Dissimilarity
Types of Attribute
– Nominal Attribute
– Ordinal Attribute
– Binary Attribute
– Numeric Attribute
Similarity/Dissimilarity for Simple
Attributes
The following table shows the similarity and dissimilarity between
two objects, x and y, with respect to a single, simple attribute.
Proximity Measure for Nominal Attributes
Nominal attribute can take 2 or more states.
Example: map_color is nominal with state <red, yellow, blue, green>
Dissimilarity Between Nominal Attribute
– Simple ratio of Mismatches
m: Number of matches
p: total number of variables/attributes
Similarity Between Nominal Attribute
– Simple ratio of Matches
Proximity Measure for Nominal Attributes
(Example)
Obj_I Test-1 Test-2 Test-3
d (Nominal) (Ordinal) (Numeric)
Here, Only Col-1 is of
1 Code A Excellent 45
nominal attribute.
2 Code B Fair 22
3 Code C Good 64
4 Code A Excellent 28
Dissimilarity Matrix: d(i,j)
=
Proximity Measure for Binary Attributes
Contingency table is used to describe the distance between two
objects (let p & q) with only binary attributes
Object Q
----------- 1 0 sum
1 f11 f10 f11+f10
Object P
0 f01 f00 f01+f00
sum f11+f01 f10+f00
Proximity Measure for Binary Attributes
[Dissimilarity]
• Distance measure of Binary Attribute
Among three patients, Jack and marry have similar diseases [as their distance is less]
Similarity between Binary Variables [Example]
Among patients, Jack and marry have similar diseases [as their similarity is more]
Proximity Measure for Numeric Attributes
[Dissimilarity]
Commonly used distance measures for computing dissimilarity are
• Euclidean Distance:
• Manhattan Distance:
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
83 x4 4.24 1 5.39 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Manhattan Distance]
Data Matrix
Dissimilarity Matrix
(with Manhattan Distance)
x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
84 x4 6 1 7 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Minkowski Distance]
Data Matrix
Minkowski Distance=
x1 x2 x3 x4 x1 x2 x3 x4
x1 0 x1 0
x2 5 0 x2 3.61 0
x3 3 6 0 x3 5.1 5.1 0
x4 6 1 85 7 0 x4 4.24 1 5.39 0
Proximity Measure for Ordinal Attributes
■ Ordinal variables are discrete or continuous with meaningful order
or ranking about them but the magnitude between successive ranks
is unknown.
■ Example:
■ Attribute Size: <small, medium, large>
to 30: warm>
■ For an attribute f, let M represent the total number of possible states.
Then the order/rank states is <1 ... Mf>
■ For an object i, let rif is the rank of the ith object:
■ Since any ordinal attribute can have a different number of states, it
is necessary to normalize the range onto [0.0 to 1] where each of the
rank rif will be normalized to zif.: and then find it’s
dissimilarity matrix.
Proximity Measure for Ordinal Attributes
Obj_ Test-1 Test-2 Test-3 Obj Test-2 Rank Normalized
Id (Nominal) (Ordinal) (Numeric) Id (Ordinal) (rif) Rank (Zif)
Only Test-2 is
1 Code A Excellent 45 ordinal.
2 Code B Fair 22 It’s normalized 1 Excellent 3 1
form is: =>
3 Code C Good 64 2 Fair 1 0
4 Code A Excellent 28 3 Good 2 0.5
4 Excellent 3 1
Dissimilarity Matrix: d(i, j)
=
Similarity Matrix: sim(i, j)= 1-d(i,j)
=
Proximity Measure for Mixed Attributes
Where
dij is the dissimilarity between object i and j
is the indicator
0 if xif/xjf is missing or xif=xjf=0
= for asymmetric binary attribute
1 Otherwise
Proximity Measure for Mixed Attributes
Proximity Measure for Mixed Attributes