[go: up one dir, main page]

0% found this document useful (0 votes)
12 views92 pages

Chapter-2 Getting To Know Your Data

The document provides an overview of data mining and data warehousing, focusing on understanding data objects and their attributes, as well as statistical descriptions and visualization techniques. It discusses the characteristics of structured data, including dimensionality, sparsity, resolution, and distribution, while also explaining different types of attributes and data sets. Additionally, it covers descriptive data summarization methods to analyze central tendency and dispersion in data.

Uploaded by

khushiraynn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views92 pages

Chapter-2 Getting To Know Your Data

The document provides an overview of data mining and data warehousing, focusing on understanding data objects and their attributes, as well as statistical descriptions and visualization techniques. It discusses the characteristics of structured data, including dimensionality, sparsity, resolution, and distribution, while also explaining different types of attributes and data sets. Additionally, it covers descriptive data summarization methods to analyze central tendency and dispersion in data.

Uploaded by

khushiraynn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

DATA MINING AND

DATA WAREHOUSING

Getting to Know Your Data


Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
What is Data?
Attributes
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a person,
temperature, etc.
– Attribute is also known as variable,

Objects
field, characteristic, dimension, or
feature
• A collection of attributes describe
an object
– Object is also known as record, point,
case, sample, entity, or instance
Important Characteristics of Structured Data
• Dimensionality
– It is the number of attributes that the objects in the dataset posses.
– Data with small number of dimensions tend to be qualitatively different than
moderate or high-dimensional data
– The difficulties lies in the high dimensionality data are sometimes referred to
as the “Curse of dimensionality”. And due to this dimensionality reduction is
required.
• Sparsity
– In database, sparsity and density describe the number of cells in a table that
are empty or zero(sparsity) and that contain information(density).
– For some data sets, such as those with symmetric features, most attibutes of
an object have value 0, in many cases fewer than 1% of the entries are
non-zero
– Sparsity is an advantage because usually only the non-zero values need to be
stored and manipulated.
Important Characteristics of Structured Data
• Resolution
– It is frequently possible to obtain data at different levels of resolution and
often the properties of the data are different at different resoultion
– For instance, the surface of Earth seems very uneven at a resoultion of few
meters, but is relatively smooth at a resolution of 10KMs.
– The patterns in the data also depend on the level of resolution
• If the resolution is too fine, a pattern may not be visible or may be
buried in noise
• If the resoulution is too coarse, the pattern may disappear.
For eg. variations in atmosheric pressure on a scale of hours reflect the
movement of storms and other weather systems, whereas on a scale of months
such phenomenon are not detectable
• Distribution
Centrality and dispersion
Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points,
objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
Attributes

• Attribute (or dimensions, features, variables):


A data field, representing a characteristic or feature of a data object.
– E.g., customer _ID, name, address

• Observed values for a given attribute are known as observations.

• A set of attributes used to describe a given object is called an


attribute vector (or feature vector ).
It contains the set of database attributes that you have chosen to
represent (describe) uniquely each data element (tuple).

• The distribution of data involving one attribute (or variable) is


called univariate.
• A bivariate distribution involves two attributes, and so on.
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute for a particular object

• Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values


• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
• Attribute Type:
1. Categorical (Qualitative)
2. Numeric (Quantitative)
Categorical Data (Qualitative Attribute Types)
• Nominal: Nominal means “relating to names”. The values of a
nominal attribute are symbols or names of things for example,
Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
[The values do not have any meaningful order about them.]
• Binary: Nominal attribute with only 2 states (0 and 1), where 0
typically means that the attribute is absent, and 1 means that it is
present. Binary attributes are referred to as Boolean if the two
states correspond to true and false.
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g.,
HIV positive) 10
Categorical Data (Qualitative Attribute Types)...
• Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
– Other examples of ordinal attributes include Grade (e.g., A+, A,
A−, B+, and so on) and
– Professional rank. Professional ranks can be enumerated in a
sequential order, such as assistant, associate, and full for professors,

The central tendency of an ordinal attribute can be represented by its


mode and its median (the middle value in an ordered sequence), but
the mean cannot be defined.

Qualitative attributes describe a feature of an object, without giving


an actual size or quantity. The values of such qualitative attributes are
typically words representing categories.
Numeric Data (Quantitative Attribute Types)
• Quantity (that is, it is a measurable quantity, integer or real-valued).
Numeric attributes can be interval-scaled or ratio-scaled.
• Interval
• Measured on a scale of equal-sized units.
• The values of interval-scaled attributes have order and can be positive,
0, or negative.
– E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being a multiple (or ratio) of another
value.
– e.g. temperature in Kelvin [Unlinke Celsius and Fahrenheit, the
Kelvin temperature has a true zero-point or at 00K the particles
have zero K.E].
– e.g., years of experience in object employee, no_of_words in
object Documents, length, counts, monetary quantities
Discrete vs. Continuous Attributes

• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Binary (0 or 1; Yes/No; Male/Female)
– Boolean (True/False)
– An attribute is countably infinite if the set of possible values is
infinite but the values can be put in a one-to-one
correspondence with natural numbers.
– For example, the attribute customer ID is countably infinite.

13
Discrete vs. Continuous Attributes (cont..)

• Continuous Attribute

‒ Has real numbers as attribute values


Numeric (e.g., salaries, ages, temperatures, rainfall, sales,
height, or weight)

Practically, real values can only be measured and represented


using a finite number of digits
Continuous attributes are also represented as floating-point
variables
Properties of Attribute Values
The type of an attribute depends on which of the following
properties/operations it possesses:
– Distinctness : = and ≠
– Order : <, ≤, >, and ≥
– Addition : + and - (Differences are meaningful)
– Multiplication : * and / (Ratios are meaningful)

– Nominal attribute: distinctness


– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful differences
– Ratio attribute: all 4 properties/operations
Different types of Attributes
Types of data sets
• Record Data
– Data Matrix
– Document Data (Sparse Data Matrix)
– Transaction Data
• Graph Data
– World Wide Web
– Molecular Structures
• Ordered Data
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record data set
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there


are m rows, one for each object, and n columns, one for each
attribute
Document Data
• Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.
Transaction Data
• A special type of record data, where

– Each record (transaction) involves a set of items.


– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
Graph Data
Examples: Generic graph, a molecule, and webpages

Benzene Molecule: C6H6


Ordered Data
• Sequences of transactions
Items/Events

An element of the
sequence
Ordered Data
• Genomic sequence data
Ordered Data
• Spatio-Temporal Data

Average Monthly Temperature of land and ocean


Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
Descriptive Data Summarization
❑ For data preprocessing to be successful, it is essential to have an
overall picture of data.
❑ Descriptive Data Summarization techniques can be used to identify
the typical properties of data and highlights which data values should
be treated as noise or outliers.
❑ Data characterstics such as central tendency and dispersion of the
data are used to understand the distribution of the data.
Central Tendency: Measure the location of the middle or centre
of data distribution.
e.g. mean, median, mode and midrange
Dispersion of Data set: How data are spread out.
e.g. range, quartiles, interquartile range, five number summary,
variance and standard deviation.
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):

• The most common and effective The mean of this set of values is
numeric measure of the “center” of a
set of data is the (arithmetic) mean.
Let x1,x2, ...... ,xN be a set of N values
or observations, such as for some
Note: N is population size.
numeric attribute X, like salary.

• Sometimes, each value xi in a set may


be associated with a weight wi for i =
1, .... ,N. The weights reflect the
significance, importance, or
This is called the weighted arithmetic
occurrence frequency attached to their mean or the weighted average.
respective values. In this case, we can
compute
Measuring the Central Tendency(Mean)...
• A trimmed mean (sometimes called a truncated mean) is similar to a mean, but
it trims any outliers. Outliers can affect the mean (especially if there are just one
or two very large values), so a trimmed mean can often be a better fit for data sets
with erratic high or low values or for extremely skewed distributions. Even a
small number of extreme values can corrupt the mean.
• For example, the mean salary at a company may be substantially pushed up by
that of a few highly paid managers. Similarly, the mean score of a class in an
exam could be pulled down quite a bit by a few very low scores.
• Which is the mean obtained after chopping off values at the high and low
extremes.
• Example: Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99.
– Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle
three values: 60, 81, 83, 91, 99.
– Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) =
85.
Measuring the Central Tendency...

• Median:
– The median of a set of data is the middlemost number in the set.
The median is also the number that is halfway into the set.
– To find the median, the data should first be arranged in order from
least to greatest.
– Middle value if odd number of values, or average of the middle
two values otherwise

– What will be the median estimated by interpolation (for


grouped data)?
Measuring the Central Tendency...
• Mode
– The mode for a set of data is the value that occurs most
frequently in the set. Therefore, it can be determined for
qualitative and quantitative attributes.
– It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode.
– Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal.
– In general, a data set with two or more modes is multimodal.
– At the other extreme, if each data value occurs only once, then
there is no mode.
– Example:
• 53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68,
68, 70
• 62 appears three times, more often than the other values, so Mode = 62
Mean, Median and Mode from Grouped Frequencies

The Race.....This starts with some raw data (not a grouped frequency yet)..

To find the Mean Alex adds up all the numbers, then divides by how
many numbers:

Mean = (59+65+61+62+53+55+60+70+64+56+58+58+62+62+68+65
+56 +59+68+61+67)/21 = 61.38095...
Mean, Median and Mode from Grouped Frequencies

• To find the Median Alex places the numbers in value order and finds the
middle number.
• In this case the median is the 11th number:
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68,
70
• Median = 61
• To find the Mode, or modal value, Alex places the numbers in value
order then counts how many of each number. The Mode is the number
which appears most often (there can be more than one mode):
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68,
70
• 62 appears three times, more often than the other values, so Mode = 62
Mean, Median and Mode from Grouped Frequencies

• Grouped Frequency Table


• Alex then makes a Grouped Frequency Table:
• So 2 runners took between 51 and 55 seconds,
7 took between 56 and 60 seconds, etc
Estimating the Mean from Grouped Data

We can estimate the Mean by using the midpoints.

Let's now make the table using midpoints:


Estimating the Mean from Grouped Data

• Our thinking is: "2 people took 53 sec, 7 people took


58 sec, 8 people took 63 sec and 4 took 68 sec". In
other words we imagine the data looks like this:
53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63, 68, 68, 68, 68

• Then we add them all up and divide by 21. The quick


way to do it is to multiply each midpoint by each
frequency:

And then our estimate of the mean time to complete


the race is:
Estimating the Median from Grouped Data

• Let's look at our data again:


• The median is the middle value, which in our
case is the 11th one, which is in the 61 - 65
group:
• We can say "the median group is 61 - 65"
• But if we want an estimated Median value we
need to look more closely at the 61 - 65 group.
Estimating the Median from Grouped Data
• At 60.5 we already have 9 runners, and by the next
boundary at 65.5 we have 17 runners.
• By drawing a straight line in between we can pick
out where the median frequency of n/2 runners is:

And this handy formula does the calculation:

where:
L is the lower class boundary of the group
containing the median
n is the total number of values
B is the cumulative frequency of the groups before
the median group
G is the frequency of the median group
w is the group width
For our example:

L = 60.5
n = 21
B=2+7=9
G=8
w=5
Estimating the Mode from Grouped Data

• Again, looking at our data:


• We can easily find the modal group (the group with the
highest frequency), which is 61 - 65
• We can say "the modal group is 61 - 65"
• But the actual Mode may not even be in that group! Or
there may be more than one mode. Without the raw data
we don't really know. But, we can estimate the Mode
using the following formula:

where:
L is the lower class boundary of the modal group
fm is the frequency of the modal group
fm-1 is the frequency of the group before the modal group
w is the group width
fm+1 is the frequency of the group after the modal group
For our example:
L = 60.5
fm-1 =7
fm = 8
fm+1 = 4
w =5
Baby Carrots Example Age Example
Example: You grew fifty baby carrots The ages of the 112 people who live on
using special soil. You dig them up and a tropical island are grouped as follows:
measure their lengths (to the nearest mm)
and group the results:
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively
skewed data
• Data in most real applications are not symmetric (a). They may instead
be either positively skewed (b), where the mode occurs at a value that is
smaller than the median or negatively skewed (c), where the mode
occurs at a value greater than the median.
Measuring the Dispersion of Data

• Range, Quartiles, Variance, Standard Deviation, and


Interquartile Range
• We now look at measures to assess the dispersion or spread of
numeric data. The measures include range, quartiles, percentiles,
and the interquartile range.
• The five-number summary, which can be displayed as a boxplot, is
useful in identifying outliers.
• Variance and standard deviation also indicate the spread of a
data distribution.
Measuring the Dispersion of Data (cont..)

Range
• Let x1, x2, . . . , xN be a set of observations for some numeric attribute, X.
• The range of the set is the difference between the largest (max()) and smallest
(min()) values.

Standard Deviation
• The standard deviation (usually abbreviated SD, sd, or just s) of a bunch of
numbers tells you how much the individual numbers tend to differ (in either
direction) from the mean. It’s calculated as follows:

This formula is saying that you calculate


the standard deviation of a set of N
numbers (Xi) by subtracting the mean
from each value to get the deviation (di) of
each value from the mean, squaring each
of these deviations, adding up the (di)2
terms, dividing by N – 1, and then taking
the square root.
Measuring the Dispersion of Data (cont..)
Standard Deviation (cont.)
• This is almost identical to the formula for
the root-mean-square deviation of the points
from the mean, except that it has N – 1 in the
denominator instead of N.
• This difference occurs because the sample
mean is used as an approximation of the true
population mean (which you don’t know). If
the true mean were available to use, the For an IQ example (84, 84, 89, 91, 110,
denominator would be N. 114, and 116) where the mean is 98.3, you
calculate the SD as follows:
• When talking about population distributions,
the SD describes the width of the
distribution curve. The figure shows three
normal distributions. They all have a mean Standard deviations are very sensitive
of zero, but they have different standard to extreme values (outliers) in the data.
deviations and, therefore, different widths. For example, if the highest value in the
Each distribution curve has a total area of IQ dataset had been 150 instead of 116,
exactly 1.0, so the peak height is smaller the SD would have gone up from 14.4
when the SD is larger. to 23.9.
Why n-1 in Standard Deviation?
Bessel's correction
Why divide by n-1 rather than n?
You compute the difference between each value and the mean of those values. You
don't know the true mean of the population; all you know is the mean of your
sample. Except for the rare cases where the sample mean happens to equal the
population mean, the data will be closer to the sample mean than it will be to the true
population mean.
So the value you compute will probably be a bit smaller (and can't be larger) than
what it would be if you used the true population mean.
To make up for this, divide by n-1 rather than n. This is called Bessel's correction.

But why n-1? If you knew the sample mean, and all but one of the values, you could
calculate what that last value must be. Statisticians say there are n-1 degrees of
freedom.
Measuring the Dispersion of Data (cont..)
• Several other useful measures of dispersion are related to the SD:

• Variance: Variance of N observations, x1, x2, . . . , xN, for a numeric


attribute X is

• The variance is just the square of the SD. For the IQ example, the
variance = 14.42 = 207.36.

• Coefficient of variation: The coefficient of variation (CV) is the SD


divided by the mean. For the IQ example, CV = 14.4/98.3 = 0.1465, or
14.65 percent.
Measuring the Dispersion of Data (cont..)

Quartiles
It divide an ordered data set into four equal parts.
The values which divide each part are called the first, second, and third quartiles;
they are denoted by Q1, Q2, and Q3, respectively.
Q1 is the middle value of the first half of the ordered data set, Q2 is the median value
in the set, and Q3 is the middle value in the second half of the ordered data set.

Interquartile range (IQR)


A good measure of the spread of data is the interquartile range (IQR) or the
difference between Q3 and Q1. This gives us the width of the box, as well. A small
width means more consistent data values since it indicates less variation in the data or
that data values are closer together. So, IQR = Q3 - Q1
Interquartile Range = Q3-Q1
• With an Odd Sample Size:
• With an Even Sample Size:
• For the sample (n=10) the median
• For the sample (n=10) the diastolic blood pressure is 72 (50%
median diastolic blood pressure of the values are above 72, and 50%
are below).
is 71 (50% of the values are
above 71, and 50% are below). • When the sample size is odd, the
median and quartiles are determined
• The quartiles can be determined in the same way.
in the same way we determined
• Suppose in the previous example,
the median, except we consider the lowest value (62) were
each half of the data set excluded, and the sample size was
separately. n=9. The median and quartiles are
indicated below.
Outliers and Tukey Fences:
Tukey Fences
– When there are no outliers in a sample, the mean and standard
deviation are used to summarize a typical value and the variability in
the sample, respectively.
– When there are outliers in a sample, the median and interquartile range
are used to summarize a typical value and the variability in the sample,
respectively.

– Outliers are values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1)


or equivalently, values below Q1-1.5 IQR or above Q3+1.5 IQR.

– In previous example, for the diastolic blood pressures, the lower limit
is 64 - 1.5(77-64) = 44.5 and the upper limit is 77 + 1.5(77-64) = 96.5.
The diastolic blood pressures range from 62 to 81. Therefore there are
no outliers.
Example : The Full Framingham Cohort Data
• The Framingham Heart Study is a long-term, ongoing cardiovascular
cohort study on residents of the city of Framingham, Massachusetts.
The study began in 1948 with 5,209 adult subjects from Framingham,
and is now on its third generation of participants.

• Table 1 displays the means, standard deviations, medians, quartiles


and interquartile ranges for each of the continuous variables in the
subsample of n=10 participants who attended the seventh examination
of the Framingham Offspring Study.
Table 1 - Summary Statistics on n=10 Participants
• Table 2 displays the observed minimum and maximum values along
with the limits to determine outliers using the quartile rule for each
of the variables in the subsample of n=10 participants.
• Are there outliers in any of the variables? Which statistics are most
appropriate to summarize the average or typical value and the
dispersion?
Table 2 - Limits for Assessing Outliers in Characteristics Measured in the n=10 Participants

Since there are no suspected outliers in the subsample of n=10 participants, the mean and
standard deviation are the most appropriate statistics to summarize average values and
dispersion, respectively, of each of these characteristics.
Continue.....
• For clarity, we have so far used a very small subset of the Framingham Offspring
Cohort to illustrate calculations of summary statistics and determination of
outliers. For your interest, Table 3 displays the means, standard deviations,
medians, quartiles and interquartile ranges for each of the continuous variable
displayed in Table 1 in the full sample (n=3,539) of participants who attended the
seventh examination of the Framingham Offspring Study.

Table 3-Summary Statistics on Sample of (n=3,539) Participants


Continue.....
• Table 4 displays the observed minimum and maximum values
along with the limits to determine outliers using the quartile rule
for each of the variables in the full sample (n=3,539).
Table 4 - Limits for Assessing Outliers in Characteristics Presented in Table 3

Are there outliers in any of the variables?

Which statistics are most appropriate to summarize the average or typical values and
the dispersion for each variable?
Observations on example......

• In the full sample, each of the characteristics has outliers on the upper
end of the distribution as the maximum values exceed the upper limits
in each case. There are also outliers on the low end for diastolic blood
pressure and total cholesterol, since the minimums are below the lower
limits.
• For some of these characteristics, the difference between the upper
limit and the maximum (or the lower limit and the minimum) is small
(e.g., height, systolic and diastolic blood pressures), while for others
(e.g., total cholesterol, weight and body mass index) the difference is
much larger. This method for determining outliers is a popular one but
not generally applied as a hard and fast rule. In this application it
would be reasonable to present means and standard deviations for
height, systolic and diastolic blood pressures and medians and
interquartile ranges for total cholesterol, weight and body mass index.
The beauty of the normal curve:

68-95-99.7 Rule
No matter what μ and σ are,
• the area between μ-σ and μ+σ is
about 68%; the area between
μ-2σ and μ+2σ is about 95%;
• and the area between
μ-3σ and μ+3σ is about 99.7%.
• Almost all values fall within 3 standard deviations. (μ: mean, σ:
standard deviation)
Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis represent frequencies
• Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are ≤ xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
• Scatter plot: each pair of values is a pair of coordinates and plotted
as points in the plane

60
Boxplot Analysis
• Boxplots are a popular way of visualizing
a distribution. A boxplot incorporates the
five-number summary as follows:
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and
third quartiles, i.e., the height of the box
is IQR
– The median is marked by a line within
the box Boxplot for the unit price data for items sold at four
branches of
– Two lines (called whiskers) outside the AllElectronics during a given time period.

box extend to the smallest (Minimum)


and largest (Maximum) observations.
Boxplot Analysis (cont..)

Finally, boxplots often provide information about the shape of a


data set. The examples below show some common patterns.

Each of the boxplots illustrates a


different skewness pattern.
If most of the observations are
concentrated on the low end of the
scale, the distribution is skewed right;
and vice versa.
If a distribution is symmetric, the
observations will be evenly split at the
median, as shown in the middle
figure.
Histogram Analysis
• Histogram: Graph display of tabulated frequencies, shown as bars
• It shows what proportion of cases fall into each of several categories
• Differs from a bar chart in that it is the area of the bar that denotes the value,
not the height as in bar charts, a crucial distinction when the categories are not
of uniform width
• The categories are usually specified as non-overlapping intervals of some
variable. The categories (bars) must be adjacent
Histograms Often Tell More than Boxplots

■ The two histograms shown in


the left may have the same
boxplot representation
■ The same values for: min,
Q1, median, Q3, max
■ But they have rather different
data distributions
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100*fi% of the data are below or equal to the
value xi
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.

66
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, corelation etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

67
Positively and Negatively Correlated Data

• The left half fragment is positively correlated


• The right half is negative correlated
Uncorrelated Data

69
Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
Similarity and Dissimilarity
In data mining applications such as clustering, outlier analysis,
classification it is required to find how much similar or dissimilar one
data is in comparison to the other.
• Similarity
– A Numerical measure of how alike two data objects are
– Value is higher when objects are more alike and lower when
objects are more dissimilar
– Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
– A Numerical measure of how different two data objects are
– Value is higher when objects are more dissimilar and lower when
objects are more alike
– Minimum dissimilarity is often 0 and the upper limit varies
• Proximity refers to a similarity or dissimilarity
Data Matrix and Dissimilarity Matrix

• Data Matrix
– Object by attribute structure
– n data points (rows) with p
dimensions/attributes (cos) in the
form of a relational table

• Dissimilarity Matrix
– Object by Object structure
– n data points’ distance or
proximity with respect to all pairs.
– A triangular matrix

72
Similarity and Dissimilarity
Types of Attribute
– Nominal Attribute
– Ordinal Attribute
– Binary Attribute
– Numeric Attribute
Similarity/Dissimilarity for Simple
Attributes
The following table shows the similarity and dissimilarity between
two objects, x and y, with respect to a single, simple attribute.
Proximity Measure for Nominal Attributes
Nominal attribute can take 2 or more states.
Example: map_color is nominal with state <red, yellow, blue, green>
Dissimilarity Between Nominal Attribute
– Simple ratio of Mismatches

m: Number of matches
p: total number of variables/attributes
Similarity Between Nominal Attribute
– Simple ratio of Matches
Proximity Measure for Nominal Attributes
(Example)
Obj_I Test-1 Test-2 Test-3
d (Nominal) (Ordinal) (Numeric)
Here, Only Col-1 is of
1 Code A Excellent 45
nominal attribute.
2 Code B Fair 22
3 Code C Good 64
4 Code A Excellent 28
Dissimilarity Matrix: d(i,j)

Similarity Matrix: sim(i, j)= 1-d(i,j)

=
Proximity Measure for Binary Attributes
Contingency table is used to describe the distance between two
objects (let p & q) with only binary attributes

Compute dissimilarity/similarities using the following quantities


f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1

Object Q
----------- 1 0 sum
1 f11 f10 f11+f10
Object P
0 f01 f00 f01+f00
sum f11+f01 f10+f00
Proximity Measure for Binary Attributes
[Dissimilarity]
• Distance measure of Binary Attribute

– Symmetric binary variables:

– Asymmetric binary variables:


Proximity Measure for Binary Attributes
[Similarity]
• Similarity measure of Binary Attribute

– Symmetric binary variables [Simple Matching Coeefficient]:

– Asymmetric binary variables [ Jaccard Index Similarity]:


Dissimilarity between Binary Variables [Example]

– Gender is a symmetric attribute and the remaining attributes are asymmetric


binary
– Let the values Y and P be 1, and the value N be 0
– Dissimilarity in terms of only asymmetric attribute =

Among three patients, Jack and marry have similar diseases [as their distance is less]
Similarity between Binary Variables [Example]

– Similarity in terms of only asymmetric attribute [ Jaccard Index Similarity]

Among patients, Jack and marry have similar diseases [as their similarity is more]
Proximity Measure for Numeric Attributes
[Dissimilarity]
Commonly used distance measures for computing dissimilarity are
• Euclidean Distance:

• Manhattan Distance:

• Minkowski Distance: [When p=1: It is Manhattan


(Generalization of Euclidean p=2: It is Euclidean]
and Manhattan)

Here n is the number of dimensions (attributes) and xi and yi are


the ith attributes (components) of data objects x and y.
Proximity Measure for Numeric Attributes
[Dissimilarity- Euclidean Distance]
Data Matrix

Dissimilarity Matrix
(with Euclidean Distance)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
83 x4 4.24 1 5.39 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Manhattan Distance]
Data Matrix

Dissimilarity Matrix
(with Manhattan Distance)

x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
84 x4 6 1 7 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Minkowski Distance]
Data Matrix

Minkowski Distance=

Here p is a real number >=1

Minkowski Distance Minkowski Distance


with p value 1 is with p value 2 is
Manhattan Distance Euclidean Distance

x1 x2 x3 x4 x1 x2 x3 x4
x1 0 x1 0
x2 5 0 x2 3.61 0
x3 3 6 0 x3 5.1 5.1 0
x4 6 1 85 7 0 x4 4.24 1 5.39 0
Proximity Measure for Ordinal Attributes
■ Ordinal variables are discrete or continuous with meaningful order
or ranking about them but the magnitude between successive ranks
is unknown.
■ Example:
■ Attribute Size: <small, medium, large>

■ Attribute temperature: <-30 to -10: cold, -10 to 10: Moderate, 10

to 30: warm>
■ For an attribute f, let M represent the total number of possible states.
Then the order/rank states is <1 ... Mf>
■ For an object i, let rif is the rank of the ith object:
■ Since any ordinal attribute can have a different number of states, it
is necessary to normalize the range onto [0.0 to 1] where each of the
rank rif will be normalized to zif.: and then find it’s
dissimilarity matrix.
Proximity Measure for Ordinal Attributes
Obj_ Test-1 Test-2 Test-3 Obj Test-2 Rank Normalized
Id (Nominal) (Ordinal) (Numeric) Id (Ordinal) (rif) Rank (Zif)
Only Test-2 is
1 Code A Excellent 45 ordinal.
2 Code B Fair 22 It’s normalized 1 Excellent 3 1
form is: =>
3 Code C Good 64 2 Fair 1 0
4 Code A Excellent 28 3 Good 2 0.5
4 Excellent 3 1
Dissimilarity Matrix: d(i, j)

=
Similarity Matrix: sim(i, j)= 1-d(i,j)

=
Proximity Measure for Mixed Attributes

■ The real database mostly contains mixed types of attributes. The


simple approach is to process all types of attributes together by
performing a single analysis to obtain a single dissimilarity matrix
with a common scale of an interval [0.0, 1.0] (normalized form).
■ Let the data set contains p attributes of mixed type. The dissimilarity
d(i, j) between object i and object j is:

Where
dij is the dissimilarity between object i and j
is the indicator
0 if xif/xjf is missing or xif=xjf=0
= for asymmetric binary attribute
1 Otherwise
Proximity Measure for Mixed Attributes
Proximity Measure for Mixed Attributes

Obj_ Id Test-1 Test-2 Test-3


(Nominal) (Ordinal) (Numeric)

1 Code A Excellent 45 Disimmilarity Matrix of


2 Code B Fair 22 Numeric Attribute (Test-3)
3 Code C Good 64 max(h)=64 and min(h)=22
4 Code A Excellent 28 max(h) - min(h)= 42

Disimmilarity Matrix of Disimmilarity Matrix of


Nominal Attribute (Test-1) Ordinal Attribute (Test-2)
Proximity Measure for Mixed Attributes

Using the generalized formula of


mixed Attribute Disimmilarity:

You might also like