0% found this document useful (0 votes)

12 views92 pages

Chapter-2 Getting To Know Your Data

The document provides an overview of data mining and data warehousing, focusing on understanding data objects and their attributes, as well as statistical descriptions and visualization techniques. It discusses the characteristics of structured data, including dimensionality, sparsity, resolution, and distribution, while also explaining different types of attributes and data sets. Additionally, it covers descriptive data summarization methods to analyze central tendency and dispersion in data.

Uploaded by

khushiraynn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views92 pages

Chapter-2 Getting To Know Your Data

Uploaded by

khushiraynn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 92

DATA MINING AND

DATA WAREHOUSING

Getting to Know Your Data

Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
What is Data?
Attributes
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a person,
temperature, etc.
– Attribute is also known as variable,

Objects
field, characteristic, dimension, or
feature
• A collection of attributes describe
an object
– Object is also known as record, point,
case, sample, entity, or instance
Important Characteristics of Structured Data
• Dimensionality
– It is the number of attributes that the objects in the dataset posses.
– Data with small number of dimensions tend to be qualitatively different than
moderate or high-dimensional data
– The difficulties lies in the high dimensionality data are sometimes referred to
as the “Curse of dimensionality”. And due to this dimensionality reduction is
required.
• Sparsity
– In database, sparsity and density describe the number of cells in a table that
are empty or zero(sparsity) and that contain information(density).
– For some data sets, such as those with symmetric features, most attibutes of
an object have value 0, in many cases fewer than 1% of the entries are
non-zero
– Sparsity is an advantage because usually only the non-zero values need to be
stored and manipulated.
Important Characteristics of Structured Data
• Resolution
– It is frequently possible to obtain data at different levels of resolution and
often the properties of the data are different at different resoultion
– For instance, the surface of Earth seems very uneven at a resoultion of few
meters, but is relatively smooth at a resolution of 10KMs.
– The patterns in the data also depend on the level of resolution
• If the resolution is too fine, a pattern may not be visible or may be
buried in noise
• If the resoulution is too coarse, the pattern may disappear.
For eg. variations in atmosheric pressure on a scale of hours reflect the
movement of storms and other weather systems, whereas on a scale of months
such phenomenon are not detectable
• Distribution
Centrality and dispersion
Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points,
objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
Attributes

• Attribute (or dimensions, features, variables):

A data field, representing a characteristic or feature of a data object.
– E.g., customer _ID, name, address

• Observed values for a given attribute are known as observations.

• A set of attributes used to describe a given object is called an

attribute vector (or feature vector ).
It contains the set of database attributes that you have chosen to
represent (describe) uniquely each data element (tuple).

• The distribution of data involving one attribute (or variable) is

called univariate.
• A bivariate distribution involves two attributes, and so on.
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute for a particular object

• Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
• Attribute Type:
1. Categorical (Qualitative)
2. Numeric (Quantitative)
Categorical Data (Qualitative Attribute Types)
• Nominal: Nominal means “relating to names”. The values of a
nominal attribute are symbols or names of things for example,
Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
[The values do not have any meaningful order about them.]
• Binary: Nominal attribute with only 2 states (0 and 1), where 0
typically means that the attribute is absent, and 1 means that it is
present. Binary attributes are referred to as Boolean if the two
states correspond to true and false.
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g.,
HIV positive) 10
Categorical Data (Qualitative Attribute Types)...
• Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
– Other examples of ordinal attributes include Grade (e.g., A+, A,
A−, B+, and so on) and
– Professional rank. Professional ranks can be enumerated in a
sequential order, such as assistant, associate, and full for professors,

The central tendency of an ordinal attribute can be represented by its

mode and its median (the middle value in an ordered sequence), but
the mean cannot be defined.

Qualitative attributes describe a feature of an object, without giving

an actual size or quantity. The values of such qualitative attributes are
typically words representing categories.
Numeric Data (Quantitative Attribute Types)
• Quantity (that is, it is a measurable quantity, integer or real-valued).
Numeric attributes can be interval-scaled or ratio-scaled.
• Interval
• Measured on a scale of equal-sized units.
• The values of interval-scaled attributes have order and can be positive,
0, or negative.
– E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being a multiple (or ratio) of another
value.
– e.g. temperature in Kelvin [Unlinke Celsius and Fahrenheit, the
Kelvin temperature has a true zero-point or at 00K the particles
have zero K.E].
– e.g., years of experience in object employee, no_of_words in
object Documents, length, counts, monetary quantities
Discrete vs. Continuous Attributes

• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Binary (0 or 1; Yes/No; Male/Female)
– Boolean (True/False)
– An attribute is countably infinite if the set of possible values is
infinite but the values can be put in a one-to-one
correspondence with natural numbers.
– For example, the attribute customer ID is countably infinite.

13
Discrete vs. Continuous Attributes (cont..)

• Continuous Attribute

‒ Has real numbers as attribute values

Numeric (e.g., salaries, ages, temperatures, rainfall, sales,
height, or weight)

Practically, real values can only be measured and represented

using a finite number of digits
Continuous attributes are also represented as floating-point
variables
Properties of Attribute Values
The type of an attribute depends on which of the following
properties/operations it possesses:
– Distinctness : = and ≠
– Order : <, ≤, >, and ≥
– Addition : + and - (Differences are meaningful)
– Multiplication : * and / (Ratios are meaningful)

– Nominal attribute: distinctness

– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful differences
– Ratio attribute: all 4 properties/operations
Different types of Attributes
Types of data sets
• Record Data
– Data Matrix
– Document Data (Sparse Data Matrix)
– Transaction Data
• Graph Data
– World Wide Web
– Molecular Structures
• Ordered Data
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record data set
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there

are m rows, one for each object, and n columns, one for each
attribute
Document Data
• Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.
Transaction Data
• A special type of record data, where

– Each record (transaction) involves a set of items.

– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
Graph Data
Examples: Generic graph, a molecule, and webpages

Benzene Molecule: C6H6

Ordered Data
• Sequences of transactions
Items/Events

An element of the
sequence
Ordered Data
• Genomic sequence data
Ordered Data
• Spatio-Temporal Data

Average Monthly Temperature of land and ocean

Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
Descriptive Data Summarization
❑ For data preprocessing to be successful, it is essential to have an
overall picture of data.
❑ Descriptive Data Summarization techniques can be used to identify
the typical properties of data and highlights which data values should
be treated as noise or outliers.
❑ Data characterstics such as central tendency and dispersion of the
data are used to understand the distribution of the data.
Central Tendency: Measure the location of the middle or centre
of data distribution.
e.g. mean, median, mode and midrange
Dispersion of Data set: How data are spread out.
e.g. range, quartiles, interquartile range, five number summary,
variance and standard deviation.
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):

• The most common and effective The mean of this set of values is
numeric measure of the “center” of a
set of data is the (arithmetic) mean.
Let x1,x2, ...... ,xN be a set of N values
or observations, such as for some
Note: N is population size.
numeric attribute X, like salary.

• Sometimes, each value xi in a set may

be associated with a weight wi for i =
1, .... ,N. The weights reflect the
significance, importance, or
This is called the weighted arithmetic
occurrence frequency attached to their mean or the weighted average.
respective values. In this case, we can
compute
Measuring the Central Tendency(Mean)...
• A trimmed mean (sometimes called a truncated mean) is similar to a mean, but
it trims any outliers. Outliers can affect the mean (especially if there are just one
or two very large values), so a trimmed mean can often be a better fit for data sets
with erratic high or low values or for extremely skewed distributions. Even a
small number of extreme values can corrupt the mean.
• For example, the mean salary at a company may be substantially pushed up by
that of a few highly paid managers. Similarly, the mean score of a class in an
exam could be pulled down quite a bit by a few very low scores.
• Which is the mean obtained after chopping off values at the high and low
extremes.
• Example: Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99.
– Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle
three values: 60, 81, 83, 91, 99.
– Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) =
85.
Measuring the Central Tendency...

• Median:
– The median of a set of data is the middlemost number in the set.
The median is also the number that is halfway into the set.
– To find the median, the data should first be arranged in order from
least to greatest.
– Middle value if odd number of values, or average of the middle
two values otherwise

– What will be the median estimated by interpolation (for

grouped data)?
Measuring the Central Tendency...
• Mode
– The mode for a set of data is the value that occurs most
frequently in the set. Therefore, it can be determined for
qualitative and quantitative attributes.
– It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode.
– Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal.
– In general, a data set with two or more modes is multimodal.
– At the other extreme, if each data value occurs only once, then
there is no mode.
– Example:
• 53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68,
68, 70
• 62 appears three times, more often than the other values, so Mode = 62
Mean, Median and Mode from Grouped Frequencies

The Race.....This starts with some raw data (not a grouped frequency yet)..

To find the Mean Alex adds up all the numbers, then divides by how
many numbers:

Mean = (59+65+61+62+53+55+60+70+64+56+58+58+62+62+68+65
+56 +59+68+61+67)/21 = 61.38095...
Mean, Median and Mode from Grouped Frequencies

• To find the Median Alex places the numbers in value order and finds the
middle number.
• In this case the median is the 11th number:
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68,
70
• Median = 61
• To find the Mode, or modal value, Alex places the numbers in value
order then counts how many of each number. The Mode is the number
which appears most often (there can be more than one mode):
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68,
70
• 62 appears three times, more often than the other values, so Mode = 62
Mean, Median and Mode from Grouped Frequencies

• Grouped Frequency Table

• Alex then makes a Grouped Frequency Table:
• So 2 runners took between 51 and 55 seconds,
7 took between 56 and 60 seconds, etc
Estimating the Mean from Grouped Data

We can estimate the Mean by using the midpoints.

Let's now make the table using midpoints:

Estimating the Mean from Grouped Data

• Our thinking is: "2 people took 53 sec, 7 people took

58 sec, 8 people took 63 sec and 4 took 68 sec". In
other words we imagine the data looks like this:
53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63, 68, 68, 68, 68

• Then we add them all up and divide by 21. The quick

way to do it is to multiply each midpoint by each
frequency:

And then our estimate of the mean time to complete

the race is:
Estimating the Median from Grouped Data

• Let's look at our data again:

• The median is the middle value, which in our
case is the 11th one, which is in the 61 - 65
group:
• We can say "the median group is 61 - 65"
• But if we want an estimated Median value we
need to look more closely at the 61 - 65 group.
Estimating the Median from Grouped Data
• At 60.5 we already have 9 runners, and by the next
boundary at 65.5 we have 17 runners.
• By drawing a straight line in between we can pick
out where the median frequency of n/2 runners is:

And this handy formula does the calculation:

where:
L is the lower class boundary of the group
containing the median
n is the total number of values
B is the cumulative frequency of the groups before
the median group
G is the frequency of the median group
w is the group width
For our example:

L = 60.5
n = 21
B=2+7=9
G=8
w=5
Estimating the Mode from Grouped Data

• Again, looking at our data:

• We can easily find the modal group (the group with the
highest frequency), which is 61 - 65
• We can say "the modal group is 61 - 65"
• But the actual Mode may not even be in that group! Or
there may be more than one mode. Without the raw data
we don't really know. But, we can estimate the Mode
using the following formula:

where:
L is the lower class boundary of the modal group
fm is the frequency of the modal group
fm-1 is the frequency of the group before the modal group
w is the group width
fm+1 is the frequency of the group after the modal group
For our example:
L = 60.5
fm-1 =7
fm = 8
fm+1 = 4
w =5
Baby Carrots Example Age Example
Example: You grew fifty baby carrots The ages of the 112 people who live on
using special soil. You dig them up and a tropical island are grouped as follows:
measure their lengths (to the nearest mm)
and group the results:
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively
skewed data
• Data in most real applications are not symmetric (a). They may instead
be either positively skewed (b), where the mode occurs at a value that is
smaller than the median or negatively skewed (c), where the mode
occurs at a value greater than the median.
Measuring the Dispersion of Data

• Range, Quartiles, Variance, Standard Deviation, and

Interquartile Range
• We now look at measures to assess the dispersion or spread of
numeric data. The measures include range, quartiles, percentiles,
and the interquartile range.
• The five-number summary, which can be displayed as a boxplot, is
useful in identifying outliers.
• Variance and standard deviation also indicate the spread of a
data distribution.
Measuring the Dispersion of Data (cont..)

Range
• Let x1, x2, . . . , xN be a set of observations for some numeric attribute, X.
• The range of the set is the difference between the largest (max()) and smallest
(min()) values.

Standard Deviation
• The standard deviation (usually abbreviated SD, sd, or just s) of a bunch of
numbers tells you how much the individual numbers tend to differ (in either
direction) from the mean. It’s calculated as follows:

This formula is saying that you calculate

the standard deviation of a set of N
numbers (Xi) by subtracting the mean
from each value to get the deviation (di) of
each value from the mean, squaring each
of these deviations, adding up the (di)2
terms, dividing by N – 1, and then taking
the square root.
Measuring the Dispersion of Data (cont..)
Standard Deviation (cont.)
• This is almost identical to the formula for
the root-mean-square deviation of the points
from the mean, except that it has N – 1 in the
denominator instead of N.
• This difference occurs because the sample
mean is used as an approximation of the true
population mean (which you don’t know). If
the true mean were available to use, the For an IQ example (84, 84, 89, 91, 110,
denominator would be N. 114, and 116) where the mean is 98.3, you
calculate the SD as follows:
• When talking about population distributions,
the SD describes the width of the
distribution curve. The figure shows three
normal distributions. They all have a mean Standard deviations are very sensitive
of zero, but they have different standard to extreme values (outliers) in the data.
deviations and, therefore, different widths. For example, if the highest value in the
Each distribution curve has a total area of IQ dataset had been 150 instead of 116,
exactly 1.0, so the peak height is smaller the SD would have gone up from 14.4
when the SD is larger. to 23.9.
Why n-1 in Standard Deviation?
Bessel's correction
Why divide by n-1 rather than n?
You compute the difference between each value and the mean of those values. You
don't know the true mean of the population; all you know is the mean of your
sample. Except for the rare cases where the sample mean happens to equal the
population mean, the data will be closer to the sample mean than it will be to the true
population mean.
So the value you compute will probably be a bit smaller (and can't be larger) than
what it would be if you used the true population mean.
To make up for this, divide by n-1 rather than n. This is called Bessel's correction.

But why n-1? If you knew the sample mean, and all but one of the values, you could
calculate what that last value must be. Statisticians say there are n-1 degrees of
freedom.
Measuring the Dispersion of Data (cont..)
• Several other useful measures of dispersion are related to the SD:

• Variance: Variance of N observations, x1, x2, . . . , xN, for a numeric

attribute X is

• The variance is just the square of the SD. For the IQ example, the
variance = 14.42 = 207.36.

• Coefficient of variation: The coefficient of variation (CV) is the SD

divided by the mean. For the IQ example, CV = 14.4/98.3 = 0.1465, or
14.65 percent.
Measuring the Dispersion of Data (cont..)

Quartiles
It divide an ordered data set into four equal parts.
The values which divide each part are called the first, second, and third quartiles;
they are denoted by Q1, Q2, and Q3, respectively.
Q1 is the middle value of the first half of the ordered data set, Q2 is the median value
in the set, and Q3 is the middle value in the second half of the ordered data set.

Interquartile range (IQR)

A good measure of the spread of data is the interquartile range (IQR) or the
difference between Q3 and Q1. This gives us the width of the box, as well. A small
width means more consistent data values since it indicates less variation in the data or
that data values are closer together. So, IQR = Q3 - Q1
Interquartile Range = Q3-Q1
• With an Odd Sample Size:
• With an Even Sample Size:
• For the sample (n=10) the median
• For the sample (n=10) the diastolic blood pressure is 72 (50%
median diastolic blood pressure of the values are above 72, and 50%
are below).
is 71 (50% of the values are
above 71, and 50% are below). • When the sample size is odd, the
median and quartiles are determined
• The quartiles can be determined in the same way.
in the same way we determined
• Suppose in the previous example,
the median, except we consider the lowest value (62) were
each half of the data set excluded, and the sample size was
separately. n=9. The median and quartiles are
indicated below.
Outliers and Tukey Fences:
Tukey Fences
– When there are no outliers in a sample, the mean and standard
deviation are used to summarize a typical value and the variability in
the sample, respectively.
– When there are outliers in a sample, the median and interquartile range
are used to summarize a typical value and the variability in the sample,
respectively.

– Outliers are values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1)

or equivalently, values below Q1-1.5 IQR or above Q3+1.5 IQR.

– In previous example, for the diastolic blood pressures, the lower limit
is 64 - 1.5(77-64) = 44.5 and the upper limit is 77 + 1.5(77-64) = 96.5.
The diastolic blood pressures range from 62 to 81. Therefore there are
no outliers.
Example : The Full Framingham Cohort Data
• The Framingham Heart Study is a long-term, ongoing cardiovascular
cohort study on residents of the city of Framingham, Massachusetts.
The study began in 1948 with 5,209 adult subjects from Framingham,
and is now on its third generation of participants.

• Table 1 displays the means, standard deviations, medians, quartiles

and interquartile ranges for each of the continuous variables in the
subsample of n=10 participants who attended the seventh examination
of the Framingham Offspring Study.
Table 1 - Summary Statistics on n=10 Participants
• Table 2 displays the observed minimum and maximum values along
with the limits to determine outliers using the quartile rule for each
of the variables in the subsample of n=10 participants.
• Are there outliers in any of the variables? Which statistics are most
appropriate to summarize the average or typical value and the
dispersion?
Table 2 - Limits for Assessing Outliers in Characteristics Measured in the n=10 Participants

Since there are no suspected outliers in the subsample of n=10 participants, the mean and
standard deviation are the most appropriate statistics to summarize average values and
dispersion, respectively, of each of these characteristics.
Continue.....
• For clarity, we have so far used a very small subset of the Framingham Offspring
Cohort to illustrate calculations of summary statistics and determination of
outliers. For your interest, Table 3 displays the means, standard deviations,
medians, quartiles and interquartile ranges for each of the continuous variable
displayed in Table 1 in the full sample (n=3,539) of participants who attended the
seventh examination of the Framingham Offspring Study.

Table 3-Summary Statistics on Sample of (n=3,539) Participants

Continue.....
• Table 4 displays the observed minimum and maximum values
along with the limits to determine outliers using the quartile rule
for each of the variables in the full sample (n=3,539).
Table 4 - Limits for Assessing Outliers in Characteristics Presented in Table 3

Are there outliers in any of the variables?

Which statistics are most appropriate to summarize the average or typical values and
the dispersion for each variable?
Observations on example......

• In the full sample, each of the characteristics has outliers on the upper
end of the distribution as the maximum values exceed the upper limits
in each case. There are also outliers on the low end for diastolic blood
pressure and total cholesterol, since the minimums are below the lower
limits.
• For some of these characteristics, the difference between the upper
limit and the maximum (or the lower limit and the minimum) is small
(e.g., height, systolic and diastolic blood pressures), while for others
(e.g., total cholesterol, weight and body mass index) the difference is
much larger. This method for determining outliers is a popular one but
not generally applied as a hard and fast rule. In this application it
would be reasonable to present means and standard deviations for
height, systolic and diastolic blood pressures and medians and
interquartile ranges for total cholesterol, weight and body mass index.
The beauty of the normal curve:

68-95-99.7 Rule
No matter what μ and σ are,
• the area between μ-σ and μ+σ is
about 68%; the area between
μ-2σ and μ+2σ is about 95%;
• and the area between
μ-3σ and μ+3σ is about 99.7%.
• Almost all values fall within 3 standard deviations. (μ: mean, σ:
standard deviation)
Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis represent frequencies
• Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are ≤ xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
• Scatter plot: each pair of values is a pair of coordinates and plotted
as points in the plane

60
Boxplot Analysis
• Boxplots are a popular way of visualizing
a distribution. A boxplot incorporates the
five-number summary as follows:
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and
third quartiles, i.e., the height of the box
is IQR
– The median is marked by a line within
the box Boxplot for the unit price data for items sold at four
branches of
– Two lines (called whiskers) outside the AllElectronics during a given time period.

box extend to the smallest (Minimum)

and largest (Maximum) observations.
Boxplot Analysis (cont..)

Finally, boxplots often provide information about the shape of a

data set. The examples below show some common patterns.

Each of the boxplots illustrates a

different skewness pattern.
If most of the observations are
concentrated on the low end of the
scale, the distribution is skewed right;
and vice versa.
If a distribution is symmetric, the
observations will be evenly split at the
median, as shown in the middle
figure.
Histogram Analysis
• Histogram: Graph display of tabulated frequencies, shown as bars
• It shows what proportion of cases fall into each of several categories
• Differs from a bar chart in that it is the area of the bar that denotes the value,
not the height as in bar charts, a crucial distinction when the categories are not
of uniform width
• The categories are usually specified as non-overlapping intervals of some
variable. The categories (bars) must be adjacent
Histograms Often Tell More than Boxplots

■ The two histograms shown in

the left may have the same
boxplot representation
■ The same values for: min,
Q1, median, Q3, max
■ But they have rather different
data distributions
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100*fi% of the data are below or equal to the
value xi
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.

66
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, corelation etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

67
Positively and Negatively Correlated Data

• The left half fragment is positively correlated

• The right half is negative correlated
Uncorrelated Data

69
Chapter-2 Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

• Measuring Data Similarity and Dissimilarity

• Summary
Similarity and Dissimilarity
In data mining applications such as clustering, outlier analysis,
classification it is required to find how much similar or dissimilar one
data is in comparison to the other.
• Similarity
– A Numerical measure of how alike two data objects are
– Value is higher when objects are more alike and lower when
objects are more dissimilar
– Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
– A Numerical measure of how different two data objects are
– Value is higher when objects are more dissimilar and lower when
objects are more alike
– Minimum dissimilarity is often 0 and the upper limit varies
• Proximity refers to a similarity or dissimilarity
Data Matrix and Dissimilarity Matrix

• Data Matrix
– Object by attribute structure
– n data points (rows) with p
dimensions/attributes (cos) in the
form of a relational table

• Dissimilarity Matrix
– Object by Object structure
– n data points’ distance or
proximity with respect to all pairs.
– A triangular matrix

72
Similarity and Dissimilarity
Types of Attribute
– Nominal Attribute
– Ordinal Attribute
– Binary Attribute
– Numeric Attribute
Similarity/Dissimilarity for Simple
Attributes
The following table shows the similarity and dissimilarity between
two objects, x and y, with respect to a single, simple attribute.
Proximity Measure for Nominal Attributes
Nominal attribute can take 2 or more states.
Example: map_color is nominal with state <red, yellow, blue, green>
Dissimilarity Between Nominal Attribute
– Simple ratio of Mismatches

m: Number of matches
p: total number of variables/attributes
Similarity Between Nominal Attribute
– Simple ratio of Matches
Proximity Measure for Nominal Attributes
(Example)
Obj_I Test-1 Test-2 Test-3
d (Nominal) (Ordinal) (Numeric)
Here, Only Col-1 is of
1 Code A Excellent 45
nominal attribute.
2 Code B Fair 22
3 Code C Good 64
4 Code A Excellent 28
Dissimilarity Matrix: d(i,j)

Similarity Matrix: sim(i, j)= 1-d(i,j)

=
Proximity Measure for Binary Attributes
Contingency table is used to describe the distance between two
objects (let p & q) with only binary attributes

Compute dissimilarity/similarities using the following quantities

f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1

Object Q
----------- 1 0 sum
1 f11 f10 f11+f10
Object P
0 f01 f00 f01+f00
sum f11+f01 f10+f00
Proximity Measure for Binary Attributes
[Dissimilarity]
• Distance measure of Binary Attribute

– Symmetric binary variables:

– Asymmetric binary variables:

Proximity Measure for Binary Attributes
[Similarity]
• Similarity measure of Binary Attribute

– Symmetric binary variables [Simple Matching Coeefficient]:

– Asymmetric binary variables [ Jaccard Index Similarity]:

Dissimilarity between Binary Variables [Example]

– Gender is a symmetric attribute and the remaining attributes are asymmetric

binary
– Let the values Y and P be 1, and the value N be 0
– Dissimilarity in terms of only asymmetric attribute =

Among three patients, Jack and marry have similar diseases [as their distance is less]
Similarity between Binary Variables [Example]

– Similarity in terms of only asymmetric attribute [ Jaccard Index Similarity]

Among patients, Jack and marry have similar diseases [as their similarity is more]
Proximity Measure for Numeric Attributes
[Dissimilarity]
Commonly used distance measures for computing dissimilarity are
• Euclidean Distance:

• Manhattan Distance:

• Minkowski Distance: [When p=1: It is Manhattan

(Generalization of Euclidean p=2: It is Euclidean]
and Manhattan)

Here n is the number of dimensions (attributes) and xi and yi are

the ith attributes (components) of data objects x and y.
Proximity Measure for Numeric Attributes
[Dissimilarity- Euclidean Distance]
Data Matrix

Dissimilarity Matrix
(with Euclidean Distance)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
83 x4 4.24 1 5.39 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Manhattan Distance]
Data Matrix

Dissimilarity Matrix
(with Manhattan Distance)

x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
84 x4 6 1 7 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Minkowski Distance]
Data Matrix

Minkowski Distance=

Here p is a real number >=1

Minkowski Distance Minkowski Distance

with p value 1 is with p value 2 is
Manhattan Distance Euclidean Distance

x1 x2 x3 x4 x1 x2 x3 x4
x1 0 x1 0
x2 5 0 x2 3.61 0
x3 3 6 0 x3 5.1 5.1 0
x4 6 1 85 7 0 x4 4.24 1 5.39 0
Proximity Measure for Ordinal Attributes
■ Ordinal variables are discrete or continuous with meaningful order
or ranking about them but the magnitude between successive ranks
is unknown.
■ Example:
■ Attribute Size: <small, medium, large>

■ Attribute temperature: <-30 to -10: cold, -10 to 10: Moderate, 10

to 30: warm>
■ For an attribute f, let M represent the total number of possible states.
Then the order/rank states is <1 ... Mf>
■ For an object i, let rif is the rank of the ith object:
■ Since any ordinal attribute can have a different number of states, it
is necessary to normalize the range onto [0.0 to 1] where each of the
rank rif will be normalized to zif.: and then find it’s
dissimilarity matrix.
Proximity Measure for Ordinal Attributes
Obj_ Test-1 Test-2 Test-3 Obj Test-2 Rank Normalized
Id (Nominal) (Ordinal) (Numeric) Id (Ordinal) (rif) Rank (Zif)
Only Test-2 is
1 Code A Excellent 45 ordinal.
2 Code B Fair 22 It’s normalized 1 Excellent 3 1
form is: =>
3 Code C Good 64 2 Fair 1 0
4 Code A Excellent 28 3 Good 2 0.5
4 Excellent 3 1
Dissimilarity Matrix: d(i, j)

=
Similarity Matrix: sim(i, j)= 1-d(i,j)

=
Proximity Measure for Mixed Attributes

■ The real database mostly contains mixed types of attributes. The

simple approach is to process all types of attributes together by
performing a single analysis to obtain a single dissimilarity matrix
with a common scale of an interval [0.0, 1.0] (normalized form).
■ Let the data set contains p attributes of mixed type. The dissimilarity
d(i, j) between object i and object j is:

Where
dij is the dissimilarity between object i and j
is the indicator
0 if xif/xjf is missing or xif=xjf=0
= for asymmetric binary attribute
1 Otherwise
Proximity Measure for Mixed Attributes
Proximity Measure for Mixed Attributes

Obj_ Id Test-1 Test-2 Test-3

(Nominal) (Ordinal) (Numeric)

1 Code A Excellent 45 Disimmilarity Matrix of

2 Code B Fair 22 Numeric Attribute (Test-3)
3 Code C Good 64 max(h)=64 and min(h)=22
4 Code A Excellent 28 max(h) - min(h)= 42

Disimmilarity Matrix of Disimmilarity Matrix of

Nominal Attribute (Test-1) Ordinal Attribute (Test-2)
Proximity Measure for Mixed Attributes

Using the generalized formula of

mixed Attribute Disimmilarity:

Lect-2 Getting To Know Your Data-Part-I
No ratings yet
Lect-2 Getting To Know Your Data-Part-I
28 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Lect 2
No ratings yet
Lect 2
77 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Unit-2 Attributes
No ratings yet
Unit-2 Attributes
4 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
DEP Unit 2
No ratings yet
DEP Unit 2
83 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
Data
No ratings yet
Data
84 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
39 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Full
No ratings yet
Full
367 pages
Data Mining - Data Objects and Attributes
No ratings yet
Data Mining - Data Objects and Attributes
50 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Chapter2 Data Exploration
No ratings yet
Chapter2 Data Exploration
78 pages
02data Part1
No ratings yet
02data Part1
19 pages
Week 2
No ratings yet
Week 2
73 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Data Attributes & Types Explained
No ratings yet
Data Attributes & Types Explained
69 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Chpater 2 PDF
No ratings yet
Chpater 2 PDF
44 pages
2 What Is DATA ST
No ratings yet
2 What Is DATA ST
63 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Attributes
No ratings yet
Attributes
66 pages
Dmi Unit 2 - 186 - N3
No ratings yet
Dmi Unit 2 - 186 - N3
21 pages
Data Analysis Essentials
No ratings yet
Data Analysis Essentials
9 pages
Ids Unit-Ii
No ratings yet
Ids Unit-Ii
44 pages
CAC 428 Topic 1 - Introduction To Data
No ratings yet
CAC 428 Topic 1 - Introduction To Data
24 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Session 1 - Getting To Know Data
No ratings yet
Session 1 - Getting To Know Data
62 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
Chap2 Data
No ratings yet
Chap2 Data
88 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
IDS Unit-2
No ratings yet
IDS Unit-2
39 pages
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
No ratings yet
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
11 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
Personal Area - WorkAdventure Documentation
No ratings yet
Personal Area - WorkAdventure Documentation
1 page
Price List EcoLum Price List March 2023 Issue
No ratings yet
Price List EcoLum Price List March 2023 Issue
12 pages
An Introduction To Well Control Calculations For Drilling Operations (PDFDrive)
100% (1)
An Introduction To Well Control Calculations For Drilling Operations (PDFDrive)
12 pages
E-Commerce Website of GPS Tracker Devices: Faculty of Computer Studies Information Technology and Computing Department
No ratings yet
E-Commerce Website of GPS Tracker Devices: Faculty of Computer Studies Information Technology and Computing Department
139 pages
DataSheet MW560N
No ratings yet
DataSheet MW560N
2 pages
CBNST 1
No ratings yet
CBNST 1
11 pages
Aircon Compilation
No ratings yet
Aircon Compilation
198 pages
Macadamian Maturity Model
No ratings yet
Macadamian Maturity Model
17 pages
Me2032 QB
No ratings yet
Me2032 QB
10 pages
Program FIFO CPU Scheduling Alogrithm
No ratings yet
Program FIFO CPU Scheduling Alogrithm
3 pages
H2 - Schedule - 6 - Rolling Shutter
No ratings yet
H2 - Schedule - 6 - Rolling Shutter
10 pages
Impacts of Using Social Media On Written Performance of Bachelor of Science in Information Technology Students
No ratings yet
Impacts of Using Social Media On Written Performance of Bachelor of Science in Information Technology Students
14 pages
Di Grooved Fittings - 90 ElBOW
No ratings yet
Di Grooved Fittings - 90 ElBOW
3 pages
Tilt Pitch Optimization Report With Covering Sheet - R002
No ratings yet
Tilt Pitch Optimization Report With Covering Sheet - R002
5 pages
Image Guidelines
No ratings yet
Image Guidelines
44 pages
Ikea Case Study
No ratings yet
Ikea Case Study
2 pages
R1900 Clue Book Ver 2 PDF
100% (2)
R1900 Clue Book Ver 2 PDF
32 pages
Sathyam Resume
No ratings yet
Sathyam Resume
2 pages
Solis Datasheet S4-WiFi-ST USA V1,6 202407
No ratings yet
Solis Datasheet S4-WiFi-ST USA V1,6 202407
2 pages
Worksheet Definite Proportions - Key
No ratings yet
Worksheet Definite Proportions - Key
2 pages
Thinking About GIS Geographic Information System Planning For Managers 5th Edition Roger Tomlinson Ready To Read
100% (2)
Thinking About GIS Geographic Information System Planning For Managers 5th Edition Roger Tomlinson Ready To Read
85 pages
Equinix (2019) Equinix IBX Sustainability Quick
No ratings yet
Equinix (2019) Equinix IBX Sustainability Quick
22 pages
Darkseed Manual
No ratings yet
Darkseed Manual
11 pages
Algebra 2 Honors Lesson 5
No ratings yet
Algebra 2 Honors Lesson 5
23 pages
MODEL NO.: V390HJ1 Suffix: Le6: Product Specification
No ratings yet
MODEL NO.: V390HJ1 Suffix: Le6: Product Specification
38 pages
Electrical Grounding Guide
No ratings yet
Electrical Grounding Guide
1 page
Hydraulic Explanation of Hydraulic Circuit and Operation
No ratings yet
Hydraulic Explanation of Hydraulic Circuit and Operation
67 pages
Drew 3 Hematology Analyzer
No ratings yet
Drew 3 Hematology Analyzer
2 pages
13 Rotations Avl Trees
No ratings yet
13 Rotations Avl Trees
30 pages
210-260.examcollection - Premium.exam.274q: Number: 210-260 Passing Score: 800 Time Limit: 120 Min File Version: 1.0
No ratings yet
210-260.examcollection - Premium.exam.274q: Number: 210-260 Passing Score: 800 Time Limit: 120 Min File Version: 1.0
263 pages

Chapter-2 Getting To Know Your Data

Uploaded by

Chapter-2 Getting To Know Your Data

Uploaded by

DATA MINING AND

Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Attribute (or dimensions, features, variables):

• Observed values for a given attribute are known as observations.

• A set of attributes used to describe a given object is called an

• The distribution of data involving one attribute (or variable) is

• Distinction between attributes and attribute values

– Different attributes can be mapped to the same set of values

The central tendency of an ordinal attribute can be represented by its

Qualitative attributes describe a feature of an object, without giving

‒ Has real numbers as attribute values

Practically, real values can only be measured and represented

– Nominal attribute: distinctness

• Such data set can be represented by an m by n matrix, where there

– Each record (transaction) involves a set of items.

Benzene Molecule: C6H6

Average Monthly Temperature of land and ocean

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Sometimes, each value xi in a set may

– What will be the median estimated by interpolation (for

• Grouped Frequency Table

We can estimate the Mean by using the midpoints.

Let's now make the table using midpoints:

• Our thinking is: "2 people took 53 sec, 7 people took

• Then we add them all up and divide by 21. The quick

And then our estimate of the mean time to complete

• Let's look at our data again:

And this handy formula does the calculation:

• Again, looking at our data:

• Range, Quartiles, Variance, Standard Deviation, and

This formula is saying that you calculate

• Variance: Variance of N observations, x1, x2, . . . , xN, for a numeric

• Coefficient of variation: The coefficient of variation (CV) is the SD

Interquartile range (IQR)

– Outliers are values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1)

• Table 1 displays the means, standard deviations, medians, quartiles

Table 3-Summary Statistics on Sample of (n=3,539) Participants

Are there outliers in any of the variables?

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

box extend to the smallest (Minimum)

Finally, boxplots often provide information about the shape of a

Each of the boxplots illustrates a

■ The two histograms shown in

• The left half fragment is positively correlated

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

Similarity Matrix: sim(i, j)= 1-d(i,j)

Compute dissimilarity/similarities using the following quantities

– Symmetric binary variables:

– Asymmetric binary variables:

– Symmetric binary variables [Simple Matching Coeefficient]:

– Asymmetric binary variables [ Jaccard Index Similarity]:

– Gender is a symmetric attribute and the remaining attributes are asymmetric

– Similarity in terms of only asymmetric attribute [ Jaccard Index Similarity]

• Minkowski Distance: [When p=1: It is Manhattan

Here n is the number of dimensions (attributes) and xi and yi are

Here p is a real number >=1

Minkowski Distance Minkowski Distance

■ Attribute temperature: <-30 to -10: cold, -10 to 10: Moderate, 10

■ The real database mostly contains mixed types of attributes. The

Obj_ Id Test-1 Test-2 Test-3

1 Code A Excellent 45 Disimmilarity Matrix of

Disimmilarity Matrix of Disimmilarity Matrix of

Using the generalized formula of

You might also like