Data Mining - Data Objects and Attributes
Data Mining - Data Objects and Attributes
Data Mining
◻ By ordering the values – ranking values will be obtained. In addition, we can quantify the
difference between values. For example, a temperature of 20◦C is five degrees higher
than a temperature of 15◦C.
◻ Calendar dates are another example. For instance, the years 2012 and 2020 are eight
years apart.
Numeric Attributes
10
◻ Many ways to organize attribute types - The types are not mutually exclusive.
◻ Classification algorithms developed from the field of machine learning often talk of
attributes as being either discrete or continuous.
◻ Discrete attribute has a finite or countably infinite set of values, which may or
may not be represented as integers.
◻ Example:
◻ Finite attributes: The attributes hair color, smoker, medical test, and drink size each
have a finite number of values, and so are discrete.
◻ Discrete attributes may have numeric values, such as 0 and 1 for binary attributes
or, the values 0 to 110 for the attribute age.
◻ Infinite Attributes:
◻ Attribute customer ID is countably infinite.
◻ Zip codes are another example.
2.1.6 Discrete versus Continuous Attributes
12
◻ Suppose that we have some attribute X, like salary, which has been recorded for a set
of objects.
◻ Let X1,X2,...,XN be the set of N observed values or observations for X.
◻ Measures of central tendency include the mean, median, mode, and midrange.
◻ The most common and effective numeric measure of the “center” of a set of data is
the (arithmetic) mean.
◻ Let X1,X2,...,XN be a set of N values or observations, such as for some numeric
attribute X, like salary. The mean of this set of values is
Mean
15
◻ Trimmed mean:
🞑 Although the mean is the singlemost useful quantity for describing a data set, it is
not always the best way of measuring the center of the data.
🞑 A major problem with the mean is its sensitivity to extreme (e.g., outlier)
values.
🞑 Example: The mean salary at a company may be substantially pushed up by that
of a few highly paid managers.
🞑 Similarly, the mean score of a class in an exam could be pulled down quite a bit by
a few very low scores.
🞑 To offset the effect caused by a small number of extreme values, we can instead
use the trimmed mean, which is the mean obtained after chopping off values at
the high and low extremes.
🞑 Example: remove the top and bottom 2% salary before computing the mean. We
should avoid trimming too large a portion (such as 20%) at both ends, as this can
result in the loss of valuable information.
Skewness & Symmetry
17
symmetri
c
positively negatively
skewed skewed
Median
18
◻ For skewed (asymmetric) data, a better measure of the center of data is the
median, which is the middle value in a set of ordered data values.
◻ It is the value that separates the higher half of a data set from the lower half.
◻ Suppose that a given data set of N values for an attribute X is sorted in increasing
order. If N is odd, then the median is the middle value of the ordered set.
◻ If N is even, then the median is not unique; it is the two middlemost values and any
value in between – take average
◻ Example: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
◻ Even: (52+56)/2 = 108/2 = 54.
◻ Suppose 11 values: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70
◻ Odd: Middle one: 52
Median
19
◻ Assume that data are grouped in intervals according to their xi data values and that
the frequency (i.e., number of data values) of each interval is known.
◻ For example, employees may be grouped according to their annual salary in
intervals such as $10–20,000, $20–30,000, and so on.
◻ Let the interval that contains the median frequency be the median interval.
◻ We can approximate the median of the entire data set (e.g., the median salary) by
interpolation using the formula
◻ where L1 is the lower boundary of the median interval, N is the number of values in
∑
the entire data set, ( freq)l is the sum of the frequencies of all of the intervals that
are l lower than the median interval, freqmedian is the frequency of the median
interval, and width is the width of the median interval.
◻
Median
20
◻ Example:
Median
21
Median
22
Median
23
Mode
24
◻ The mode for a set of data is the value that occurs most frequently in the set.
◻ Data sets with one, two,or three modes are respectively called
unimodal, bimodal, and trimodal.
◻ In general, a data set with two or more modes is multimodal.
◻ At the other extreme, if each data value occurs only once, then there is no mode.
◻ Example: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
◻ The two modes are $52,000 and $70,000.
Midrange
25
◻ The midrange can also be used to assess the central tendency of a numeric data set.
◻ It is the average of the largest and smallest values in the set.
◻ This measure is easy to compute using the SQL aggregate functions, max() and
min().
◻ Example: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
◻ (30,000+110,000)/2 = $70,000
Median
26
◻ Solve: 59, 65, 61, 62, 53, 55, 60, 70, 64, 56, 58, 58, 62, 62, 68, 65, 56, 59, 68, 61, 67
◻ Ans:
Mean =
59 + 65 + 61 + 62 + 53 + 55 + 60 + 70 + 64 + 56 + 58 + 58 + 62 + 62 + 68 + 65 + 56 +
59 + 68 + 61 + 67
21
= 61.38095...
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
Median
27
53 2 106
58 7 406
63 8 504
68 4 272
Totals: 21 1288
1288
Estimated Mean = = 61.333...
21
Median
28
◻ Solve:
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
Median
29
= 61.4375
Median
30
◻ Another Example:
Seconds Frequency
50 - 55 2
55 - 60 7
60 - 65 8
65 - 70 4
2.2.2 Measuring the Dispersion of Data
31
25%
Q1 Q2 Q3
25th Median 75th
percentile percentile
Range, Quartiles, and Interquartile Range
33
◻ The kth q-quantile for a given data distribution is the value x such that at most k/q
of the data values are less than x and at most (q − k)/q of the data values are
more than x, where k is an integer such that 0 < k < q. There are q − 1 q- quantiles.
◻ The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.
◻ The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
◻ The 100-quantiles are more commonly referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets. The median, quartiles, and
percentiles are the most widely used forms of quantiles.
Range, Quartiles, and Interquartile Range
34
◻ The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of
the data.
◻ The third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75%
(or highest 25%) of the data.
◻ The second quartile is the 50th percentile. As the median, it gives the center of the
data distribution.
◻ The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = Q3 − Q1
◻ Ex: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
◻ 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Q2 = (52 + 56)/2 = $54,000.
◻ No single numeric measure of spread (e.g., IQR) is very useful for describing
skewed distributions.
◻ In the symmetric distribution, the median (and other measures of central
tendency) splits the data into equal-size halves. This does not occur for skewed
distributions.
◻ Therefore, it is more informative to also provide the two quartiles Q1 and Q3, along
with the median.
◻ A common rule of thumb for identifying suspected outliers is to single out values
falling at least 1.5 × IQR above the third quartile or below the first quartile.
◻ Because Q1, the median, and Q3 together contain no information about the end-
points (e.g., tails) of the data, a fuller summary of the shape of a distribution can be
obtained by providing the lowest and highest data values as well. This is known as
the five-number summary.
◻ Five-number summary: Minimum, Q1, Median, Q3, Maximum
Five-Number Summary, Boxplots,
and Outliers
36
◻ Example: Let the data range be 199, 201, 236, 269,271,278,283,291, 301, 303, and
441
◻ Min = 199, Q2 = 278, Q1 = 236, Q3 = 301, Max = 441
◻ IQR = Q3-Q1 = 301-236 = 65
◻ Upper limit=Q3+1.5XIQR = 301+97.5 = 398.5
◻ Lower limit = Q1-1.5XIQR = 236 – 97.5 = 138.5
236 301
278
199
65
38
◻ Q2 is $80, Q1 is $60, and Q3 is $100. Notice that two outlying observations for this
branch were plotted individually, as their values of 175 and 202 are more than 1.5
times the IQR here of 40
220
200
180
160
140
Unit price ($)
120
100
80
60
40
20
Branch 1
Variance and Standard Deviation
39
σ 2 = N ∑ (x i− μ )2 = 1N 2 − μ 2
1
∑ x
i
i=1 i=1
σ ≈ √ 379.17 ≈ 19.47
Covariance and correlation analysis
Example of stock prices observed at five time points for AllElectronics and HighTech, a
high-tech company. If the stocks are affected by the same industry trends, will their
prices rise or fall together?
Therefore, given the positive covariance we can say that stock prices for both companies rise
together.
Correlation coefficient for numeric data
The χ2 statistic tests the hypothesis that A and B are independent, that
is, there is no correlation between them. The test is based on a
significance level, with (r −1)×(c − 1) degrees of freedom.
2.2.3 Graphic Displays of Basic Statistical
Descriptions of Data
40
◻ A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
🞑 First, it displays all of the data for the given attribute (allowing the user to assess
both the overall behavior and unusual occurrences).
🞑 Second, it plots quantile information
◻ Let xi, for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest
observation and xN is the largest for some ordinal or numeric attribute X .
◻ Each observation, xi , is paired with a percentage, fi , which
indicates that approximately fi × 100% of the data are below the value, xi .
i—
fi 0.5
=
N
Quantile Plot
42
100 Median
80 47 250
Q1
60 — —
40 7 36
20 4 0
0 7 51
0.00 0.25 0.50 0.75 1.00 5
— 5
—
f-value 7
115 54
320
8 0
117 270
120 350
Quantile-Quantile Plot
43
120
110
Q3
Branch 2 (unit price $)
100
90 Median
80
70
Q1
60
50
40
40 50 60 70 80 90 100 110 120
Branch 1 (unit price $)
Histograms
44
6000
5000
Count of items sold
4000
3000
2000
1000
0
40–59 60–79 80–99 100–119 120–139
Unit price ($)
Scatter Plots and Data Correlation
45
◻ A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
◻ First look at bivariate data to see clusters of points and outliers, or to explore the
possibility of correlation relationships.
◻ Two attributes, X, and Y, are correlated if one attribute implies the other.
Positive
Negative
Uncorrelated
(a) (b)