[go: up one dir, main page]

0% found this document useful (0 votes)
32 views50 pages

Data Mining - Data Objects and Attributes

sastra

Uploaded by

divya28032006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views50 pages

Data Mining - Data Objects and Attributes

sastra

Uploaded by

divya28032006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Unit - I

Data Mining

(Contents: Text book 2 - Chapter


2)
Outline
2

◻ Data Objects and Attribute Types


🞑 What Is an Attribute?
🞑 Nominal Attributes
🞑 Binary Attributes
🞑 Ordinal Attributes
🞑 Numeric Attributes
🞑 Discrete versus Continuous
Attributes
2.1 Data Objects and Attribute Types
3

◻ Data sets are made up of data objects.


◻ A data object represents an entity:
🞑 in a sales database, the objects may be customers, store items, and sales;
🞑 in a medical database, the objects may be patients;
🞑 in a university database, the objects may be students, professors, and courses.
◻ Data objects are typically described by attributes.
◻ Data objects can also be referred to as samples, examples, instances, data
points, or objects.
◻ If the data objects are stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns correspond to the
attributes.
2.1.1 What is Attribute?
4

◻ An attribute is a data field, representing a characteristic or feature of a data object.


◻ Often Called: attribute, dimension, feature, and variable
🞑 The term dimension is commonly used in data warehousing.
🞑 Machine learning literature use feature
🞑 Statisticians prefer the term variable.
🞑 Data mining and database professionals commonly use the term attribute, and let
us proceed with that name.
◻ Ex: Attributes describing a customer object can include, for example, customer ID,
name, and address.
2.1.1 What is Attribute?
5

◻ Observed values for a given attribute are known as observations.


◻ A set of attributes used to describe a given object is called an attribute vector (or
feature vector).
◻ The distribution of data involving one attribute (or variable) is called univariate. A
bivariate distribution involves two attributes, and so on.
◻ Types:
🞑 The type of an attribute is determined by the set of possible values
■ nominal
■ Binary
■ Ordinal
■ Numeric
■ Ratio-scaled
2.1.2 Nominal Attributes
6

◻ Nominal means “relating to names.” - values are symbols or names of things.


◻ Also referred to as categorical.
◻ Example
🞑 Suppose that hair color and marital status are two attributes describing person
objects.
🞑 In our application, possible values for hair color are black, brown, blond, red,
auburn, gray, and white.
🞑 The attribute marital status can take on the values single, married, divorced,
and
widowed.
🞑 Another example of a nominal attribute is occupation, with the values teacher,
dentist, programmer, farmer, and so on.
2.1.3 Binary Attributes
7

◻ A binary attribute is a nominal attribute with only two categories or states: 0 or


1, where 0 typically means that the attribute is absent, and 1 means that it is
present.
◻ Binary attributes are referred to as Boolean if the two states correspond to true
and false.
◻ Example
◻ suppose the patient undergoes a medical test that has two possible outcomes. The
attribute medical test is binary, where a value of 1 means the result of the test for
the patient is positive, while 0 means the result is negative.
◻ A binary attribute is symmetric if both of its states are equally valuable and carry
the same weight; that is, there is no preference on which outcome should be coded
as 0 or 1. One such example could be the attribute gender having the states male and
female.
◻ A binary attribute is asymmetric if the outcomes of the states are not equally
important, such as the positive and negative outcomes of a medical test for Cancer.
2.1.4 Ordinal Attributes
8

◻ An ordinal attribute is an attribute with possible values that have a meaningful


order or ranking among them, but the magnitude between successive values is not
known.
◻ Example:
◻ Suppose that drink size corresponds to the size of drinks available at a fast-food
restaurant.
◻ This nominal attribute has three possible values: small, medium, and large.
◻ Other examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on)
and professional rank (assistant, associate, and professors)
◻ Note:
◻ nominal, binary, and ordinal attributes are qualitative. That is, they describe a
feature of an object without giving an actual size or quantity.
◻ numeric attributes, which provide quantitative measurements of an object.
2.1.5 Numeric Attributes
9

◻ A numeric attribute is quantitative; that is, it is a measurable quantity, represented in


integer or real values.

◻ Numeric attributes can be interval-scaled or ratio-scaled.


◻ Interval-Scaled Attributes are measured on a scale of equal-size units.
◻ The values of interval-scaled attributes have order and can be positive, 0, or negative.
◻ In addition to providing a ranking of values, such attributes allow us to compare and
quantify the difference between values.
◻ Example: A temperature attribute is interval-scaled.
◻ Suppose that we have the outdoor temperature value for a number of different days,
where each day is an object.

◻ By ordering the values – ranking values will be obtained. In addition, we can quantify the
difference between values. For example, a temperature of 20◦C is five degrees higher
than a temperature of 15◦C.
◻ Calendar dates are another example. For instance, the years 2012 and 2020 are eight
years apart.
Numeric Attributes
10

◻ A ratio-scaled attribute is a numeric attribute with an inherent zero-point.


◻ That is, if a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value.
◻ In addition, the values are ordered, and we can also compute the difference
between values, as well as the mean, median, and mode.
◻ Example:
◻ Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K) temperature scale has
what is considered a true zero-point (0◦K = −273.15◦C):
◻ It is the point at which the particles that comprise matter have zero kinetic energy.
◻ Other examples of ratio-scaled attributes include count attributes such as years of
experience (e.g., the objects are employees) and number of words (e.g., the objects are
documents).
2.1.6 Discrete versus Continuous Attributes
11

◻ Many ways to organize attribute types - The types are not mutually exclusive.
◻ Classification algorithms developed from the field of machine learning often talk of
attributes as being either discrete or continuous.
◻ Discrete attribute has a finite or countably infinite set of values, which may or
may not be represented as integers.
◻ Example:
◻ Finite attributes: The attributes hair color, smoker, medical test, and drink size each
have a finite number of values, and so are discrete.
◻ Discrete attributes may have numeric values, such as 0 and 1 for binary attributes
or, the values 0 to 110 for the attribute age.
◻ Infinite Attributes:
◻ Attribute customer ID is countably infinite.
◻ Zip codes are another example.
2.1.6 Discrete versus Continuous Attributes
12

◻ If an attribute is not discrete, it is continuous.


◻ Example: Real numbers
◻ In practice, real values are represented using a finite number of digits.
◻ Continuous attributes are typically represented as floating-point variables.
2.2 Basic Statistical Descriptions of Data
13

◻ For data pre-processing to be successful, it is essential to have an overall picture of


your data.
◻ Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
🞑 measures of central tendency: which measure the location of the middle or
center of a data distribution. Given an attribute, where do most of its values fall?
- mean, median, mode, and midrange.
🞑 dispersion of the data: That is, how are the data spread out? The most
common data dispersion measures are the range, quartiles, and interquartile
range; the five-number summary and boxplots; and the variance and standard
deviation of the data. These measures are useful for identifying outliers
🞑 Graphical displays: visually inspect our data - bar charts, pie charts, and line
graphs. Other popular displays of data summaries and distributions include
quantile plots, quantile–quantile plots, histograms, and scatter plots.
2.2.1 Measuring the Central Tendency:
Mean, Median, and Mode
14

◻ Suppose that we have some attribute X, like salary, which has been recorded for a set
of objects.
◻ Let X1,X2,...,XN be the set of N observed values or observations for X.
◻ Measures of central tendency include the mean, median, mode, and midrange.
◻ The most common and effective numeric measure of the “center” of a set of data is
the (arithmetic) mean.
◻ Let X1,X2,...,XN be a set of N values or observations, such as for some numeric
attribute X, like salary. The mean of this set of values is
Mean
15

◻ Example: Suppose we have the following values for salary (in


thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110.
30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 +
x¯ 110
=
69 12
= 6 =
12
◻ Thus, the mean salary
58. is $58,000.
◻ Weighted Arithmetic Mean: Sometimes, each value xi in a set may be associated
with a weight wi for i = 1,...,N.
◻ The weights reflect the significance, importance, or occurrence frequency attached
to their respective values. In this case, we can compute
Mean
16

◻ Trimmed mean:
🞑 Although the mean is the singlemost useful quantity for describing a data set, it is
not always the best way of measuring the center of the data.
🞑 A major problem with the mean is its sensitivity to extreme (e.g., outlier)
values.
🞑 Example: The mean salary at a company may be substantially pushed up by that
of a few highly paid managers.
🞑 Similarly, the mean score of a class in an exam could be pulled down quite a bit by
a few very low scores.
🞑 To offset the effect caused by a small number of extreme values, we can instead
use the trimmed mean, which is the mean obtained after chopping off values at
the high and low extremes.
🞑 Example: remove the top and bottom 2% salary before computing the mean. We
should avoid trimming too large a portion (such as 20%) at both ends, as this can
result in the loss of valuable information.
Skewness & Symmetry
17

symmetri
c

positively negatively
skewed skewed
Median
18

◻ For skewed (asymmetric) data, a better measure of the center of data is the
median, which is the middle value in a set of ordered data values.
◻ It is the value that separates the higher half of a data set from the lower half.
◻ Suppose that a given data set of N values for an attribute X is sorted in increasing
order. If N is odd, then the median is the middle value of the ordered set.
◻ If N is even, then the median is not unique; it is the two middlemost values and any
value in between – take average
◻ Example: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
◻ Even: (52+56)/2 = 108/2 = 54.
◻ Suppose 11 values: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70
◻ Odd: Middle one: 52
Median
19

◻ Assume that data are grouped in intervals according to their xi data values and that
the frequency (i.e., number of data values) of each interval is known.
◻ For example, employees may be grouped according to their annual salary in
intervals such as $10–20,000, $20–30,000, and so on.
◻ Let the interval that contains the median frequency be the median interval.
◻ We can approximate the median of the entire data set (e.g., the median salary) by
interpolation using the formula

◻ where L1 is the lower boundary of the median interval, N is the number of values in

the entire data set, ( freq)l is the sum of the frequencies of all of the intervals that
are l lower than the median interval, freqmedian is the frequency of the median
interval, and width is the width of the median interval.

Median
20

◻ Example:
Median
21
Median
22
Median
23
Mode
24

◻ The mode for a set of data is the value that occurs most frequently in the set.
◻ Data sets with one, two,or three modes are respectively called
unimodal, bimodal, and trimodal.
◻ In general, a data set with two or more modes is multimodal.
◻ At the other extreme, if each data value occurs only once, then there is no mode.
◻ Example: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
◻ The two modes are $52,000 and $70,000.
Midrange
25

◻ The midrange can also be used to assess the central tendency of a numeric data set.
◻ It is the average of the largest and smallest values in the set.
◻ This measure is easy to compute using the SQL aggregate functions, max() and
min().
◻ Example: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
◻ (30,000+110,000)/2 = $70,000
Median
26

◻ Solve: 59, 65, 61, 62, 53, 55, 60, 70, 64, 56, 58, 58, 62, 62, 68, 65, 56, 59, 68, 61, 67
◻ Ans:
Mean =
59 + 65 + 61 + 62 + 53 + 55 + 60 + 70 + 64 + 56 + 58 + 58 + 62 + 62 + 68 + 65 + 56 +
59 + 68 + 61 + 67
21
= 61.38095...

◻ Another Example: Grouped data

Seconds Frequency

51 - 55 2

56 - 60 7

61 - 65 8

66 - 70 4
Median
27

Midpoint Frequency Midpoint ×


x f Frequency
fx

53 2 106

58 7 406

63 8 504

68 4 272

Totals: 21 1288

1288
Estimated Mean = = 61.333...
21
Median
28

◻ Solve:
Seconds Frequency

51 - 55 2

56 - 60 7

61 - 65 8

66 - 70 4
Median
29

Estimated Median (21/2) — 9


= 60.5 + 8 ×5
= 60.5 + 0.9375

= 61.4375
Median
30

◻ Another Example:
Seconds Frequency
50 - 55 2
55 - 60 7
60 - 65 8
65 - 70 4
2.2.2 Measuring the Dispersion of Data
31

◻ Measures to assess the dispersion or spread of numeric data.


🞑 Range
🞑 Quantiles (Quartiles/percentiles)
🞑 Interquartile range
🞑 Five-number summary – boxplot
🞑 Variance and standard deviation
Range, Quartiles, and Interquartile Range
32

◻ Let x1,x2,...,xN be a set of observations for some numeric attribute, X.


◻ Range: the difference between the largest (max()) and smallest (min()) values.
◻ Ex: Salary: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
◻ Range = 110-30 = $ 80,000
◻ Quantiles (Quartiles / Percentiles)
🞑 Suppose that the data for attribute X are sorted in increasing numeric order.
🞑 Data points splits the data distribution into equal-size consecutive sets are
called quantiles.

25%

Q1 Q2 Q3
25th Median 75th
percentile percentile
Range, Quartiles, and Interquartile Range
33

◻ The kth q-quantile for a given data distribution is the value x such that at most k/q
of the data values are less than x and at most (q − k)/q of the data values are
more than x, where k is an integer such that 0 < k < q. There are q − 1 q- quantiles.
◻ The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.
◻ The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
◻ The 100-quantiles are more commonly referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets. The median, quartiles, and
percentiles are the most widely used forms of quantiles.
Range, Quartiles, and Interquartile Range
34

◻ The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of
the data.
◻ The third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75%
(or highest 25%) of the data.
◻ The second quartile is the 50th percentile. As the median, it gives the center of the
data distribution.
◻ The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = Q3 − Q1
◻ Ex: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
◻ 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Q2 = (52 + 56)/2 = $54,000.

◻ Q1 = (47 + 50)/2 = 48.5 , Q3 = 96.5, IQR = Q3-Q1 = 96.5 – 48.5 = $48,000


Five-Number Summary, Boxplots,
and Outliers
35

◻ No single numeric measure of spread (e.g., IQR) is very useful for describing
skewed distributions.
◻ In the symmetric distribution, the median (and other measures of central
tendency) splits the data into equal-size halves. This does not occur for skewed
distributions.
◻ Therefore, it is more informative to also provide the two quartiles Q1 and Q3, along
with the median.
◻ A common rule of thumb for identifying suspected outliers is to single out values
falling at least 1.5 × IQR above the third quartile or below the first quartile.
◻ Because Q1, the median, and Q3 together contain no information about the end-
points (e.g., tails) of the data, a fuller summary of the shape of a distribution can be
obtained by providing the lowest and highest data values as well. This is known as
the five-number summary.
◻ Five-number summary: Minimum, Q1, Median, Q3, Maximum
Five-Number Summary, Boxplots,
and Outliers
36

◻ Boxplots - visualizing a distribution which incorporates the five-number summary as


follows:
🞑 Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
🞑 The median is marked by a line within the box.
🞑 Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations.
◻ Lower Limit = Q1 – 1.5 IQR
◻ Upper Limit = Q3 + 1.5 IQR
Five-Number Summary, Boxplots,
and Outliers
37

◻ Example: Let the data range be 199, 201, 236, 269,271,278,283,291, 301, 303, and
441
◻ Min = 199, Q2 = 278, Q1 = 236, Q3 = 301, Max = 441
◻ IQR = Q3-Q1 = 301-236 = 65
◻ Upper limit=Q3+1.5XIQR = 301+97.5 = 398.5
◻ Lower limit = Q1-1.5XIQR = 236 – 97.5 = 138.5

236 301
278

199
65
38

◻ Q2 is $80, Q1 is $60, and Q3 is $100. Notice that two outlying observations for this
branch were plotted individually, as their values of 175 and 202 are more than 1.5
times the IQR here of 40
220

200

180

160

140
Unit price ($)

120

100

80

60

40

20

Branch 1
Variance and Standard Deviation
39

◻ Variance and standard deviation are measures of data dispersion.


◻ They indicate how spread out a data distribution is.
◻ A low standard deviation means that the data observations tend to be very close to
the mean, while a high standard deviation indicates that the data are spread out
over a large range of values.
◻ The variance of N observations, x1,x2,...,xN , for a numeric attribute X is
n
n

σ 2 = N ∑ (x i− μ )2 = 1N 2 − μ 2
1
∑ x
i
i=1 i=1

◻ Where 𝜇 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑎𝑛𝑑 Standard deviation, σ, of the


observations is the square root of the variance, σ2
◻ Example: salary data 𝜇 =$58,000 for the mean. N=12
σ2 = (1/12)(302 +362 +472...+1102)−582 ≈ 379.17

σ ≈ √ 379.17 ≈ 19.47
Covariance and correlation analysis

Correlation and covariance are two similar measures for assessing


how much two attributes change together
.
Example

Example of stock prices observed at five time points for AllElectronics and HighTech, a
high-tech company. If the stocks are affected by the same industry trends, will their
prices rise or fall together?
Therefore, given the positive covariance we can say that stock prices for both companies rise
together.
Correlation coefficient for numeric data

correlation coefficient (also known as Pearson’s product moment


coefficient)
χ2 correlation test for nominal data

Χ2 (chi-square) test. The data tuples described by A and B can be shown as a


contingency table.

The χ2 statistic tests the hypothesis that A and B are independent, that
is, there is no correlation between them. The test is based on a
significance level, with (r −1)×(c − 1) degrees of freedom.
2.2.3 Graphic Displays of Basic Statistical
Descriptions of Data
40

◻ Graphic displays of basic statistical descriptions. These include


🞑 quantile plots
🞑 quantile–quantile plots
🞑 Histograms
🞑 scatter plots
Quantile Plot
41

◻ A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
🞑 First, it displays all of the data for the given attribute (allowing the user to assess
both the overall behavior and unusual occurrences).
🞑 Second, it plots quantile information
◻ Let xi, for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest
observation and xN is the largest for some ordinal or numeric attribute X .
◻ Each observation, xi , is paired with a percentage, fi , which
indicates that approximately fi × 100% of the data are below the value, xi .

i—
fi 0.5
=
N
Quantile Plot
42

A Set of Unit Price Data for Items


Sold at a Branch of AllElectronics
Unit Count of
price items
140 ($) sold
120 40 275
Q3 43 300
Unit price ($)

100 Median
80 47 250
Q1
60 — —
40 7 36
20 4 0
0 7 51
0.00 0.25 0.50 0.75 1.00 5
— 5

f-value 7
115 54
320
8 0
117 270
120 350
Quantile-Quantile Plot
43

◻ A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate


distribution against the corresponding quantiles of another. It is a powerful
visualization tool in that it allows the user to view whether there is a shift in going
from one distribution to another.

120
110
Q3
Branch 2 (unit price $)

100
90 Median
80
70
Q1
60
50
40
40 50 60 70 80 90 100 110 120
Branch 1 (unit price $)
Histograms
44

◻ Plotting histograms is a graphical method for summarizing the distribution of a


given attribute, X.
◻ If X is numeric, the term histogram is preferred.
◻ The range of values for X is partitioned into disjoint consecutive subranges.
◻ The subranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X. The range of a bucket is known as the width.

6000

5000
Count of items sold

4000

3000

2000

1000

0
40–59 60–79 80–99 100–119 120–139
Unit price ($)
Scatter Plots and Data Correlation
45

◻ A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
◻ First look at bivariate data to see clusters of points and outliers, or to explore the
possibility of correlation relationships.
◻ Two attributes, X, and Y, are correlated if one attribute implies the other.

Positive
Negative

Uncorrelated

(a) (b)

You might also like