0% found this document useful (0 votes)

32 views50 pages

Data Mining - Data Objects and Attributes

sastra

Uploaded by

divya28032006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views50 pages

Data Mining - Data Objects and Attributes

sastra

Uploaded by

divya28032006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Unit - I

Data Mining

(Contents: Text book 2 - Chapter

2)
Outline
2

◻ Data Objects and Attribute Types

🞑 What Is an Attribute?
🞑 Nominal Attributes
🞑 Binary Attributes
🞑 Ordinal Attributes
🞑 Numeric Attributes
🞑 Discrete versus Continuous
Attributes
2.1 Data Objects and Attribute Types
3

◻ Data sets are made up of data objects.

◻ A data object represents an entity:
🞑 in a sales database, the objects may be customers, store items, and sales;
🞑 in a medical database, the objects may be patients;
🞑 in a university database, the objects may be students, professors, and courses.
◻ Data objects are typically described by attributes.
◻ Data objects can also be referred to as samples, examples, instances, data
points, or objects.
◻ If the data objects are stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns correspond to the
attributes.
2.1.1 What is Attribute?
4

◻ An attribute is a data field, representing a characteristic or feature of a data object.

◻ Often Called: attribute, dimension, feature, and variable
🞑 The term dimension is commonly used in data warehousing.
🞑 Machine learning literature use feature
🞑 Statisticians prefer the term variable.
🞑 Data mining and database professionals commonly use the term attribute, and let
us proceed with that name.
◻ Ex: Attributes describing a customer object can include, for example, customer ID,
name, and address.
2.1.1 What is Attribute?
5

◻ Observed values for a given attribute are known as observations.

◻ A set of attributes used to describe a given object is called an attribute vector (or
feature vector).
◻ The distribution of data involving one attribute (or variable) is called univariate. A
bivariate distribution involves two attributes, and so on.
◻ Types:
🞑 The type of an attribute is determined by the set of possible values
■ nominal
■ Binary
■ Ordinal
■ Numeric
■ Ratio-scaled
2.1.2 Nominal Attributes
6

◻ Nominal means “relating to names.” - values are symbols or names of things.

◻ Also referred to as categorical.
◻ Example
🞑 Suppose that hair color and marital status are two attributes describing person
objects.
🞑 In our application, possible values for hair color are black, brown, blond, red,
auburn, gray, and white.
🞑 The attribute marital status can take on the values single, married, divorced,
and
widowed.
🞑 Another example of a nominal attribute is occupation, with the values teacher,
dentist, programmer, farmer, and so on.
2.1.3 Binary Attributes
7

◻ A binary attribute is a nominal attribute with only two categories or states: 0 or

1, where 0 typically means that the attribute is absent, and 1 means that it is
present.
◻ Binary attributes are referred to as Boolean if the two states correspond to true
and false.
◻ Example
◻ suppose the patient undergoes a medical test that has two possible outcomes. The
attribute medical test is binary, where a value of 1 means the result of the test for
the patient is positive, while 0 means the result is negative.
◻ A binary attribute is symmetric if both of its states are equally valuable and carry
the same weight; that is, there is no preference on which outcome should be coded
as 0 or 1. One such example could be the attribute gender having the states male and
female.
◻ A binary attribute is asymmetric if the outcomes of the states are not equally
important, such as the positive and negative outcomes of a medical test for Cancer.
2.1.4 Ordinal Attributes
8

◻ An ordinal attribute is an attribute with possible values that have a meaningful

order or ranking among them, but the magnitude between successive values is not
known.
◻ Example:
◻ Suppose that drink size corresponds to the size of drinks available at a fast-food
restaurant.
◻ This nominal attribute has three possible values: small, medium, and large.
◻ Other examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on)
and professional rank (assistant, associate, and professors)
◻ Note:
◻ nominal, binary, and ordinal attributes are qualitative. That is, they describe a
feature of an object without giving an actual size or quantity.
◻ numeric attributes, which provide quantitative measurements of an object.
2.1.5 Numeric Attributes
9

◻ A numeric attribute is quantitative; that is, it is a measurable quantity, represented in

integer or real values.

◻ Numeric attributes can be interval-scaled or ratio-scaled.

◻ Interval-Scaled Attributes are measured on a scale of equal-size units.
◻ The values of interval-scaled attributes have order and can be positive, 0, or negative.
◻ In addition to providing a ranking of values, such attributes allow us to compare and
quantify the difference between values.
◻ Example: A temperature attribute is interval-scaled.
◻ Suppose that we have the outdoor temperature value for a number of different days,
where each day is an object.

◻ By ordering the values – ranking values will be obtained. In addition, we can quantify the
difference between values. For example, a temperature of 20◦C is five degrees higher
than a temperature of 15◦C.
◻ Calendar dates are another example. For instance, the years 2012 and 2020 are eight
years apart.
Numeric Attributes
10

◻ A ratio-scaled attribute is a numeric attribute with an inherent zero-point.

◻ That is, if a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value.
◻ In addition, the values are ordered, and we can also compute the difference
between values, as well as the mean, median, and mode.
◻ Example:
◻ Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K) temperature scale has
what is considered a true zero-point (0◦K = −273.15◦C):
◻ It is the point at which the particles that comprise matter have zero kinetic energy.
◻ Other examples of ratio-scaled attributes include count attributes such as years of
experience (e.g., the objects are employees) and number of words (e.g., the objects are
documents).
2.1.6 Discrete versus Continuous Attributes
11

◻ Many ways to organize attribute types - The types are not mutually exclusive.
◻ Classification algorithms developed from the field of machine learning often talk of
attributes as being either discrete or continuous.
◻ Discrete attribute has a finite or countably infinite set of values, which may or
may not be represented as integers.
◻ Example:
◻ Finite attributes: The attributes hair color, smoker, medical test, and drink size each
have a finite number of values, and so are discrete.
◻ Discrete attributes may have numeric values, such as 0 and 1 for binary attributes
or, the values 0 to 110 for the attribute age.
◻ Infinite Attributes:
◻ Attribute customer ID is countably infinite.
◻ Zip codes are another example.
2.1.6 Discrete versus Continuous Attributes
12

◻ If an attribute is not discrete, it is continuous.

◻ Example: Real numbers
◻ In practice, real values are represented using a finite number of digits.
◻ Continuous attributes are typically represented as floating-point variables.
2.2 Basic Statistical Descriptions of Data
13

◻ For data pre-processing to be successful, it is essential to have an overall picture of

your data.
◻ Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
🞑 measures of central tendency: which measure the location of the middle or
center of a data distribution. Given an attribute, where do most of its values fall?
- mean, median, mode, and midrange.
🞑 dispersion of the data: That is, how are the data spread out? The most
common data dispersion measures are the range, quartiles, and interquartile
range; the five-number summary and boxplots; and the variance and standard
deviation of the data. These measures are useful for identifying outliers
🞑 Graphical displays: visually inspect our data - bar charts, pie charts, and line
graphs. Other popular displays of data summaries and distributions include
quantile plots, quantile–quantile plots, histograms, and scatter plots.
2.2.1 Measuring the Central Tendency:
Mean, Median, and Mode
14

◻ Suppose that we have some attribute X, like salary, which has been recorded for a set
of objects.
◻ Let X1,X2,...,XN be the set of N observed values or observations for X.
◻ Measures of central tendency include the mean, median, mode, and midrange.
◻ The most common and effective numeric measure of the “center” of a set of data is
the (arithmetic) mean.
◻ Let X1,X2,...,XN be a set of N values or observations, such as for some numeric
attribute X, like salary. The mean of this set of values is
Mean
15

◻ Example: Suppose we have the following values for salary (in

thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110.
30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 +
x¯ 110
=
69 12
= 6 =
12
◻ Thus, the mean salary
58. is $58,000.
◻ Weighted Arithmetic Mean: Sometimes, each value xi in a set may be associated
with a weight wi for i = 1,...,N.
◻ The weights reflect the significance, importance, or occurrence frequency attached
to their respective values. In this case, we can compute
Mean
16

◻ Trimmed mean:
🞑 Although the mean is the singlemost useful quantity for describing a data set, it is
not always the best way of measuring the center of the data.
🞑 A major problem with the mean is its sensitivity to extreme (e.g., outlier)
values.
🞑 Example: The mean salary at a company may be substantially pushed up by that
of a few highly paid managers.
🞑 Similarly, the mean score of a class in an exam could be pulled down quite a bit by
a few very low scores.
🞑 To offset the effect caused by a small number of extreme values, we can instead
use the trimmed mean, which is the mean obtained after chopping off values at
the high and low extremes.
🞑 Example: remove the top and bottom 2% salary before computing the mean. We
should avoid trimming too large a portion (such as 20%) at both ends, as this can
result in the loss of valuable information.
Skewness & Symmetry
17

symmetri
c

positively negatively
skewed skewed
Median
18

◻ For skewed (asymmetric) data, a better measure of the center of data is the
median, which is the middle value in a set of ordered data values.
◻ It is the value that separates the higher half of a data set from the lower half.
◻ Suppose that a given data set of N values for an attribute X is sorted in increasing
order. If N is odd, then the median is the middle value of the ordered set.
◻ If N is even, then the median is not unique; it is the two middlemost values and any
value in between – take average
◻ Example: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
◻ Even: (52+56)/2 = 108/2 = 54.
◻ Suppose 11 values: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70
◻ Odd: Middle one: 52
Median
19

◻ Assume that data are grouped in intervals according to their xi data values and that
the frequency (i.e., number of data values) of each interval is known.
◻ For example, employees may be grouped according to their annual salary in
intervals such as $10–20,000, $20–30,000, and so on.
◻ Let the interval that contains the median frequency be the median interval.
◻ We can approximate the median of the entire data set (e.g., the median salary) by
interpolation using the formula

◻ where L1 is the lower boundary of the median interval, N is the number of values in
∑
the entire data set, ( freq)l is the sum of the frequencies of all of the intervals that
are l lower than the median interval, freqmedian is the frequency of the median
interval, and width is the width of the median interval.
◻
Median
20

◻ Example:
Median
21
Median
22
Median
23
Mode
24

◻ The mode for a set of data is the value that occurs most frequently in the set.
◻ Data sets with one, two,or three modes are respectively called
unimodal, bimodal, and trimodal.
◻ In general, a data set with two or more modes is multimodal.
◻ At the other extreme, if each data value occurs only once, then there is no mode.
◻ Example: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
◻ The two modes are $52,000 and $70,000.
Midrange
25

◻ The midrange can also be used to assess the central tendency of a numeric data set.
◻ It is the average of the largest and smallest values in the set.
◻ This measure is easy to compute using the SQL aggregate functions, max() and
min().
◻ Example: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
◻ (30,000+110,000)/2 = $70,000
Median
26

◻ Solve: 59, 65, 61, 62, 53, 55, 60, 70, 64, 56, 58, 58, 62, 62, 68, 65, 56, 59, 68, 61, 67
◻ Ans:
Mean =
59 + 65 + 61 + 62 + 53 + 55 + 60 + 70 + 64 + 56 + 58 + 58 + 62 + 62 + 68 + 65 + 56 +
59 + 68 + 61 + 67
21
= 61.38095...

◻ Another Example: Grouped data

Seconds Frequency

51 - 55 2

56 - 60 7

61 - 65 8

66 - 70 4
Median
27

Midpoint Frequency Midpoint ×

x f Frequency
fx

53 2 106

58 7 406

63 8 504

68 4 272

Totals: 21 1288

1288
Estimated Mean = = 61.333...
21
Median
28

◻ Solve:
Seconds Frequency

51 - 55 2

56 - 60 7

61 - 65 8

66 - 70 4
Median
29

Estimated Median (21/2) — 9

= 60.5 + 8 ×5
= 60.5 + 0.9375

= 61.4375
Median
30

◻ Another Example:
Seconds Frequency
50 - 55 2
55 - 60 7
60 - 65 8
65 - 70 4
2.2.2 Measuring the Dispersion of Data
31

◻ Measures to assess the dispersion or spread of numeric data.

🞑 Range
🞑 Quantiles (Quartiles/percentiles)
🞑 Interquartile range
🞑 Five-number summary – boxplot
🞑 Variance and standard deviation
Range, Quartiles, and Interquartile Range
32

◻ Let x1,x2,...,xN be a set of observations for some numeric attribute, X.

◻ Range: the difference between the largest (max()) and smallest (min()) values.
◻ Ex: Salary: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
◻ Range = 110-30 = $ 80,000
◻ Quantiles (Quartiles / Percentiles)
🞑 Suppose that the data for attribute X are sorted in increasing numeric order.
🞑 Data points splits the data distribution into equal-size consecutive sets are
called quantiles.

25%

Q1 Q2 Q3
25th Median 75th
percentile percentile
Range, Quartiles, and Interquartile Range
33

◻ The kth q-quantile for a given data distribution is the value x such that at most k/q
of the data values are less than x and at most (q − k)/q of the data values are
more than x, where k is an integer such that 0 < k < q. There are q − 1 q- quantiles.
◻ The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.
◻ The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
◻ The 100-quantiles are more commonly referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets. The median, quartiles, and
percentiles are the most widely used forms of quantiles.
Range, Quartiles, and Interquartile Range
34

◻ The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of
the data.
◻ The third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75%
(or highest 25%) of the data.
◻ The second quartile is the 50th percentile. As the median, it gives the center of the
data distribution.
◻ The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = Q3 − Q1
◻ Ex: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
◻ 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Q2 = (52 + 56)/2 = $54,000.

◻ Q1 = (47 + 50)/2 = 48.5 , Q3 = 96.5, IQR = Q3-Q1 = 96.5 – 48.5 = $48,000

Five-Number Summary, Boxplots,
and Outliers
35

◻ No single numeric measure of spread (e.g., IQR) is very useful for describing
skewed distributions.
◻ In the symmetric distribution, the median (and other measures of central
tendency) splits the data into equal-size halves. This does not occur for skewed
distributions.
◻ Therefore, it is more informative to also provide the two quartiles Q1 and Q3, along
with the median.
◻ A common rule of thumb for identifying suspected outliers is to single out values
falling at least 1.5 × IQR above the third quartile or below the first quartile.
◻ Because Q1, the median, and Q3 together contain no information about the end-
points (e.g., tails) of the data, a fuller summary of the shape of a distribution can be
obtained by providing the lowest and highest data values as well. This is known as
the five-number summary.
◻ Five-number summary: Minimum, Q1, Median, Q3, Maximum
Five-Number Summary, Boxplots,
and Outliers
36

◻ Boxplots - visualizing a distribution which incorporates the five-number summary as

follows:
🞑 Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
🞑 The median is marked by a line within the box.
🞑 Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations.
◻ Lower Limit = Q1 – 1.5 IQR
◻ Upper Limit = Q3 + 1.5 IQR
Five-Number Summary, Boxplots,
and Outliers
37

◻ Example: Let the data range be 199, 201, 236, 269,271,278,283,291, 301, 303, and
441
◻ Min = 199, Q2 = 278, Q1 = 236, Q3 = 301, Max = 441
◻ IQR = Q3-Q1 = 301-236 = 65
◻ Upper limit=Q3+1.5XIQR = 301+97.5 = 398.5
◻ Lower limit = Q1-1.5XIQR = 236 – 97.5 = 138.5

236 301
278

199
65
38

◻ Q2 is $80, Q1 is $60, and Q3 is $100. Notice that two outlying observations for this
branch were plotted individually, as their values of 175 and 202 are more than 1.5
times the IQR here of 40
220

200

180

160

140
Unit price ($)

120

100

Branch 1
Variance and Standard Deviation
39

◻ Variance and standard deviation are measures of data dispersion.

◻ They indicate how spread out a data distribution is.
◻ A low standard deviation means that the data observations tend to be very close to
the mean, while a high standard deviation indicates that the data are spread out
over a large range of values.
◻ The variance of N observations, x1,x2,...,xN , for a numeric attribute X is
n
n

σ 2 = N ∑ (x i− μ )2 = 1N 2 − μ 2
1
∑ x
i
i=1 i=1

◻ Where 𝜇 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑎𝑛𝑑 Standard deviation, σ, of the

observations is the square root of the variance, σ2
◻ Example: salary data 𝜇 =$58,000 for the mean. N=12
σ2 = (1/12)(302 +362 +472...+1102)−582 ≈ 379.17

σ ≈ √ 379.17 ≈ 19.47
Covariance and correlation analysis

Correlation and covariance are two similar measures for assessing

how much two attributes change together
.
Example

Example of stock prices observed at ﬁve time points for AllElectronics and HighTech, a
high-tech company. If the stocks are aﬀected by the same industry trends, will their
prices rise or fall together?
Therefore, given the positive covariance we can say that stock prices for both companies rise
together.
Correlation coefficient for numeric data

correlation coeﬃcient (also known as Pearson’s product moment

coeﬃcient)
χ2 correlation test for nominal data

Χ2 (chi-square) test. The data tuples described by A and B can be shown as a

contingency table.

The χ2 statistic tests the hypothesis that A and B are independent, that
is, there is no correlation between them. The test is based on a
signiﬁcance level, with (r −1)×(c − 1) degrees of freedom.
2.2.3 Graphic Displays of Basic Statistical
Descriptions of Data
40

◻ Graphic displays of basic statistical descriptions. These include

🞑 quantile plots
🞑 quantile–quantile plots
🞑 Histograms
🞑 scatter plots
Quantile Plot
41

◻ A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
🞑 First, it displays all of the data for the given attribute (allowing the user to assess
both the overall behavior and unusual occurrences).
🞑 Second, it plots quantile information
◻ Let xi, for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest
observation and xN is the largest for some ordinal or numeric attribute X .
◻ Each observation, xi , is paired with a percentage, fi , which
indicates that approximately fi × 100% of the data are below the value, xi .

i—
fi 0.5
=
N
Quantile Plot
42

A Set of Unit Price Data for Items

Sold at a Branch of AllElectronics
Unit Count of
price items
140 ($) sold
120 40 275
Q3 43 300
Unit price ($)

100 Median
80 47 250
Q1
60 — —
40 7 36
20 4 0
0 7 51
0.00 0.25 0.50 0.75 1.00 5
— 5
—
f-value 7
115 54
320
8 0
117 270
120 350
Quantile-Quantile Plot
43

◻ A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate

distribution against the corresponding quantiles of another. It is a powerful
visualization tool in that it allows the user to view whether there is a shift in going
from one distribution to another.

120
110
Q3
Branch 2 (unit price $)

100
90 Median
80
70
Q1
60
50
40
40 50 60 70 80 90 100 110 120
Branch 1 (unit price $)
Histograms
44

◻ Plotting histograms is a graphical method for summarizing the distribution of a

given attribute, X.
◻ If X is numeric, the term histogram is preferred.
◻ The range of values for X is partitioned into disjoint consecutive subranges.
◻ The subranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X. The range of a bucket is known as the width.

6000

5000
Count of items sold

4000

3000

2000

1000

0
40–59 60–79 80–99 100–119 120–139
Unit price ($)
Scatter Plots and Data Correlation
45

◻ A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
◻ First look at bivariate data to see clusters of points and outliers, or to explore the
possibility of correlation relationships.
◻ Two attributes, X, and Y, are correlated if one attribute implies the other.

Positive
Negative

Uncorrelated

(a) (b)

2nd Slides
No ratings yet
2nd Slides
54 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
Types of Data Attributes Explained
No ratings yet
Types of Data Attributes Explained
10 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
DEP Unit 2
No ratings yet
DEP Unit 2
83 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Ids Unit 2 Final
No ratings yet
Ids Unit 2 Final
18 pages
4 - Ch4 - Data Objects and Attribute Types
No ratings yet
4 - Ch4 - Data Objects and Attribute Types
14 pages
CH 2
No ratings yet
CH 2
35 pages
Dmi Unit 2 - 186 - N3
No ratings yet
Dmi Unit 2 - 186 - N3
21 pages
Data Types for Aspiring Data Scientists
No ratings yet
Data Types for Aspiring Data Scientists
14 pages
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
No ratings yet
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
12 pages
Lect-2 Getting To Know Your Data-Part-I
No ratings yet
Lect-2 Getting To Know Your Data-Part-I
28 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Unit-2 Attributes
No ratings yet
Unit-2 Attributes
4 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
Chapter 2 (Data)
No ratings yet
Chapter 2 (Data)
98 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Chpater 2 PDF
No ratings yet
Chpater 2 PDF
44 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
Ids U2 PPT 30092024
No ratings yet
Ids U2 PPT 30092024
87 pages
Week 2
No ratings yet
Week 2
73 pages
Data Objects and Attribute Types
No ratings yet
Data Objects and Attribute Types
3 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
UNIT3
No ratings yet
UNIT3
98 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
DS Handout 4
No ratings yet
DS Handout 4
4 pages
Lect 2
No ratings yet
Lect 2
77 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Data Attribute Classification Guide
No ratings yet
Data Attribute Classification Guide
4 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
DMW Unit1
No ratings yet
DMW Unit1
21 pages
Session 1 - Getting To Know Data
No ratings yet
Session 1 - Getting To Know Data
62 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
U1 - L2 - Data Warehouse - Pptx-Data
No ratings yet
U1 - L2 - Data Warehouse - Pptx-Data
27 pages
Data Mining-Introduction
No ratings yet
Data Mining-Introduction
47 pages
Unit 1 Problems-Cn
No ratings yet
Unit 1 Problems-Cn
12 pages
GUI2
No ratings yet
GUI2
3 pages
INT202 LPS Unit II 3
No ratings yet
INT202 LPS Unit II 3
23 pages
1.2 Data Warehouse
No ratings yet
1.2 Data Warehouse
27 pages
SQL
No ratings yet
SQL
1 page
Mining Frequent Patterns
No ratings yet
Mining Frequent Patterns
108 pages
Delay Problems
No ratings yet
Delay Problems
9 pages
Java Debugging Questions2
No ratings yet
Java Debugging Questions2
85 pages
Basics of Drug and Their Actions - Presentation
No ratings yet
Basics of Drug and Their Actions - Presentation
19 pages
Functions and Modules
No ratings yet
Functions and Modules
43 pages
CN-CIA-1-2024 Sastra
No ratings yet
CN-CIA-1-2024 Sastra
2 pages
Chapter 1
No ratings yet
Chapter 1
188 pages
cn1 Sastra
No ratings yet
cn1 Sastra
29 pages
Statistics MCQ
92% (12)
Statistics MCQ
15 pages
Hull OFOD10e MultipleChoice Questions and Answers Ch23
No ratings yet
Hull OFOD10e MultipleChoice Questions and Answers Ch23
6 pages
Descriptive Statistics Guide for Students
No ratings yet
Descriptive Statistics Guide for Students
40 pages
Assignment in Statistics
No ratings yet
Assignment in Statistics
8 pages
Statistical Quality Control For The Six Sigma Green Belt 1st Edition Bhisham C. Gupta and H. Fred Walker Complete Edition
No ratings yet
Statistical Quality Control For The Six Sigma Green Belt 1st Edition Bhisham C. Gupta and H. Fred Walker Complete Edition
88 pages
Commerce Maths Question Bank
No ratings yet
Commerce Maths Question Bank
4 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Test 1 Crosstabs: Case Processing Summary
No ratings yet
Test 1 Crosstabs: Case Processing Summary
9 pages
JSO (Test - 10) Paid
No ratings yet
JSO (Test - 10) Paid
6 pages
Research Methodology and Fuel & Applied Geology Practical
No ratings yet
Research Methodology and Fuel & Applied Geology Practical
2 pages
Perhitungan Regresi Dengan Excell Secara Manual
No ratings yet
Perhitungan Regresi Dengan Excell Secara Manual
7 pages
VWAP
No ratings yet
VWAP
2 pages
2.1.2 A - SR - StudentResponseSheet-1
No ratings yet
2.1.2 A - SR - StudentResponseSheet-1
5 pages
Iitm Mat 1syllabus
No ratings yet
Iitm Mat 1syllabus
4 pages
Chapter 1 of Proba-Stat S1
No ratings yet
Chapter 1 of Proba-Stat S1
79 pages
Stat 166 Final Paper
No ratings yet
Stat 166 Final Paper
59 pages
Econometric Problems: Heteroscedasticity Tutorial
100% (2)
Econometric Problems: Heteroscedasticity Tutorial
3 pages
Hasil Eviews 10
No ratings yet
Hasil Eviews 10
3 pages
All As 525 v2
No ratings yet
All As 525 v2
10 pages
Assignment For Business
No ratings yet
Assignment For Business
7 pages
Investigating The Correlation Between GDP Per Capita and Self-Reported Happiness
No ratings yet
Investigating The Correlation Between GDP Per Capita and Self-Reported Happiness
25 pages
CCM 202 Notes 2ND Year Semester Ii-1
No ratings yet
CCM 202 Notes 2ND Year Semester Ii-1
110 pages
Probability & Statistics Notes
No ratings yet
Probability & Statistics Notes
38 pages
PED 18 Activity 6
No ratings yet
PED 18 Activity 6
4 pages
Answer of Quiz (1) - Investment - Risk & Return
No ratings yet
Answer of Quiz (1) - Investment - Risk & Return
6 pages
Measure of Central Tendency
100% (1)
Measure of Central Tendency
13 pages
All Lesson Summaries (Bloomberg's Level I CFA (R) Exam Prep)
No ratings yet
All Lesson Summaries (Bloomberg's Level I CFA (R) Exam Prep)
144 pages
First find the mean µ=∑X/N µ =10+60+50+30+40+20 6 µ=210/6=35 Variance=σ =∑ (X-µ) N σ =1750/6 σ =291.7 Standard deviation σ=√ ∑ (X-µ) N N=6 ∑X=210
No ratings yet
First find the mean µ=∑X/N µ =10+60+50+30+40+20 6 µ=210/6=35 Variance=σ =∑ (X-µ) N σ =1750/6 σ =291.7 Standard deviation σ=√ ∑ (X-µ) N N=6 ∑X=210
5 pages
History Exchange Report
No ratings yet
History Exchange Report
14 pages
Robust Geodetic Parameter Estimation Under Least Squares Through Weighting On The Basis of The Mean Square Error
No ratings yet
Robust Geodetic Parameter Estimation Under Least Squares Through Weighting On The Basis of The Mean Square Error
12 pages

Data Mining - Data Objects and Attributes

Uploaded by

Data Mining - Data Objects and Attributes

Uploaded by

Unit - I

(Contents: Text book 2 - Chapter

◻ Data Objects and Attribute Types

◻ Data sets are made up of data objects.

◻ An attribute is a data field, representing a characteristic or feature of a data object.

◻ Observed values for a given attribute are known as observations.

◻ Nominal means “relating to names.” - values are symbols or names of things.

◻ A binary attribute is a nominal attribute with only two categories or states: 0 or

◻ An ordinal attribute is an attribute with possible values that have a meaningful

◻ A numeric attribute is quantitative; that is, it is a measurable quantity, represented in

◻ Numeric attributes can be interval-scaled or ratio-scaled.

◻ A ratio-scaled attribute is a numeric attribute with an inherent zero-point.

◻ If an attribute is not discrete, it is continuous.

◻ For data pre-processing to be successful, it is essential to have an overall picture of

◻ Example: Suppose we have the following values for salary (in

◻ Another Example: Grouped data

Midpoint Frequency Midpoint ×

Estimated Median (21/2) — 9

◻ Measures to assess the dispersion or spread of numeric data.

◻ Let x1,x2,...,xN be a set of observations for some numeric attribute, X.

◻ Q1 = (47 + 50)/2 = 48.5 , Q3 = 96.5, IQR = Q3-Q1 = 96.5 – 48.5 = $48,000

◻ Boxplots - visualizing a distribution which incorporates the five-number summary as

◻ Variance and standard deviation are measures of data dispersion.

◻ Where 𝜇 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑎𝑛𝑑 Standard deviation, σ, of the

Correlation and covariance are two similar measures for assessing

correlation coeﬃcient (also known as Pearson’s product moment

Χ2 (chi-square) test. The data tuples described by A and B can be shown as a

◻ Graphic displays of basic statistical descriptions. These include

A Set of Unit Price Data for Items

◻ A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate

◻ Plotting histograms is a graphical method for summarizing the distribution of a

You might also like