0% found this document useful (0 votes)

21 views73 pages

Week 2

Uploaded by

sainathgunda99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views73 pages

Week 2

Uploaded by

sainathgunda99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Know your data

sainathgunda99@gmail.com
DLZNK464L9

Week 2

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning objectives
By the end of this module, you will be able to:
• List the different types of attributes.
• Compute basic descriptive statistics of a dataset.
• Create and read graphic plots that display descriptive statistics.
• Explain statistical hypothesis testing.
• Compute object similarity and dissimilarity.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
• Types of datasets
• Data objects
• Attributes and their types
• Relationship of attributes
sainathgunda99@gmail.com
DLZNK464L9

• Need for the absolute “0”

• Discrete vs. continuous attributes

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Types of data sets
Type of datasets Examples
Record • Relational records
• Data matrix: Numerical matrix, crosstabs
• Document data: Text documents, term
frequency vector
• Transactional data
sainathgunda99@gmail.com
DLZNK464L9
Graph and • World wide web
network • Social or Information networks
• Molecular structures
Ordered • Video data: Sequence of images
• Temporal data: Time-series
• Sequential data: Transaction sequences
• Genetic sequence data
Spatial, image and • Spatial data: Maps
multimedia • Image data
• Video data
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Illustration - Record type datasets
Document data
Documents Team Coach Play Ball Score Game Win Lost Timeout Season
Document1 3 0 5 0 2 6 0 2 0 2
Document2 0 7 0 2 1 0 0 3 0 0
Document3 0 1 0 0 1 2 2 0 3 0
sainathgunda99@gmail.com
DLZNK464L9
Transactional data
TID Items
1 Bread, coke, milk
2 Beer, bread
3 Beer, coke, diaper, milk
4 Beer, bread, diaper, milk
5 Coke , diaper, milk
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data objects
• Data sets are made up of data objects.
• A data object represents an entity (an observation).
• Examples:
○ sales database: customers, store items, sales
○
sainathgunda99@gmail.com
DLZNK464L9
medical database: patients, treatments
○ university database: students, professors, courses
• Data objects are also called observations, samples, examples, instances, data points, objects, and
tuples.
• Data objects are described by their attributes.
• Database table rows -> data objects; columns ->attributes

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Fitness of a dataset
Data sets need to be as complete as possible for the questions/phenomena to be studied.
• Are data objects/observations in your data set representative of the population under study?
○ Areas of the aircraft most likely to be damaged in the war.
○ Identifying tanks in the forest.
• Are relevant attributes comprehensively included in your data set?
○ Most militarized country in the world?
sainathgunda99@gmail.com
DLZNK464L9
○ Body height and total earnings?

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attributes
• Attribute (or dimensions, features, variables) is a data field, representing a characteristic or feature
of a data object, E.g., customer _ID, name, address
• Attribute vector (feature vector) is formed by data objects with more than one attributes.
• Types :
o Categorical: qualitative
─ Nominal, Binary, Ordinal
sainathgunda99@gmail.com

o Numeric: quantitative
DLZNK464L9

─ Interval-scaled, Ratio-scaled

This file is meant for personal use by sainathgunda99@gmail.com only.

• Nominal: categories, states, or “names of things”

○ Hair_color = {auburn, black, blond, brown, grey, red, white}
○ marital status, occupation, ID numbers, zip codes
• Binary
○ Nominal attribute with only 2 states (0 and 1)
○
sainathgunda99@gmail.comSymmetric binary: both outcomes are equally important.
─ e.g., biological gender
DLZNK464L9

○ Asymmetric binary: outcomes not equally important.

─ e.g., medical test (positive vs. negative)
─ Convention: assign 1 to the most important outcome (e.g.,
positive)
• Ordinal
○ Values have a meaningful order (ranking), but the magnitude
between successive values is unknown.
○ Size = {small, medium, large}, letter grades, army rankings
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Numeric attribute types: Quantitative
• Quantity (integer or real-valued)
• Interval
○ Measured on a scale of equal-sized units
○ Values have order
─ E.g., the temperature in Celsius or Fahrenheit units, calendar
dates
○ No true zero-point
sainathgunda99@gmail.com
DLZNK464L9

• Ratio
○ Inherent zero-point
○ We can speak of values as being an order of magnitude larger than
the unit of measurement (10 kelvins is twice as high as 5 kelvins).
─ E.g., the temperature in Kelvin, length, counts, monetary
quantities.

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Relationship of attributes
• All ratio attributes are interval attributes.
• All interval attributes are ordinal attributes.
• All ordinal attributes are nominal attributes.

sainathgunda99@gmail.com
DLZNK464L9
Ratio:
Absolute
Interval: zero
Distance is
Ordinal: meaningful
Attributes
Nominal: can be
Attributes ordered
are only
named
weakest

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
The need for the absolute “0”
• Temperature measured at 20 ˚C is not twice that at 10 ˚C because 0 ˚C is not the absolute 0.
• Ratios can not be defined reliably on an arbitrary 0.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Discrete vs continuous attribute
• Discrete attribute
○ Values are distinct and separate (unconnected values).
○ Can have integer values, e.g., age or binary values.
○ Can be infinite but must be countable (each value in the set has a corresponding integer).

• Continuous attribute
sainathgunda99@gmail.com
○ Has real numbers as attribute values. E.g., temperature, height, or weight.
DLZNK464L9

○ Can take on ANY value within a finite or infinite interval.

○ Continuous attributes are typically represented as floating-point variables.

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We discussed various types of datasets, such as record, and ordered among others along with data
objects that represent an entity.
• We looked at different types of attributes along with the relationship between attributes, such as all
the ratio attributes being interval attributes etc.
sainathgunda99@gmail.com
DLZNK464L9

• We talked about the need for the absolute “0” i.e., ratio attribute.
• We learned the difference between discrete and continuous attributes and that discrete attributes
can have integer values. For example age, or binary values, whereas continuous attributes take
on any value within a finite or infinite interval.

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● The basic statistical description of data
● Measures of central tendency
● Weighted arithmetic mean
● Symmetric vs. Skewed data
sainathgunda99@gmail.com
DLZNK464L9

● Properties of a nominal distribution curve

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Basic statistical descriptions of data
● Motivation
○ To understand the data better.
● Central tendency
○ Mode, median, mean, midrange
● Dispersion of the data
○ Range, quartiles, interquartile range, five-number summary boxplots
sainathgunda99@gmail.com
DLZNK464L9

○ Variance and standard deviation

● Graphs to present data summaries and distributions
○ Quantile plots, histograms, scatter plots etc.

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the central tendency
● Mean (algebraic measure) (sample vs. population):
Note: n is the sample size, and N is the population size.
○ Weighted arithmetic mean: Sample mean Population
○ Trimmed mean: chopping extreme values

● Median:
sainathgunda99@gmail.com
DLZNK464L9

○ Middle value if odd number of ordered values, or average of the

middle two values otherwise
Weighted arithmetic mean
○ Estimated by interpolation (for grouped data)

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the central tendency (Contd.)
● Mode
○ Values that occur most frequently for a variable.
○ Unimodal, bimodal, trimodal, or no model if all values occurs the
same times
○ Empirical relationship between mean, mode, and median on
sainathgunda99@gmail.com
DLZNK464L9

moderately skewed data: mean-mode=3 x (mean-median)

This file is meant for personal use by sainathgunda99@gmail.com only.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

N=3194
sainathgunda99@gmail.com
DLZNK464L9

L1 = 21
Total observations = 3194
=200+450+300 = 950 3194/2 = 1597
Freqmedian = 1500

width = 30
median = 33.94
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Making sense of the estimation

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Symmetric positively skewed negatively skewed

sainathgunda99@gmail.com
DLZNK464L9

Median, mean and mode of symmetric, positively and negatively skewed data.

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the dispersion of data
● Range, Quartiles, outliers, and boxplots
○ Range: the difference between the largest and smallest value
in a set
■ Midrange = the average of the largest and smallest of
sainathgunda99@gmail.com values in a data set.
DLZNK464L9

○ Quantiles: points taken at regular intervals of data

distribution.
■ Quartiles and percentiles: Q1 (25th percentile), Q2
(median), Q3 (75th percentile)
○ Inter-quartile range: IQR = Q3 – Q1
○ Outlier: usually, a value higher/lower than 1.5 x IQR
above/below Q3/Q1
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the dispersion of data (Contd.)

● Variance and standard deviation (sample: s, population: σ)

○ Variance s2 (or σ2): algebraic, scalable computation
○ Standard deviation s (or σ) is the square root of variance s2 (or σ2)
○ Mean absolute deviation (MAD)
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

sainathgunda99@gmail.com
DLZNK464L9

μ–σ μ+σ μ–2σ μ+2σ μ-3σ μ+3σ

From μ–σ to μ+σ: contains From μ–2σ to μ+2σ: From μ–3σ to μ+3σ: contains
about 68% of the contains about 95% of the about 95% of the
observations observations observations

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We looked at the basic statistical description of the data.
• We discussed the various measures of central tendencies, such as Mean, Arithmetic Mean, Median,
and Mode.
• We understood that symmetric distribution is when the data is equally distributed around the mean,
sainathgunda99@gmail.com
DLZNK464L9
mode, or median and skewed distribution is when the tail of the distribution is longer on the left-hand
side than on the right-hand side or vice-versa.
• We discussed the various measures of dispersion of the data, such as variance and standard deviation
• We discussed the properties of a nominal distribution curve using various distribution graphs.

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● Box plot
● Quantile plot
● Quantile-Quantile (Q-Q) plot
● Scatter plot
sainathgunda99@gmail.com
DLZNK464L9

● Positively and negative correlation

● Non-correlated data

This file is meant for personal use by sainathgunda99@gmail.com only.

• Quantile plot: each value xi is paired with fi, indicating that approximately fi
x 100% of data are ≤ xi .

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariate

sainathgunda99@gmail.com
DLZNK464L9
distribution against the corresponding quantiles of another.

• Bar chart: x-axis presents values, y-axis frequency/count

• Histogram: x-axis values, y-axis frequency/density

• Scatter plot: graphs bi/tri-variate data as points in a 2/3-D plane

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Box plot

• Five-number summary of a distribution: Minimum,

Q1, Median, Q3, Maximum
• Boxplot
○ Data is represented with a box.
○ The ends of the box are at the first and third
sainathgunda99@gmail.com
DLZNK464L9
quartiles, i.e., the height of the box is IQR.
○ The median is marked by a line within the box.
○ Whiskers: two lines outside the box extended
to Minimum and Maximum within 1.5 IQR.
○ Outliers: points beyond a specified outlier
threshold, plotted individually.

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Quantile plot
• Displays all of the data (allowing the user to assess both the overall behavior and unusual
occurrences).
• Plots quantile information of a univariate distribution
○Data sorted in increasing order. For a value xi , fi indicates that approximately fi x 100% of the
data are below or equal to the value xi

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Quantile-Quantile (Q-Q) plot
• Graph the quantiles of one univariate distribution against the corresponding quantiles of another
variable. (xi, yi)
• Shows if two variables follows the same distribution.
• Example shows the unit price of items sold at branch 1 vs. branch 2 for each quantile. Unit prices of
items sold at branch 1 tend to be lower than those at branch 2.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Bar Chart vs Histogram
• Histogram: Shows the probability distribution of a given variable by depicting the frequencies/density of
observations occurring in certain ranges of values.
• Differs from a bar chart
Attribute Histogram Bar Chart
Variable type Contiguous variables Discrete variables
sainathgunda99@gmail.com
DLZNK464L9
Gap between
bars Adjacent bars Space-separated bars
Bar width Could have varied width Equal width
Bar order Matters Does not matter

This file is meant for personal use by sainathgunda99@gmail.com only.

• The two histograms shown may have the same boxplot representation.

• The same values for min, Q1, median, Q3, and max, but they have rather different data
distributions.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Scatter plot
• Present bivariate numerical data to see clusters of points, outliers, correlations etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

• The left half fragment is positively linear correlated.

• The right half is negative linear correlated.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:

• We understood various charts and plots, such as box plot, histogram, quantile plot, Q-Q plots, and
Scatter plot.
• We learned that positive correlation describes the relationship between two variables that change
in the same direction and a negative correlation describes the relationship between two variables
that change in the inverse directions.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda

In this session, we will learn about:

• Hypothesis testing
• One-tailed and two-tailed tests
• Steps to perform hypothesis testing
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

• Hypothesis testing is a scientific process of testing whether a hypothesis is plausible or not.

• Test includes a null hypothesis and alternative hypothesis, for example:

Null hypothesis: 𝒙𝒙
� = 𝝁𝝁 Alt hypothesis: 𝒙𝒙
� > 𝝁𝝁
Null hypothesis: variable A and variable B are independent Alt hypothesis: they are correlated
sainathgunda99@gmail.com
DLZNK464L9

• The goal is to determine which hypothesis is likely to be true at a confidence level

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing: comparison of means
Comparison of Degrees of
means freedom Application Assumptions Test statistic
Testing the difference of a sample
mean, x-bar, with a known population Normal distribution,
Not mean, μ known population σ
One sample Z test applicable

sainathgunda99@gmail.com
Normal distribution,
DLZNK464L9
Testing the difference of one sample population standard
One sample t test n-1 mean, x-bar with a given mean, μ deviation, σ is unknown
Testing the difference of two sample
means when population variances
Two sample t test n1+n2-2 unknown but considered equal Normal distribution

Rejection area

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing
Comparison of Degrees of
means freedom Application Assumptions Test statistic
Testing two sample means when
their respective population
standard deviations are unknown Normal distribution two
but considered equal, data dependent samples,
recorded in pairs and each pair has always two-tailed test.
Paired t test
sainathgunda99@gmail.com
DLZNK464L9
n-1 a difference, d Sd= standard deviation
Normal distribution
Testing the difference of three or
One-way ANOVA n1-1 & n2-1 more sample means

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Confidence level and significance level
• Significance level: the risk we are willing to take to
reject null hypo when it is actually true.
xbar = 110, 𝝁𝝁 = 100
• Typical: 5% or 1% Say, our Z = 2.5

P(Z > 1.645)

• Confidence level = 1 – significance level

sainathgunda99@gmail.com
DLZNK464L9

• Typical: 95% or 99%

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
One-tailed and two-tailed tests
● One-tailed test
1. Null Hypothesis; xbar = 𝞵𝞵
2. Alternate hypothesis; xbar > 𝞵𝞵 ;
where 𝞵𝞵 is hypothesized mean
sainathgunda99@gmail.com
DLZNK464L9

● Two-tailed test
1. Null hypothesis; xbar = 𝞵𝞵
2. Alternate hypothesis; xbar ≠ 𝞵𝞵 ;
where 𝞵𝞵 is hypothesized mean

This file is meant for personal use by sainathgunda99@gmail.com only.

• Steps in Hypothesis testing:

1. State the hypotheses (null and alternative)
2. Identify the test statistic and its probability distribution.
3. Specify the significance level
4. Collect the data and perform the calculations
sainathgunda99@gmail.com
DLZNK464L9

5. Make the statistical decision

6. Make the business decision

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:

• We discussed hypothesis testing, a scientific process of testing whether or not a hypothesis is plausible.
• We understood one-tailed tests that allow the testing of an effect in one direction and two-tailed tests
allow the testing of an effect in two directions—positive and negative.
• We looked at various tests to check the null and alternate hypothesis, such as one sample Z test, two-
sample t-test, etc.
sainathgunda99@gmail.com
DLZNK464L9

• We looked at the steps to conduct a hypothesis test

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● Measuring data similarity and dissimilarity
● Proximity measures for nominal, ordinal, and binary attributes
● Proximity measures for numerical attributes and normalization
● Compute dissimilarity with mixed type variables
● Cosine similarity
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Similarity and Dissimilarity
● Similarity
○ Numerical measure of how alike two data objects are.
○ Value is higher when objects are more alike.
○ Often falls in the range [0,1].
● Dissimilarity (e.g., distance)
○ Numerical measure of how different two data objects are.
sainathgunda99@gmail.com
DLZNK464L9
○ Lower when objects are more alike.
○ Minimum dissimilarity is often 0.
○ Upper limit varies.
● Proximity may refer to either similarity or dissimilarity.

This file is meant for personal use by sainathgunda99@gmail.com only.

● Distance/similarity matrix
○ n data points, but registers only
sainathgunda99@gmail.com
DLZNK464L9
the distance/similarity
○ Is often a symmetric matrix
○ Single mode: (dis)similarity

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for nominal attributes
Nominal attributes can take two or more states/values, e.g., color can be red, yellow, blue, green, etc.
(generalization of a binary attribute).
Method 1: Simple matching

Observations ( cat1 and cat2 ) are described by nominal values of color, size, sleep time.

Objects
sainathgunda99@gmail.com
DLZNK464L9 Color Size Sleep time
cat1 yellow small <5 hours
cat2 yellow medium 5-8 hours

d(i, j): distance between i and j

m: Number of attributes with same values,
p: total number of attributes This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes
Method2:
● In this method, nominal attributes are converted to binary attributes.
● Creating a new binary attribute for each of the nominal states is called “One Hot Encoding”, thus
forming a binary attribute table as shown below.
● Thus, proximity measure for binary attributes is used to measure the similarity between the
objects.
sainathgunda99@gmail.com
DLZNK464L9

sleep time sleeptime 5-

Objects color-yellow color-… size-small size-medium <5 hours 8 hours
cat1 1 0 1 0 1 0
cat2 1 0 0 1 0 1

This file is meant for personal use by sainathgunda99@gmail.com only.

The number of attributes having the same/different values for the observations ( eg, cat1, cat2 in the
previous table ) are counted by using the binary attribute table forming the contingency table as
shown below:
Object J

sainathgunda99@gmail.com
1 0 sum
DLZNK464L9

Object I 1 q r q+r

0 s t s+t

sum q+s r+t p

q represents the number of attributes both objects have the value of 1

r, s represents the number of attributes both objects have the different value
t represents the number of attributes both objects have the value of 0
p is the total number of attributes
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes
By using the contingency table previously created, distance / similarity metrics are calculated as shown
below:

Distance measure for symmetric binary

variables :

sainathgunda99@gmail.com
DLZNK464L9

Distance measure for asymmetric binary

variables:

Jaccard coefficient (shown similarity

measure for asymmetric binary
variables) :

This file is meant for personal use by sainathgunda99@gmail.com only.

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

sainathgunda99@gmail.com

● Gender is a nominal attribute.

DLZNK464L9

● The remaining attributes are asymmetric

binary.
● Let the values Y and P be 1, and the value N
be 0.
● Considering only asymmetric attributes.

This file is meant for personal use by sainathgunda99@gmail.com only.

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h norm)
sainathgunda99@gmail.com

● Properties
DLZNK464L9

○ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)

○ d(i, j) = d(j, i) (Symmetry)
○ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
● A distance that satisfies these properties is a metric.

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Special cases of Minkowski distance
● h = 1: Manhattan (city block, L1 norm) distance
○ E.g., the Hamming distance: the number of bits that are different between two binary vectors

● h = 2: (L2 norm) Euclidean distance

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Special cases of Minkowski distance (Contd.)
● Chebyshev distance:
○ When h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
■This is the maximum difference between any component (attribute) of the vectors

sainathgunda99@gmail.com
DLZNK464L9

○ When h → -∞.
○ This is the minimum difference between any component (attribute) of the vectors

This file is meant for personal use by sainathgunda99@gmail.com only.

Manhattan (L1)
sainathgunda99@gmail.com
DLZNK464L9

Euclidean (L2)

Supremum
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Normalization of numerical values
● Values measured on different scales can not be compared directly.
Student A: SAT = 1800
Student B: ACT = 24
Which student performed better relative to other test-takers?
● Normalization used widely with multi-dimensional datasets involving different scales: clustering,
multidimensional scaling, principal component analysis, etc.
sainathgunda99@gmail.com
DLZNK464L9

SAT ACT
Mean 1500 21
Standard
deviation 300 5

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Standardizing numeric data
● Z-score
○ x: raw score to be standardized, μ: mean of the population, σ: standard
deviation
○ the distance between the raw score and the population mean in units of
the standard deviation
○ negative when the raw score is below the mean, “+” when above
sainathgunda99@gmail.com
DLZNK464L9

● An alternative way: Calculate the Mean Absolute Deviation (MAD),

Where

Standardized measure:

● MAD is more robust to outliers than the standard deviation because, in the
former, the differences with the mean are not squared.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for ordinal variables
● Order is important, e.g., rank
Math grade: A, B, C, D, E.

sainathgunda99@gmail.com
DLZNK464L9

● Map ordinal values to values between 0 and 1 (to interval-scaled)

1. Replace xif by their rank
2. Map the range of each variable onto [0, 1] by replacing the i-th value in
the f-th variable by

3. Compute the dissimilarity of using methods for numerical variables

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Dissimilarity btw objects with mixed type attributes
● A database may contain all attribute types.
○ Nominal, symmetric binary, asymmetric binary, numeric, ordinal
● One may use a weighted formula to combine their effects.
● Distance btw objects i and j over f features/attributes:

sainathgunda99@gmail.com
DLZNK464L9

• 0 if f is missing for either object i • f is binary or nominal:

or j, or if f for i and j are both 0 o dij(f) = 0 if xif = xjf , or dij(f) = 1
and f is asymmetric binary otherwise
attribute • f is numeric: use the normalized
• 1 otherwise distance
• for fpersonal
This file is meant is ordinal: convert to numeric
use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Compute dissimilarity with mixed variables

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

● Gender is a nominal attribute; others are asymmetric binary.

sainathgunda99@gmail.com
DLZNK464L9
● Let the values Y and P be 1, and the value N be 0.

This file is meant for personal use by sainathgunda99@gmail.com only.

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

Considering all attributes:

sainathgunda99@gmail.com
DLZNK464L9

d(Jack, Mary)= (11+01+00+01+00+11+0*0)/(1+1+1+1) = 2/4 = 0.5

Gender: nominal: d=1, δ=1

Fever: asyn: d=0, δ=1
Cough: asyn, both 0: d=0, δ=0
Test-1:asyn: d=0, δ=1
Test-2:asyn, both 0: d=0, δ=0
Test-3:asyn: d=1, δ=1
Test-4:asyn, both 0: d=0, δ=0
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Cosine similarity: similarity btw vectors of numerical values
• A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
Document Team Coach Hockey Baseball Soccer Penalty Score Win Loss Season
Documen
t1 5 0 3 0 2 0 0 2 0 0
Documen
t2
sainathgunda99@gmail.com
DLZNK464L9
3 0 2 0 1 1 0 1 0 1
Documen
t3 0 7 0 2 1 0 0 3 0 0
Documen
t4 0 1 0 0 1 2 2 0 3 0

• The angle between any two vectors (documents) can be used as a measure of the similarity between
the two documents:

This file is meant for personal use by sainathgunda99@gmail.com only.

• Cosine similarity is in [0, 1] : If d1 and d2 are two vectors (e.g., term-frequency vectors) then

cos(d1, d2) = (d1 ∙ d2) /(||d1|| x ||d2||) ,

where ∙ indicates vector dot product, ||d||: the length of vector d

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

• cos(d1, d2) = (d1 ∙ d2) /(||d1|| ||d2||) ,

where ∙ indicates vector dot product, ||d||: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

DLZNK464L9 d = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
sainathgunda99@gmail.com
1
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1∙ d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 25 / (6.48*4.12) = 0.94

This file is meant for personal use by sainathgunda99@gmail.com only.

Here is a quick recap:

● We learned how to measure data similarity and dissimilarity.
● We looked into distance metrics, such as Minkowski distance and standardization.
● We learned proximity measures for nominal, ordinal, and binary attributes.
● We also learned how to compute dissimilarity with mixed variables.
● We discussed cosine similarity with an example.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Outcomes
Coming to the end of this module, you should now be able to:
• Differentiate different types of attributes: nominal, binary, ordinal, interval-scaled, and ratio-scaled.
• Evaluate basic descriptive statistics of a dataset: central tendency and dispersion.
• Illustrate and interpret graphic plots that display descriptive statistics.
• Summarize statistical hypothesis testing.
• Evaluate object similarity and dissimilarity in mixed-type datasets.
• Summarize Cosine similarity using an example.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.

Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Data Mining - Data Objects and Attributes
No ratings yet
Data Mining - Data Objects and Attributes
50 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Types of Data Attributes Explained
No ratings yet
Types of Data Attributes Explained
10 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
39 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
Ids Unit 2 Final
No ratings yet
Ids Unit 2 Final
18 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
Lect-2 Getting To Know Your Data-Part-I
No ratings yet
Lect-2 Getting To Know Your Data-Part-I
28 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
2 What Is DATA ST
No ratings yet
2 What Is DATA ST
63 pages
Chap2 Data
No ratings yet
Chap2 Data
88 pages
DEP Unit 2
No ratings yet
DEP Unit 2
83 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Data
No ratings yet
Data
84 pages
Full
No ratings yet
Full
367 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
4 - Ch4 - Data Objects and Attribute Types
No ratings yet
4 - Ch4 - Data Objects and Attribute Types
14 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Lect 2
No ratings yet
Lect 2
77 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Data Objects and Attribute Types
No ratings yet
Data Objects and Attribute Types
3 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
Chapter 2 (Data)
No ratings yet
Chapter 2 (Data)
98 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
DMW Unit1
No ratings yet
DMW Unit1
21 pages
Unit-2 Attributes
No ratings yet
Unit-2 Attributes
4 pages
Dmi Unit 2 - 186 - N3
No ratings yet
Dmi Unit 2 - 186 - N3
21 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Data Types for Aspiring Data Scientists
No ratings yet
Data Types for Aspiring Data Scientists
14 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
IDS Unit-2
No ratings yet
IDS Unit-2
39 pages
Attributes
No ratings yet
Attributes
66 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
No ratings yet
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
12 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
DS Handout 4
No ratings yet
DS Handout 4
4 pages
Selenium Java Notes Part-1
No ratings yet
Selenium Java Notes Part-1
95 pages
30 Most Important ML Concepts 1735541667
No ratings yet
30 Most Important ML Concepts 1735541667
18 pages
Dashboards Intro
No ratings yet
Dashboards Intro
27 pages
Dashboard Layouts
No ratings yet
Dashboard Layouts
33 pages
Week+1-Part+1 Upd
No ratings yet
Week+1-Part+1 Upd
30 pages
Weekly Assessment-4
No ratings yet
Weekly Assessment-4
10 pages
Basic and Advanced Statistical Tests Writing Results Sections and Creating Tables and Figures 1st Edition Amanda Ross
No ratings yet
Basic and Advanced Statistical Tests Writing Results Sections and Creating Tables and Figures 1st Edition Amanda Ross
61 pages
ISL213E-Midterm-2020-2021 - Answer Key - Edited
No ratings yet
ISL213E-Midterm-2020-2021 - Answer Key - Edited
6 pages
Measure of Central Tendency
No ratings yet
Measure of Central Tendency
22 pages
Lesson 6: Measures of Relative Dispersion
No ratings yet
Lesson 6: Measures of Relative Dispersion
4 pages
8 Question Bank
No ratings yet
8 Question Bank
10 pages
Grubbs Test - Explained With Examples - All Things Statistics
No ratings yet
Grubbs Test - Explained With Examples - All Things Statistics
4 pages
Quiz 2
No ratings yet
Quiz 2
2 pages
Stat 4091 Exercise2
100% (1)
Stat 4091 Exercise2
4 pages
Even Solutions
No ratings yet
Even Solutions
41 pages
QUARTILES, DECILES AND PERCENTILES FOR GROUPED DATA (Autosaved)
100% (1)
QUARTILES, DECILES AND PERCENTILES FOR GROUPED DATA (Autosaved)
22 pages
Statistics
No ratings yet
Statistics
15 pages
Book IntroStatistics PDF
No ratings yet
Book IntroStatistics PDF
263 pages
STAT Quiz 3
No ratings yet
STAT Quiz 3
3 pages
Confidence Interval When SD Is Unknown
No ratings yet
Confidence Interval When SD Is Unknown
19 pages
Exercises Hyp Test
No ratings yet
Exercises Hyp Test
2 pages
Data Analysis Calculator
No ratings yet
Data Analysis Calculator
28 pages
Mean and Proportion of 2 Independent Populations
No ratings yet
Mean and Proportion of 2 Independent Populations
5 pages
Statistics Exam Questions
No ratings yet
Statistics Exam Questions
29 pages
Q1. (Maximum Marks:4) (Non-Calculator)
No ratings yet
Q1. (Maximum Marks:4) (Non-Calculator)
15 pages
Grade 8 Science Test Guide
No ratings yet
Grade 8 Science Test Guide
49 pages
Data Integration and Missing Values Analysis
No ratings yet
Data Integration and Missing Values Analysis
23 pages
Z Score
No ratings yet
Z Score
4 pages
Sampling Principles: Design Manual
No ratings yet
Sampling Principles: Design Manual
41 pages
Linear Correlation Analysis Guide
100% (1)
Linear Correlation Analysis Guide
3 pages
Lesson 3.2 Measures of Central Tendency Position and Variation
No ratings yet
Lesson 3.2 Measures of Central Tendency Position and Variation
62 pages
Exploratory Data Analysis Basics
No ratings yet
Exploratory Data Analysis Basics
70 pages
Coefficient of Determination
No ratings yet
Coefficient of Determination
7 pages
6) Exploratory Data Analysis
No ratings yet
6) Exploratory Data Analysis
29 pages
STATISTICS
No ratings yet
STATISTICS
5 pages
Random Variables & Probability Distributions
No ratings yet
Random Variables & Probability Distributions
7 pages