Know your data
sainathgunda99@gmail.com
DLZNK464L9
Week 2
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning objectives
By the end of this module, you will be able to:
• List the different types of attributes.
• Compute basic descriptive statistics of a dataset.
• Create and read graphic plots that display descriptive statistics.
• Explain statistical hypothesis testing.
• Compute object similarity and dissimilarity.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Data objects and attribute types
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
• Types of datasets
• Data objects
• Attributes and their types
• Relationship of attributes
sainathgunda99@gmail.com
DLZNK464L9
• Need for the absolute “0”
• Discrete vs. continuous attributes
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Types of data sets
Type of datasets Examples
Record • Relational records
• Data matrix: Numerical matrix, crosstabs
• Document data: Text documents, term
frequency vector
• Transactional data
sainathgunda99@gmail.com
DLZNK464L9
Graph and • World wide web
network • Social or Information networks
• Molecular structures
Ordered • Video data: Sequence of images
• Temporal data: Time-series
• Sequential data: Transaction sequences
• Genetic sequence data
Spatial, image and • Spatial data: Maps
multimedia • Image data
• Video data
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Illustration - Record type datasets
Document data
Documents Team Coach Play Ball Score Game Win Lost Timeout Season
Document1 3 0 5 0 2 6 0 2 0 2
Document2 0 7 0 2 1 0 0 3 0 0
Document3 0 1 0 0 1 2 2 0 3 0
sainathgunda99@gmail.com
DLZNK464L9
Transactional data
TID Items
1 Bread, coke, milk
2 Beer, bread
3 Beer, coke, diaper, milk
4 Beer, bread, diaper, milk
5 Coke , diaper, milk
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data objects
• Data sets are made up of data objects.
• A data object represents an entity (an observation).
• Examples:
○ sales database: customers, store items, sales
○
sainathgunda99@gmail.com
DLZNK464L9
medical database: patients, treatments
○ university database: students, professors, courses
• Data objects are also called observations, samples, examples, instances, data points, objects, and
tuples.
• Data objects are described by their attributes.
• Database table rows -> data objects; columns ->attributes
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Fitness of a dataset
Data sets need to be as complete as possible for the questions/phenomena to be studied.
• Are data objects/observations in your data set representative of the population under study?
○ Areas of the aircraft most likely to be damaged in the war.
○ Identifying tanks in the forest.
• Are relevant attributes comprehensively included in your data set?
○ Most militarized country in the world?
sainathgunda99@gmail.com
DLZNK464L9
○ Body height and total earnings?
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attributes
• Attribute (or dimensions, features, variables) is a data field, representing a characteristic or feature
of a data object, E.g., customer _ID, name, address
• Attribute vector (feature vector) is formed by data objects with more than one attributes.
• Types :
o Categorical: qualitative
─ Nominal, Binary, Ordinal
sainathgunda99@gmail.com
o Numeric: quantitative
DLZNK464L9
─ Interval-scaled, Ratio-scaled
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attribute types: Qualitative
• Nominal: categories, states, or “names of things”
○ Hair_color = {auburn, black, blond, brown, grey, red, white}
○ marital status, occupation, ID numbers, zip codes
• Binary
○ Nominal attribute with only 2 states (0 and 1)
○
sainathgunda99@gmail.comSymmetric binary: both outcomes are equally important.
─ e.g., biological gender
DLZNK464L9
○ Asymmetric binary: outcomes not equally important.
─ e.g., medical test (positive vs. negative)
─ Convention: assign 1 to the most important outcome (e.g.,
positive)
• Ordinal
○ Values have a meaningful order (ranking), but the magnitude
between successive values is unknown.
○ Size = {small, medium, large}, letter grades, army rankings
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Numeric attribute types: Quantitative
• Quantity (integer or real-valued)
• Interval
○ Measured on a scale of equal-sized units
○ Values have order
─ E.g., the temperature in Celsius or Fahrenheit units, calendar
dates
○ No true zero-point
sainathgunda99@gmail.com
DLZNK464L9
• Ratio
○ Inherent zero-point
○ We can speak of values as being an order of magnitude larger than
the unit of measurement (10 kelvins is twice as high as 5 kelvins).
─ E.g., the temperature in Kelvin, length, counts, monetary
quantities.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Relationship of attributes
• All ratio attributes are interval attributes.
• All interval attributes are ordinal attributes.
• All ordinal attributes are nominal attributes.
sainathgunda99@gmail.com
DLZNK464L9
Ratio:
Absolute
Interval: zero
Distance is
Ordinal: meaningful
Attributes
Nominal: can be
Attributes ordered
are only
named
weakest
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
The need for the absolute “0”
• Temperature measured at 20 ˚C is not twice that at 10 ˚C because 0 ˚C is not the absolute 0.
• Ratios can not be defined reliably on an arbitrary 0.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Discrete vs continuous attribute
• Discrete attribute
○ Values are distinct and separate (unconnected values).
○ Can have integer values, e.g., age or binary values.
○ Can be infinite but must be countable (each value in the set has a corresponding integer).
• Continuous attribute
sainathgunda99@gmail.com
○ Has real numbers as attribute values. E.g., temperature, height, or weight.
DLZNK464L9
○ Can take on ANY value within a finite or infinite interval.
○ Continuous attributes are typically represented as floating-point variables.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We discussed various types of datasets, such as record, and ordered among others along with data
objects that represent an entity.
• We looked at different types of attributes along with the relationship between attributes, such as all
the ratio attributes being interval attributes etc.
sainathgunda99@gmail.com
DLZNK464L9
• We talked about the need for the absolute “0” i.e., ratio attribute.
• We learned the difference between discrete and continuous attributes and that discrete attributes
can have integer values. For example age, or binary values, whereas continuous attributes take
on any value within a finite or infinite interval.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Basic statistical descriptions of data
sainathgunda99@gmail.com
DLZNK464L9 (Part 1)
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● The basic statistical description of data
● Measures of central tendency
● Weighted arithmetic mean
● Symmetric vs. Skewed data
sainathgunda99@gmail.com
DLZNK464L9
● Properties of a nominal distribution curve
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Basic statistical descriptions of data
● Motivation
○ To understand the data better.
● Central tendency
○ Mode, median, mean, midrange
● Dispersion of the data
○ Range, quartiles, interquartile range, five-number summary boxplots
sainathgunda99@gmail.com
DLZNK464L9
○ Variance and standard deviation
● Graphs to present data summaries and distributions
○ Quantile plots, histograms, scatter plots etc.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the central tendency
● Mean (algebraic measure) (sample vs. population):
Note: n is the sample size, and N is the population size.
○ Weighted arithmetic mean: Sample mean Population
○ Trimmed mean: chopping extreme values
● Median:
sainathgunda99@gmail.com
DLZNK464L9
○ Middle value if odd number of ordered values, or average of the
middle two values otherwise
Weighted arithmetic mean
○ Estimated by interpolation (for grouped data)
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the central tendency (Contd.)
● Mode
○ Values that occur most frequently for a variable.
○ Unimodal, bimodal, trimodal, or no model if all values occurs the
same times
○ Empirical relationship between mean, mode, and median on
sainathgunda99@gmail.com
DLZNK464L9
moderately skewed data: mean-mode=3 x (mean-median)
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Weighted arithmetic mean
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Estimated median by interpolation
N=3194
sainathgunda99@gmail.com
DLZNK464L9
L1 = 21
Total observations = 3194
=200+450+300 = 950 3194/2 = 1597
Freqmedian = 1500
width = 30
median = 33.94
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Making sense of the estimation
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Symmetric vs Skewed data
Symmetric positively skewed negatively skewed
sainathgunda99@gmail.com
DLZNK464L9
Median, mean and mode of symmetric, positively and negatively skewed data.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the dispersion of data
● Range, Quartiles, outliers, and boxplots
○ Range: the difference between the largest and smallest value
in a set
■ Midrange = the average of the largest and smallest of
sainathgunda99@gmail.com values in a data set.
DLZNK464L9
○ Quantiles: points taken at regular intervals of data
distribution.
■ Quartiles and percentiles: Q1 (25th percentile), Q2
(median), Q3 (75th percentile)
○ Inter-quartile range: IQR = Q3 – Q1
○ Outlier: usually, a value higher/lower than 1.5 x IQR
above/below Q3/Q1
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the dispersion of data (Contd.)
● Variance and standard deviation (sample: s, population: σ)
○ Variance s2 (or σ2): algebraic, scalable computation
○ Standard deviation s (or σ) is the square root of variance s2 (or σ2)
○ Mean absolute deviation (MAD)
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Properties of Normal Distribution Curve
sainathgunda99@gmail.com
DLZNK464L9
μ–σ μ+σ μ–2σ μ+2σ μ-3σ μ+3σ
From μ–σ to μ+σ: contains From μ–2σ to μ+2σ: From μ–3σ to μ+3σ: contains
about 68% of the contains about 95% of the about 95% of the
observations observations observations
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We looked at the basic statistical description of the data.
• We discussed the various measures of central tendencies, such as Mean, Arithmetic Mean, Median,
and Mode.
• We understood that symmetric distribution is when the data is equally distributed around the mean,
sainathgunda99@gmail.com
DLZNK464L9
mode, or median and skewed distribution is when the tail of the distribution is longer on the left-hand
side than on the right-hand side or vice-versa.
• We discussed the various measures of dispersion of the data, such as variance and standard deviation
• We discussed the properties of a nominal distribution curve using various distribution graphs.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Basic statistical descriptions of data
(Part 2)
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● Box plot
● Quantile plot
● Quantile-Quantile (Q-Q) plot
● Scatter plot
sainathgunda99@gmail.com
DLZNK464L9
● Positively and negative correlation
● Non-correlated data
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Graphic displays of basic statistical descriptions
• Boxplot: the graphic display of a five-number summary.
• Quantile plot: each value xi is paired with fi, indicating that approximately fi
x 100% of data are ≤ xi .
• Quantile-quantile (q-q) plot: graphs the quantiles of one univariate
sainathgunda99@gmail.com
DLZNK464L9
distribution against the corresponding quantiles of another.
• Bar chart: x-axis presents values, y-axis frequency/count
• Histogram: x-axis values, y-axis frequency/density
• Scatter plot: graphs bi/tri-variate data as points in a 2/3-D plane
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Box plot
• Five-number summary of a distribution: Minimum,
Q1, Median, Q3, Maximum
• Boxplot
○ Data is represented with a box.
○ The ends of the box are at the first and third
sainathgunda99@gmail.com
DLZNK464L9
quartiles, i.e., the height of the box is IQR.
○ The median is marked by a line within the box.
○ Whiskers: two lines outside the box extended
to Minimum and Maximum within 1.5 IQR.
○ Outliers: points beyond a specified outlier
threshold, plotted individually.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Quantile plot
• Displays all of the data (allowing the user to assess both the overall behavior and unusual
occurrences).
• Plots quantile information of a univariate distribution
○Data sorted in increasing order. For a value xi , fi indicates that approximately fi x 100% of the
data are below or equal to the value xi
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Quantile-Quantile (Q-Q) plot
• Graph the quantiles of one univariate distribution against the corresponding quantiles of another
variable. (xi, yi)
• Shows if two variables follows the same distribution.
• Example shows the unit price of items sold at branch 1 vs. branch 2 for each quantile. Unit prices of
items sold at branch 1 tend to be lower than those at branch 2.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Bar Chart vs Histogram
• Histogram: Shows the probability distribution of a given variable by depicting the frequencies/density of
observations occurring in certain ranges of values.
• Differs from a bar chart
Attribute Histogram Bar Chart
Variable type Contiguous variables Discrete variables
sainathgunda99@gmail.com
DLZNK464L9
Gap between
bars Adjacent bars Space-separated bars
Bar width Could have varied width Equal width
Bar order Matters Does not matter
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Histograms often tell more than boxplots
• The two histograms shown may have the same boxplot representation.
• The same values for min, Q1, median, Q3, and max, but they have rather different data
distributions.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Scatter plot
• Present bivariate numerical data to see clusters of points, outliers, correlations etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Positive and negative correlation
• The left half fragment is positively linear correlated.
• The right half is negative linear correlated.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Non-correlated data
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We understood various charts and plots, such as box plot, histogram, quantile plot, Q-Q plots, and
Scatter plot.
• We learned that positive correlation describes the relationship between two variables that change
in the same direction and a negative correlation describes the relationship between two variables
that change in the inverse directions.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Hypothesis testing
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
• Hypothesis testing
• One-tailed and two-tailed tests
• Steps to perform hypothesis testing
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing
• Hypothesis testing is a scientific process of testing whether a hypothesis is plausible or not.
• Test includes a null hypothesis and alternative hypothesis, for example:
Null hypothesis: 𝒙𝒙
� = 𝝁𝝁 Alt hypothesis: 𝒙𝒙
� > 𝝁𝝁
Null hypothesis: variable A and variable B are independent Alt hypothesis: they are correlated
sainathgunda99@gmail.com
DLZNK464L9
• The goal is to determine which hypothesis is likely to be true at a confidence level
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing: comparison of means
Comparison of Degrees of
means freedom Application Assumptions Test statistic
Testing the difference of a sample
mean, x-bar, with a known population Normal distribution,
Not mean, μ known population σ
One sample Z test applicable
sainathgunda99@gmail.com
Normal distribution,
DLZNK464L9
Testing the difference of one sample population standard
One sample t test n-1 mean, x-bar with a given mean, μ deviation, σ is unknown
Testing the difference of two sample
means when population variances
Two sample t test n1+n2-2 unknown but considered equal Normal distribution
Rejection area
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing
Comparison of Degrees of
means freedom Application Assumptions Test statistic
Testing two sample means when
their respective population
standard deviations are unknown Normal distribution two
but considered equal, data dependent samples,
recorded in pairs and each pair has always two-tailed test.
Paired t test
sainathgunda99@gmail.com
DLZNK464L9
n-1 a difference, d Sd= standard deviation
Normal distribution
Testing the difference of three or
One-way ANOVA n1-1 & n2-1 more sample means
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Confidence level and significance level
• Significance level: the risk we are willing to take to
reject null hypo when it is actually true.
xbar = 110, 𝝁𝝁 = 100
• Typical: 5% or 1% Say, our Z = 2.5
P(Z > 1.645)
• Confidence level = 1 – significance level
sainathgunda99@gmail.com
DLZNK464L9
• Typical: 95% or 99%
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
One-tailed and two-tailed tests
● One-tailed test
1. Null Hypothesis; xbar = 𝞵𝞵
2. Alternate hypothesis; xbar > 𝞵𝞵 ;
where 𝞵𝞵 is hypothesized mean
sainathgunda99@gmail.com
DLZNK464L9
● Two-tailed test
1. Null hypothesis; xbar = 𝞵𝞵
2. Alternate hypothesis; xbar ≠ 𝞵𝞵 ;
where 𝞵𝞵 is hypothesized mean
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing: steps
• Steps in Hypothesis testing:
1. State the hypotheses (null and alternative)
2. Identify the test statistic and its probability distribution.
3. Specify the significance level
4. Collect the data and perform the calculations
sainathgunda99@gmail.com
DLZNK464L9
5. Make the statistical decision
6. Make the business decision
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We discussed hypothesis testing, a scientific process of testing whether or not a hypothesis is plausible.
• We understood one-tailed tests that allow the testing of an effect in one direction and two-tailed tests
allow the testing of an effect in two directions—positive and negative.
• We looked at various tests to check the null and alternate hypothesis, such as one sample Z test, two-
sample t-test, etc.
sainathgunda99@gmail.com
DLZNK464L9
• We looked at the steps to conduct a hypothesis test
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring data similarity and
sainathgunda99@gmail.com
DLZNK464L9
dissimilarity
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● Measuring data similarity and dissimilarity
● Proximity measures for nominal, ordinal, and binary attributes
● Proximity measures for numerical attributes and normalization
● Compute dissimilarity with mixed type variables
● Cosine similarity
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Similarity and Dissimilarity
● Similarity
○ Numerical measure of how alike two data objects are.
○ Value is higher when objects are more alike.
○ Often falls in the range [0,1].
● Dissimilarity (e.g., distance)
○ Numerical measure of how different two data objects are.
sainathgunda99@gmail.com
DLZNK464L9
○ Lower when objects are more alike.
○ Minimum dissimilarity is often 0.
○ Upper limit varies.
● Proximity may refer to either similarity or dissimilarity.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data matrix and (dis)similarity matrix
● Data matrix
○ n data points with p dimensions
○ Two modes: object +feature
● Distance/similarity matrix
○ n data points, but registers only
sainathgunda99@gmail.com
DLZNK464L9
the distance/similarity
○ Is often a symmetric matrix
○ Single mode: (dis)similarity
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for nominal attributes
Nominal attributes can take two or more states/values, e.g., color can be red, yellow, blue, green, etc.
(generalization of a binary attribute).
Method 1: Simple matching
Observations ( cat1 and cat2 ) are described by nominal values of color, size, sleep time.
Objects
sainathgunda99@gmail.com
DLZNK464L9 Color Size Sleep time
cat1 yellow small <5 hours
cat2 yellow medium 5-8 hours
d(i, j): distance between i and j
m: Number of attributes with same values,
p: total number of attributes This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes
Method2:
● In this method, nominal attributes are converted to binary attributes.
● Creating a new binary attribute for each of the nominal states is called “One Hot Encoding”, thus
forming a binary attribute table as shown below.
● Thus, proximity measure for binary attributes is used to measure the similarity between the
objects.
sainathgunda99@gmail.com
DLZNK464L9
sleep time sleeptime 5-
Objects color-yellow color-… size-small size-medium <5 hours 8 hours
cat1 1 0 1 0 1 0
cat2 1 0 0 1 0 1
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes
The number of attributes having the same/different values for the observations ( eg, cat1, cat2 in the
previous table ) are counted by using the binary attribute table forming the contingency table as
shown below:
Object J
sainathgunda99@gmail.com
1 0 sum
DLZNK464L9
Object I 1 q r q+r
0 s t s+t
sum q+s r+t p
q represents the number of attributes both objects have the value of 1
r, s represents the number of attributes both objects have the different value
t represents the number of attributes both objects have the value of 0
p is the total number of attributes
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes
By using the contingency table previously created, distance / similarity metrics are calculated as shown
below:
Distance measure for symmetric binary
variables :
sainathgunda99@gmail.com
DLZNK464L9
Distance measure for asymmetric binary
variables:
Jaccard coefficient (shown similarity
measure for asymmetric binary
variables) :
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Compute dissimilarity using binary variables
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
sainathgunda99@gmail.com
● Gender is a nominal attribute.
DLZNK464L9
● The remaining attributes are asymmetric
binary.
● Let the values Y and P be 1, and the value N
be 0.
● Considering only asymmetric attributes.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Distance on numeric data: Minkowski distance
● Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h norm)
sainathgunda99@gmail.com
● Properties
DLZNK464L9
○ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
○ d(i, j) = d(j, i) (Symmetry)
○ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
● A distance that satisfies these properties is a metric.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Special cases of Minkowski distance
● h = 1: Manhattan (city block, L1 norm) distance
○ E.g., the Hamming distance: the number of bits that are different between two binary vectors
● h = 2: (L2 norm) Euclidean distance
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Special cases of Minkowski distance (Contd.)
● Chebyshev distance:
○ When h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
■This is the maximum difference between any component (attribute) of the vectors
sainathgunda99@gmail.com
DLZNK464L9
○ When h → -∞.
○ This is the minimum difference between any component (attribute) of the vectors
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Example: Minkowski distance
Manhattan (L1)
sainathgunda99@gmail.com
DLZNK464L9
Euclidean (L2)
Supremum
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Normalization of numerical values
● Values measured on different scales can not be compared directly.
Student A: SAT = 1800
Student B: ACT = 24
Which student performed better relative to other test-takers?
● Normalization used widely with multi-dimensional datasets involving different scales: clustering,
multidimensional scaling, principal component analysis, etc.
sainathgunda99@gmail.com
DLZNK464L9
SAT ACT
Mean 1500 21
Standard
deviation 300 5
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Standardizing numeric data
● Z-score
○ x: raw score to be standardized, μ: mean of the population, σ: standard
deviation
○ the distance between the raw score and the population mean in units of
the standard deviation
○ negative when the raw score is below the mean, “+” when above
sainathgunda99@gmail.com
DLZNK464L9
● An alternative way: Calculate the Mean Absolute Deviation (MAD),
Where
Standardized measure:
● MAD is more robust to outliers than the standard deviation because, in the
former, the differences with the mean are not squared.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for ordinal variables
● Order is important, e.g., rank
Math grade: A, B, C, D, E.
sainathgunda99@gmail.com
DLZNK464L9
● Map ordinal values to values between 0 and 1 (to interval-scaled)
1. Replace xif by their rank
2. Map the range of each variable onto [0, 1] by replacing the i-th value in
the f-th variable by
3. Compute the dissimilarity of using methods for numerical variables
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Dissimilarity btw objects with mixed type attributes
● A database may contain all attribute types.
○ Nominal, symmetric binary, asymmetric binary, numeric, ordinal
● One may use a weighted formula to combine their effects.
● Distance btw objects i and j over f features/attributes:
sainathgunda99@gmail.com
DLZNK464L9
• 0 if f is missing for either object i • f is binary or nominal:
or j, or if f for i and j are both 0 o dij(f) = 0 if xif = xjf , or dij(f) = 1
and f is asymmetric binary otherwise
attribute • f is numeric: use the normalized
• 1 otherwise distance
• for fpersonal
This file is meant is ordinal: convert to numeric
use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Compute dissimilarity with mixed variables
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
● Gender is a nominal attribute; others are asymmetric binary.
sainathgunda99@gmail.com
DLZNK464L9
● Let the values Y and P be 1, and the value N be 0.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Compute dissimilarity with mixed variables (Contd.)
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Considering all attributes:
sainathgunda99@gmail.com
DLZNK464L9
d(Jack, Mary)= (1*1+0*1+0*0+0*1+0*0+1*1+0*0)/(1+1+1+1) = 2/4 = 0.5
Gender: nominal: d=1, δ=1
Fever: asyn: d=0, δ=1
Cough: asyn, both 0: d=0, δ=0
Test-1:asyn: d=0, δ=1
Test-2:asyn, both 0: d=0, δ=0
Test-3:asyn: d=1, δ=1
Test-4:asyn, both 0: d=0, δ=0
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Cosine similarity: similarity btw vectors of numerical values
• A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
Document Team Coach Hockey Baseball Soccer Penalty Score Win Loss Season
Documen
t1 5 0 3 0 2 0 0 2 0 0
Documen
t2
sainathgunda99@gmail.com
DLZNK464L9
3 0 2 0 1 1 0 1 0 1
Documen
t3 0 7 0 2 1 0 0 3 0 0
Documen
t4 0 1 0 0 1 2 2 0 3 0
• The angle between any two vectors (documents) can be used as a measure of the similarity between
the two documents:
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Similarity btw vectors of numerical values (Contd.)
• Cosine similarity is in [0, 1] : If d1 and d2 are two vectors (e.g., term-frequency vectors) then
cos(d1, d2) = (d1 ∙ d2) /(||d1|| x ||d2||) ,
where ∙ indicates vector dot product, ||d||: the length of vector d
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Example: Cosine similarity
• cos(d1, d2) = (d1 ∙ d2) /(||d1|| ||d2||) ,
where ∙ indicates vector dot product, ||d||: the length of vector d
• Ex: Find the similarity between documents 1 and 2.
DLZNK464L9 d = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
sainathgunda99@gmail.com
1
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1∙ d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 25 / (6.48*4.12) = 0.94
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We learned how to measure data similarity and dissimilarity.
● We looked into distance metrics, such as Minkowski distance and standardization.
● We learned proximity measures for nominal, ordinal, and binary attributes.
● We also learned how to compute dissimilarity with mixed variables.
● We discussed cosine similarity with an example.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Outcomes
Coming to the end of this module, you should now be able to:
• Differentiate different types of attributes: nominal, binary, ordinal, interval-scaled, and ratio-scaled.
• Evaluate basic descriptive statistics of a dataset: central tendency and dispersion.
• Illustrate and interpret graphic plots that display descriptive statistics.
• Summarize statistical hypothesis testing.
• Evaluate object similarity and dissimilarity in mixed-type datasets.
• Summarize Cosine similarity using an example.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.