[go: up one dir, main page]

0% found this document useful (0 votes)
20 views47 pages

Topic2 Basic Prob Stats

Uploaded by

Bui Xuan Phong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views47 pages

Topic2 Basic Prob Stats

Uploaded by

Bui Xuan Phong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Basic Probability and Statistics

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 1 / 47


1 Introduction

2 Single Quantitative Variable Exploration


Numerical Summaries
Graphical Summaries

3 Association Between Two Variables


Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 2 / 47


1 Introduction

2 Single Quantitative Variable Exploration


Numerical Summaries
Graphical Summaries

3 Association Between Two Variables


Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 3 / 47


Types of Data

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 4 / 47


Descriptive Statistics

There are two major ways of describing data descriptively: numerical and graphical
summaries.

One variable: the numerical and graphical summaries will be covered.

For two variables: association between two variables will be covered.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 5 / 47


1 Introduction

2 Single Quantitative Variable Exploration


Numerical Summaries
Graphical Summaries

3 Association Between Two Variables


Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 6 / 47


Numerical and Graphical Summaries

Numerical summaries /descriptive measures: number of observations (sample size),


location, variability and other measures.

Graphical summaries : histogram, boxplot, QQ plot (for checking normality of a dataset),


scatter plot for bivariate data.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 7 / 47


1 Introduction

2 Single Quantitative Variable Exploration


Numerical Summaries
Graphical Summaries

3 Association Between Two Variables


Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 8 / 47


An Example: Yearly Sales

> sales <- read.csv("C:/Data/yearly_sales.csv")

The function head() displays the rst few records in the data set

> head(sales)
cust_id sales_total num_of_orders gender
1 100001 800.64 3 F
2 100002 217.53 3 F
3 100003 74.58 2 M
4 100004 498.60 3 M
5 100005 723.11 4 F
6 100006 69.43 2 F
> total = sales$sales_total

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 9 / 47


Summary of the Center

Center of data should include the information on: mean, median and mode.

About the total sales, we roughly can have

> n = length(total); n
[1] 10000
> summary(total)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.02 80.29 151.65 249.46 295.50 7606.09

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 10 / 47


Summary of the Variability

> range(total)
[1] 30.02 7606.09

> var(total)
[1] 101793.4

> sd(total)
[1] 319.0508

> IQR(total)
[1] 215.21

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 11 / 47


A Note on Numerical Summaries

For a sample, if the mean is the same or approximately the same as the median, then the
sample is close to symmetric.

Mean is sensitive to the outlier(s) while median is not.

When the mean is much larger than the median, sample is right skewed; while when the
mean is much smaller than the median then sample is left skewed.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 12 / 47


1 Introduction

2 Single Quantitative Variable Exploration


Numerical Summaries
Graphical Summaries

3 Association Between Two Variables


Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 13 / 47


Numerical Summaries Are Not Enough
All 3 samples below had a sample mean of 0 and a sample variance of 1.

No matter how many of the summary measures we report, nothing beats a picture.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 14 / 47


Histogram and Density Plot

A histogram is a graph that uses bars to portray the frequencies or relative frequencies of
the possible outcomes for a quantitative variable.

Density plots can be thought of as plots of smoothed histograms.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 15 / 47


Histogram

What do we look for in a histogram?

▶ The overall pattern. Do the data cluster together, or is there a gap such that one or more

observations deviate from the rest?

▶ Do the data have a single mound? This is known as a unimodal distribution. Data with two

mound are known as bimodal, and data with many mounds are referred to as multimodal.

▶ Is the distribution symmetric or skewed? Any suspected outliers?

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 16 / 47


A Histogram With Suspected Outliers

This histogram is unimodal, but it has suspected outliers on the right.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 17 / 47


Unimodal and Bimodal Histograms

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 18 / 47


Skewness of Histograms

Income is typically right-skewed.

IQ is typically symmetric.

Life-span is typically left-skewed.


Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 19 / 47
Histogram and Density Plot in R
There are many ways to plot histograms in R:

The hist function in the base graphics package;

truehist in package MASS;


histogram in package lattice;
geom_histogram in package ggplot2.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 20 / 47


Histogram and Normal Density Plot in R

> hist(total, freq=FALSE, main = paste("Histogram of Total Sales"),


+ xlab = "total sales", ylab="Probability", col = "grey")

The histogram is highly right skewed.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 21 / 47


Boxplots

Boxplots provide a skeletal representation of a distribution, and they are very well suited
for showing distributions for multiple variables.

A boxplot helps us to identify median, lower and upper quantiles, IQR, and outlier(s).

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 22 / 47


Boxplot

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 23 / 47


Boxplots in R
The code should be

> boxplot(total, xlab = "Total Sales", col = "blue")

The median is very low, close to 200.


Box plot shows many outliers and
extreme outliers.

If the sample is unimodal then the


distribution is highly right skewed.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 24 / 47


QQ Plots

The purpose of plotting a QQ plot of a sample is to see if the sample follows


(approximately) a normal distribution or not.

A QQ-plot matches the standardized sample quantiles against the theoretical quantiles of
a N(0, 1) distribution.

From the points on the plot, we can usually tell whether our sample has longer or shorter
tail than normal.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 25 / 47


QQ plots (1)

Figure on the left is a data with both longer tails than normal.

Figure on the right is a data with both shorter tails than normal.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 26 / 47


QQ plots (2)

Figure on the left is a data with left tail longer than normal but right tail is shorter than
normal.

Figure on the right is a data with both tails are normal.


Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 27 / 47
QQ Plots in R
The code should be
> qqnorm(total, main = "QQ Plot", pch = 20)
> qqline(total, col = "red")

The QQ plot of the sample has the


right tail much longer than normal
while the left tail is much shorter than
normal.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 28 / 47


1 Introduction

2 Single Quantitative Variable Exploration


Numerical Summaries
Graphical Summaries

3 Association Between Two Variables


Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 29 / 47


1 Introduction

2 Single Quantitative Variable Exploration


Numerical Summaries
Graphical Summaries

3 Association Between Two Variables


Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 30 / 47


Quantifying the Association: Correlation Value

Let X and Y are two features from a set of n points.

The correlation of these two is dened as:

n
1 X  Xi − X̄  Yi − Ȳ 
r=
n−1 sX sY
i=1

where X̄, Ȳ are the sample means, sX , sY are the sample standard deviations of the two
features.

r is always between -1 and 1.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 31 / 47


Correlation Value

A positive value for r indicates a positive association and a negative value of r indicates a
negative association.

> order = sales$num_of_orders


> cor(total, order)
[1] 0.7508015

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 32 / 47


Visualization the Association: Scatterplots

Scatterplot can help to visualize the association between two quantitative features well.

What to say given a scatterplot :

Is there any (possible) relationship between the 2 variables?

If yes, is the association positive or negative?

If there is association, is it linear or non-linear type?

Are some observations unusual, departing from the overall trend?

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 33 / 47


Scatterplots in R

> plot(order,total, pch = 20, col = "darkblue")

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 34 / 47


1 Introduction

2 Single Quantitative Variable Exploration


Numerical Summaries
Graphical Summaries

3 Association Between Two Variables


Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 35 / 47


Boxplots of Multiple Groups

Categorical variable cancer has two categories: male and female. Variable Age is
quantitative. One would check if any relationship between these two variables.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 36 / 47


Boxplots of Multiple Groups in R

> attach(sales)
> boxplot(total ~ gender)

There is no obvious dierence in the total sales of the customer's gender. The median of two
groups are similar, and the IRQ are about the same.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 37 / 47


Association of 3 Variables

Can you gure out a way to visualize the association of the three features: total sales,
number of orders and the gender of the customers?

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 38 / 47


1 Introduction

2 Single Quantitative Variable Exploration


Numerical Summaries
Graphical Summaries

3 Association Between Two Variables


Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 39 / 47


Summary of a Categorical Variable

For a single categorical variable, we can use frequency table (which also can produce the
proportion or percentage) as numerical summaries.

The category with the highest frequency is reported as the modal category .

Common graphical to display a categorical variable is bar plot or pie chart.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 40 / 47


Barplot and Pie Chart
> count = table(gender)
> count # frequency table
gender
F M
5035 4965
> barplot(count)
> pie(count)

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 41 / 47


Two Categorical Variables

Contingency table is often used to summarize the two categorical variables.

Odds ratio is useful too.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 42 / 47


Two Categorical Variables

Categorizing the number of orders into two categories: small and large size.

> order.size = ifelse(order<=5, "small", "large")


> table(order.size)
order.size
large small
324 9676

Contingency table of frequency

> table = table(gender,order.size);table


order.size
gender large small
F 142 4893
M 182 4783

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 43 / 47


Contingency Tables

Contingency table of joint proportion

> prop.table(table)
order.size
gender large small
F 0.0142 0.4893
M 0.0182 0.4783

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 44 / 47


Contingency Tables

Contingency table of proportion by gender

> tab = prop.table(table, "gender") # proportion by gender


> tab
order.size
gender large small
F 0.02820258 0.97179742
M 0.03665660 0.96334340

Among orders by females, 2.82% are large orders while 3.67% of orders by males
are large.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 45 / 47


Odds of Success

For a probability of success π, the odds of success is dened as odds = π/(1 − π).

If we consider having a large order is a success, then for the female groups, the odds of
success, or the odds of large order, is 0.029.

> tab[1]/(1-tab[1])
[1] 0.02902105

For the male group, the odds of having large order is 0.038.

> tab[2]/(1-tab[2])
[1] 0.03805143

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 46 / 47


Odds Ratio

Odds ratio is the ratio of two odds of success: odds of larger orders in the female group
(0.029), and odds of larger orders in the male group (0.038).

0.029
OR = = 0.76.
0.038

What does this value mean?

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 47 / 47

You might also like