0% found this document useful (0 votes)

20 views47 pages

Topic2 Basic Prob Stats

Uploaded by

Bui Xuan Phong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views47 pages

Topic2 Basic Prob Stats

Uploaded by

Bui Xuan Phong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Basic Probability and Statistics

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 1 / 47

1 Introduction

2 Single Quantitative Variable Exploration

Numerical Summaries
Graphical Summaries

3 Association Between Two Variables

Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 2 / 47

1 Introduction

2 Single Quantitative Variable Exploration

Numerical Summaries
Graphical Summaries

3 Association Between Two Variables

Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 3 / 47

Types of Data

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 4 / 47

Descriptive Statistics

There are two major ways of describing data descriptively: numerical and graphical
summaries.

One variable: the numerical and graphical summaries will be covered.

For two variables: association between two variables will be covered.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 5 / 47

1 Introduction

2 Single Quantitative Variable Exploration

Numerical Summaries
Graphical Summaries

3 Association Between Two Variables

Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 6 / 47

Numerical and Graphical Summaries

Numerical summaries /descriptive measures: number of observations (sample size),

location, variability and other measures.

Graphical summaries : histogram, boxplot, QQ plot (for checking normality of a dataset),

scatter plot for bivariate data.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 7 / 47

1 Introduction

2 Single Quantitative Variable Exploration

Numerical Summaries
Graphical Summaries

3 Association Between Two Variables

Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 8 / 47

An Example: Yearly Sales

> sales <- read.csv("C:/Data/yearly_sales.csv")

The function head() displays the rst few records in the data set

> head(sales)
cust_id sales_total num_of_orders gender
1 100001 800.64 3 F
2 100002 217.53 3 F
3 100003 74.58 2 M
4 100004 498.60 3 M
5 100005 723.11 4 F
6 100006 69.43 2 F
> total = sales$sales_total

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 9 / 47

Summary of the Center

Center of data should include the information on: mean, median and mode.

About the total sales, we roughly can have

> n = length(total); n
[1] 10000
> summary(total)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.02 80.29 151.65 249.46 295.50 7606.09

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 10 / 47

Summary of the Variability

> range(total)
[1] 30.02 7606.09

> var(total)
[1] 101793.4

> sd(total)
[1] 319.0508

> IQR(total)
[1] 215.21

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 11 / 47

A Note on Numerical Summaries

For a sample, if the mean is the same or approximately the same as the median, then the
sample is close to symmetric.

Mean is sensitive to the outlier(s) while median is not.

When the mean is much larger than the median, sample is right skewed; while when the
mean is much smaller than the median then sample is left skewed.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 12 / 47

1 Introduction

2 Single Quantitative Variable Exploration

Numerical Summaries
Graphical Summaries

3 Association Between Two Variables

Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 13 / 47

Numerical Summaries Are Not Enough
All 3 samples below had a sample mean of 0 and a sample variance of 1.

No matter how many of the summary measures we report, nothing beats a picture.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 14 / 47

Histogram and Density Plot

A histogram is a graph that uses bars to portray the frequencies or relative frequencies of
the possible outcomes for a quantitative variable.

Density plots can be thought of as plots of smoothed histograms.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 15 / 47

Histogram

What do we look for in a histogram?

▶ The overall pattern. Do the data cluster together, or is there a gap such that one or more

observations deviate from the rest?

▶ Do the data have a single mound? This is known as a unimodal distribution. Data with two

mound are known as bimodal, and data with many mounds are referred to as multimodal.

▶ Is the distribution symmetric or skewed? Any suspected outliers?

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 16 / 47

A Histogram With Suspected Outliers

This histogram is unimodal, but it has suspected outliers on the right.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 17 / 47

Unimodal and Bimodal Histograms

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 18 / 47

Skewness of Histograms

Income is typically right-skewed.

IQ is typically symmetric.

Life-span is typically left-skewed.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 19 / 47
Histogram and Density Plot in R
There are many ways to plot histograms in R:

The hist function in the base graphics package;

truehist in package MASS;

histogram in package lattice;
geom_histogram in package ggplot2.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 20 / 47

Histogram and Normal Density Plot in R

> hist(total, freq=FALSE, main = paste("Histogram of Total Sales"),

+ xlab = "total sales", ylab="Probability", col = "grey")

The histogram is highly right skewed.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 21 / 47

Boxplots

Boxplots provide a skeletal representation of a distribution, and they are very well suited
for showing distributions for multiple variables.

A boxplot helps us to identify median, lower and upper quantiles, IQR, and outlier(s).

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 22 / 47

Boxplot

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 23 / 47

Boxplots in R
The code should be

> boxplot(total, xlab = "Total Sales", col = "blue")

The median is very low, close to 200.

Box plot shows many outliers and
extreme outliers.

If the sample is unimodal then the

distribution is highly right skewed.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 24 / 47

QQ Plots

The purpose of plotting a QQ plot of a sample is to see if the sample follows

(approximately) a normal distribution or not.

A QQ-plot matches the standardized sample quantiles against the theoretical quantiles of
a N(0, 1) distribution.

From the points on the plot, we can usually tell whether our sample has longer or shorter
tail than normal.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 25 / 47

QQ plots (1)

Figure on the left is a data with both longer tails than normal.

Figure on the right is a data with both shorter tails than normal.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 26 / 47

QQ plots (2)

Figure on the left is a data with left tail longer than normal but right tail is shorter than
normal.

Figure on the right is a data with both tails are normal.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 27 / 47
QQ Plots in R
The code should be
> qqnorm(total, main = "QQ Plot", pch = 20)
> qqline(total, col = "red")

The QQ plot of the sample has the

right tail much longer than normal
while the left tail is much shorter than
normal.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 28 / 47

1 Introduction

2 Single Quantitative Variable Exploration

Numerical Summaries
Graphical Summaries

3 Association Between Two Variables

Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 29 / 47

1 Introduction

2 Single Quantitative Variable Exploration

Numerical Summaries
Graphical Summaries

3 Association Between Two Variables

Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 30 / 47

Quantifying the Association: Correlation Value

Let X and Y are two features from a set of n points.

The correlation of these two is dened as:

n
1 X Xi − X̄ Yi − Ȳ
r=
n−1 sX sY
i=1

where X̄, Ȳ are the sample means, sX , sY are the sample standard deviations of the two
features.

r is always between -1 and 1.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 31 / 47

Correlation Value

A positive value for r indicates a positive association and a negative value of r indicates a
negative association.

> order = sales$num_of_orders

> cor(total, order)
[1] 0.7508015

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 32 / 47

Visualization the Association: Scatterplots

Scatterplot can help to visualize the association between two quantitative features well.

What to say given a scatterplot :

Is there any (possible) relationship between the 2 variables?

If yes, is the association positive or negative?

If there is association, is it linear or non-linear type?

Are some observations unusual, departing from the overall trend?

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 33 / 47

Scatterplots in R

> plot(order,total, pch = 20, col = "darkblue")

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 34 / 47

1 Introduction

2 Single Quantitative Variable Exploration

Numerical Summaries
Graphical Summaries

3 Association Between Two Variables

Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 35 / 47

Boxplots of Multiple Groups

Categorical variable cancer has two categories: male and female. Variable Age is
quantitative. One would check if any relationship between these two variables.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 36 / 47

Boxplots of Multiple Groups in R

> attach(sales)
> boxplot(total ~ gender)

There is no obvious dierence in the total sales of the customer's gender. The median of two
groups are similar, and the IRQ are about the same.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 37 / 47

Association of 3 Variables

Can you gure out a way to visualize the association of the three features: total sales,
number of orders and the gender of the customers?

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 38 / 47

1 Introduction

2 Single Quantitative Variable Exploration

Numerical Summaries
Graphical Summaries

3 Association Between Two Variables

Two Quantitative Variables
One Categorical and One Quantitative Variable
Two Categorical Variables

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 39 / 47

Summary of a Categorical Variable

For a single categorical variable, we can use frequency table (which also can produce the
proportion or percentage) as numerical summaries.

The category with the highest frequency is reported as the modal category .

Common graphical to display a categorical variable is bar plot or pie chart.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 40 / 47

Barplot and Pie Chart
> count = table(gender)
> count # frequency table
gender
F M
5035 4965
> barplot(count)
> pie(count)

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 41 / 47

Two Categorical Variables

Contingency table is often used to summarize the two categorical variables.

Odds ratio is useful too.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 42 / 47

Two Categorical Variables

Categorizing the number of orders into two categories: small and large size.

> order.size = ifelse(order<=5, "small", "large")

> table(order.size)
order.size
large small
324 9676

Contingency table of frequency

> table = table(gender,order.size);table

order.size
gender large small
F 142 4893
M 182 4783

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 43 / 47

Contingency Tables

Contingency table of joint proportion

> prop.table(table)
order.size
gender large small
F 0.0142 0.4893
M 0.0182 0.4783

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 44 / 47

Contingency Tables

Contingency table of proportion by gender

> tab = prop.table(table, "gender") # proportion by gender

> tab
order.size
gender large small
F 0.02820258 0.97179742
M 0.03665660 0.96334340

Among orders by females, 2.82% are large orders while 3.67% of orders by males
are large.

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 45 / 47

Odds of Success

For a probability of success π, the odds of success is dened as odds = π/(1 − π).

If we consider having a large order is a success, then for the female groups, the odds of
success, or the odds of large order, is 0.029.

> tab[1]/(1-tab[1])
[1] 0.02902105

For the male group, the odds of having large order is 0.038.

> tab[2]/(1-tab[2])
[1] 0.03805143

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 46 / 47

Odds Ratio

Odds ratio is the ratio of two odds of success: odds of larger orders in the female group
(0.029), and odds of larger orders in the male group (0.038).

0.029
OR = = 0.76.
0.038

What does this value mean?

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 47 / 47

R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Data Science Using R
No ratings yet
Data Science Using R
34 pages
Data Science Using R
No ratings yet
Data Science Using R
34 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
Unit 1
No ratings yet
Unit 1
5 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
Computatm Solution
No ratings yet
Computatm Solution
6 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
22 pages
Introduction To The Practice of Basic Statistics (Textbook Outline)
100% (14)
Introduction To The Practice of Basic Statistics (Textbook Outline)
65 pages
Lecture Notes
No ratings yet
Lecture Notes
37 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Unit 1 Assignment SKELETON R spr18
No ratings yet
Unit 1 Assignment SKELETON R spr18
23 pages
Stats Notes
No ratings yet
Stats Notes
16 pages
MATH 361 (Autosaved)
No ratings yet
MATH 361 (Autosaved)
17 pages
ST8114 Module1 PartI UnivariateEDA
No ratings yet
ST8114 Module1 PartI UnivariateEDA
60 pages
SSMDA
No ratings yet
SSMDA
37 pages
CHP 2
No ratings yet
CHP 2
52 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
Variables & Chart
No ratings yet
Variables & Chart
60 pages
Introduction to Statistics Basics
No ratings yet
Introduction to Statistics Basics
23 pages
Unit 4 Ba Shivdas
No ratings yet
Unit 4 Ba Shivdas
17 pages
Statistics S1 Theory
No ratings yet
Statistics S1 Theory
8 pages
It0089 Finalreviewer
No ratings yet
It0089 Finalreviewer
143 pages
Iie 3017 02
No ratings yet
Iie 3017 02
35 pages
Lecture01
No ratings yet
Lecture01
76 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
00 Probability 2
No ratings yet
00 Probability 2
19 pages
AP Statistics Introduction
No ratings yet
AP Statistics Introduction
36 pages
Statistical Analysis Basics
100% (1)
Statistical Analysis Basics
143 pages
Stat 101 Exam Study Guide
No ratings yet
Stat 101 Exam Study Guide
18 pages
Ise 390 Engineering Probability & Statistics I: Dr. Swain Book Club
No ratings yet
Ise 390 Engineering Probability & Statistics I: Dr. Swain Book Club
18 pages
Lecture 1
No ratings yet
Lecture 1
26 pages
Lecture 2. Exploratory Data Analysis
No ratings yet
Lecture 2. Exploratory Data Analysis
28 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel
48 pages
Making Sense of Data Statistic Course
No ratings yet
Making Sense of Data Statistic Course
39 pages
Chapter Two
No ratings yet
Chapter Two
36 pages
Data Visualization Techniques Guide
No ratings yet
Data Visualization Techniques Guide
9 pages
3 Data Visualization
No ratings yet
3 Data Visualization
75 pages
Business Data & Statistics Guide
No ratings yet
Business Data & Statistics Guide
84 pages
U1 Exploring One-Variable Data
No ratings yet
U1 Exploring One-Variable Data
22 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Geostatistics & Reservoir Analysis
No ratings yet
Geostatistics & Reservoir Analysis
83 pages
One-Variable Data Analysis Guide
No ratings yet
One-Variable Data Analysis Guide
4 pages
Bio Statics
No ratings yet
Bio Statics
143 pages
Types of Statistics
No ratings yet
Types of Statistics
7 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
STAT241 - Business Statistics (Day 3)
No ratings yet
STAT241 - Business Statistics (Day 3)
32 pages
I Am Sharing 'DOC-20250811-WA0005.' With You
No ratings yet
I Am Sharing 'DOC-20250811-WA0005.' With You
16 pages
Lecture2 SummarizingData
No ratings yet
Lecture2 SummarizingData
33 pages
Ap Stat Exam Rev ch1-13
No ratings yet
Ap Stat Exam Rev ch1-13
120 pages
Lecture-6: Introduction To Data Science
No ratings yet
Lecture-6: Introduction To Data Science
25 pages
Descriptive Statistics Course Guide
No ratings yet
Descriptive Statistics Course Guide
50 pages
AP Statistics: Data & Variation
No ratings yet
AP Statistics: Data & Variation
83 pages
New Chapter 13 Elementary Statistics
No ratings yet
New Chapter 13 Elementary Statistics
15 pages
Lecture 3
No ratings yet
Lecture 3
39 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Word File For Prob and Stats
No ratings yet
Word File For Prob and Stats
25 pages
Topic1 R Introduction
No ratings yet
Topic1 R Introduction
58 pages
Topic4 KNN
No ratings yet
Topic4 KNN
49 pages
Topic3 Linear Regression
No ratings yet
Topic3 Linear Regression
52 pages
Topic5 Decision Trees
No ratings yet
Topic5 Decision Trees
66 pages
Output Lesson Plan Math 8 q1l1 Group 2
No ratings yet
Output Lesson Plan Math 8 q1l1 Group 2
11 pages
Unit 2 - Measures of Central Tendency - English
No ratings yet
Unit 2 - Measures of Central Tendency - English
36 pages
Frequency Distribution Table (FDT) : Where N Total Number of Values To Be Grouped
No ratings yet
Frequency Distribution Table (FDT) : Where N Total Number of Values To Be Grouped
7 pages
Ch-10-Measures of Central Tendency (Median&Mode) (Prashant Kirad)
No ratings yet
Ch-10-Measures of Central Tendency (Median&Mode) (Prashant Kirad)
12 pages
Measures of Central Tendency.
No ratings yet
Measures of Central Tendency.
10 pages
Recent Trends of Multimodal Affective Computing: A Survey From NLP Perspective
No ratings yet
Recent Trends of Multimodal Affective Computing: A Survey From NLP Perspective
26 pages
Ôn tập lý thuyết - SB - chap 1-5
No ratings yet
Ôn tập lý thuyết - SB - chap 1-5
12 pages
Unconstrained Optimization Methods
No ratings yet
Unconstrained Optimization Methods
87 pages
MODULE 2 Frequency Distribution
No ratings yet
MODULE 2 Frequency Distribution
10 pages
Quality C
No ratings yet
Quality C
11 pages
Advanced Statistics Lecture Notes
100% (2)
Advanced Statistics Lecture Notes
19 pages
Probability and Statistics For Science and Engineering With Examples in R 2nd Edition Hongshik Ahn Instant Download Full Chapters
No ratings yet
Probability and Statistics For Science and Engineering With Examples in R 2nd Edition Hongshik Ahn Instant Download Full Chapters
115 pages
Our Lady of Fatima University Midterm Reviewer SASA211 Chapter 3: Graphing Data
No ratings yet
Our Lady of Fatima University Midterm Reviewer SASA211 Chapter 3: Graphing Data
7 pages
Chapter 3 QMT 554-Jul10
No ratings yet
Chapter 3 QMT 554-Jul10
59 pages
Nursing Research & Statistics
100% (3)
Nursing Research & Statistics
15 pages
TQM Insights for Steel Industry
No ratings yet
TQM Insights for Steel Industry
50 pages
Bar Charts Explained for Beginners
No ratings yet
Bar Charts Explained for Beginners
24 pages
Chapter 4 Measures of Location
100% (1)
Chapter 4 Measures of Location
37 pages
Stats (Dragged) - 1
No ratings yet
Stats (Dragged) - 1
9 pages
Reviewer in Assessment of Student Learning PDF Educational Assessment Mean
No ratings yet
Reviewer in Assessment of Student Learning PDF Educational Assessment Mean
1 page
SW Statistics
No ratings yet
SW Statistics
150 pages
Ibis - 2022 - Sarà - Welcome Aboard Are Birds Using Ships
No ratings yet
Ibis - 2022 - Sarà - Welcome Aboard Are Birds Using Ships
12 pages
Data Management
No ratings yet
Data Management
84 pages
Statistics for Data Analysts
No ratings yet
Statistics for Data Analysts
47 pages
Quality Control Tools Guide
No ratings yet
Quality Control Tools Guide
51 pages
STS 311
No ratings yet
STS 311
7 pages
Probability & Statistics Guide
100% (2)
Probability & Statistics Guide
56 pages
A Detailed Lesson Plan in Mathematics
No ratings yet
A Detailed Lesson Plan in Mathematics
11 pages
Keller SME 12e PPT CH03 Rev.
No ratings yet
Keller SME 12e PPT CH03 Rev.
31 pages
MODULE 2 Measures of Central Tendency
No ratings yet
MODULE 2 Measures of Central Tendency
8 pages

Topic2 Basic Prob Stats

Uploaded by

Topic2 Basic Prob Stats

Uploaded by

Basic Probability and Statistics

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 1 / 47

2 Single Quantitative Variable Exploration

3 Association Between Two Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 2 / 47

2 Single Quantitative Variable Exploration

3 Association Between Two Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 3 / 47

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 4 / 47

One variable: the numerical and graphical summaries will be covered.

For two variables: association between two variables will be covered.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 5 / 47

2 Single Quantitative Variable Exploration

3 Association Between Two Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 6 / 47

Numerical summaries /descriptive measures: number of observations (sample size),

Graphical summaries : histogram, boxplot, QQ plot (for checking normality of a dataset),

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 7 / 47

2 Single Quantitative Variable Exploration

3 Association Between Two Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 8 / 47

> sales <- read.csv("C:/Data/yearly_sales.csv")

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 9 / 47

About the total sales, we roughly can have

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 10 / 47

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 11 / 47

Mean is sensitive to the outlier(s) while median is not.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 12 / 47

2 Single Quantitative Variable Exploration

3 Association Between Two Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 13 / 47

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 14 / 47

Density plots can be thought of as plots of smoothed histograms.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 15 / 47

What do we look for in a histogram?

observations deviate from the rest?

▶ Is the distribution symmetric or skewed? Any suspected outliers?

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 16 / 47

This histogram is unimodal, but it has suspected outliers on the right.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 17 / 47

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 18 / 47

Income is typically right-skewed.

Life-span is typically left-skewed.

The hist function in the base graphics package;

truehist in package MASS;

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 20 / 47

> hist(total, freq=FALSE, main = paste("Histogram of Total Sales"),

The histogram is highly right skewed.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 21 / 47

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 22 / 47

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 23 / 47

> boxplot(total, xlab = "Total Sales", col = "blue")

The median is very low, close to 200.

If the sample is unimodal then the

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 24 / 47

The purpose of plotting a QQ plot of a sample is to see if the sample follows

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 25 / 47

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 26 / 47

Figure on the right is a data with both tails are normal.

The QQ plot of the sample has the

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 28 / 47

2 Single Quantitative Variable Exploration

3 Association Between Two Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 29 / 47

2 Single Quantitative Variable Exploration

3 Association Between Two Variables

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 30 / 47

Let X and Y are two features from a set of n points.

The correlation of these two is dened as:

r is always between -1 and 1.

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 31 / 47

> order = sales$num_of_orders

Topic 2  Basic Probability and Statistics DSA1101 Introduction to Data Science 32 / 47

What to say given a scatterplot :

Is there any (possible) relationship between the 2 variables?

If yes, is the association positive or negative?

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 1 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 2 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 3 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 4 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 5 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 6 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 7 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 8 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 9 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 10 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 11 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 12 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 13 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 14 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 15 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 16 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 17 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 18 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 20 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 21 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 22 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 23 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 24 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 25 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 26 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 28 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 29 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 30 / 47

The correlation of these two is dened as:

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 31 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 32 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 33 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 34 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 35 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 36 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 37 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 38 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 39 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 40 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 41 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 42 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 43 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 44 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 45 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 46 / 47

Topic 2 Basic Probability and Statistics DSA1101 Introduction to Data Science 47 / 47