0% found this document useful (0 votes)

19 views52 pages

Intermediate Analytics-Regression-Week 1

The document provides an overview of intermediate analytics concepts, focusing on regression, classification, and clustering techniques. It discusses model validation, evaluation methods, and the importance of understanding relationships between variables through correlation and regression analysis. Additionally, it covers linear regression, including simple and multivariate regression, model selection criteria, and techniques like stepwise regression for improving model accuracy.

Uploaded by

cronguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views52 pages

Intermediate Analytics-Regression-Week 1

Uploaded by

cronguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Intermediate

Analytics
ALY6015
Northeastern University
By: Behzad Abdi
Meet Your Instructor

ALY6015-Intermediate Analytics-Northeastern University-Behzad Abdi

Introduction
Human Simulation in Machines
Introduction
Artificial Intelligence:
Introduction

Regression Classification Clustering

• Goal: To predict a continuous • Goal: To assign data to specific • Goal: To group data into similar

value based on given input data. categories based on labelled clusters based on patterns and
• Type of Output: A numerical data. similarities, without predefined
value (e.g., predicting price, • Type of Output: A discrete labels.
temperature, or weight) value or category (e.g., yes/no, • Type of Output: Grouping data
• Example: Predicting house prices dog/cat). based on internal similarities.
based on size, location, and year of • Example: Detecting spam • Example: Grouping customers

construction. emails or classifying fruit as an based on purchasing behavior

• Feature: Regression is used for apple or an orange. without knowing predefined
problems where the output is a • Feature: In classification, the categories.
numerical value, and the model model learns from labelled data • Feature: Clustering is used in

aims to predict a specific and assigns new data to a unsupervised learning, where data is
numerical result. specific category. not assigned to predefined
categories. The model automatically
Model Validation and
Evaluation:
 Independent variable, also called a predictor variable.
 Dependent variable, also called a response variable
Relationships Between Variables
• Inferential statistics help determine if relationships
exist between numerical variables.

• Examples:
• Sales volume and advertising spending
• Study hours and exam scores
• Age and blood pressure

• Techniques: Correlation and Regression Analysis.

Main Questions
1. Are two or more variables related?

2. If so, what is the strength of the relationship?

3. What type of relationship exists?

4. What kind of predictions can be made from the relationship?

Correlation vs. Regression
 Correlation:
• Measures whether variables are related.
• Determines the strength of the relationship
using a correlation coefficient.

 Regression:
• Describes the nature of relationships
(positive/negative, linear/nonlinear).
• Helps predict one variable based on another.
Correlation
coefficient
• Measures the strength and direction of the relationship
between two variables.

• It ranges between −1and +1:

• +1: Perfect positive correlation.
• −1: Perfect negative correlation.
• 0: No correlation.

• Pearson Correlation:
Correlation Coefficient
Interpretation:
Visualizing Correlation
Scatter Plots illustrate the relationship between variables.
 Is there a significant linear relationship between the
variables, or the value of r is due to chance?

The  Hypothesis testing is used to determine if r is

statistically significant.
Significanc
e test of  Assumptions:
the • Variables x and y are linearly related.
• Variables x and y are random variables.
Correlation • The variables have a bivariate normal distribution.
Coefficient o For any given x, y values have a bell-shaped distribution.
o For any given y, x values have a bell-shaped distribution.
Hypothesis-Testing
Procedure (Traditional
Method):
Steps:

1. State the hypotheses:

• Null hypothesis (…H0): r = 0 (no
correlation).
• Alternative hypothesis (…H1): r ≠ 0
(significant correlation).
2. Compute the test value (t-test).

3. Compare the test value with critical values

from Table t distribution .

4. Make the decision (Reject or Fail to Reject

H0).

5. Summarize the results.

Steps:
P-
Value 1. State the hypotheses.
2. Find the test value using the t-test.

Metho 3. Compute the P-value (e.g., from

Table F or calculator).
4. Compare P-value to α level (e.g.,
d: 0.05).
5. Summarize results.
Using Table Critical Values of r:

 Table I provides critical r values for specific α levels and

degrees of freedom.
 Steps:
1. State the hypotheses.
2. Find critical values from Table I.
3. Compare r to critical values.
4. Make a decision.
Limitations

 The relationship between the variables may be caused by a third variable (lurking variable).

 Correlation ≠ Causation: Correlation does not imply causation.

 Effect of Outliers: Outliers can distort the correlation value.

 Linear Relationships Only: Pearson correlation measures only linear relationships.

Introduction to Linear Regression
Introduction to Linear Regression
What is Linear Regression?

1 Definition 2 Goal
Linear regression models To find the best-fitting
the relationship between straight line through a set
a dependent variable and of points.
one or more independent
variables.
Introduction to Linear Regression
1. Simple Linear Regression:

• Only one independent variable (input)

• Goal: establish a linear relationship between the input and the output variable.
• Represented by the equation: y=mx+c
Where:
y is the dependent variable (output).
x is the independent variable (input).
m is the slope of the line (the rate at which y changes with
respect to x).
c is the intercept (the point where the line crosses the y-axis).
Linear Regression
1. Simple Linear Regression:

Example: Predicting a student's exam score (y) based on their studied hours (x).

The independent variable: the number of hours studied

The dependent variable: the exam score
Linear Regression
Fit a slope line through all the points such that the error or residual (i.e., the distance of the line from each point) is the best possible minimal.
Linear Regression
• The error could be positive or negative.
• A simple sum of all the errors will be zero
• So, we should square the error
Improving the results of a linear regression mode

1. Removing Outliers:

• Explanation: Outliers (data points that are significantly different from others) can skew the
model and reduce its accuracy.

• Why it's effective: Removing or managing outliers allows the model to align more with most
of the data and perform better.

• Example: If a student with very few study hours (e.g., 1 hour) achieves a very high score (e.g.,
95), this could be an outlier and need to be removed or examined further.
Improving the results of a linear regression mode

2. Multicollinearity

Definition:
• Multicollinearity occurs when two or more predictors in a regression model are highly
correlated.
• This makes it difficult for the model to distinguish their individual effects on the
dependent variable.

Impact on the Model:

• Unstable Coefficients: Changes in data can lead to large variations in predictor coefficients.
• Reduced Interpretability: Difficult to determine the importance of individual predictors.
• Inflated Standard Errors: Wider confidence intervals, making predictors appear insignificant.
Improving the results of a linear regression mode
 Detecting Multicollinearity

Common Methods:
1. Correlation Matrix:
• High correlation ( or ) between predictors is a warning sign.
2. Variance Inflation Factor (VIF):
• Measures how much the variance of a predictor is inflated due to multicollinearity.
• Rule of Thumb: indicates severe multicollinearity.

 Resolving Multicollinearity

Strategies to Handle Multicollinearity:

1. Remove One of the Correlated Predictors:

• Identify redundant predictors and exclude them.
2. Combine Predictors:
• Create a new variable (e.g., principal component analysis or a
combined index).
3. Regularization Methods:
• Use techniques like Ridge Regression or Lasso to reduce the impact
of correlated predictors.
Linear Regression
Multivariate Regression

• In real-life use cases, there are more than one

independent variable.

• The concept of having multiple independent

variables is called multivariate regression.
Model Selection and
Comparison
Techniques:
• R Square
• Adjusted R Square
• AIC (Akaike information criterion)
• BIC (Bayesian information criterion)
• Mallow’s Cp
What is R-Squared

 Definition:
• Measures how well a regression model explains the variability in the dependent variable based on the
independent variable(s).

 Range: values range from 0 to 1:

• High: Large proportion of variability explained by the model.
• Low: Model does not explain much of the variability.

 Example: Predicting house prices based on square footage:

• : 85% of the variation in house prices is explained by square footage.
• : Only 20% of the variation is explained.

 Key Points:
• Higher values indicate a better fit.
• Used to evaluate the performance of regression models.
R-Squared Formula:
Linear Regression
R-Squared for Goodness of fit

R-squared = 1510.01 / 1547.55 = 0.97

Advantages and limitation of R-
Squared:
 Advantages of R-Squared:
• Easy to Interpret: Provides a simple measure of how well the model fits the data.
• Useful for Model Comparison: Higher values suggest better models (on the same
dataset).

 Limitations of R-Squared:
• Not Always Predictive: High does not guarantee good predictions for new data.
• Sensitive to Overfitting: Adding more variables can artificially increase .
• Correlation, Not Causation: only measures association, not causality.
Adjusted 𝑅-Square:

 Adjusted 𝑅-Square:

• To address the limitation of

overfitting, it penalizes the model
for including unnecessary variables.
Akaike Information Criterion (AIC):

 Definition: AIC estimates in-sample prediction error and compares model quality within
the same dataset.

 Key Points:
• Lower AIC indicates a better model.
• Only valid for comparing models from the same dataset.
• Does not measure absolute model quality.
Bayesian Information Criterion (BIC):

 Definition: A penalized-likelihood criterion derived from Bayesian probability, closely

related to AIC.

 Key Points:
• Heavily penalizes complexity to favor simpler models.
• Lower BIC indicates a better model.

 Formula:
Mallow’s Cp:

 Definition: Compares precision and bias of the full model to models with subsets of
predictors.

 Key Points:
• Calculates Cp for all variable combinations.
• Best model has Cp value closest to (number of predictors + 1).

 Formula:
Comparison
of Adjusted
R-Square,
AIC, BIC, and
Mallow’s Cp:
Detailed Use Cases:
1. Adjusted 𝑅 Square :
• When to Use: When comparing multiple linear regression models with different
numbers of predictors.
• When Not to Use: When working with nonlinear models or datasets with many
predictors.

2. AIC (Akaike Information Criterion):

• When to Use: When working with more complex models, such as GLMs or time-series
models.
• When Not to Use: When simpler models are preferred, or the dataset is very large.

3. BIC (Bayesian Information Criterion):

• When to Use: When datasets are large and simpler models are desirable.
• When Not to Use: When high accuracy is needed, even if the model is more complex.

4. Mallow’s Cp:
• When to Use: When selecting the best subset of predictors in linear regression.
• When Not to Use: For nonlinear models or when there are too many predictors.

 How to Decide Which Method to Use:

• Linear Models: Use Adjusted R-Square and Mallow’s Cp.

• Complex Models or Small Datasets: Use AIC.
• Large Datasets or Simplicity Preferred: Use BIC.
Stepwise Regression:

 Definition:

• Stepwise Regression is a modeling technique that automatically decides

which independent variables (features) to include or remove from the
model.
• Goal: Create a simpler and more effective model by reducing unnecessary
predictors.

 Stepwise Regression Methods:

1. Forward Selection:
• Start with no variables.
• Add predictors one by one based on their significance in the model.

2. Backward Elimination:
• Start with all predictors.
• Remove the least significant predictor one by one.

3. Stepwise Selection:
• Combine Forward Selection and Backward Elimination.
• Variables can be added or removed at each step.
Example: Predicting House Prices with Stepwise Regression

 Predictors:
• House Size (X1)
• Number of Rooms (X2)
• House Age (X3)
• Wall Color (X4)

1. Forward Selection:
• Start with no predictors.
• Add House Size (X1) because it has the most significant impact.
• Add Number of Rooms (X2) if it improves model performance.
• Continue until no additional predictors improve the model.

2. Backward Elimination:
• Start with all predictors: X1, X2, X3, X4.
• Remove Wall Color (X4) if it has the least significance.
• Repeat until all remaining predictors are significant.

3. Stepwise Selection:
• Combine both methods. Variables can be added or removed based on
their impact on the model.
Advantages and Disadvantages of Stepwise
Regression:
•Advantages:

• Simplifies the model.

• Removes unnecessary or insignificant predictors.
• Fast and automatic.

•Disadvantages:

• Risk of Omitting Important Predictors:

• Can exclude significant variables, especially if predictors are highly correlated.
• Sensitive to Data:
• Results may vary with different datasets.
Linear Regression
Improving the results of a linear regression mode

Adding More Independent Variables:

• Explanation: Considering only study hours as the independent variable might be insufficient.
You can add other variables, such as the amount of sleep, the number of tutoring classes, or
even the level of student stress.

• Why it's effective: Adding more independent variables can help the model consider more
factors that affect the score, thereby creating a more accurate model.

• Example: Suppose you add the amount of sleep to the model. Now your model might look
like this:
Tools for Validation:
•Popular Techniques:

• Train-Validation Split: Directly splitting data into training and validation sets.

• K-Fold Cross-Validation: Dividing data into K subsets for repeated training and validation.

• Stratified K-Fold: Preserves class distribution during splitting.

K-Fold Example:
Iteration 1:
• Training Set: [3, 4, 5, 6, 7, 8, 9, 10]
Imagine your dataset consists of the following 10 samples: • Validation Set: [1, 2]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Iteration 2:
Using 5-Fold Cross-Validation, the data is split into 5 folds:
• Training Set: [1, 2, 5, 6, 7, 8, 9, 10]
• Fold 1: [1, 2]
• Validation Set: [3, 4]
• Fold 2: [3, 4]
• Fold 3: [5, 6] Iteration 3:
• Fold 4: [7, 8] • Training Set: [1, 2, 3, 4, 7, 8, 9, 10]
• Fold 5: [9, 10] • Validation Set: [5, 6]
Iteration 4:
• Training Set: [1, 2, 3, 4, 5, 6, 9, 10]
• Validation Set: [7, 8]
Iteration 5:
• Training Set: [1, 2, 3, 4, 5, 6, 7, 8]
• Validation Set: [9, 10]
Polynomial Regression

Lecture3 4
No ratings yet
Lecture3 4
48 pages
CH 5
No ratings yet
CH 5
36 pages
Module 3
No ratings yet
Module 3
34 pages
Da Module 3
No ratings yet
Da Module 3
54 pages
Introduction of Regression
No ratings yet
Introduction of Regression
57 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Linear Regression Algorithm
No ratings yet
Linear Regression Algorithm
16 pages
Regression Analysis (AI)
No ratings yet
Regression Analysis (AI)
9 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Correlation Regression 15 16
No ratings yet
Correlation Regression 15 16
19 pages
Regression Analysis Basics
No ratings yet
Regression Analysis Basics
12 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Presentation4 - Bivariate Analysis and Simple Linear Regression
No ratings yet
Presentation4 - Bivariate Analysis and Simple Linear Regression
31 pages
REGRESSION ANALYSIS 1 and 2 Notes
No ratings yet
REGRESSION ANALYSIS 1 and 2 Notes
9 pages
BRM-Lecture 4-2023
No ratings yet
BRM-Lecture 4-2023
48 pages
Correlation
No ratings yet
Correlation
5 pages
Unit III
No ratings yet
Unit III
13 pages
Module 4
No ratings yet
Module 4
33 pages
Corr and Regress
No ratings yet
Corr and Regress
37 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Aiml Module 3 Part 3
No ratings yet
Aiml Module 3 Part 3
12 pages
Inferential Analysis
No ratings yet
Inferential Analysis
45 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Unit-2 ML
No ratings yet
Unit-2 ML
39 pages
Module 6
No ratings yet
Module 6
35 pages
Model Development
No ratings yet
Model Development
80 pages
CH 4 - Correlation and Regression YARA&LAMA
No ratings yet
CH 4 - Correlation and Regression YARA&LAMA
27 pages
Linear Regression
No ratings yet
Linear Regression
16 pages
PS PPT Aat-2
No ratings yet
PS PPT Aat-2
12 pages
Lesson 9
No ratings yet
Lesson 9
4 pages
Regression PDF
No ratings yet
Regression PDF
33 pages
Unit 7 8614
No ratings yet
Unit 7 8614
35 pages
STAT630Slide Adv Data Analysis
0% (1)
STAT630Slide Adv Data Analysis
238 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Regression Analysis Techniques
No ratings yet
Regression Analysis Techniques
16 pages
Concepts - Regression Overview
No ratings yet
Concepts - Regression Overview
14 pages
13 Predictive Analysis - Tests of Association - Regression
No ratings yet
13 Predictive Analysis - Tests of Association - Regression
70 pages
ML Unit-2
No ratings yet
ML Unit-2
34 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
Unit 3
No ratings yet
Unit 3
24 pages
Second Stats Packet 24
No ratings yet
Second Stats Packet 24
100 pages
DA unit-III
No ratings yet
DA unit-III
30 pages
Unit 5 Business Analytics
No ratings yet
Unit 5 Business Analytics
24 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Chapter 8 Regression Model - 2023
No ratings yet
Chapter 8 Regression Model - 2023
21 pages
Predictive Analytics-Mid Sem Exam Question Bank
No ratings yet
Predictive Analytics-Mid Sem Exam Question Bank
28 pages
Simple Linear Regression and Correlation
No ratings yet
Simple Linear Regression and Correlation
77 pages
Linear Regression
No ratings yet
Linear Regression
3 pages
Day 3
No ratings yet
Day 3
85 pages
Module 2
No ratings yet
Module 2
21 pages
1.isidore Ekpe
No ratings yet
1.isidore Ekpe
27 pages
Document
No ratings yet
Document
483 pages
Statistics For Anthropology 2nd Edition Lorena Madrigal Instant Download
100% (2)
Statistics For Anthropology 2nd Edition Lorena Madrigal Instant Download
64 pages
Impact of Economic Policy Uncertainty and Macroeconomic Factors On Stock Market Volatility - Evidence From Islamic Indices
No ratings yet
Impact of Economic Policy Uncertainty and Macroeconomic Factors On Stock Market Volatility - Evidence From Islamic Indices
10 pages
M&E Final Draft 03-03-2011
No ratings yet
M&E Final Draft 03-03-2011
53 pages
BRM Course Outline
No ratings yet
BRM Course Outline
6 pages
Effect of Customer Relationship Management On Satisfaction
No ratings yet
Effect of Customer Relationship Management On Satisfaction
20 pages
The Effects of Internal Control and Information Systems of Insurance Services Fraud at Stella Maris Hospital, Makassar
No ratings yet
The Effects of Internal Control and Information Systems of Insurance Services Fraud at Stella Maris Hospital, Makassar
6 pages
Regresi 1: Variables Entered/Removed
No ratings yet
Regresi 1: Variables Entered/Removed
5 pages
Jurnal Digitalisasi Perbankan
No ratings yet
Jurnal Digitalisasi Perbankan
12 pages
Garuda23316091 PDF
No ratings yet
Garuda23316091 PDF
12 pages
Khushi
No ratings yet
Khushi
12 pages
Mediation Analysis Myths & Truths
No ratings yet
Mediation Analysis Myths & Truths
36 pages
Learning Readiness and Educational Achie
No ratings yet
Learning Readiness and Educational Achie
11 pages
ProductProce Strategy Manufacturing Technology and Workforce
No ratings yet
ProductProce Strategy Manufacturing Technology and Workforce
44 pages
Econometrics CH 1-4
100% (1)
Econometrics CH 1-4
315 pages
Manova One Powerpoint 1 STAT
100% (1)
Manova One Powerpoint 1 STAT
35 pages
Chapter 16
No ratings yet
Chapter 16
36 pages
Econometrics Analysis of Farm Inputs
100% (1)
Econometrics Analysis of Farm Inputs
10 pages
A Five-State Financial Distress Prediction Model
No ratings yet
A Five-State Financial Distress Prediction Model
13 pages
Marketing Mix Modelling MBA Report
No ratings yet
Marketing Mix Modelling MBA Report
50 pages
Does Green Bond Issuance Affect Firm Value Evidence 2025 Global Finance Jou
No ratings yet
Does Green Bond Issuance Affect Firm Value Evidence 2025 Global Finance Jou
19 pages
Statistics Mini Project
No ratings yet
Statistics Mini Project
28 pages
Jarboui 2020
No ratings yet
Jarboui 2020
20 pages
CPFD Barracuda
No ratings yet
CPFD Barracuda
11 pages
Yaikob Second Assesiment Final
No ratings yet
Yaikob Second Assesiment Final
33 pages
Intro to Simple Regression Analysis
No ratings yet
Intro to Simple Regression Analysis
55 pages
(Ebook PDF) Introduction To Econometrics 4Th Edition by James H. Stock Install Download
No ratings yet
(Ebook PDF) Introduction To Econometrics 4Th Edition by James H. Stock Install Download
52 pages
Module 6D - Multiple Linear Regression Analysis PDF
No ratings yet
Module 6D - Multiple Linear Regression Analysis PDF
42 pages
Chapter 11 Consumer Decision-Making With Regard To Organic Food Products
No ratings yet
Chapter 11 Consumer Decision-Making With Regard To Organic Food Products
20 pages

Intermediate Analytics-Regression-Week 1

Uploaded by

Intermediate Analytics-Regression-Week 1

Uploaded by

Intermediate

ALY6015-Intermediate Analytics-Northeastern University-Behzad Abdi

Regression Classification Clustering

construction. emails or classifying fruit as an based on purchasing behavior

• Techniques: Correlation and Regression Analysis.

2. If so, what is the strength of the relationship?

3. What type of relationship exists?

4. What kind of predictions can be made from the relationship?

• It ranges between −1and +1:

The  Hypothesis testing is used to determine if r is

1. State the hypotheses:

3. Compare the test value with critical values

4. Make the decision (Reject or Fail to Reject

5. Summarize the results.

Metho 3. Compute the P-value (e.g., from

 Table I provides critical r values for specific α levels and

 Correlation ≠ Causation: Correlation does not imply causation.

 Effect of Outliers: Outliers can distort the correlation value.

 Linear Relationships Only: Pearson correlation measures only linear relationships.

• Only one independent variable (input)

The independent variable: the number of hours studied

Impact on the Model:

Strategies to Handle Multicollinearity:

1. Remove One of the Correlated Predictors:

• In real-life use cases, there are more than one

• The concept of having multiple independent

 Range: values range from 0 to 1:

 Example: Predicting house prices based on square footage:

R-squared = 1510.01 / 1547.55 = 0.97

• To address the limitation of

 Definition: A penalized-likelihood criterion derived from Bayesian probability, closely

2. AIC (Akaike Information Criterion):

3. BIC (Bayesian Information Criterion):

 How to Decide Which Method to Use:

• Linear Models: Use Adjusted R-Square and Mallow’s Cp.

• Stepwise Regression is a modeling technique that automatically decides

 Stepwise Regression Methods:

• Simplifies the model.

• Risk of Omitting Important Predictors:

Adding More Independent Variables:

• Stratified K-Fold: Preserves class distribution during splitting.

You might also like