Intermediate
Analytics
ALY6015
Northeastern University
By: Behzad Abdi
Meet Your Instructor
ALY6015-Intermediate Analytics-Northeastern University-Behzad Abdi
Introduction
Human Simulation in Machines
Introduction
Artificial Intelligence:
Introduction
Regression Classification Clustering
• Goal: To predict a continuous • Goal: To assign data to specific • Goal: To group data into similar
value based on given input data. categories based on labelled clusters based on patterns and
• Type of Output: A numerical data. similarities, without predefined
value (e.g., predicting price, • Type of Output: A discrete labels.
temperature, or weight) value or category (e.g., yes/no, • Type of Output: Grouping data
• Example: Predicting house prices dog/cat). based on internal similarities.
based on size, location, and year of • Example: Detecting spam • Example: Grouping customers
construction. emails or classifying fruit as an based on purchasing behavior
• Feature: Regression is used for apple or an orange. without knowing predefined
problems where the output is a • Feature: In classification, the categories.
numerical value, and the model model learns from labelled data • Feature: Clustering is used in
aims to predict a specific and assigns new data to a unsupervised learning, where data is
numerical result. specific category. not assigned to predefined
categories. The model automatically
Model Validation and
Evaluation:
Independent variable, also called a predictor variable.
Dependent variable, also called a response variable
Relationships Between Variables
• Inferential statistics help determine if relationships
exist between numerical variables.
• Examples:
• Sales volume and advertising spending
• Study hours and exam scores
• Age and blood pressure
• Techniques: Correlation and Regression Analysis.
Main Questions
1. Are two or more variables related?
2. If so, what is the strength of the relationship?
3. What type of relationship exists?
4. What kind of predictions can be made from the relationship?
Correlation vs. Regression
Correlation:
• Measures whether variables are related.
• Determines the strength of the relationship
using a correlation coefficient.
Regression:
• Describes the nature of relationships
(positive/negative, linear/nonlinear).
• Helps predict one variable based on another.
Correlation
coefficient
• Measures the strength and direction of the relationship
between two variables.
• It ranges between −1and +1:
• +1: Perfect positive correlation.
• −1: Perfect negative correlation.
• 0: No correlation.
• Pearson Correlation:
Correlation Coefficient
Interpretation:
Visualizing Correlation
Scatter Plots illustrate the relationship between variables.
Is there a significant linear relationship between the
variables, or the value of r is due to chance?
The Hypothesis testing is used to determine if r is
statistically significant.
Significanc
e test of Assumptions:
the • Variables x and y are linearly related.
• Variables x and y are random variables.
Correlation • The variables have a bivariate normal distribution.
Coefficient o For any given x, y values have a bell-shaped distribution.
o For any given y, x values have a bell-shaped distribution.
Hypothesis-Testing
Procedure (Traditional
Method):
Steps:
1. State the hypotheses:
• Null hypothesis (…H0): r = 0 (no
correlation).
• Alternative hypothesis (…H1): r ≠ 0
(significant correlation).
2. Compute the test value (t-test).
3. Compare the test value with critical values
from Table t distribution .
4. Make the decision (Reject or Fail to Reject
H0).
5. Summarize the results.
Steps:
P-
Value 1. State the hypotheses.
2. Find the test value using the t-test.
Metho 3. Compute the P-value (e.g., from
Table F or calculator).
4. Compare P-value to α level (e.g.,
d: 0.05).
5. Summarize results.
Using Table Critical Values of r:
Table I provides critical r values for specific α levels and
degrees of freedom.
Steps:
1. State the hypotheses.
2. Find critical values from Table I.
3. Compare r to critical values.
4. Make a decision.
Limitations
The relationship between the variables may be caused by a third variable (lurking variable).
Correlation ≠ Causation: Correlation does not imply causation.
Effect of Outliers: Outliers can distort the correlation value.
Linear Relationships Only: Pearson correlation measures only linear relationships.
Introduction to Linear Regression
Introduction to Linear Regression
What is Linear Regression?
1 Definition 2 Goal
Linear regression models To find the best-fitting
the relationship between straight line through a set
a dependent variable and of points.
one or more independent
variables.
Introduction to Linear Regression
1. Simple Linear Regression:
• Only one independent variable (input)
• Goal: establish a linear relationship between the input and the output variable.
• Represented by the equation: y=mx+c
Where:
y is the dependent variable (output).
x is the independent variable (input).
m is the slope of the line (the rate at which y changes with
respect to x).
c is the intercept (the point where the line crosses the y-axis).
Linear Regression
1. Simple Linear Regression:
Example: Predicting a student's exam score (y) based on their studied hours (x).
The independent variable: the number of hours studied
The dependent variable: the exam score
Linear Regression
Fit a slope line through all the points such that the error or residual (i.e., the distance of the line from each point) is the best possible minimal.
Linear Regression
• The error could be positive or negative.
• A simple sum of all the errors will be zero
• So, we should square the error
Improving the results of a linear regression mode
1. Removing Outliers:
• Explanation: Outliers (data points that are significantly different from others) can skew the
model and reduce its accuracy.
• Why it's effective: Removing or managing outliers allows the model to align more with most
of the data and perform better.
• Example: If a student with very few study hours (e.g., 1 hour) achieves a very high score (e.g.,
95), this could be an outlier and need to be removed or examined further.
Improving the results of a linear regression mode
2. Multicollinearity
Definition:
• Multicollinearity occurs when two or more predictors in a regression model are highly
correlated.
• This makes it difficult for the model to distinguish their individual effects on the
dependent variable.
Impact on the Model:
• Unstable Coefficients: Changes in data can lead to large variations in predictor coefficients.
• Reduced Interpretability: Difficult to determine the importance of individual predictors.
• Inflated Standard Errors: Wider confidence intervals, making predictors appear insignificant.
Improving the results of a linear regression mode
Detecting Multicollinearity
Common Methods:
1. Correlation Matrix:
• High correlation ( or ) between predictors is a warning sign.
2. Variance Inflation Factor (VIF):
• Measures how much the variance of a predictor is inflated due to multicollinearity.
• Rule of Thumb: indicates severe multicollinearity.
Resolving Multicollinearity
Strategies to Handle Multicollinearity:
1. Remove One of the Correlated Predictors:
• Identify redundant predictors and exclude them.
2. Combine Predictors:
• Create a new variable (e.g., principal component analysis or a
combined index).
3. Regularization Methods:
• Use techniques like Ridge Regression or Lasso to reduce the impact
of correlated predictors.
Linear Regression
Multivariate Regression
• In real-life use cases, there are more than one
independent variable.
• The concept of having multiple independent
variables is called multivariate regression.
Model Selection and
Comparison
Techniques:
• R Square
• Adjusted R Square
• AIC (Akaike information criterion)
• BIC (Bayesian information criterion)
• Mallow’s Cp
What is R-Squared
Definition:
• Measures how well a regression model explains the variability in the dependent variable based on the
independent variable(s).
Range: values range from 0 to 1:
• High: Large proportion of variability explained by the model.
• Low: Model does not explain much of the variability.
Example: Predicting house prices based on square footage:
• : 85% of the variation in house prices is explained by square footage.
• : Only 20% of the variation is explained.
Key Points:
• Higher values indicate a better fit.
• Used to evaluate the performance of regression models.
R-Squared Formula:
Linear Regression
R-Squared for Goodness of fit
R-squared = 1510.01 / 1547.55 = 0.97
Advantages and limitation of R-
Squared:
Advantages of R-Squared:
• Easy to Interpret: Provides a simple measure of how well the model fits the data.
• Useful for Model Comparison: Higher values suggest better models (on the same
dataset).
Limitations of R-Squared:
• Not Always Predictive: High does not guarantee good predictions for new data.
• Sensitive to Overfitting: Adding more variables can artificially increase .
• Correlation, Not Causation: only measures association, not causality.
Adjusted 𝑅-Square:
Adjusted 𝑅-Square:
• To address the limitation of
overfitting, it penalizes the model
for including unnecessary variables.
Akaike Information Criterion (AIC):
Definition: AIC estimates in-sample prediction error and compares model quality within
the same dataset.
Key Points:
• Lower AIC indicates a better model.
• Only valid for comparing models from the same dataset.
• Does not measure absolute model quality.
Bayesian Information Criterion (BIC):
Definition: A penalized-likelihood criterion derived from Bayesian probability, closely
related to AIC.
Key Points:
• Heavily penalizes complexity to favor simpler models.
• Lower BIC indicates a better model.
Formula:
Mallow’s Cp:
Definition: Compares precision and bias of the full model to models with subsets of
predictors.
Key Points:
• Calculates Cp for all variable combinations.
• Best model has Cp value closest to (number of predictors + 1).
Formula:
Comparison
of Adjusted
R-Square,
AIC, BIC, and
Mallow’s Cp:
Detailed Use Cases:
1. Adjusted 𝑅 Square :
• When to Use: When comparing multiple linear regression models with different
numbers of predictors.
• When Not to Use: When working with nonlinear models or datasets with many
predictors.
2. AIC (Akaike Information Criterion):
• When to Use: When working with more complex models, such as GLMs or time-series
models.
• When Not to Use: When simpler models are preferred, or the dataset is very large.
3. BIC (Bayesian Information Criterion):
• When to Use: When datasets are large and simpler models are desirable.
• When Not to Use: When high accuracy is needed, even if the model is more complex.
4. Mallow’s Cp:
• When to Use: When selecting the best subset of predictors in linear regression.
• When Not to Use: For nonlinear models or when there are too many predictors.
How to Decide Which Method to Use:
• Linear Models: Use Adjusted R-Square and Mallow’s Cp.
• Complex Models or Small Datasets: Use AIC.
• Large Datasets or Simplicity Preferred: Use BIC.
Stepwise Regression:
Definition:
• Stepwise Regression is a modeling technique that automatically decides
which independent variables (features) to include or remove from the
model.
• Goal: Create a simpler and more effective model by reducing unnecessary
predictors.
Stepwise Regression Methods:
1. Forward Selection:
• Start with no variables.
• Add predictors one by one based on their significance in the model.
2. Backward Elimination:
• Start with all predictors.
• Remove the least significant predictor one by one.
3. Stepwise Selection:
• Combine Forward Selection and Backward Elimination.
• Variables can be added or removed at each step.
Example: Predicting House Prices with Stepwise Regression
Predictors:
• House Size (X1)
• Number of Rooms (X2)
• House Age (X3)
• Wall Color (X4)
1. Forward Selection:
• Start with no predictors.
• Add House Size (X1) because it has the most significant impact.
• Add Number of Rooms (X2) if it improves model performance.
• Continue until no additional predictors improve the model.
2. Backward Elimination:
• Start with all predictors: X1, X2, X3, X4.
• Remove Wall Color (X4) if it has the least significance.
• Repeat until all remaining predictors are significant.
3. Stepwise Selection:
• Combine both methods. Variables can be added or removed based on
their impact on the model.
Advantages and Disadvantages of Stepwise
Regression:
•Advantages:
• Simplifies the model.
• Removes unnecessary or insignificant predictors.
• Fast and automatic.
•Disadvantages:
• Risk of Omitting Important Predictors:
• Can exclude significant variables, especially if predictors are highly correlated.
• Sensitive to Data:
• Results may vary with different datasets.
Linear Regression
Improving the results of a linear regression mode
Adding More Independent Variables:
• Explanation: Considering only study hours as the independent variable might be insufficient.
You can add other variables, such as the amount of sleep, the number of tutoring classes, or
even the level of student stress.
• Why it's effective: Adding more independent variables can help the model consider more
factors that affect the score, thereby creating a more accurate model.
• Example: Suppose you add the amount of sleep to the model. Now your model might look
like this:
Tools for Validation:
•Popular Techniques:
• Train-Validation Split: Directly splitting data into training and validation sets.
• K-Fold Cross-Validation: Dividing data into K subsets for repeated training and validation.
• Stratified K-Fold: Preserves class distribution during splitting.
K-Fold Example:
Iteration 1:
• Training Set: [3, 4, 5, 6, 7, 8, 9, 10]
Imagine your dataset consists of the following 10 samples: • Validation Set: [1, 2]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Iteration 2:
Using 5-Fold Cross-Validation, the data is split into 5 folds:
• Training Set: [1, 2, 5, 6, 7, 8, 9, 10]
• Fold 1: [1, 2]
• Validation Set: [3, 4]
• Fold 2: [3, 4]
• Fold 3: [5, 6] Iteration 3:
• Fold 4: [7, 8] • Training Set: [1, 2, 3, 4, 7, 8, 9, 10]
• Fold 5: [9, 10] • Validation Set: [5, 6]
Iteration 4:
• Training Set: [1, 2, 3, 4, 5, 6, 9, 10]
• Validation Set: [7, 8]
Iteration 5:
• Training Set: [1, 2, 3, 4, 5, 6, 7, 8]
• Validation Set: [9, 10]
Polynomial Regression