University of Mumbai
Program – Bachelor of Engineering in
Computer Science and Engineering (Artificial Intelligence
and Machine Learning)
Class - T.E.
Course Code – CSDLO5011
Course Name – Statistics for Artificial
Intelligence Data Science
By
Prof. A.V.Phanse
Correlation & Regression
Correlation is a statistical measure that describes the strength and direction of a
relationship between two variables.
It helps to understand how changes in one variable are associated with changes
in another.
Correlation can be positive, negative, or zero.
Positive Correlation: When one variable increases, the other also increases.
Negative Correlation: When one variable increases, the other decreases.
Zero Correlation: No consistent relationship exists between the variables.
The correlation is usually measured by a value called the correlation coefficient
(denoted as r), which ranges from -1 to +1:
r = +1: Perfect positive correlation.
r = -1: Perfect negative correlation.
r = 0: No correlation.
Example:
Consider two variables:
Hours studied (X)
Exam score (Y)
If we observe that as the number of hours a student studies increases, their exam
score tends to increase, this indicates a positive correlation.
For example:
2 hours studied → score of 60
4 hours studied → score of 75
6 hours studied → score of 90
In this case, more studying seems to result in higher scores, indicating a positive
relationship.
Regression is a statistical method used to model and analyze the relationships
between a dependent variable (the outcome) and one or more independent
variables (predictors).
The primary goal of regression analysis is to predict the value of the dependent
variable based on the values of the independent variables and to understand how
the independent variables are associated with the dependent variable.
Types of Regression:
Linear Regression: The relationship between the dependent and independent
variables is modeled as a straight line (linear).
Multiple Regression: Similar to linear regression, but with more than one
independent variable.
Non-linear Regression: The relationship between the dependent and independent
variables is non-linear.
Logistic Regression: Used when the dependent variable is binary (e.g., yes/no,
0/1).
Key Differences Between Regression and Correlation:
Correlation quantifies the strength and direction of a relationship between two
variables but doesn't provide a model for predicting values.
Regression not only quantifies the relationship but also creates a predictive model,
allowing us to estimate outcomes based on input values.
Simple Linear Regression is a type of linear regression where we model the
relationship between two variables: one independent variable (predictor, X) and
one dependent variable (outcome, Y).
The goal is to fit a straight line through the data that best describes the relationship
between the two variables.
The Equation:
The simple linear regression model is represented by the equation:
Where:
Y is the predicted exam score.
X is the number of hours studied.
a is the intercept (the exam score when X = 0).
b is the slope (how much the exam score changes for each additional hour studied).
Example:
Let’s say we are interested in predicting a student's exam score (Y) based on the
number of hours studied (X). Here, X is the independent variable (hours studied),
and Y is the dependent variable (exam score).
Suppose after collecting data, we run a simple linear regression analysis and get the
following equation:
This equation can be interpreted as follows:
Intercept (50): If a student does not study at all (X = 0), we predict that their score
will be 50.
Slope (5): For each additional hour of studying, the student’s score is expected to
increase by 5 points.
Predicting an Outcome:
If a student studies for 6 hours, their predicted exam score can be calculated by
plugging the value of X into the regression equation:
Thus, if a student studies for 6 hours, their expected exam score would be 80.
Assumptions of Simple Linear Regression:
1. Linearity: The relationship between the independent and dependent variable is
linear.
2. Independence: The observations are independent of each other.
3. Homoscedasticity: The differences between observed and predicted values is
constant across all values of X.
4. Normality: The residuals are normally distributed.
Key Applications:
Predicting future outcomes (e.g., predicting sales based on advertising spend).
Understanding relationships (e.g., how much weight depends on caloric intake).
Building simple predictive models.
Method of least squares
The method of least squares is a fundamental technique used in regression
analysis to find the best-fitting line (regression line) through a set of data points.
It minimizes the sum of the squared differences between the observed values
and the predicted values by the regression line. These squared differences are
called residuals.
Coefficient of Determination
As R square value is 0.98, it means that 98% variance in the test score can be
explained by no. of hours studied indicating excellent fit of regression model.
Example from University Exam for Practice
Multiple linear regression
Multiple linear regression is an extension of simple linear regression, where instead
of modeling the relationship between a single independent variable x and a
dependent variable y, we model the relationship between multiple independent
variables x1,x2,…,xk and the dependent variable y.
Example from University Exam for Practice
Thank You…