UNIT-3: Regression and Logistic Regression (Detailed Notes)
📘 1. Regression – Concepts
📌 Definition:
Regression is a fundamental statistical technique used in data science and machine learning to explore the
relationship between a dependent variable (target) and one or more independent variables (predictors or
features). The objective of regression is not just to understand the correlation but to predict the outcome of
the dependent variable when the values of independent variables are known. It falls under supervised
learning, where the algorithm is trained on a labeled dataset.
📌 Objectives of Regression:
• To predict numerical (continuous) values.
• To identify and quantify relationships between variables.
• To support business decision-making with data-driven insights.
📌 Types of Regression:
1. Linear Regression (Simple and Multiple)
2. Polynomial Regression
3. Logistic Regression (used for classification)
4. Regularized Regression (Lasso, Ridge, Elastic Net)
📌 Real-life Applications:
• Predicting house prices
• Forecasting sales or stock prices
• Estimating the effect of study hours on exam scores
📘 2. Simple Linear Regression
📌 Definition:
Simple linear regression is the most basic form of regression analysis where the relationship between one
independent variable (X) and one dependent variable (Y) is modeled using a straight line. It assumes that
this relationship is linear and continuous.
📌 Mathematical Equation:
Y = β0 + β1 X + ϵ Where:
• Y : Dependent variable (output)
1
• X : Independent variable (input)
• β0 : Intercept
• β1 : Slope of the line
• ϵ : Error term (residual)
📌 Example:
Suppose we want to predict the marks of a student based on the number of hours studied. We collect data
from multiple students, fit a line through the data points, and use that line for prediction.
📌 Graphical Representation:
Imagine a 2D scatter plot where:
• X-axis = Hours studied
• Y-axis = Marks obtained
• A straight line passes through the data points (best fit line)
📘 3. Multiple Linear Regression
📌 Definition:
Multiple linear regression is an extension of simple linear regression. It models the relationship between a
dependent variable and two or more independent variables. It is used when the dependent variable is
influenced by multiple factors.
📌 Mathematical Model:
Y = β0 + β1 X1 + β2 X2 + ... + βn Xn + ϵ
📌 Example:
Predicting a house price based on:
• Area (in sq. ft.)
• Number of rooms
• Distance from city center
• Age of the property
All of these become independent variables contributing to the price (Y).
📌 Graph:
Visualizing in 3D space or higher dimensions where each axis represents one feature and the fitted plane
(not a line) represents the predicted outcome.
2
📘 4. Polynomial Regression
📌 Definition:
Polynomial regression models the relationship between the dependent and independent variables as an
nth degree polynomial. It is useful when the data shows a curvilinear trend that linear models cannot
capture.
📌 Mathematical Model:
Y = β0 + β1 X + β2 X 2 + β3 X 3 + ... + βn X n + ϵ
📌 Example:
Predicting plant growth over time where the growth rate increases with time, slows down, and then stops –
a non-linear pattern best modeled with a quadratic or cubic polynomial.
📌 Graph:
Shows a U-shaped or S-shaped curve fitted through the data points, depending on the polynomial degree.
📘 5. BLUE Assumptions (Best Linear Unbiased Estimator)
BLUE comes from the Gauss-Markov Theorem and outlines the assumptions needed for linear regression to
give the best unbiased estimates of coefficients.
📌 Assumptions:
1. Linearity – The relationship between X and Y is linear.
2. Independence – Residuals (errors) are independent.
3. Homoscedasticity – Residuals have constant variance.
4. No Multicollinearity – Independent variables are not highly correlated.
5. Normality – Errors are normally distributed (especially important for hypothesis testing).
📌 Violations and Solutions:
• Multicollinearity ➝ Remove redundant variables, use PCA
• Heteroscedasticity ➝ Apply log or square root transformations
• Autocorrelation ➝ Use time series models
3
📘 6. Least Squares Estimation (LSE)
📌 Definition:
Least Squares Estimation is a mathematical method used to determine the best-fit line in regression by
minimizing the sum of the squares of the differences between observed and predicted values.
📌 Mathematical Objective:
n
SSE = ∑i=1 (yi − y^i )2 Minimize this to find optimal β values.
📌 Steps:
1. Assume a linear model: Y = β0 + β1 X
2. Compute residuals (difference between actual and predicted)
3. Square and sum the residuals (SSE)
4. Find values of β0 and β1 that minimize SSE
📌 Applications:
• Forecasting sales
• Estimating trends in financial data
📘 7. Variable Rationalization
📌 Definition:
Variable rationalization is the process of selecting the most relevant variables (features), transforming them
appropriately, and engineering new features to enhance model performance.
📌 Steps:
1. Feature Selection – Identify and keep the most predictive variables
2. Feature Transformation – Normalize or log-transform variables
3. Feature Engineering – Create new features (e.g., BMI from height and weight)
4. Dimensionality Reduction – Use PCA to reduce the number of variables
📌 Example:
In a student performance dataset, instead of using raw attendance and study hours, we create a composite
score like: Effort Index = Attendance × Study Hours
4
📘 8. Model Building & Evaluation
📌 Steps in Building a Regression Model:
1. Data Collection – Gather relevant, clean data
2. Data Preprocessing – Handle missing values, outliers
3. Feature Selection & Engineering
4. Splitting Data – Train-test split (e.g., 80/20)
5. Model Training – Apply regression algorithm
6. Evaluation – Assess with performance metrics
📌 Evaluation Metrics:
• R² (Coefficient of Determination): Proportion of variance explained
• Adjusted R²: Adjusted for number of predictors
• MSE/MAE/RMSE: Error-based metrics (lower is better)
📘 9. Logistic Regression
📌 Definition:
Logistic Regression is a classification algorithm used when the dependent variable is binary (Yes/No, 1/0). It
estimates the probability that a given input belongs to a certain category using the logistic (sigmoid)
function.
📌 Logistic Function:
1
P (Y = 1) = 1+e−(β0 +β1 X)
📌 Example:
Predicting whether a customer will buy a product (1) or not (0) based on age and income.
📌 Output:
Probabilities between 0 and 1. Threshold (e.g., 0.5) is used to assign class labels.
📘 10. Logistic Model Evaluation Metrics
📌 Key Metrics:
• Confusion Matrix – TP, TN, FP, FN
• Accuracy – Correct predictions / Total predictions
• Precision – TP / (TP + FP)
5
• Recall – TP / (TP + FN)
• F1 Score – Harmonic mean of precision and recall
• ROC Curve & AUC – Model’s ability to distinguish between classes
• Pseudo R² – McFadden’s R² for logistic models
• AIC/BIC – Lower values indicate better fit (penalizes complexity)
📘 11. Business Applications of Regression Models
📌 Domain-wise Use Cases:
• Finance – Credit scoring, fraud detection, stock price forecasting
• Marketing – Customer segmentation, churn prediction, campaign effectiveness
• Healthcare – Disease diagnosis (e.g., diabetes prediction)
• Retail – Product recommendation, demand forecasting
• HR – Employee attrition modeling, recruitment analytics
• Manufacturing – Predictive maintenance, defect detection
• Transportation – Route optimization, delivery forecasting
• Education – Student dropout prediction, adaptive learning systems
📌 Tools Used:
Python, R, SQL, Excel, Power BI, Tableau