[go: up one dir, main page]

0% found this document useful (0 votes)
19 views9 pages

Comprehensive Machine Learning Tutorial - Regressio

This tutorial provides a comprehensive overview of machine learning techniques, including polynomial regression, regularization methods (Ridge, LASSO, Elastic Net), and logistic regression. It emphasizes the importance of understanding the geometric and mathematical foundations of these methods, as well as evaluation metrics like precision, recall, and F1-score for assessing model performance. Key takeaways highlight the role of regularization in preventing overfitting and the use of gradient descent for optimizing model parameters.

Uploaded by

ayush singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

Comprehensive Machine Learning Tutorial - Regressio

This tutorial provides a comprehensive overview of machine learning techniques, including polynomial regression, regularization methods (Ridge, LASSO, Elastic Net), and logistic regression. It emphasizes the importance of understanding the geometric and mathematical foundations of these methods, as well as evaluation metrics like precision, recall, and F1-score for assessing model performance. Key takeaways highlight the role of regularization in preventing overfitting and the use of gradient descent for optimizing model parameters.

Uploaded by

ayush singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Comprehensive Machine Learning Tutorial:

Regression, Classification, and Evaluation


Polynomial Regression

Polynomial regression extends linear regression to capture non-linear relationships


between variables by introducing polynomial terms of the predictor variables[1]. Unlike
linear regression which fits a straight line, polynomial regression fits a curved line to
better represent complex patterns in data[2][3].

Mathematical Foundation:
Polynomial regression uses the relationship between variables x and y to find the best
way to draw a curve through data points[2]. The general form is:
$ y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + ... + \beta_nx^n + \epsilon $

When to Use:

 When data points clearly won't fit a linear regression line

 When you suspect non-linear relationships between variables

 For modeling growth rates, decay processes, or curved phenomena[3]

Implementation Example:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate sample data


X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = 0.5 * X.ravel() ** 3 + 0.2 * X.ravel() ** 2 + 0.1 * X.ravel() + noise

# Create polynomial features (degree 3)


poly_features = PolynomialFeatures(degree=3)
X_poly = poly_features.fit_transform(X)

# Fit the model


poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)

Key Considerations:
 Higher degree polynomials can lead to overfitting

 Polynomial regression is still linear in the coefficients

 Cross-validation is essential for selecting optimal degree[4]

Ridge Regression

Ridge regression addresses limitations of ordinary least squares by adding L2


regularization to prevent overfitting and handle multicollinearity[5][6]. It adds a penalty
term proportional to the square of the magnitude of coefficients.

Mathematical Formulation:
Ridge regression minimizes:
$ RSS + \alpha \sum_{i=1}^{p} \beta_i^2 $

Where α is the regularization parameter controlling the strength of the penalty[5]. The
solution is:
$ \hat{\beta}_{ridge} = (X^TX + \alpha I) {-1}XTy $

Five Key Points of Ridge Regression:

1. Shrinks coefficients: Reduces coefficient magnitudes toward zero but never


exactly zero

2. Handles multicollinearity: Effective when predictors are highly correlated

3. Bias-variance tradeoff: Introduces small bias to significantly reduce variance[7]

4. Always has a solution: The matrix $ X^TX + \alpha I $ is always invertible

5. Tuning parameter α: Controls regularization strength - larger α means more


shrinkage[6]

Implementation:

from sklearn.linear_model import Ridge

# Ridge regression with α = 1.0


ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train, y_train)

Geometry Intuition in Regression


The geometric interpretation of regularized regression provides insight into why these
methods work[7]. In the parameter space:

 Ridge regression: The constraint region is a sphere (L2 norm), leading to smooth
shrinkage

 LASSO: The constraint region has corners (L1 norm), promoting sparsity

 Elastic Net: Combines both geometric properties for balanced regularization[8]

This geometric perspective explains why LASSO can set coefficients to exactly zero while
Ridge cannot - the corner points of the L1 constraint region lie on the coordinate axes[9]
[10].

Regularized Linear Models

Regularized linear models add penalty terms to prevent overfitting and improve
generalization[7][11]. The three main types are:

Ridge (L2): $ RSS + \alpha_2 \sum \beta_i^2 $


LASSO (L1): $ RSS + \alpha_1 \sum |\beta_i| $
Elastic Net: $ RSS + \alpha_1 \sum |\beta_i| + \alpha_2 \sum \beta_i^2 $

Each method addresses different aspects of the bias-variance tradeoff and feature
selection[7].

LASSO Regression

LASSO (Least Absolute Shrinkage and Selection Operator) performs both variable
selection and regularization by using L1 penalty[12][9]. It was originally introduced in
geophysics and later popularized by Robert Tibshirani[9].

Why LASSO Creates Sparsity:


LASSO creates sparsity due to the geometric properties of the L1 penalty[12][10]:

 The L1 constraint region has sharp corners at coordinate axes

 The optimal solution often occurs at these corners where coefficients are exactly
zero

 Unlike Ridge's smooth sphere, L1's diamond shape promotes exact zeros[9]
Mathematical Explanation:
The LASSO optimization problem:
$ \min_{\beta} \frac{1}{2n} ||y - X\beta||_2^2 + \alpha ||\beta||_1 $

The absolute value penalty creates non-differentiable points that force coefficients to
zero[13].

Benefits:

 Automatic feature selection

 Interpretable models with fewer variables

 Better performance with irrelevant features[12][14]

Implementation:

from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train, y_train)

# Check sparsity
non_zero_coeffs = np.sum(lasso_reg.coef_ != 0)

Elastic Net Regression

Elastic Net combines Ridge and LASSO penalties to overcome limitations of each method
alone[8]. It's particularly effective when dealing with groups of correlated features.

Mathematical Form:
$ \hat{\beta} = \arg\min_{\beta} (||y - X\beta||^2 + \lambda_2||\beta||^2 + \lambda_1||\
beta||_1) $

Advantages:

 Selects groups of correlated variables (unlike LASSO which picks one)

 More stable than LASSO when features are highly correlated

 Maintains sparsity while handling multicollinearity[8]

Implementation:

from sklearn.linear_model import ElasticNet


elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5) # Equal L1 and L2 weight
elastic_net.fit(X_train, y_train)

The l1_ratio parameter controls the balance: 0=Ridge, 1=LASSO, 0.5=equal weight[11].

Logistic Regression

Logistic regression is a statistical method for binary classification that uses the logistic
function to model the probability of class membership[15][16]. Despite its name, it's a
classification algorithm, not regression[17].

Geometric Intuition:
Logistic regression finds a hyperplane that separates classes by modeling the probability
boundary[17]. In 2D, this appears as a line; in higher dimensions, it's a hyperplane. The
goal is to find the optimal decision boundary that maximizes the likelihood of the
observed data.

Mathematical Foundation:
Logistic regression models the probability:
$ P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + ... + \beta_px_p)}} $

Implementation:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
predictions = log_reg.predict(X_test)
probabilities = log_reg.predict_proba(X_test)

Perceptron Trick

The perceptron trick is a foundational algorithm for understanding logistic regression[18].


It provides geometric intuition for how linear classifiers learn decision boundaries.

Algorithm:

1. Initialize weights randomly

2. For each misclassified point: $ w \leftarrow w + \eta \cdot y \cdot x $

3. Repeat until convergence

Connection to Logistic Regression:


 Perceptron: Hard classification (0/1 decisions)

 Logistic: Soft classification (probabilities)

 Both find linear decision boundaries

 Logistic regression generalizes perceptron with probabilistic output[18]

Sigmoid Function

The sigmoid function is crucial for logistic regression, mapping any real number to the
range (0,1)[19][15].

Mathematical Properties:
$ \sigma(z) = \frac{1}{1 + e^{-z}} $

Key Characteristics:

 S-shaped curve

 σ(0) = 0.5

 σ(∞) = 1, σ(-∞) = 0

 Derivative: σ'(z) = σ(z)(1-σ(z))

 Smooth and differentiable everywhere[19]

Why Sigmoid for Classification:

 Maps linear combination to probability

 Smooth decision boundary

 Mathematically convenient for optimization[19]

Gradient Descent for Logistic Regression

Gradient descent optimizes logistic regression by minimizing the log-loss (cross-entropy)


function[20].

Cost Function:
$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_\theta(x^{(i)})) + (1-
y^{(i)}) \log(1-h_\theta(x^{(i)}))] $

Gradient Calculation:
$ \frac{\partial J}{\partial \theta} = \frac{1}{m} X^T (h - y) $
Update Rule:
$ \theta \leftarrow \theta - \alpha \frac{\partial J}{\partial \theta} $

Implementation:

def gradient_descent_logistic(X, y, learning_rate=0.01, iterations=1000):


theta = np.zeros(X.shape[1])

for i in range(iterations):
h = sigmoid(X.dot(theta))
gradient = X.T.dot(h - y) / y.size
theta -= learning_rate * gradient

return theta

Accuracy and Confusion Matrix

The confusion matrix is a fundamental tool for evaluating classification models[21][22]. It


provides detailed insights beyond simple accuracy.

Confusion Matrix Structure:

Predicted
0 1
Actual 0 TN FP
1 FN TP

Components:

 True Positives (TP): Correctly predicted positive cases

 True Negatives (TN): Correctly predicted negative cases

 False Positives (FP): Incorrectly predicted as positive

 False Negatives (FN): Incorrectly predicted as negative[21]

Accuracy Calculation:
$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $

Implementation:

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
Precision, Recall, and F1-Score

These metrics provide deeper insights into model performance, especially for imbalanced
datasets[21][23][24].

Precision:
$ Precision = \frac{TP}{TP + FP} $
Measures the quality of positive predictions - "Of all positive predictions, how many were
correct?"[25]

Recall (Sensitivity):
$ Recall = \frac{TP}{TP + FN} $
Measures coverage of actual positives - "Of all actual positives, how many were
found?"[25]

F1-Score:
$ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $

The F1-score is the harmonic mean of precision and recall, providing a single metric that
balances both[23][24][25].

When to Use Each:

 Precision: When false positives are costly (e.g., spam detection)

 Recall: When false negatives are costly (e.g., disease diagnosis)

 F1-Score: When you need balance, especially with imbalanced data[26][27]

Implementation:

from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_true, y_pred)


recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

Rigid Registration (Computer Vision Context)

Rigid registration aligns images while preserving shape and size, using only translations
and rotations[28][29]. This technique has five key points:

1. Preserves distances: Maintains relative distances between all points


2. Six degrees of freedom: Three translations and three rotations in 3D[29]

3. Global transformation: Applies same transformation to entire image

4. Shape preservation: No deformation or scaling of objects

5. Applications: Medical imaging, satellite imagery, computer vision[28][30]

Mathematical Representation:

$T= [ R0 1t ] $

Where R is rotation matrix and t is translation vector[31].

Summary

This comprehensive tutorial covered essential machine learning concepts from


polynomial regression through classification metrics. Key takeaways include:

 Polynomial regression extends linear models for non-linear relationships

 Regularization techniques (Ridge, LASSO, Elastic Net) prevent overfitting and


enable feature selection

 LASSO creates sparsity through L1 penalty geometry

 Logistic regression uses sigmoid function for probabilistic classification

 Evaluation metrics (precision, recall, F1) provide nuanced performance


assessment

 Gradient descent optimizes model parameters iteratively

Understanding these concepts provides a solid foundation for machine learning


applications, from basic regression to advanced classification problems. The geometric
intuitions and mathematical foundations help build deeper understanding beyond just
applying algorithms.

You might also like