Comprehensive Machine Learning Tutorial:
Regression, Classification, and Evaluation
Polynomial Regression
Polynomial regression extends linear regression to capture non-linear relationships
between variables by introducing polynomial terms of the predictor variables[1]. Unlike
linear regression which fits a straight line, polynomial regression fits a curved line to
better represent complex patterns in data[2][3].
Mathematical Foundation:
Polynomial regression uses the relationship between variables x and y to find the best
way to draw a curve through data points[2]. The general form is:
$ y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + ... + \beta_nx^n + \epsilon $
When to Use:
When data points clearly won't fit a linear regression line
When you suspect non-linear relationships between variables
For modeling growth rates, decay processes, or curved phenomena[3]
Implementation Example:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Generate sample data
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = 0.5 * X.ravel() ** 3 + 0.2 * X.ravel() ** 2 + 0.1 * X.ravel() + noise
# Create polynomial features (degree 3)
poly_features = PolynomialFeatures(degree=3)
X_poly = poly_features.fit_transform(X)
# Fit the model
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)
Key Considerations:
Higher degree polynomials can lead to overfitting
Polynomial regression is still linear in the coefficients
Cross-validation is essential for selecting optimal degree[4]
Ridge Regression
Ridge regression addresses limitations of ordinary least squares by adding L2
regularization to prevent overfitting and handle multicollinearity[5][6]. It adds a penalty
term proportional to the square of the magnitude of coefficients.
Mathematical Formulation:
Ridge regression minimizes:
$ RSS + \alpha \sum_{i=1}^{p} \beta_i^2 $
Where α is the regularization parameter controlling the strength of the penalty[5]. The
solution is:
$ \hat{\beta}_{ridge} = (X^TX + \alpha I) {-1}XTy $
Five Key Points of Ridge Regression:
1. Shrinks coefficients: Reduces coefficient magnitudes toward zero but never
exactly zero
2. Handles multicollinearity: Effective when predictors are highly correlated
3. Bias-variance tradeoff: Introduces small bias to significantly reduce variance[7]
4. Always has a solution: The matrix $ X^TX + \alpha I $ is always invertible
5. Tuning parameter α: Controls regularization strength - larger α means more
shrinkage[6]
Implementation:
from sklearn.linear_model import Ridge
# Ridge regression with α = 1.0
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train, y_train)
Geometry Intuition in Regression
The geometric interpretation of regularized regression provides insight into why these
methods work[7]. In the parameter space:
Ridge regression: The constraint region is a sphere (L2 norm), leading to smooth
shrinkage
LASSO: The constraint region has corners (L1 norm), promoting sparsity
Elastic Net: Combines both geometric properties for balanced regularization[8]
This geometric perspective explains why LASSO can set coefficients to exactly zero while
Ridge cannot - the corner points of the L1 constraint region lie on the coordinate axes[9]
[10].
Regularized Linear Models
Regularized linear models add penalty terms to prevent overfitting and improve
generalization[7][11]. The three main types are:
Ridge (L2): $ RSS + \alpha_2 \sum \beta_i^2 $
LASSO (L1): $ RSS + \alpha_1 \sum |\beta_i| $
Elastic Net: $ RSS + \alpha_1 \sum |\beta_i| + \alpha_2 \sum \beta_i^2 $
Each method addresses different aspects of the bias-variance tradeoff and feature
selection[7].
LASSO Regression
LASSO (Least Absolute Shrinkage and Selection Operator) performs both variable
selection and regularization by using L1 penalty[12][9]. It was originally introduced in
geophysics and later popularized by Robert Tibshirani[9].
Why LASSO Creates Sparsity:
LASSO creates sparsity due to the geometric properties of the L1 penalty[12][10]:
The L1 constraint region has sharp corners at coordinate axes
The optimal solution often occurs at these corners where coefficients are exactly
zero
Unlike Ridge's smooth sphere, L1's diamond shape promotes exact zeros[9]
Mathematical Explanation:
The LASSO optimization problem:
$ \min_{\beta} \frac{1}{2n} ||y - X\beta||_2^2 + \alpha ||\beta||_1 $
The absolute value penalty creates non-differentiable points that force coefficients to
zero[13].
Benefits:
Automatic feature selection
Interpretable models with fewer variables
Better performance with irrelevant features[12][14]
Implementation:
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train, y_train)
# Check sparsity
non_zero_coeffs = np.sum(lasso_reg.coef_ != 0)
Elastic Net Regression
Elastic Net combines Ridge and LASSO penalties to overcome limitations of each method
alone[8]. It's particularly effective when dealing with groups of correlated features.
Mathematical Form:
$ \hat{\beta} = \arg\min_{\beta} (||y - X\beta||^2 + \lambda_2||\beta||^2 + \lambda_1||\
beta||_1) $
Advantages:
Selects groups of correlated variables (unlike LASSO which picks one)
More stable than LASSO when features are highly correlated
Maintains sparsity while handling multicollinearity[8]
Implementation:
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5) # Equal L1 and L2 weight
elastic_net.fit(X_train, y_train)
The l1_ratio parameter controls the balance: 0=Ridge, 1=LASSO, 0.5=equal weight[11].
Logistic Regression
Logistic regression is a statistical method for binary classification that uses the logistic
function to model the probability of class membership[15][16]. Despite its name, it's a
classification algorithm, not regression[17].
Geometric Intuition:
Logistic regression finds a hyperplane that separates classes by modeling the probability
boundary[17]. In 2D, this appears as a line; in higher dimensions, it's a hyperplane. The
goal is to find the optimal decision boundary that maximizes the likelihood of the
observed data.
Mathematical Foundation:
Logistic regression models the probability:
$ P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + ... + \beta_px_p)}} $
Implementation:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
predictions = log_reg.predict(X_test)
probabilities = log_reg.predict_proba(X_test)
Perceptron Trick
The perceptron trick is a foundational algorithm for understanding logistic regression[18].
It provides geometric intuition for how linear classifiers learn decision boundaries.
Algorithm:
1. Initialize weights randomly
2. For each misclassified point: $ w \leftarrow w + \eta \cdot y \cdot x $
3. Repeat until convergence
Connection to Logistic Regression:
Perceptron: Hard classification (0/1 decisions)
Logistic: Soft classification (probabilities)
Both find linear decision boundaries
Logistic regression generalizes perceptron with probabilistic output[18]
Sigmoid Function
The sigmoid function is crucial for logistic regression, mapping any real number to the
range (0,1)[19][15].
Mathematical Properties:
$ \sigma(z) = \frac{1}{1 + e^{-z}} $
Key Characteristics:
S-shaped curve
σ(0) = 0.5
σ(∞) = 1, σ(-∞) = 0
Derivative: σ'(z) = σ(z)(1-σ(z))
Smooth and differentiable everywhere[19]
Why Sigmoid for Classification:
Maps linear combination to probability
Smooth decision boundary
Mathematically convenient for optimization[19]
Gradient Descent for Logistic Regression
Gradient descent optimizes logistic regression by minimizing the log-loss (cross-entropy)
function[20].
Cost Function:
$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_\theta(x^{(i)})) + (1-
y^{(i)}) \log(1-h_\theta(x^{(i)}))] $
Gradient Calculation:
$ \frac{\partial J}{\partial \theta} = \frac{1}{m} X^T (h - y) $
Update Rule:
$ \theta \leftarrow \theta - \alpha \frac{\partial J}{\partial \theta} $
Implementation:
def gradient_descent_logistic(X, y, learning_rate=0.01, iterations=1000):
theta = np.zeros(X.shape[1])
for i in range(iterations):
h = sigmoid(X.dot(theta))
gradient = X.T.dot(h - y) / y.size
theta -= learning_rate * gradient
return theta
Accuracy and Confusion Matrix
The confusion matrix is a fundamental tool for evaluating classification models[21][22]. It
provides detailed insights beyond simple accuracy.
Confusion Matrix Structure:
Predicted
0 1
Actual 0 TN FP
1 FN TP
Components:
True Positives (TP): Correctly predicted positive cases
True Negatives (TN): Correctly predicted negative cases
False Positives (FP): Incorrectly predicted as positive
False Negatives (FN): Incorrectly predicted as negative[21]
Accuracy Calculation:
$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $
Implementation:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
Precision, Recall, and F1-Score
These metrics provide deeper insights into model performance, especially for imbalanced
datasets[21][23][24].
Precision:
$ Precision = \frac{TP}{TP + FP} $
Measures the quality of positive predictions - "Of all positive predictions, how many were
correct?"[25]
Recall (Sensitivity):
$ Recall = \frac{TP}{TP + FN} $
Measures coverage of actual positives - "Of all actual positives, how many were
found?"[25]
F1-Score:
$ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $
The F1-score is the harmonic mean of precision and recall, providing a single metric that
balances both[23][24][25].
When to Use Each:
Precision: When false positives are costly (e.g., spam detection)
Recall: When false negatives are costly (e.g., disease diagnosis)
F1-Score: When you need balance, especially with imbalanced data[26][27]
Implementation:
from sklearn.metrics import precision_score, recall_score, f1_score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
Rigid Registration (Computer Vision Context)
Rigid registration aligns images while preserving shape and size, using only translations
and rotations[28][29]. This technique has five key points:
1. Preserves distances: Maintains relative distances between all points
2. Six degrees of freedom: Three translations and three rotations in 3D[29]
3. Global transformation: Applies same transformation to entire image
4. Shape preservation: No deformation or scaling of objects
5. Applications: Medical imaging, satellite imagery, computer vision[28][30]
Mathematical Representation:
$T= [ R0 1t ] $
Where R is rotation matrix and t is translation vector[31].
Summary
This comprehensive tutorial covered essential machine learning concepts from
polynomial regression through classification metrics. Key takeaways include:
Polynomial regression extends linear models for non-linear relationships
Regularization techniques (Ridge, LASSO, Elastic Net) prevent overfitting and
enable feature selection
LASSO creates sparsity through L1 penalty geometry
Logistic regression uses sigmoid function for probabilistic classification
Evaluation metrics (precision, recall, F1) provide nuanced performance
assessment
Gradient descent optimizes model parameters iteratively
Understanding these concepts provides a solid foundation for machine learning
applications, from basic regression to advanced classification problems. The geometric
intuitions and mathematical foundations help build deeper understanding beyond just
applying algorithms.