Vector Space V
P = 1: Manhattan norm, ||v||1 =
– p Convexity −∇f (xk ).
A set with two operations (+, scalar mult ·) i |vi |.
• Convex Set X : For any x, y ∈ X ,
s.t.: – p = 2: Euclidean norm, ||v||2 = the line segment between them is in X : xt+1 ← xt − η∇f (xt )
• (V, +) is an abelian group (closed, assoc., αx + (1 − α)y ∈ X for all α ∈ [0, 1].
i vi .
pP
2
identity, inverse, comm.). • Convex Function f : The function’s • GD with Momentum: Adds inertia to
• Scalar mult: α(βv) = (αβ)v; 1v = v. • q
Frobenius Norm (matrix): ||A||F = graph lies below its chords: f (αx + (1 − accelerate convergence.
• Distributivity: (α + β)v = αv + βv; tr(A⊤ A). α)y) ≤ αf (x) + (1 − α)f (y).
p
2
P
i,j Aij =
α(v + w) = αv + αw. • Convex Problem: Minimizing a convex vt = γvt−1 + η∇f (xt )
The tracePnis the sum of diagonal elements: function over a convex set.
Examples: Rd (d-dim vectors), Rn×d (ma- tr(A) = i=1 Aii . xt+1 = xt − vt
trices). • Key Property: For convex problems,
• Properties: Linearity tr(cA + B) = any local minimum is a global minimum. • Coordinate Descent: Minimize along
Matrix Properties & Operations ctr(A)+tr(B); Transpose tr(A⊤ ) = tr(A); • Convex Functions: Norms (|| · ||), one coordinate at a time, fixing others:
For A, B ∈ R n×d
, α ∈ R: Cycling tr(ABCD) = tr(BCDA). squared L2 norm (|| · ||22 ), linear/affine (t+1)
functions (Ax + b), non-negative sums of xi ← arg minxi f (. . . , xi , . . . ).
• Addition: (A + B)ij = Aij + Bij .
Orthogonality convex functions, composition of a convex
• Scalar Mult: (αA)ij = αAij . Regression Task Model
• Inner Product Geometry: v⊤ w = function with an affine map.
• Transpose (A⊤ ): Rows become cols. ||v||2 · ||w||2 cos ∢(v, w). Goal: Find f : Rd → R to predict con-
(A⊤ )ij = Aji . Property: (A⊤ )⊤ = A. • Orthogonal Vectors: v⊤ w = 0. Lagrangian Duality tinuous target y. Model: y = f ∗ (x) + ϵ,
• Symmetric: A⊤ = A (must be square). • Orthonormal Vectors: Orthogonal and For a constrained problem, the Lagrangian where ϵ is zero-mean noise. Linear Model:
• Diagonal: Non-zeroes only on diagonal unit norm (||v||2 = 1). is: Use basis functions ϕ(x) for non-linearity:
(Aij = 0 if i ̸= j). • Orthogonal Matrix: A square matrix A X X f (x) = β ⊤ ϕ(x).
with orthonormal columns. This implies L(x, λ, µ) = f (x)− λi ci (x)− µk gk (x) • Affine: ϕ(x) = (1, x1 , . . . , xd )⊤ .
Vector & Matrix Products
For v ∈ Rd , w ∈ Rn : A⊤ A = AA⊤ = I, so A−1 = A⊤ . i k • Polynomial (deg k): ϕ(x) includes terms
• Orthogonal Invariance: For an orthog- like xi , xi xj , x2i , . . . up to degree k.
• Inner Product (Rd ): v⊤ w = onal matrix X, the L2 norm is invariant: with Lagrange multipliers λi , µk ≥ 0.
Pd
• The primal problem is the original prob- • Gaussian RBF: ϕ(x)i = exp(−γ||x −
i=1 v i wi ∈ R ||Xv||2 = ||v||2 and ||XV ||F = ||V ||F . µi ||2 ).
• Outer Product: vw⊤ ∈ Rd×n , where lem.
Singular Value Decomposition (SVD) • The dual function is g(λ, µ) =
(vw⊤ )ij = vi wj . inf x L(x, λ, µ).
Ordinary Least Squares (OLS)
For any matrix X ∈ Rn×p of rank k: Minimize Residual Sum of Squares (RSS).
For A ∈ Rn×r , B ∈ Rr×d → C = AB ∈ • The dual problem is maxλ,µ≥0 g(λ, µ).
Rn×d : • Weak Duality: p∗ ≥ d∗ (primal opti- Design matrix X has rows ϕ(xi )⊤ .
X = U ΣV ⊤
• Matrix Product Defs: mum ≥ dual optimum).
1. Inner prod. (row-col): Cij = • U ∈ Rn×n is orthogonal. Its columns are • Strong Duality: p∗ = d∗ . Often holds min ||y − Xβ||22
β
for convex problems (e.g., Slater’s condi-
Pr
Ai· B·j = s=1 Ais Bsj . the left singular vectors.
2. POuter prod. (col-row): C = • V ∈ Rp×p is orthogonal. Its columns are tion). This is a convex problem.
r
s=1 A ·s B s· . the right singular vectors. Derivatives Analytic Solutions • Solution (Normal Equations): X ⊤ Xβ =
• Properties: Generally not commutative • Σ ∈ Rn×p is a rectangular diagonal ma- • Gradient ∇f (x): Vector of partial X ⊤ y.
(AB ̸= BA). Associative: (AB)C = trix with k non-zero singular values σ1 ≥
· · · ≥ σk > 0. derivatives ∂x∂f
. Points in direction of • If X ⊤ X is invertible: β = (X ⊤ X)−1 X ⊤ y.
A(BC). Transpose: (AB)⊤ = B ⊤ A⊤ . steepest ascent.
i
• If not invertible (e.g., p > n), infinite so-
Geometric View: Any matrix transforma-
• Jacobian ∂x ∂f
: Matrix of partials for vec- lutions exist.
Identity and Inverse tion Ax can be seen as a rotation (V ⊤ ),
• Identity Matrix I: Diagonal matrix scaling (Σ), and another rotation (U ). In- tor function f : Rd → Rc . ( ∂x
∂f ∂fi
)ij = ∂x . Regularized Regression
with ones. IA = AI = A. verse via SVD: If A is square and invert- j
Add a penalty on the weights β to prevent
• Inverse Matrix A−1 : For a square ma- • Hessian ∇ f (x): Matrix of 2nd partial
2
overfitting.
ible, A−1 = V Σ−1 U ⊤ . derivatives.
trix A, if it exists, AA−1 = A−1 A = I. • Ridge Regression (L2 Penalty):
• Diagonal Inverse: For Optimization Fundamentals • Chain Rule: ∂f (g(x))∂x = ∂f∂g(g) ∂x
∂g
.
D = diag(d1 , . . . , dn ), D−1 = • Unconstrained: minx∈Rd f (x). • FONC (First-Order Necessary): At local min ||y − Xβ||22 + λ||β||22
diag(1/d1 , . . . , 1/dn ) if all di ̸= 0. • Constrained: minx∈Rd f (x) subject to min x∗ , ∇f (x∗ ) = 0. β
ci (x) = 0 (equality) and gk (x) ≥ 0 (in- • SONC (Second-Order Necessary): At lo-
Norms and Trace equality).
A norm || · || : V → R+ measures length. cal min x∗ , ∇2 f (x∗ ) is positive semidefi- Solution: β = (X ⊤ X +λI)−1 X ⊤ y. This
• Feasible Set C: All x satisfying con- nite (z ⊤ Hz ≥ 0). is always unique for λ > 0. Shrinks
• Properties: ||v + w|| ≤ ||v|| + ||w|| (tri- straints.
angle ineq.); ||αv|| = |α| · ||v|| (homogene- weights towards zero.
• Minimizer: A point x∗ is a global min- Gradient-Based Optimization • Lasso (L1 Penalty):
ity); ||v|| = 0 ⇐⇒ v = 0. imizer if f (x∗ ) ≤ f (x) for all x. It’s a Iterative methods to find minima. General
• Lp -norm (vector): ||v||p = local minimizer if this holds for all x in form: xk+1 = xk + ηk pk .
Pd min ||y − Xβ||22 + λ||β||1
( i=1 |vi |p )1/p . a neighborhood of x∗ . • Gradient Descent (GD): pk = β
Promotes sparsity (sets some weights to theorem with the "naive" assumption • Backpropagation: Algorithm to effi- sum of squares (WCSS):
exactly zero). Solved with coordinate of conditional independence of features: ciently compute the gradient of the loss
descent. The update for βk is a soft- p(x|y) = k p(xk |y). Very fast. w.r.t. all network weights (θ) by applying
Q
r
X X 1
thresholding operation. • Decision Trees (CART): Tree-based the chain rule layer-by-layer backwards. min ||xi −µs ||2 , where µs =
model. Learns a hierarchy of if/else ques- • Optimizers: {Cs }
s=1 xi ∈Cs
|Cs |
x
Bias-Variance Tradeoff (Regression)
tions. Interpretable but prone to overfit- – SGD: Update weights using gradient
Decomposition of Expected Prediction Error ting. Splits are chosen to maximize In- from a small mini-batch of data, not • Lloyd’s Algorithm: Iterative approach:
(EPE): formation Gain (reduce impurity). the full dataset. θ ← θ − η∇θ Lbatch . 1. Assign: Assign each point to its near-
2 ∗ 2 ∗ • Random Forests: Ensemble of2 many – Often combined with Momentum or est centroid µs .
(x)])2decorrelated
E[(y−fD (x)) ] = E[(y − f (x)) ] + (f (x) − ED [fDdeep, − ED [ftrees.
+ ED [(fD (x)decision D (x)]) Trained
] more advanced methods like Adam. 2. Update: Recalculate each centroid µs
on bootstrap samples of data and random • Regularization: as the mean of the points assigned to
| {z } | {z } | {z }
Noise, σ 2 Bias2 Variance
subsets of features. Robust and powerful. – Weight Decay (L2 ): Add λ||θ||22 to it.
• Bias: Error from wrong assumptions. • Support Vector Machines (SVM): the loss. • Matrix Factorization View: Equiva-
High bias → underfitting. Finds the hyperplane that maximizes the – Dropout: During training, randomly lent to minY,X ||D − Y X ⊤ ||2F where Y ∈
• Variance: Error from sensitivity to small margin between classes. The kernel set a fraction of neuron activations to {0, 1}n×r is a cluster assignment matrix
training set fluctuations. High variance → trick (k(xi , x j ) = ϕ(x i ) ⊤
ϕ(x j )) allows it zero for each forward pass. and X contains centroids.
overfitting. to learn non-linear boundaries.
• Evaluation: Use test set or Cross- SVM Objective Duality Convolutional NNs (CNNs) Spectral Clustering
Validation (CV) to estimate generaliza- Primal (Soft-Margin): Graph-based clustering.
tion error. Specialized for grid-like data (e.g., images).
• Convolution Operation: Slides a small, 1. Build Similarity Graph: Construct a
Classification Fundamentals 1 Xn
learnable filter (kernel) over the input, weighted adjacency matrix W from data
Goal: Predict discrete class label ŷ ∈ min ||w||2 + C ξi (e.g., using k-NN or an RBF kernel).
w,b,ξ 2 computing dot products P to create a fea-
{1, . . . , c} for input x. Model: A function 2. Compute Graph Laplacian: L =
i=1
ture map. (X ⋆K)ij = a,b Xi+a,j+b Ka,b .
f : Rd → Rc gives scores (logits). Pre- DW − W , where DW is theP diagonal de-
diction: ŷ = arg maxl f (x)l . Probabilis- s.t. yi (w⊤ xi + b) ≥ 1 − ξi and ξi ≥ 0. Dual • Inductive Biases: gree matrix with (DW )ii = j Wij .
tic View: Assume data from p∗ (x, y). The – Locality: Filters process local patches.
(with Kernel Trick):
– Parameter Sharing: Same filter is 3. Embedding: Compute the first r eigen-
best possible classifier is the Bayes Opti- vectors of L (corresponding to the small-
used across the entire input.
mal Classifier:
max
X
αi −
1X
αi αj yi yj k(xi , xj ) – Translation Equivariance: Output est eigenvalues) to form an embedding
α 2 i,j shifts with input. matrix A ∈ Rn×r .
y ∗ (x) = arg max p∗ (y = l|x) i
4. Cluster: Run k-means on the rows of the
l∈{1,...,c} • Pooling Layer: Downsamples feature
maps, e.g., Max Pooling takes the max embedding matrix A to get final cluster
s.t. αi yi = 0 and 0 ≤ αi ≤ C.
P
Its error is the Bayes error rate, the irre-
i
value in a window. assignments.
ducible error. Architecture: MLP This method can find non-convex clusters.
• Neuron: A single computational unit:
Evaluation Metrics output = ϕ(w⊤ x + b). Principal Component Analysis (PCA) Low-Rank Matrix Factorization (for Rec-
• 0-1 Loss: L01 (y, ŷ) = 1 if y ̸= ŷ, else 0. • Layer: A vector of neurons. Output of Goal: Find a low-dimensional (r-dim) linear Sys)
• Accuracy: 1 − (mean 0-1 loss) = projection of the data that maximizes vari-
layer l: h(l) = ϕ(W (l) h(l−1) + b(l) ). Goal: Approximate a data matrix D ∈
#Correct
#Total . Can be misleading for imbal- • Multi-Layer Perceptron (MLP): A ance. Objective: Find orthogonal matrix Rn×d (e.g., user-item ratings) with a low-
anced classes. stack of layers. Z ∈ Rd×r that solves: rank product.
• Confusion Matrix: Tabulates TP, FP, • Activation Fns (ϕ): Introduce non- • Objective (Full data):
TN, FN. linearity. max tr(Z ⊤ C ⊤ CZ) s.t. Z ⊤ Z = I
• Precision (positive predictive value): – ReLU: ϕ(z) = max(0, z). Standard Z
min ||D − Y X ⊤ ||2F
T P +F P .
TP choice. where C is the centered data matrix (D − Y ∈Rn×r ,X∈Rd×r
• Recall (sensitivity, true positive rate): – Sigmoid: ϕ(z) = 1/(1 + e−z ). Output
in [0, 1]. 1µ⊤ ). Solution: The columns of Z (the Solution is given by the Truncated
T P +F N .
TP
– Tanh: ϕ(z) = tanh(z). Output in principal components) are the first r right SVD of D: Dr = Ur Σr Vr⊤ .
• F1-Score: Harmonic mean of precision [−1, 1]. singular vectors of C (or eigenvectors of the • Objective (Missing Data): Let O be a
and recall: 2 · Prec+Recall
Prec·Recall
. • Universal Approx. Thm.: An MLP covariance matrix C ⊤ C). The data is pro- binary mask of observed entries.
with one hidden layer can approximate jected via DZ.
Algorithms Overview
any continuous function. min ||O◦(D−Y X ⊤ )||2F +λ(||Y ||2F +||X||2F )
• k-Nearest Neighbors (k-NN): Non- Y,X
parametric. Predicts the majority class of Training NNs k-Means Clustering
the k closest training points. Simple, but • Loss Function: For classification, typ- Goal: Partition n data points into r clus- Solved using Alternating Least
can be slow at test time. ically Cross-Entropy Loss: L(y, p̂) = ters. Squares (ALS), a form of block
• Naive Bayes: Probabilistic. Uses Bayes’ − log(p̂y ), where p̂ = softmax(logits). • Objective: Minimize the within-cluster coordinate descent.