[go: up one dir, main page]

0% found this document useful (0 votes)
14 views2 pages

Cheatsheet 2

The document discusses vector spaces, norms, and matrix properties, emphasizing concepts like convexity, optimization, and regression techniques. It covers various mathematical operations, including inner and outer products, singular value decomposition, and optimization methods such as gradient descent and regularization. Additionally, it addresses classification fundamentals and the bias-variance tradeoff in machine learning models.

Uploaded by

fotitos.clauu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views2 pages

Cheatsheet 2

The document discusses vector spaces, norms, and matrix properties, emphasizing concepts like convexity, optimization, and regression techniques. It covers various mathematical operations, including inner and outer products, singular value decomposition, and optimization methods such as gradient descent and regularization. Additionally, it addresses classification fundamentals and the bias-variance tradeoff in machine learning models.

Uploaded by

fotitos.clauu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Vector Space V

P = 1: Manhattan norm, ||v||1 =


– p Convexity −∇f (xk ).
A set with two operations (+, scalar mult ·) i |vi |.
• Convex Set X : For any x, y ∈ X ,
s.t.: – p = 2: Euclidean norm, ||v||2 = the line segment between them is in X : xt+1 ← xt − η∇f (xt )
• (V, +) is an abelian group (closed, assoc., αx + (1 − α)y ∈ X for all α ∈ [0, 1].
i vi .
pP
2
identity, inverse, comm.). • Convex Function f : The function’s • GD with Momentum: Adds inertia to
• Scalar mult: α(βv) = (αβ)v; 1v = v. • q
Frobenius Norm (matrix): ||A||F = graph lies below its chords: f (αx + (1 − accelerate convergence.
• Distributivity: (α + β)v = αv + βv; tr(A⊤ A). α)y) ≤ αf (x) + (1 − α)f (y).
p
2
P
i,j Aij =
α(v + w) = αv + αw. • Convex Problem: Minimizing a convex vt = γvt−1 + η∇f (xt )
The tracePnis the sum of diagonal elements: function over a convex set.
Examples: Rd (d-dim vectors), Rn×d (ma- tr(A) = i=1 Aii . xt+1 = xt − vt
trices). • Key Property: For convex problems,
• Properties: Linearity tr(cA + B) = any local minimum is a global minimum. • Coordinate Descent: Minimize along
Matrix Properties & Operations ctr(A)+tr(B); Transpose tr(A⊤ ) = tr(A); • Convex Functions: Norms (|| · ||), one coordinate at a time, fixing others:
For A, B ∈ R n×d
, α ∈ R: Cycling tr(ABCD) = tr(BCDA). squared L2 norm (|| · ||22 ), linear/affine (t+1)
functions (Ax + b), non-negative sums of xi ← arg minxi f (. . . , xi , . . . ).
• Addition: (A + B)ij = Aij + Bij .
Orthogonality convex functions, composition of a convex
• Scalar Mult: (αA)ij = αAij . Regression Task Model
• Inner Product Geometry: v⊤ w = function with an affine map.
• Transpose (A⊤ ): Rows become cols. ||v||2 · ||w||2 cos ∢(v, w). Goal: Find f : Rd → R to predict con-
(A⊤ )ij = Aji . Property: (A⊤ )⊤ = A. • Orthogonal Vectors: v⊤ w = 0. Lagrangian Duality tinuous target y. Model: y = f ∗ (x) + ϵ,
• Symmetric: A⊤ = A (must be square). • Orthonormal Vectors: Orthogonal and For a constrained problem, the Lagrangian where ϵ is zero-mean noise. Linear Model:
• Diagonal: Non-zeroes only on diagonal unit norm (||v||2 = 1). is: Use basis functions ϕ(x) for non-linearity:
(Aij = 0 if i ̸= j). • Orthogonal Matrix: A square matrix A X X f (x) = β ⊤ ϕ(x).
with orthonormal columns. This implies L(x, λ, µ) = f (x)− λi ci (x)− µk gk (x) • Affine: ϕ(x) = (1, x1 , . . . , xd )⊤ .
Vector & Matrix Products
For v ∈ Rd , w ∈ Rn : A⊤ A = AA⊤ = I, so A−1 = A⊤ . i k • Polynomial (deg k): ϕ(x) includes terms
• Orthogonal Invariance: For an orthog- like xi , xi xj , x2i , . . . up to degree k.
• Inner Product (Rd ): v⊤ w = onal matrix X, the L2 norm is invariant: with Lagrange multipliers λi , µk ≥ 0.
Pd
• The primal problem is the original prob- • Gaussian RBF: ϕ(x)i = exp(−γ||x −
i=1 v i wi ∈ R ||Xv||2 = ||v||2 and ||XV ||F = ||V ||F . µi ||2 ).
• Outer Product: vw⊤ ∈ Rd×n , where lem.
Singular Value Decomposition (SVD) • The dual function is g(λ, µ) =
(vw⊤ )ij = vi wj . inf x L(x, λ, µ).
Ordinary Least Squares (OLS)
For any matrix X ∈ Rn×p of rank k: Minimize Residual Sum of Squares (RSS).
For A ∈ Rn×r , B ∈ Rr×d → C = AB ∈ • The dual problem is maxλ,µ≥0 g(λ, µ).
Rn×d : • Weak Duality: p∗ ≥ d∗ (primal opti- Design matrix X has rows ϕ(xi )⊤ .
X = U ΣV ⊤
• Matrix Product Defs: mum ≥ dual optimum).
1. Inner prod. (row-col): Cij = • U ∈ Rn×n is orthogonal. Its columns are • Strong Duality: p∗ = d∗ . Often holds min ||y − Xβ||22
β
for convex problems (e.g., Slater’s condi-
Pr
Ai· B·j = s=1 Ais Bsj . the left singular vectors.
2. POuter prod. (col-row): C = • V ∈ Rp×p is orthogonal. Its columns are tion). This is a convex problem.
r
s=1 A ·s B s· . the right singular vectors. Derivatives Analytic Solutions • Solution (Normal Equations): X ⊤ Xβ =
• Properties: Generally not commutative • Σ ∈ Rn×p is a rectangular diagonal ma- • Gradient ∇f (x): Vector of partial X ⊤ y.
(AB ̸= BA). Associative: (AB)C = trix with k non-zero singular values σ1 ≥
· · · ≥ σk > 0. derivatives ∂x∂f
. Points in direction of • If X ⊤ X is invertible: β = (X ⊤ X)−1 X ⊤ y.
A(BC). Transpose: (AB)⊤ = B ⊤ A⊤ . steepest ascent.
i
• If not invertible (e.g., p > n), infinite so-
Geometric View: Any matrix transforma-
• Jacobian ∂x ∂f
: Matrix of partials for vec- lutions exist.
Identity and Inverse tion Ax can be seen as a rotation (V ⊤ ),
• Identity Matrix I: Diagonal matrix scaling (Σ), and another rotation (U ). In- tor function f : Rd → Rc . ( ∂x
∂f ∂fi
)ij = ∂x . Regularized Regression
with ones. IA = AI = A. verse via SVD: If A is square and invert- j
Add a penalty on the weights β to prevent
• Inverse Matrix A−1 : For a square ma- • Hessian ∇ f (x): Matrix of 2nd partial
2
overfitting.
ible, A−1 = V Σ−1 U ⊤ . derivatives.
trix A, if it exists, AA−1 = A−1 A = I. • Ridge Regression (L2 Penalty):
• Diagonal Inverse: For Optimization Fundamentals • Chain Rule: ∂f (g(x))∂x = ∂f∂g(g) ∂x
∂g
.
D = diag(d1 , . . . , dn ), D−1 = • Unconstrained: minx∈Rd f (x). • FONC (First-Order Necessary): At local min ||y − Xβ||22 + λ||β||22
diag(1/d1 , . . . , 1/dn ) if all di ̸= 0. • Constrained: minx∈Rd f (x) subject to min x∗ , ∇f (x∗ ) = 0. β
ci (x) = 0 (equality) and gk (x) ≥ 0 (in- • SONC (Second-Order Necessary): At lo-
Norms and Trace equality).
A norm || · || : V → R+ measures length. cal min x∗ , ∇2 f (x∗ ) is positive semidefi- Solution: β = (X ⊤ X +λI)−1 X ⊤ y. This
• Feasible Set C: All x satisfying con- nite (z ⊤ Hz ≥ 0). is always unique for λ > 0. Shrinks
• Properties: ||v + w|| ≤ ||v|| + ||w|| (tri- straints.
angle ineq.); ||αv|| = |α| · ||v|| (homogene- weights towards zero.
• Minimizer: A point x∗ is a global min- Gradient-Based Optimization • Lasso (L1 Penalty):
ity); ||v|| = 0 ⇐⇒ v = 0. imizer if f (x∗ ) ≤ f (x) for all x. It’s a Iterative methods to find minima. General
• Lp -norm (vector): ||v||p = local minimizer if this holds for all x in form: xk+1 = xk + ηk pk .
Pd min ||y − Xβ||22 + λ||β||1
( i=1 |vi |p )1/p . a neighborhood of x∗ . • Gradient Descent (GD): pk = β
Promotes sparsity (sets some weights to theorem with the "naive" assumption • Backpropagation: Algorithm to effi- sum of squares (WCSS):
exactly zero). Solved with coordinate of conditional independence of features: ciently compute the gradient of the loss
descent. The update for βk is a soft- p(x|y) = k p(xk |y). Very fast. w.r.t. all network weights (θ) by applying
Q
r
X X 1
thresholding operation. • Decision Trees (CART): Tree-based the chain rule layer-by-layer backwards. min ||xi −µs ||2 , where µs =
model. Learns a hierarchy of if/else ques- • Optimizers: {Cs }
s=1 xi ∈Cs
|Cs |
x
Bias-Variance Tradeoff (Regression)
tions. Interpretable but prone to overfit- – SGD: Update weights using gradient
Decomposition of Expected Prediction Error ting. Splits are chosen to maximize In- from a small mini-batch of data, not • Lloyd’s Algorithm: Iterative approach:
(EPE): formation Gain (reduce impurity). the full dataset. θ ← θ − η∇θ Lbatch . 1. Assign: Assign each point to its near-
2 ∗ 2 ∗ • Random Forests: Ensemble of2 many – Often combined with Momentum or est centroid µs .
(x)])2decorrelated
E[(y−fD (x)) ] = E[(y − f (x)) ] + (f (x) − ED [fDdeep, − ED [ftrees.
+ ED [(fD (x)decision D (x)]) Trained
] more advanced methods like Adam. 2. Update: Recalculate each centroid µs
on bootstrap samples of data and random • Regularization: as the mean of the points assigned to
| {z } | {z } | {z }
Noise, σ 2 Bias2 Variance
subsets of features. Robust and powerful. – Weight Decay (L2 ): Add λ||θ||22 to it.
• Bias: Error from wrong assumptions. • Support Vector Machines (SVM): the loss. • Matrix Factorization View: Equiva-
High bias → underfitting. Finds the hyperplane that maximizes the – Dropout: During training, randomly lent to minY,X ||D − Y X ⊤ ||2F where Y ∈
• Variance: Error from sensitivity to small margin between classes. The kernel set a fraction of neuron activations to {0, 1}n×r is a cluster assignment matrix
training set fluctuations. High variance → trick (k(xi , x j ) = ϕ(x i ) ⊤
ϕ(x j )) allows it zero for each forward pass. and X contains centroids.
overfitting. to learn non-linear boundaries.
• Evaluation: Use test set or Cross- SVM Objective Duality Convolutional NNs (CNNs) Spectral Clustering
Validation (CV) to estimate generaliza- Primal (Soft-Margin): Graph-based clustering.
tion error. Specialized for grid-like data (e.g., images).
• Convolution Operation: Slides a small, 1. Build Similarity Graph: Construct a
Classification Fundamentals 1 Xn
learnable filter (kernel) over the input, weighted adjacency matrix W from data
Goal: Predict discrete class label ŷ ∈ min ||w||2 + C ξi (e.g., using k-NN or an RBF kernel).
w,b,ξ 2 computing dot products P to create a fea-
{1, . . . , c} for input x. Model: A function 2. Compute Graph Laplacian: L =
i=1
ture map. (X ⋆K)ij = a,b Xi+a,j+b Ka,b .
f : Rd → Rc gives scores (logits). Pre- DW − W , where DW is theP diagonal de-
diction: ŷ = arg maxl f (x)l . Probabilis- s.t. yi (w⊤ xi + b) ≥ 1 − ξi and ξi ≥ 0. Dual • Inductive Biases: gree matrix with (DW )ii = j Wij .
tic View: Assume data from p∗ (x, y). The – Locality: Filters process local patches.
(with Kernel Trick):
– Parameter Sharing: Same filter is 3. Embedding: Compute the first r eigen-
best possible classifier is the Bayes Opti- vectors of L (corresponding to the small-
used across the entire input.
mal Classifier:
max
X
αi −
1X
αi αj yi yj k(xi , xj ) – Translation Equivariance: Output est eigenvalues) to form an embedding
α 2 i,j shifts with input. matrix A ∈ Rn×r .
y ∗ (x) = arg max p∗ (y = l|x) i
4. Cluster: Run k-means on the rows of the
l∈{1,...,c} • Pooling Layer: Downsamples feature
maps, e.g., Max Pooling takes the max embedding matrix A to get final cluster
s.t. αi yi = 0 and 0 ≤ αi ≤ C.
P
Its error is the Bayes error rate, the irre-
i
value in a window. assignments.
ducible error. Architecture: MLP This method can find non-convex clusters.
• Neuron: A single computational unit:
Evaluation Metrics output = ϕ(w⊤ x + b). Principal Component Analysis (PCA) Low-Rank Matrix Factorization (for Rec-
• 0-1 Loss: L01 (y, ŷ) = 1 if y ̸= ŷ, else 0. • Layer: A vector of neurons. Output of Goal: Find a low-dimensional (r-dim) linear Sys)
• Accuracy: 1 − (mean 0-1 loss) = projection of the data that maximizes vari-
layer l: h(l) = ϕ(W (l) h(l−1) + b(l) ). Goal: Approximate a data matrix D ∈
#Correct
#Total . Can be misleading for imbal- • Multi-Layer Perceptron (MLP): A ance. Objective: Find orthogonal matrix Rn×d (e.g., user-item ratings) with a low-
anced classes. stack of layers. Z ∈ Rd×r that solves: rank product.
• Confusion Matrix: Tabulates TP, FP, • Activation Fns (ϕ): Introduce non- • Objective (Full data):
TN, FN. linearity. max tr(Z ⊤ C ⊤ CZ) s.t. Z ⊤ Z = I
• Precision (positive predictive value): – ReLU: ϕ(z) = max(0, z). Standard Z
min ||D − Y X ⊤ ||2F
T P +F P .
TP choice. where C is the centered data matrix (D − Y ∈Rn×r ,X∈Rd×r
• Recall (sensitivity, true positive rate): – Sigmoid: ϕ(z) = 1/(1 + e−z ). Output
in [0, 1]. 1µ⊤ ). Solution: The columns of Z (the Solution is given by the Truncated
T P +F N .
TP
– Tanh: ϕ(z) = tanh(z). Output in principal components) are the first r right SVD of D: Dr = Ur Σr Vr⊤ .
• F1-Score: Harmonic mean of precision [−1, 1]. singular vectors of C (or eigenvectors of the • Objective (Missing Data): Let O be a
and recall: 2 · Prec+Recall
Prec·Recall
. • Universal Approx. Thm.: An MLP covariance matrix C ⊤ C). The data is pro- binary mask of observed entries.
with one hidden layer can approximate jected via DZ.
Algorithms Overview
any continuous function. min ||O◦(D−Y X ⊤ )||2F +λ(||Y ||2F +||X||2F )
• k-Nearest Neighbors (k-NN): Non- Y,X
parametric. Predicts the majority class of Training NNs k-Means Clustering
the k closest training points. Simple, but • Loss Function: For classification, typ- Goal: Partition n data points into r clus- Solved using Alternating Least
can be slow at test time. ically Cross-Entropy Loss: L(y, p̂) = ters. Squares (ALS), a form of block
• Naive Bayes: Probabilistic. Uses Bayes’ − log(p̂y ), where p̂ = softmax(logits). • Objective: Minimize the within-cluster coordinate descent.

You might also like