Machine Learning General: Definiton

Machine Learning Cheatsheet Discriminative model Generative model
© 2024 Robins Yadav
Machine Learning General

Definiton
We want to learn a target function f that maps input variables X to
Learns the decision boundary be- Learns the input distribution
output variable y, with an error e:
tween classes
y = f (X) + e Directly estimate P (y|x) Estimate P (x|y) to find likeli- → The training loss goes down over time, achieving low error values
hood P (y|x) using Baye’s rule → The validation loss goes down until a turning point is found, and
Linear, Non-linear Specifically meant for classifica- They are used for generating new
it starts going up again. That point represents the beginning of
tion tasks contents or data
overfitting. Therefore, The training process should be stopped when
Different algorithms make different assumptions about the shape and Logistic Regression, Random Hidden Markov Models, Naive
Forests, SVM, Neural Networks, Bayes, Gaussian Mixture Mod- the validation error trend changes from descending to ascending.
structure of f . Any algorithm can be either:
Decision Tree, kNN els, Gaussian Discriminant Anal-
• Paramteric (or Linear): simplify the mapping to a known ysis, LDA, Bayesian Networks
linear combination form and learn its coefficients.
• Non-parametric (or Non-linear): free to learn any functional Bias-Variance trade-off, Underfitting, Overfitting
form from the training data, while maintaining some ability to
generalize. In supervised learning, the prediction (expected) error e is composed
Note: Linear algorithms are usually simpler, faster and requires less of the bias, the variance and the irreducible part.
data, while Nonlinear can be are more flexible, more powerful and 2 h h ii2
more performant. Error(x) = E[fˆ(x)] − f (x) + E fˆ(x) − E fˆ(x) + σe2
Supervised, Unsupervised
→ Bias refers to erroneous assumptions made by the model about
• Supervised learning methods learn to predict outcomes y the data to make the target function f easier to learn.
(y (1) , ..., y (m) ) from data points X (x(1) , ..., x(m) ) given that the Mathematically, how much do the expected predicted values differ
data is labeled. from true values?
→ Type of prediction → Variance is the error, refers to how much the predictions of the
Regression Classification model (i.e estimate of the target function f ) will vary when trained
Outcome Continuous Class on different training data (or possibly with different random seeds).
Examples Linear Regression Logistic Regression, It is also known as Variance Error or Error due to Variance. • Training loss vs. Validation loss:
SVM, Naive Bayes Mathematically, variance is the expected squared deviation of the
model’s prediction from its mean prediction, across different training
→ Conditional Estimates sets
Regression → conditional expectation: E [y|X = x]
Classification → conditional probability: P (Y = y|X = x) Note: As the complexity of the model (ex: flexibility of the decision
boundary) increases, the variance will increase and bias will decrease
• Unsupervised learning methods learn to find the inherent structure
or hidden patterns from unlabeled data X (x(1) , ..., x(m) ) . • The goal of parameterization is to achieve a low bias and low
variance trade-off through methods such as:
Type of models → Cross-validation can be used to tune models so as to optimize the → Epochs: One Epoch is when an ENTIRE dataset is passed forward
trade-off between bias and variance and backward through the neural network only ONCE.
→ Discriminative Model: It focuses on predicting the data’s outputs → Dimension reduction and feature selection → Batch: You can’t pass the entire dataset into the neural net at
(for classification or regression) by training a model. It learns → Mixture models (probabilistic models) and ensemble learning. once. So, you divide dataset into No. of Batches or sets or parts.
parameters by maximizing the conditional probability P (Y |X). → Regularization - Dropout During training, randomly set some → Iterations is the No. of Batches needed to complete One Epoch.
→ Generative Model: It focuses on learning a probability distribution activations to 0. This forces network to not rely on any one node. Question: If total number of samples in a dataset is 1000 and batch
for the dataset, it can reference this probability distribution to size is 10, how many iterations will be there in one epoch. Ans: 100
generate new data instances. It learns parameters by maximizing the • How would you identify if your model is overfitting? By analyzing
joint probability of P (X, Y ) = P (X ∩ Y ) . the learning curves, you should be able to spot whether the model is
underfitting or overfitting. The y-axis is some metric of learning (ex:,
classification accuracy) and the x-axis is experience (time or No. of
iteration).
• What is cross-validation? Why it’s important?
Cross-validation evaluates a model’s performance.
• Underfitting or High bias means that the model is not able to → The idea is to divide the dataset into k subsets or ”folds”, train
capture (or learn) the trend (or pattern) in the data. the model on k − 1 of these folds, and test on the remaining fold to
• Overfitting or High variance means that the model learns too much ensure that the model generalizes well to unseen data.
from the available data but does not generalize well enough to predict → After evaluating on all k folds, performance metrics are averaged
on new data. for a robust estimate of the model’s effectiveness.
→ K-Fold Cross-Validation: The data is divided into k equally-sized Data Science General ETL - Extract Transform Load
folds.
→ Stratified K-Fold: Similar to K-Fold, but it maintains the An ETL workflow is crucial for consolidating and preparing data for
Inference vs. Classification analysis and reporting, ensuring accuracy, consistency, and availability
proportion of classes in each fold, making it ideal for imbalanced
datasets. for better decision-making.
→ Leave-One-Out Cross-Validation (LOOCV): A special case where → Inference: Given two groups, what is the differences between these
groups? t-tests, paired t-test etc. 1. Extract Phase: Retrieve raw data from sources like databases,
k equals the number of data points, so each fold contains just one files, or APIs using SQL queries, scripts, or connectors. Extract
data point. → Classification: Given a new animal, find whether new animal is cat
or dog? only the new or updated data since the last extraction to optimize
performance and reduce load.
Unrepresentative Training Dataset
Prediction and Inference 2. Transform Phase: Clean, integrate, and format data, applying
When the data available during training is not enough to capture the business rules and calculations. Validate transformations and
model, relative to the validation dataset. → Prediction uses a model to predict future observations. ensure data quality using tools like Python, SQL, or Apache Spark.
- Model does not need to be valid 3. Load Phase: Insert transformed data into the target system,
- Evaluation does need to be valid ensuring correct schema, bulk loading for efficiency, and optimizing
- Quality & strength: accuracy of predicting unseen data with indexing and partitioning.
→ Inference uses the model’s structure and parameters to learn or
understand an underlying phenomenon. ETL Workflow Automation
- Validity depends on assumptions Automation Tools:
- Quality & strength: R2 , p-value, assumption checks, coefficients → Apache Airflow: For workflow scheduling and orchestration.
→ AWS Glue: Managed ETL service on AWS.
Preventing Data Leakage → Apache NiFi: For data flow automation.
• Proper Data Splitting: Split data into training, validation, and

The train and validation curves are improving, but there’s a big gap test sets before performing any data preprocessing. For time
between them, which means they operate like datasets from different series, split data chronologically.
distributions.
• Transformation fit on Training Data Only: Ensure that
transformations (e.g., scaling, encoding) are fit only on the
Unrepresentative Validation Dataset
training data and then applied to both training and test data
• Feature Selection: Ensure that features used for training are

available at the time of prediction. For time series data, create
lag features that use only past information.
• Data Augmentation: Apply augmentation techniques only to the
training set.
Handle missing or corrupted data in a dataset
• Remove Missing Data: If only a small number of rows have

missing values, or If an entire column has a large percentage of
As we can see, the training curve looks ok, but the validation function missing values
moves noisily around the training curve. It could be the case that
• Impute Missing Data: Mean/Median/Mode Imputation,
validation data is scarce and not very representative of the training
Imputation Using Algorithms like Forward/Backward Fill/
data, so the model struggles to model these examples.
Interpolate for time series data
• Use Algorithms That Handle Missing Data: Decision trees,

Random forests
• Replace Corrupted Data: Identify Outliers using statistical
methods (like z-scores or IQR), Apply manual inspection and
correction of corrupted values based on external sources or
domain expertise.
• Data Augmentation: To create synthetic data based on existing
patterns
Here, the validation loss is much better than the training one, which Notes
reflects the validation dataset is easier to predict than the training
dataset. An explanation could be the validation data is scarce but → Ablation: An ablation study is turning off components of a model
widely represented by the training dataset, so the model performs (e.g. features or sub-models) one at a time, to see how much each
extremely well on these few examples. contributes to the model’s performance.
Model Evaluation sensitivity because false positive (FP normal transactions that Note: We can think of the plot as the fraction of correct predictions
are flagged as possible fraud) are more acceptable than false for the positive class (y-axis) versus the fraction of errors for the
Classification Problems negative (FN fradulent transactions that are not detected) negative class (x-axis).
Confusion Matrix TP How to choose threshold for the logistic regression? The choice of a
3. Precision: threshold depends on the importance of TPR and FPR classification
Type I error: The null hypothesis H0 is rejected when it is true. TP + FP
Type II error: The null hypothesis H0 is not rejected when it is false. problem. If there is no external concern about low TPR or high FPR,
Exactness of model. → Out of total predicted positive (1) one option is to weight them equally by choosing the threshold that
→ False negative (Type I error) — incorrectly decide no values, how often classifier is correct.
→ False positive (Type II error) — incorrectly decide yes maximizes TPR−FPR.
Probability: P [Y = 1|D = 1] , If our model says positive,
how likely it is correct in that judgement. # Get predicted probabilities
Example: ”Spam Filter” +ve (1) class is spam → Optimize y_prob = model . predict_proba ( X_test ) [: , 1]
for precision or, specificity because false negatives (FN spam # Calculate ROC curve
goes to the inbox) are more acceptable than false positive (FP fpr , tpr , thresholds = roc_curve ( y_test , y_prob )
non-spam is caught by the spam filter). Example: # Optimal threshold using Youden ’s J stat .
”Hotel booking canceled” +ve (1) class is isCanceled → youden_index = tpr - fpr
Optimize for precision or, specificity because false negatives optimal = thresholds [ np . argmax ( youden_index ) ]
(FN isCanceled labeled as ”not canceled” 0) are more
acceptable than false positive (FP isnotCanceled labeled as • Area Under the ROC Curve AUC: To compute the points in an
”canceled” 1). ROC curve, an efficient, sorting-based algorithm called AUC. AUC
Ex: We assume the null hypothesis H0 is true. ranges in value from 0 to 1. Area Under the Curve measures how
→ H0 : Person is not guilty Precision × Recall likely the model differentiates positives and negatives (perfect AUC =
4. F1-Score = 2 ×
→ H1 : Person is guilty Precision + Recall 1, basline = 0.5)
False positive (FP) and False negative (FN) are equally • Precision-Recall curve: Focuses on the correct prediction of the
important. minority class, useful when data is imbalanced. Plot precision at
FP different thresholds.
5. False Positive Rate:
TN + FP
Fraction of negatives wrongly classified positive.
Probability: P [D = 1|Y = 0]
FN
6. False Negative Rate: = 1 - Recall
TP + FN
Fraction of positives wrongly classified negative.
TN
7. Specificity: = 1-FPR
TN + FP
Fraction of negatives rightly classified negative.
Regression Problems
• ROC-curve: The curve illustrates the trade-off between (TPR) true
positive rate (sensitivity or recall) and the (FPR) false positive rate 1X
1. Mean Squared Error: MSE = (yi − yî )2
using classification thresholds α. n i
→ Lowering the classification threshold classifies more items as
positive, thus increasing both False Positives and True Positives. s
PN
i=1 (yi − yî )2
2. Root Mean Squared Error: RMSE =
N
TP + TN
1. Accuracy: 1X
TP + TN + FP + FN 3. Mean Absolute Error: MAE = |yi − yî |
→ Ratio of correct predictions over total predictions. n i
Estimate of P [D = Y ] , probability of decision is equal to X
outcome. 4. Sum of Squared Error: SSE = (yi − yî )2
i
TP
2. Recall or Sensitivity or True positive rate: .
TP + FN X
5. Total Sum of Squares: SST = (yi − y¯i )2
Completness of model → Out of total actual positive (1) i
values, how often the classifier is correct.
Probability: P [D = 1|Y = 1] 6. R2 Error :
Example: ”Fradulent transaction detector” or MSE (model) SSE
R2 = 1 − R2 = 1 −
”Person Cancer” → +ve (1) is ”fraud”: Optimize for MSE(baseline) SST
7. Adjusted R2 : → Cost function : The cost function J is commonly used to know Tips : For effective gradient descent, select an optimal learning rate,
the performance of a model, and is defined with the loss function L scale features, initialize parameters wisely, utilize mini-batch
n−1 as follows: processing, monitor convergence, experiment with various optimizers,
Ra2 = 1 − (1 − R2 ) m
n−k−1 X apply regularization techniques, avoid local minima, visualize loss
J(θ) = L(hθ (x(i) ), y (i) ) trends, and tune hyperparameters diligently.
i=1
Variance, R2 and the Sum of Squares • Stochastic Gradient Descent uses a single point to compute
• Cost function for regression: Mean Squared Error (MSE), Mean gradients, leading to smoother convergence and faster compute
• The total sum of squares: SStotal = i (yi − ȳ)2
P
Absolute Error (MAE), Huber Loss, Log-Cosh Loss speeds.
1 P 2 → Time Complexity: O(k · m) → m is no. of features. In each
• This scales with variance: var(Y ) = n i (yi − ȳ) • Cost function for classification: Binary Cross-Entropy Loss,
• The regression sum of squares: Categorical Cross-Entropy Loss, Sparse Categorical Cross-Entropy iteration, SGD computes the gradient using only one data point,
P ˆ
SSregression = i (y 2
i − ȳ) , → nVar(predictions) Loss, Hinge Loss, Squared Hinge Loss. leading to O(k · m) for m features.
• The residual sum of squares (squared errro): • Mini-batch Gradient Descent trains on small subsets of the data,
SSresidual = i (yi − yî )2 , → nVar(ϵ)
P
Convex & Non-convex striking a balance between the approaches. → Time Complexity:
O(k · b · m)
Note: ϵ̄ = 0, E[ŷ] = ȳ
A convex function is one where a line drawn between any two points Ordinary Least Squares
SStotal = SSregression + SSresidual on the graph lies on or above the graph. It has one minimum. A
non-convex function is one where a line drawn between any two X
SSresidual SSregression nV ar(preds) V ar(preds) General Linear Regression Model: ŷ = β0 + β j xj + ϵ
points on the graph may intersect other points on the graph. It
R2 = 1 − = = = j
SStotal SStotal nV ar(Y ) V ar(Y ) characterized as ”wavy”
→ When a cost function is non-convex, it means that there is a Here, βj is the j-th coefficient and xj is the j-th feature.
→ Explained Variance: R2 quantifies how much of the variability in
likelihood that the function may find local minima instead of the ⃗ that minimizes squared error:
Ordinary Least Squares - find β
the outcome (dependent) variable can be explained by the predictor
global minimum, which is typically undesired in machine learning
(independent) variables.
models from an optimization perspective. X
→ Goodness of Fit: A higher R2 value generally suggests a better fit arg min (yi − yî )2
of the model to the data, meaning the model’s predictions are closer ⃗
β
General Optimization Steps i
to the actual values.
→ An R2 of 1 indicates a perfect fit, where the model explains all the 1. Understand data (features and outcome variables) → 2. Define
variability, while an R2 of 0 indicates that the model explains none of loss (or gain/utility) function → 3. Define predictive model → 4.
the variability. Search for parameters that minimize the loss function
→ R2 is not valid for nonlinear models as
SSresidual + SSerror ̸= SStotal Gradient Descent
Gradient Descent is used to find the coefficients of f that minimizes

→ Drawback: R-squared will always increase when a new predictor a cost function.
variable is added to the regression model, that’s why adjusted R2 → Time Complexity: O(n · m) → n is no. of data points, m no. of
used features. If you run for k iterations the total complexity becomes
O(k · n · m)
Optimization → It minimizes the average loss by moving iteratively in the direction
⃗ = ŷ
Goal: least-squares solution to : X β
Almost every machine learning method has an optimization algorithm of steepest descent, controlled by the learning rate γ (step size). Solution: solve the normal equations:
Procedure: XT Xβ ⃗ = X T ŷ →β ⃗ = (X T X)−1 X T ŷ
at its core.
→ Hypothesis : The hypothesis hθ is the model that we choose. For ⃗ |X,
⃗ ŷ)
1. Intialization θ = 0 (coefficients to 0 or random) L(β
a given input data x(i) the model prediction output is hθ (x(i) ) .
2. Calculate cost J(θ) = evaluate f (coefficients) → Least squares generalizes into minimizing loss functions.
→ This is the heart of machine learning, particularly supervised
→ Loss function : L : (ŷ, y) ∈ R × Y 7−→ L(ŷ, y) ∈ R that takes 3. Gradient of cost ∂
J(θ) we knows the uphill direction
∂θj learning.
as inputs the predicted ŷ, the actual y, and outputs how different
they are. ∂
4. Update coeff θj = θj − α ∂θ J(θ) we go downhill Likelihood and Posterior
j
In another way, The loss function computes the distance or
difference between the predicted output ŷ of the algorithm and the The cost updating process is repeated until convergence (minimum P (y|θ) P (θ)
actual output y. found). P (θ|y) =
P (y)
The common loss functions are summed up in the table below: • P (θ) is the prior
• P (y|θ) isR the likelihood – how likely is the data given params θ
Least squared error Logistic loss Hinge loss • P (y) = P (y|θ)P (θ)dθ is a scaling factor (constant for fixed y)
1
2
(y − ŷ)2 log (1 + exp(−y ŷ)) max(0, 1 − y ŷ) • P (θ|y) is the posterior.
Linear Regression Logistic Regression SVM • We’re maximizing likelihood (ML estimator)
• Can also maximize posterior (MAP estimator)
- When prior is constant, they’re the same
- With lots of data, they’re almost the same
→ Logistic function is trained by maximizing the log likelihood of the
training data given the model
Maximum Likelihood Estimation (MLE) Bayesian Estimation - Maximum a Posterior (MAP)
In MLE, the goal is to estimate the parameters (or coefficients) of a MAP estimation seeks to find the parameters θ that maximize the
probability distribution by finding the values that maximize the posterior distribution, which assumes a ”prior distribution P (θ)”
likelihood of the observed data.
• Likelihood Function: The likelihood function measures the P (X|θ)P (θ)
θ̂M AP = argmax P (θ|X) = argmax
probability of the observed data given the parameters. It helps to θ θ P (X)
evaluate how well different parameters explain the observed data. For
Since P (X) does not depend on θ,
a dataset X = {x1 , x2 , ...xn }, assumed to be i.i.d.and a parameter θ,
the likelihood L(θ) is: θ̂M AP ≈ argmax P (X|θ)P (θ) Variance Inflation Factor : Measures the severity if multicollinearity
n n θ 1
→ , where Ri2 is found by regressing Xi aganist all other
Y Y
L (θ|X) = f (xi |θ) → L (θ|X) = P (X|θ) = P (xi |θ) 1 − Ri2
i=1 i=1 In logistic regression, MAP can be applied by introducing a prior on
the model parameters P (θ) variables (a common VIF cutoff is 10)
• Log-Likelihood: The natural log of L(θ) is then taken prior to Learning: Estimating the coefficients β from the training data using
calculating the maximum because multiplying probabilities can result the optimization algorithm Gradient Descent or Ordinary Least
in very small values and also log is a monotonically increasing Linear Algorithms Squares.
function, maximizing the log-likelihood log L(θ) is equivalent to Ordinary Least Squares - where we find β ⃗ that minimizes squared
maximizing the likelihood: Regression error: X
n arg min (yi − yî )2
→ Regression predicts (or estimates) a continuous variable
Y
log L (θ|X) = log P (X|θ) = log P (xi |θ) ⃗
β i
i=1 Dependent variable Y , Independent variable(s) X
→ compute estimate ŷ ≈ y
→ MLE is used to find the estimators that minimized the likelihood
function: L(θ|x) = fθ (x) density function of the data distribution yî = β0 + β1 xi
In case of Logistic Regression:   yi = yî + ϵi

X
P (Y = 1|X = x) = ŷ = logistic β0 +
 β j xj  Here, β0 is intercept, β1 P
is slope and ϵ is residuals. The goal is to
j learn β0 , β1 to minimize ϵ2i (least squares)
The model computes probability of yes.
→ What if we want P (Y = yi ), regardless of whether yi is 1 or 0? Linearity: A linear equation of k + 1 variables is of the form:
P (Y = yi |X = xi ) = yî yi (1 − yî )1−yi ŷ = β0 + β1 x1 + · · · + βk xk
• ŷi is model’s estimate of P (Y = 1|X = xi )
It is the sum of scalar multiples of the individual variables - aline!
• yi ∈ {0, 1} is outcome → The dimension of the hyperplane of the regression is its
y → Linear models are remarkably capable of transforming many complexity.
• ŷi i is ŷi if yi = 1, and 1 if yi = 0 non-linear problems into linear.
Variations: There are extensions of Linear Regression training called
Conditioning on Parameters
⃗ and write function:
Fuller definition - condition on parameters β
Linear Regression regularization methods, that aim to reduce the complexity of the
models or to address over-fitting in ML. The regularizer is not
⃗ ⃗
P (Y = 1|x, β) = ŷ = m(x, β) = logistic(...) yî = β0 + β1 xi1 + β2 xi2 · · · + βp xip + ϵ dependent on the data. → In relation to the bias-variance trade-off,
Likelihood Function regularization aims to decrease complexity in a way that significantly
p
n X
Given data X = ⟨x1 , ..., xn ⟩, y = ⟨y1 , ..., yn ⟩ and parameters β̂
X reduces variances while only slightly increasing bias.
yî = β0 + βj xij
→ Standardize numeric variables when using regularization because
⃗ = P (y, X|β) ⃗ =
Y
⃗ i=1 j=1
Likelihood(y, X, β) P (yi |xi , β) to ensure that 0 is a neutral value, so a low coefficient means ”little
i Here, n is total no. of observation, ŷi is dependent variable, xij is effect when deviating from average”. So values, and therefore
By joint conditional probability, explanatory variable of j-th features of the i-th observation. β0 is coefficients, are on the same scale (# of standard deviations), to
⃗ = P (X|β)
P (y, X|β) ⃗ · P (y|X, β)
⃗ properly distribute weight between them.
intercept or usually called bias coefficient.
But X is independent of params, so P (X|β)⃗ = P (X). And X is fixed, Assumptions: → Multicollinearity → correlated predictors. Problem: Which
so P (X) is an (unknown) constant. → Linear models make four key assumptions necessary for inferential coefficient gets the common effect? To solve: Loss and
Y validity. Regularization comes.
⃗ = log P (X)
log Likelihood(y, X, β) ⃗
P yi |xi , β • Linearity — outcome y and predictor X have linear relationship.
• Ridge Regression (L2 regularization): where OLS is modified
X i • Independence — observations are independent of each other
= log P (X) + ⃗
log P yi |xi , β to minimize the squared sum of the coefficients
- Independent variables (features) are not highly correlated with each
i other → Low multicollinearity n p p p
X X X X
Maximum Likelihood Estimator • Normal errors — residuals are normally distributed - check with (yi − β0 − βj xij )2 + λ βj2 = RSS + λ βj2
Q-Q plots. Violation means line (in Q-Q plots) still fits but p-value i=1 j=1 j=1 j=1
X
arg max ⃗
logP yi |xi , β
⃗
β
and CIs are unreliable
i → Prevents the weights from getting too large (L2 norm). If
• Equal variance — residuals have constant variance (called
P (Y = yi |X = xi ) = yî yi (1 − yî )1−yi homoskedasticity; violation is heteroskedasticity) - check scatterplot lambda is very large then it will add too much weight and it
logP (Y = yi |X = xi ) = yi log yî + (1 − yi ) log (1 − yî ) or regplot between residuals vs. fitted. Violations means model is will lead to under-fit.
Model log likelihood is sum over training data. Applicable to any failing to capture a systematic effect. → These violations are problem 1
λ∝
model where ŷ = P (Y = 1|x) only for inference not for prediction model variance
• Lasso Regression (L1 regularization) : where OLS is modified LDA assumes Gaussian data and attributes of same σ 2 . Predictions
to minimize the sum of the coefficients eβ0 +β1 x1 +···+βi xi are made using Bayes Theorem:
p(X) = = p(y = 1 | X)
n
X p
X p
X p
X 1 + eβ0 +β1 x1 +···+βi xi P (k) × P (x|k)
(yi − β0 − βj xij )2 + λ |βj | = RSS + λ |βj | P (y = k | X = x) = Pk
Note : Coefficients are linearly related to odds, such that a one unit
i=1 j=1 j=1 j=1 l=1 P (l) × P (x|l)
increase in x1 affects odds by eβ1 .
where p is the no. features (or dimensions), λ ≥ 0 is a tuning Note : The coefficients in logistic regression are interpreted in to obtain a discriminate function (latent variable) for each class k,
parameters to be determined. terms of their effect on the log-odds of the outcome, and the estimating P (x|k) with a Gaussian distribution:
exponentiated coefficients (odds ratios) provide a clearer
→ Lasso shrinks the less important feature’s coefficient to µk µ2k
understanding of the change in odds associated with each predictor.
zero thus, removing some feature altogether. If lambda is very Dk (x) = x × − + ln(P (k))
σ2 2σ 2
large value will make coefficients zero hence it will under-fit.
→ L1 is less likely to shrink coefficients to 0. Therefore L1 The class with largest discriminant value is the output class.
regularization leads to sparser models. Variations:
Logistic Regression 1. Quadratic DA: Each class uses its own variance estimate
2. Regularized DA: Regularization into the variance estimate.
Log-Odds and Logistics
Data preparation for Linear Algorithm
• Odds
The probability of success P (S): 0 ≤ p ≤ 1 1. Data Transformation: Linear algorithms require data to have
→ The odds of success are defined as the ratio of the probability of a linear relationship between features and the target variable.
success over the probability of failure. Often, transformations like log or polynomial are applied to
P (S) P (S)
The odds of success: Odds(S) = P (S c ) = 1−P (S) The representation below is an equation with binary output, which
make the data fit a linear pattern.
actually models the probability of default class:
→ Ex: Odds(failure) = x → means x:1 aganist success 2. Feature Engineering: Feature engineering is crucial.
Assumptions:
• Log Odds or logit → - Linear relationship between X and log-odds of Y Polynomial features or interaction terms may need to be added
P (A) manually to capture relationships in the data.
log Odds(A) = log = logP (A) − log (1 − P (A)) - Observations must be independent to each other
1 − P (A) - Low multicollinearity 3. Handling Outliers: Linear models are sensitive to outliers.
• Logistic: The inverse of the logit (logit− 1): Learning: Learning the logistic regression coefficients is done by: Detecting and either removing or transforming outliers is
→ Minimizing the logistic loss function important, as they can significantly influence the results.
1 ex X
4. Rescaling: Rescaling features (standardization or
⃗ i)

logistic(x) = −x
= x arg min log 1 + exp(−yi βx
1+e e +1 ⃗
β normalization) is required to ensure that all features contribute
i
equally to the model. Algorithms such as linear regression
→ Maximizing the log likelihood of the training data given the model perform better with rescaled data.
X
⃗
5. Assumptions: Linear algorithms assume a linear relationship
arg max log P yi |xi , β between the input features and the output variable. Violating
⃗
β i
sigmoid or logistic curve. this assumption can lead to suboptimal performance.
Linear Discriminant Analysis Advantages of Linear Algorithms
For multiclass classification, LDA is the preferred linear technique. 1. Simplicity and Interpretability: Easy to understand and
Representation: LDA representation consists of statistical properties interpret results (e.g., coefficients indicate feature importance).
→ Odds are another way of representing probabilities. calculated for each class: means and the covariance matrix:
2. Computational Efficiency: Faster to train and predict,
→ The logistic and logit functions convert between probabilities and n n
1 X 2 1 X especially with large datasets.
log-odds. µk = xi σ = (xi − µk )2
nk i=1 n − k i=1 3. Less Prone to Overfitting: With proper regularization (like
• General Linear Models (GLMs):
Lasso or Ridge), linear models can generalize well on unseen
−1
yî = g (β0 + β1 xi1 + β2 xi2 · · · + βp xip ) data.
  4. Strong Theoretical Foundation: Well-established statistical
p
X properties and a solid theoretical framework.
−1
ŷi = g β0 + βj xij 
j=1
5. Works Well with Linearly Separable Data: Performs well when
the relationship between features and target variable is linear.
Here, g is a link function
• Counts: Poision regression, log link func
• Binary: Logistic regression, logit link func and g −1 is logistic func
→ In logistic regression, a linear output is converted into a probability
between 0 and 1 using the sigmoid or logistic function.
 
X
P (yi = 1|X) = ŷi = logistic β0 + βj xij 
j
Nonlinear Algorithms ∥x − z∥2
In practice, the kernel K defined by K(x, z) = e− is
All Nonlinear Algorithms are non-parametric and more flexible. They 2σ 2
are not sensible to outliers and do not require any shape of called the Gaussian kernel and is commonly used.
distribution.
Naive Bayes Classifier
Naive Bayes is a classification algorithm interested in selecting the
best hypothesis h given data X assuming that the features of each
data point are all independent
Representation: The representation is based on Bayes Theorem:
The prediction function is the signed distance of the new input x to
P (X|Y ) × P (Y ) the separating hyperplane w, with b the bias:
P (Y |X) =
P (X)
f (x) = ⟨w, x⟩ + b = wT x + b
With naive hypothesis,
P (Y |X) = P (x1 , x2 , · · · , xi | Y ) = P (x1 |Y ) × P (x1 |Y ) × · · · P (xi |Y ) → Optimal margin classifier: The optimal margin classifier h is such
n that: Note: we say that we use the ”kernel trick” to compute the cost
P (X|Y ) =
Y
P (xi | Y ) h(x) = sign(wT x − b)
function using the kernel because we actually don’t need to know the
i=1 where (w, b) ∈ Rn × R is the solution of the following optimization explicit mapping ϕ, which is often very complicated. Instead, only the
problem: values K(x, z) are needed.
The prediction is the maximum a posterior hypothesis: 1
max (P (Y |X)) = max (P (X|Y ) × P (Y )) min ∥w∥2
2 Variations:
here, the denominator is not kept as it is only for normalization. SVM is implemented using various kernels, which define the measure
such that
between new data and support vectors:
Learning: Training is fast because only probabilities need to be y (i) (wT x(i) − b) ≥ 1
calculated: X
instancesY count(x ∧ Y ) Learning: 1. Linear (dot-product): K(x, xi ) = (x × xi )
P (Y ) = P (x|Y ) =
all instances instancesY → Hinge loss : The hinge loss is used in the setting of SVMs and is
defined as follows: X
2. Polynomial: K(x, xi ) = 1 + (x × xi )d
Variations: Gaussian Naive Bayes can extend to numerical attributes
L(ŷ, y) = [1 − y ŷ]+ = max(0, 1 − y ŷ)
by assuming a Gaussian distribution. Instead of P (x|h) are calculated
with P (h) during learning, and MAP for prediction is calculated
X
3. Radial: K(x, xi ) = e− γ (x − xi )2
using Gaussian PDF → Lagrangian : We define the Lagrangian L(w, b) as follows:
1 (x − µ)2 l
f (x | µ(x), σ) = √ e− X Hyperparameters: regularization parameter (C) and the kernel
2πσ 2 2σ 2 L(w, b) = f (w) + βi hi (w)
v i=1
parameters (such as gamma for the RBF kernel).
n u n
1X u1 X
µ(x) = xi σ=t (xi − µ(x))2 Lagrange method is required to convert constrained optimization
n i=1 n i=1 K-Nearest Neighbors
problem into unconstrained optimization problem. The goal of above
Support Vector Machines equation to get the optimal value for w and b.
n If you are similar to your neighbors, you are one of them. KNN uses
" #
SVM is a go-to for high performance with little tuning. Compares 2 1X
λ∥w∥
⃗ + max 0, 1 − yi (w
⃗ · x⃗i − b) the entire training data, no training is required.
extreme values in your dataset. n i=1
In SVM, a hyperplane (or decision boundary: wT x − b = 0) is Note: Higher k → higher the bias, Lower k → higher the variance.
selected to separate the points in the input variables space by their The first term is the regularization term, which is a technique to • Choice of k is very critical → A small value of k means that noise
class, with the largest margin. The closest datapoints (defining the avoid overfitting by penalizing large coefficients in the solution vector. will have a higher influence on the result. → A large value of k make
margin) are called the support vectors. The second term, hinge loss, is to penalize misclassifications. It everything classified as the most probable class and also
measures the error due to misclassification (or data points being computationally expensive.
→ The goal of a support vector machine is to find the optimal
closer to the classification boundary than the margin). The λ is the √
separating hyperplane which maximizes the margin of the training → A simple approach to select k is set k = n or cross-validating
regularization coefficient, and its major role is to determine the
data.
trade-off between increasing the margin size and ensuring that the xi on small subset of training data (validation data) by varying values of
lies on the correct side of the margin. k and observing training - validation error.
→ Kernel : A kernel is a way of computing the dot product of two X 1
p
vectors xx and yy in some (possibly very high dimensional) feature → Minkowski Distance = |ai − bi |p
space, which is why kernel functions are sometime called ”generalized X
dot product”. The kernel trick is a method of using a linear classifier - p=1 gives Manhattan distance |ai − bi |
to solve a non-linear problem by transforming linearly inseparable data qX
to linearly separable ones in a higher dimension. - p=2 gives Euclidean distance (ai − bi )2
Given a feature mapping ϕ, we define the kernel K as follows:
→ Hamming Distance - count of the differences between two vectors,
K(x, z) = ϕ(x)T ϕ(z) often used to compare categorical variables.
At each leaf node, CART predicts the most frequent category, quantity (parameters) of a unknown population by averaging the
assuming false negative and false positive costs are the same. estimates from these sub-samples.
→ The splitting process handles multicollinearity and outliers. Random Forest
→ Trees are prone to high variance, so tune through CV. → Bagged Decision Trees: Each DT may contain different no. of
Note: In decision trees, the depth of the tree determines the rows and different no. of features.
variance. Decision trees are commonly pruned to control variance → Individual DTs may face overfitting i.e. have low bias (complex
• CART for regression minimizes SSE by splitting data into model) but high variance, by ensembling a lot of DTs we are going to
sub-regions and predicting the average value at leaf nodes. The reduce the variance, while not increasing the bias.
complexity parameter cp only keeps splits that reduce loss by at least Hyperparameters: number of trees, maximum depth of the trees
cp (small cp → deep tree).
• Boosting
• CART for classification minimizes the sum of region impurity, → It involves sequentially training of multiple models, where each
where pi is the probability of a sample being in category i. Possible model tries to correct the errors of the previous ones.
measures, each with a max impurity of 0.5.
X
- Gini Impurity / Gini Index / Gini Coefficient = 1 − (pi )2
X
- Cross Entropy = (pi )log2 (pi )
Procedure:
1. Calculate entropy of the outcome classes (c)
c
X
E(T ) = −pi log2 pi
i=1
2. The dataset is split on the different attributes. The entropy of

each branch is calculated. Then it is added proportionally to AdaBoost
get total entropy for the split. The resulting entropy is • Uses the same training samples at each stage
subtracted from the entropy before the split. • ”Weakness” = Misclassified data points
• Learning Focus: Primarily reduces bias (increases variance) by
Gain(T, X) = Entropy(T ) − Entropy(T, X) focusing on misclassified instances Algorithm:
Classification and Regression Trees (CART) 1. Initialize Weights: Assign equal weight to each of the training
3. Choose attributes with largest information Gain as the data
Decision Tree is a Supervised learning technique that can be used for
decision node, divide the dataset by its branches and repeat 2. Train weak model and Evaluate: Provide this as input to the
both Classification and Regression problems.
the same process on every branch. weak model and identify the wrongly classified data points
4. A branch with entropy of 0 is a leaf node 3. Adjust Weights: Increase the weight of wrongly classified data
5. A branch with entropy more than 0 needs futher splitting. points
6. ID3 algorithm is run recursively on the non-leaf branches, until 4. Combined Models: Combine the weak models using a
all data is classified weighted sum, where weights are based on the accuracy of
each learner.
Hyperparameters: The most common Stopping Criterion for splitting
5. Repeat steps 2-4 for a predefined number of iterations or until
is a minimum of training observations per node, maximum depth of
the error is minimized.
the tree
• Limitations: Sensitive to noisy data and outliers since misclassified
Ensemble Algorithms points are given more focus.
Hyperparameters: number of estimators, learning rate.
Ensemble methods combine multiple, simpler algorithms (weak
learners) to obtain better performance algorithm. Gradient Boosting
• Uses the different training samples at each stage
Bagging Boosting • ”Weakness” = Residuals or Erros
AdaBoost • Learning Focus: Instead of adjusting weights, it optimizes the
Here, each node represents a question about the data, and the Random Forest Gradient Boosting model by minimizing a loss function (e.g., mean squared error for
branches from each node represent the possible answers. XGBoost regression).
→ Root Node: It is the very first node (parent node), and denotes
• Bagging Algorithm:
the whole population, and gets split into two or more Decision nodes
→ It involves parallel training of multiple models independently on 1. Initialize Model: Start with an initial model (e.g., a constant
based on the feature values.
different subsets of the data. These subsets of data are drawn using value). Let’s say Avg.
→ Decision Node: At each decision node, the algorithm chooses the
best feature and threshold to split the data, aiming to create the the bootstrap technique. 2. Compute Residuals: Calculate the residuals (errors) of the
most homogeneous subsets. They have multiple branches. → Then averaging their predictions (for regression) or majority voting current model.
→ This process continues until a stopping condition is met (like (for classification). 3. Train Weak Learner: Train a weak learner on the residuals.
maximum depth or pure leaves). → It can reduce the variance and prevent overfitting by averaging 4. Update Model: Add the weak learner to the model with a
→ Leaf Node: The final predictions are made at the leaf nodes, out the errors of individual models. certain learning rate.
which represent the outcome of those decisions. → Bootstrapping is drawing random sub-samples (sampling with 5. Repeat steps 2-4 for a fixed number of iterations or until the
→ Sub-Tree: A branch is a subdivision of a complete tree. replacement) from a large sample (available data) to estimate model converges.
• Limitations: Slower to train and more prone to overfitting without 6. Feature Engineering Not Always Necessary: Some non-linear
careful tuning. algorithms, like tree-based models, automatically capture
Hyperparameters: learning rate, number of boosting stages, interactions and non-linearities without explicit feature
maximum depth of individual trees. engineering.
XGBoost
• Enhances gradient boosting by making it faster, more efficient, and Unsupervised Machine Learning
more accurate..
1. Clustering
→ Execution speed: Parallelization (It will use all cores of CPU),
Cache optimization, Out of memory (Data size bigger than memory) 2. Dimension Reduction
→ Model performance: 3. Association Rule Mining
→ K-means always converges (mostly to local minimum not to
- Adds regularization to balance the trade-off between fitting the 4. Graphical Modelling and Network Analysis global minimum)
training data and maintaining model simplicity. • How to choose K number of clusters in K-Means algorithm?
- Auto pruning: Prevents trees from growing too large, improving Clustering
→ The maximum possible number of clusters will be equal to the
generalization and reducing the risk of overfitting. number of observations in the dataset.
Grouping objects into meaningful subets or, clusters. → Objects
- During training, model learns the optimal way to split data with
within each cluster are similar.
missing values as well as model learns from the patterns of missing
Clustering Algorithms: Hierarchial Clustering
data and adjusts the decision boundaries accordingly.
- Efficient handling of sparse data 1. Partition-based methods Agglomerative method: ”Bottom-up”
- Flexible: Supports a variety of loss functions and custom objective (a) K-means clustering 1. Compute the distance or, proximity matrix
functions (b) Fuzzy C-Means
Hyperparameters: Learning Rate, Number of Trees, Maximum Depth, 2. Initialization: Each observation is a cluster
Min Child Weight, Subsample, Booster Type, Early Stopping Rounds 2. Hierarchical methods
3. Iteration: Merge two clusters which are most similar; until all
(a) Agglomerative Clustering observations are merged into a single cluster.
Data Preparation for Non-Linear Algorithms (b) Divisive Clustering
Divisive method: ”Top-down”
1. Data Transformation: Non-linear algorithms can naturally 3. Density-based methods
model non-linear relationships without the need for data 1. Compute the distance, or proximity matrix
(a) Density-Based methods (DBSCAN)
transformations, as they are capable of capturing complex 2. Initialization: All objects stay in one cluster
patterns.
3. Iteration: Select a cluster and split it into two sub-cluster
2. Feature Engineering: Non-linear models are less reliant on K-means clustering until each leaf cluster contains only one observation.
manual feature engineering. Algorithms like decision trees or
neural networks can automatically capture feature interactions The objective of K-means clustering is to minimize total intra-cluster Proximity (distance) matrix
and non-linear relationships. or, the squared error function. → Single or ward linkage: Minimize within cluster distance
3. Handling Outliers: Non-linear algorithms, such as decision K X
n h i
(j) L(C1 , C2 ) = min D XiC1 , XjC2
X
trees and support vector machines, are generally more robust Objective function → J = ∥Xi − Cj ∥ 2
to outliers compared to linear models. j=1 i=1
→ Complete linkage: Longest distance between two points in each
4. Rescaling: Some non-linear models (e.g., neural networks, Here, K is No. of clusters, n is No. of cases, Cj is centroid for cluster. Minimize maximum distance of between cluster pairs
support vector machines) benefit from rescaling, while others cluster j h i
(e.g., decision trees, random forests) do not require rescaling. L(C1 , C2 ) = max D XiC1 , XjC2
5. Assumptions: Non-linear algorithms do not assume a linear
relationship between inputs and outputs, allowing them to → Average linkage: Minimize average distance between cluster pairs
model more complex relationships in the data without
nC1 nC2
predefined structures. 1 X Xh C i
L(C1 , C2 ) = D Xi 1 , XjC2
Advantages of Non-Linear Algorithms nC1 nC2 i=1 j=1
1. Captures Complex Relationships: Can model intricate

patterns and interactions in the data that linear algorithms DBSCAN
might miss. → Two parameters: ε - distance, minimum points
2. Higher Flexibility: Adaptable to various data types and → Three classifications of points:
structures (e.g., images, text). • Core: has atleast minimum points within ε - distance including
3. Improved Accuracy: Often yields better performance on itself
non-linear datasets by fitting the data more closely. 1. Divide data into K clusters or groups. • ε - distance has less than minimum points within ε - distance
but can be reached by clusters.
4. Handles High Dimensionality: Suitable for high-dimensional 2. Randomly select centroid for each of these K clusters.
• Outlier: point that cannot be reached by cluster
spaces, especially with techniques like kernel methods or neural 3. Assign data points to their closest cluster centroid according
networks. to Euclidean/ Square Euclidean/Manhattan/Cosine Procedure:
5. No Assumption of Linearity: Does not require prior knowledge 4. Calculate the centroids of the newly formed clusters. 1. Pick a random point that has not been assigned to a cluster
of the relationship between input features and output, making 5. Repeat steps 3 and 4 until the same centroids (convergences) or, designated as an Outlier. Determine if it is a Core Point.
it versatile. are assigned to each cluster. If not, label the point as Outlier.
2. Once a Core Point has been found, add all directly reachable Neural Network Sigmoid ReLU Tanh
to its cluster. Then do neighbor jumps to each reachable 1 ez −e−z
A neural network is a type of machine learning model that mimics the 1+e−z
max(0, z) ez +e−z
point and add them to the cluster. If an Outlier has been structure and function of the human brain to recognize patterns,
added, label it as a Border Point. make decisions, and learn from data.
3. Repeat these steps until all points are assigned a cluster or,
label as Outlier.
Dimensionality Reduction Methods

Reduce the number of input variables (attributes or features) in
dataset. → Input Layer: The first layer that receives the input data. Each
→ Softmax - used as the last activation function of a neural network
neuron in this layer corresponds to a feature of the input data.
Principle Component Analysis (PCA) → Hidden Layers: Layers between the input and output layers where
to normalize the output of a network to a probability distribution over
ezi
predicted output classes. These probabilities sum to 1 → P ez
PCA combines highly correlated variables into a new, smaller set of the network learn complex patterns.
constructs called principal components, which capture most of the → Output Layer: The final layer that produces the network’s output, → If there is more than one ‘correct’ label, the sigmoid function
variance present in the data. such as a prediction or classification. provides probabilities for all, some, or none of the labels.
• Dimensionality reduction • Perceptron - the foundation of a neural network, and it is a
How Neural Networks Work?
single-layer neural network. An Artificial Neuron is a basic building
• Feature extraction • Forward Propagation: The input data is passed through the
block of a neural network.
• Data visualization network, layer by layer, with each neuron applying its weights and bias
• Neural Network - a multi-layer perceptron
to the input and passing the result through the activation function.
Procedure:
The final layer produces the output.
X − mean • Backpropagation: Backpropagation is an algorithm used in neural
1. Standarize the data: Z =
SD networks to adjust the internal weights and biases to minimize the
2. Calculate covariance-matrix of the standarized data error calculated by the loss function.
– Regression Loss: Mean Squared Error/Squared loss/ L2 loss,
V = cov(Z T )
Mean Absolute Error/ L1 loss, Huber Loss
– Classification Loss: Binary Cross Entropy/log loss, Categorical
3. Find eigen-values and eigen-vectors from the Cross Entropy
covariance-matrix The common loss functions are summed up in the table below:
values, vectors = eig(V )
Least squared error Logistic loss Hinge loss
1
2
(y − ŷ)2 log (1 + exp(−y ŷ)) max(0, 1 − y ŷ)
4. Feature vectors; It is simply the matrix that has columns, the Linear Regression Logistic Regression SVM
eigen-vectors of the components that we decide to keep.
5. Project data → Znew = vectorsT · Z T
Association Rule Mining

”Market Basket Analysis” → It uses Machine Learning models to
analyze data for patterns or, co-occurence in a database. → Weights: are the real values that are attached with each
Graphical Modelling and Network Analysis input/feature and they convey the importance of that feature in → During training, the network uses a supervised learning method
predicting the final output. where the difference (error) between the network’s predicted output
”Bayesian Networks” → Bias: is used for shifting the activation function towards left or and the known expected output is calculated. This error is then
right. propagated back through the network via backpropagation to
→ Summation Function: used to bind the weights and inputs compute the gradient of the loss function with respect to each
together and calculate their sum. weight. An optimization algorithm, such as gradient descent, uses
→ Activation Function: decides whether a neuron should be activated these gradients to update the weights and biases, reducing the error
or not, and it introduces non-linearities into the network which makes and improving the model’s accuracy over time.
input capable of learning and performing more complex tasks.
• Training: The process of forward propagation, loss calculation, and
backpropagation is repeated over many iterations, allowing the
network to learn from the data and improve its accuracy.
• To prevent overfitting, regularization can be applied by:
– Stopping training when validation performance drops
– Dropout - randomly drop some nodes during training to prevent
over-reliance on a single node
– Embedding weight penalties into the objective function
– Batch Normalization - stabilizes learning by normalizing inputs
to a layer
Recurrent Neural Network • Preservation of gradient information by LSTM. The sensitivity of
the output layer can be switched on and off.
Recurrent Neural Networks (RNNs) are designed to process sequences
→ LSTM memorize the information for the long period of time. The
of data such as time series data, voice, natural language, and other
difference between RNN and LSTM are: RNN cell has only one tanh
activities.
layer while LSTM cell has four layers: forget gate layer, store gate
layer, new cell state layer, output layer, and previous cell state as
shown in Figure below.
Time Series
Time Series is generally data which is collected over time and is
dependent on it.
→ RNN memorize information from previous data with feedback It is a random sequence {Xt } of real values recorded at successive
loops inside it which helps to keep data information over time. equally spaced points in time.
→ It has an arrow pointing to itself, indicating that the data inside → Not every data collected with respect to time represents a time
Convolutional Neural Network block “A” will be recursively used. Once expanded, its structure is series.
CNN is a neural network architecture that is well-suited for image equivalent to a chain-like structure. → Methods of prediction & forecasting, time based data is Time
classification and object recognition tasks. The general CNN → Learning to store information or data over long periods of time Series Modeling
architectures are as shown below: intervals via recurrent backpropagation takes a very long time. Hence, • Examples of time series: Stock Market Price, Passenger Count of
the gradient gradually vanishes as they propagate to earlier time airlines, Temperature over time, Monthly Sales Data,
steps. These downstream gradients relies on parameter (weight) Quarterly/Annual Revenue, Hourly Weather Data/Wind Speed, IOT
sharing for efficiency, and repeatedly multiplying values greater than sensors in Industries and Smart Devices, Energy Forecasting
or less than 1 leads to:
→ Difference between Time Series and Regression
– Exploding gradients - model instability and overflows • Time Series is time dependent. However the basic assumption of a
– Vanishing gradients - loss of learning ability linear regression model is that the observations are independent.
• Along with an increasing or decreasing trend, most Time Series
→ A convolutional neural network starts by taking an input image, → This can be solved using: have some form of seasonality trends
represented as a matrix of pixel values – Gradient clipping - cap the maximum value of gradients
→ This input image is passed through convolutional layers. Here, a Note:
– ReLU - its derivative prevents gradient shrinkage for x > 0 → Predicting a time series using regression techniques is not a good
set of filters applies to the input image to detect features like edges,
– Gated cells - regulate the flow of information approach.
textures, and patterns. Each filter produces a feature map that
highlights a specific aspect of the input image. → Time series forecasting is the use of a model to predict future
And, also for the non-convex problem, the RNN model training values based on previously observed values.
→ After each convolution, an activation function (like ReLU) is confuse between local minimum and global minimum. To overcome
applied to introduce non-linearity, enabling the network to learn more these problem, LSTM has been introduced as RNN languages → A stochastic process is defined as a collection of random variables
complex patterns. modelling learning algorithm based on the feedforward architecture. X = {Xt : t ∈ T } defined on a common probability space, taking
→ This produces feature maps. Different weights lead to different values in a common set S (the state space), and indexed by a set T ,
feature maps. often either N or [0, ∞) and thought of as time (discrete or
continuous respectively) (Oliver, 2009).
Time Series Statistical Models
A time series model specifies the joint distribution of the sequence
{Xt } of random variables; e.g.,
P (X1 ≤ x1 , . . . , Xt ≤ xt ) for all t and x1 , . . . , xt
Typically, a time series model can be described as
Xt = mt + st + Yt
where mt : trend component; st : seasonal component; Yt : Zero-mean
→ The feature maps are then passed through pooling layers, which error
downsample the spatial dimensions by taking the maximum or • Vanishing gradient problem for RNNs. The sensitivity increases as
average value in small regions. This reduces the size of the feature the network backpropagates through in time. The darker the shade, Note: The following are some zero-mean models
maps and retains essential information, making the network more the greater the sensitivity. → iid noise: The simplest time series model is the one with no trend
efficient and less sensitive to slight changes in the input. or seasonal component, and the observations Xt s are simply
→ Again, the feature maps produced by the convolutional layer and independent and identically distribution random variables with zero
pooling layer are then passed through multiple additional mean. Such a sequence of random variable {Xt } is referred to as iid
convolutional and pooling layers, each layer learning increasingly noise. Y Y
P (X1 ≤ x1 , . . . , Xt ≤ xt ) = P (Xt ≤ xt ) = F (xt )
complex features of the input image.
t t
→ Now, the output obtained from above is fed into a fully connected
where F (·) is the cdf of each Xt . Further E(Xt ) = 0 for all t. We
layer for classification, object detection, or other structural analyses.
denote such sequence as Xt ∼ IID(0, σ 2 ). IID noise is not interesting
The final output of the network is a predicted class label or
for forecasting since Xt |X1 , . . . , Xt−1 = Xt .
probability score for each class, depending on the task.
→ iid noise example: A binary (discrete) process {Xt } is a sequence
Question: Describe the difference between batch normalization and of iid random variables Xt s with
layer normalization. P (Xt = 1) = 0.5, P (Xt = −1) = 0.5
→ Gaussian Noise example:A continues process: Gaussian noise • Exponential Smoothing - uses an exponentially decreasing weight Evolution of NLP
{Xt } is a sequence of iid normal random variables with zero mean to observations over time, and takes a moving average. The time t
and σ 2 variance; i.e., Xt ∼ N (0, σ 2 ) output is st = αxt + (1 − α)st−1 , where 0 < α < 1.
→ Random walk: The random walk {St , t = 0, 1, 2, . . .} (starting at • Double Exponential Smoothing - applies a recursive exponential
zero, S0 = 0) is obtained by cumulatively summing (or ”integrating”) filter to capture trends within a time series
random variables; i.e., S0 = 0 and st = αxt + (1 − α)(st−1 + bt−1 )
St = X1 + · · · + Xt , for t = 1, 2, . . . bt = β(st − st−1 ) + (1 − β)bt−1
where {Xt } is iid noise with zero mean and σ 2 variance. Note that by Triple exponential smoothing adds a third variable γ that accounts for
differencing, we can recover Xt ; i.e., seasonality.
∇St = St − St−1 = Xt • ARIMA - models time series using three parameters (p, d, q):
Further, we have – Autoregressive - the past p values affect the next value
X
!
X X – Integrated - values are replaced with the difference between
E(St ) = E Xt = E(Xt ) = 0=0 current and previous values, using the difference degree d (0 for
t t i stationary data, and 1 for non-stationary) Challenges in NLP
X
!
X – Moving Average - the number of lagged forecast errors and the • The 3 stages of an NLP pipeline are: Text Processing → Feature
Var(St ) = Var Xt = Var(Xt ) = tσ 2 size of the moving average window q Extraction → Modeling.
t t
• SARIMA - models seasonality through four additional
→ White Noise: We say {Xt } is a white noise; i.e., seasonality-specific parameters: P , D, Q, and the season length s
Xt ∼ WN(0, σ 2 ), if {Xt } is uncorrelated, i.e., Cov (Xt1 , Xt2 ) = 0 for
• Prophet - additive model that uses non-linear trends to account for
any t1 and t2 with E[Xt ] = 0 and Var(Xt = σ 2 ).
multiple seasonalities such as yearly, weekly, and daily.
Note: Every IID(0, σ 2 ) sequence is WN(0, σ 2 ) but not conversely. → Robust to missing data and handles outliers well.
• Moving Average Smoother This is an essentially non-parametric → Can be represented as: y(t) = g(t) + s(t) + h(t) + ϵ(t), with four
method for trend estimation. It takes averages of observations around distinct components for the growth over time, seasonality, holiday
Text Processing
t; i.e., it smooths the series. For example, let effects, and error. This specification is similar to a generalized
additive model.
1 Take raw input text, clean it, normalize it, and convert it into a form
Xt = (Wt−1 + Wt + Wt+1 ) • Generalized Additive Model - combine predictive methods while
3 that is suitable for feature extraction.
preserving additivity across variables, in a form such as Libraries: nltk, spacy
which is a three-point moving average of the white noise series Wt . y = β0 + f1 (x1 ) + · · · + fm (xm ), where functions can be non-linear.
→ AR(1) model (Autoregression of order 1): Let → GAMs also provide regularized and interpretable solutions for – Lower casing
Xt = 0.6Xt−1 + Wt regression and classification problems. – Removing other stuff like: punctuations, tags, URLs, etc depends
Tutorial: Complete Guide on Time Series Analysis in Python on the problem
where Wt is a white noise series. It represents a regression or
prediction of the current value Xt of a time series as a function of the – Convert chat words used in social media to a normal word
past two values of the series. Natural Language Processing – Spelling correction using libraries like TextBlob
NLP is the discipline of building machines that can manipulate
Stationary Process human language — or data that resembles human language — in the
– Stop words - removes common and irrelevant words (the, is)
Note: Do not remove stop words when using POS Tagging in text
Extracts characteristics from time-sequenced data, which may exhibit way that it is written, spoken, and organized. It evolved from
processing.
the following characteristics: computational linguistics.
– Stationarity - statistical properties such as mean, variance, NLP Applications – Tokenization - splits text into individual words (tokens) and word
auto-correlation are constant over time, an autocovariance that fragments.
does not depend on time, and no trend or seasonality • Sentence-level tokenization involves splitting a text into
individual sentences.
– Non-Stationary - There are 2 major reasons behind the
• Word-level tokenization involves splitting each sentence into
non-stationary of a Time Series
individual words or tokens.
– Trend - varying mean over time (mean is not constant)
– Seasonality - variations at specific time-frames (standard – Lemmatization - reduces words to its base form based on
deviation is not constant) dictionary definition (am, are, is → be)
– Stemming - reduces words to its base form without context
– Trend - Trend is a general direction in which something is
(ended → end)
developing or changing.
– Seasonality - Any predictable change or pattern in a time series – Language Detection
that recurs or repeats over a specific time period (calendar times)
occurring at regular intervals less than a year
Advance Text Processing
– Cyclicality - variations without a fixed time length, occurring in
periods of greater or less than one year – POS Tagging
– Autocorrelation - degree of linear similarity between current and
lagged values
• CV must account for the time aspect, such as for each fold Fx :
– Parse Tree
– Sliding Window - train F1 , test F2 , then train F2 , test F3
– Forward Chain - train F1 , test F2 , then train F1 , F2 , test F3 – Coreference Resolution
Feature Extraction Note: A word is important if it occurs many times in a document. similar contexts, tend to have related meanings. It builds a global
But that creates a problem. Words like “a” and “the” appear co-occurrence matrix that captures the frequency of word
→ Feature Extraction = Text Representation = Text Vectorization often. And as such, their TF score will always be high. We resolve co-occurrences within a context window across the entire corpus
Common Terms: this issue by using Inverse Document Frequency, which is high if → Based on transformer architecture
• Corpus • Vocabulary • Document • Word the word is rare and low if the word is common across the corpus. • BERT - accounts for word order and trains on subwords, and unlike
The TF-IDF score of a term is the product of TF and IDF. word2vec and GloVe, BERT outputs different vectors for different
Cosine Similarity - measures similarity between vectors, calculated uses of words (cell phone vs. blood cell)
A·B
as cos(θ) = ||A||||B|| , which ranges from o to 1
Sentiment Analysis
Extracts the attitudes and emotions from text
• Polarity - measures positive, negative, or neutral opinions
– Valence shifters - capture amplifiers or negators such as ’really
fun’ or ’hardly fun’
• Sentiment - measures emotional states such as happy or sad
• Subject-Object Identification - classifies sentences as either
subjective or objective
→ Most conventional machine learning techniques work on the
features – generally numbers that describe a document in relation to Topic Modelling
the corpus that contains it – created by either Bag-of-Words, TF-IDF,
Captures the underlying themes that appear in documents
or generic (custom) feature engineerings such as document length,
• Latent Dirichlet Allocation (LDA) - generates k topics by first
word polarity, and metadata (for instance, if the text has associated
assigning each word to a random topic, then iteratively updating
tags or scores). → CountVectorizer - Bag of Words assignments based on parameters α, the mix of topics per document,
Note: Deep learning does not require to do feature engineering → TfidfTransformer - TF-IDF values and β, the distribution of words per topic
→ TfidfVectorizer - Bag of Words AND TF-IDF values • Latent Semantic Analysis (LSA) - identifies patterns using tf-idf
– Bag-of-words - counts the number of times each word or n-gram
Word Embedding scores and reduces data to k dimensions through SVD
(combination of n words) appears in a document.
Word embeddings are often based on neural network models in deep NLP Tutorial
learning. Duplicate Question Pairs - Quora Questions Pairs: NLP Pipeline
→ Based on CBOW, Skip gram: Word2vec, GloVe, fastText
– Continuous bag-of-words (CBOW) - predicts the word given its
context
– skip-gram - predicts the context given a word
• word2vec - trains iteratively over a corpus of text to learn the
association between the words, and preserve the semantic information
as well as contextual meanings of words within a given corpus of text.
→ They are numerical representations of words and phrases allowing
similar words to have similar vector representations.
– n-gram - predicts the next term in a sequence of n terms based → It uses the cosine similarity metric to measure semantic similarity.
on Markov chains If the cosine angle is one, it means that the words are overlapping. ,
→ Markov Chain - stochastic and memoryless process that such that king − man + woman ≈ queen
predicts future events based only on the current state Note: According to research CBOW is used when small dataset is
available.
– tf-idf - In contrast, with TF-IDF, we weight each word by its

importance. To evaluate a word’s significance, we consider two
things:
1. Term Frequency: How important is the word in the
document? TF(word in a document) =
Number of occurrences of that word in document
Number of words in document
2. Inverse Document Frequency: How important is the word
in the whole corpus (a collection of documents)?
IDF(word in a corpus) = • GloVe (Global Vectors for Word Representation) - GloVe operates
number of documents in the corpus
log( number of documents that include the word
) on the idea that words that frequently co-occur together, sharing
Recommender System References
The Recommender System utilizes machine learning and data analysis [1] Rahul Beakta. “Big data and hadoop: A review
to provide personalized suggestions to users. paper”. In: International Journal of Computer Science
→ It operates by collecting and analyzing user behavior, user & Information Technology 2.2 (2015), pp. 13–15.
preferences, and historical user-item interactions.
• Two main types of recommender system: [2] M. Sundermeyer, H. Ney, and R. Schlüter. “From
– Collaborative Filtering - recommends what similar users like Feedforward to Recurrent LSTM Neural Networks for
– Content Filtering - recommends similar items Language Modeling”. In: IEEE/ACM Transactions on
Audio, Speech, and Language Processing 23.3 (Mar.
2015), pp. 517–529. issn: 2329-9290. doi:
10.1109/TASLP.2015.2400218.
[3] Varsha B Bobade. “Survey paper on big data and
Hadoop”. In: Int. Res. J. Eng. Technol 3.1 (2016),
pp. 861–863.
[4] D. Dong, Z. Sheng, and T. Yang. “Wind Power
Prediction Based on Recurrent Neural Network with
Long Short-Term Memory Units”. In: 2018
International Conference on Renewable Energy and
Power Engineering (REPE). Nov. 2018, pp. 34–38.
doi: 10.1109/REPE.2018.8657666.
Collaborative filtering is more common and includes methods such as: [5] Analog Devices. Training Convolutional Neural
• Memory-based Approaches - finds neighborhoods by using rating
data to compute user and item similarity, measured using correlation Networks: What is Machine Learning? Part 2. Analog
or cosine similarity Dialogue. url:
– User-User - similar users also liked... https://www.analog.com/en/analog-
– Leads to more diverse recommendations, as opposed to just dialogue/articles/training-convolutional-
recommending popular items
neural-networks-what-is-machine-learning-
– Suffers from sparsity, as the number of users who rate items is
often low part-2.html.
– Item-Item - similar users who liked this item also liked... [6] deeplearning.ai. Natural Language Processing
– Efficient when there are more users than items, since the item Resources. deeplearning.ai. url: https:
neighborhoods update less frequently than users //www.deeplearning.ai/resources/natural-
– Similarity between items is often more reliable than similarity
between users
language-processing/.
• Model-based Approaches - predict ratings of unrated items, [7] Edureka. MapReduce Tutorial. Edureka. url: https:
through methods such as Bayesian networks, SVD, and clustering. //www.edureka.co/blog/mapreduce-tutorial/.
Handles sparse data better than memory-based approaches.
– Matrix Factorization - decomposes the user-item rating matrix [8] Edureka. Top 50 Hadoop Interview Questions (2016).
into two lower-dimensional matrices representing the users and Edureka. url:
items, each with k latent factors https://www.edureka.co/blog/interview-
→ Recommender systems can also be combined through ensemble questions/top-50-hadoop-interview-
methods to improve performance. questions-2016/.
[9] Nilay Chauhan. Getting Started with NLP Pipelines.
Kaggle. url: https:
//www.kaggle.com/code/nilaychauhan/getting-
started-with-nlp-pipelines.
Last Updated September 30, 2024

Machine Learning General: Definiton

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Machine Learning General: Definiton

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning General: Definiton

Uploaded by

Copyright:

Available Formats

Machine Learning Cheatsheet Discriminative model Generative model

© 2024 Robins Yadav

Machine Learning General

• Proper Data Splitting: Split data into training, validation, and

• Feature Selection: Ensure that features used for training are

Handle missing or corrupted data in a dataset

• Remove Missing Data: If only a small number of rows have

• Use Algorithms That Handle Missing Data: Decision trees,

Gradient Descent is used to find the coefficients of f that minimizes

In case of Logistic Regression:   yi = yˆi + ϵi

2. The dataset is split on the different attributes. The entropy of

1. Captures Complex Relationships: Can model intricate

Dimensionality Reduction Methods

Association Rule Mining

– tf-idf - In contrast, with TF-IDF, we weight each word by its

Last Updated September 30, 2024

You might also like