[go: up one dir, main page]

0% found this document useful (0 votes)
2 views24 pages

ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-10-33

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 24

Algorithm Steps:

Advantages:
Limitations:
Measuring Clustering Goodness
Choosing a Measure

Unit - 1
Machine Learning
Machine Learning (ML) is a branch of artificial intelligence (AI) that enables computers to learn
from data and improve their performance on tasks over time, without being explicitly
programmed. The core idea is to make data-driven predictions or decisions by building
mathematical models based on sample data, known as "training data."

ML involves several important concepts, techniques, and classifications. Below are detailed
notes covering key concepts.

Types of Machine Learning


1. Supervised Learning

Supervised learning is a type of ML where the model is trained on labeled data. The
algorithm learns the mapping function from the input (features) to the output (target labels).

Input: Training data with known labels.

Output: A function that can predict the label for new, unseen data.

Examples:

Classification: Identifying whether an email is spam or not.

Regression: Predicting housing prices.

Popular Algorithms:

Linear Regression

Logistic Regression

Support Vector Machines (SVM)

Decision Trees

Random Forests

2. Unsupervised Learning

In unsupervised learning, the model is trained on data without any labels. The goal is to
identify underlying structures in the data. This type of learning is used for clustering,

Machine Learning 10
association, and dimensionality reduction.

Input: Unlabeled data.

Output: Patterns or groupings in the data.

Examples:

Clustering: Grouping customers based on purchasing behavior (K-means, Hierarchical


Clustering).

Dimensionality Reduction: Reducing the complexity of data while preserving its key
characteristics (Principal Component Analysis - PCA).

Popular Algorithms:

K-Means Clustering

DBSCAN (Density-Based Spatial Clustering)

Principal Component Analysis (PCA)

3. Semi-Supervised Learning

Semi-supervised learning is a hybrid approach, where a small amount of labeled data is


used alongside a large amount of unlabeled data. This can improve the accuracy of models
when labeling data is expensive or time-consuming.
Examples:

Semi-supervised learning is used in scenarios where obtaining labeled data is


expensive (e.g., medical image classification).

Popular Algorithms:

Self-training

Co-training

4. Reinforcement Learning

In reinforcement learning, an agent learns by interacting with its environment and receiving
rewards or penalties based on its actions. The goal is to maximize the cumulative reward
over time.

Input: Environment states.

Output: Actions that maximize rewards.

Examples:

Game-playing AI (AlphaGo)

Autonomous driving.

Popular Algorithms:

Machine Learning 11
Q-Learning

Deep Q-Networks (DQN)

Policy Gradient Methods

Key Concepts in Machine Learning


1. Training and Testing
Machine learning models learn patterns from training data and are evaluated on testing
data. The typical process includes:

Splitting the dataset into training and testing sets (e.g., 80% training, 20% testing).

Training the model on the training set.

Testing and evaluating performance using metrics like accuracy, precision, recall, etc.

2. Overfitting and Underfitting

Overfitting: When a model performs well on the training data but poorly on new,
unseen data. This happens when the model learns the noise in the training data.

Underfitting: When a model is too simple and cannot capture the underlying trends in
the data, leading to poor performance on both the training and testing sets.

Preventive Measures:

Cross-validation: A technique where the dataset is split into multiple parts, and the
model is trained and tested on different splits to avoid overfitting.

Regularization: Techniques like L1 and L2 regularization add a penalty for larger


coefficients in models, preventing overfitting.

3. Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept in ML:

Bias: Error due to overly simplistic models that fail to capture complex patterns (leading
to underfitting).

Variance: Error due to a model that is too complex and captures noise in the training
data (leading to overfitting).
The goal is to find a balance between bias and variance.

4. Model Evaluation Metrics

Several metrics are used to evaluate the performance of ML models:

Accuracy: The percentage of correct predictions.

Precision: The proportion of true positive predictions out of all positive predictions.

Machine Learning 12
Recall (Sensitivity): The proportion of true positive predictions out of all actual
positives.

F1 Score: The harmonic mean of precision and recall, useful when you need to balance
them.

Confusion Matrix: A matrix that shows true positives, true negatives, false positives,
and false negatives.

5. Feature Selection and Engineering

Feature Selection: The process of selecting the most important features from the
dataset to improve model performance and reduce overfitting.

Feature Engineering: Creating new features or transforming existing ones to improve


model performance. Examples include scaling, encoding categorical variables, or
creating interaction terms.

Common Machine Learning Algorithms

Use Cases:

Spam detection.

Disease diagnosis (predicting whether a patient has a disease).

3. Decision Trees

Machine Learning 13
Decision trees are non-parametric models used for both classification and regression
tasks. They recursively split the data based on feature values to form a tree-like structure,
where each internal node represents a decision based on a feature, and each leaf node
represents an outcome.
Key Concepts:

Gini Impurity: Used to measure the quality of a split in classification tasks.

Entropy: A measure of uncertainty, used in information gain calculations.

Advantages:

Easy to interpret.

Handles both numerical and categorical data.

4. Support Vector Machines (SVM)


SVMs are used for classification tasks and work by finding the hyperplane that best
separates the data into different classes. SVMs are effective in high-dimensional spaces
and with complex decision boundaries.

Key Concepts:

Margin: The distance between the hyperplane and the nearest data point from each
class.

Kernel Trick: Transforms data into higher dimensions to make it linearly separable.

5. K-Nearest Neighbors (KNN)


KNN is a simple algorithm that classifies data points based on their proximity to other data
points in the training set. It assigns the majority class of the k-nearest points to a new data
point.
Advantages:

Simple to understand and implement.

No training phase, only requires the data for prediction.

6. Neural Networks
Neural networks are inspired by the structure of the human brain and are used for both
classification and regression tasks. They consist of layers of neurons, where each neuron
performs a weighted sum of its inputs and passes the result through a non-linear activation
function.

Key Concepts:

Activation Function: Determines whether a neuron should be activated. Common


functions include ReLU (Rectified Linear Unit), Sigmoid, and Tanh.

Machine Learning 14
Backpropagation: The process of updating weights in a neural network by calculating
the error and propagating it back through the layers.

Applications of Machine Learning


1. Healthcare
ML is used in diagnosing diseases, personalized medicine, and drug discovery. Models can
predict diseases based on patient data, enabling early diagnosis.

2. Finance
Machine learning is used for stock price prediction, fraud detection, and algorithmic
trading.

3. E-commerce
Recommendation engines use ML to suggest products to users based on their browsing
and purchasing history.

4. Autonomous Vehicles
Self-driving cars use ML algorithms for tasks like object detection, path planning, and
decision-making.

5. Natural Language Processing (NLP)

Perspectives and Issues in Machine Learning


As machine learning (ML) becomes increasingly integrated into a wide range of fields, several
perspectives and issues have emerged that influence its development, application, and ethics.
These perspectives include the technological, societal, and ethical implications of ML, while
issues range from technical challenges to concerns about fairness and bias. Below is a
comprehensive overview of these perspectives and issues:

Perspectives in Machine Learning


1. Technological Perspective
Machine learning is seen as a transformative technology with the potential to revolutionize
industries, automate tasks, and create intelligent systems. From this perspective, ML is:

Driving innovation in areas like healthcare, finance, and transportation through


automation and decision-making.

Enabling AI-driven applications such as natural language processing (NLP), computer


vision, and speech recognition.

Accelerating scientific discoveries by processing large datasets and identifying


patterns that may not be obvious to human researchers.

Machine Learning 15
Current Trends:

Deep Learning: A subset of ML that uses artificial neural networks for complex tasks
like image and speech recognition.

Edge Computing in ML: Running ML models locally on devices, reducing latency and
dependence on cloud computing.

2. Societal Perspective
Machine learning has significant societal implications. It has the potential to improve the
quality of life, automate routine tasks, and offer personalized services. However, it also
raises concerns about job displacement, privacy, and decision-making fairness.

Key Aspects:

Personalization: From tailored shopping experiences to targeted healthcare, ML offers


personalized recommendations and services.

Automation: ML is replacing human labor in sectors like manufacturing, transportation,


and customer service, which raises concerns about job displacement and the need for
reskilling the workforce.

Data-Driven Society: As more organizations rely on data, machine learning plays a


critical role in analyzing large volumes of data to support informed decisions, especially
in healthcare, marketing, and finance.

3. Business and Economic Perspective

In business, machine learning offers competitive advantages, helping companies improve


efficiency, optimize operations, and better understand customer behavior.

Key Benefits:

Cost Reduction: By automating processes and reducing human error, businesses can
lower costs.

Increased Productivity: Machine learning models can process large datasets faster
than humans, improving decision-making speed.

Predictive Analytics: ML enables companies to predict future trends, demand, and


risks, helping them stay competitive in rapidly changing markets.

4. Ethical and Philosophical Perspective


As machine learning continues to impact society, ethical issues come into play. The
question arises as to how much control we give to machines and algorithms. Some ethical
concerns include:

Algorithmic Bias: Algorithms may perpetuate or even exacerbate biases if the training
data used is biased.

Machine Learning 16
Transparency: Many ML models, especially deep learning models, are seen as "black
boxes" due to their complexity, making it difficult to explain their decisions.

Autonomy: How much decision-making power should we give to machines, particularly


in critical areas like healthcare, criminal justice, or warfare?

Issues in Machine Learning


1. Data Quality and Quantity
Machine learning models heavily rely on data, and the quality of data determines the
performance of the model. However, obtaining clean, labeled, and representative data is a
significant challenge.

Data Scarcity: Many domains lack sufficient labeled data, which hinders model training.

Noisy Data: Data can be incomplete, inconsistent, or contain errors, leading to poor
model performance.

Bias in Data: If the data used for training is biased or unbalanced, the model may
produce biased outcomes, particularly in sensitive applications like hiring, lending, or
criminal justice.

Potential Solutions:

Data Augmentation: Generating synthetic data to improve the dataset size and variety.

Transfer Learning: Using pre-trained models on related tasks to address data scarcity.

Active Learning: Involving human experts to label the most informative samples
selectively.

2. Model Interpretability

One of the biggest challenges in machine learning, especially with complex models like
deep learning, is interpretability.

Black Box Problem: Many machine learning models, particularly deep neural networks,
are not interpretable by humans. This makes it difficult to understand how a decision is
made, which can be problematic in areas like healthcare or finance where transparency
is essential.

Trust: For ML to be accepted in critical applications (e.g., medical diagnoses,


autonomous vehicles), models need to be explainable and interpretable.

Potential Solutions:

Explainable AI (XAI): A field focused on creating models that are inherently


interpretable or providing tools that explain the decisions of "black-box" models.

Machine Learning 17
Model Simplification: Choosing simpler models like decision trees or linear models
when interpretability is more important than performance.

3. Overfitting and Generalization

Overfitting occurs when a model performs well on training data but fails to generalize to
unseen data. This is one of the most common issues in machine learning.

Overfitting: This happens when the model learns noise or irrelevant patterns in the
training data, resulting in poor performance on new data.

Underfitting: On the contrary, underfitting occurs when the model is too simple to
capture the underlying structure of the data, leading to both poor training and test
performance.

Potential Solutions:

Regularization Techniques: L1 and L2 regularization, dropout, and early stopping are


methods to prevent overfitting.

Cross-Validation: Ensuring the model’s generalization by testing on multiple subsets of


data.

Data Augmentation: Increasing the size of the training data to expose the model to
more variation.

4. Algorithmic Fairness and Bias

Machine learning models can sometimes make unfair decisions due to biases present in
the data. This can result in discrimination, particularly in applications like hiring, lending,
law enforcement, and healthcare.

Bias Amplification: If a training dataset contains biased data, the model can perpetuate
or even amplify these biases.

Unintended Consequences: ML systems trained on historical data can reinforce


negative stereotypes and marginalize vulnerable groups.

Potential Solutions:

Fairness Constraints: Including fairness constraints when training models to ensure


that sensitive attributes like race, gender, or age do not unfairly influence predictions.

Bias Mitigation Algorithms: Algorithms like re-sampling or adversarial debiasing that


aim to reduce bias in the dataset or model.

5. Scalability and Computation

Training large-scale machine learning models, especially deep neural networks, requires
enormous computational resources, both in terms of processing power and storage.

Machine Learning 18
High Computational Cost: Training large models, like GPT (generative pre-trained
transformers), requires massive amounts of GPU/TPU resources, which is a challenge
for many organizations.

Energy Consumption: As the size of machine learning models grows, so does their
energy consumption, raising concerns about sustainability and environmental impact.

Potential Solutions:

Distributed Computing: Leveraging cloud infrastructure and distributed frameworks


like Apache Spark to handle large datasets and models.

Model Compression: Techniques like pruning, quantization, and knowledge distillation


to reduce model size and computational requirements without sacrificing performance.

6. Security and Privacy


Machine learning systems, especially those deployed in the real world, are vulnerable to
several security and privacy concerns.

Adversarial Attacks: Small, imperceptible changes to input data can cause machine
learning models (especially deep neural networks) to make incorrect predictions. This
can be problematic in applications like facial recognition or autonomous driving.

Data Privacy: Many ML models require large amounts of personal data, raising
concerns about user privacy. In particular, the use of data in healthcare or finance
needs to comply with strict privacy regulations like GDPR.

Potential Solutions:

Adversarial Training: Training models to be robust against adversarial attacks.

Federated Learning: A technique where the model is trained across multiple


decentralized devices without requiring data to leave the devices, improving privacy.

7. Ethical Concerns and Accountability

Machine learning raises ethical concerns, particularly when deployed in sensitive areas
such as healthcare, law enforcement, or autonomous decision-making.

Accountability: When a machine learning system makes an incorrect or harmful


decision, determining who is responsible (the developer, the company, or the user) can
be difficult.

Ethics in AI: Ensuring that machine learning models are built and deployed in a way that
respects human rights, dignity, and fairness.

Potential Solutions:

Ethical AI Frameworks: Developing ethical guidelines and frameworks to ensure


fairness, accountability, and transparency in machine learning.

Machine Learning 19
Human-in-the-loop Systems: Keeping a human supervisor in critical decision-making
processes to ensure accountability.

Conclusion

The perspectives and issues in machine learning are diverse and complex. While ML offers
great potential in improving technology and society, it also presents challenges related to data
quality, interpretability, bias, security, and ethics. It is crucial to develop machine learning
systems responsibly, with a strong emphasis on fairness, transparency, and accountability to
ensure that these systems benefit society equitably and without harm.

Review of Probability in Machine Learning


Probability theory is fundamental to machine learning because many algorithms rely on
probabilistic models to make decisions or predictions. Understanding probability helps in
reasoning about uncertainty, which is intrinsic to real-world data and machine learning models.

Key Concepts in Probability

Machine Learning 20
Machine Learning 21
Machine Learning 22
3. Hidden Markov Models (HMMs)
HMMs are used in sequential data, where the system being modeled is assumed to be a
Markov process with hidden states. Probabilistic transitions between states are modeled,
and observations depend on the underlying states.

4. Expectation-Maximization (EM) Algorithm


The EM algorithm is an iterative approach used to estimate parameters of models with
latent variables, such as Gaussian Mixture Models (GMMs).

Machine Learning 23
Basic Linear Algebra in Machine Learning Techniques
Linear algebra is the backbone of many machine learning algorithms, especially those dealing
with large datasets and complex models. It is essential for understanding how data is
represented and manipulated in high-dimensional spaces.

Machine Learning 24
7. Matrix Inversion and Pseudo-Inverse
In machine learning, finding the inverse of a matrix can be essential for solving systems of
linear equations (as in linear regression). However, not all matrices are invertible, so we
often compute the Moore-Penrose pseudo-inverse.

Linear Algebra in Machine Learning Techniques


1. Linear Regression
Linear regression models predict a continuous target \( y \) from a set of input features \(
\mathbf{X} \). The goal is to find the best-fitting line by solving for the weight vector \(
\mathbf{w} \) using the normal equation:

w = (XT X)−1 XT y

2. Principal Component Analysis (PCA)


PCA is a dimensionality reduction technique that projects data onto a lower-dimensional
space by identifying the directions (principal components) where the data varies the most.
It uses the eigenvectors of the covariance matrix of the data.

3. Support Vector Machines (SVMs)

In SVMs, the goal is to find a hyperplane that maximizes the margin between different
classes. The problem is formulated as a convex optimization problem, often solved using
quadratic programming. SVMs make extensive use of matrix operations to compute
distances and margins.

4. Neural Networks

Neural networks rely heavily on matrix multiplication to compute activations and


backpropagate errors through layers of the network. The weights of the network are
updated using gradient-based optimization algorithms, which involve computing the
gradient of the loss function with respect to the weight matrices.

Conclusion
Probability helps to model uncertainty, make predictions, and estimate parameters in
various machine learning techniques like Naive Bayes, Hidden Markov Models, and
probabilistic models.

Linear Algebra provides the computational foundation for most machine learning
algorithms, enabling data representation, optimization, dimensionality reduction, and neural
network computations.

Understanding these two areas is critical for both theoretical and practical machine learning
applications.

Machine Learning 25
Dataset and Its Types
In machine learning, a dataset refers to a structured collection of data that is used to train and
evaluate models. A dataset typically consists of input data (features) and corresponding
outputs (labels or target values) that the model learns to predict. The quality and structure of
the dataset significantly impact the performance of machine learning algorithms.

Components of a Dataset
1. Features (Input Variables): The independent variables used as input to a machine learning
model. In a dataset, features are the attributes or columns that describe the data points
(observations).

Example: In a dataset of houses, features could be size , number of bedrooms , and


location .

2. Labels (Target/Output Variables): The dependent variable, representing what we want to


predict. Labels are provided during training and are usually missing during testing.

Example: In a house price prediction task, the label could be price .

3. Samples (Data Points): Each row in the dataset corresponds to a sample, which represents
a single observation or instance in the dataset.

Example: Each house in the housing dataset is a sample with specific features.

4. Metadata: Information about the dataset, such as feature descriptions, units, or additional
details about how the data was collected.

Types of Datasets
There are several types of datasets based on how they are used in machine learning, the type
of data they contain, and the tasks they address.

1. Based on Machine Learning Task


1. Training Dataset

The dataset used to train the machine learning model. The model learns the
relationships between input features and target labels from the training data.

Typically, this dataset contains both input features and the corresponding labels
(supervised learning).

Example: A dataset of house features and prices used to train a regression model.

2. Validation Dataset

Used to tune model parameters and evaluate model performance during training. This
dataset helps in selecting the best model and hyperparameters.

Machine Learning 26
The validation set helps prevent overfitting by evaluating the model on data it hasn't
seen during training.

Example: After training a model on the training dataset, the validation set is used to
measure how well the model generalizes.

3. Test Dataset

The dataset used to assess the model’s final performance. It contains data points that
the model has never seen before, ensuring that the model generalizes well to unseen
data.

Example: After training and tuning, the test dataset evaluates how the model performs
on real-world or unseen data.

2. Based on Data Structure


1. Structured Data

Data that is organized into tables (rows and columns), often found in databases and
spreadsheets. Each row represents a data point (sample), and each column represents
a feature or attribute.

Example: Financial records, customer data (with attributes like name, age, income), and
sensor readings.

2. Unstructured Data

Data that does not follow a predefined format or structure. This type of data is typically
text, images, audio, or video.

Example: Social media posts, emails, images (e.g., in facial recognition), or videos (e.g.,
for object detection).

3. Semi-Structured Data

Data that does not fit neatly into structured formats but contains some organizational
properties (tags, labels, etc.). Examples include data in JSON or XML formats.

Example: Web data, log files, and emails with specific headers.

3. Based on Label Availability


1. Labeled Data

In a labeled dataset, each sample has a corresponding output label. This is common in
supervised learning tasks, where the model is trained to predict the label.

Example: A dataset of images with labels indicating whether they contain a cat or a
dog.

2. Unlabeled Data

Machine Learning 27
In an unlabeled dataset, only the input features are provided, and no labels are
available. Unsupervised learning techniques, such as clustering, are used to extract
patterns or structures from such data.

Example: A dataset of customer transaction records without knowing which


transactions belong to which customer segment.

4. Based on Data Type


1. Numerical Data

Data that represents numbers and can be continuous (e.g., weight, height, price) or
discrete (e.g., count of items).

2. Categorical Data

Data that represents categories or discrete groups. Categorical features can be nominal
(no inherent order, like color ) or ordinal (with order, like education level ).

3. Time-Series Data

Data points that are ordered and collected over time intervals, commonly used in
forecasting and temporal analysis.

Example: Stock prices recorded daily, temperature readings.

4. Text Data

Data in the form of words and sentences. Natural Language Processing (NLP)
techniques are used to process this type of data.

Example: Movie reviews, news articles.

Data Preprocessing in Machine Learning


Data preprocessing is a crucial step in any machine learning pipeline. Raw data often contains
noise, missing values, or irrelevant features. Preprocessing ensures that the data is in the best
possible shape before feeding it into a model. The process typically involves cleaning,
transforming, and standardizing the data.

Key Steps in Data Preprocessing


1. Handling Missing Data
Missing data can significantly affect the performance of a model. There are several
methods to handle missing values:

Removal: Remove rows or columns with missing data. This is feasible if the proportion
of missing data is small.

Machine Learning 28
Imputation: Replace missing values with a statistic (mean, median, or mode) or use
algorithms like K-Nearest Neighbors (KNN) to impute missing values.

Interpolation: For time-series data, missing values can be filled by interpolating


between known data points.

2. Data Normalization and Standardization

3. Encoding Categorical Data


Machine learning models require numerical input, so categorical variables need to be
encoded. Common methods include:

Label Encoding: Assigns a unique integer to each category. This works well when the
categories have an ordinal relationship.

Example: Low = 0 , Medium = 1 , High = 2 .

One-Hot Encoding: Creates binary columns for each category. Each column represents
whether a category is present (1) or not (0). This is commonly used for nominal
categorical features.

Example: A column for each category in Color : Red , Blue , Green .

4. Feature Scaling
Feature scaling ensures that all input features contribute equally to the model’s
performance. This step is important for algorithms that are sensitive to feature magnitude,
such as:

Gradient Descent-Based Models (e.g., neural networks, logistic regression).

Distance-Based Models (e.g., KNN, SVM).

5. Feature Engineering

Machine Learning 29
Feature Extraction: Creating new features from existing ones that better represent the
underlying patterns in the data.

Example: From date-time data, extract day , month , year , weekday , etc.

Feature Selection: Selecting a subset of the most relevant features to reduce


dimensionality and improve model performance.

Techniques: Correlation analysis, mutual information, and model-based feature


selection (e.g., using random forests).

6. Handling Imbalanced Data


Imbalanced datasets occur when one class is underrepresented compared to others. This
can skew the performance of machine learning models, particularly in classification tasks.
Techniques for handling imbalanced data include:

Resampling:

Oversampling the minority class (e.g., SMOTE: Synthetic Minority Oversampling


Technique).

Undersampling the majority class.

Class Weighting: Assign higher penalties to misclassifications of the minority class


during training.

7. Dimensionality Reduction

Reducing the number of features in the dataset can help reduce overfitting and improve
computation time.

Principal Component Analysis (PCA): A linear technique that projects data onto fewer
dimensions while preserving variance.

t-SNE: A non-linear dimensionality reduction technique used for visualizing high-


dimensional data.

8. Outlier Detection and Removal


Outliers can distort the learning process, especially in regression tasks. Common
techniques include:

Z-Score Method: A sample is considered an outlier if its z-score is greater than a


threshold (e.g., 3).

IQR Method: Values beyond 1.5 times the interquartile range (IQR) are considered
outliers.

Conclusion
Datasets form the

Machine Learning 30
backbone of machine learning, providing the data required for model training and evaluation.
Understanding the types of datasets and their structure is crucial for selecting the right
algorithms and methods.

Data preprocessing is an essential step to prepare raw data for analysis. Proper handling
of missing data, scaling, encoding categorical variables, and dealing with imbalances and
outliers can significantly improve the performance and generalizability of machine learning
models.

By investing time in preprocessing, you ensure that your models are trained on clean, well-
structured data, leading to better results.

Machine Learning: Bias, Variance, Function Approximation, and


Overfitting

1. Function Approximation
In machine learning, function approximation is the task of learning or constructing a function
that generates appropriate outputs given inputs, based on a set of training examples.

Mathematical Formulation:
Given: A set of training examples {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}

Goal: Find a function f(x) that approximates the true underlying function f*(x)

Minimize: Loss function L(f(x), y) over the training set

Common Approaches:
1. Parametric: Assume a specific form of f(x) with parameters θ, e.g., f(x) = θ₀ + θ₁x for linear
regression

2. Non-parametric: Make fewer assumptions about f(x), e.g., k-Nearest Neighbors

2. Bias and Variance


Bias and variance are two sources of error in machine learning models that contribute to the
overall prediction error.

Machine Learning 31
Bias
Definition: The error introduced by approximating a real-world problem with a simplified
model

High bias → Underfitting

Indicators: High training error, similar performance on training and validation sets

Variance
Definition: The error introduced due to the model's sensitivity to small fluctuations in the
training set

High variance → Overfitting

Indicators: Low training error, high validation error

Bias-Variance Decomposition
For a given point x, the expected mean squared error can be decomposed as:
E[(y - f̂ (x))²] = (Bias[f̂ (x)])² + Var[f̂ (x)] + σ²
Where:
f̂ (x) is the learned function

Bias[f̂ (x)] = E[f̂ (x)] - f(x), the expected difference between the learned function and the true
function

Var[f̂ (x)] = E[(f̂ (x) - E[f̂ (x)])²], the variability of predictions

σ² is the irreducible error

3. Overfitting
Overfitting occurs when a model learns the training data too well, capturing noise and random
fluctuations as if they were part of the underlying pattern.

Characteristics:

Machine Learning 32
Low training error

High validation/test error

Poor generalization to new, unseen data

Causes:
1. High model complexity relative to the amount of training data

2. Insufficient regularization

3. Training for too many epochs

Detection Methods:
1. Learning curves: Plot training and validation errors against the training set size

2. Cross-validation: Assess model performance on multiple splits of the data

Prevention Techniques:
1. Regularization (e.g., L1, L2 norms)

2. Early stopping

3. Ensemble methods (e.g., bagging, boosting)

4. Data augmentation

5. Dropout (for neural networks)

Relationships and Trade-offs


1. Bias-Variance Trade-off:

As model complexity increases: Bias ↓, Variance ↑

Goal: Find the sweet spot that minimizes total error

2. Model Complexity Spectrum:


Underfitting ← Low Complexity --- Optimal --- High Complexity → Overfitting

3. Impact on Learning:

High Bias: Model fails to capture important patterns (underfitting)

High Variance: Model captures noise, leading to poor generalization (overfitting)

4. Strategies for Improvement:

High Bias: Increase model complexity, add features

High Variance: Simplify model, add more training data, apply regularization

Machine Learning 33

You might also like