ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-10-33
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-10-33
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-10-33
Advantages:
Limitations:
Measuring Clustering Goodness
Choosing a Measure
Unit - 1
Machine Learning
Machine Learning (ML) is a branch of artificial intelligence (AI) that enables computers to learn
from data and improve their performance on tasks over time, without being explicitly
programmed. The core idea is to make data-driven predictions or decisions by building
mathematical models based on sample data, known as "training data."
ML involves several important concepts, techniques, and classifications. Below are detailed
notes covering key concepts.
Supervised learning is a type of ML where the model is trained on labeled data. The
algorithm learns the mapping function from the input (features) to the output (target labels).
Output: A function that can predict the label for new, unseen data.
Examples:
Popular Algorithms:
Linear Regression
Logistic Regression
Decision Trees
Random Forests
2. Unsupervised Learning
In unsupervised learning, the model is trained on data without any labels. The goal is to
identify underlying structures in the data. This type of learning is used for clustering,
Machine Learning 10
association, and dimensionality reduction.
Examples:
Dimensionality Reduction: Reducing the complexity of data while preserving its key
characteristics (Principal Component Analysis - PCA).
Popular Algorithms:
K-Means Clustering
3. Semi-Supervised Learning
Popular Algorithms:
Self-training
Co-training
4. Reinforcement Learning
In reinforcement learning, an agent learns by interacting with its environment and receiving
rewards or penalties based on its actions. The goal is to maximize the cumulative reward
over time.
Examples:
Game-playing AI (AlphaGo)
Autonomous driving.
Popular Algorithms:
Machine Learning 11
Q-Learning
Splitting the dataset into training and testing sets (e.g., 80% training, 20% testing).
Testing and evaluating performance using metrics like accuracy, precision, recall, etc.
Overfitting: When a model performs well on the training data but poorly on new,
unseen data. This happens when the model learns the noise in the training data.
Underfitting: When a model is too simple and cannot capture the underlying trends in
the data, leading to poor performance on both the training and testing sets.
Preventive Measures:
Cross-validation: A technique where the dataset is split into multiple parts, and the
model is trained and tested on different splits to avoid overfitting.
3. Bias-Variance Tradeoff
Bias: Error due to overly simplistic models that fail to capture complex patterns (leading
to underfitting).
Variance: Error due to a model that is too complex and captures noise in the training
data (leading to overfitting).
The goal is to find a balance between bias and variance.
Precision: The proportion of true positive predictions out of all positive predictions.
Machine Learning 12
Recall (Sensitivity): The proportion of true positive predictions out of all actual
positives.
F1 Score: The harmonic mean of precision and recall, useful when you need to balance
them.
Confusion Matrix: A matrix that shows true positives, true negatives, false positives,
and false negatives.
Feature Selection: The process of selecting the most important features from the
dataset to improve model performance and reduce overfitting.
Use Cases:
Spam detection.
3. Decision Trees
Machine Learning 13
Decision trees are non-parametric models used for both classification and regression
tasks. They recursively split the data based on feature values to form a tree-like structure,
where each internal node represents a decision based on a feature, and each leaf node
represents an outcome.
Key Concepts:
Advantages:
Easy to interpret.
Key Concepts:
Margin: The distance between the hyperplane and the nearest data point from each
class.
Kernel Trick: Transforms data into higher dimensions to make it linearly separable.
6. Neural Networks
Neural networks are inspired by the structure of the human brain and are used for both
classification and regression tasks. They consist of layers of neurons, where each neuron
performs a weighted sum of its inputs and passes the result through a non-linear activation
function.
Key Concepts:
Machine Learning 14
Backpropagation: The process of updating weights in a neural network by calculating
the error and propagating it back through the layers.
2. Finance
Machine learning is used for stock price prediction, fraud detection, and algorithmic
trading.
3. E-commerce
Recommendation engines use ML to suggest products to users based on their browsing
and purchasing history.
4. Autonomous Vehicles
Self-driving cars use ML algorithms for tasks like object detection, path planning, and
decision-making.
Machine Learning 15
Current Trends:
Deep Learning: A subset of ML that uses artificial neural networks for complex tasks
like image and speech recognition.
Edge Computing in ML: Running ML models locally on devices, reducing latency and
dependence on cloud computing.
2. Societal Perspective
Machine learning has significant societal implications. It has the potential to improve the
quality of life, automate routine tasks, and offer personalized services. However, it also
raises concerns about job displacement, privacy, and decision-making fairness.
Key Aspects:
Key Benefits:
Cost Reduction: By automating processes and reducing human error, businesses can
lower costs.
Increased Productivity: Machine learning models can process large datasets faster
than humans, improving decision-making speed.
Algorithmic Bias: Algorithms may perpetuate or even exacerbate biases if the training
data used is biased.
Machine Learning 16
Transparency: Many ML models, especially deep learning models, are seen as "black
boxes" due to their complexity, making it difficult to explain their decisions.
Data Scarcity: Many domains lack sufficient labeled data, which hinders model training.
Noisy Data: Data can be incomplete, inconsistent, or contain errors, leading to poor
model performance.
Bias in Data: If the data used for training is biased or unbalanced, the model may
produce biased outcomes, particularly in sensitive applications like hiring, lending, or
criminal justice.
Potential Solutions:
Data Augmentation: Generating synthetic data to improve the dataset size and variety.
Transfer Learning: Using pre-trained models on related tasks to address data scarcity.
Active Learning: Involving human experts to label the most informative samples
selectively.
2. Model Interpretability
One of the biggest challenges in machine learning, especially with complex models like
deep learning, is interpretability.
Black Box Problem: Many machine learning models, particularly deep neural networks,
are not interpretable by humans. This makes it difficult to understand how a decision is
made, which can be problematic in areas like healthcare or finance where transparency
is essential.
Potential Solutions:
Machine Learning 17
Model Simplification: Choosing simpler models like decision trees or linear models
when interpretability is more important than performance.
Overfitting occurs when a model performs well on training data but fails to generalize to
unseen data. This is one of the most common issues in machine learning.
Overfitting: This happens when the model learns noise or irrelevant patterns in the
training data, resulting in poor performance on new data.
Underfitting: On the contrary, underfitting occurs when the model is too simple to
capture the underlying structure of the data, leading to both poor training and test
performance.
Potential Solutions:
Data Augmentation: Increasing the size of the training data to expose the model to
more variation.
Machine learning models can sometimes make unfair decisions due to biases present in
the data. This can result in discrimination, particularly in applications like hiring, lending,
law enforcement, and healthcare.
Bias Amplification: If a training dataset contains biased data, the model can perpetuate
or even amplify these biases.
Potential Solutions:
Training large-scale machine learning models, especially deep neural networks, requires
enormous computational resources, both in terms of processing power and storage.
Machine Learning 18
High Computational Cost: Training large models, like GPT (generative pre-trained
transformers), requires massive amounts of GPU/TPU resources, which is a challenge
for many organizations.
Energy Consumption: As the size of machine learning models grows, so does their
energy consumption, raising concerns about sustainability and environmental impact.
Potential Solutions:
Adversarial Attacks: Small, imperceptible changes to input data can cause machine
learning models (especially deep neural networks) to make incorrect predictions. This
can be problematic in applications like facial recognition or autonomous driving.
Data Privacy: Many ML models require large amounts of personal data, raising
concerns about user privacy. In particular, the use of data in healthcare or finance
needs to comply with strict privacy regulations like GDPR.
Potential Solutions:
Machine learning raises ethical concerns, particularly when deployed in sensitive areas
such as healthcare, law enforcement, or autonomous decision-making.
Ethics in AI: Ensuring that machine learning models are built and deployed in a way that
respects human rights, dignity, and fairness.
Potential Solutions:
Machine Learning 19
Human-in-the-loop Systems: Keeping a human supervisor in critical decision-making
processes to ensure accountability.
Conclusion
The perspectives and issues in machine learning are diverse and complex. While ML offers
great potential in improving technology and society, it also presents challenges related to data
quality, interpretability, bias, security, and ethics. It is crucial to develop machine learning
systems responsibly, with a strong emphasis on fairness, transparency, and accountability to
ensure that these systems benefit society equitably and without harm.
Machine Learning 20
Machine Learning 21
Machine Learning 22
3. Hidden Markov Models (HMMs)
HMMs are used in sequential data, where the system being modeled is assumed to be a
Markov process with hidden states. Probabilistic transitions between states are modeled,
and observations depend on the underlying states.
Machine Learning 23
Basic Linear Algebra in Machine Learning Techniques
Linear algebra is the backbone of many machine learning algorithms, especially those dealing
with large datasets and complex models. It is essential for understanding how data is
represented and manipulated in high-dimensional spaces.
Machine Learning 24
7. Matrix Inversion and Pseudo-Inverse
In machine learning, finding the inverse of a matrix can be essential for solving systems of
linear equations (as in linear regression). However, not all matrices are invertible, so we
often compute the Moore-Penrose pseudo-inverse.
w = (XT X)−1 XT y
In SVMs, the goal is to find a hyperplane that maximizes the margin between different
classes. The problem is formulated as a convex optimization problem, often solved using
quadratic programming. SVMs make extensive use of matrix operations to compute
distances and margins.
4. Neural Networks
Conclusion
Probability helps to model uncertainty, make predictions, and estimate parameters in
various machine learning techniques like Naive Bayes, Hidden Markov Models, and
probabilistic models.
Linear Algebra provides the computational foundation for most machine learning
algorithms, enabling data representation, optimization, dimensionality reduction, and neural
network computations.
Understanding these two areas is critical for both theoretical and practical machine learning
applications.
Machine Learning 25
Dataset and Its Types
In machine learning, a dataset refers to a structured collection of data that is used to train and
evaluate models. A dataset typically consists of input data (features) and corresponding
outputs (labels or target values) that the model learns to predict. The quality and structure of
the dataset significantly impact the performance of machine learning algorithms.
Components of a Dataset
1. Features (Input Variables): The independent variables used as input to a machine learning
model. In a dataset, features are the attributes or columns that describe the data points
(observations).
3. Samples (Data Points): Each row in the dataset corresponds to a sample, which represents
a single observation or instance in the dataset.
Example: Each house in the housing dataset is a sample with specific features.
4. Metadata: Information about the dataset, such as feature descriptions, units, or additional
details about how the data was collected.
Types of Datasets
There are several types of datasets based on how they are used in machine learning, the type
of data they contain, and the tasks they address.
The dataset used to train the machine learning model. The model learns the
relationships between input features and target labels from the training data.
Typically, this dataset contains both input features and the corresponding labels
(supervised learning).
Example: A dataset of house features and prices used to train a regression model.
2. Validation Dataset
Used to tune model parameters and evaluate model performance during training. This
dataset helps in selecting the best model and hyperparameters.
Machine Learning 26
The validation set helps prevent overfitting by evaluating the model on data it hasn't
seen during training.
Example: After training a model on the training dataset, the validation set is used to
measure how well the model generalizes.
3. Test Dataset
The dataset used to assess the model’s final performance. It contains data points that
the model has never seen before, ensuring that the model generalizes well to unseen
data.
Example: After training and tuning, the test dataset evaluates how the model performs
on real-world or unseen data.
Data that is organized into tables (rows and columns), often found in databases and
spreadsheets. Each row represents a data point (sample), and each column represents
a feature or attribute.
Example: Financial records, customer data (with attributes like name, age, income), and
sensor readings.
2. Unstructured Data
Data that does not follow a predefined format or structure. This type of data is typically
text, images, audio, or video.
Example: Social media posts, emails, images (e.g., in facial recognition), or videos (e.g.,
for object detection).
3. Semi-Structured Data
Data that does not fit neatly into structured formats but contains some organizational
properties (tags, labels, etc.). Examples include data in JSON or XML formats.
Example: Web data, log files, and emails with specific headers.
In a labeled dataset, each sample has a corresponding output label. This is common in
supervised learning tasks, where the model is trained to predict the label.
Example: A dataset of images with labels indicating whether they contain a cat or a
dog.
2. Unlabeled Data
Machine Learning 27
In an unlabeled dataset, only the input features are provided, and no labels are
available. Unsupervised learning techniques, such as clustering, are used to extract
patterns or structures from such data.
Data that represents numbers and can be continuous (e.g., weight, height, price) or
discrete (e.g., count of items).
2. Categorical Data
Data that represents categories or discrete groups. Categorical features can be nominal
(no inherent order, like color ) or ordinal (with order, like education level ).
3. Time-Series Data
Data points that are ordered and collected over time intervals, commonly used in
forecasting and temporal analysis.
4. Text Data
Data in the form of words and sentences. Natural Language Processing (NLP)
techniques are used to process this type of data.
Removal: Remove rows or columns with missing data. This is feasible if the proportion
of missing data is small.
Machine Learning 28
Imputation: Replace missing values with a statistic (mean, median, or mode) or use
algorithms like K-Nearest Neighbors (KNN) to impute missing values.
Label Encoding: Assigns a unique integer to each category. This works well when the
categories have an ordinal relationship.
One-Hot Encoding: Creates binary columns for each category. Each column represents
whether a category is present (1) or not (0). This is commonly used for nominal
categorical features.
4. Feature Scaling
Feature scaling ensures that all input features contribute equally to the model’s
performance. This step is important for algorithms that are sensitive to feature magnitude,
such as:
5. Feature Engineering
Machine Learning 29
Feature Extraction: Creating new features from existing ones that better represent the
underlying patterns in the data.
Example: From date-time data, extract day , month , year , weekday , etc.
Resampling:
7. Dimensionality Reduction
Reducing the number of features in the dataset can help reduce overfitting and improve
computation time.
Principal Component Analysis (PCA): A linear technique that projects data onto fewer
dimensions while preserving variance.
IQR Method: Values beyond 1.5 times the interquartile range (IQR) are considered
outliers.
Conclusion
Datasets form the
Machine Learning 30
backbone of machine learning, providing the data required for model training and evaluation.
Understanding the types of datasets and their structure is crucial for selecting the right
algorithms and methods.
Data preprocessing is an essential step to prepare raw data for analysis. Proper handling
of missing data, scaling, encoding categorical variables, and dealing with imbalances and
outliers can significantly improve the performance and generalizability of machine learning
models.
By investing time in preprocessing, you ensure that your models are trained on clean, well-
structured data, leading to better results.
1. Function Approximation
In machine learning, function approximation is the task of learning or constructing a function
that generates appropriate outputs given inputs, based on a set of training examples.
Mathematical Formulation:
Given: A set of training examples {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
Goal: Find a function f(x) that approximates the true underlying function f*(x)
Common Approaches:
1. Parametric: Assume a specific form of f(x) with parameters θ, e.g., f(x) = θ₀ + θ₁x for linear
regression
Machine Learning 31
Bias
Definition: The error introduced by approximating a real-world problem with a simplified
model
Indicators: High training error, similar performance on training and validation sets
Variance
Definition: The error introduced due to the model's sensitivity to small fluctuations in the
training set
Bias-Variance Decomposition
For a given point x, the expected mean squared error can be decomposed as:
E[(y - f̂ (x))²] = (Bias[f̂ (x)])² + Var[f̂ (x)] + σ²
Where:
f̂ (x) is the learned function
Bias[f̂ (x)] = E[f̂ (x)] - f(x), the expected difference between the learned function and the true
function
3. Overfitting
Overfitting occurs when a model learns the training data too well, capturing noise and random
fluctuations as if they were part of the underlying pattern.
Characteristics:
Machine Learning 32
Low training error
Causes:
1. High model complexity relative to the amount of training data
2. Insufficient regularization
Detection Methods:
1. Learning curves: Plot training and validation errors against the training set size
Prevention Techniques:
1. Regularization (e.g., L1, L2 norms)
2. Early stopping
4. Data augmentation
3. Impact on Learning:
High Variance: Simplify model, add more training data, apply regularization
Machine Learning 33