II Sem Material ML
II Sem Material ML
UNIT-I
1. Explain the concept of structured data and unstructured data in machine learning.
2. Explain the concept of well-posed problems in the context of machine learning.
3. Discuss the role of basic linear algebra in machine learning techniques.
4. List and briefly describe four forms of learning in Machine learning.
5. Provide examples of how machine learning and data mining are applied in practice.
6. Illustrate how feature engineering plays a role in productive machine learning.
UNIT-2
1. Discuss in detail about Ocaam’s Razor.
2. Explain the Over fitting and computational complexity issues associated with dimensionality
problems.
3. Explain how classification metrics like precision, recall, and F1-score are evaluated.
4. Describe how supervised learning differs from unsupervised learning.
5. Discuss heuristic search in inductive learning, focussing on strategies to avoid overfitting.
6. Explain the concept of bias-variance trade-off in detail.
UNIT-3
1. Differentiate between Linear regression and Logistic regression using appropriate example.
2. Describe how the K-Nearest Neighbor algorithm works with an example.
3. What is Fisher’s Linear discriminant used for in machine learning.
4. How does Bayesian reasoning support probabilistic inference?
5. What role does inferential statistical analysis play in machine learning?
6. Explain the concept of Logistic regression in Classification tasks.
UNIT-4
1. Explain different types of activation functions used for neural network training.
2. What is the motivation behind using neural networks for learning the concept? Explain
briefly.
3. Define SVM and further explain the maximum margin linear separator concept.
4. What is the Architecture of a simple neural network?
5. Explain about perceptron in detail.
6. Explain the concept of linear and non-linear support vector machine.
UNIT-5
1. Explain the multilayer perceptron model in detail with neat diagram.
2. Discuss the process of training an RBF network.
3. How are decision trees constructed for classification tasks?
4. How does backpropagation work in training neural networks?
5. Discuss the strengths of the decision tree learning approach
6. How does the CART algorithm differ from ID3 and C4.5?
MACHINE LEARNING
UNIT-I
1.Explain the concept of structured data and unstructured data in machine learning.
Structured Data
Structured data is very strategic, factual data that is categorized in a prearranged method which is
quantative and simple to search through and manipulate. This kind of data mainly entails
quantifiable numbers, date and time among others or perhaps data in tables with rows and columns
similar to that which is in an Excel file/Google Docs spreadsheet. SQL helped by IBM in the 1970s
is used to control the data in structured databases and data warehouses mostly. Structured data
applications involve the use of booking and flight details in airline sales transactions and the
management of stocks in a business.
Uses of structured data
Financial Transactions: Referential data is familiar with financial systems it is used for
transactions, accounts and reports generation. For instance, in a bank personal account,
structured data is used in storing data such as customer details, logbooks and balance sheets.
Inventory Management: CPS and PSPs use structured data to keep records on shelf stock,
stock flow, and other aspects of supply chain management. These types of databases assist in
keeping good records and administration of products, quantities, and places.
Customer Relationship Management (CRM): Companies use structured data within CRM
systems to organize communications with consumers and document the requisite sales
processes and consumer preferences. The client details, purchasing patterns, and interaction
records are implemented in the organized database to facilitate marketing and client service.
Human Resources (HR) Management: HR departments require such data to govern employee
details, monitor the level of attendance, and process payments. Human capital management
databases are records systems that contain records of employees, organizational performance
appraisal information, and benefit administration data.
E-commerce Transactions: Business mechanisms utilize structured data in executing
payments, inventory, and order delivery for goods and services sold online. One essential
benefit that is reinforced by structured databases is updating inventory status, and handling
payments and customers’ shopping preferences.
Unstructured Data
Unstructured data is formed by different file formats that are in the form of log files, sounds,
images, and all other raw data that have no structural pattern to hold to. This form of data poses a
major challenge to organizations because it is difficult to extract value from it since it is
unstructured. Managing such data means that the storage space will be hugely occupied and
security is always a major issue. It cannot be described by a data model or schema, as most
databases can be to be managed, analyzed or to be searched. Whereas structured data carries
quantitative information and is usually processed into organized formats such as databases,
unstructured data includes information in textual, image, audio, and video formats, and it is
generally qualitative. It is typically saved in NoSQL databases or non-relational data stores.
Some of the human-generated unstructured data are text files, emails, social media posts, mobile
communication data, and business applications. The machine-generated unstructured data is the
satellite images and data captured by scientific instruments and sensors, video surveillance, etc.
Uses of Unstructured Data
Social Media Analysis: Social media - Twitter, Facebook, and Instagram data in the form of
Tweets, Facebook posts, comments, and other similar types of unstructured data and convert
this social media data to structured data to understand consumer sentiment, trends, and brand
perception.
Image Recognition: Image data is unstructured data applied in body identification, object
recognition, and computer-aided, medical imaging. Sophisticated methods and pattern
recognition logic work on the pixel level by analyzing bits to solve problems as simple as face
recognition, or as complex as object identification and disease diagnosis.
Text Mining: Structured, semi-structured, and unstructured text from documents, emails, and
web pages is extracted to identify key pieces of information, categorize text as positive,
negative, or neutral, and discuss topics. These simple natural language processing (NLP)
methodologies are used in the identification of patterns, the identification of keywords, and
content summaries.
Sensor Data Analytics: Real-time information from both sensors and IoT devices as well as
industrial equipment is collected unstructured to be subsequently processed for performance,
bottlenecks, or any other issues through analysis. Information gathered from the sensors in
terms of time series helps in understanding the state of the environment, the condition of the
operational tools, and the overall process of manufacturing.
Video Surveillance: Raw video data is applied in video monitors for security, observing
behaviour patterns and identifying incidents. Facial recognition algorithms perform video
analytics on video feeds and are capable of detecting motion, identifying objects or threats and
notifying security agents.
Difference between structured data and unstructured data
Parameters Structured data Unstructured data
Well Posed Learning Problem - A computer program is said to learn from experience E in
context to some task T and some performance measure P, if its performance on T, as was measured
by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits -
Task
Performance Measure
Experience
Certain examples that efficiently defines the well-posed learning problem are -
1. To better filter emails as spam or not
Task - Classifying emails as spam or not
Performance Measure - The fraction of emails accurately classified as spam or not spam
Experience - Observing you label emails as spam or not spam
2. A checkers learning problem
Task - Playing checkers game
Performance Measure - percent of games won against opposer
Experience - playing implementation games against itself
3. Handwriting Recognition Problem
Task - Acknowledging handwritten words within portrayal
Performance Measure - percent of words accurately classified
Experience - a directory of handwritten words with given classifications
4. A Robot Driving Problem
Task - driving on public four-lane highways using sight scanners
Performance Measure - average distance progressed before a fallacy
Experience - order of images and steering instructions noted down while observing a human
driver
5. Fruit Prediction Problem
Task - forecasting different fruits for recognition
Performance Measure - able to predict maximum variety of fruits
Experience - training machine with the largest datasets of fruits images
6. Face Recognition Problem
Task - predicting different types of faces
Performance Measure - able to predict maximum types of faces
Experience - training machine with maximum amount of datasets of different face images
7. Automatic Translation of documents
Task - translating one type of language used in a document to other language
Performance Measure - able to convert one language to other efficiently
Experience - training machine with a large dataset of different types of languages
4)List and briefly describe four forms of learning in Machine learning.
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Machine learning and data mining are widely used across industries to analyze data, identify patterns,
and make intelligent decisions. Below are several practical examples:
Machine Learning models analyze transaction patterns to detect unusual activities (e.g.,
sudden large withdrawals).
Data mining uncovers hidden patterns in historical transaction data to improve fraud rules.
ML algorithms assist doctors by predicting diseases from symptoms, lab results, or medical
images.
Data mining helps discover relationships between patient attributes and diseases in large
medical datasets.
ML enables systems like Google Photos to recognize faces, and virtual assistants to
understand voice commands.
Data mining techniques help label and categorize vast image/audio datasets.
Sensors collect real-time data, and ML models predict when a machine is likely to fail.
Data mining helps discover trends and frequent failure causes from historical logs.
ML models are trained to detect spam emails or malicious files based on known threats.
Data mining identifies new attack patterns from network traffic and logs.
8. Financial Forecasting
Conclusion
Machine learning and data mining are crucial for turning raw data into actionable insights. Their
practical applications span multiple industries, improving decision-making, efficiency, and user
experience.
Feature engineering is the process of turning raw data into useful features that help improve the
performance of machine learning models. It includes choosing, creating and adjusting data attributes
to make the model’s predictions more accurate. The goal is to make the model better by providing
relevant and easy-to-understand information.
A feature or attribute is a measurable property of data that is used as input for machine learning
algorithms. Features can be numerical, categorical or text-based representing essential data aspects
which are relevant to the problem. For example in housing price prediction, features might include
the number of bedrooms, location and property age.
Feature Engineering Architecture
1. Feature Creation: Feature creation involves generating new features from domain knowledge or
by observing patterns in the data. It can be:
Domain-specific: Created based on industry knowledge likr business rules.
Data-driven: Derived by recognizing patterns in data.
Synthetic: Formed by combining existing features.
2. Feature Transformation: Transformation adjusts features to improve model learning:
Normalization & Scaling: Adjust the range of features for consistency.
Encoding: Converts categorical data to numerical form i.e one-hot encoding.
Mathematical transformations: Like logarithmic transformations for skewed data.
3. Feature Extraction: Extracting meaningful features can reduce dimensionality and improve
model accuracy:
Dimensionality reduction: Techniques like PCA reduce features while preserving important
information.
Aggregation & Combination: Summing or averaging features to simplify the model.
4. Feature Selection: Feature selection involves choosing a subset of relevant features to use:
Filter methods: Based on statistical measures like correlation.
Wrapper methods: Select based on model performance.
Embedded methods: Feature selection integrated within model training.
5. Feature Scaling: Scaling ensures that all features contribute equally to the model:
Min-Max scaling: Rescales values to a fixed range like 0 to 1.
Standard scaling: Normalizes to have a mean of 0 and variance of 1.
Steps in Feature Engineering
Feature engineering can vary depending on the specific problem but the general steps are:
1. Data Cleansing: Identify and correct errors or inconsistencies in the dataset to ensure data
quality and reliability.
2. Data Transformation: Transform raw data into a format suitable for modeling including scaling,
normalization and encoding.
3. Feature Extraction: Create new features by combining or deriving information from existing
ones to provide more meaningful input to the model.
4. Feature Selection: Choose the most relevant features for the model using techniques like
correlation analysis, mutual information and stepwise regression.
5. Feature Iteration: Continuously refine features based on model performance by adding,
removing or modifying features for improvement.
Common Techniques in Feature Engineering
1. One-Hot Encoding: One-Hot Encoding converts categorical variables into binary indicators,
allowing them to be used by machine learning models.
import pandas as pd
print(df_encoded)
Output
print(df)
Output
Age Age_Group
0 23 21-40
1 45 41-60
2 18 0-20
3 34 21-40
4 67 61+
5 50 41-60
6 21 21-40
3. Text Data Preprocessing: Involves removing stop-words, stemming and vectorizing text data to
prepare it for machine learning models.
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
texts = ["This is a sample sentence.", "Text data preprocessing is important."]
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
vectorizer = CountVectorizer()
def preprocess_text(text):
words = text.split()
words = [stemmer.stem(word)
for word in words if word.lower() not in stop_words]
return " ".join(words)
X = vectorizer.fit_transform(cleaned_texts)
4. Feature Splitting: Divides a single feature into multiple sub-features, uncovering valuable
insights and improving model performance.
import pandas as pd
data = {'Full_Address': [
'123 Elm St, Springfield, 12345', '456 Oak Rd, Shelbyville, 67890']}
df = pd.DataFrame(data)
print(df)
Output
Definition:
Occam’s Razor (also spelled Ockham’s Razor) is a philosophical and problem-solving principle
attributed to the English Franciscan friar and scholastic philosopher William of Ockham (1287–
1347). The principle states:
In simple terms, the simplest explanation is usually the correct one, or when faced with competing
hypotheses that make the same predictions, the one with the fewest assumptions should be selected.
Key Points:
Example:
According to Occam’s Razor, Hypothesis A is preferred because it's simpler and requires fewer
assumptions.
Diagram:
Applications:
Limitations:
Conclusion:
Occam’s Razor is a powerful tool for critical thinking and decision-making. It encourages clarity,
logical simplicity, and efficient reasoning. While not always correct, it provides a valuable starting
point for evaluating hypotheses in both academic and real-life scenarios.
Introduction to Dimensionality
In machine learning and data science, dimensionality refers to the number of input variables
(features) in a dataset. When the number of features increases, the dataset is said to have high
dimensionality.
1. Overfitting
2. Increased Computational Complexity
Definition:
Overfitting occurs when a model learns the noise or random fluctuations in the training data instead
of the underlying pattern, resulting in poor performance on unseen (test) data.
With more features, the model has more "freedom" to fit the training data.
It may capture patterns that do not generalize.
As the number of features increases, the risk of false correlations also increases.
Example:
If a dataset has only 100 data points but 1,000 features, the model might memorize the data instead of
learning.
Visualization:
Training Error ↓ (Decreasing)
Test Error ↓↓ (Starts increasing again due to overfitting)
\
\
\__ Complexity Increases (Too many features)
2. Computational Complexity
Definition:
As dimensionality increases, the computational cost in terms of time and memory also increases,
often exponentially. This is known as the curse of dimensionality.
Issues Include:
Example:
In a 2D space, it’s easy to compute the distance between two points. In a 1000D space, the same
distance calculation becomes complex and less informative.
Solutions:
Conclusion:
High dimensionality can lead to overfitting due to excessive freedom in model fitting and increased
computational complexity due to the exponential growth in resource requirements. Proper
dimensionality reduction and regularization techniques are essential to manage these challenges and
ensure efficient and accurate models.
3. Explain how classification metrics like precision, recall, and F1-score are evaluated.
Precision
Recall
F1-Score
1. Precision
Definition:
Precision tells us how many of the predicted positive results are actually correct.
Precision=TP/TP+FP
Example:
Precision=80/100=0.80 (80%)
Interpretation:
2. Recall
Definition:
Recall measures how many actual positives were correctly predicted by the model.
Recall=TP/TP+FN
Example:
If there are 100 actual positive cases and the model correctly identifies 70:
Recall=70/100=0.70 (70%)
Interpretation:
3. F1-Score
Definition:
F1-Score is the harmonic mean of Precision and Recall. It balances the two metrics.
F1-Score=2×(Precision×Recall/Precision + Recall)
Example:
F1=2×(0.80×0.70/0.80+0.70)=0.747
Interpretation:
F1-score is useful when we need to balance both precision and recall, especially on imbalanced
datasets.
Conclusion:
Precision, Recall, and F1-score are essential metrics for evaluating classification models. They
provide a more complete picture than accuracy alone, especially for imbalanced datasets. Choosing
the right metric depends on the context of the problem and the cost of errors.
The difference between supervised and unsupervised learning lies in how they use data and their
goals. Supervised learning relies on labeled datasets, where each input is paired with a
corresponding output label. The goal is to learn the relationship between inputs and outputs so the
model can predict outcomes for new data, such as classifying emails as spam or not spam. In
contrast, unsupervised learning works with unlabeled data aiming to uncover hidden patterns or
structures within the dataset such as grouping customers based on their shopping habits or
detecting anomalies in a dataset.
Overall, supervised learning excels in predictive tasks with known outcomes, while unsupervised
learning is ideal for discovering relationships and trends in raw data.
Supervised learning
Labeled data means that each example in the dataset comes with a correct answer or output.
In supervised learning process:
Machine is given a dataset with input features (like age, salary, or temperature) and
corresponding labels (like "yes/no," "high/low," or "rainy/sunny").
Then machine learns dataset by finding patterns in the data. For example, it might learn that if the
temperature is high, it’s likely to be sunny.
Once trained, the machine can predict the label for new input data. For instance, if you give it a
new temperature value, it can predict whether it will be sunny or rainy.
Supervised Learning Analogies
1. Supervised learning is like a teacher guiding a student. The teacher provides examples (labeled
data) and explains the correct answers (output labels). For instance:
A teacher shows a child pictures of animals and labels them as "cat" or "dog."
The child learns to recognize the features that distinguish cats from dogs.
If the child makes a mistake, the teacher corrects them, helping them improve over time.
This analogy emphasizes the role of labeled data in supervised learning, where the algorithm learns
from examples with known outputs.
2. Think of sorting mail into categories like "bills," "ads," or "personal letters":
You are given labeled examples of each type of mail (e.g., envelopes marked as "bill" or "ad").
By examining these examples, you learn patterns such as bills often having company logos or ads
being colorful.
Once trained, you can sort new mail into categories even without explicit labels.
This analogy mirrors how supervised learning uses labeled data to classify new inputs into predefined
categories.
Unsupervised Learning
Unsupervised learning is like letting a child explore and learn on their own without a teacher to find
hidden patterns or groupings in the data on its own. Here, the machine is given a dataset with
only input features (like customer purchase history or website click patterns) but no labels.
Then machine tries to find structure in the data. It might group similar data points together or identify
trends. At last it provides insights, such as clusters of similar data or patterns that were not
obvious before.
Unsupervised Learning Analogies
1. Sorting Books Without Labels : Imagine you are given a box of books with no labels or
categories. Your task is to organize them:
You notice that some books are mystery novels, so you group them together.
Others are textbooks, which you set aside in a separate pile.
Comic books form another group because of their distinct style.
Here, you create groups based on the books' characteristics (e.g., genre, content) without any prior
guidance. This reflects how unsupervised learning clusters data based on similarities.
This analogy reflects customer segmentation in marketing. Businesses use unsupervised learning to
group customers based on purchasing behavior, preferences, or demographics, enabling targeted
marketing strategies.
2. Exploring a New City: Imagine visiting a new city without a map or guide. You explore and start
grouping landmarks:
Buildings with tall spires might be grouped as churches.
Open spaces with greenery might be categorized as parks.
Streets with lots of shops could be grouped as markets.
You’re identifying patterns and organizing your observations independently, much like how
unsupervised learning identifies patterns in data.
This analogy mirrors anomaly detection in cybersecurity. For example, unsupervised learning
algorithms analyze network traffic and identify unusual patterns that could indicate potential
cyberattacks.
Difference between Supervised and Unsupervised Learning
Aspect Supervised Learning Unsupervised Learning
Uses labeled data (input features + Uses unlabeled data (only input features,
Input Data corresponding outputs). no outputs).
Testing the Model can be tested and evaluated Cannot be tested in the traditional sense,
Model using labeled test data. as there are no labels.
Heuristic search techniques are used for problem-solving in AI systems. These techniques help find
the most efficient path from a starting point to a goal, making them essential for applications such as
navigation systems, game playing, and optimization problems.
Heuristic search operates within the search space of a problem to find the best or near-optimal
solution using systematic algorithms.
Unlike brute-force methods, which exhaustively evaluate all possible solutions, heuristic search
leverages heuristic information to guide the search toward more promising paths.
In this context, heuristics refer to a set of criteria or rules of thumb that provide an estimate of the
most viable solution. By balancing exploration (searching new possibilities)
and exploitation (refining known solutions), heuristic algorithms efficiently solve complex problems
that would otherwise be computationally expensive.
Significance of Heuristic Search in AI
The advantage of heuristic search techniques in AI is their ability to efficiently navigate large search
spaces. By prioritizing the most promising paths, heuristics significantly reduce the number of
possibilities that need to be explored. This not only accelerates the search process but also enables AI
systems to solve complex problems that would be impractical for exact algorithms.
Components of Heuristic Search
Heuristic search algorithms typically comprise several essential components:
1. State Space: This implies that the totality of all possible states or settings, which is considered to
be the solution for the given problem.
2. Initial State: The instance in the search tree of the highest level with no null values, serving as
the initial state of the problem at hand.
3. Goal Test: The exploration phase ensures whether the present state is a terminal or consenting
state in which the problem is solved.
4. Successor Function: This create a situation where individual states supplant the current state
which represent the possible moves or solutions in the problem space.
5. Heuristic Function: The function of a heuristic is to estimate the value or distance from a given
state to the target state. It helps to focus the process on regions or states that has prospect of
achieving the goal.
Types of Heuristic Search Techniques
Over the history of heuristic search algorithms, there have been a lot of techniques created to
improve them further and attend different problem domains. Some prominent techniques include:
1. A* Search Algorithm
A* Search Algorithm is perhaps the most well-known heuristic search algorithm. It uses a best-first
search and finds the least-cost path from a given initial node to a target node. It has a heuristic
function, often denoted as f(n)=g(n)+h(n)f(n)=g(n)+h(n) , where g(n) is the cost from the start node
to n, and h(n) is a heuristic that estimates the cost of the cheapest path from n to the goal. A* is
widely used in pathfinding and graph traversal.
2. Greedy Best-First Search
Greedy best-first search expands the node that is closest to the goal, as estimated by a heuristic
function. Unlike A*, which considers both path cost and estimated remaining cost, greedy best-first
search only prioritizes the estimated cost to the goal. While this makes it faster, it can be less
optimal, often leading to sub optimal solutions.
3. Hill Climbing
Hill climbing is a heuristic search used for mathematical optimization problems. It is a variant of the
gradient ascent method. It starts from a random initial point and iteratively moves toward higher
values (local maxima) by choosing the best neighboring state. However, it can get stuck in local
maxima, failing to find the global optimum.
4. Simulated Annealing
Inspired by annealing in metallurgy, simulated annealing is a probabilistic technique for finding the
global optimum. Unlike hill climbing, it allows the search to accept worse solutions temporarily to
escape local optima. This probabilistic acceptance decreases over time, allowing it to converge
toward the best solution.
5. Beam Search
Beam search is a graph-based search technique that explores only a limited number of promising
nodes (a beam). The beam width, which limits the number of nodes stored in memory, plays a
crucial role in the performance and accuracy of the search.
Applications of Heuristic Search
Heuristic search techniques are widely used in various real-world scenarios, including:
Pathfinding: Whether it's navigating a city or plotting a route in a game, heuristic search helps
find the shortest or most efficient path between two points.
Optimization: From resource allocation to scheduling, heuristic methods help make the most of
available resources while maximizing efficiency.
Game Playing: In strategy games like chess and Go, AI relies on heuristic search to evaluate
possible moves and plan ahead.
Robotics: Autonomous robots use heuristic search to determine their movements, avoid
obstacles, and complete tasks efficiently.
Natural Language Processing (NLP): Search algorithms play a key role in language processing
tasks like parsing, semantic analysis, and text generation, helping AI understand and generate
human language.
Advantages of Heuristic Search Techniques
Heuristic search techniques offer several advantages:
Efficiency: By focusing on the most promising paths, heuristic search significantly reduces the
number of possibilities explored, saving both time and computational resources.
Optimality: When using admissible heuristics, certain algorithms like A* can guarantee an
optimal solution, ensuring the best possible outcome.
Versatility: Heuristic methods are adaptable and can be applied to a wide range of problems,
from pathfinding and optimization to game AI and robotics.
Limitations of Heuristic Search Techniques
Despite their advantages, heuristic search techniques also have some limitations:
Heuristic Quality: The effectiveness of heuristic search heavily depends on the quality of the
heuristic function. Poorly designed heuristics can lead to inefficient or suboptimal solutions.
Space Complexity: Some heuristic algorithms require large amounts of memory, especially
when dealing with extensive search spaces, making them less practical for resource-limited
environments.
Domain-Specificity: Designing effective heuristics often requires domain-specific knowledge,
which can make it difficult to create general-purpose heuristic approaches.
6)Bias-Variance Trade Off - Machine Learning
It is important to understand prediction errors (bias and variance) when it comes to accuracy in any
machine-learning algorithm. There is a tradeoff between a model’s ability to minimize bias and
variance which is referred to as the best solution for selecting a value of Regularization constant.
A proper understanding of these errors would help to avoid the overfitting and underfitting of a
data set while training the algorithm.
What is Bias?
The bias is known as the difference between the prediction of the values by the Machine
Learning model and the correct value. Being high in biasing gives a large error in training as well
as testing data. It recommended that an algorithm should always be low-biased to avoid the
problem of underfitting. By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as the Underfitting of Data. This
happens when the hypothesis is too simple or linear in nature. Refer to the graph given below for
an example of such a situation.
We try to optimize the value of the total error for the model by using the Bias-Variance Tradeoff.
TotalError=Bias2+Variance+IrreducibleErrorTotalError=Bias2+Variance+IrreducibleError
The best fit will be given by the hypothesis on the tradeoff point. The error to complexity graph to
show trade-off is given as -
This is referred to as the best point chosen for the training of the algorithm which gives low error
in training as well as testing data.
UNIT-3
1)ML | Linear Regression vs Logistic Regression
Linear Regression is a machine learning algorithm based on supervised regression algorithm.
Regression models a target prediction value based on independent variables. It is mostly used for
finding out the relationship between variables and forecasting. Different regression models differ
based on – the kind of relationship between the dependent and independent variables, they are
considering and the number of independent variables being used. Logistic regression is basically
a supervised classification algorithm. In a classification problem, the target variable(or output), y,
can take only discrete values for a given set of features(or inputs), X.
Sl.No
. Linear Regression Logistic Regression
Linear Regression is a
1. supervised regression Logistic Regression is a supervised classification model.
model.
Equation of linear
regression:
Equation of logistic regression
(a0+a1x1+a2x2+⋯+aixi)
y(x)=e(a0+a1x1+a2x2+⋯+aixi)1+e(a0+a1x1+a2x2+⋯+aixi)y(x)
(a0+a1x1+a2x2+⋯+aixi)
2. Here, =1+e(a0+a1x1+a2x2+⋯+aixi)e(a0+a1x1+a2x2+⋯+aixi)
Here,
y = response variable
y = response variable
xi = ith predictor
xi = ith predictor variable
variable
ai = average effect on y as xi increases by 1
ai= average effect on y
as xi increases by 1
In Linear Regression, we
3. predict the value by an In Logistic Regression, we predict the value by 1 or 0.
integer number.
Here dependent variable Here the dependent variable consists of only two categories.
should be numeric and Logistic regression estimates the odds outcome of the dependent
7.
the response variable is variable given a set of quantitative or categorical independent
continuous to value. variables.
Linear regression
assumes the normal or Logistic regression assumes the binomial distribution of the
11.
gaussian distribution of dependent variable.
the dependent variable.
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm generally used for
classification but can also be used for regression tasks. It works by finding the "k" closest data points
(neighbors) to a given input and makesa predictions based on the majority class (for classification) or
the average value (for regression). Since KNN makes no assumptions about the underlying data
distribution it makes it a non-parametric and instance-based learning method.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of classification it performs an
action on the dataset.
For example, consider the following table of data points containing two features:
training_data = [[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]]
training_labels = ['A', 'A', 'A', 'B', 'B']
test_point = [4, 5]
k=3
5. Prediction
Linear Discriminant Analysis (LDA) also known as Normal Discriminant Analysis is supervised
classification problem that helps separate two or more classes by converting higher-
dimensional data space into a lower-dimensional space. It is used to identify a linear combination
of features that best separates classes within a dataset.
2 Classes
overlapping
For example we have two classes that need to be separated efficiently. Each class may have multiple
features and using a single feature to classify them may result in overlapping. To solve this LDA is
used as it uses multiple features to improve classification accuracy. LDA works by some
assumptions and we are required to understand them so that we have a better understanding of its
working.
Key Assumptions of LDA
For LDA to perform effectively, certain assumptions are made:
Gaussian Distribution: The data in each class should follow a normal bell-shaped distribution.
Equal Covariance Matrices: All classes should have the same covariance structure.
Linear Separability: The data should be separable using a straight line or plane.
If these assumptions are met LDA can produce very good results. For example when data points
belonging to two classes are plotted if they are not linearly separable LDA will attempt to find a
projection that maximizes class separability.
Linearly Separable Dataset
Image shows an example where the classes (black and green circles) are not linearly separable. LDA
attempts to separate them using red dashed line. It uses both axes (X and Y) to generate a new axis
in such a way that it maximizes the distance between the means of the two classes while
minimizing the variation within each class. This transforms the dataset into a space where the
classes are better separated. After transforming the data points along a new axis LDA maximizes the
class separation. This new axis allows for clearer classification by projecting the data along a line
that enhance the distance between the means of the two classes.
Perpendicular distance between the decision boundary and the data points helps us to visualize how
LDA works by reducing class variation and increasing separability. After generating this new axis
using the above-mentioned criteria all the data points of the classes are plotted on this new axis and
are shown in the figure given below.
LDA
It shows how LDA creates a new axis to project the data and separate the two classes effectively
along a linear path. But it fails when the mean of the distributions are shared as it becomes
impossible for LDA to find a new axis that makes both classes linearly separable. In such cases we
use non-linear discriminant analysis.
How does LDA work
LDA works by finding directions in the feature space that best separate the classes. It does this by
maximizing the difference between the class means while minimizing the spread within each class.
Let’s assume we have two classes with d-dimensional samples such as x1,x2,...xnx1,x2,...xn where:
n1n1 samples belong to class c1c1
n2n2 samples belong to class c2c2.
If xi represents a data point its projection onto the line represented by the unit vector v is vTxiLet the
means of class c1c1 and class c2c2 before projection be μ1 and μ2 respectively. After projection the
new means are μ^1=vTμ1and μ^2=vTμ2
Our aim to normalize the difference ∣μ^1−μ^2∣to maximize the class separation. The scatter for
samples of class c1c1 is calculated as:
s12=∑xi∈c1(xi−μ1)2
Similarly for class c2:
s22=∑xi∈c2(xi−μ2)2
The goal is to maximize the ratio of the between-class scatter to the within-class scatter, which leads
us to the following criteria:
J(v)=∣μ^1−μ^2∣/s12+s22∣
For the best separation we calculate the eigenvector corresponding to the highest eigenvalue of the
scatter matrices sw−1sb.
Extensions to LDA
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance) allowing it to handle more complex relationships.
2. Flexible Discriminant Analysis (FDA): Uses non-linear combinations of inputs such as splines
to handle non-linear separability.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the covariance
estimate to prevent overfitting.
Implementation of LDA using Python
In this implementation we will perform linear discriminant analysis using Scikit-learn library on
the Iris dataset.
StandardScaler(): Standardizes the features to ensure they have a mean of 0 and a standard
deviation of 1 removing the influence of different scales.
fit_transform(): Standardizes the feature data by applying the transformation learned from the
training data ensuring each feature contributes equally.
LabelEncoder(): Converts categorical labels into numerical values that machine learning models
can process.
fit_transform() on y: Transforms the target labels into numerical values for use in classification
models.
LinearDiscriminantAnalysis(): Reduces the dimensionality of the data by projecting it into a
lower-dimensional space while maximizing the separation between classes.
transform() on X_test: Applies the learned LDA transformation to the test data to maintain
consistency with the training data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
iris = load_iris()
dataset = pd.DataFrame(columns=iris.feature_names,
data=iris.data)
dataset['target'] = iris.target
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
sc = StandardScaler()
X = sc.fit_transform(X)
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test,\
y_train, y_test = train_test_split(X, y,
test_size=0.2)
lda = LinearDiscriminantAnalysis(n_components=2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
plt.scatter(
X_train[:, 0], X_train[:, 1],
c=y_train,
cmap='rainbow',
alpha=0.7, edgecolors='b'
)
classifier = RandomForestClassifier(max_depth=2,
random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Scatter plot of
the iris data mapped into 2D
This scatter plot shows three distinct groups of data points, represented by different colors. The group
on the right (dark blue) is clearly separated from the others indicate it's very different. The other two
groups (red and light blue) are positioned closer together with some overlap and suggest they are
more similar and harder to separate.
Advantages of LDA
Simple and computationally efficient.
Works well even when the number of features is much larger than the number of training
samples.
Can handle multicollinearity.
Disadvantages of LDA
Assumes Gaussian distribution of data which may not always be the case.
Assumes equal covariance matrices for different classes which may not hold in all datasets.
Assumes linear separability which is not always true.
May not always perform well in high-dimensional feature spaces.
Applications of LDA
1. Face Recognition: It is used to reduce the high-dimensional feature space of pixel values in face
recognition applications helping to identify faces more efficiently.
2. Medical Diagnosis: It classifies disease severity in mild, moderate or severe based on patient
parameters helping in decision-making for treatment.
3. Customer Identification: It can help identify customer segments most likely to purchase a
specific product based on survey data.
4)Bayes' Theorem in AI
In probability theory, Bayes' theorem talks about the relation of the conditional probability of two
random events and their marginal probability. In short, it provides a way to calculate the value of
P(B|A) by using the knowledge of P(A|B).
Bayes' theorem is the name given to the formula used to calculate conditional probability. The
formula is as follows:
P(A∣B)=P(A∩B)/P(B)=(P(A)∗P(B∣A))/P(B)P(A∣B)=P(A∩B)/P(B)=(P(A)∗P(B∣A))/P(B)
where,
P(A) is the probability that event A occurs.
P(B) defines the probability that event B occurs.
P(A|B) is the probability of the occurrence of event A given that event B has already occurred.
P(B∣A) can now be read as: Probability of event B occurring given that event A occurred.
p(A∩B) is the probability events A and B will happen together.
Key terms in Bayes' Theorem
The Bayes' Theorem is a basic concept in probability and statistics. It gives a model of updating
beliefs or probabilities when the new evidence is presented. This theorem was named after
Reverend Thomas Bayes and has been applied in many fields, ranging from artificial intelligence
and machine learning to data analysis.
The Bayes' Theorem encompasses four major elements:
1. Prior Probability (P(A)): The probability or belief in an event A prior to considering any
additional evidence, it represents what we know or believe about A based on previous
knowledge.
2. Likelihood P(B|A): the probability of evidence B given the occurrence of event A. It
determines how strongly the evidence points toward the event.
3. Evidence (P(B)): Evidence is the probability of observing evidence B regardless of whether A
is true. It serves to normalize the distribution so that the posterior probability is a valid
probability distribution.
4. Posterior Probability P(A|B): The posterior probability is a revised belief regarding event A,
informed by some new evidence B. It answers the question, "What is the probability that A is
true given evidence B observed?"
Using these components, Bayes' Theorem computes the posterior probability P(A|B), which
represents our updated belief in A after considering the new evidence.
In artificial intelligence, probability and the Bayes Theorem are especially useful when making
decisions or inferences based on uncertain or incomplete data. It enables us to rationally update our
beliefs as new evidence becomes available, making it an indispensable tool in AI, machine
learning, and decision-making processes.
How Bayes theorem is relevant in AI?
Bayes' theorem is highly relevant in AI due to its ability to handle uncertainty and make decisions
based on probabilities. Here's why it's crucial:
1. Probabilistic Reasoning: In many real-world scenarios, AI systems must reason under
uncertainty. Bayes' theorem allows AI systems to update their beliefs based on new evidence.
This is essential for applications like autonomous vehicles, where the environment is constantly
changing and sensors provide noisy information.
2. Machine Learning: Bayes' theorem serves as the foundation for Bayesian machine learning
approaches. These methods allow AI models to incorporate prior knowledge and update their
beliefs as they see more data. This is particularly useful in scenarios with limited data or when
dealing with complex relationships between variables.
3. Classification and Prediction: In classification tasks, such as spam email detection or medical
diagnosis, Bayes' theorem can be used to calculate the probability that a given input belongs to
a particular class. This allows AI systems to make more informed decisions based on the
available evidence.
4. Anomaly Detection: Bayes' theorem is used in anomaly detection, where AI systems identify
unusual patterns in data. By modeling the normal behavior of a system, Bayes' theorem can
help detect deviations from this norm, signaling potential anomalies or security threats.
Overall, Bayes' theorem provides a powerful framework for reasoning under uncertainty and is
essential for many AI applications, from decision-making to pattern recognition.
Mathematical Derivation of Bayes' Rule
Bayes' Rule is derived from the definition of conditional probability. Let's start with the definition:
P(A∣B)=P(A∩B)P(B)P(A∣B)=P(B)P(A∩B)
This equation states that the probability of event AA given event BB is equal to the probability of
both events happening (the intersection of AA and BB) divided by the probability of event BB.
Similarly, we can write the conditional probability of event B given event A:
P(B∣A)=P(A∩B)P(A)P(B∣A)=P(A)P(A∩B)
By rearranging this equation, we get:
P(A∩B)=P(B∣A)⋅P(A)P(A∩B)=P(B∣A)⋅P(A)
Now, we have two expressions for P(A∩B)P(A∩B), since both expressions are equal
to P(A∩B)P(A∩B), we can set them equal to each other:
P(A∣B)⋅P(B)=P(B∣A)⋅P(A)P(A∣B)⋅P(B)=P(B∣A)⋅P(A)
To get P(A∣B)P(A∣B), we divide both sides by P(B)P(B):
P(A∣B)=P(B)P(B∣A)⋅P(A)P(A∣B)=P(B∣A)⋅P(A)P(B)
Importance of Bayes' Theorem in AI
Bayes' Theorem is extremely important in artificial intelligence (AI) and related fields.
Probabilistic Reasoning: In AI, many problems involve uncertainty, so probabilistic reasoning
is an important technique. Bayes' Theorem enables artificial intelligence systems to model and
reason about uncertainty by updating beliefs in response to new evidence. This is important for
decision-making, pattern recognition, and predictive modeling.
Machine Learning: Bayes' Theorem is a fundamental concept in machine learning,
specifically Bayesian machine learning. Bayesian methods are used to model complex
relationships, estimate model parameters, and predict outcomes. Bayesian models enable the
principled handling of uncertainty in tasks such as classification, regression, and clustering.
Data Science: Bayes' Theorem is used extensively in Bayesian statistics. It is used to estimate
and update probabilities in a variety of settings, including hypothesis testing, Bayesian
inference, and Bayesian optimization. It offers a consistent framework for modeling and
comprehending data.
Example of Bayes' Rule Application in AI
One of the good old example of Bayes' Rule in AI is its application in spam email classification.
This example demonstrates how Bayes' Theorem is used to classify emails as spam or non-spam
based on the presence of certain keywords.
Consider an email filtering system that needs to determine whether an incoming email is spam or
not based on the presence of the word "win" in the email. We are given the following probabilities:
P(S): The prior probability that any given email is spam.
P(H): The prior probability that any given email is not spam (ham).
P(W∣S): The probability that the word "win" appears in a spam email.
P(W∣H): The probability that the word "win" appears in a non-spam email.
P(W): The probability that the word "win" appears in any email.
Given Data
P(S)=0.2 (20% of emails are spam)
P(H)=0.8 (80% of emails are not spam)
P(W∣S)=0.6 (60% of spam emails contain the word "win")
P(W∣H)=0.1P (10% of non-spam emails contain the word "win")
We want to find P(S∣W), the probability that an email is spam given that it contains the word
"win".
Applying Bayes rule we get:
P(S∣W)=P(W)P(W∣S)⋅P(S)P(S∣W)=P(W∣S)⋅P(S)P(W)
First, we need to calculate P(W), the probability that any email contains the word "win". Using the
law of total probability:
P(W)=P(W∣S)⋅P(S)+P(W∣H)⋅P(H)P(W)=P(W∣S)⋅P(S)+P(W∣H)⋅P(H)
Substituting the given values:
P(W)=(0.6⋅0.2)+(0.1⋅0.8)=0.2P(W)=(0.6⋅0.2)+(0.1⋅0.8)=0.2
Now, we can use Bayes' Rule to find P(S∣W):
P(S∣W)=P(W∣S)⋅P(S)P(W)P(S∣W)=P(W)P(W∣S)⋅P(S),
substituting the values:
P(S∣W)=0.6⋅0.20.2=0.6P(S∣W)=0.20.6⋅0.2=0.6
Thus we can conclude that the probability that an email is spam given that it contains the word
"win" is 0.6, or 60%. This means that if an email contains the word "win," there is a 60% chance
that it is spam.
In a real-world AI system, such as an email spam filter, this calculation would be part of a larger
model that considers multiple features (words) within an email. The filter uses these probabilities,
along with other algorithms, to classify emails accurately and efficiently. By continuously updating
the probabilities based on incoming data, the spam filter can adapt to new types of spam and
improve its accuracy over time.
Uses of Bayes Rule in Artificial Intelligence
Bayes' theorem in Al is used to draw probabilistic conclusions, update beliefs, and make decisions
based on available information. Here are some important applications of Bayes' rule in AI.
1. Bayesian Inference: In Bayesian statistics, the Bayes' rule is used to update the probability
distribution over a set of parameters or hypotheses using observed data. This is especially
important for machine learning tasks like parameter estimation in Bayesian networks, hidden
Markov models, and probabilistic graphical models.
2. Naive Bayes Classification: In the field of natural language processing and text classification,
the Naive Bayes classifier is widely used. It uses Bayes' theorem to calculate the likelihood that
a document belongs to a specific category based on the words it contains. Despite its "naive"
assumption of feature independence, it works surprisingly well in practice.
3. Bayesian Networks: Bayesian networks are graphical models that use Bayes' theorem to
represent and predict probabilistic relationships between variables. They are used in a variety of
AI applications, such as medical diagnosis, fault detection, and decision support systems.
4. Spam Email Filtering: In email filtering systems, Bayes' theorem is used to determine whether
an incoming email is spam or not. The model calculates the likelihood of seeing specific words
or features in spam or non-spam emails and adjusts the probabilities accordingly.
5. Reinforcement Learning: Bayes' rule can be used to model the environment in a probabilistic
manner. Bayesian reinforcement learning methods can help agents estimate and update their
beliefs about state transitions and rewards, allowing them to make more informed decisions.
6. Bayesian Optimization: In optimization tasks, Bayes' theorem can be used to represent the
objective function as a probabilistic surrogate. Bayesian optimization techniques make use of
this model to iteratively explore and exploit the search space in order to efficiently find the
optimal solution. This is commonly used for hyperparameter tuning and algorithm parameter
optimization.
7. Anomaly Detection: The Bayes theorem can be used to identify anomalies or outliers in
datasets. Deviations from the normal distribution can be quantified by modeling it, which aids
in anomaly detection for a variety of applications, including fraud detection and network
security.
8. Personalization: In recommendation systems, Bayes' theorem can be used to update user
preferences and provide personalized recommendations. By constantly updating a user's
preferences based on their interactions, the system can recommend more relevant content.
9. Robotics and Sensor Fusion: In robotics, the Bayes' rule is used to combine sensors. It uses
data from multiple sensors to estimate the state of a robot or its environment. This is necessary
for tasks like localization and mapping.
10. Medical Diagnosis: In healthcare, Bayes' theorem is used in medical decision support systems
to update the likelihood of various diagnoses based on patient symptoms, test results, and
medical history.
5)What is Inferential Statistics?
Inferential statistics is an important tool that allows us to make predictions and conclusions about a
population based on sample data. Unlike descriptive statistics, which only summarizes data,
inferential statistics lets us test hypotheses, make estimates and measure the uncertainty about our
predictions. These tools are essential for evaluating models, testing assumptions and supporting
data-driven decision-making.
For example, instead of surveying every voter in a country, we can survey a few thousand and still
make reliable conclusions about the entire population’s opinion. Inferential statistics provides the
tools to do this in a systematic and mathematical way.
Descriptive and Inferential Statistics
Logistic regression is a supervised learning classification algorithm used to predict the probability of
a target variable. The nature of target or dependent variable is dichotomous, which means there
would be only two possible classes.
In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).
Generally, logistic regression means binary logistic regression having binary target variables, but
there can be two more categories of target variables that can be predicted by it. Based on those
number of categories, Logistic regression can be divided into following types −
Binary or Binomial
In such a kind of classification, a dependent variable will have only two possible types either 1 and 0.
For example, these variables may represent success or failure, yes or no, win or loss etc.
Multinomial
In such a kind of classification, dependent variable can have 3 or more possible unordered types or
the types having no quantitative significance. For example, these variables may represent "Type A"
or "Type B" or "Type C".
Ordinal
In such a kind of classification, dependent variable can have 3 or more possible ordered types or the
types having a quantitative significance. For example, these variables may represent "poor" or
"good", "very good", "Excellent" and each category can have the scores like 0,1,2,3.
Before diving into the implementation of logistic regression, we must be aware of the following
assumptions about the same −
In case of binary logistic regression, the target variables must be binary always and the desired
outcome is represented by the factor level 1.
There should not be any multi-collinearity in the model, which means the independent variables must
be independent of each other .
We must include meaningful variables in our model.
We should choose a large sample size for logistic regression.
The simplest form of logistic regression is binary or binomial logistic regression in which the target
or dependent variable can have only 2 possible types either 1 or 0. It allows us to model a
relationship between multiple predictor variables and a binary/binomial target variable. In case of
logistic regression, the linear function is basically used as an input to another function such as in the
following relation −
hθ(x)=g(θTx)0hθ1hθ(x)=g(θTx)0hθ1
g(z)=11+e−z=θTg(z)=11+e−z=θT
To sigmoid curve can be represented with the help of following graph. We can see the values of y-
axis lie between 0 and 1 and crosses the axis at 0.5.
The classes can be divided into positive or negative. The output comes under the probability of
positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of
hypothesis function as positive if it is 0.5, otherwise negative.
We also need to define a loss function to measure how well the algorithm performs using the weights
on functions, represented by theta as follows −
=()=()
J(θ)=1m.(−yTlog(h)−(1−y)Tlog(1−h))J(θ)=1m.(−yTlog(h)−(1−y)Tlog(1−h))
Now, after defining the loss function our prime goal is to minimize the loss function. It can be done
with the help of fitting the weights which means by increasing or decreasing the weights. With the
help of derivatives of the loss function w.r.t each weight, we would be able to know what parameters
should have high weight and what should have smaller weight.
The following gradient descent equation tells us how loss would change if we modified the
parameters −
()θj=1mXT(())()θj=1mXT(())
Now we will implement the above concept of binomial logistic regression in Python. For this
purpose, we are using a multivariate flower dataset named iris which have 3 classes of 50 instances
each, but we will be using the first two feature columns. Every class represents a type of iris flower.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1
plt.figure(figsize=(6, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend();
Next, we will define sigmoid function, loss function and gradient descend as follows −
class LogisticRegression:
def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
self.lr = lr
self.num_iter = num_iter
self.fit_intercept = fit_intercept
self.verbose = verbose
def __add_intercept(self, X):
intercept = np.ones((X.shape[0], 1))
return np.concatenate((intercept, X), axis=1)
def __sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def __loss(self, h, y):
return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
def fit(self, X, y):
if self.fit_intercept:
X = self.__add_intercept(X)
self.theta = np.zeros(X.shape[1])
for i in range(self.num_iter):
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
gradient = np.dot(X.T, (h - y)) / y.size
self.theta -= self.lr * gradient
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
loss = self.__loss(h, y)
if(self.verbose ==True and i % 10000 == 0):
print(f'loss: {loss} \t')
With the help of the following script, we can predict the output probabilities −
plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend()
x1_min, x1_max = X[:,0].min(), X[:,0].max(),
x2_min, x2_max = X[:,1].min(), X[:,1].max(),
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_prob(grid).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='red');
Now we will implement the above concept of multinomial logistic regression in Python. For this
purpose, we are using a dataset from sklearn named digit.
Import sklearn
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
digits = datasets.load_digits()
X = digits.data
y = digits.target
With the help of next line of code, we can split X and y into training and testing sets −
digreg = linear_model.LogisticRegression()
Now, we need to train the model by using the training sets as follows −
digreg.fit(X_train, y_train)
y_pred = digreg.predict(X_test)
Output
Unit-4
1)Types Of Activation Function in ANN
The biological neural network has been modeled in the form of Artificial Neural Networks with
artificial neurons simulating the function of a biological neuron. The artificial neuron is depicted in
the below picture:
C. ReLU (Rectified Linear Unit) Function: It is the most popularly used activation function in the
areas of convolutional neural networks and deep learning. It is of the form:
f(x)={x,x≥00,x<0
This means that f(x) is zero when x is less than zero and f(x) is equal to x when x is above or equal to
zero. This function is differentiable, except at a single point x = 0. In that sense, the derivative of a
ReLU is actually a sub-derivative.
D. Sigmoid Function: It is by far the most commonly used activation function in neural networks.
The need for sigmoid function stems from the fact that many learning algorithms require the
activation function to be differentiable and hence continuous. There are two types of sigmoid
function:
1. Binary Sigmoid Function
A binary sigmoid function is of the form: yout=f(x)=1+e−kx1
, where k = steepness or slope parameter, By varying the value of k, sigmoid function with
different slopes can be obtained. It has a range of (0,1). The slope of origin is k/4. As the value of k
becomes very large, the sigmoid function becomes a threshold function.
2. Bipolar Sigmoid Function
yout=f(x)ex+e−xex−e−x
This function is similar to the bipolar sigmoid function.
2)What is the motivation behind using neural networks for learning the concept? Explain
briefly.
The primary motivation behind using neural networks is their ability to mimic the human brain in
learning patterns and making intelligent decisions from data. Neural networks consist of layers of
interconnected nodes (neurons) that can process information in a non-linear and adaptive manner.
They are particularly useful for learning concepts where explicit programming is difficult or
impossible due to the complexity of the data.
Key Motivations:
1. Pattern Recognition:
Neural networks excel at recognizing patterns in complex and high-dimensional data like
images, speech, and text.
2. Non-linear Processing:
They can model non-linear relationships between inputs and outputs, which many
traditional algorithms cannot.
3. Learning from Experience:
Neural networks learn from examples, making them effective for problems where rules are
not clearly defined.
4. Generalization Ability:
After training, neural networks can generalize knowledge to handle new, unseen data
effectively.
5. Adaptability:
They are highly flexible and can be adapted to various tasks, including classification,
regression, prediction, and control systems.
6. Noise Tolerance:
Neural networks can perform well even when the input data contains noise or incomplete
information.
7. Feature Extraction:
Deep neural networks automatically learn important features from raw data, reducing the
need for manual feature engineering.
8. Parallel Processing:
They support parallel processing, which improves training efficiency, especially on GPUs.
9. Real-World Success:
Neural networks have achieved remarkable success in areas like image recognition, natural
language processing, and autonomous vehicles.
10. Continuous Improvement:
With more data and computation, neural networks continue to improve their performance
over time.
Conclusion:
Neural networks provide a powerful, flexible, and data-driven approach to learning concepts,
especially when the data is complex and traditional methods are inadequate.
The key idea behind the SVM algorithm is to find the hyperplane that best separates two classes by
maximizing the margin between them. This margin is the distance from the hyperplane to the
nearest data points (support vectors) on each side.
The best hyperplane also known as the "hard margin" is the one that maximizes the distance
between the hyperplane and the nearest data points from both classes. This ensures a clear
separation between the classes. So from the above figure, we choose L2 as hard margin. Let's
consider a scenario like shown below:
Here, we have one blue ball in the boundary of the red ball.
How does SVM classify the data?
The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the
characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM
is robust to outliers.
Hyperplane which is the most optimized one
A soft margin allows for some misclassifications or violations of the margin to improve
generalization. The SVM optimizes the following equation to balance margin maximization and
penalty minimization:
Objective Function=(1/margin)+λ∑penalty
The penalty used for violations is often hinge loss which has the following behavior:
If a data point is correctly classified and within the margin there is no penalty (loss = 0).
If a point is incorrectly classified or violates the margin the hinge loss increases proportionally
to the distance of the violation.
Till now we were talking about linearly separable data that seprates group of blue balls and red
balls by a straight line/linear line.
When data is not linearly separable i.e it can't be divided by a straight line, SVM uses a technique
called kernels to map the data into a higher-dimensional space where it becomes separable. This
transformation helps SVM find a decision boundary even for non-linear data.
Original 1D
dataset for classification
A kernel is a function that maps data points into a higher-dimensional space without explicitly
computing the coordinates in that space. This allows SVM to work efficiently with non-linear data
by implicitly performing the mapping. For example consider data points that are not linearly
separable. By applying a kernel function SVM transforms the data points into a higher-dimensional
space where they become linearly separable.
Linear Kernel: For linear separability.
Polynomial Kernel: Maps data into a polynomial space.
Radial Basis Function (RBF) Kernel: Transforms data into a space based on distances
between data points.
Mapping 1D data to 2D to become able to separate the two classes
In this case the new variable y is created as a function of distance from the origin.
Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training
dataset consisting of input feature vectors X and their corresponding class labels Y. The equation
for the linear hyperplane can be written as:
W Tx+b=0
Where:
ww is the normal vector to the hyperplane (the direction perpendicular to it).
bb is the offset or bias term representing the distance of the hyperplane from the origin along
the normal vector ww.
Distance from a Data Point to the Hyperplane
The distance between a data point x_i and the decision boundary can be calculated as:
di=wTxi+b/∣∣w∣∣
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal
vector W
Linear SVM Classifier
Distance from a Data Point to the Hyperplane:
y^={1: wTx+b≥0
0: wTx+b <0
Where y^ is the predicted label of a data point.
Optimization Problem for SVM
For a linearly separable dataset the goal is to find the hyperplane that maximizes the margin
between the two classes while ensuring that all data points are correctly classified. This leads to the
following optimization problem:
MinimizeW,b1/2∥w∥2
Subject to the constraint:
yi(wTxi+b)≥1fori=1,2,3,⋯,m
yi(wTxi+b)≥1fori=1,2,3,⋯,m
Where:
yiyi is the class label (+1 or -1) for each training instance.
xixi is the feature vector for the ii-th training instance.
mm is the total number of training instances.
The condition yi(wTxi+b)≥1 ensures that each data point is correctly classified and lies outside the
margin.
Soft Margin in Linear SVM Classifier
In the presence of outliers or non-separable data the SVM allows some misclassification by
introducing slack variables ζi. The optimization problem is modified as:
Minimizew,b1/2∥w∥2+C∑i=1mζi
Subject to the constraints:
yi(wTxi+b)≥1−ζiandζi≥0for i=1,2,…,m
Where:
C is a regularization parameter that controls the trade-off between margin maximization and
penalty for misclassifications.
ζi are slack variables that represent the degree of violation of the margin by each data point.
Dual Problem for SVM
The dual problem involves maximizing the Lagrange multipliers associated with the support
vectors. This transformation allows solving the SVM optimization using kernel functions for non-
linear classification.
The dual objective function is given by:
maximize α1/2∑i=1m∑j=1mαiαjtitjK(xi,xj)−∑i=1mαi
Where:
ith
αi are the Lagrange multipliers associated with the i training sample.
th
ti is the class label for the i -th training sample.
K(xi,xj) is the kernel function that computes the similarity between data points xi and xj. The
kernel allows SVM to handle non-linear classification problems by mapping data into a higher-
dimensional space.
The dual formulation optimizes the Lagrange multipliers αi and the support vectors are those
training samples where αi>0.
SVM Decision Boundary
Once the dual problem is solved, the decision boundary is given by:
w=∑i=1mαitiK(xi, x)+b
Where ww is the weight vector, xx is the test data point and bb is the bias term. Finally the bias
term bb is determined by the support vectors, which satisfy:
ti(wTxi−b)=1⇒b=wTxi−ti
Where xi is any support vector.
This completes the mathematical framework of the Support Vector Machine algorithm which
allows for both linear and non-linear classification using the dual problem and kernel trick.
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided into
two main parts:
Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are very
suitable. This means that a single straight line (in 2D) or a hyperplane (in higher dimensions)
can entirely divide the data points into their respective classes. A hyperplane that maximizes
the margin between the classes is the decision boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be separated
into two classes by a straight line (in the case of 2D). By using kernel functions, nonlinear
SVMs can handle nonlinearly separable data. The original input data is transformed by these
kernel functions into a higher-dimensional feature space where the data points can be linearly
separated. A linear SVM is used to locate a nonlinear decision boundary in this modified
space.
Predict if cancer is Benign or malignant. Using historical data about patients diagnosed with cancer
enables doctors to differentiate malignant cases and benign ones are given independent attributes.
Load the breast cancer dataset from sklearn.datasets
Separate input features and target variables.
Build and train the SVM classifiers using RBF kernel.
Plot the scatter plot of the input features.
# Load the important packages
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.svm import SVC
# Scatter plot
plt.scatter(X[:, 0], X[:, 1],
c=y,
s=20, edgecolors="k")
plt.show()
Output:
Breast
Cancer Classifications with SVM RBF kernel
1. Slow Training: SVM can be slow for large datasets, affecting performance in SVM in data
mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters like C
requires careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes, limiting
effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher dimensions makes SVM
less interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential, otherwise SVM models may
perform poorly.
Just like the human brain processes information through interconnected neurons, ANNs use layers of
artificial neurons to learn patterns and make predictions.
Get ready to elevate your career with world-class programs that combine theory, practical projects,
and mentorship. Learn to develop, train, and deploy neural networks that solve real-world challenges
in business, healthcare, finance, and beyond.
The architecture explains how data flows through the network, how neurons (units) are connected,
and how the network learns and makes predictions.
Layers: Neural networks consist of layers of neurons, which include the input layers, hidden
layers, and output layers.
Neurons (Nodes): Neurons are the basic computational units that perform a weighted sum of
their inputs, apply a bias, and pass the result through an activation function.
Weights and biases: Weights represent the strength of the connections between neurons, and
biases allow neurons to make predictions even when all inputs are zero.
Activation function: Non-linear functions (like ReLU and Sigmoid) are used to introduce
non-linearity into the network, enabling it to model complex relationships.
Model’s capacity
A model with high depth (number of layers) and width (number of neurons in each layer) can handle
complex relationships. A network with few layers cannot handle tasks like image or speech
recognition.
Efficiency
The neural network’s architecture affects the efficiency of the model. For instance, convolutional
neural networks (CNNs) have lower computational costs compared to fully connected networks.
Optimization
The network’s structure affects its optimization. For instance, deeper networks may face issues like
vanishing gradients, where the gradients become too small for effective learning in early layers.
Task-specific design
The architecture of networks can be tailored to specific tasks, such as CNNs for image
classification and RNNs for sequence prediction.
Let’s explore the ANN architecture briefly before moving ahead with neural networks.
The Artificial Neural Network (ANN) architecture refers to the structured arrangement of nodes
(neurons) and layers that define how an artificial neural network processes and learns from data. The
design of ANN influences its ability to learn complex patterns and perform tasks efficiently.
Task-specific design
The architecture is chosen based on the task and the type of data. For example, Convolutional Neural
Networks (CNNs) are suitable for image data, while Recurrent Neural Networks (RNNs) or
transformers are preferred for sequential functions like speech and text analysis. CNNs, in particular,
leverage convolutional layers to detect spatial hierarchies in images, making the architecture of
CNN an essential factor in tasks like image classification.
A network's depth (number of hidden layers) and width (number of neurons per layer) affect its
capacity to capture complex relationships in the data. Deep architectures are effective for complex
tasks like image recognition and speech processing.
Efficiency
The architecture affects the computational efficiency of the network. For example, CNNs use of
shared weights in convolutional layers reduces the number of parameters and computational cost.
Optimization
The structure of the architecture impacts the effectiveness of the network. For instance, deeper
networks face challenges like the vanishing gradient problem.
Model generalization
The architecture influences the model's ability to generalize to unseen data. Complex architectures
with too many parameters can lead to overfitting, while simpler architectures may not capture enough
data complexity.
5)What is Perceptron?
Perceptron is a type of neural network that performs binary classification that maps input features
to an output decision, usually classifying data into one of two categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a layer of output
nodes. It is particularly good at learning linearly separable patterns. It utilizes a variation of
artificial neurons called Threshold Logic Units (TLU), which were first introduced by McCulloch
and Walter Pitts in the 1940s. This foundational model has played a crucial role in the development
of more advanced neural networks and machine learning algorithms.
Types of Perceptron
In a fully connected layer, also known as a dense layer, all neurons in one layer are connected to
every neuron in the previous layer.
The output of the fully connected layer is computed as:
fW,b(X)=h(XW+b)
where X is the input Wis the weight for each inputs neurons and b is the bias and h is
the step function.
During training, the Perceptron's weights are adjusted to minimize the difference between the
predicted output and the actual output. This is achieved using supervised learning algorithms like
the delta rule or the Perceptron learning rule.
The weight update formula is:
wi,j=wi,j+η(yj−y^j)xi
Where:
th th
wi,j is the weight between the i input and j output neuron,
th
xi is the i input value,
yj is the actual value, and y^j is the predicted value,
η is the learning rate, controlling how much the weights are adjusted.
This process enables the perceptron to learn from data and improve its prediction accuracy over
time.
Let’s take a simple example of classifying whether a given fruit is an apple or not based on two
inputs: its weight (in grams) and its color (on a scale of 0 to 1, where 1 means red). The perceptron
receives these inputs, multiplies them by their weights, adds a bias, and applies the activation
function to decide whether the fruit is an apple or not.
Input 1 (Weight): 150 grams
Input 2 (Color): 0.9 (since the fruit is mostly red)
Weights: [0.5, 1.0]
Bias: 1.5
The perceptron’s weighted sum would be:
(150∗0.5)+(0.9∗1.0)+1.5=76.4(150∗0.5)+(0.9∗1.0)+1.5=76.4
Let’s assume the activation function uses a threshold of 75. Since 76.4 > 75, the perceptron
classifies the fruit as an apple (output = 1).
Q6. Explain the concept of Linear and Non-linear Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. It is particularly powerful for binary classification problems.
� Linear SVM:
Definition:
A Linear SVM is used when the data is linearly separable, meaning that the two classes can be
separated by a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions).
Key Concepts:
It finds the optimal hyperplane that best separates the two classes.
The goal is to maximize the margin between the two classes. The margin is the distance
between the hyperplane and the closest data points (called support vectors).
Equation of Hyperplane:
w⋅x+b=0
Where:
w = weight vector
x = input vector
b = bias
When to Use:
� Non-linear SVM:
Definition:
A Non-linear SVM is used when the data is not linearly separable. In such cases, a linear
hyperplane cannot effectively separate the data.
Key Concepts:
Example:
For example, XOR data is not linearly separable. A non-linear SVM with an RBF kernel can
successfully classify it.
� Comparison Table:
� Applications:
Conclusion:
SVM is a powerful tool for classification. When data is linearly separable, a Linear SVM works
efficiently. However, for more complex patterns, a Non-linear SVM with kernel functions
transforms the data to achieve accurate classification.
UNIT-5
1)Multi-Layer Perceptron (MLP) consists of fully connected dense layers that transform input
data from one dimension to another. It is called multi-layer because it contains an input layer, one
or more hidden layers and an output layer. The purpose of an MLP is to model complex
relationships between inputs and outputs.
Components of Multi-Layer Perceptron (MLP)
Input Layer: Each neuron or node in this layer corresponds to an input feature. For instance, if
you have three input features the input layer will have three neurons.
Hidden Layers: MLP can have any number of hidden layers with each layer containing any
number of nodes. These layers process the information received from the input layer.
Output Layer: The output layer generates the final prediction or result. If there are multiple
outputs, the output layer will have a corresponding number of neurons.
Every connection in the diagram is a representation of the fully connected nature of an MLP. This
means that every node in one layer connects to every node in the next layer. As the data moves
through the network each layer transforms it until the final output is generated in the output layer.
Working of Multi-Layer Perceptron
Let's see working of the multi-layer perceptron. The key mechanisms such as forward propagation,
loss function, backpropagation and optimization.
1. Forward Propagation
In forward propagation the data flows from the input layer to the output layer, passing through
any hidden layers. Each neuron in the hidden layers processes the input as follows:
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
z=∑iwixi+b
Where:
xi is the input feature.
wi is the corresponding weight.
b is the bias term.
2. Activation Function: The weighted sum z is passed through an activation function to introduce
non-linearity. Common activation functions include:
−z
Sigmoid: σ(z)=1/1+e
ReLU (Rectified Linear Unit): f(z)=max(0,z)
−2z
Tanh (Hyperbolic Tangent): tanh(z)=2/1+e −1
2. Loss Function
Once the network generates an output the next step is to calculate the loss using a loss function. In
supervised learning this compares the predicted output to the actual label.
For a classification problem the commonly used binary cross-entropy loss function is:
L=−N1∑i=1N[yilog(y^i)+(1−yi)log(1−y^i)]
Where:
yi is the actual label.
y^i is the predicted label.
N is the number of samples.
For regression problems the mean squared error (MSE) is often used:
MSE=1/N∑i=1N(yi−y^i)2
3. Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the network's weights
and biases. This is achieved through backpropagation:
1. Gradient Calculation: The gradients of the loss function with respect to each weight and bias
are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by layer.
3. Gradient Descent: The network updates the weights and biases by moving in the opposite
direction of the gradient to reduce the loss: w=w−η⋅ ∂L/∂w
Where:
w is the weight.
η is the learning rate.
∂L/∂w is the gradient of the loss function with respect to the weight.
4. Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases during training.
Popular optimization methods include:
Stochastic Gradient Descent (SGD): Updates the weights based on a single sample or a small
batch of data: w=w−η⋅∂L/∂w
Adam Optimizer: An extension of SGD that incorporates momentum and adaptive learning
rates for more efficient training:
o mt=β1mt−1+(1−β1)⋅gt
2
o vt=β2vt−1+(1−β2)⋅gt
Here gt represents the gradient at time tt and β1,β2are decay rates.
Now that we are done with the theory part of multi-layer perception, let's go ahead and implement
code in python using the TensorFlow library.
Implementing Multi Layer Perceptron
In this section, we will guide through building a neural network using TensorFlow.
First we import necessary libraries such as TensorFlow, NumPy and Matplotlib for visualizing
the data. We also load the MNIST dataset.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
Next we normalize the image data by dividing by 255 (since pixel values range from 0 to 255)
which helps in faster convergence during training.
gray_scale = 255
3. Visualizing Data
To understand the data better we plot the first 100 training samples each representing a digit.
fig, ax = plt.subplots(10, 10)
k=0
for i in range(10):
for j in range(10):
ax[i][j].imshow(x_train[k].reshape(28, 28), aspect='auto')
k += 1
plt.show()
Output:
Q2. Discuss the process of training an RBF (Radial Basis Function) network
A Radial Basis Function (RBF) network is a type of artificial neural network used for function
approximation, pattern recognition, and classification. It consists of three layers: input layer,
hidden layer with RBF neurons, and output layer.
Training an RBF network involves determining the parameters of the hidden and output layers.
The process is typically divided into two main stages.
� Structure of an RBF Network:
1. Input Layer:
Passes input features to the next layer without any computation.
2. Hidden Layer:
Contains neurons with RBF activation functions (typically Gaussian). Each neuron
computes the distance between the input and a center point.
3. Output Layer:
Performs a weighted sum of the hidden layer outputs (usually linear).
This involves selecting the centers and spreads (also called widths or standard deviations) of the
radial basis functions.
σ=dmax/√2K
hi(x)=exp(−∥x−ci∥2/2sigma2)
H⋅W=Y
where
� Applications:
Function approximation
Time-series prediction
Medical diagnosis
Image and speech recognition
� Conclusion:
Training an RBF network involves two key steps: selecting centers and spreads for the hidden layer,
and computing the output weights using linear regression. This two-stage process makes RBF
networks efficient and effective for a variety of learning tasks.
A Decision Tree is a popular supervised learning algorithm used for classification and regression
tasks. It models decisions and their possible consequences in the form of a tree structure. Each
internal node represents a feature test, each branch represents an outcome, and each leaf node
represents a class label or value.
Decision trees can be used for classification (CART) and regression (Regression Trees)
problems.
4. Non-linear Relationships
Capable of modeling non-linear relationships between input variables and target outcomes.
The algorithm automatically selects the most informative features at each node during
training.
Some decision tree implementations can handle missing data by assigning probabilities to
possible outcomes.
Decision trees have low computational complexity, making them efficient for both training
and inference.
Can be used in ensemble methods like Random Forests and Gradient Boosted Trees to
improve accuracy and reduce overfitting.
10. No Assumption of Data Distribution
Decision trees are non-parametric models, meaning they do not assume any prior
distribution of the data.
� Conclusion:
Decision tree learning is a powerful and versatile approach that offers interpretability, fast
training, and strong performance on a wide range of problems.
Decision trees are a popular machine-learning technique used for both classification and regression
tasks. Several algorithms are available for building decision trees, each with its unique approach to
splitting nodes and managing complexity. The most commonly used algorithms include CART
(Classification and Regression Trees), ID3 (Iterative Dichotomiser 3), C4.5, and C5.0. These vary
primarily in how they choose where to split the data and how they handle different data types.
CART (Classification and Regression Trees)
Overview
Type of Tree: CART produces binary trees, meaning each node splits into two child nodes. It
can handle both classification and regression tasks.
Splitting Criterion: Uses Gini impurity for classification and mean squared error for regression
to choose the best split.
Complexity and Performance
Handling of Data: Capable of handling both numerical and categorical data but converts
categorical features into binary splits.
Performance: Generally, provides a good balance between accuracy and computational
efficiency, making it suitable for various applications.
ID3 (Iterative Dichotomiser 3)
Overview
Type of Tree: Generates a tree where each node can have two or more child nodes. It is designed
primarily for classification tasks.
Splitting Criterion: Uses information gain, based on entropy, to select the optimal split.
Complexity and Performance
Handling of Data: Primarily handles categorical data and does not inherently support numerical
features without binning.
Performance: While simple and intuitive, it is prone to overfitting, especially with many
categorical features.
C4.5 and C5.0
C4.5 Overview
Improvement Over ID3: Extends ID3 by handling both discrete and continuous features,
dealing with missing values, and pruning the tree after building to avoid overfitting.
Splitting Criterion: Uses gain ratio, which normalizes the information gain, to choose splits,
attempting to solve the bias toward attributes with a large number of values present in ID3.
C4.5 Complexity and Performance
Handling of Data: Efficiently handles both types of data and missing values.
Performance: More complex than ID3 but generally provides better accuracy and less
susceptibility to overfitting due to its pruning stage.
C5.0 Overview
Type of Tree: An extension of C4.5, proprietary, optimized for speed and memory use, and
includes enhancements like boosting.
Splitting Criterion: Similar to C4.5 but includes mechanisms to boost weak classifiers.
C5.0 Complexity and Performance
Handling of Data: Handles large datasets efficiently and supports both categorical and numerical
data.
Performance: Typically outperforms C4.5 in terms of both speed and memory usage, often
producing more accurate models due to the incorporation of boosting techniques.
Conclusion
Each decision tree algorithm has its strengths and weaknesses, often tailored to specific types of data
or applications. CART is widely used due to its simplicity and effectiveness for diverse tasks, while
C4.5 and C5.0 offer advanced features that handle complexity better and reduce overfitting. ID3,
while less commonly used today, laid the groundwork for more advanced tree algorithms. The choice
of algorithm often depends on the specific needs of the task, including the nature of the data and the
computational resources available.
It works iteratively to adjust weights and bias to minimize the cost function. In each epoch the
model adapts these parameters by reducing loss by following the error gradient. It often uses
optimization algorithms like gradient descent or stochastic gradient descent. The algorithm
computes the gradient using the chain rule from calculus allowing it to effectively navigate
complex layers in the neural network to minimize the cost function.
Back Propagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect to
each weight using the chain rule making it possible to update weights efficiently.
2. Scalability: The Back Propagation algorithm scales well to networks with multiple layers
and complex architectures making deep learning feasible.
3. Automated Learning: With Back Propagation the learning process becomes automated
and the model can adjust itself to optimize its performance.
Working of Back Propagation Algorithm
The Back Propagation algorithm involves two main steps: the Forward Pass and the Backward
Pass.
In forward pass the input data is fed into the input layer. These inputs combined with their
respective weights are passed to hidden layers. For example in a network with two hidden
layers (h1 and h2) the output from h1 serves as the input to h2. Before applying an activation
function, a bias is added to the weighted inputs.
Each hidden layer computes the weighted sum (`a`) of the inputs then applies an activation
function like ReLU ( Rectified Linear Unit) to obtain the output (`o`). The output is passed to
the next layer where an activation function such as softmax converts the weighted outputs into
probabilities for classification.
In the backward pass the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method for
error calculation is the Mean Squared Error ( MSE) given by:
Once the error is calculated the network adjusts weights using gradients which are computed
with the chain rule. These gradients indicate how much each weight and bias should be adjusted to
minimize the error in the next iteration. The backward pass continues layer by layer ensuring that
the network learns and improves its performance.
The activation function through its derivative plays a crucial role in computing these
gradients during Back Propagation.
Forward Propagation
1. Initial Calculation
Where,
aj is the weighted sum of all the inputs and weights at each node
wi,j represents the weights between the ithinput and the jth neuron
xi represents the value of the ith input
O ( output) : After applying the activation function to a, we get the output of the
neuron:
oj = activation function(aj)
2. Sigmoid Function
The sigmoid function returns a value between 0 and 1, introducing non- linearity into the model.
1
y =
j 1+e−aj
3. Computing Outputs
At h1 node
a1 = (w1,1x1) + (w2,1x2)
= 0.21
Once we calculated the a1value, we can now proceed to find the y3 value:
1
y = F (a ) =
j j 1+e−a1
1
y3 = F (0.21) = 1+
e−0.21
y3 = 0.56
0.7) = 0.315
1
y4 = F (0.315) = 1+e
−0.315
y5 = F (0.702) = 1 = 0.67
1+e−0.70
2
4. Error Calculation
Our actual output is 0.5 but we obtained 0.67. To calculate the error we can use the below formula:
Errorj = ytarget − y5
Back Propagation
1. Calculating Gradients
Δwij = η × δj × Oj
Where:
For O3:
For h1:
New weight:
New weight:
w1,2(new) = 0.273225
w1,3(new) = 0.086615
w2,1(new) = 0.269445
w2,2(new) = 0.18534
y4 = 0.56
y5 = 0.61
Since y5 = 0.61 is still not the target output the process of calculating
the error and backpropagating continues until the desired output is reached.
This process demonstrates how Back Propagation iteratively updates weights by minimizing
errors until the network accurately predicts the output.
Error = ytarget − y5
This process is said to be continued until the actual output is gained by the neural network.