[go: up one dir, main page]

0% found this document useful (0 votes)
34 views75 pages

II Sem Material ML

Uploaded by

Magam Vijitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views75 pages

II Sem Material ML

Uploaded by

Magam Vijitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

MACHINE LEARNING

UNIT-I
1. Explain the concept of structured data and unstructured data in machine learning.
2. Explain the concept of well-posed problems in the context of machine learning.
3. Discuss the role of basic linear algebra in machine learning techniques.
4. List and briefly describe four forms of learning in Machine learning.
5. Provide examples of how machine learning and data mining are applied in practice.
6. Illustrate how feature engineering plays a role in productive machine learning.
UNIT-2
1. Discuss in detail about Ocaam’s Razor.
2. Explain the Over fitting and computational complexity issues associated with dimensionality
problems.
3. Explain how classification metrics like precision, recall, and F1-score are evaluated.
4. Describe how supervised learning differs from unsupervised learning.
5. Discuss heuristic search in inductive learning, focussing on strategies to avoid overfitting.
6. Explain the concept of bias-variance trade-off in detail.
UNIT-3
1. Differentiate between Linear regression and Logistic regression using appropriate example.
2. Describe how the K-Nearest Neighbor algorithm works with an example.
3. What is Fisher’s Linear discriminant used for in machine learning.
4. How does Bayesian reasoning support probabilistic inference?
5. What role does inferential statistical analysis play in machine learning?
6. Explain the concept of Logistic regression in Classification tasks.
UNIT-4
1. Explain different types of activation functions used for neural network training.
2. What is the motivation behind using neural networks for learning the concept? Explain
briefly.
3. Define SVM and further explain the maximum margin linear separator concept.
4. What is the Architecture of a simple neural network?
5. Explain about perceptron in detail.
6. Explain the concept of linear and non-linear support vector machine.
UNIT-5
1. Explain the multilayer perceptron model in detail with neat diagram.
2. Discuss the process of training an RBF network.
3. How are decision trees constructed for classification tasks?
4. How does backpropagation work in training neural networks?
5. Discuss the strengths of the decision tree learning approach
6. How does the CART algorithm differ from ID3 and C4.5?

MACHINE LEARNING

UNIT-I

1.Explain the concept of structured data and unstructured data in machine learning.

Structured Data
Structured data is very strategic, factual data that is categorized in a prearranged method which is
quantative and simple to search through and manipulate. This kind of data mainly entails
quantifiable numbers, date and time among others or perhaps data in tables with rows and columns
similar to that which is in an Excel file/Google Docs spreadsheet. SQL helped by IBM in the 1970s
is used to control the data in structured databases and data warehouses mostly. Structured data
applications involve the use of booking and flight details in airline sales transactions and the
management of stocks in a business.
Uses of structured data
 Financial Transactions: Referential data is familiar with financial systems it is used for
transactions, accounts and reports generation. For instance, in a bank personal account,
structured data is used in storing data such as customer details, logbooks and balance sheets.
 Inventory Management: CPS and PSPs use structured data to keep records on shelf stock,
stock flow, and other aspects of supply chain management. These types of databases assist in
keeping good records and administration of products, quantities, and places.
 Customer Relationship Management (CRM): Companies use structured data within CRM
systems to organize communications with consumers and document the requisite sales
processes and consumer preferences. The client details, purchasing patterns, and interaction
records are implemented in the organized database to facilitate marketing and client service.
 Human Resources (HR) Management: HR departments require such data to govern employee
details, monitor the level of attendance, and process payments. Human capital management
databases are records systems that contain records of employees, organizational performance
appraisal information, and benefit administration data.
 E-commerce Transactions: Business mechanisms utilize structured data in executing
payments, inventory, and order delivery for goods and services sold online. One essential
benefit that is reinforced by structured databases is updating inventory status, and handling
payments and customers’ shopping preferences.
Unstructured Data
Unstructured data is formed by different file formats that are in the form of log files, sounds,
images, and all other raw data that have no structural pattern to hold to. This form of data poses a
major challenge to organizations because it is difficult to extract value from it since it is
unstructured. Managing such data means that the storage space will be hugely occupied and
security is always a major issue. It cannot be described by a data model or schema, as most
databases can be to be managed, analyzed or to be searched. Whereas structured data carries
quantitative information and is usually processed into organized formats such as databases,
unstructured data includes information in textual, image, audio, and video formats, and it is
generally qualitative. It is typically saved in NoSQL databases or non-relational data stores.
Some of the human-generated unstructured data are text files, emails, social media posts, mobile
communication data, and business applications. The machine-generated unstructured data is the
satellite images and data captured by scientific instruments and sensors, video surveillance, etc.
Uses of Unstructured Data
 Social Media Analysis: Social media - Twitter, Facebook, and Instagram data in the form of
Tweets, Facebook posts, comments, and other similar types of unstructured data and convert
this social media data to structured data to understand consumer sentiment, trends, and brand
perception.
 Image Recognition: Image data is unstructured data applied in body identification, object
recognition, and computer-aided, medical imaging. Sophisticated methods and pattern
recognition logic work on the pixel level by analyzing bits to solve problems as simple as face
recognition, or as complex as object identification and disease diagnosis.
 Text Mining: Structured, semi-structured, and unstructured text from documents, emails, and
web pages is extracted to identify key pieces of information, categorize text as positive,
negative, or neutral, and discuss topics. These simple natural language processing (NLP)
methodologies are used in the identification of patterns, the identification of keywords, and
content summaries.
 Sensor Data Analytics: Real-time information from both sensors and IoT devices as well as
industrial equipment is collected unstructured to be subsequently processed for performance,
bottlenecks, or any other issues through analysis. Information gathered from the sensors in
terms of time series helps in understanding the state of the environment, the condition of the
operational tools, and the overall process of manufacturing.
 Video Surveillance: Raw video data is applied in video monitors for security, observing
behaviour patterns and identifying incidents. Facial recognition algorithms perform video
analytics on video feeds and are capable of detecting motion, identifying objects or threats and
notifying security agents.
Difference between structured data and unstructured data
Parameters Structured data Unstructured data

Organized, typically in tables or


No predefined format lacks structure
Format databases

Schema Follows a predefined schema No fixed schema, flexible

Easily stored in databases or


Requires specialized storage solutions
Storage spreadsheets

Often requires advanced search


Simple and straightforward
Retrieval algorithms

Well-suited for quantitative Requires specialized techniques


Analysis analysis (NLP, etc.)

Easily processed using traditional Requires advanced processing


Processing methods techniques

Low complexity due to structured High complexity due to lack of


Complexity format structure

Can be larger due to multimedia


Typically smaller in size
Size content

Straightforward extraction of Requires sophisticated analysis


Insights Extraction insights methods

Database Easily managed with traditional May require NoSQL or other


Management DBMS specialized DBMS

Highly searchable using SQL Less searchable, often relies on


Searchability queries metadata

Databases, spreadsheets, CSV


Text documents, emails, images,
Examples files

2) Explain the concept of well-posed problems in the context of machine learning.

Well Posed Learning Problem - A computer program is said to learn from experience E in
context to some task T and some performance measure P, if its performance on T, as was measured
by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits -
 Task
 Performance Measure
 Experience
Certain examples that efficiently defines the well-posed learning problem are -
1. To better filter emails as spam or not
 Task - Classifying emails as spam or not
 Performance Measure - The fraction of emails accurately classified as spam or not spam
 Experience - Observing you label emails as spam or not spam
2. A checkers learning problem
 Task - Playing checkers game
 Performance Measure - percent of games won against opposer
 Experience - playing implementation games against itself
3. Handwriting Recognition Problem
 Task - Acknowledging handwritten words within portrayal
 Performance Measure - percent of words accurately classified
 Experience - a directory of handwritten words with given classifications
4. A Robot Driving Problem
 Task - driving on public four-lane highways using sight scanners
 Performance Measure - average distance progressed before a fallacy
 Experience - order of images and steering instructions noted down while observing a human
driver
5. Fruit Prediction Problem
 Task - forecasting different fruits for recognition
 Performance Measure - able to predict maximum variety of fruits
 Experience - training machine with the largest datasets of fruits images
6. Face Recognition Problem
 Task - predicting different types of faces
 Performance Measure - able to predict maximum types of faces
 Experience - training machine with maximum amount of datasets of different face images
7. Automatic Translation of documents
 Task - translating one type of language used in a document to other language
 Performance Measure - able to convert one language to other efficiently
 Experience - training machine with a large dataset of different types of languages

4)List and briefly describe four forms of learning in Machine learning.

4) Types of Machine Learning


There are several types of machine learning, each with special characteristics and applications.
Some of the main types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Reinforcement Learning
Additionally, there is a more specific category called semi-supervised learning, which combines
elements of both supervised and unsupervised learning.
1. Supervised Machine Learning
Supervised learning is defined as when a model gets trained on a "Labelled Dataset". Labelled
datasets have both input and output parameters. In Supervised Learning algorithms learn to map
points between inputs and correct outputs. It has both training and validation datasets labelled.

Supervised Learning

Let's understand it with the help of an example.


Example: Consider a scenario where you have to build an image classifier to differentiate between
cats and dogs. If you feed the datasets of dogs and cats labelled images to the algorithm, the
machine will learn to classify between a dog or a cat from these labeled images. When we input
new dog or cat images that it has never seen before, it will use the learned algorithms and predict
whether it is a dog or a cat. This is how supervised learning works, and this is particularly an
image classification.
There are two main categories of supervised learning that are mentioned below:
 Classification
 Regression
Classification
Classificationdeals with predicting categorical target variables, which represent discrete classes or
labels. For instance, classifying emails as spam or not spam, or predicting whether a patient has a
high risk of heart disease. Classification algorithms learn to map the input features to one of the
predefined classes.
Here are some classification algorithms:
 Logistic Regression
 Support Vector Machine
 Random Forest
 Decision Tree
 K-Nearest Neighbors (KNN)
 Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables, which represent
numerical values. For example, predicting the price of a house based on its size, location, and
amenities, or forecasting the sales of a product. Regression algorithms learn to map the input
features to a continuous numerical value.
Here are some regression algorithms:
 Linear Regression
 Polynomial Regression
 Ridge Regression
 Lasso Regression
 Decision tree
 Random Forest
Advantages of Supervised Machine Learning
 Supervised Learning models can have high accuracy as they are trained on labelled data.
 The process of decision-making in supervised learning models is often interpretable.
 It can often be used in pre-trained models which saves time and resources when developing
new models from scratch.
Disadvantages of Supervised Machine Learning
 It has limitations in knowing patterns and may struggle with unseen or unexpected patterns that
are not present in the training data.
 It can be time-consuming and costly as it relies on labeled data only.
 It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
 Image classification: Identify objects, faces, and other features in images.
 Natural language processing: Extract information from text, such as sentiment, entities, and
relationships.
 Speech recognition: Convert spoken language into text.
 Recommendation systems: Make personalized recommendations to users.
 Predictive analytics: Predict outcomes, such as sales, customer churn, and stock prices.
 Medical diagnosis: Detect diseases and other medical conditions.
 Fraud detection: Identify fraudulent transactions.
 Autonomous vehicles: Recognize and respond to objects in the environment.
 Email spam detection: Classify emails as spam or not spam.
 Quality control in manufacturing: Inspect products for defects.
 Credit scoring: Assess the risk of a borrower defaulting on a loan.
 Gaming: Recognize characters, analyze player behavior, and create NPCs.
 Customer support: Automate customer support tasks.
 Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
 Sports analytics: Analyze player performance, make game predictions, and optimize
strategies.
2. Unsupervised Machine Learning
Unsupervised Learning Unsupervised learning is a type of machine learning technique in which an
algorithm discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn't involve providing the algorithm with labeled target outputs. The
primary goal of Unsupervised learning is often to discover hidden patterns, similarities, or clusters
within the data, which can then be used for various purposes, such as data exploration,
visualization, dimensionality reduction, and more.

Unsupervised Learning

Let's understand it with the help of an example.


Example: Consider that you have a dataset that contains information about the purchases you
made from the shop. Through clustering, the algorithm can group the same purchasing behavior
among you and other customers, which reveals potential customers without predefined labels. This
type of information can help businesses get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:
 Clustering
 Association
Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for labeled
examples.
Here are some clustering algorithms:
 K-Means Clustering algorithm
 Mean-shift algorithm
 DBSCAN Algorithm
 Principal Component Analysis
 Independent Component Analysis
Association
Association rule learning is a technique for discovering relationships between items in a dataset. It
identifies rules that indicate the presence of one item implies the presence of another item with a
specific probability.
Here are some association rule learning algorithms:
 Apriori Algorithm
 Eclat
 FP-growth Algorithm
Advantages of Unsupervised Machine Learning
 It helps to discover hidden patterns and various relationships between the data.
 Used for tasks such as customer segmentation, anomaly detection, and data exploration.
 It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
 Without using labels, it may be difficult to predict the quality of the model's output.
 Cluster Interpretability may not be clear and may not have meaningful interpretations.
 It has techniques such as autoencoders and dimensionality reduction that can be used to extract
meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
 Clustering: Group similar data points into clusters.
 Anomaly detection: Identify outliers or anomalies in data.
 Dimensionality reduction: Reduce the dimensionality of data while preserving its essential
information.
 Recommendation systems: Suggest products, movies, or content to users based on their
historical behavior or preferences.
 Topic modeling: Discover latent topics within a collection of documents.
 Density estimation: Estimate the probability density function of data.
 Image and video compression: Reduce the amount of storage required for multimedia content.
 Data preprocessing: Help with data preprocessing tasks such as data cleaning, imputation of
missing values, and data scaling.
 Market basket analysis: Discover associations between products.
 Genomic data analysis: Identify patterns or group genes with similar expression profiles.
 Image segmentation: Segment images into meaningful regions.
 Community detection in social networks: Identify communities or groups of individuals with
similar interests or connections.
 Customer behavior analysis: Uncover patterns and insights for better marketing and product
recommendations.
 Content recommendation: Classify and tag content to make it easier to recommend similar
items to users.
 Exploratory data analysis (EDA): Explore data and gain insights before defining specific
tasks.
3. Reinforcement Machine Learning
Reinforcement machine learningalgorithm is a learning method that interacts with the environment
by producing actions and discovering errors. Trial, error, and delay are the most relevant
characteristics of reinforcement learning. In this technique, the model keeps on increasing its
performance using Reward Feedback to learn the behavior or pattern. These algorithms are specific
to a particular problem e.g. Google Self Driving car, AlphaGo where a bot competes with humans
and even itself to get better and better performers in Go Game. Each time we feed in data, they
learn and add the data to their knowledge which is training data. So, the more it learns the better it
gets trained and hence experienced.
Here are some of most common reinforcement learning algorithms:
 Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function, which maps
states to actions. The Q-function estimates the expected reward of taking a particular action in a
given state.
 SARSA (State-Action-Reward-State-Action): SARSA is another model-free RL algorithm
that learns a Q-function. However, unlike Q-learning, SARSA updates the Q-function for the
action that was actually taken, rather than the optimal action.
 Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning. Deep Q-
learning uses a neural network to represent the Q-function, which allows it to learn complex
relationships between states and actions.

Reinforcement Machine Learning

Let's understand it with the help of examples.


Example: Consider that you are training an AI agent to play a game like chess. The agent explores
different moves and receives positive or negative feedback based on the outcome. Reinforcement
Learning also finds applications in which they learn to perform tasks by interacting with their
surroundings.
Types of Reinforcement Machine Learning
There are two main types of reinforcement learning:
Positive reinforcement
 Rewards the agent for taking a desired action.
 Encourages the agent to repeat the behavior.
 Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct answer.
Negative reinforcement
 Removes an undesirable stimulus to encourage a desired behavior.
 Discourages the agent from repeating the behavior.
 Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by completing
a task.
Advantages of Reinforcement Machine Learning
 It has autonomous decision-making that is well-suited for tasks and that can learn to make a
sequence of decisions, like robotics and game-playing.
 This technique is preferred to achieve long-term results that are very difficult to achieve.
 It is used to solve a complex problems that cannot be solved by conventional techniques.
Disadvantages of Reinforcement Machine Learning
 Training Reinforcement Learning agents can be computationally expensive and time-
consuming.
 Reinforcement learning is not preferable to solving simple problems.
 It needs a lot of data and a lot of computation, which makes it impractical and costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
 Game Playing: RL can teach agents to play games, even complex ones.
 Robotics: RL can teach robots to perform tasks autonomously.
 Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
 Recommendation Systems: RL can enhance recommendation algorithms by learning user
preferences.
 Healthcare: RL can be used to optimize treatment plans and drug discovery.
 Natural Language Processing (NLP): RL can be used in dialogue systems and chatbots.
 Finance and Trading: RL can be used for algorithmic trading.
 Supply Chain and Inventory Management: RL can be used to optimize supply chain
operations.
 Energy Management: RL can be used to optimize energy consumption.
 Game AI: RL can be used to create more intelligent and adaptive NPCs in video games.
 Adaptive Personal Assistants: RL can be used to improve personal assistants.
 Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create immersive and
interactive experiences.
 Industrial Control: RL can be used to optimize industrial processes.
 Education: RL can be used to create adaptive learning systems.
 Agriculture: RL can be used to optimize agricultural operations.
Semi-Supervised Learning: Supervised + Unsupervised Learning
Semi-Supervised learningis a machine learning algorithm that works between the supervised and
unsupervised learning so it uses both labelled and unlabelled data. It's particularly useful when
obtaining labeled data is costly, time-consuming, or resource-intensive. This approach is useful
when the dataset is expensive and time-consuming. Semi-supervised learning is chosen when
labeled data requires skills and relevant resources in order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled and the rest large
portion of it is unlabeled. We can use the unsupervised techniques to predict labels and then feed
these labels to supervised techniques. This technique is mostly applicable in the case of image data
sets where usually all images are not labeled.

Semi-Supervised Learning

Let's understand it with the help of an example.


Example: Consider that we are building a language translation model, having labeled translations
for every sentence pair can be resources intensive. It allows the models to learn from labeled and
unlabeled sentence pairs, making them more accurate. This technique has led to significant
improvements in the quality of machine translation services.
Types of Semi-Supervised Learning Methods
There are a number of different semi-supervised learning methods each with its own
characteristics. Some of the most common ones include:
 Graph-based semi-supervised learning: This approach uses a graph to represent the
relationships between the data points. The graph is then used to propagate labels from the
labeled data points to the unlabeled data points.
 Label propagation: This approach iteratively propagates labels from the labeled data points to
the unlabeled data points, based on the similarities between the data points.
 Co-training: This approach trains two different machine learning models on different subsets
of the unlabeled data. The two models are then used to label each other's predictions.
 Self-training: This approach trains a machine learning model on the labeled data and then uses
the model to predict labels for the unlabeled data. The model is then retrained on the labeled
data and the predicted labels for the unlabeled data.
 Generative adversarial networks (GANs): GANs are a type of deep learning algorithm that
can be used to generate synthetic data. GANs can be used to generate unlabeled data for semi-
supervised learning by training two neural networks, a generator and a discriminator.
Advantages of Semi- Supervised Machine Learning
 It leads to better generalization as compared to supervised learning, as it takes both labeled
and unlabeled data.
 Can be applied to a wide range of data.
Disadvantages of Semi- Supervised Machine Learning
 Semi-supervised methods can be more complex to implement compared to other approaches.
 It still requires some labeled data that might not always be available or easy to obtain.
 The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
 Image Classification and Object Recognition: Improve the accuracy of models by combining
a small set of labeled images with a larger set of unlabeled images.
 Natural Language Processing (NLP): Enhance the performance of language models and
classifiers by combining a small set of labeled text data with a vast amount of unlabeled text.
 Speech Recognition: Improve the accuracy of speech recognition by leveraging a limited
amount of transcribed speech data and a more extensive set of unlabeled audio.
 Recommendation Systems: Improve the accuracy of personalized recommendations by
supplementing a sparse set of user-item interactions (labeled data) with a wealth of unlabeled
user behavior data.
 Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a small set of
labeled medical images alongside a larger set of unlabeled images.
_____________________******_____________________________
5. Provide examples of how machine learning and data mining are applied in practice.

Machine Learning and Data Mining Applications

Machine learning and data mining are widely used across industries to analyze data, identify patterns,
and make intelligent decisions. Below are several practical examples:

1. Fraud Detection (Banking & Finance)

 Machine Learning models analyze transaction patterns to detect unusual activities (e.g.,
sudden large withdrawals).
 Data mining uncovers hidden patterns in historical transaction data to improve fraud rules.

2. Recommendation Systems (E-commerce & Streaming)


 Platforms like Amazon and Netflix use ML to recommend products or movies based on
users’ past behavior.
 Data mining identifies associations between different items (e.g., customers who bought X
also bought Y).

3. Medical Diagnosis (Healthcare)

 ML algorithms assist doctors by predicting diseases from symptoms, lab results, or medical
images.
 Data mining helps discover relationships between patient attributes and diseases in large
medical datasets.

4. Customer Segmentation (Marketing)

 ML clusters customers based on purchasing behavior, demographics, or website interactions.


 Data mining is used to identify patterns in customer data for targeted marketing campaigns.

5. Image and Speech Recognition (AI Applications)

 ML enables systems like Google Photos to recognize faces, and virtual assistants to
understand voice commands.
 Data mining techniques help label and categorize vast image/audio datasets.

6. Predictive Maintenance (Manufacturing)

 Sensors collect real-time data, and ML models predict when a machine is likely to fail.
 Data mining helps discover trends and frequent failure causes from historical logs.

7. Spam and Malware Detection (Cybersecurity)

 ML models are trained to detect spam emails or malicious files based on known threats.
 Data mining identifies new attack patterns from network traffic and logs.

8. Financial Forecasting

 ML is used to predict stock prices or market trends based on historical data.


 Data mining reveals hidden correlations and seasonality in financial datasets.

Conclusion

Machine learning and data mining are crucial for turning raw data into actionable insights. Their
practical applications span multiple industries, improving decision-making, efficiency, and user
experience.

6)What is Feature Engineering?

Feature engineering is the process of turning raw data into useful features that help improve the
performance of machine learning models. It includes choosing, creating and adjusting data attributes
to make the model’s predictions more accurate. The goal is to make the model better by providing
relevant and easy-to-understand information.
A feature or attribute is a measurable property of data that is used as input for machine learning
algorithms. Features can be numerical, categorical or text-based representing essential data aspects
which are relevant to the problem. For example in housing price prediction, features might include
the number of bedrooms, location and property age.
Feature Engineering Architecture

Importance of Feature Engineering


Feature engineering can significantly influence model performance. By refining features, we can:
 Improve accuracy: Choosing the right features helps the model learn better, leading to more
accurate predictions.
 Reduce overfitting: Using fewer, more important features helps the model avoid memorizing the
data and perform better on new data.
 Boost interpretability: Well-chosen features make it easier to understand how the model makes
its predictions.
 Enhance efficiency: Focusing on key features speeds up the model’s training and prediction
process, saving time and resources.
Processes Involved in Feature Engineering
Lets see various features involved in feature engineering:
Processes involved in Feature Engineering

1. Feature Creation: Feature creation involves generating new features from domain knowledge or
by observing patterns in the data. It can be:
 Domain-specific: Created based on industry knowledge likr business rules.
 Data-driven: Derived by recognizing patterns in data.
 Synthetic: Formed by combining existing features.
2. Feature Transformation: Transformation adjusts features to improve model learning:
 Normalization & Scaling: Adjust the range of features for consistency.
 Encoding: Converts categorical data to numerical form i.e one-hot encoding.
 Mathematical transformations: Like logarithmic transformations for skewed data.
3. Feature Extraction: Extracting meaningful features can reduce dimensionality and improve
model accuracy:
 Dimensionality reduction: Techniques like PCA reduce features while preserving important
information.
 Aggregation & Combination: Summing or averaging features to simplify the model.
4. Feature Selection: Feature selection involves choosing a subset of relevant features to use:
 Filter methods: Based on statistical measures like correlation.
 Wrapper methods: Select based on model performance.
 Embedded methods: Feature selection integrated within model training.
5. Feature Scaling: Scaling ensures that all features contribute equally to the model:
 Min-Max scaling: Rescales values to a fixed range like 0 to 1.
 Standard scaling: Normalizes to have a mean of 0 and variance of 1.
Steps in Feature Engineering
Feature engineering can vary depending on the specific problem but the general steps are:
1. Data Cleansing: Identify and correct errors or inconsistencies in the dataset to ensure data
quality and reliability.
2. Data Transformation: Transform raw data into a format suitable for modeling including scaling,
normalization and encoding.
3. Feature Extraction: Create new features by combining or deriving information from existing
ones to provide more meaningful input to the model.
4. Feature Selection: Choose the most relevant features for the model using techniques like
correlation analysis, mutual information and stepwise regression.
5. Feature Iteration: Continuously refine features based on model performance by adding,
removing or modifying features for improvement.
Common Techniques in Feature Engineering
1. One-Hot Encoding: One-Hot Encoding converts categorical variables into binary indicators,
allowing them to be used by machine learning models.
import pandas as pd

data = {'Color': ['Red', 'Blue', 'Green', 'Blue']}


df = pd.DataFrame(data)

df_encoded = pd.get_dummies(df, columns=['Color'], prefix='Color')

print(df_encoded)

Output

Color_Blue Color_Green Color_Red


0 False False True
1 True False False
2 False True False
3 True False False
2. Binning: Binning transforms continuous variables into discrete bins, making them categorical for
easier analysis.
import pandas as pd

data = {'Age': [23, 45, 18, 34, 67, 50, 21]}


df = pd.DataFrame(data)

bins = [0, 20, 40, 60, 100]


labels = ['0-20', '21-40', '41-60', '61+']

df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

print(df)

Output

Age Age_Group
0 23 21-40
1 45 41-60
2 18 0-20
3 34 21-40
4 67 61+
5 50 41-60
6 21 21-40
3. Text Data Preprocessing: Involves removing stop-words, stemming and vectorizing text data to
prepare it for machine learning models.
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
texts = ["This is a sample sentence.", "Text data preprocessing is important."]

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
vectorizer = CountVectorizer()

def preprocess_text(text):
words = text.split()
words = [stemmer.stem(word)
for word in words if word.lower() not in stop_words]
return " ".join(words)

cleaned_texts = [preprocess_text(text) for text in texts]

X = vectorizer.fit_transform(cleaned_texts)

print("Cleaned Texts:", cleaned_texts)


print("Vectorized Text:", X.toarray())
Output:

4. Feature Splitting: Divides a single feature into multiple sub-features, uncovering valuable
insights and improving model performance.
import pandas as pd

data = {'Full_Address': [
'123 Elm St, Springfield, 12345', '456 Oak Rd, Shelbyville, 67890']}
df = pd.DataFrame(data)

df[['Street', 'City', 'Zipcode']] = df['Full_Address'].str.extract(


r'([0-9]+\s[\w\s]+),\s([\w\s]+),\s(\d+)')

print(df)

Output

Full_Address Street City Zipcode


0 123 Elm St, Springfield, 12345 123 Elm St Springfield 12345
1 456 Oak Rd, Shelbyville, 67890 456 Oak Rd Shelbyville 67890...
Tools for Feature Engineering
There are several tools available for feature engineering. Here are some popular ones:
 Featuretools: Automates feature engineering by extracting and transforming features from
structured data. It integrates well with libraries like pandas and scikit-learn making it easy to
create complex features without extensive coding.
 TPOT: Uses genetic algorithms to optimize machine learning pipelines, automating feature
selection and model optimization. It visualizes the entire process, helping you identify the best
combination of features and algorithms.
 DataRobot: Automates machine learning workflows including feature engineering, model
selection and optimization. It supports time-dependent and text data and offers collaborative tools
for teams to efficiently work on projects.
 Alteryx: Offers a visual interface for building data workflows, simplifying feature extraction,
transformation and cleaning. It integrates with popular data sources and its drag-and-drop
interface makes it accessible for non-programmers.
 H2O.ai: Provides both automated and manual feature engineering tools for a variety of data
types. It includes features for scaling, imputation and encoding and offers interactive
visualizations to better understand model results.
UNIT-2

Occam’s Razor – 10 Marks Answer

Definition:

Occam’s Razor (also spelled Ockham’s Razor) is a philosophical and problem-solving principle
attributed to the English Franciscan friar and scholastic philosopher William of Ockham (1287–
1347). The principle states:

"Entities should not be multiplied beyond necessity."

In simple terms, the simplest explanation is usually the correct one, or when faced with competing
hypotheses that make the same predictions, the one with the fewest assumptions should be selected.

Key Points:

 Not a rule, but a heuristic: Occam’s Razor is a guiding principle or a methodological


approach, not a law of logic or a scientific theory.
 Focus on simplicity: It promotes simplicity in explanation and theory selection.
 Avoid overcomplication: More variables or assumptions introduce complexity and
uncertainty.
 Used in science, philosophy, and AI: Particularly useful in hypothesis testing, theory
selection, and model optimization (e.g., machine learning).

Example:

Imagine you wake up and see the ground is wet.

 Hypothesis A: It rained last night.


 Hypothesis B: It rained, and a neighbor turned on the sprinkler, and a water balloon fight
happened.

According to Occam’s Razor, Hypothesis A is preferred because it's simpler and requires fewer
assumptions.

Diagram:

Problem: Wet Ground


|
-----------------------------
| |
Hypothesis A Hypothesis B
(Rain only) (Rain + Sprinkler + Balloons)
| |
Fewer assumptions Many assumptions
| |
Occam’s Razor prefers Not preferred
this one

Applications:

1. Science – To choose between competing scientific theories.


2. Medicine – Diagnostic principle: “When you hear hoofbeats, think horses, not zebras.”
3. Artificial Intelligence – Model selection prefers simpler models to avoid overfitting.
4. Philosophy – In metaphysics, to limit unnecessary theoretical entities.

Limitations:

 Simplicity doesn’t guarantee truth.


 Sometimes the more complex explanation is actually correct.
 Must be used alongside empirical evidence.

Conclusion:

Occam’s Razor is a powerful tool for critical thinking and decision-making. It encourages clarity,
logical simplicity, and efficient reasoning. While not always correct, it provides a valuable starting
point for evaluating hypotheses in both academic and real-life scenarios.

2. Overfitting and Computational Complexity Issues in Dimensionality Problems (10 Marks)

Introduction to Dimensionality

In machine learning and data science, dimensionality refers to the number of input variables
(features) in a dataset. When the number of features increases, the dataset is said to have high
dimensionality.

High-dimensional data often causes two major issues:

1. Overfitting
2. Increased Computational Complexity

1. Overfitting in High Dimensionality

Definition:

Overfitting occurs when a model learns the noise or random fluctuations in the training data instead
of the underlying pattern, resulting in poor performance on unseen (test) data.

Why It Happens in High Dimensions:

 With more features, the model has more "freedom" to fit the training data.
 It may capture patterns that do not generalize.
 As the number of features increases, the risk of false correlations also increases.

Example:

If a dataset has only 100 data points but 1,000 features, the model might memorize the data instead of
learning.

Visualization:
Training Error ↓ (Decreasing)
Test Error ↓↓ (Starts increasing again due to overfitting)
\
\
\__ Complexity Increases (Too many features)

2. Computational Complexity

Definition:

As dimensionality increases, the computational cost in terms of time and memory also increases,
often exponentially. This is known as the curse of dimensionality.

Issues Include:

 Increased training time: More features mean more calculations.


 Memory usage: High-dimensional data consumes more RAM and storage.
 Model complexity: Algorithms like KNN, SVM, and decision trees become slower and less
efficient.
 Distance-based algorithms suffer: In high dimensions, Euclidean distance becomes less
meaningful because all points tend to become equidistant.

Example:

In a 2D space, it’s easy to compute the distance between two points. In a 1000D space, the same
distance calculation becomes complex and less informative.

Solutions:

1. Feature Selection – Remove irrelevant or redundant features.


2. Dimensionality Reduction – Use techniques like:
o PCA (Principal Component Analysis)
o t-SNE
o LDA (Linear Dscriminant Analysis)
3. Regularization – Techniques like L1 (Lasso) and L2 (Ridge) help prevent overfitting.
4. Cross-validation – Helps to ensure that models generalize well.

Conclusion:

High dimensionality can lead to overfitting due to excessive freedom in model fitting and increased
computational complexity due to the exponential growth in resource requirements. Proper
dimensionality reduction and regularization techniques are essential to manage these challenges and
ensure efficient and accurate models.

3. Explain how classification metrics like precision, recall, and F1-score are evaluated.

3. Classification Metrics: Precision, Recall, and F1-Score

In supervised machine learning, especially classification tasks, we evaluate model performance


using various metrics. Among the most important are:

 Precision
 Recall
 F1-Score

These are derived from the confusion matrix.


Confusion Matrix

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN)

1. Precision

Definition:

Precision tells us how many of the predicted positive results are actually correct.

Precision=TP/TP+FP

Example:

If a model predicts 100 positive cases and 80 are correct:

Precision=80/100=0.80 (80%)

Interpretation:

High precision = few false positives


Useful in situations where false positives are costly, e.g., spam filters.

2. Recall

Definition:

Recall measures how many actual positives were correctly predicted by the model.

Recall=TP/TP+FN

Example:

If there are 100 actual positive cases and the model correctly identifies 70:

Recall=70/100=0.70 (70%)

Interpretation:

High recall = few false negatives


Important where missing positive cases is risky, e.g., disease detection.

3. F1-Score
Definition:

F1-Score is the harmonic mean of Precision and Recall. It balances the two metrics.

F1-Score=2×(Precision×Recall/Precision + Recall)

Example:

If Precision = 0.80 and Recall = 0.70

F1=2×(0.80×0.70/0.80+0.70)=0.747

Interpretation:

F1-score is useful when we need to balance both precision and recall, especially on imbalanced
datasets.

Use Case Summary

Metric Best Used When

Precision False positives are costly (e.g., spam filters)

Recall False negatives are costly (e.g., medical diagnosis)

F1-Score Need a balance between precision and recall

Conclusion:

Precision, Recall, and F1-score are essential metrics for evaluating classification models. They
provide a more complete picture than accuracy alone, especially for imbalanced datasets. Choosing
the right metric depends on the context of the problem and the cost of errors.

4)Difference between Supervised and Unsupervised Learning


The difference between supervised and unsupervised learning lies in how they use data and their
goals. Supervised learning relies on labeled datasets, where each input is paired with a
corresponding output label. The goal is to learn the relationship between inputs and outputs so the
model can predict outcomes for new data, such as classifying emails as spam or not spam. In
contrast, unsupervised learning works with unlabeled data aiming to uncover hidden patterns or
structures within the dataset such as grouping customers based on their shopping habits or
detecting anomalies in a dataset.
Overall, supervised learning excels in predictive tasks with known outcomes, while unsupervised
learning is ideal for discovering relationships and trends in raw data.
Supervised learning
Labeled data means that each example in the dataset comes with a correct answer or output.
In supervised learning process:
 Machine is given a dataset with input features (like age, salary, or temperature) and
corresponding labels (like "yes/no," "high/low," or "rainy/sunny").
 Then machine learns dataset by finding patterns in the data. For example, it might learn that if the
temperature is high, it’s likely to be sunny.
 Once trained, the machine can predict the label for new input data. For instance, if you give it a
new temperature value, it can predict whether it will be sunny or rainy.
Supervised Learning Analogies
1. Supervised learning is like a teacher guiding a student. The teacher provides examples (labeled
data) and explains the correct answers (output labels). For instance:
 A teacher shows a child pictures of animals and labels them as "cat" or "dog."
 The child learns to recognize the features that distinguish cats from dogs.
 If the child makes a mistake, the teacher corrects them, helping them improve over time.
This analogy emphasizes the role of labeled data in supervised learning, where the algorithm learns
from examples with known outputs.
2. Think of sorting mail into categories like "bills," "ads," or "personal letters":
 You are given labeled examples of each type of mail (e.g., envelopes marked as "bill" or "ad").
 By examining these examples, you learn patterns such as bills often having company logos or ads
being colorful.
 Once trained, you can sort new mail into categories even without explicit labels.
This analogy mirrors how supervised learning uses labeled data to classify new inputs into predefined
categories.
Unsupervised Learning
Unsupervised learning is like letting a child explore and learn on their own without a teacher to find
hidden patterns or groupings in the data on its own. Here, the machine is given a dataset with
only input features (like customer purchase history or website click patterns) but no labels.
Then machine tries to find structure in the data. It might group similar data points together or identify
trends. At last it provides insights, such as clusters of similar data or patterns that were not
obvious before.
Unsupervised Learning Analogies
1. Sorting Books Without Labels : Imagine you are given a box of books with no labels or
categories. Your task is to organize them:
 You notice that some books are mystery novels, so you group them together.
 Others are textbooks, which you set aside in a separate pile.
 Comic books form another group because of their distinct style.
Here, you create groups based on the books' characteristics (e.g., genre, content) without any prior
guidance. This reflects how unsupervised learning clusters data based on similarities.
This analogy reflects customer segmentation in marketing. Businesses use unsupervised learning to
group customers based on purchasing behavior, preferences, or demographics, enabling targeted
marketing strategies.
2. Exploring a New City: Imagine visiting a new city without a map or guide. You explore and start
grouping landmarks:
 Buildings with tall spires might be grouped as churches.
 Open spaces with greenery might be categorized as parks.
 Streets with lots of shops could be grouped as markets.
You’re identifying patterns and organizing your observations independently, much like how
unsupervised learning identifies patterns in data.
This analogy mirrors anomaly detection in cybersecurity. For example, unsupervised learning
algorithms analyze network traffic and identify unusual patterns that could indicate potential
cyberattacks.
Difference between Supervised and Unsupervised Learning
Aspect Supervised Learning Unsupervised Learning

Uses labeled data (input features + Uses unlabeled data (only input features,
Input Data corresponding outputs). no outputs).

Predicts outcomes or classifies data Discovers hidden patterns, structures, or


Goal based on known labels. groupings in data.
Aspect Supervised Learning Unsupervised Learning

Less complex, as the model learns


More complex, as the model must find
Computational from labeled data with clear
patterns without any guidance.
Complexity guidance.

Two types : Classification (for


discrete outputs) or regression (for Clustering and association
Types continuous outputs).

Testing the Model can be tested and evaluated Cannot be tested in the traditional sense,
Model using labeled test data. as there are no labels.

5)Discuss heuristic search in inductive learning, focussing on strategies to avoid overfitting.

Heuristic Search Techniques in AI


Heuristic search techniques are used for problem-solving in AI systems. These techniques help find
the most efficient path from a starting point to a goal, making them essential for applications such as
navigation systems, game playing, and optimization problems.
 Heuristic search operates within the search space of a problem to find the best or near-optimal
solution using systematic algorithms.
 Unlike brute-force methods, which exhaustively evaluate all possible solutions, heuristic search
leverages heuristic information to guide the search toward more promising paths.
In this context, heuristics refer to a set of criteria or rules of thumb that provide an estimate of the
most viable solution. By balancing exploration (searching new possibilities)
and exploitation (refining known solutions), heuristic algorithms efficiently solve complex problems
that would otherwise be computationally expensive.
Significance of Heuristic Search in AI
The advantage of heuristic search techniques in AI is their ability to efficiently navigate large search
spaces. By prioritizing the most promising paths, heuristics significantly reduce the number of
possibilities that need to be explored. This not only accelerates the search process but also enables AI
systems to solve complex problems that would be impractical for exact algorithms.
Components of Heuristic Search
Heuristic search algorithms typically comprise several essential components:
1. State Space: This implies that the totality of all possible states or settings, which is considered to
be the solution for the given problem.
2. Initial State: The instance in the search tree of the highest level with no null values, serving as
the initial state of the problem at hand.
3. Goal Test: The exploration phase ensures whether the present state is a terminal or consenting
state in which the problem is solved.
4. Successor Function: This create a situation where individual states supplant the current state
which represent the possible moves or solutions in the problem space.
5. Heuristic Function: The function of a heuristic is to estimate the value or distance from a given
state to the target state. It helps to focus the process on regions or states that has prospect of
achieving the goal.
Types of Heuristic Search Techniques
Over the history of heuristic search algorithms, there have been a lot of techniques created to
improve them further and attend different problem domains. Some prominent techniques include:
1. A* Search Algorithm
A* Search Algorithm is perhaps the most well-known heuristic search algorithm. It uses a best-first
search and finds the least-cost path from a given initial node to a target node. It has a heuristic
function, often denoted as f(n)=g(n)+h(n)f(n)=g(n)+h(n) , where g(n) is the cost from the start node
to n, and h(n) is a heuristic that estimates the cost of the cheapest path from n to the goal. A* is
widely used in pathfinding and graph traversal.
2. Greedy Best-First Search
Greedy best-first search expands the node that is closest to the goal, as estimated by a heuristic
function. Unlike A*, which considers both path cost and estimated remaining cost, greedy best-first
search only prioritizes the estimated cost to the goal. While this makes it faster, it can be less
optimal, often leading to sub optimal solutions.
3. Hill Climbing
Hill climbing is a heuristic search used for mathematical optimization problems. It is a variant of the
gradient ascent method. It starts from a random initial point and iteratively moves toward higher
values (local maxima) by choosing the best neighboring state. However, it can get stuck in local
maxima, failing to find the global optimum.
4. Simulated Annealing
Inspired by annealing in metallurgy, simulated annealing is a probabilistic technique for finding the
global optimum. Unlike hill climbing, it allows the search to accept worse solutions temporarily to
escape local optima. This probabilistic acceptance decreases over time, allowing it to converge
toward the best solution.
5. Beam Search
Beam search is a graph-based search technique that explores only a limited number of promising
nodes (a beam). The beam width, which limits the number of nodes stored in memory, plays a
crucial role in the performance and accuracy of the search.
Applications of Heuristic Search
Heuristic search techniques are widely used in various real-world scenarios, including:
 Pathfinding: Whether it's navigating a city or plotting a route in a game, heuristic search helps
find the shortest or most efficient path between two points.
 Optimization: From resource allocation to scheduling, heuristic methods help make the most of
available resources while maximizing efficiency.
 Game Playing: In strategy games like chess and Go, AI relies on heuristic search to evaluate
possible moves and plan ahead.
 Robotics: Autonomous robots use heuristic search to determine their movements, avoid
obstacles, and complete tasks efficiently.
 Natural Language Processing (NLP): Search algorithms play a key role in language processing
tasks like parsing, semantic analysis, and text generation, helping AI understand and generate
human language.
Advantages of Heuristic Search Techniques
Heuristic search techniques offer several advantages:
 Efficiency: By focusing on the most promising paths, heuristic search significantly reduces the
number of possibilities explored, saving both time and computational resources.
 Optimality: When using admissible heuristics, certain algorithms like A* can guarantee an
optimal solution, ensuring the best possible outcome.
 Versatility: Heuristic methods are adaptable and can be applied to a wide range of problems,
from pathfinding and optimization to game AI and robotics.
Limitations of Heuristic Search Techniques
Despite their advantages, heuristic search techniques also have some limitations:
 Heuristic Quality: The effectiveness of heuristic search heavily depends on the quality of the
heuristic function. Poorly designed heuristics can lead to inefficient or suboptimal solutions.
 Space Complexity: Some heuristic algorithms require large amounts of memory, especially
when dealing with extensive search spaces, making them less practical for resource-limited
environments.
 Domain-Specificity: Designing effective heuristics often requires domain-specific knowledge,
which can make it difficult to create general-purpose heuristic approaches.
6)Bias-Variance Trade Off - Machine Learning
It is important to understand prediction errors (bias and variance) when it comes to accuracy in any
machine-learning algorithm. There is a tradeoff between a model’s ability to minimize bias and
variance which is referred to as the best solution for selecting a value of Regularization constant.
A proper understanding of these errors would help to avoid the overfitting and underfitting of a
data set while training the algorithm.
What is Bias?
The bias is known as the difference between the prediction of the values by the Machine
Learning model and the correct value. Being high in biasing gives a large error in training as well
as testing data. It recommended that an algorithm should always be low-biased to avoid the
problem of underfitting. By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as the Underfitting of Data. This
happens when the hypothesis is too simple or linear in nature. Refer to the graph given below for
an example of such a situation.

High Bias in the Model

In such a problem, a hypothesis looks like follows.


hθ(x)=g(θ0+θ1x1+θ2x2) hθ(x)=g(θ0+θ1x1+θ2x2)
What is Variance?
The variability of model prediction for a given data point which tells us the spread of our data is
called the variance of the model. The model with high variance has a very complex fit to the
training data and thus is not able to fit accurately on the data which it hasn’t seen before. As a
result, such models perform very well on training data but have high error rates on test data. When
a model is high on variance, it is then said to as Overfitting of Data. Overfitting is fitting the
training set accurately via complex curve and high order hypothesis but is not the solution as the
error with unseen data is high. While training a data model variance should be kept low. The high
variance data looks as follows.

High Variance in the Model

In such a problem, a hypothesis looks like follows.


hθ(x)=g(θ0+θ1x+θ2x2+θ3x3+θ4x4)hθ(x)=g(θ0+θ1x+θ2x2+θ3x3+θ4x4)
Bias Variance Tradeoff
If the algorithm is too simple (hypothesis with linear equation) then it may be on high bias and low
variance condition and thus is error-prone. If algorithms fit too complex (hypothesis with high
degree equation) then it may be on high variance and low bias. In the latter condition, the new
entries will not perform well. Well, there is something between both of these conditions, known as
a Trade-off or Bias Variance Trade-off. This tradeoff in complexity is why there is a tradeoff
between bias and variance. An algorithm can’t be more complex and less complex at the same
time. For the graph, the perfect tradeoff will be like this.

We try to optimize the value of the total error for the model by using the Bias-Variance Tradeoff.
TotalError=Bias2+Variance+IrreducibleErrorTotalError=Bias2+Variance+IrreducibleError
The best fit will be given by the hypothesis on the tradeoff point. The error to complexity graph to
show trade-off is given as -

Region for the Least Value of Total Error

This is referred to as the best point chosen for the training of the algorithm which gives low error
in training as well as testing data.
UNIT-3
1)ML | Linear Regression vs Logistic Regression


Linear Regression is a machine learning algorithm based on supervised regression algorithm.
Regression models a target prediction value based on independent variables. It is mostly used for
finding out the relationship between variables and forecasting. Different regression models differ
based on – the kind of relationship between the dependent and independent variables, they are
considering and the number of independent variables being used. Logistic regression is basically
a supervised classification algorithm. In a classification problem, the target variable(or output), y,
can take only discrete values for a given set of features(or inputs), X.
Sl.No
. Linear Regression Logistic Regression

Linear Regression is a
1. supervised regression Logistic Regression is a supervised classification model.
model.

Equation of linear
regression:
Equation of logistic regression

(a0+a1x1+a2x2+⋯+aixi)
y(x)=e(a0+a1x1+a2x2+⋯+aixi)1+e(a0+a1x1+a2x2+⋯+aixi)y(x)
(a0+a1x1+a2x2+⋯+aixi)
2. Here, =1+e(a0+a1x1+a2x2+⋯+aixi)e(a0+a1x1+a2x2+⋯+aixi)
Here,
y = response variable
y = response variable
xi = ith predictor
xi = ith predictor variable
variable
ai = average effect on y as xi increases by 1
ai= average effect on y
as xi increases by 1

In Linear Regression, we
3. predict the value by an In Logistic Regression, we predict the value by 1 or 0.
integer number.

Here no activation Here activation function is used to convert a linear regression


4.
function is used. equation to the logistic regression equation

Here no threshold value


5. Here a threshold value is added.
is needed.

Here we calculate Root


Mean Square
6. Here we use precision to predict the next weight value.
Error(RMSE) to predict
the next weight value.
Sl.No
. Linear Regression Logistic Regression

Here dependent variable Here the dependent variable consists of only two categories.
should be numeric and Logistic regression estimates the odds outcome of the dependent
7.
the response variable is variable given a set of quantitative or categorical independent
continuous to value. variables.

It is based on the least


8. It is based on maximum likelihood estimation.
square estimation.

Here when we plot the


Any change in the coefficient leads to a change in both the
training datasets, a
direction and the steepness of the logistic function. It means
9. straight line can be
positive slopes result in an S-shaped curve and negative slopes
drawn that touches
result in a Z-shaped curve.
maximum plots.

Linear regression is used


to estimate the
dependent variable in
Whereas logistic regression is used to calculate the probability of
10. case of a change in
an event. For example, classify if tissue is benign or malignant.
independent variables.
For example, predict the
price of houses.

Linear regression
assumes the normal or Logistic regression assumes the binomial distribution of the
11.
gaussian distribution of dependent variable.
the dependent variable.

Applications of linear Applications of logistic regression:


regression:  Medicine
12.  Financial risk  Credit scoring
assessment  Hotel Booking
 Business insights  Gaming
 Market analysis  Text editing

2)K-Nearest Neighbor(KNN) Algorithm


K-Nearest Neighbors (KNN) is a supervised machine learning algorithm generally used for
classification but can also be used for regression tasks. It works by finding the "k" closest data points
(neighbors) to a given input and makesa predictions based on the majority class (for classification) or
the average value (for regression). Since KNN makes no assumptions about the underlying data
distribution it makes it a non-parametric and instance-based learning method.

K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of classification it performs an
action on the dataset.

For example, consider the following table of data points containing two features:

KNN Algorithm working visualization

The new point is classified as Category 2 because most of its


closest neighbors are blue squares. KNN assigns the
category based on the majority of nearby points. The
image shows how KNN predicts the category of a new
data point based on its closest neighbours.
 The red diamonds represent Category 1 and the
blue squares represent Category 2.
 The new data point checks its closest neighbors (circled points).
 Since the majority of its closest neighbors are blue squares (Category 2) KNN predicts the new
data point belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.
What is 'K' in K Nearest Neighbour?
In the k-Nearest Neighbours algorithm k is just a number that tells the algorithm how many nearby
points or neighbors to look at when it makes a decision.
Example: Imagine you're deciding which fruit it is based on its shape and size. You compare it to
fruits you already know.
 If k = 3, the algorithm looks at the 3 closest fruits to the new one.
 If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an apple
because most of its neighbors are apples.
How to choose the value of k for KNN Algorithm?
 The value of k in KNN decides how many neighbors the algorithm looks at when making a
prediction.
 Choosing the right k is important for good results.
 If the data has lots of noise or outliers, using a larger k can make the predictions more stable.
 But if k is too large the model may become too simple and miss important patterns and this is
called underfitting.
 So k should be picked carefully based on the data.
Statistical Methods for Selecting k
 Cross-Validation: Cross-Validation is a good way to find the best value of k is by using k-fold
cross-validation. This means dividing the dataset into k parts. The model is trained on some of
these parts and tested on the remaining ones. This process is repeated for each part. The k value
that gives the highest average accuracy during these tests is usually the best one to use.
 Elbow Method: In Elbow Method we draw a graph showing the error rate or accuracy for
different k values. As k increases the error usually drops at first. But after a certain point error
stops decreasing quickly. The point where the curve changes direction and looks like an "elbow"
is usually the best choice for k.
 Odd Values for k: It’s a good idea to use an odd number for k especially in classification
problems. This helps avoid ties when deciding which class is the most common among the
neighbors.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbor, these neighbors are used for classification
and regression task. To identify nearest neighbor we use below distance metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or space.
You can think of it like the shortest path you would walk if you were to go directly from one point to
another.
distance(x,Xi)=√∑j=1d(xj−Xij)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and vertical lines
like a grid or city streets. It’s also called "taxicab distance" because a taxi can only drive along the
grid-like streets of a city.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and Manhattan
distances as special cases.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above, when p=2, it becomes the same as the Euclidean distance formula and when
p=1, it turns into the Manhattan distance formula. Minkowski distance is essentially a flexible
formula that can represent either Euclidean or Manhattan distance depending on the value of p.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where it predicts
the label or value of a new data point by considering the labels or values of its K nearest neighbors in
the training dataset.

Step 1: Selecting the optimal value of K


 K represents the number of nearest neighbors that needs to be considered while making
prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points Euclidean distance is used.
Distance is calculated between data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 When you want to classify a data point into a category like spam or not spam, the KNN algorithm
looks at the K closest points in the dataset. These closest points are called neighbors. The
algorithm then looks at which category the neighbors belong to and picks the one that appears the
most. This is called majority voting.
 In regression, the algorithm still looks for the K closest points. But instead of voting for a class in
classification, it takes the average of the values of those K neighbors. This average is the
predicted value for the new point for the algorithm.
It shows how a test point is classified based on its nearest neighbors. As the test point moves the
algorithm identifies the closest 'k' data points i.e. 5 in this case and assigns test point the majority
class label that is grey label class here.
Python Implementation of KNN Algorithm
1. Importing Libraries
Counter is used to count the occurrences of elements in a list or iterable. In KNN after finding the k
nearest neighbor labels Counter helps count how many times each label appears.
import numpy as np
from collections import Counter
2. Defining the Euclidean Distance Function
euclidean_distance is to calculate euclidean distance between points.

def euclidean_distance(point1, point2):


return np.sqrt(np.sum((np.array(point1) - np.array(point2))**2))
3. KNN Prediction Function
 distances.append saves how far each training point is from the test point, along with its label.
 distances.sort is used to sorts the list so the nearest points come first.
 k_nearest_labels picks the labels of the k closest points.
 Uses Counter to find which label appears most among those k labels that becomes the prediction.

def knn_predict(training_data, training_labels, test_point, k):


distances = []
for i in range(len(training_data)):
dist = euclidean_distance(test_point, training_data[i])
distances.append((dist, training_labels[i]))
distances.sort(key=lambda x: x[0])
k_nearest_labels = [label for _, label in distances[:k]]
return Counter(k_nearest_labels).most_common(1)[0][0]
4. Training Data, Labels and Test Point

training_data = [[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]]
training_labels = ['A', 'A', 'A', 'B', 'B']
test_point = [4, 5]
k=3
5. Prediction

prediction = knn_predict(training_data, training_labels, test_point, k)


print(prediction)
Output:
A
The algorithm calculates the distances of the test point [4, 5] to all training points selects the 3 closest
points as k = 3 and determines their labels. Since the majority of the closest points are
labelled 'A' the test point is classified as 'A'.
In machine learning we can also use Scikit Learn python library which has in built functions to
perform KNN machine learning model and for that you refer to Implementation of KNN classifier
using Sklearn.
Applications of KNN
 Recommendation Systems: Suggests items like movies or products by finding users with similar
preferences.
 Spam Detection: Identifies spam emails by comparing new emails to known spam and non-spam
examples.
 Customer Segmentation: Groups customers by comparing their shopping behavior to others.
 Speech Recognition: Matches spoken words to known patterns to convert them into text.
Advantages of KNN
 Simple to use: Easy to understand and implement.
 No training step: No need to train as it just stores the data and uses it during prediction.
 Few parameters: Only needs to set the number of neighbors (k) and a distance method.
 Versatile: Works for both classification and regression problems.
Disadvantages of KNN
 Slow with large data: Needs to compare every point during prediction.
 Struggles with many features: Accuracy drops when data has too many features.
 Can Overfit: It can overfit especially when the data is high-dimensional or not clean.

3)Linear Discriminant Analysis in Machine Learning


Linear Discriminant Analysis (LDA) also known as Normal Discriminant Analysis is supervised
classification problem that helps separate two or more classes by converting higher-
dimensional data space into a lower-dimensional space. It is used to identify a linear combination
of features that best separates classes within a dataset.

2 Classes
overlapping

For example we have two classes that need to be separated efficiently. Each class may have multiple
features and using a single feature to classify them may result in overlapping. To solve this LDA is
used as it uses multiple features to improve classification accuracy. LDA works by some
assumptions and we are required to understand them so that we have a better understanding of its
working.
Key Assumptions of LDA
For LDA to perform effectively, certain assumptions are made:
 Gaussian Distribution: The data in each class should follow a normal bell-shaped distribution.
 Equal Covariance Matrices: All classes should have the same covariance structure.
 Linear Separability: The data should be separable using a straight line or plane.
If these assumptions are met LDA can produce very good results. For example when data points
belonging to two classes are plotted if they are not linearly separable LDA will attempt to find a
projection that maximizes class separability.
Linearly Separable Dataset

Image shows an example where the classes (black and green circles) are not linearly separable. LDA
attempts to separate them using red dashed line. It uses both axes (X and Y) to generate a new axis
in such a way that it maximizes the distance between the means of the two classes while
minimizing the variation within each class. This transforms the dataset into a space where the
classes are better separated. After transforming the data points along a new axis LDA maximizes the
class separation. This new axis allows for clearer classification by projecting the data along a line
that enhance the distance between the means of the two classes.

The perpendicular distance between the line and points

Perpendicular distance between the decision boundary and the data points helps us to visualize how
LDA works by reducing class variation and increasing separability. After generating this new axis
using the above-mentioned criteria all the data points of the classes are plotted on this new axis and
are shown in the figure given below.

LDA

It shows how LDA creates a new axis to project the data and separate the two classes effectively
along a linear path. But it fails when the mean of the distributions are shared as it becomes
impossible for LDA to find a new axis that makes both classes linearly separable. In such cases we
use non-linear discriminant analysis.
How does LDA work
LDA works by finding directions in the feature space that best separate the classes. It does this by
maximizing the difference between the class means while minimizing the spread within each class.
Let’s assume we have two classes with d-dimensional samples such as x1,x2,...xnx1,x2,...xn where:
 n1n1 samples belong to class c1c1
 n2n2 samples belong to class c2c2.
If xi represents a data point its projection onto the line represented by the unit vector v is vTxiLet the
means of class c1c1 and class c2c2 before projection be μ1 and μ2 respectively. After projection the
new means are μ^1=vTμ1and μ^2=vTμ2
Our aim to normalize the difference ∣μ^1−μ^2∣to maximize the class separation. The scatter for
samples of class c1c1 is calculated as:
s12=∑xi∈c1(xi−μ1)2
Similarly for class c2:
s22=∑xi∈c2(xi−μ2)2
The goal is to maximize the ratio of the between-class scatter to the within-class scatter, which leads
us to the following criteria:
J(v)=∣μ^1−μ^2∣/s12+s22∣
For the best separation we calculate the eigenvector corresponding to the highest eigenvalue of the
scatter matrices sw−1sb.
Extensions to LDA
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance) allowing it to handle more complex relationships.
2. Flexible Discriminant Analysis (FDA): Uses non-linear combinations of inputs such as splines
to handle non-linear separability.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the covariance
estimate to prevent overfitting.
Implementation of LDA using Python
In this implementation we will perform linear discriminant analysis using Scikit-learn library on
the Iris dataset.
 StandardScaler(): Standardizes the features to ensure they have a mean of 0 and a standard
deviation of 1 removing the influence of different scales.
 fit_transform(): Standardizes the feature data by applying the transformation learned from the
training data ensuring each feature contributes equally.
 LabelEncoder(): Converts categorical labels into numerical values that machine learning models
can process.
 fit_transform() on y: Transforms the target labels into numerical values for use in classification
models.
 LinearDiscriminantAnalysis(): Reduces the dimensionality of the data by projecting it into a
lower-dimensional space while maximizing the separation between classes.
 transform() on X_test: Applies the learned LDA transformation to the test data to maintain
consistency with the training data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

iris = load_iris()
dataset = pd.DataFrame(columns=iris.feature_names,
data=iris.data)
dataset['target'] = iris.target

X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

sc = StandardScaler()
X = sc.fit_transform(X)
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test,\
y_train, y_test = train_test_split(X, y,
test_size=0.2)

lda = LinearDiscriminantAnalysis(n_components=2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

plt.scatter(
X_train[:, 0], X_train[:, 1],
c=y_train,
cmap='rainbow',
alpha=0.7, edgecolors='b'
)

classifier = RandomForestClassifier(max_depth=2,
random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print('Accuracy : ' + str(accuracy_score(y_test, y_pred)))


conf_m = confusion_matrix(y_test, y_pred)
print(conf_m)
Output:
Accuracy : 0.9
[[ 8 0 0]
[ 0 8 2]
[ 0 1 11]]

Scatter plot of
the iris data mapped into 2D

This scatter plot shows three distinct groups of data points, represented by different colors. The group
on the right (dark blue) is clearly separated from the others indicate it's very different. The other two
groups (red and light blue) are positioned closer together with some overlap and suggest they are
more similar and harder to separate.
Advantages of LDA
 Simple and computationally efficient.
 Works well even when the number of features is much larger than the number of training
samples.
 Can handle multicollinearity.
Disadvantages of LDA
 Assumes Gaussian distribution of data which may not always be the case.
 Assumes equal covariance matrices for different classes which may not hold in all datasets.
 Assumes linear separability which is not always true.
 May not always perform well in high-dimensional feature spaces.
Applications of LDA
1. Face Recognition: It is used to reduce the high-dimensional feature space of pixel values in face
recognition applications helping to identify faces more efficiently.
2. Medical Diagnosis: It classifies disease severity in mild, moderate or severe based on patient
parameters helping in decision-making for treatment.
3. Customer Identification: It can help identify customer segments most likely to purchase a
specific product based on survey data.

4)Bayes' Theorem in AI
In probability theory, Bayes' theorem talks about the relation of the conditional probability of two
random events and their marginal probability. In short, it provides a way to calculate the value of
P(B|A) by using the knowledge of P(A|B).
Bayes' theorem is the name given to the formula used to calculate conditional probability. The
formula is as follows:
P(A∣B)=P(A∩B)/P(B)=(P(A)∗P(B∣A))/P(B)P(A∣B)=P(A∩B)/P(B)=(P(A)∗P(B∣A))/P(B)
where,
 P(A) is the probability that event A occurs.
 P(B) defines the probability that event B occurs.
 P(A|B) is the probability of the occurrence of event A given that event B has already occurred.
 P(B∣A) can now be read as: Probability of event B occurring given that event A occurred.
 p(A∩B) is the probability events A and B will happen together.
Key terms in Bayes' Theorem
The Bayes' Theorem is a basic concept in probability and statistics. It gives a model of updating
beliefs or probabilities when the new evidence is presented. This theorem was named after
Reverend Thomas Bayes and has been applied in many fields, ranging from artificial intelligence
and machine learning to data analysis.
The Bayes' Theorem encompasses four major elements:
1. Prior Probability (P(A)): The probability or belief in an event A prior to considering any
additional evidence, it represents what we know or believe about A based on previous
knowledge.
2. Likelihood P(B|A): the probability of evidence B given the occurrence of event A. It
determines how strongly the evidence points toward the event.
3. Evidence (P(B)): Evidence is the probability of observing evidence B regardless of whether A
is true. It serves to normalize the distribution so that the posterior probability is a valid
probability distribution.
4. Posterior Probability P(A|B): The posterior probability is a revised belief regarding event A,
informed by some new evidence B. It answers the question, "What is the probability that A is
true given evidence B observed?"
Using these components, Bayes' Theorem computes the posterior probability P(A|B), which
represents our updated belief in A after considering the new evidence.
In artificial intelligence, probability and the Bayes Theorem are especially useful when making
decisions or inferences based on uncertain or incomplete data. It enables us to rationally update our
beliefs as new evidence becomes available, making it an indispensable tool in AI, machine
learning, and decision-making processes.
How Bayes theorem is relevant in AI?
Bayes' theorem is highly relevant in AI due to its ability to handle uncertainty and make decisions
based on probabilities. Here's why it's crucial:
1. Probabilistic Reasoning: In many real-world scenarios, AI systems must reason under
uncertainty. Bayes' theorem allows AI systems to update their beliefs based on new evidence.
This is essential for applications like autonomous vehicles, where the environment is constantly
changing and sensors provide noisy information.
2. Machine Learning: Bayes' theorem serves as the foundation for Bayesian machine learning
approaches. These methods allow AI models to incorporate prior knowledge and update their
beliefs as they see more data. This is particularly useful in scenarios with limited data or when
dealing with complex relationships between variables.
3. Classification and Prediction: In classification tasks, such as spam email detection or medical
diagnosis, Bayes' theorem can be used to calculate the probability that a given input belongs to
a particular class. This allows AI systems to make more informed decisions based on the
available evidence.
4. Anomaly Detection: Bayes' theorem is used in anomaly detection, where AI systems identify
unusual patterns in data. By modeling the normal behavior of a system, Bayes' theorem can
help detect deviations from this norm, signaling potential anomalies or security threats.
Overall, Bayes' theorem provides a powerful framework for reasoning under uncertainty and is
essential for many AI applications, from decision-making to pattern recognition.
Mathematical Derivation of Bayes' Rule
Bayes' Rule is derived from the definition of conditional probability. Let's start with the definition:
P(A∣B)=P(A∩B)P(B)P(A∣B)=P(B)P(A∩B)
This equation states that the probability of event AA given event BB is equal to the probability of
both events happening (the intersection of AA and BB) divided by the probability of event BB.
Similarly, we can write the conditional probability of event B given event A:
P(B∣A)=P(A∩B)P(A)P(B∣A)=P(A)P(A∩B)
By rearranging this equation, we get:
P(A∩B)=P(B∣A)⋅P(A)P(A∩B)=P(B∣A)⋅P(A)
Now, we have two expressions for P(A∩B)P(A∩B), since both expressions are equal
to P(A∩B)P(A∩B), we can set them equal to each other:
P(A∣B)⋅P(B)=P(B∣A)⋅P(A)P(A∣B)⋅P(B)=P(B∣A)⋅P(A)
To get P(A∣B)P(A∣B), we divide both sides by P(B)P(B):
P(A∣B)=P(B)P(B∣A)⋅P(A)P(A∣B)=P(B∣A)⋅P(A)P(B)
Importance of Bayes' Theorem in AI
Bayes' Theorem is extremely important in artificial intelligence (AI) and related fields.
 Probabilistic Reasoning: In AI, many problems involve uncertainty, so probabilistic reasoning
is an important technique. Bayes' Theorem enables artificial intelligence systems to model and
reason about uncertainty by updating beliefs in response to new evidence. This is important for
decision-making, pattern recognition, and predictive modeling.
 Machine Learning: Bayes' Theorem is a fundamental concept in machine learning,
specifically Bayesian machine learning. Bayesian methods are used to model complex
relationships, estimate model parameters, and predict outcomes. Bayesian models enable the
principled handling of uncertainty in tasks such as classification, regression, and clustering.
 Data Science: Bayes' Theorem is used extensively in Bayesian statistics. It is used to estimate
and update probabilities in a variety of settings, including hypothesis testing, Bayesian
inference, and Bayesian optimization. It offers a consistent framework for modeling and
comprehending data.
Example of Bayes' Rule Application in AI
One of the good old example of Bayes' Rule in AI is its application in spam email classification.
This example demonstrates how Bayes' Theorem is used to classify emails as spam or non-spam
based on the presence of certain keywords.
Consider an email filtering system that needs to determine whether an incoming email is spam or
not based on the presence of the word "win" in the email. We are given the following probabilities:
 P(S): The prior probability that any given email is spam.
 P(H): The prior probability that any given email is not spam (ham).
 P(W∣S): The probability that the word "win" appears in a spam email.
 P(W∣H): The probability that the word "win" appears in a non-spam email.
 P(W): The probability that the word "win" appears in any email.
Given Data
 P(S)=0.2 (20% of emails are spam)
 P(H)=0.8 (80% of emails are not spam)
 P(W∣S)=0.6 (60% of spam emails contain the word "win")
 P(W∣H)=0.1P (10% of non-spam emails contain the word "win")
We want to find P(S∣W), the probability that an email is spam given that it contains the word
"win".
Applying Bayes rule we get:
P(S∣W)=P(W)P(W∣S)⋅P(S)P(S∣W)=P(W∣S)⋅P(S)P(W)
First, we need to calculate P(W), the probability that any email contains the word "win". Using the
law of total probability:
P(W)=P(W∣S)⋅P(S)+P(W∣H)⋅P(H)P(W)=P(W∣S)⋅P(S)+P(W∣H)⋅P(H)
Substituting the given values:
P(W)=(0.6⋅0.2)+(0.1⋅0.8)=0.2P(W)=(0.6⋅0.2)+(0.1⋅0.8)=0.2
Now, we can use Bayes' Rule to find P(S∣W):
P(S∣W)=P(W∣S)⋅P(S)P(W)P(S∣W)=P(W)P(W∣S)⋅P(S),
substituting the values:
P(S∣W)=0.6⋅0.20.2=0.6P(S∣W)=0.20.6⋅0.2=0.6
Thus we can conclude that the probability that an email is spam given that it contains the word
"win" is 0.6, or 60%. This means that if an email contains the word "win," there is a 60% chance
that it is spam.
In a real-world AI system, such as an email spam filter, this calculation would be part of a larger
model that considers multiple features (words) within an email. The filter uses these probabilities,
along with other algorithms, to classify emails accurately and efficiently. By continuously updating
the probabilities based on incoming data, the spam filter can adapt to new types of spam and
improve its accuracy over time.
Uses of Bayes Rule in Artificial Intelligence
Bayes' theorem in Al is used to draw probabilistic conclusions, update beliefs, and make decisions
based on available information. Here are some important applications of Bayes' rule in AI.
1. Bayesian Inference: In Bayesian statistics, the Bayes' rule is used to update the probability
distribution over a set of parameters or hypotheses using observed data. This is especially
important for machine learning tasks like parameter estimation in Bayesian networks, hidden
Markov models, and probabilistic graphical models.
2. Naive Bayes Classification: In the field of natural language processing and text classification,
the Naive Bayes classifier is widely used. It uses Bayes' theorem to calculate the likelihood that
a document belongs to a specific category based on the words it contains. Despite its "naive"
assumption of feature independence, it works surprisingly well in practice.
3. Bayesian Networks: Bayesian networks are graphical models that use Bayes' theorem to
represent and predict probabilistic relationships between variables. They are used in a variety of
AI applications, such as medical diagnosis, fault detection, and decision support systems.
4. Spam Email Filtering: In email filtering systems, Bayes' theorem is used to determine whether
an incoming email is spam or not. The model calculates the likelihood of seeing specific words
or features in spam or non-spam emails and adjusts the probabilities accordingly.
5. Reinforcement Learning: Bayes' rule can be used to model the environment in a probabilistic
manner. Bayesian reinforcement learning methods can help agents estimate and update their
beliefs about state transitions and rewards, allowing them to make more informed decisions.
6. Bayesian Optimization: In optimization tasks, Bayes' theorem can be used to represent the
objective function as a probabilistic surrogate. Bayesian optimization techniques make use of
this model to iteratively explore and exploit the search space in order to efficiently find the
optimal solution. This is commonly used for hyperparameter tuning and algorithm parameter
optimization.
7. Anomaly Detection: The Bayes theorem can be used to identify anomalies or outliers in
datasets. Deviations from the normal distribution can be quantified by modeling it, which aids
in anomaly detection for a variety of applications, including fraud detection and network
security.
8. Personalization: In recommendation systems, Bayes' theorem can be used to update user
preferences and provide personalized recommendations. By constantly updating a user's
preferences based on their interactions, the system can recommend more relevant content.
9. Robotics and Sensor Fusion: In robotics, the Bayes' rule is used to combine sensors. It uses
data from multiple sensors to estimate the state of a robot or its environment. This is necessary
for tasks like localization and mapping.
10. Medical Diagnosis: In healthcare, Bayes' theorem is used in medical decision support systems
to update the likelihood of various diagnoses based on patient symptoms, test results, and
medical history.
5)What is Inferential Statistics?
Inferential statistics is an important tool that allows us to make predictions and conclusions about a
population based on sample data. Unlike descriptive statistics, which only summarizes data,
inferential statistics lets us test hypotheses, make estimates and measure the uncertainty about our
predictions. These tools are essential for evaluating models, testing assumptions and supporting
data-driven decision-making.
For example, instead of surveying every voter in a country, we can survey a few thousand and still
make reliable conclusions about the entire population’s opinion. Inferential statistics provides the
tools to do this in a systematic and mathematical way.
Descriptive and Inferential Statistics

Why Do We Need Inferential Statistics?


In real-world scenarios, analyzing an entire population is often impossible. Instead, we collect data
from a sample and use inferential statistics to:
 Draw conclusions about the whole population.
 Test claims or hypotheses.
 Calculate confidence intervals and p-values to measure uncertainty.
 Make predictions with statistical models.
Techniques in Inferential Statistics
Inferential statistics offers several key methods for testing hypotheses, estimating population
parameters and making predictions. Here are the major techniques:
1. Confidence Intervals: It gives us a range of values that likely includes the true population
parameter. It helps quantify the uncertainty of an estimate. The formula for calculating a
confidence interval for the mean is:
CI=xˉ±Zα/2×σnCI=xˉ±Zα/2×nσ
Where:
 xˉ is the sample mean
 Za/2 is the Z-value from the standard normal distribution (e.g., 1.96 for a 95% confidence
interval)
 σ is the population standard deviation
 n is the sample size
For example, if we measure the average height of 100 people, a 95% confidence interval gives us a
range where the true population mean height is likely to fall. This helps gauge the precision of our
estimate and compare models (like in A/B testing).
2. Hypothesis Testing: Hypothesis testing is a formal procedure for testing claims or assumptions
about data. It involves the following steps:
 Null Hypothesis (H₀): The default assumption, such as “there’s no difference between two
models.”
 Alternative Hypothesis (H₁): The claim you aim to prove, such as “Model A performs better
than Model B.”
We collect data and compute a test statistic (such as Z for a Z-test or t for a T-test):
Z=xˉ−μ0σnZ=nσxˉ−μ0
Where:
 xˉxˉ is the sample mean
 μ0μ0 is the hypothesized population mean
 σσ is the population standard deviation
 nn is the sample size
After calculating the test statistic, we compare it with a critical value or use a p-value to decide
whether to reject or accept the null hypothesis. If the p-value is smaller than the significance level
α\alphaα (usually 0.05), we reject the null hypothesis.
p-value=2⋅P(Z>∣zobs∣)p-value=2⋅P(Z>∣zobs∣)
Where zobszobs is the observed test statistic. A small p-value suggests strong evidence against the
null hypothesis.
3. Central Limit Theorem: It states that the distribution of the sample mean will approximate a
normal distribution as the sample size increases, regardless of the original population distribution.
This is crucial because many statistical methods assume that data is normally distributed. The CLT
can be mathematically expressed as:
Xˉ∼N(μ,σn)Xˉ∼N(μ,nσ)
Where:
 μμis the population mean
 σσ is the population standard deviation
 nn is the sample size
This theorem allows us to apply normal distribution-based methods even when the original data is
not normally distributed, such as in cases with skewed income or shopping behavior data.
Errors in Inferential Statistics
In hypothesis testing Type I Error and Type II Error are key concepts:
 Type I Error occurs when we wrongly reject a true null hypothesis. The probability of making
a Type I error is denoted by αα (the significance level).
 Type II Error occurs when we fail to reject a false null hypothesis. The probability of making
a Type II error is denoted by ββ and the power of the test is given by 1−β1−β.
The goal is to minimize these errors by carefully selecting sample sizes and significance levels.
Parametric and Non-Parametric Tests
Statistical tests help decide if data supports a hypothesis. They calculate a test statistic that shows
how much the data differs from the assumption (null hypothesis). This is compared to a critical
value or p-value to accept or reject the null.
1. Parametric Tests: These tests assume that the data follows a specific distribution (often
normal) and has consistent variance. They are typically used for continuous data. Examples
include the Z-test, T-test and ANOVA. These tests are effective for comparing models or
measuring performance when the assumptions are met.
2. Non-Parametric Tests: Non-Parametric tests do not assume a specific distribution for the data,
making them ideal for small samples or non-normal data, including categorical or ranked data.
Examples include the Chi-Square test, Mann-Whitney U test and Kruskal-Wallis test. They are
useful when data is skewed or categorical, such as customer ratings or behaviors.
Example: Evaluating a New Delivery Algorithm Using Inferential Statistics
A quick commerce company wants to check if a new delivery algorithm reduces delivery times
compared to the current system.
Experiment Setup:
 100 orders split into two groups: 50 with the new algorithm, 50 with the current system.
 Delivery times for both groups are recorded.
Steps
Hypotheses:
 Null (H0): New algorithm does not reduce delivery time.
 Alternative (H1): New algorithm reduces delivery time.
Significance Level:
Set at 0.05 (5% risk of wrongly rejecting H0).
 Type I error: Thinking the new system is better when it isn’t.
 Type II error: Missing a real improvement.
Test Statistic: Compare average delivery times between the two groups
Analysis:
 Calculate means and differences.
 Check if data is roughly normal.
Perform a t-test or z-test
If p-value < 0.05, reject H0 and conclude the new algorithm is better. Otherwise, no clear
improvement.
Confidence Interval: For example, a range of -5 to -2 minutes means deliveries are 2 to 5 minutes
faster with the new system.
6) Explain the concept of Logistic regression in Classification tasks.

Introduction to Logistic Regression

Logistic regression is a supervised learning classification algorithm used to predict the probability of
a target variable. The nature of target or dependent variable is dichotomous, which means there
would be only two possible classes.

In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the


simplest ML algorithms that can be used for various classification problems such as spam detection,
Diabetes prediction, cancer detection etc.

Types of Logistic Regression

Generally, logistic regression means binary logistic regression having binary target variables, but
there can be two more categories of target variables that can be predicted by it. Based on those
number of categories, Logistic regression can be divided into following types −

Binary or Binomial

In such a kind of classification, a dependent variable will have only two possible types either 1 and 0.
For example, these variables may represent success or failure, yes or no, win or loss etc.

Multinomial

In such a kind of classification, dependent variable can have 3 or more possible unordered types or
the types having no quantitative significance. For example, these variables may represent "Type A"
or "Type B" or "Type C".

Ordinal
In such a kind of classification, dependent variable can have 3 or more possible ordered types or the
types having a quantitative significance. For example, these variables may represent "poor" or
"good", "very good", "Excellent" and each category can have the scores like 0,1,2,3.

Logistic Regression Assumptions

Before diving into the implementation of logistic regression, we must be aware of the following
assumptions about the same −

 In case of binary logistic regression, the target variables must be binary always and the desired
outcome is represented by the factor level 1.
 There should not be any multi-collinearity in the model, which means the independent variables must
be independent of each other .
 We must include meaningful variables in our model.
 We should choose a large sample size for logistic regression.

Binary Logistic Regression Model

The simplest form of logistic regression is binary or binomial logistic regression in which the target
or dependent variable can have only 2 possible types either 1 or 0. It allows us to model a
relationship between multiple predictor variables and a binary/binomial target variable. In case of
logistic regression, the linear function is basically used as an input to another function such as in the
following relation −

hθ(x)=g(θTx)0hθ1hθ(x)=g(θTx)0hθ1

Here, is the logistic or sigmoid function which can be given as follows −

g(z)=11+e−z=θTg(z)=11+e−z=θT

To sigmoid curve can be represented with the help of following graph. We can see the values of y-
axis lie between 0 and 1 and crosses the axis at 0.5.

The classes can be divided into positive or negative. The output comes under the probability of
positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of
hypothesis function as positive if it is 0.5, otherwise negative.

We also need to define a loss function to measure how well the algorithm performs using the weights
on functions, represented by theta as follows −

=()=()
J(θ)=1m.(−yTlog(h)−(1−y)Tlog(1−h))J(θ)=1m.(−yTlog(h)−(1−y)Tlog(1−h))
Now, after defining the loss function our prime goal is to minimize the loss function. It can be done
with the help of fitting the weights which means by increasing or decreasing the weights. With the
help of derivatives of the loss function w.r.t each weight, we would be able to know what parameters
should have high weight and what should have smaller weight.

The following gradient descent equation tells us how loss would change if we modified the
parameters −

()θj=1mXT(())()θj=1mXT(())

Implementation of Binary Logistic Regression Model in Python

Now we will implement the above concept of binomial logistic regression in Python. For this
purpose, we are using a multivariate flower dataset named iris which have 3 classes of 50 instances
each, but we will be using the first two feature columns. Every class represents a type of iris flower.

First, we need to import the necessary libraries as follows −

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

Next, load the iris dataset as follows −

iris = datasets.load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1

We can plot our training data s follows −

plt.figure(figsize=(6, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend();
Next, we will define sigmoid function, loss function and gradient descend as follows −

class LogisticRegression:
def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
self.lr = lr
self.num_iter = num_iter
self.fit_intercept = fit_intercept
self.verbose = verbose
def __add_intercept(self, X):
intercept = np.ones((X.shape[0], 1))
return np.concatenate((intercept, X), axis=1)
def __sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def __loss(self, h, y):
return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
def fit(self, X, y):
if self.fit_intercept:
X = self.__add_intercept(X)

Now, initialize the weights as follows −

self.theta = np.zeros(X.shape[1])
for i in range(self.num_iter):
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
gradient = np.dot(X.T, (h - y)) / y.size
self.theta -= self.lr * gradient
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
loss = self.__loss(h, y)
if(self.verbose ==True and i % 10000 == 0):
print(f'loss: {loss} \t')

With the help of the following script, we can predict the output probabilities −

def predict_prob(self, X):


if self.fit_intercept:
X = self.__add_intercept(X)
return self.__sigmoid(np.dot(X, self.theta))
def predict(self, X):
return self.predict_prob(X).round()

Next, we can evaluate the model and plot it as follows −

model = LogisticRegression(lr=0.1, num_iter=300000)


preds = model.predict(X)
(preds == y).mean()

plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend()
x1_min, x1_max = X[:,0].min(), X[:,0].max(),
x2_min, x2_max = X[:,1].min(), X[:,1].max(),
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_prob(grid).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='red');

Multinomial Logistic Regression Model


Another useful form of logistic regression is multinomial logistic regression in which the target or
dependent variable can have 3 or more possible unordered types i.e. the types having no quantitative
significance.

Implementation of Multinomial Logistic Regression Model in Python

Now we will implement the above concept of multinomial logistic regression in Python. For this
purpose, we are using a dataset from sklearn named digit.

First, we need to import the necessary libraries as follows −

Import sklearn
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split

Next, we need to load digit dataset −

digits = datasets.load_digits()

Now, define the feature matrix(X) and response vector(y)as follows −

X = digits.data
y = digits.target

With the help of next line of code, we can split X and y into training and testing sets −

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

Now create an object of logistic regression as follows −

digreg = linear_model.LogisticRegression()

Now, we need to train the model by using the training sets as follows −

digreg.fit(X_train, y_train)

Next, make the predictions on testing set as follows −

y_pred = digreg.predict(X_test)

Next print the accuracy of the model as follows −

print("Accuracy of Logistic Regression model is:",


metrics.accuracy_score(y_test, y_pred)*100)

Output

Accuracy of Logistic Regression model is: 95.6884561891516


From the above output we can see the accuracy of our model is around 96 percent.

Unit-4
1)Types Of Activation Function in ANN


The biological neural network has been modeled in the form of Artificial Neural Networks with
artificial neurons simulating the function of a biological neuron. The artificial neuron is depicted in
the below picture:

Structure of an Artificial Neuron

Each neuron consists of three major components:


1. A set of 'i' synapses having weight wi. A signal xi forms the input to the i-th synapse having
weight wi. The value of any weight may be positive or negative. A positive weight has an
extraordinary effect, while a negative weight has an inhibitory effect on the output of the
summation junction.
2. A summation junction for the input signals is weighted by the respective synaptic weight.
Because it is a linear combiner or adder of the weighted input signals, the output of the
summation junction can be expressed as follows:
ysum=∑i=1nwixi
3. A threshold activation function (or simply the activation function, also known as squashing
function) results in an output signal only when an input signal exceeding a specific threshold
value comes as an input. It is similar in behaviour to the biological neuron which transmits the
signal only when the total input signal meets the firing threshold.
Types of Activation Function :
There are different types of activation functions. The most commonly used activation function are
listed below:
A. Identity Function: Identity function is used as an activation function for the input layer. It is a
linear function having the form
yout=f(x)=x,∀x
As obvious, the output remains the same as the input.
B. Threshold/step Function: It is a commonly used activation function. As depicted in the diagram,
it gives 1 as output of the input is either 0 or positive. If the input is negative, it gives 0 as output.
Expressing it mathematically,
yout=f(ysum)={1,x≥00,x<0
The threshold function is almost like the step function, with the only difference being a fact
that θ θ is used as a threshold value instead of . Expressing mathematically,
yout=f(ysum)={1,x≥θ0,x<θ

C. ReLU (Rectified Linear Unit) Function: It is the most popularly used activation function in the
areas of convolutional neural networks and deep learning. It is of the form:
f(x)={x,x≥00,x<0

This means that f(x) is zero when x is less than zero and f(x) is equal to x when x is above or equal to
zero. This function is differentiable, except at a single point x = 0. In that sense, the derivative of a
ReLU is actually a sub-derivative.
D. Sigmoid Function: It is by far the most commonly used activation function in neural networks.
The need for sigmoid function stems from the fact that many learning algorithms require the
activation function to be differentiable and hence continuous. There are two types of sigmoid
function:
1. Binary Sigmoid Function
A binary sigmoid function is of the form: yout=f(x)=1+e−kx1
, where k = steepness or slope parameter, By varying the value of k, sigmoid function with
different slopes can be obtained. It has a range of (0,1). The slope of origin is k/4. As the value of k
becomes very large, the sigmoid function becomes a threshold function.
2. Bipolar Sigmoid Function

A bipolar sigmoid function is of the form yout=f(x)=1−e−kx1+e−kx


The range of values of sigmoid functions can be varied depending on the application. However, the
range of (-1,+1) is most commonly adopted.
E. Hyperbolic Tangent Function: It is bipolar in nature. It is a widely adopted activation function
for a special type of neural network known as Backpropagation Network. The hyperbolic tangent
function is of the form

yout=f(x)ex+e−xex−e−x
This function is similar to the bipolar sigmoid function.

2)What is the motivation behind using neural networks for learning the concept? Explain
briefly.

The primary motivation behind using neural networks is their ability to mimic the human brain in
learning patterns and making intelligent decisions from data. Neural networks consist of layers of
interconnected nodes (neurons) that can process information in a non-linear and adaptive manner.
They are particularly useful for learning concepts where explicit programming is difficult or
impossible due to the complexity of the data.
Key Motivations:

1. Pattern Recognition:
Neural networks excel at recognizing patterns in complex and high-dimensional data like
images, speech, and text.
2. Non-linear Processing:
They can model non-linear relationships between inputs and outputs, which many
traditional algorithms cannot.
3. Learning from Experience:
Neural networks learn from examples, making them effective for problems where rules are
not clearly defined.
4. Generalization Ability:
After training, neural networks can generalize knowledge to handle new, unseen data
effectively.
5. Adaptability:
They are highly flexible and can be adapted to various tasks, including classification,
regression, prediction, and control systems.
6. Noise Tolerance:
Neural networks can perform well even when the input data contains noise or incomplete
information.
7. Feature Extraction:
Deep neural networks automatically learn important features from raw data, reducing the
need for manual feature engineering.
8. Parallel Processing:
They support parallel processing, which improves training efficiency, especially on GPUs.
9. Real-World Success:
Neural networks have achieved remarkable success in areas like image recognition, natural
language processing, and autonomous vehicles.
10. Continuous Improvement:
With more data and computation, neural networks continue to improve their performance
over time.

Conclusion:

Neural networks provide a powerful, flexible, and data-driven approach to learning concepts,
especially when the data is complex and traditional methods are inadequate.

3)Support Vector Machine (SVM) Algorithm


Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification
and regression tasks. It tries to find the best boundary known as hyperplane that separates different
classes in the data. It is useful when you want to do binary classification like spam vs. not spam or
cat vs. dog.
The main goal of SVM is to maximize the margin between the two classes. The larger the margin
the better the model performs on new and unseen data.

Key Concepts of Support Vector Machine

 Hyperplane: A decision boundary separating different classes in feature space and is


represented by the equation wx + b = 0 in linear classification.
 Support Vectors: The closest data points to the hyperplane, crucial for determining the
hyperplane and margin in SVM.
 Margin: The distance between the hyperplane and the support vectors. SVM aims to maximize
this margin for better classification performance.
 Kernel: A function that maps data to a higher-dimensional space enabling SVM to handle non-
linearly separable data.
 Hard Margin: A maximum-margin hyperplane that perfectly separates the data without
misclassifications.
 Soft Margin: Allows some misclassifications by introducing slack variables, balancing margin
maximization and misclassification penalties when data is not perfectly separable.
 C: A regularization term balancing margin maximization and misclassification penalties. A
higher C value forces stricter penalty for misclassifications.
 Hinge Loss: A loss function penalizing misclassified points or margin violations and is
combined with regularization in SVM.
 Dual Problem: Involves solving for Lagrange multipliers associated with support vectors,
facilitating the kernel trick and efficient computation.

How does Support Vector Machine Algorithm Work?

The key idea behind the SVM algorithm is to find the hyperplane that best separates two classes by
maximizing the margin between them. This margin is the distance from the hyperplane to the
nearest data points (support vectors) on each side.

Multiple hyperplanes separate the data from two classes

The best hyperplane also known as the "hard margin" is the one that maximizes the distance
between the hyperplane and the nearest data points from both classes. This ensures a clear
separation between the classes. So from the above figure, we choose L2 as hard margin. Let's
consider a scenario like shown below:

Selecting hyperplane for data with outlier

Here, we have one blue ball in the boundary of the red ball.
How does SVM classify the data?
The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the
characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM
is robust to outliers.
Hyperplane which is the most optimized one

A soft margin allows for some misclassifications or violations of the margin to improve
generalization. The SVM optimizes the following equation to balance margin maximization and
penalty minimization:
Objective Function=(1/margin)+λ∑penalty
The penalty used for violations is often hinge loss which has the following behavior:
 If a data point is correctly classified and within the margin there is no penalty (loss = 0).
 If a point is incorrectly classified or violates the margin the hinge loss increases proportionally
to the distance of the violation.
Till now we were talking about linearly separable data that seprates group of blue balls and red
balls by a straight line/linear line.

What to do if data are not linearly separable?

When data is not linearly separable i.e it can't be divided by a straight line, SVM uses a technique
called kernels to map the data into a higher-dimensional space where it becomes separable. This
transformation helps SVM find a decision boundary even for non-linear data.

Original 1D
dataset for classification

A kernel is a function that maps data points into a higher-dimensional space without explicitly
computing the coordinates in that space. This allows SVM to work efficiently with non-linear data
by implicitly performing the mapping. For example consider data points that are not linearly
separable. By applying a kernel function SVM transforms the data points into a higher-dimensional
space where they become linearly separable.
 Linear Kernel: For linear separability.
 Polynomial Kernel: Maps data into a polynomial space.
 Radial Basis Function (RBF) Kernel: Transforms data into a space based on distances
between data points.
Mapping 1D data to 2D to become able to separate the two classes

In this case the new variable y is created as a function of distance from the origin.

Mathematical Computation of SVM

Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training
dataset consisting of input feature vectors X and their corresponding class labels Y. The equation
for the linear hyperplane can be written as:
W Tx+b=0
Where:
 ww is the normal vector to the hyperplane (the direction perpendicular to it).
 bb is the offset or bias term representing the distance of the hyperplane from the origin along
the normal vector ww.
Distance from a Data Point to the Hyperplane
The distance between a data point x_i and the decision boundary can be calculated as:
di=wTxi+b/∣∣w∣∣
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal
vector W
Linear SVM Classifier
Distance from a Data Point to the Hyperplane:
y^={1: wTx+b≥0
0: wTx+b <0
Where y^ is the predicted label of a data point.
Optimization Problem for SVM
For a linearly separable dataset the goal is to find the hyperplane that maximizes the margin
between the two classes while ensuring that all data points are correctly classified. This leads to the
following optimization problem:
MinimizeW,b1/2∥w∥2
Subject to the constraint:
yi(wTxi+b)≥1fori=1,2,3,⋯,m
yi(wTxi+b)≥1fori=1,2,3,⋯,m
Where:
 yiyi is the class label (+1 or -1) for each training instance.
 xixi is the feature vector for the ii-th training instance.
 mm is the total number of training instances.
The condition yi(wTxi+b)≥1 ensures that each data point is correctly classified and lies outside the
margin.
Soft Margin in Linear SVM Classifier
In the presence of outliers or non-separable data the SVM allows some misclassification by
introducing slack variables ζi. The optimization problem is modified as:
Minimizew,b1/2∥w∥2+C∑i=1mζi
Subject to the constraints:
yi(wTxi+b)≥1−ζiandζi≥0for i=1,2,…,m
Where:
 C is a regularization parameter that controls the trade-off between margin maximization and
penalty for misclassifications.
 ζi are slack variables that represent the degree of violation of the margin by each data point.
Dual Problem for SVM
The dual problem involves maximizing the Lagrange multipliers associated with the support
vectors. This transformation allows solving the SVM optimization using kernel functions for non-
linear classification.
The dual objective function is given by:
maximize α1/2∑i=1m∑j=1mαiαjtitjK(xi,xj)−∑i=1mαi
Where:
ith
 αi are the Lagrange multipliers associated with the i training sample.
th
 ti is the class label for the i -th training sample.
 K(xi,xj) is the kernel function that computes the similarity between data points xi and xj. The
kernel allows SVM to handle non-linear classification problems by mapping data into a higher-
dimensional space.
The dual formulation optimizes the Lagrange multipliers αi and the support vectors are those
training samples where αi>0.
SVM Decision Boundary
Once the dual problem is solved, the decision boundary is given by:
w=∑i=1mαitiK(xi, x)+b
Where ww is the weight vector, xx is the test data point and bb is the bias term. Finally the bias
term bb is determined by the support vectors, which satisfy:
ti(wTxi−b)=1⇒b=wTxi−ti
Where xi is any support vector.
This completes the mathematical framework of the Support Vector Machine algorithm which
allows for both linear and non-linear classification using the dual problem and kernel trick.

Types of Support Vector Machine

Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided into
two main parts:
 Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are very
suitable. This means that a single straight line (in 2D) or a hyperplane (in higher dimensions)
can entirely divide the data points into their respective classes. A hyperplane that maximizes
the margin between the classes is the decision boundary.
 Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be separated
into two classes by a straight line (in the case of 2D). By using kernel functions, nonlinear
SVMs can handle nonlinearly separable data. The original input data is transformed by these
kernel functions into a higher-dimensional feature space where the data points can be linearly
separated. A linear SVM is used to locate a nonlinear decision boundary in this modified
space.

Implementing SVM Algorithm in Python

Predict if cancer is Benign or malignant. Using historical data about patients diagnosed with cancer
enables doctors to differentiate malignant cases and benign ones are given independent attributes.
 Load the breast cancer dataset from sklearn.datasets
 Separate input features and target variables.
 Build and train the SVM classifiers using RBF kernel.
 Plot the scatter plot of the input features.
# Load the important packages
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.svm import SVC

# Load the datasets


cancer = load_breast_cancer()
X = cancer.data[:, :2]
y = cancer.target

#Build the model


svm = SVC(kernel="rbf", gamma=0.5, C=1.0)
# Trained the model
svm.fit(X, y)

# Plot Decision Boundary


DecisionBoundaryDisplay.from_estimator(
svm,
X,
response_method="predict",
cmap=plt.cm.Spectral,
alpha=0.8,
xlabel=cancer.feature_names[0],
ylabel=cancer.feature_names[1],
)

# Scatter plot
plt.scatter(X[:, 0], X[:, 1],
c=y,
s=20, edgecolors="k")
plt.show()
Output:
Breast
Cancer Classifications with SVM RBF kernel

Advantages of Support Vector Machine (SVM)

1. High-Dimensional Performance: SVM excels in high-dimensional spaces, making it suitable


for image classification and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF and polynomial SVM effectively
handles nonlinear relationships.
3. Outlier Resilience: The soft margin feature allows SVM to ignore outliers, enhancing
robustness in spam detection and anomaly detection.
4. Binary and Multiclass Support: SVM is effective for both binary classification and multiclass
classification suitable for applications in text classification.
5. Memory Efficiency: It focuses on support vectors making it memory efficient compared to
other algorithms.

Disadvantages of Support Vector Machine (SVM)

1. Slow Training: SVM can be slow for large datasets, affecting performance in SVM in data
mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters like C
requires careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes, limiting
effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher dimensions makes SVM
less interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential, otherwise SVM models may
perform poorly.

4)What Defines Neural Network Architecture?


A neural network architecture represents the structure and organization of an artificial neural network
(ANN), which is a computational model inspired by the workings of a biological neural network.

Just like the human brain processes information through interconnected neurons, ANNs use layers of
artificial neurons to learn patterns and make predictions.

Get ready to elevate your career with world-class programs that combine theory, practical projects,
and mentorship. Learn to develop, train, and deploy neural networks that solve real-world challenges
in business, healthcare, finance, and beyond.

 Master of Science in Machine Learning & AI - LJMU

 Post Graduate Diploma in Machine Learning & AI - IIIT Bangalore

 Master of Science in AI and Data Science - Jindal Global University

The architecture explains how data flows through the network, how neurons (units) are connected,
and how the network learns and makes predictions.

Here are the key components of a neural network architecture.

 Layers: Neural networks consist of layers of neurons, which include the input layers, hidden
layers, and output layers.

 Neurons (Nodes): Neurons are the basic computational units that perform a weighted sum of
their inputs, apply a bias, and pass the result through an activation function.

 Weights and biases: Weights represent the strength of the connections between neurons, and
biases allow neurons to make predictions even when all inputs are zero.

 Activation function: Non-linear functions (like ReLU and Sigmoid) are used to introduce
non-linearity into the network, enabling it to model complex relationships.

Here’s how the architecture of neural networks defines its capabilities.

 Model’s capacity

A model with high depth (number of layers) and width (number of neurons in each layer) can handle
complex relationships. A network with few layers cannot handle tasks like image or speech
recognition.
 Efficiency

The neural network’s architecture affects the efficiency of the model. For instance, convolutional
neural networks (CNNs) have lower computational costs compared to fully connected networks.

 Optimization

The network’s structure affects its optimization. For instance, deeper networks may face issues like
vanishing gradients, where the gradients become too small for effective learning in early layers.

 Task-specific design

The architecture of networks can be tailored to specific tasks, such as CNNs for image
classification and RNNs for sequence prediction.

Let’s explore the ANN architecture briefly before moving ahead with neural networks.

What is ANN Architecture and Its Role in Neural Networks?

The Artificial Neural Network (ANN) architecture refers to the structured arrangement of nodes
(neurons) and layers that define how an artificial neural network processes and learns from data. The
design of ANN influences its ability to learn complex patterns and perform tasks efficiently.

Here’s the role of ANN architecture in neural networks.

 Task-specific design

The architecture is chosen based on the task and the type of data. For example, Convolutional Neural
Networks (CNNs) are suitable for image data, while Recurrent Neural Networks (RNNs) or
transformers are preferred for sequential functions like speech and text analysis. CNNs, in particular,
leverage convolutional layers to detect spatial hierarchies in images, making the architecture of
CNN an essential factor in tasks like image classification.

 Capacity to learn complex patterns

A network's depth (number of hidden layers) and width (number of neurons per layer) affect its
capacity to capture complex relationships in the data. Deep architectures are effective for complex
tasks like image recognition and speech processing.

 Efficiency

The architecture affects the computational efficiency of the network. For example, CNNs use of
shared weights in convolutional layers reduces the number of parameters and computational cost.

 Optimization

The structure of the architecture impacts the effectiveness of the network. For instance, deeper
networks face challenges like the vanishing gradient problem.

 Model generalization

The architecture influences the model's ability to generalize to unseen data. Complex architectures
with too many parameters can lead to overfitting, while simpler architectures may not capture enough
data complexity.

5)What is Perceptron?
Perceptron is a type of neural network that performs binary classification that maps input features
to an output decision, usually classifying data into one of two categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a layer of output
nodes. It is particularly good at learning linearly separable patterns. It utilizes a variation of
artificial neurons called Threshold Logic Units (TLU), which were first introduced by McCulloch
and Walter Pitts in the 1940s. This foundational model has played a crucial role in the development
of more advanced neural networks and machine learning algorithms.

Types of Perceptron

1. Single-Layer Perceptron is a type of perceptron is limited to learning linearly separable


patterns. It is effective for tasks where the data can be divided into distinct categories through a
straight line. While powerful in its simplicity, it struggles with more complex problems where
the relationship between inputs and outputs is non-linear.
2. Multi-Layer Perceptron possess enhanced processing capabilities as they consist of two or
more layers, adept at handling more complex patterns and relationships within the data.
Basic Components of Perceptron
A Perceptron is composed of key components that work together to process information and make
predictions.
 Input Features: The perceptron takes multiple input features, each representing a
characteristic of the input data.
 Weights: Each input feature is assigned a weight that determines its influence on the output.
These weights are adjusted during training to find the optimal values.
 Summation Function: The perceptron calculates the weighted sum of its inputs, combining
them with their respective weights.
 Activation Function: The weighted sum is passed through the Heaviside step function,
comparing it to a threshold to produce a binary output (0 or 1).
 Output: The final output is determined by the activation function, often used for binary
classification tasks.
 Bias: The bias term helps the perceptron make adjustments independent of the input, improving
its flexibility in learning.
 Learning Algorithm: The perceptron adjusts its weights and bias using a learning algorithm,
such as the Perceptron Learning Rule, to minimize prediction errors.
These components enable the perceptron to learn from data and make predictions. While a single
perceptron can handle simple binary classification, complex tasks require multiple perceptrons
organized into layers, forming a neural network.
How does Perceptron work?
A weight is assigned to each input node of a perceptron, indicating the importance of that input in
determining the output. The Perceptron’s output is calculated as a weighted sum of the inputs,
which is then passed through an activation function to decide whether the Perceptron will fire.
The weighted sum is computed as:
z=w1x1+w2x2+…+wnxn=XTW
The step function compares this weighted sum to a threshold. If the input is larger than the
threshold value, the output is 1; otherwise, it's 0. This is the most common activation function used
in Perceptrons are represented by the Heaviside step function:
h(z)={0 if z<Threshold
1 if z≥Threshold
A perceptron consists of a single layer of Threshold Logic Units (TLU), with each TLU fully
connected to all input nodes.
Threshold Logic units

In a fully connected layer, also known as a dense layer, all neurons in one layer are connected to
every neuron in the previous layer.
The output of the fully connected layer is computed as:
fW,b(X)=h(XW+b)
where X is the input Wis the weight for each inputs neurons and b is the bias and h is
the step function.
During training, the Perceptron's weights are adjusted to minimize the difference between the
predicted output and the actual output. This is achieved using supervised learning algorithms like
the delta rule or the Perceptron learning rule.
The weight update formula is:
wi,j=wi,j+η(yj−y^j)xi
Where:
th th
 wi,j is the weight between the i input and j output neuron,
th
 xi is the i input value,
 yj is the actual value, and y^j is the predicted value,
 η is the learning rate, controlling how much the weights are adjusted.
This process enables the perceptron to learn from data and improve its prediction accuracy over
time.

Example: Perceptron in Action

Let’s take a simple example of classifying whether a given fruit is an apple or not based on two
inputs: its weight (in grams) and its color (on a scale of 0 to 1, where 1 means red). The perceptron
receives these inputs, multiplies them by their weights, adds a bias, and applies the activation
function to decide whether the fruit is an apple or not.
 Input 1 (Weight): 150 grams
 Input 2 (Color): 0.9 (since the fruit is mostly red)
 Weights: [0.5, 1.0]
 Bias: 1.5
The perceptron’s weighted sum would be:
(150∗0.5)+(0.9∗1.0)+1.5=76.4(150∗0.5)+(0.9∗1.0)+1.5=76.4
Let’s assume the activation function uses a threshold of 75. Since 76.4 > 75, the perceptron
classifies the fruit as an apple (output = 1).

Q6. Explain the concept of Linear and Non-linear Support Vector Machine (SVM)

Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. It is particularly powerful for binary classification problems.

� Linear SVM:
Definition:

A Linear SVM is used when the data is linearly separable, meaning that the two classes can be
separated by a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions).

Key Concepts:

 It finds the optimal hyperplane that best separates the two classes.
 The goal is to maximize the margin between the two classes. The margin is the distance
between the hyperplane and the closest data points (called support vectors).

Equation of Hyperplane:

w⋅x+b=0

Where:

 w = weight vector
 x = input vector
 b = bias

When to Use:

 Data is clearly linearly separable


 No significant overlap between classes

� Non-linear SVM:

Definition:

A Non-linear SVM is used when the data is not linearly separable. In such cases, a linear
hyperplane cannot effectively separate the data.

Key Concepts:

 To solve this, SVM uses a technique called the kernel trick.


 It maps the input data into a higher-dimensional space, where it becomes linearly
separable.
 Common kernels:
o Polynomial Kernel
o Radial Basis Function (RBF) Kernel
o Sigmoid Kernel

Example:

For example, XOR data is not linearly separable. A non-linear SVM with an RBF kernel can
successfully classify it.

� Comparison Table:

Feature Linear SVM Non-linear SVM

Data Separation Linearly separable Non-linearly separable

Hyperplane Straight line/plane Curved boundary after transformation

Kernel Function Not used Used (e.g., RBF, Polynomial)

Computation Time Faster Slower (due to transformation)

Complexity Simple More complex

� Applications:

 Linear SVM: Text classification, spam filtering


 Non-linear SVM: Image classification, speech recognition, handwriting detection

Conclusion:

SVM is a powerful tool for classification. When data is linearly separable, a Linear SVM works
efficiently. However, for more complex patterns, a Non-linear SVM with kernel functions
transforms the data to achieve accurate classification.

UNIT-5

1)Multi-Layer Perceptron (MLP) consists of fully connected dense layers that transform input
data from one dimension to another. It is called multi-layer because it contains an input layer, one
or more hidden layers and an output layer. The purpose of an MLP is to model complex
relationships between inputs and outputs.
Components of Multi-Layer Perceptron (MLP)
 Input Layer: Each neuron or node in this layer corresponds to an input feature. For instance, if
you have three input features the input layer will have three neurons.
 Hidden Layers: MLP can have any number of hidden layers with each layer containing any
number of nodes. These layers process the information received from the input layer.
 Output Layer: The output layer generates the final prediction or result. If there are multiple
outputs, the output layer will have a corresponding number of neurons.
Every connection in the diagram is a representation of the fully connected nature of an MLP. This
means that every node in one layer connects to every node in the next layer. As the data moves
through the network each layer transforms it until the final output is generated in the output layer.
Working of Multi-Layer Perceptron
Let's see working of the multi-layer perceptron. The key mechanisms such as forward propagation,
loss function, backpropagation and optimization.

1. Forward Propagation

In forward propagation the data flows from the input layer to the output layer, passing through
any hidden layers. Each neuron in the hidden layers processes the input as follows:
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
z=∑iwixi+b
Where:
 xi is the input feature.
 wi is the corresponding weight.
 b is the bias term.
2. Activation Function: The weighted sum z is passed through an activation function to introduce
non-linearity. Common activation functions include:
−z
 Sigmoid: σ(z)=1/1+e
 ReLU (Rectified Linear Unit): f(z)=max(0,z)
−2z
 Tanh (Hyperbolic Tangent): tanh(z)=2/1+e −1

2. Loss Function

Once the network generates an output the next step is to calculate the loss using a loss function. In
supervised learning this compares the predicted output to the actual label.
For a classification problem the commonly used binary cross-entropy loss function is:
L=−N1∑i=1N[yilog(y^i)+(1−yi)log(1−y^i)]
Where:
 yi is the actual label.
 y^i is the predicted label.
 N is the number of samples.
For regression problems the mean squared error (MSE) is often used:
MSE=1/N∑i=1N(yi−y^i)2

3. Backpropagation

The goal of training an MLP is to minimize the loss function by adjusting the network's weights
and biases. This is achieved through backpropagation:
1. Gradient Calculation: The gradients of the loss function with respect to each weight and bias
are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by layer.
3. Gradient Descent: The network updates the weights and biases by moving in the opposite
direction of the gradient to reduce the loss: w=w−η⋅ ∂L/∂w
Where:
 w is the weight.
 η is the learning rate.
 ∂L/∂w is the gradient of the loss function with respect to the weight.

4. Optimization

MLPs rely on optimization algorithms to iteratively refine the weights and biases during training.
Popular optimization methods include:
 Stochastic Gradient Descent (SGD): Updates the weights based on a single sample or a small
batch of data: w=w−η⋅∂L/∂w
 Adam Optimizer: An extension of SGD that incorporates momentum and adaptive learning
rates for more efficient training:
o mt=β1mt−1+(1−β1)⋅gt
2
o vt=β2vt−1+(1−β2)⋅gt
 Here gt represents the gradient at time tt and β1,β2are decay rates.
Now that we are done with the theory part of multi-layer perception, let's go ahead and implement
code in python using the TensorFlow library.
Implementing Multi Layer Perceptron
In this section, we will guide through building a neural network using TensorFlow.

1. Importing Modules and Loading Dataset

First we import necessary libraries such as TensorFlow, NumPy and Matplotlib for visualizing
the data. We also load the MNIST dataset.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

2. Loading and Normalizing Image Data

Next we normalize the image data by dividing by 255 (since pixel values range from 0 to 255)
which helps in faster convergence during training.

gray_scale = 255

x_train = x_train.astype('float32') / gray_scale


x_test = x_test.astype('float32') / gray_scale

print("Feature matrix (x_train):", x_train.shape)


print("Target matrix (y_train):", y_train.shape)
print("Feature matrix (x_test):", x_test.shape)
print("Target matrix (y_test):", y_test.shape)
Output:
Multi-Layer Perceptron Learning in Tensorflow

3. Visualizing Data

To understand the data better we plot the first 100 training samples each representing a digit.
fig, ax = plt.subplots(10, 10)
k=0
for i in range(10):
for j in range(10):
ax[i][j].imshow(x_train[k].reshape(28, 28), aspect='auto')
k += 1
plt.show()
Output:

Q2. Discuss the process of training an RBF (Radial Basis Function) network

A Radial Basis Function (RBF) network is a type of artificial neural network used for function
approximation, pattern recognition, and classification. It consists of three layers: input layer,
hidden layer with RBF neurons, and output layer.

Training an RBF network involves determining the parameters of the hidden and output layers.
The process is typically divided into two main stages.
� Structure of an RBF Network:

1. Input Layer:
Passes input features to the next layer without any computation.
2. Hidden Layer:
Contains neurons with RBF activation functions (typically Gaussian). Each neuron
computes the distance between the input and a center point.
3. Output Layer:
Performs a weighted sum of the hidden layer outputs (usually linear).

� Training Process of RBF Network:

Stage 1: Determine Parameters of Hidden Layer

This involves selecting the centers and spreads (also called widths or standard deviations) of the
radial basis functions.

1. Select Centers (Centroids):


o Choose K centers from the input data.
o Methods:
 Random selection
 K-Means Clustering (commonly used)
 Orthogonal Least Squares
2. Compute Spread (σ) for each RBF neuron:
o Spread controls the width of the Gaussian function.
o Can be a fixed value or computed using:

σ=dmax/√2K

where dmax is the maximum distance between any two centers.

3. Compute Hidden Layer Outputs:


Each hidden neuron applies a radial basis function:

hi(x)=exp(−∥x−ci∥2/2sigma2)

where ci is the center of the i-th RBF neuron.

Stage 2: Train Output Layer (Linear Weights)

1. Form Hidden Layer Output Matrix (H):


Rows = training examples
Columns = outputs from RBF neurons
2. Compute Output Weights (W):
Solve the equation:

H⋅W=Y

where

oH = matrix of RBF outputs


oW= output weights
oY= target outputs
3. Use Least Squares Method to compute weights:
W=(HTH)-1 HTY

� Summary of Training Steps:

1. Input data preprocessing


2. Select centers (e.g., using K-means)
3. Compute spreads (σ)
4. Calculate outputs of hidden layer using RBFs
5. Use linear regression to compute output weights

� Advantages of RBF Networks:

 Fast training (due to linear output layer)


 Good approximation capability
 Handles non-linear relationships well

� Applications:

 Function approximation
 Time-series prediction
 Medical diagnosis
 Image and speech recognition

� Conclusion:

Training an RBF network involves two key steps: selecting centers and spreads for the hidden layer,
and computing the output weights using linear regression. This two-stage process makes RBF
networks efficient and effective for a variety of learning tasks.

Q5. Discuss the strengths of the Decision Tree learning approach

A Decision Tree is a popular supervised learning algorithm used for classification and regression
tasks. It models decisions and their possible consequences in the form of a tree structure. Each
internal node represents a feature test, each branch represents an outcome, and each leaf node
represents a class label or value.

� Strengths of the Decision Tree Learning Approach:

1. Easy to Understand and Interpret

 Decision trees are highly intuitive and resemble human decision-making.


 The tree structure can be visualized, making it easier to explain the model to non-technical
users.
2. Requires Little Data Preprocessing

 No need for feature scaling (like normalization or standardization).


 Can handle both numerical and categorical data without transformation.

3. Handles Both Classification and Regression

 Decision trees can be used for classification (CART) and regression (Regression Trees)
problems.

4. Non-linear Relationships

 Capable of modeling non-linear relationships between input variables and target outcomes.

5. Automatic Feature Selection

 The algorithm automatically selects the most informative features at each node during
training.

6. Works Well with Missing Values

 Some decision tree implementations can handle missing data by assigning probabilities to
possible outcomes.

7. Fast Training and Prediction

 Decision trees have low computational complexity, making them efficient for both training
and inference.

8. Robust to Irrelevant Features

 Unimportant features tend to be ignored in the tree-building process.

9. Flexible and Versatile

 Can be used in ensemble methods like Random Forests and Gradient Boosted Trees to
improve accuracy and reduce overfitting.
10. No Assumption of Data Distribution

 Decision trees are non-parametric models, meaning they do not assume any prior
distribution of the data.

� Conclusion:

Decision tree learning is a powerful and versatile approach that offers interpretability, fast
training, and strong performance on a wide range of problems.

6)Different Decision Tree Algorithms: Comparison of Complexity and Performance


Decision trees are a popular machine-learning technique used for both classification and regression
tasks. Several algorithms are available for building decision trees, each with its unique approach to
splitting nodes and managing complexity. The most commonly used algorithms include CART
(Classification and Regression Trees), ID3 (Iterative Dichotomiser 3), C4.5, and C5.0. These vary
primarily in how they choose where to split the data and how they handle different data types.
CART (Classification and Regression Trees)
Overview
 Type of Tree: CART produces binary trees, meaning each node splits into two child nodes. It
can handle both classification and regression tasks.
 Splitting Criterion: Uses Gini impurity for classification and mean squared error for regression
to choose the best split.
Complexity and Performance
 Handling of Data: Capable of handling both numerical and categorical data but converts
categorical features into binary splits.
 Performance: Generally, provides a good balance between accuracy and computational
efficiency, making it suitable for various applications.
ID3 (Iterative Dichotomiser 3)
Overview
 Type of Tree: Generates a tree where each node can have two or more child nodes. It is designed
primarily for classification tasks.
 Splitting Criterion: Uses information gain, based on entropy, to select the optimal split.
Complexity and Performance
 Handling of Data: Primarily handles categorical data and does not inherently support numerical
features without binning.
 Performance: While simple and intuitive, it is prone to overfitting, especially with many
categorical features.
C4.5 and C5.0
C4.5 Overview
 Improvement Over ID3: Extends ID3 by handling both discrete and continuous features,
dealing with missing values, and pruning the tree after building to avoid overfitting.
 Splitting Criterion: Uses gain ratio, which normalizes the information gain, to choose splits,
attempting to solve the bias toward attributes with a large number of values present in ID3.
C4.5 Complexity and Performance
 Handling of Data: Efficiently handles both types of data and missing values.
 Performance: More complex than ID3 but generally provides better accuracy and less
susceptibility to overfitting due to its pruning stage.
C5.0 Overview
 Type of Tree: An extension of C4.5, proprietary, optimized for speed and memory use, and
includes enhancements like boosting.
 Splitting Criterion: Similar to C4.5 but includes mechanisms to boost weak classifiers.
C5.0 Complexity and Performance
 Handling of Data: Handles large datasets efficiently and supports both categorical and numerical
data.
 Performance: Typically outperforms C4.5 in terms of both speed and memory usage, often
producing more accurate models due to the incorporation of boosting techniques.
Conclusion
Each decision tree algorithm has its strengths and weaknesses, often tailored to specific types of data
or applications. CART is widely used due to its simplicity and effectiveness for diverse tasks, while
C4.5 and C5.0 offer advanced features that handle complexity better and reduce overfitting. ID3,
while less commonly used today, laid the groundwork for more advanced tree algorithms. The choice
of algorithm often depends on the specific needs of the task, including the nature of the data and the
computational resources available.

Backpropagation in Neural Network


Back Propagation is also known as "Backward Propagation of Errors" is a method used to train
neural network . Its goal is to reduce the difference between the model’s predicted output and
the actual output by adjusting the weights and biases in the network.

It works iteratively to adjust weights and bias to minimize the cost function. In each epoch the
model adapts these parameters by reducing loss by following the error gradient. It often uses
optimization algorithms like gradient descent or stochastic gradient descent. The algorithm
computes the gradient using the chain rule from calculus allowing it to effectively navigate
complex layers in the neural network to minimize the cost function.

Fig(a) A simple illustration of how the backpropagation works by adjustments of weights

Back Propagation plays a critical role in how neural networks improve over time. Here's why:

1. Efficient Weight Update: It computes the gradient of the loss function with respect to
each weight using the chain rule making it possible to update weights efficiently.

2. Scalability: The Back Propagation algorithm scales well to networks with multiple layers
and complex architectures making deep learning feasible.
3. Automated Learning: With Back Propagation the learning process becomes automated
and the model can adjust itself to optimize its performance.
Working of Back Propagation Algorithm
The Back Propagation algorithm involves two main steps: the Forward Pass and the Backward
Pass.

1. Forward Pass Work

In forward pass the input data is fed into the input layer. These inputs combined with their
respective weights are passed to hidden layers. For example in a network with two hidden
layers (h1 and h2) the output from h1 serves as the input to h2. Before applying an activation
function, a bias is added to the weighted inputs.

Each hidden layer computes the weighted sum (`a`) of the inputs then applies an activation
function like ReLU ( Rectified Linear Unit) to obtain the output (`o`). The output is passed to
the next layer where an activation function such as softmax converts the weighted outputs into
probabilities for classification.

The forward pass using weights and biases


2. Backward Pass

In the backward pass the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method for
error calculation is the Mean Squared Error ( MSE) given by:

MSE = (Predicted Output − Actual Output)2

Once the error is calculated the network adjusts weights using gradients which are computed
with the chain rule. These gradients indicate how much each weight and bias should be adjusted to
minimize the error in the next iteration. The backward pass continues layer by layer ensuring that
the network learns and improves its performance.
The activation function through its derivative plays a crucial role in computing these
gradients during Back Propagation.

Example of Back Propagation in Machine Learning


Let’s walk through an example of Back Propagation in machine learning. Assume the neurons
use the sigmoid activation function for the forward and backward pass. The target output is 0.5
and the learning rate is 1.

Example (1) of backpropagation sum

Forward Propagation

1. Initial Calculation

The weighted sum at each node is calculated using:


aj = ∑(wi, j ∗ xi)

Where,

aj is the weighted sum of all the inputs and weights at each node
wi,j represents the weights between the ithinput and the jth neuron
xi represents the value of the ith input

O ( output) : After applying the activation function to a, we get the output of the
neuron:

oj = activation function(aj)

2. Sigmoid Function

The sigmoid function returns a value between 0 and 1, introducing non- linearity into the model.

1
y =
j 1+e−aj

To find the outputs of y3, y4 and y5

3. Computing Outputs

At h1 node
a1 = (w1,1x1) + (w2,1x2)

= (0.2 ∗ 0.35) + (0.2 ∗ 0.7)

= 0.21

Once we calculated the a1value, we can now proceed to find the y3 value:
1
y = F (a ) =
j j 1+e−a1
1
y3 = F (0.21) = 1+
e−0.21
y3 = 0.56

Similarly find the values of y4at h2 and y5at O3

a2 = (w1,2 ∗ x1) + (w2,2 ∗ x2) = (0.3 ∗ 0.35) + (0.3 ∗

0.7) = 0.315

1
y4 = F (0.315) = 1+e
−0.315

a3 = (w1,3 ∗ y3) + (w2,3 ∗ y4) = (0.3 ∗ 0.57) + (0.9 ∗ 0.59) = 0.702

y5 = F (0.702) = 1 = 0.67

1+e−0.70
2

Values of y3, y4 and y5

4. Error Calculation

Our actual output is 0.5 but we obtained 0.67. To calculate the error we can use the below formula:
Errorj = ytarget − y5

=> 0.5 − 0.67 = −0.17

Using this error value we will be backpropagating.

Back Propagation

1. Calculating Gradients

The change in each weight is calculated as:

Δwij = η × δj × Oj

Where:

δj is the error term for each unit,


η is the learning rate.

2. Output Unit Error

For O3:

δ5 = y5(1 − y5)(ytarget − y5)

= 0.67(1 − 0.67)(−0.17) = −0.0376

3. Hidden Unit Error

For h1:

δ3 = y3(1 − y3)(w1,3 × δ5)

= 0.56(1 − 0.56)(0.3 × −0.0376) = −0.0027


For h2:

δ4 = y4(1 − y4)(w2,3 × δ5)

= 0.59(1 − 0.59)(0.9 × −0.0376) = −0.0819


4. Weight Updates

For the weights from hidden to output layer:

Δw2,3 = 1 × (−0.0376) × 0.59 = −0.022184

New weight:

w2,3(new) = −0.022184 + 0.9 = 0.877816

For weights from input to hidden layer:

Δw1,1 = 1 × (−0.0027) × 0.35 = 0.000945

New weight:

w1,1(new) = 0.000945 + 0.2 = 0.200945

Similarly other weights are updated:

w1,2(new) = 0.273225

w1,3(new) = 0.086615

w2,1(new) = 0.269445

w2,2(new) = 0.18534

The updated weights are illustrated below

Through backward pass the weights are updated

After updating the weights the forward pass is repeated yielding:


y3 = 0.57

y4 = 0.56
y5 = 0.61

Since y5 = 0.61 is still not the target output the process of calculating
the error and backpropagating continues until the desired output is reached.

This process demonstrates how Back Propagation iteratively updates weights by minimizing
errors until the network accurately predicts the output.

Error = ytarget − y5

= 0.5 − 0.61 = −0.11

This process is said to be continued until the actual output is gained by the neural network.

Back Propagation Implementation in Python for XOR Problem


This code demonstrates how Back Propagation is used in a neural
network to solve the XOR problem. The neural network consists of:

You might also like