Jntuk ML RECORD Full
Jntuk ML RECORD Full
Jntuk ML RECORD Full
Experiment-1
Aim: To Implement and demonstrate the FIND-S algorithm for finding the most specific
hypothesis based on a given set of training data samples. Read the training data from a .CSV
file.
Description:
Introduction :
The find-S algorithm is a basic concept learning algorithm in machine learning. The find-S
algorithm finds the most specific hypothesis that fits all the positive examples. We have to
note here that the algorithm considers only those positive training example. The find-S
algorithm starts with the most specific hypothesis and generalizes this hypothesis each time
it fails to classify an observed positive training data. Hence, the Find-S algorithm moves
from the most specific hypothesis to the most general hypothesis.
Important Representation :
Algorithm:
20JG1A1233
2
PROGRAM:
OUTPUT:
20JG1A1233
3
Experiment-2
Aim: For a given set of training data examples stored in a .CSV file, to implement and
demonstrate the Candidate-Elimination algorithm to output a description of the set of all
hypotheses consistent with the training examples.
Description:
The candidate elimination algorithm incrementally builds the version space given a
hypothesis space H and a set E of examples. The examples are added one by one; each
example possibly shrinks the version space by removing the hypotheses that are inconsistent
with the example. The candidate elimination algorithm does this by updating the general and
specific boundary for each new example.
You can consider this as an extended form of the Find-S algorithm.
Consider both positive and negative examples.
Actually, positive examples are used here as the Find-S algorithm (Basically they
are generalizing from the specification).
While the negative example is specified in the generalizing form.
Terms Used:
Concept learning: Concept learning is basically the learning task of the
machine (Learn by Train data)
General Hypothesis: Not Specifying features to learn the machine.
G = {‘?’, ‘?’,’?’,’?’…}: Number of attributes
Specific Hypothesis: Specifying features to learn machine (Specific feature)
S= {‘pi’,’pi’,’pi’…}: The number of pi depends on a number of attributes.
Version Space: It is an intermediate of general hypothesis and Specific
hypothesis. It not only just writes one hypothesis but a set of all possible
hypotheses based on training data-set.
Algorithm:
20JG1A1233
4
PROGRAM:
20JG1A1233
5
OUTPUT:
20JG1A1233
6
Experiment-3
Aim: To Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to classify
a new sample.
Description:
Decision Trees:
In simple words, a decision tree is a structure that contains nodes (rectangular boxes) and
edges(arrows) and is built from a dataset (table of columns representing features/attributes and
rows corresponds to records). Each node is either used to make a decision (known as decision
node) or represent an outcome (known as leaf node).
The picture above depicts a decision tree that is used to classify whether a person
is Fit or Unfit.
The decision nodes here are questions like ‘’‘Is the person less than 30 years of age?’, ‘Does
the person eat junk?’, etc. and the leaves are one of the two possible outcomes
viz. Fit and Unfit.
Looking at the Decision Tree we can say make the following decisions:
if a person is less than 30 years of age and doesn’t eat junk food then he is Fit, if a person is
less than 30 years of age and eats junk food then he is Unfit and so on. The initial node is
called the root node (colored in blue), the final nodes are called the leaf nodes (colored in
green) and the rest of the nodes are called intermediate or internal nodes.
20JG1A1233
7
The root and intermediate nodes represent the decisions while the leaf nodes represent the
outcomes.
ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step. Invented
by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree. In simple
words, the top-down approach means that we start building the tree from the top and
the greedy approach means that at each iteration we select the best feature at the present
moment to create a node. Most generally ID3 is only used for classification problems
with nominal features only.
Metrics in ID3
As mentioned previously, the ID3 algorithm selects the best feature at each step while
building a Decision tree. Before you ask, the answer to the question: ‘How does ID3 select
the best feature?’ is that ID3 uses Information Gain or just Gain to find the best feature.
Information Gain calculates the reduction in the entropy and measures how well a given
feature separates or classifies the target classes. The feature with the highest Information
Gain is selected as the best one. In simple words, Entropy is the measure of disorder and the
Entropy of a dataset is the measure of disorder in the target feature of the dataset.
In the case of binary classification (where the target column has only two types of classes)
entropy is 0 if all values in the target column are homogenous(similar) and will be 1 if the
target column has equal number values for both the classes.
We denote our dataset as S, entropy is calculated as:
Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
where,
n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target
column” to the “total number of rows” in the dataset.
20JG1A1233
8
PROGRAM:
20JG1A1233
9
20JG1A1233
10
OUTPUT:
20JG1A1233
11
EXPERIMENT-4
Aim: Exercises to solve the real-world problems using the following machine learning
methods: a) Linear Regression b) Logistic Regression
Description:
LINEAR REGRESSION: Linear regression is one of the easiest and most popular Machine
Learning algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such as sales, salary,
age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
y= a0+a1x+ ε
LOGISTIC REGRESSION: Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
20JG1A1233
12
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
PROGRAM:
Linear Regression:
20JG1A1233
13
20JG1A1233
14
20JG1A1233
15
Logistic Regression:
20JG1A1233
16
20JG1A1233
17
EXPERIMENT-5
Aim: Develop a program for Bias, Variance, Remove duplicates, Cross Validation
Description: The bias error is an error from erroneous assumptions in the learning algorithm.
High bias can cause an algorithm to miss the relevant relations between features and target
outputs (underfitting). The variance is an error from sensitivity to small fluctuations in the
training set. High variance may result from an algorithm modelling the random noise in the
training data (overfitting). The bias–variance tradeoff is a central problem in supervised
learning. Ideally, one wants to choose a model that both accurately captures the regularities in
its training data, but also generalizes well to unseen data. Unfortunately, it is typically
impossible to do both simultaneously. High-variance learning methods may be able to represent
their training set well but are at risk of overfitting to noisy or unrepresentative training data. In
contrast, algorithms with high bias typically produce simpler models that may fail to capture
important regularities (i.e. underfit) in the data. The bias–variance decomposition is a way of
analysing a learning algorithm's expected generalization error with respect to a particular
model. The following diagram illustrates the bias–variance tradeoff.
Preparing a dataset before designing a machine learning model is an important task for the data
scientist. When you gather a dataset for modelling a machine learning model, you may find
some instances repeated several times. It is very important for you to remove duplicates from
the dataset to maintain accuracy and to avoid misleading statistics. Cross-validation is a
technique for evaluating a machine learning model and testing its performance. CV is
commonly used in applied ML tasks. It can be used to estimate the test error associated with a
given statistical learning method in order to evaluate its performance, or to select the
appropriate level of flexibility. In this experiment, students need to take a learning model and
an appropriate data set, remove duplicates in the data set, fit a model, measure bias and variance
components of the error rate, and fine-tune the parameters using cross validation. They may
use built-in APIs if needed
20JG1A1233
18
PROGRAM:
20JG1A1233
19
20JG1A1233
20
Experiment-6
Aim: To Write a program to implement Categorical Encoding, One-hot Encoding.
Description:
One Hot Encoding:
One hot encoding is a technique that we use to represent categorical variables as numerical
values in a machine learning model.
The advantages of using one hot encoding include:
1. It allows the use of categorical variables in models that require numerical input.
2. It can improve model performance by providing more information to the model
about the categorical variable.
3. It can help to avoid the problem of ordinality, which can occur when a categorical
variable has a natural ordering (e.g. “small”, “medium”, “large”).
Examples:
Categorical
Fruit value of fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20
20JG1A1233
21
The output after applying one-hot encoding on the data is given as follows,
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
Categorical Encoding:
Encoding categorical data is a process of converting categorical data into integer format so
that the data with converted categorical values can be provided to the different models.
Categorical data can be considered as gathered information that is divided into groups. For
example, a list of many people with their blood group: A+, A-, B+, B-, AB+, AB-,O+, O- etc.
in which each of the blood types is a categorical value.
• Nominal data
• Ordinal data
Nominal data: This type of categorical data consists of the name variable without any
numerical values. For example, in any organization, the name of the different departments
like research and development department, human resource department, accounts and billing
department etc.
Ordinal data: This type of categorical data consists of a set of orders or scales. For example,
a list of patients consists of the level of sugar present in the body of a person which can be
divided into high, low and medium classes.
20JG1A1233
22
PROGRAM:
20JG1A1233
23
EXPERIMENT-7
Aim: Build an Artificial Neural Network by implementing the Back propagation algorithm
and test the same using appropriate data sets.
Description:
1. Artificial Neural Network (ANN):
An ANN is composed of interconnected artificial neurons or nodes organized
into layers: input layer, hidden layer(s), and output layer.
Each neuron receives inputs, performs a weighted sum of those inputs, applies
an activation function, and produces an output.
The connections between neurons are associated with weights that determine
the strength of the connection.
The activation function introduces non-linearity into the network, allowing it to
learn complex patterns and make predictions.
2. Backpropagation Algorithm:
Backpropagation is a supervised learning algorithm used to train an
ANN by adjusting its weights and biases.
It utilizes the gradient descent optimization technique to minimize the
network's error or loss function.
The algorithm consists of two main phases: forward propagation and
backward propagation.
Forward Propagation:
1. During forward propagation, the input data is fed into the
network, and the outputs of each neuron are calculated
successively through the layers.
2. The output of the network is compared to the desired output,
and the error or loss is calculated.
Backward Propagation:
1. Backward propagation involves propagating the error from the
output layer back to the previous layers.
2. The error is used to calculate the gradients of the weights and
biases, which indicate the direction and magnitude of
adjustments required.
3. The weights and biases are updated in the opposite direction of
the gradients, effectively reducing the error.
4. This process is repeated iteratively for a defined number of
epochs or until the network converges to a satisfactory level of
accuracy.
PROGRAM:
20JG1A1233
24
20JG1A1233
25
20JG1A1233
26
20JG1A1233
27
Experiment-8
Aim: To Write a program to implement k-Nearest Neighbour algorithm to classify the iris data
set. Print both correct and wrong predictions.
Description: The k-nearest neighbours algorithm, also known as KNN or k-NN, is a non-
parametric, supervised learning classifier, which uses proximity to make classifications or
predictions about the grouping of an individual data point. While it can be used for either
regression or classification problems, it is typically used as a classification algorithm, working
off the assumption that similar points can be found near one another.
Applications:
- Data preprocessing: Datasets frequently have missing values, but the KNN algorithm can
estimate for those values in a process known as missing data imputation.
- Recommendation Engines: Using clickstream data from websites, the KNN algorithm has
been used to provide automatic recommendations to users on additional content.
This research (link resides outside of ibm.com) shows that the a user is assigned to a particular
group, and based on that group’s user behaviour, they are given a recommendation. However,
given the scaling issues with KNN, this approach may not be optimal for larger datasets.
- Finance: It has also been used in a variety of finance and economic use cases. For example,
one paper (PDF, 391 KB) (link resides outside of ibm.com) shows how using KNN on credit
data can help banks assess risk of a loan to an organization or individual. It is used to determine
the credit-worthiness of a loan applicant. Another journal (PDF, 447 KB)(link resides outside
of ibm.com) highlights its use in stock market forecasting, currency exchange rates, trading
futures, and money laundering analyses.
- Healthcare: KNN has also had application within the healthcare industry, making predictions
on the risk of heart attacks and prostate cancer. The algorithm works by calculating the most
likely gene expressions.
Advantages:
- Easy to implement: Given the algorithm’s simplicity and accuracy, it is one of the first
classifiers that a new data scientist will learn.
- Adapts easily: As new training samples are added, the algorithm adjusts to account for any
new data since all training data is stored into memory.
- Few hyperparameters: KNN only requires a k value and a distance metric, which is low
when compared to other machine learning algorithms.
Cons:
o Large datasets take longer to process.
o Requires feature scaling, and inability to do will result in wrongful predictions.
o Noisy data can result in over-fitting or under-fitting of data.
20JG1A1233
28
PROGRAM:
OUTPUT:
20JG1A1233
29
Experiment-9
Aim: To Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points. Select appropriate data set for your experiment and draw graphs.
Description:
Locally Weighted Regression algorithm:
Locally weighted linear regression is a supervised learning algorithm.
It is a non-parametric algorithm.
There exists No training phase. All the work is done during the testing
phase/while making predictions.
The dataset must always be available for predictions.
Locally weighted regression methods are a generalization of k-Nearest
Neighbour.
In Locally weighted regression an explicit local approximation is constructed
from the target function for each query instance.
The local approximation is based on the target function of the form like
constant, linear, or quadratic functions localized kernel functions.
20JG1A1233
30
PROGRAM:
OUTPUT:
20JG1A1233
31
EXPERIMENT-10
Aim: Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier
model to perform this task. Built-in Java classes/API can be used to write the program.
Calculate the accuracy, precision, and recall for your data set.
Description:
The Naive Bayes algorithm is a supervised machine learning algorithm based on the Bayes’
theorem. It is a probabilistic classifier that is often used in NLP tasks like sentiment analysis
(identifying a text corpus’ emotional or sentimental tone or opinion). The Bayes’ theorem is
used to determine the probability of a hypothesis when prior knowledge is available. It depends
on conditional probabilities.
The formula is given below :
where P(A|B) is posterior probability i.e. the probability of a hypothesis A given the event B
occurs. P(B|A) is likelihood probability i.e. the probability of the evidence given that
hypothesis A is true. P(A) is prior probability i.e. the probability of the hypothesis before
observing the evidence and P(B) is marginal probability i.e. the probability of the evidence.
There are 5 types of Naive Bayes classifiers available in scikit-learn – namely Bernoulli Naive
Bayes, Categorical NB, Complement NB, Gaussian NB, and Multinomial NB.
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
means a particular document belongs to which category such as Sports, Politics,
education, etc. The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.
PROGRAM:
20JG1A1233
33
20JG1A1233
34
EXPERIMENT-11
Aim: Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for
clustering using k-Means algorithm. Compare the results of these two algorithms and comment
on the quality of clustering. You can add Java/Python ML library classes/API in the program.
Description:
Clustering is an unsupervised learning technique that separates data of similar nature. It aims
to find a structure (intrinsic grouping) in a collection of unlabelled data. A cluster is therefore
a collection of objects which are ‘similar’ between each other and are ‘dissimilar’ to the objects
belonging to other clusters. Two representatives of the clustering algorithms are the K-means
algorithm and the expectation maximization (EM) algorithm. The K-means algorithm uses
Euclidean distance while EM uses statistical methods.
K-means clustering:
Input: The number of k and a database containing n objects.
Output: A set of k-clusters that minimize the squared-error criterion.
1. arbitrarily choose k objects as the initial cluster centres;
2. repeat;
a. (re)assign each object to the cluster to which the object is the most similar based on
the mean value of the objects in the cluster;
b. update the cluster mean, i.e. calculate the mean value of the object for each cluster;
until no change.
20JG1A1233
35
EM clustering:
Input: Cluster number k, a database, stopping tolerance.
Output: A set of k-clusters with weight that maximize log-likelihood function.
1. Expectation step: For each database record x, compute the membership probability of x in
each cluster h = 1,…, k.
2. Maximization step: Update mixture model parameter (probability weight).
3. Stopping criteria: If stopping criteria are satisfied stop, else set j = j +1 and go to (1).
Steps in EM Algorithm
PROGRAM:
20JG1A1233
36
20JG1A1233
37
Experiment-12
Aim: Exploratory Data Analysis for Classification using Pandas or Matplotlib
Description:
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
data, and possibly formulate hypotheses that might cause new data collection and
experiments. EDA focuses more narrowly on checking assumptions required for model
fitting and hypothesis testing. It also checks while handling missing values and making
transformations of variables as needed.
EDA builds a robust understanding of the data, and issues associated with either the info or
process. It’s a scientific approach to getting the story of the data.
TYPES OF EXPLORATORY DATA ANALYSIS:
1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical
TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS:
Some of the most common tools used to create an EDA are:
1. R: An open-source programming language and free software environment for statistical
computing and graphics supported by the R foundation for statistical computing. The R
language is widely used among statisticians in developing statistical observations and data
analysis.
2. Python: An interpreted, object-oriented programming language with dynamic semantics.
Its high level, built-in data structures, combined with dynamic binding, make it very
attractive for rapid application development, also as to be used as a scripting or glue
language to attach existing components together. Python and EDA are often used together
to spot missing values in the data set, which is vital so you’ll decide the way to handle
missing values for machine learning.
20JG1A1233
38
PROGRAM:
20JG1A1233
39
20JG1A1233
40
Experiment-13
Aim: To Write a Python program to construct a Bayesian network considering medical data.
Use this model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set.
Description:
Bayesian networks are a widely-used class of probabilistic graphical models. They consist of
two parts: a structure and parameters. The structure is a directed acyclic graph (DAG) that
expresses conditional independencies and dependencies among ran- dom variables associated
with nodes. The parameters consist of conditional probability distributions associated with each
node. A Bayesian network is a compact, flexible and interpretable representation of a joint
probability distribution. It is also an useful tool in knowledge discovery as directed acyclic
graphs allow representing causal relations between variables. Typically, a Bayesian network is
learned from data.
We study Bayesian networks, especially learning their structure. We are interested in scalable
algorithms with theoretical guarantees.
Example:
The corresponding directed acyclic graph is depicted in below figure.
The goal is to calculate the posterior conditional probability distribution of each of the
possible unobserved causes given the observed evidence
PROGRAM:
20JG1A1233
41
OUTPUT:
20JG1A1233
42
20JG1A1233
43
Experiment-14
Aim: To Write a program to Implement Support Vector Machines and Principal Component
Analysis.
Description:
Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model,
capable of performing linear or nonlinear classification, regression, and even outlier detection.
SVM (Support Vector Machine) can be used for both regression and classification. However,
it is widely applied in classifications objectives. The objective of the support vector machine
algorithm is to find a hyperplane in N-dimensional space(N — the number of features) that
distinctly classifies the data points.
There are numerous hyper-planes from which to choose to split the two kinds of data points.
Our goal is to discover a plane with the greatest margin, or the greatest distance between data
points from both classes. Maximizing the margin distance adds some reinforcement, making it
easier to classify future data points.
We can use PCA to find the first two principal components, and visualize the data in this new,
two-dimensional space, with a single scatter-plot.
20JG1A1233
44
20JG1A1233
45
EXPERIMENT-15
Aim: Write a program to Implement Principal Component Analysis
Description:
Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modelling. It is a technique to draw strong patterns from the given dataset by
reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels. It is a feature extraction technique, so
it contains the important variables and drops the least important variable.
20JG1A1233
46
PROGRAM:
OUTPUT:
20JG1A1233