[go: up one dir, main page]

0% found this document useful (0 votes)
502 views68 pages

Python Data Preprocessing & Regression

The document provides information about implementing logistic regression and evaluating its performance. It describes calculating evaluation metrics like the F-measure and confusion matrix using true positives, true negatives, false positives and false negatives. It also describes implementing gradient descent to observe the cost with logistic regression and implementing cross-validation. The document discusses logistic regression for predicting categorical dependent variables using independent variables and how it provides probabilistic values between 0 and 1 rather than exact 0s and 1s.

Uploaded by

Jay Prajapati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
502 views68 pages

Python Data Preprocessing & Regression

The document provides information about implementing logistic regression and evaluating its performance. It describes calculating evaluation metrics like the F-measure and confusion matrix using true positives, true negatives, false positives and false negatives. It also describes implementing gradient descent to observe the cost with logistic regression and implementing cross-validation. The document discusses logistic regression for predicting categorical dependent variables using independent variables and how it provides probabilistic values between 0 and 1 rather than exact 0s and 1s.

Uploaded by

Jay Prajapati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 1

AIM: Perform the following using Python Pandas, numpy and Matplotlib library on
given dataset:
i) Deal with missing values in the data either by deleting records or using
mean/median/mode imputation. ii) Detect if Outliers exist and Plot the data
distribution using Box Plots, Scatter Plots and Histograms of matplotlib library
iii) Create and display the correlation matrix of all features of the data.
iv) Perform Data Standardization and Normalization
v)Select the 10 best features of the data using different statistical scoring methods.
(Hint: Chi-Squared Statistical Test is a good scoring method)
vi) Split the data into training and testing sets in a ratio of 80:20.
Record and Analyse Observations.

DESCRIPTION:
 Data standardization
Data standardization is the process of converting data to a common format to enable users to
process and analyze it. Most organizations utilize data from a number of sources; this can
include data warehouses, lakes, cloud storage, and databases.

 Data normalization
Data normalization refers to shifting the values of your data so they fall between 0 and 1. Data
standardization, in this context, is used as a scaling technique to establish the mean and the
standard deviation at 0 and 1, respectively.

 Training Data

Training Data (or a training dataset) is the initial data used to train machine learning models.
Training datasets are fed to machine learning algorithms to teach them how to make predictions
or perform a desired task.

 Testing Data

Testing Data is a set of observations used to evaluate the performance of the model using some
performance metric. It is important that no observations from the training set are included in
the test set. If the test set does contain examples from the training set, it will be difficult to
assess whether the algorithm has learned to generalize from the training set or has simply
memorized it.

DEPSTAR IT 1
IT377 MACHINE LEARNING 20DIT073

CODE:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from google.colab import files
uploaded = files.upload()
import io
df = pd.read_csv('/content/Data.csv')
print(df)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
print(X_train)
print(X_test)
print(y_train)
print(y_test)

OUTPUT:

Figure 1.1: Missing Data

DEPSTAR IT 2
IT377 MACHINE LEARNING 20DIT073

Figure 1.2: Taking care of missing data

Figure 1.3: Splitting the data into training and testing set

CONCLUSION:
In this practical I have learned about the splitting data into training and testing set and handle
the missing value using mean, median and mode.

DEPSTAR IT 3
IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 2

AIM: i) Implement the linear regression and calculate the different evaluation
measure (MAE, RMSE etc.). for the same. Also implement gradient descent and
observe the cost with linear regression using gradient descent. Do not use any Python
library for linear regression. (Hint: Linear Regression Formula is Y= mX +b where Y
is target variable and X is independent variable)
ii) Implement Non-linear regression in Python.

DESCRIPTION:
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
Linear Regression is an algorithm that belongs to supervised Machine Learning. It tries to
apply relations that will predict the outcome of an event based on the independent variable
data points. The relation is usually a straight line that best fits the different data points as
close as possible. The output is of a continuous form, i.e., numerical value. For example, the
output could be revenue or sales in currency, the number of products sold, etc. In the above
example, the independent variable can be single or multiple.

CODE:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(X)
print(y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
y_pred = regressor.predict(X_test)
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

DEPSTAR IT 4
IT377 MACHINE LEARNING 20DIT073

OUTPUT:

Figure 2.1: Dataset imported

Figure 2.2: Visualizing the training set results

DEPSTAR IT 5
IT377 MACHINE LEARNING 20DIT073

Figure 2.3: Visualizing the test set results

CONCLUSION:
In this practical I have learned about the linear regression model using the data visualization or
gradient descent.

DEPSTAR IT 6
IT377 MACHINE LEARNING 20DIT073

PART-2

DESCRIPTION:
Non-Linear regression is a type of polynomial regression. It is a method to model a non-linear
relationship between the dependent and independent variables. It is used in place when the data
shows a curvy trend, and linear regression would not produce very accurate results when
compared to non-linear regression. This is because in linear regression it is pre-assumed that
the data is linear.
Nonlinear regression is a mathematical model that fits an equation to certain data using a
generated line. As is the case with a linear regression that uses a straight-line equation (such as
Ỵ= c + m x), nonlinear regression shows association using a curve, making it nonlinear in
the parameter.

CODE:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from scipy.stats import wilcoxon
# compare samples
stat, p = wilcoxon(X_train[0], X_test[0])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
print('Same distribution (fail to reject H0)')
else:
print('Different distribution (reject H0)')

DEPSTAR IT 7
IT377 MACHINE LEARNING 20DIT073

OUTPUT:

Figure 2.2.1: Dataset imported

DEPSTAR IT 8
IT377 MACHINE LEARNING 20DIT073

Figure 2.2.2: Splitting the data into splitting training and testing test

CONCLUSION:
In this practical I have learned about the non-regression model and the polynomial regression
model by splitting the dataset into training and testing set.

DEPSTAR IT 9
IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 3

AIM: Implement logistic regression and calculate the different evaluation measure (F-
measures, Confusion Matrix etc.) for the same. Also implement gradient descent and
observe the cost with logistic regression using gradient descent. (Hint: Confusion Matrix
and F-measures involve use of True Negatives, True Positives, False Negatives and False
Positives). Also implement Cross-Validation.

DESCRIPTION:
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.

Confusion matrix:
A much better way to evaluate the performance of a classifier is to look at the confusion
matrix. The general idea is to count the number of times instances of class A are classified
as B.
For example, to know the number of times the classifier confused images of 5s with 3s,
you would look in the 5th row and 3rd column of the confusion matrix.
precision = (TP) / (TP+FP).
recall = (TP) / (TP+FN).

CODE:
Logistic regression:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

DEPSTAR IT 10
IT377 MACHINE LEARNING 20DIT073

print(X_train)
print(X_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
print(classifier.predict(sc.transform([[45,87000]])))
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

print(classification_report(y, model.predict(x)))
OUTPUT:

Figure 3.1: Importing the data set

DEPSTAR IT 11
IT377 MACHINE LEARNING 20DIT073

Figure 3.2: Splitting the dataset into training and testing set

Figure 3.3: Training the logistic regression on training set

Figure 3.4: Predicting a new result

Figure 3.5:Predicting the test results

DEPSTAR IT 12
IT377 MACHINE LEARNING 20DIT073

Figure 3.6: Making the confusion matrix

Figure 3.7: Confusion matrix

Figure 3.8: F-Score Measure, precision and recall

CONCLUSION:
In this practical I have learned about I how to implement logistic regression and calculate
the different evaluation measure such as F-measures, Confusion Matrix etc.

DEPSTAR IT 13
IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 4

AIM: Implement Multi-class Classification in python. Visualize and Analyze the results.

DESCRIPTION:

Multiclass classification is a machine learning classification task that consists of more than two
classes, or outputs. For example, using a model to identify animal types in images from an
encyclopedia is a multiclass classification example because there are many different animal
classifications that each image can be classified as. Multiclass classification also requires that
a sample only have one class (ie. an elephant is only an elephant; it is not also a lemur).

Outside of regression, multiclass classification is probably the most common machine learning
task. In classification, we are presented with a number of training examples divided into K
separate classes, and we build a machine learning model to predict which of those classes some
previously unseen data belongs to (ie. the animal types from the previous example). In seeing
the training dataset, the model learns patterns specific to each class and uses those patterns to
predict the membership of future data.

OUTPUT:
Decision Tree Classification
 Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

 Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

 Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
print(X_test)

[[ 30 87000]
[ 38 50000]
[ 35 75000]
[ 30 79000]
[ 35 50000]
[ 27 20000]
[ 31 15000]
[ 36 144000]
[ 18 68000]
[ 47 43000]
[ 30 49000]

DEPSTAR IT 14
IT377 MACHINE LEARNING 20DIT073

[ 28 55000]
[ 37 55000]
[ 39 77000]
[ 20 86000]
[ 32 117000]
[ 37 77000]
[ 19 85000]
[ 55 130000]
[ 35 22000]
[ 48 90000]
[ 42 104000]]

print(y_test)

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0
0010000100101100011001001010100001001
0 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1]

 Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train)

[[ 0.58164944 -0.88670699]
[-0.60673761 1.46173768]
[-0.01254409 -0.5677824 ]
[-0.60673761 1.89663484]
[ 1.37390747 -1.40858358]
[ 1.47293972 0.99784738]
[ 0.08648817 -0.79972756]
[-0.01254409 -0.24885782]
[-0.21060859 -0.5677824 ]
[-0.21060859 -0.19087153]
[-0.30964085 -1.29261101]
[-0.90383437 -0.77073441]
[-0.21060859 -0.50979612]
[-1.10189888 -0.45180983]
[-1.20093113 1.40375139]]

print(X_test)

[[-0.80480212 0.50496393]
[-0.01254409 -0.5677824 ]
[-0.30964085 0.1570462 ]
[-0.80480212 0.27301877]
[-0.30964085 -0.5677824 ]
[-1.10189888 -1.43757673]
[-0.70576986 -1.58254245]

DEPSTAR IT 15
IT377 MACHINE LEARNING 20DIT073

[-0.21060859 2.15757314]
[-1.99318916 -0.04590581]
[ 0.8787462 -0.77073441]
[-0.80480212 -0.59677555]
[-1.00286662 -0.42281668]
[-0.11157634 -0.42281668]
[ 0.08648817 0.21503249]
[-1.79512465 0.47597078]
[-0.60673761 1.37475825]
[-0.11157634 0.21503249]
[-1.89415691 0.44697764]
[ 1.67100423 1.75166912]
[-0.30964085 -1.37959044]
[-0.30964085 -0.65476184]
[ 0.8787462 2.15757314]
[ 0.97777845 -1.06066585]
[ 0.97777845 0.59194336]
[ 0.38358493 0.99784738]]

 Training the Decision Tree Classification model on the Training set


from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=0, splitter='best')

 Predicting a new result


print(classifier.predict(sc.transform([[30,87000]])))

[0]

 Predicting the Test set results


y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)
)

[[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]

DEPSTAR IT 16
IT377 MACHINE LEARNING 20DIT073

[0 0]
[0 0]
[0 0]
[1 0]
[0 0]
[1 0]
[1 0]
[0 0]
[1 1]
[0 0]
[1 1]
[1 1]]

 Making the Confusion Matrix


from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)

accuracy_score(y_test, y_pred)

[[62 6]
[ 3 29]]

0.91

 Visualising the Training set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max
() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 100
0, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).
T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'gre
en'))(i), label = j)
plt.title('Decision Tree Classification (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

DEPSTAR IT 17
IT377 MACHINE LEARNING 20DIT073

 Visualising the Test set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max
() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 100
0, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).
T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'gre
en'))(i), label = j)
plt.title('Decision Tree Classification (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

DEPSTAR IT 18
IT377 MACHINE LEARNING 20DIT073

Random Forest Classification


 Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

 Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

 Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state
= 0)

print(X_test)

[[ 30 87000]
[ 38 50000]
[ 35 75000]
[ 30 79000]
[ 35 50000]
[ 27 20000]
[ 31 15000]
[ 36 144000]
[ 18 68000]
[ 47 43000]
[ 30 49000]
[ 28 55000]
[ 37 55000]
[ 39 77000]
[ 20 86000]
[ 32 117000]
[ 37 77000]
[ 19 85000]
[ 55 130000]
[ 35 22000]
[ 35 47000]
[ 47 144000]
[ 41 51000]
[ 47 105000]
[ 23 28000]
[ 49 141000]
[ 28 87000]
[ 29 80000]
[ 37 62000]
[ 32 86000]
[ 48 33000]

DEPSTAR IT 19
IT377 MACHINE LEARNING 20DIT073

[ 48 90000]
[ 42 104000]]

print(y_test)

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0
0010000100101100011001001010100001001
0 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1]

 Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_test)

[[-0.80480212 0.50496393]
[-0.01254409 -0.5677824 ]
[-0.30964085 0.1570462 ]
[-0.80480212 0.27301877]
[-0.30964085 -0.5677824 ]
[-1.10189888 -1.43757673]
[-0.70576986 -1.58254245]
[-0.21060859 2.15757314]
[-1.99318916 -0.04590581]
[ 0.8787462 -0.77073441]
[-0.80480212 -0.59677555]
[-1.00286662 -0.42281668]
[-0.11157634 -0.42281668]
[ 0.08648817 0.21503249]
[-1.79512465 0.47597078]
[-0.60673761 1.37475825]
[-0.11157634 0.21503249]
[-1.89415691 0.44697764]
[ 1.67100423 1.75166912]
[-0.30964085 -1.37959044]
[-0.30964085 -0.65476184]
[ 0.8787462 2.15757314]
[ 0.28455268 -0.53878926]
[ 0.8787462 1.02684052]
[-1.49802789 -1.20563157]
[ 1.07681071 2.07059371]
[-1.00286662 0.50496393]
[-0.90383437 0.30201192]
[-0.11157634 -0.21986468]
[ 0.97777845 0.59194336]
[ 0.38358493 0.99784738]]

DEPSTAR IT 20
IT377 MACHINE LEARNING 20DIT073

 Training the Random Forest Classification model on the Training set


from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_
state = 0)
classifier.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,


criterion='entropy', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=None, oob_score=False, random_state=0, verbose=0,
warm_start=False)

 Predicting a new result


print(classifier.predict(sc.transform([[30,87000]])))

[0]

 Predicting the Test set results


y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)
)

[[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[1 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 0]
[1 0]
[0 0]
[1 1]
[1 1]]

DEPSTAR IT 21
IT377 MACHINE LEARNING 20DIT073

 Making the Confusion Matrix


from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[63 5]
[ 4 28]]

0.91

 Visualizing the Training set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max
() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 100
0, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).
T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'gre
en'))(i), label = j)
plt.title('Random Forest Classification (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

 Visualising the Test set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max

DEPSTAR IT 22
IT377 MACHINE LEARNING 20DIT073

() + 10, step = 0.25),


np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 100
0, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).
T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'gre
en'))(i), label = j)
plt.title('Random Forest Classification (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

Support Vector Machine (SVM)


 Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

 Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

 Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state
= 0)
print(X_test)

DEPSTAR IT 23
IT377 MACHINE LEARNING 20DIT073

[[ 30 87000]
[ 38 50000]
[ 35 75000]
[ 30 79000]
[ 35 50000]
[ 27 20000]
[ 31 15000]
[ 36 144000]
[ 18 68000]
[ 47 43000]
[ 30 49000]
[ 28 55000]
[ 37 55000]
[ 39 77000]
[ 20 86000]
[ 32 117000]
[ 37 77000]
[ 19 85000]
[ 55 130000]
[ 35 22000]
[ 35 47000]
[ 47 144000]
[ 41 51000]
[ 47 105000]
[ 23 28000]
[ 49 141000]
[ 28 87000]
[ 29 80000]
[ 37 62000]
[ 32 86000]
[ 21 88000]
[ 37 79000]
[ 57 60000]
[ 37 53000]
[ 24 58000]
[ 48 90000]
[ 42 104000]]

print(y_test)

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0
0010000100101100011001001010100001001
0 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1]

 Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

DEPSTAR IT 24
IT377 MACHINE LEARNING 20DIT073

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

print(X_test)

[[-0.80480212 0.50496393]
[-0.01254409 -0.5677824 ]
[-0.30964085 0.1570462 ]
[-0.80480212 0.27301877]
[-0.30964085 -0.5677824 ]
[-1.10189888 -1.43757673]
[-0.70576986 -1.58254245]
[-0.21060859 2.15757314]
[-1.99318916 -0.04590581]
[ 0.8787462 -0.77073441]
[-0.80480212 -0.59677555]
[-1.00286662 -0.42281668]
[-0.11157634 -0.42281668]
[ 0.08648817 0.21503249]
[-1.79512465 0.47597078]
[-0.60673761 1.37475825]
[-0.11157634 0.21503249]
[-1.89415691 0.44697764]
[ 1.67100423 1.75166912]
[-0.30964085 -1.37959044]
[-0.30964085 -0.65476184]
[ 0.8787462 2.15757314]
[ 0.28455268 -0.53878926]
[ 0.8787462 1.02684052]
[-1.49802789 -1.20563157]
[ 1.07681071 2.07059371]
[-1.00286662 0.50496393]
[-0.90383437 0.30201192]
[-0.11157634 -0.21986468]
[-0.60673761 0.47597078]
[-1.6960924 0.53395707]
[-0.11157634 0.27301877]
[ 1.86906873 -0.27785096]
[-0.11157634 -0.48080297]
[-1.39899564 -0.33583725]
[-1.99318916 -0.50979612]
[-1.59706014 0.33100506]
[-0.4086731 -0.77073441]
[-0.70576986 -1.03167271]
[ 1.07681071 -0.97368642]
[-1.10189888 0.53395707]
[ 0.28455268 -0.50979612]
[-1.10189888 0.41798449]
[-0.30964085 -1.43757673]
[ 0.97777845 -1.06066585]

DEPSTAR IT 25
IT377 MACHINE LEARNING 20DIT073

[ 0.97777845 0.59194336]
[ 0.38358493 0.99784738]]

 Training the SVM model on the Training set


from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
verbose=False)

 Predicting a new result


print(classifier.predict(sc.transform([[30,87000]])))

[0]

 Predicting the Test set results


y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)
)

[[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[1 1]
[0 0]
[1 1]
[0 0]
[1 1]

DEPSTAR IT 26
IT377 MACHINE LEARNING 20DIT073

[0 0]
[0 0]
[0 0]
[1 1]
[1 1]]

 Making the Confusion Matrix


from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[66 2]
[ 8 24]]

0.9

 Visualising the Training set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max
() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 100
0, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).
T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'gre
en'))(i), label = j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()

DEPSTAR IT 27
IT377 MACHINE LEARNING 20DIT073

 Visualising the Test set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max
() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 100
0, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).
T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'gre
en'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

CONCLUSION:
In this practical I have learned how to perform Random Forest Classification, Decision Tree
Classification and Support Vector Classification.

DEPSTAR IT 28
IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 5

AIM: Implement K-Nearest Neighbours and Naïve Bayes Classifier with python’s
Scikit-Learn on different datasets. Compare the classifiers based on their evaluation
measures.

DESCRIPTION:
 K-Nearest Neighbor(KNN)
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.

 Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

DEPSTAR IT 29
IT377 MACHINE LEARNING 20DIT073

OUTPUT:
K-Nearest Neighbors (K-NN)
 Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

 Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

 Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

print(X_test)

[[ 30 87000]
[ 38 50000]
[ 35 75000]
[ 30 79000]
[ 35 50000]
[ 27 20000]
[ 31 15000]
[ 36 144000]
[ 18 68000]
[ 47 43000]
[ 30 49000]
[ 28 55000]
[ 37 55000]
[ 37 62000]
[ 32 86000]
[ 21 88000]
[ 37 79000]
[ 57 60000]
[ 45 32000]
[ 29 83000]
[ 26 80000]
[ 49 28000]
[ 23 20000]
[ 32 18000]
[ 60 42000]
[ 19 76000]
[ 36 99000]
[ 19 26000]
[ 60 83000]
[ 24 89000]
[ 27 58000]

DEPSTAR IT 30
IT377 MACHINE LEARNING 20DIT073

[ 40 47000]
[ 42 70000]
[ 32 150000]
[ 35 77000]
[ 22 63000]
[ 45 22000]
[ 27 89000]
[ 18 82000]
[ 42 79000]
[ 40 60000]
[ 53 34000]
[ 47 107000]
[ 58 144000]
[ 59 83000]
[ 24 55000]
[ 26 35000]
[ 58 38000]
[ 42 80000]
[ 40 75000]
[ 59 130000]
[ 46 41000]
[ 41 60000]
|
|
|
|

[ 27 96000]
[ 23 63000]
[ 48 33000]
[ 48 90000]
[ 42 104000]]

print(y_test)

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0
0010000100101100011001001010100001001
0 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1]

 Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_test)

[[-0.80480212 0.50496393]
[-0.01254409 -0.5677824 ]

DEPSTAR IT 31
IT377 MACHINE LEARNING 20DIT073

[-0.30964085 0.1570462 ]
[-0.80480212 0.27301877]
[-0.30964085 -0.5677824 ]
[-1.10189888 -1.43757673]
[-0.70576986 -1.58254245]
[-0.21060859 2.15757314]
[-0.11157634 0.27301877]
[ 1.86906873 -0.27785096]
[-0.11157634 -0.48080297]
[-1.39899564 -0.33583725]
[-1.99318916 -0.50979612]
[-1.59706014 0.33100506]
[-0.4086731 -0.77073441]
[-0.70576986 -1.03167271]
[ 1.07681071 -0.97368642]
[-1.10189888 0.53395707]
[-1.10189888 -0.33583725]
[ 0.18552042 -0.65476184]
[ 0.38358493 0.01208048]
[-0.60673761 2.331532 ]
[-0.30964085 0.21503249]
[-1.59706014 -0.19087153]
[ 0.68068169 -1.37959044]
[-1.10189888 0.56295021]
[-1.99318916 0.35999821]
[ 0.38358493 0.27301877]
|
|
|
|

[ 0.38358493 0.27301877]
[ 0.18552042 -0.27785096]
[ 1.47293972 -1.03167271]
[ 0.8787462 1.08482681]
[ 1.96810099 2.15757314]
[-1.10189888 0.76590222]
[-1.49802789 -0.19087153]
[ 0.97777845 -1.06066585]
[ 0.97777845 0.59194336]
[ 0.38358493 0.99784738]]

 Training the K-NN model on the Training set


from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

DEPSTAR IT 32
IT377 MACHINE LEARNING 20DIT073

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',


metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

 Predicting a new result


print(classifier.predict(sc.transform([[30,87000]])))

[0]

 Predicting the Test set results


y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[1 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 0]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[1 1]
[0 0]
[1 1]
[0 0]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 1]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]

DEPSTAR IT 33
IT377 MACHINE LEARNING 20DIT073

[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[1 1]
[0 0]
[1 1]
[1 1]
[1 1]
[1 0]
[0 0]
[0 0]
[1 1]
[0 1]
[0 0]
[1 1]
[1 1]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[0 0]

|
|
|
|
[0 1]
[0 0]
[1 1]
[1 1]
[1 1]]

 Making the Confusion Matrix


from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[64 4]
[ 3 29]]
0.93

DEPSTAR IT 34
IT377 MACHINE LEARNING 20DIT073

 Visualising the Training set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10,
stop = X_set[:, 0].max() + 10, step = 1),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 1))

plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).


T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap
(('red', 'green'))(i), label = j)
plt.title('K-NN (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
'c' argument looks like a single numeric RGB or RGBA sequence, which should be
avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.
Please use a 2-D array with a single row if you really want to specify the same RGB or
RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be
avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.
Please use a 2-D array with a single row if you really want to specify the same RGB or
RGBA value for all points.

DEPSTAR IT 35
IT377 MACHINE LEARNING 20DIT073

 Visualising the Test set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10,
stop = X_set[:, 0].max() + 10, step = 1),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 1))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).
T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))
(i), label = j)
plt.title('K-NN (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
'c' argument looks like a single numeric RGB or RGBA sequence, which should be
avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.
Please use a 2-D array with a single row if you really want to specify the same RGB or
RGBA value for all points.

Naive Bayes

 Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

DEPSTAR IT 36
IT377 MACHINE LEARNING 20DIT073

 Importing the dataset


dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

 Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state = 0)
print(X_test)

[[ 30 87000]
[ 38 50000]
[ 35 75000]
[ 30 79000]
[ 35 50000]
[ 49 36000]
[ 27 88000]
[ 41 52000]
[ 27 84000]
[ 35 20000]
[ 43 112000]
[ 27 58000]
[ 37 80000]
[ 52 90000]
[ 26 30000]
[ 49 86000]
[ 57 122000]
[ 34 25000]
[ 35 57000]
[ 34 115000]
[ 59 88000]
[ 45 32000]
[ 29 83000]
[ 24 55000]
[ 26 35000]
[ 58 38000]
[ 42 80000]
[ 40 75000]
[ 59 130000]
|
|
|
|

[ 37 146000]
[ 23 48000]
[ 25 33000]
[ 24 84000]

DEPSTAR IT 37
IT377 MACHINE LEARNING 20DIT073

[ 27 96000]
[ 23 63000]
[ 48 33000]
[ 48 90000]
[ 42 104000]]

print(y_test)

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0
0010000100101100011001001010100001001
0 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1]

 Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_test)

[[-0.80480212 0.50496393]
[-0.01254409 -0.5677824 ]
[-0.30964085 0.1570462 ]
[-0.80480212 0.27301877]
[-0.30964085 -0.5677824 ]
[-1.10189888 -1.43757673]
[-0.70576986 -1.58254245]
[-0.21060859 2.15757314]
[-1.99318916 -0.04590581]
[ 0.8787462 -0.77073441]
[-0.80480212 -0.59677555]
[-1.00286662 -0.42281668]
[ 0.28455268 -0.53878926]
[ 0.8787462 1.02684052]
[-1.49802789 -1.20563157]
[ 1.07681071 2.07059371]
[-1.00286662 0.50496393]
[-0.90383437 0.30201192]
[-0.11157634 -0.21986468]
[-0.60673761 0.47597078]
[-1.6960924 0.53395707]
[-0.11157634 0.27301877]
[ 1.86906873 -0.27785096]
[-0.11157634 -0.48080297]
[-1.39899564 -0.33583725]
[-1.99318916 -0.50979612]
[-1.59706014 0.33100506]
[-0.4086731 -0.77073441]
[-0.70576986 -1.03167271]

DEPSTAR IT 38
IT377 MACHINE LEARNING 20DIT073

[ 1.07681071 -0.97368642]
[-1.10189888 0.53395707]
[ 0.28455268 -0.50979612]
[-1.10189888 0.41798449]
[-0.30964085 -1.43757673]
[ 0.48261718 1.22979253]
|
|
|
|

[-1.10189888 0.76590222]
[-1.49802789 -0.19087153]
[ 0.97777845 -1.06066585]
[ 0.97777845 0.59194336]
[ 0.38358493 0.99784738]]

 Training the Naive Bayes model on the Training set


from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
GaussianNB(priors=None, var_smoothing=1e-09)

 Predicting a new result


print(classifier.predict(sc.transform([[30,87000]])))

[0]

 Predicting the Test set results


y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[1 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]

DEPSTAR IT 39
IT377 MACHINE LEARNING 20DIT073

[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[1 1]
[1 1]
[1 0]
[0 0]
[0 0]
[1 1]
[0 1]
[0 0]
[1 1]
[0 1]
[0 0]
[0 0]
|
|
|
|

[0 1]
[0 0]
[1 1]
[1 1]
[1 1]]

 Making the Confusion Matrix


from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[65 3]
[ 7 25]]

0.9

 Visualising the Training set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max

DEPSTAR IT 40
IT377 MACHINE LEARNING 20DIT073

() + 10, step = 0.25),np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max


() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).
T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red',
'green'))(i), label = j)
plt.title('Naive Bayes (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

'c' argument looks like a single numeric RGB or RGBA sequence, which should be
avoided as value-mapping will have precedence in case its length matches with 'x' & '
y'. Please use a 2-D array with a single row if you really want to specify the same RG
B or RGBA value for all points.

 Visualising the Test set results


from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max
() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 10
00, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).
T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

DEPSTAR IT 41
IT377 MACHINE LEARNING 20DIT073

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'gr
een'))(i), label = j)
plt.title('Naive Bayes (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

'c' argument looks like a single numeric RGB or RGBA sequence, which should be av
oided as value-mapping will have precedence in case its length matches with 'x' & 'y'.
Please use a 2-D array with a single row if you really want to specify the same RGB o
r RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be av
oided as value-mapping will have precedence in case its length matches with 'x' & 'y'.
Please use a 2-D array with a single row if you really want to specify the same RGB o
r RGBA value for all points.

CONCLUSION:
In this practical I have learned how to use K – Nearest Neighbours and Naive Bayes
Classifier.

DEPSTAR IT 42
IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 6

AIM: Use K-Means Clustering and Hierarchical Clustering algorithm for following
datasets.

OUTPUT:

K-Means Clustering
 Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

 Importing the dataset


dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values

 Using the elbow method to find the optimal number of clusters


from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

DEPSTAR IT 43
IT377 MACHINE LEARNING 20DIT073

 Training the K-Means model on the dataset


kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)

 Visualising the clusters


plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Clust
er 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Clus
ter 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Clus
ter 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Clus
ter 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Clus
ter 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'y
ellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

DEPSTAR IT 44
IT377 MACHINE LEARNING 20DIT073

Hierarchical Clustering
 Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

 Importing the dataset


dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values

 Using the dendrogram to find the optimal number of clusters


import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

DEPSTAR IT 45
IT377 MACHINE LEARNING 20DIT073

 Training the Hierarchical Clustering model on the dataset


from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

 Visualising the clusters


plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster
5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

CONCLUSION:
In this practical, we’ve learnt to perform K – Means Clustering and Hierarchical
Clustering on datasets.

DEPSTAR IT 46
IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 7

AIM: Build an Artificial Neural Network by implementing the Backpropagation


Algorithm and test the same using appropriate data sets.

OUTPUT:

Artificial Neural Network


 Importing the libraries
import numpy as np
import pandas as pd
import tensorflow as tf
tf.__version__

'2.2.0'

Part 1 - Data Preprocessing


 Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:-1].values
y = dataset.iloc[:, -1].values
print(X)

[[619 'France' 'Female' ... 1 1 101348.88]


[608 'Spain' 'Female' ... 0 1 112542.58]
[502 'France' 'Female' ... 1 0 113931.57]
...
[709 'France' 'Female' ... 0 1 42085.58]
[772 'Germany' 'Male' ... 1 0 92888.52]
[792 'France' 'Female' ... 1 0 38190.78]]

print(y)

[1 0 1 ... 1 1 0]

 Encoding categorical data


Label Encoding the "Gender" column
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:, 2] = le.fit_transform(X[:, 2])

print(X)

[[619 'France' 0 ... 1 1 101348.88]


[608 'Spain' 0 ... 0 1 112542.58]

DEPSTAR IT 47
IT377 MACHINE LEARNING 20DIT073

[502 'France' 0 ... 1 0 113931.57]


...
[709 'France' 0 ... 0 1 42085.58]
[772 'Germany' 1 ... 1 0 92888.52]
[792 'France' 0 ... 1 0 38190.78]]

 One Hot Encoding the "Geography" column


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)

[[1.0 0.0 0.0 ... 1 1 101348.88]


[0.0 0.0 1.0 ... 0 1 112542.58]
[1.0 0.0 0.0 ... 1 0 113931.57]
...
[1.0 0.0 0.0 ... 0 1 42085.58]
[0.0 1.0 0.0 ... 1 0 92888.52]
[1.0 0.0 0.0 ... 1 0 38190.78]]

 Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =
0)

 Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Part 2 - Building the ANN


 Initializing the ANN
ann = tf.keras.models.Sequential()

 Adding the input layer and the first hidden layer


ann.add(tf.keras.layers.Dense(units=6, activation='relu'))

 Adding the second hidden layer


ann.add(tf.keras.layers.Dense(units=6, activation='relu'))

 Adding the output layer


ann.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

Part 3 - Training the ANN


 Compiling the ANN
ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

DEPSTAR IT 48
IT377 MACHINE LEARNING 20DIT073

 Training the ANN on the Training set


ann.fit(X_train, y_train, batch_size = 32, epochs = 100)

Epoch 1/100
250/250 [==============================] - 0s 1ms/step - loss: 0.8037 - accuracy:
0.5185
Epoch 2/100
250/250 [==============================] - 0s 1ms/step - loss: 0.5291 - accuracy:
0.7901
Epoch 3/100
250/250 [==============================] - 0s 1ms/step - loss: 0.4888 - accuracy:
0.7952
Epoch 4/100
250/250 [==============================] - 0s 1ms/step - loss: 0.4668 - accuracy:
0.7979
Epoch 5/100
250/250 [==============================] - 0s 1ms/step - loss: 0.4478 - accuracy:
0.7994
Epoch 6/100
250/250 [==============================] - 0s 1ms/step - loss: 0.4302 - accuracy:
0.8049
Epoch 7/100
250/250 [==============================] - 0s 1ms/step - loss: 0.4123 - accuracy:
0.8119
Epoch 8/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3947 - accuracy:
0.8238
Epoch 9/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3807 - accuracy:
0.8355
Epoch 10/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3720 - accuracy:
0.8385
Epoch 11/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3664 - accuracy:
0.8425
Epoch 12/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3629 - accuracy:
0.8416
Epoch 13/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3599 - accuracy:
0.8471
Epoch 14/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3580 - accuracy:
0.8443
Epoch 15/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3564 - accuracy:
0.8456

DEPSTAR IT 49
IT377 MACHINE LEARNING 20DIT073

|
|
|
Epoch 88/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3327 - accuracy:
0.8645
Epoch 89/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3325 - accuracy:
0.8674
Epoch 90/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3322 - accuracy:
0.8655
Epoch 91/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3327 - accuracy:
0.8650
Epoch 92/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3318 - accuracy:
0.8650
Epoch 93/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3322 - accuracy:
0.8635
Epoch 94/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3325 - accuracy:
0.8650
Epoch 95/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3317 - accuracy:
0.8662
Epoch 96/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3318 - accuracy:
0.8646
Epoch 97/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3319 - accuracy:
0.8649
Epoch 98/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3320 - accuracy:
0.8641
Epoch 99/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3314 - accuracy:
0.8648
Epoch 100/100
250/250 [==============================] - 0s 1ms/step - loss: 0.3314 - accuracy:
0.8660<tensorflow.python.keras.callbacks.History at 0x7f8d3ce23978>

Part 4 - Making the predictions and evaluating the model


 Predicting the result of a single observation
Predict if the customer with the following informations will leave the bank:
Geography: France
Credit Score: 600
Gender: Male

DEPSTAR IT 50
IT377 MACHINE LEARNING 20DIT073

Age: 40 years old


Tenure: 3 years
Balance: $ 60000
Number of Products: 2
Does this customer have a credit card ? Yes
Is this customer an Active Member: Yes

Estimated Salary: $ 50000


So, should we say goodbye to that customer ?

 Solution
print(ann.predict(sc.transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])) > 0.5)

[[False]]

Therefore, our ANN model predicts that this customer stays in the bank!
 Predicting the Test set results
y_pred = ann.predict(X_test)
y_pred = (y_pred > 0.5)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)
)

[[0 0]
[0 1]
[0 0]
...
[0 0]
[0 0]
[0 0]]

 Making the Confusion Matrix


from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1516 79]
[ 200 205]]

0.8605

CONCLUSION:
In this practical, we’ve learnt to build an Artificial Neural Network by implementing the
Backpropagation Algorithm

DEPSTAR IT 51
IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 8

AIM: Implement the Multi-Layer Perceptron from scratch with at least 3 layers for a
classification or a regression problem of your choice, implement Backpropogation and
observe Underfitting, Overfitting and Regularization.

CODE:

import pandas as pd
import numpy as np
df = pd.read_csv('WineQT.csv')
df.head(5)
df.drop('Id', axis=1, inplace=True)
df.head()
from sklearn.neural_network import MLPClassifier
y = df['quality']
y.head()
X = df.drop('quality', axis=1, inplace=False)
X.head()
X.describe()

model = MLPClassifier(solver='sgd', hidden_layer_sizes=(16, 8), random_state=1,


learning_rate_init=0.005, learning_rate='adaptive', verbose=True, validation_fraction=0.1,
early_stopping=True)
model.fit(X,y)
model2 = MLPClassifier(solver='sgd', hidden_layer_sizes=(24, 8), random_state=1,
learning_rate_init=0.005, learning_rate='adaptive', verbose=True, validation_fraction=0.1,
early_stopping=True)
model2.fit(X,y)
import matplotlib.pyplot as plt
plt.plot(model2.loss_curve_)
model3 = MLPClassifier(solver='sgd', hidden_layer_sizes=(16, 12), random_state=1,
learning_rate_init=0.001, learning_rate='invscaling', verbose=True, validation_fraction=0.1,
early_stopping=True)
model3.fit(X,y)
plt.plot(model3.loss_curve_)
model3.score(X,y)

import pickle
filename = 'finalized_sklearn_classification_model.sav'
pickle.dump(model3, open(filename, 'wb'))
loaded_model = pickle.load(open(filename, 'rb'))
loaded_model.score(X,y)

DEPSTAR IT 52
IT377 MACHINE LEARNING 20DIT073

OUTPUT:-

df.head

Model Fitting:-

Model 1

Model 2

DEPSTAR IT 53
IT377 MACHINE LEARNING 20DIT073

Mode 3
Curve Loss:-

Model 1

Model 2

DEPSTAR IT 54
IT377 MACHINE LEARNING 20DIT073

Model 3

Model Score
CONCLUSION:
From this Practical I have learnt the implement the Multi-Layer Perceptron from
scratch with at least 3 layers for a classification or a regression problem.

DEPSTAR IT 55
IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 9

AIM: Implement a Boosting Algorithms on a same dataset and analyze which one is
providing best prediction.

CODE:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import style
style.use('fivethirtyeight')
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
import scipy.stats as sps

# Load in the data and define the column labels

dataset = pd.read_csv('data\mushroom.csv',header=None)
dataset = dataset.sample(frac=1)
dataset.columns = ['target','cap-shape','cap-surface','cap-color','bruises','odor','gill-
attachment','gill-spacing',
'gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-ring','stalk-surface-
below-ring','stalk-color-above-ring',
'stalk-color-below-ring','veil-type','veil-color','ring-number','ring-type','spore-print-
color','population',
'habitat']

# Encode the feature values from strings to integers since the sklearn DecisionTreeClassifier
only takes numerical values
for label in dataset.columns:
dataset[label] = LabelEncoder().fit(dataset[label]).transform(dataset[label])

Tree_model = DecisionTreeClassifier(criterion="entropy",max_depth=1)

X = dataset.drop('target',axis=1)
Y = dataset['target'].where(dataset['target']==1,-1)

predictions = np.mean(cross_validate(Tree_model,X,Y,cv=100)['test_score'])

print('The accuracy is: ',predictions*100,'%')

DEPSTAR IT 56
IT377 MACHINE LEARNING 20DIT073

OUTPUT:-

class Boosting:

def __init__(self,dataset,T,test_dataset):
self.dataset = dataset
self.T = T
self.test_dataset = test_dataset
self.alphas = None
self.models = None
self.accuracy = []
self.predictions = None

def fit(self):
# Set the descriptive features and the target feature
X = self.dataset.drop(['target'],axis=1)
Y = self.dataset['target'].where(self.dataset['target']==1,-1)

# Initialize the weights of each sample with wi = 1/N and create a dataframe in which
the evaluation is computed
Evaluation = pd.DataFrame(Y.copy())
Evaluation['weights'] = 1/len(self.dataset) # Set the initial weights w = 1/N

# Run the boosting algorithm by creating T "weighted models"


alphas = []
models = []
for t in range(self.T):

# Train the Decision Stump(s)


Tree_model = DecisionTreeClassifier(criterion="entropy",max_depth=1) # Mind the
deth one --> Decision Stump

DEPSTAR IT 57
IT377 MACHINE LEARNING 20DIT073

# We know that we must train our decision stumps on weighted datasets where the weights
depend on the results of
# the previous decision stumps. To accomplish that, we use the 'weights' column of
the above created
# 'evaluation dataframe' together with the sample_weight parameter of the fit method.
# The documentation for the sample_weights parameter sais: "[...] If None, then
samples are equally weighted."
# Consequently, if NOT None, then the samples are NOT equally weighted and
therewith we create a WEIGHTED dataset
# which is exactly what we want to have.
model = Tree_model.fit(X,Y,sample_weight=np.array(Evaluation['weights']))
# Append the single weak classifiers to a list which is later on used to make
the
# weighted decision
models.append(model)
predictions = model.predict(X)
score = model.score(X,Y)
# Add values to the Evaluation DataFrame
Evaluation['predictions'] = predictions
Evaluation['evaluation'] = np.where(Evaluation['predictions'] ==
Evaluation['target'],1,0)
Evaluation['misclassified'] = np.where(Evaluation['predictions'] !=
Evaluation['target'],1,0)

# Calculate the misclassification rate and accuracy


accuracy = sum(Evaluation['evaluation'])/len(Evaluation['evaluation'])
misclassification = sum(Evaluation['misclassified'])/len(Evaluation['misclassified'])
# Caclulate the error
err =
np.sum(Evaluation['weights']*Evaluation['misclassified'])/np.sum(Evaluation['weights']

# Calculate the alpha values


alpha = np.log((1-err)/err)
alphas.append(alpha)
# Update the weights wi --> These updated weights are used in the sample_weight
parameter
# for the training of the next decision stump.

DEPSTAR IT 58
IT377 MACHINE LEARNING 20DIT073

Evaluation['weights'] *= np.exp(alpha*Evaluation['misclassified'])
#print('The Accuracy of the {0}. model is : '.format(t+1),accuracy*100,'%')
#print('The missclassification rate is: ',misclassification*100,'%')
self.alphas = alphas
self.models = models
def predict(self):
X_test = self.test_dataset.drop(['target'],axis=1).reindex(range(len(self.test_dataset)))
Y_test =
self.test_dataset['target'].reindex(range(len(self.test_dataset))).where(self.dataset['target']==1,
-1)
# With each model in the self.model list, make a prediction
accuracy = []
predictions = []
for alpha,model in zip(self.alphas,self.models):
prediction = alpha*model.predict(X_test) # We use the predict method for the single
decisiontreeclassifier models in the list
predictions.append(prediction)
self.accuracy.append(np.sum(np.sign(np.sum(np.array(predictions),axis=0))==Y_test.values)
/len(predictions[0]))
# The above line of code could be a little bit confusing and hence I will do Up
The single step:
# Goal: Create a list of accuracies which can be used to plot the accuracy against the number
of base learners used for the model
# 1. np.array(predictions) --> This is the array which contains the predictions of the
single models. It has the shape 8124xn
# and hence looks like [[0.998,0.87,...0.87...],[...],[...],[0.99,1.23,...,1.05,0,99...]]
# 2. np.sum(np.array(predictions),axis=0) --> Summs up the first elements of the lists,
that is 0,998+...+...+0.99. This is
# done since the formula for the prediction wants us to sum up the predictions of all
models for each instance in the dataset.
# Hence if we have for example 3 models than the predictions array has the shape
8124x3 (Imagine a table with 3 columns and
# 8124 rows). Here the first column containst the predictions for the first model, the
second column contains the
# prediction for the second model, the third column the prediction for the third model
(mind that the
# second and third model are influenced by the results of the first resoectvely the first
and the

DEPSTAR IT 59
IT377 MACHINE LEARNING 20DIT073

# second model). This is logical since the results from column (model)
# n-1 are used to alter the weights of the nth model and the results of the nth model
are then used to alter the weights
# of the n+1 model.
# 3. np.sign(np.sum(np.array(predictions),axis=0)) --> Since our test target data are
elements of {-1,1} and we want to
# have our prediction in the same format, we use the sign function. Hence each
column in the accuracy array is like
# [-0.998,1.002,1.24,...,-0.89] and each element represents the combined and weighted
prediction of all models up this column
# (so if we are for instance in the 5th column and for the 4th instnace we find the
value -0.989, this value represents the
# weighted prediction of a boosted model with 5 base learners for the 4th instance.
The 4th instance of the 6th column represents
# the weighted and combined predictions of a boosted model with 6 base
learners while the 4th instance of the 4th column represents
# the predction of a model with 4 base learners and so on and so forth...). To make a
long story short, we are interested in the
# the sign of these comined predictions. If the sign is positive, we know that the true
prediction is more likely postive (1) then
# negaive (-1). The higher the value (postive or negative) the more likely it is that the
model returns the correct prediction.
# 4.
np.sum(np.sign(np.sum(np.array(predictions),axis=0))==Y_test.values)/len(predictions[0]) --
> With the last step we have transformed the array
# into the shape 8124x1 where the instances are elements {-1,1} and therewith we are
now in the situation to compare this
# prediction with our target feature values. The target feature array is of the shape
8124x1 since for each row it contains
# exactly one prediction {-1,1} just as our just created array above --> Ready to
compare ;).
# The comparison is done with the == Y_test.values command. As result we get an
# array of the shape 8124x1 where the instances are elements of {True,False} (True if
our prediction is consistent with the
# target feature value and False if not). Since we want to calculate a percentage value
we have to calculate the fraction of
# instances which have been classified correctly. Therefore we simply sum up the above
comparison array
# with the elements {True,False} along the axis 0.

DEPSTAR IT 60
IT377 MACHINE LEARNING 20DIT073

# and divide it by the total number of rows (8124) since True is the same as 1 and False is the
same as 0. Hence correct predictions
# increase the sum while false predictions does not change the sum. If we predicted
nothing correct the calculation is 0/8124 and
# therewith 0 and if we predicted everything correct, the calculation is 8124/8124 and
thereiwth 1.
# 5.
self.accuracy.append(np.sum(np.sign(np.sum(np.array(predictions),axis=0))==Y_test.values)
/len(predictions[0])) -->
# After we have computed the above steps, we add the result to the self.accuracy list. This list
has the shape n x 1, that is,
# for a model with 5 base learners this list has 5 entries where the 5th entry represents
the accuracy of the model when all
# 5 base learners are combined, the 4th element the accuracy of the model when 4
base learners are combined and so on and so forth. This
# procedure has been explained above. That's it and we can plot the accuracy.
self.predictions = np.sign(np.sum(np.array(predictions),axis=0))
#Plot the accuracy of the model against the number of stump-models used
number_of_base_learners = 50
fig = plt.figure(figsize=(10,10))
ax0 = fig.add_subplot(111)
for i in range(number_of_base_learners):
model = Boosting(dataset,i,dataset)
model.fit()
model.predict()
ax0.plot(range(len(model.accuracy)),model.accuracy,'-b')
ax0.set_xlabel('# models used for Boosting ')
ax0.set_ylabel('accuracy')
print('With a number of ',number_of_base_learners,'base models we receive an accuracy of
',model.accuracy[-1]*100,'%')
plt.show()

DEPSTAR IT 61
IT377 MACHINE LEARNING 20DIT073

OUTPUT:

CONCLUSION:
From this Practical I have learnt the implementation of a Boosting Algorithms.

DEPSTAR IT 62
IT377 MACHINE LEARNING 20DIT073

PRACTICAL – 10

AIM:Train a Reinforcement Learning Agent for the Multi-Armed Bandit Problem and
visualize the results using matplotlib or seaborn libraries in Python. Consider at least 15
arms (n=15).

CODE:

!pip install tf-agents


import abc
import numpy as np
import tensorflow as tf
from tf_agents.agents import tf_agent
from tf_agents.drivers import driver
from tf_agents.environments import py_environment
from tf_agents.environments import tf_environment
from tf_agents.environments import tf_py_environment
from tf_agents.policies import tf_policy
from tf_agents.specs import array_spec
from tf_agents.specs import tensor_spec
from tf_agents.trajectories import time_step as ts
from tf_agents.trajectories import trajectory
from tf_agents.trajectories import policy_step
nest = tf.nest
class BanditPyEnvironment(py_environment.PyEnvironment):
def __init__(self, observation_spec, action_spec):
self._observation_spec = observation_spec
self._action_spec = action_spec
super(BanditPyEnvironment, self).__init__()
def action_spec(self):
return self._action_spec
def observation_spec(self):
return self._observation_spec
def _empty_observation(self):
return tf.nest.map_structure(lambda x: np.zeros(x.shape, x.dtype),
self.observation_spec())
def _reset(self):
return ts.restart(self._observe(), batch_size=self.batch_size)
def _step(self, action):
reward = self._apply_action(action)
return ts.termination(self._observe(), reward)
@abc.abstractmethod
def _observe(self):
@abc.abstractmethod
def _apply_action(self, action):
class SimplePyEnvironment(BanditPyEnvironment):
def __init__(self):
action_spec = array_spec.BoundedArraySpec(
shape=(), dtype=np.int32, minimum=0, maximum=2, name='action')

DEPSTAR IT 63
IT377 MACHINE LEARNING 20DIT073

observation_spec = array_spec.BoundedArraySpec(
shape=(1,), dtype=np.int32, minimum=-2, maximum=2, name='observation')
super(SimplePyEnvironment, self).__init__(observation_spec, action_spec)
def _observe(self):
self._observation = np.random.randint(-2, 3, (1,), dtype='int32')
return self._observation
def _apply_action(self, action):
return action * self._observation
environment = SimplePyEnvironment()
observation = environment.reset().observation
print("observation: %d" % observation)
action = 2 #@param
print("action: %d" % action)
reward = environment.step(action).reward
print("reward: %f" % reward)
tf_environment = tf_py_environment.TFPyEnvironment(environment)
class SignPolicy(tf_policy.TFPolicy):
def __init__(self):
observation_spec = tensor_spec.BoundedTensorSpec(
shape=(1,), dtype=tf.int32, minimum=-2, maximum=2)
time_step_spec = ts.time_step_spec(observation_spec)
action_spec = tensor_spec.BoundedTensorSpec(
shape=(), dtype=tf.int32, minimum=0, maximum=2)
super(SignPolicy, self).__init__(time_step_spec=time_step_spec,
action_spec=action_spec)
def _distribution(self, time_step):
pass
def _variables(self):
return ()
def _action(self, time_step, policy_state, seed):
observation_sign = tf.cast(tf.sign(time_step.observation[0]), dtype=tf.int32)
action = observation_sign + 1
return policy_step.PolicyStep(action, policy_state)
sign_policy = SignPolicy()
current_time_step = tf_environment.reset()
print('Observation:')
print (current_time_step.observation)
action = sign_policy.action(current_time_step).action
print('Action:')
print (action)
reward = tf_environment.step(action).reward
print('Reward:')
print(reward)
step = tf_environment.reset()
action = 1
next_step = tf_environment.step(action)
reward = next_step.reward
next_observation = next_step.observation
print("Reward: ")
print(reward)

DEPSTAR IT 64
IT377 MACHINE LEARNING 20DIT073

print("Next observation:")
print(next_observation)
class TwoWayPyEnvironment(BanditPyEnvironment):
def __init__(self):
action_spec = array_spec.BoundedArraySpec(
shape=(), dtype=np.int32, minimum=0, maximum=2, name='action')
observation_spec = array_spec.BoundedArraySpec(
shape=(1,), dtype=np.int32, minimum=-2, maximum=2, name='observation')
self._reward_sign = 2 * np.random.randint(2) - 1
print("reward sign:")
print(self._reward_sign)
super(TwoWayPyEnvironment, self).__init__(observation_spec, action_spec)
def _observe(self):
self._observation = np.random.randint(-2, 3, (1,), dtype='int32')
return self._observation
def _apply_action(self, action):
return self._reward_sign * action * self._observation[0]
two_way_tf_environment =
tf_py_environment.TFPyEnvironment(TwoWayPyEnvironment())
class TwoWaySignPolicy(tf_policy.TFPolicy):
def __init__(self, situation):
observation_spec = tensor_spec.BoundedTensorSpec(
shape=(1,), dtype=tf.int32, minimum=-2, maximum=2)
action_spec = tensor_spec.BoundedTensorSpec(
shape=(), dtype=tf.int32, minimum=0, maximum=2)
time_step_spec = ts.time_step_spec(observation_spec)
self._situation = situation
super(TwoWaySignPolicy, self).__init__(time_step_spec=time_step_spec,
action_spec=action_spec)
def _distribution(self, time_step):
pass
def _variables(self):
return [self._situation]
def _action(self, time_step, policy_state, seed):
sign = tf.cast(tf.sign(time_step.observation[0, 0]), dtype=tf.int32)
def case_unknown_fn():
return tf.constant(1, shape=(1,))
def case_normal_fn():
return tf.constant(sign + 1, shape=(1,))
def case_flipped_fn():
return tf.constant(1 - sign, shape=(1,))
cases = [(tf.equal(self._situation, 0), case_unknown_fn),
(tf.equal(self._situation, 1), case_normal_fn),
(tf.equal(self._situation, 2), case_flipped_fn)]
action = tf.case(cases, exclusive=True)
return policy_step.PolicyStep(action, policy_state)
class SignAgent(tf_agent.TFAgent):
def __init__(self):
self._situation = tf.Variable(0, dtype=tf.int32)
policy = TwoWaySignPolicy(self._situation)

DEPSTAR IT 65
IT377 MACHINE LEARNING 20DIT073

time_step_spec = policy.time_step_spec
action_spec = policy.action_spec
super(SignAgent, self).__init__(time_step_spec=time_step_spec,
action_spec=action_spec,
policy=policy,
collect_policy=policy,
train_sequence_length=None)
def _initialize(self):
return tf.compat.v1.variables_initializer(self.variables)
def _train(self, experience, weights=None):
observation = experience.observation
action = experience.action
reward = experience.reward
needs_action = tf.logical_and(tf.equal(self._situation, 0),
tf.not_equal(reward, 0))
def new_situation_fn():
"""This returns either 1 or 2, depending on the signs."""
return (3 - tf.sign(tf.cast(observation[0, 0, 0], dtype=tf.int32) *
tf.cast(action[0, 0], dtype=tf.int32) *
tf.cast(reward[0, 0], dtype=tf.int32))) / 2
new_situation = tf.cond(needs_action,
new_situation_fn,
lambda: self._situation)
new_situation = tf.cast(new_situation, tf.int32)
tf.compat.v1.assign(self._situation, new_situation)
return tf_agent.LossInfo((), ())
sign_agent = SignAgent()
def trajectory_for_bandit(initial_step, action_step, final_step):
return trajectory.Trajectory(observation=tf.expand_dims(initial_step.observation, 0),
action=tf.expand_dims(action_step.action, 0),
policy_info=action_step.info,
reward=tf.expand_dims(final_step.reward, 0),
discount=tf.expand_dims(final_step.discount, 0),
step_type=tf.expand_dims(initial_step.step_type, 0),
next_step_type=tf.expand_dims(final_step.step_type, 0))
step = two_way_tf_environment.reset()
for _ in range(10):
action_step = sign_agent.collect_policy.action(step)
next_step = two_way_tf_environment.step(action_step.action)
experience = trajectory_for_bandit(step, action_step, next_step)
print(experience)
sign_agent.train(experience)
step = next_step
from tf_agents.bandits.agents import lin_ucb_agent
from tf_agents.bandits.environments import stationary_stochastic_py_environment as sspe
from tf_agents.bandits.metrics import tf_metrics
from tf_agents.drivers import dynamic_step_driver
from tf_agents.replay_buffers import tf_uniform_replay_buffer
import matplotlib.pyplot as plt
batch_size = 2 # @param

DEPSTAR IT 66
IT377 MACHINE LEARNING 20DIT073

arm0_param = [-3, 0, 1, -2] # @param


arm1_param = [1, -2, 3, 0] # @param
arm2_param = [0, 0, 1, 1] # @param
def context_sampling_fn(batch_size):
"""Contexts from [-10, 10]^4."""
def _context_sampling_fn():
return np.random.randint(-10, 10, [batch_size, 4]).astype(np.float32)
return _context_sampling_fn
class LinearNormalReward(object):
"""A class that acts as linear reward function when called."""
def __init__(self, theta, sigma):
self.theta = theta
self.sigma = sigma
def __call__(self, x):
mu = np.dot(x, self.theta)
return np.random.normal(mu, self.sigma)
arm0_reward_fn = LinearNormalReward(arm0_param, 1)
arm1_reward_fn = LinearNormalReward(arm1_param, 1)
arm2_reward_fn = LinearNormalReward(arm2_param, 1)
environment = tf_py_environment.TFPyEnvironment(
sspe.StationaryStochasticPyEnvironment(
context_sampling_fn(batch_size),
[arm0_reward_fn, arm1_reward_fn, arm2_reward_fn],
batch_size=batch_size))
observation_spec = tensor_spec.TensorSpec([4], tf.float32)
time_step_spec = ts.time_step_spec(observation_spec)
action_spec = tensor_spec.BoundedTensorSpec(
dtype=tf.int32, shape=(), minimum=0, maximum=2)
agent = lin_ucb_agent.LinearUCBAgent(time_step_spec=time_step_spec,
action_spec=action_spec)
def compute_optimal_reward(observation):
expected_reward_for_arms = [
tf.linalg.matvec(observation, tf.cast(arm0_param, dtype=tf.float32)),
tf.linalg.matvec(observation, tf.cast(arm1_param, dtype=tf.float32)),
tf.linalg.matvec(observation, tf.cast(arm2_param, dtype=tf.float32))]
optimal_action_reward = tf.reduce_max(expected_reward_for_arms, axis=0)
return optimal_action_reward
regret_metric = tf_metrics.RegretMetric(compute_optimal_reward)
num_iterations = 90 # @param
steps_per_loop = 1 # @param
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.policy.trajectory_spec,
batch_size=batch_size,
max_length=steps_per_loop)
observers = [replay_buffer.add_batch, regret_metric]
driver = dynamic_step_driver.DynamicStepDriver(
env=environment,
policy=agent.collect_policy,
num_steps=steps_per_loop * batch_size,
observers=observers)

DEPSTAR IT 67
IT377 MACHINE LEARNING 20DIT073

regret_values = []
for _ in range(num_iterations):
driver.run()
loss_info = agent.train(replay_buffer.gather_all())
replay_buffer.clear()
regret_values.append(regret_metric.result())
plt.plot(regret_values)
plt.ylabel('Average Regret')
plt.xlabel('Number of Iterations')

OUTPUT:-

Observation

Plotting Graph of Regret Values

CONCLUSION:
From this Practical I have learnt to train a Reinforcement Learning Agent for the
Multi- Armed Bandit Problem and visualize the results using matplotlib or seaborn
libraries in Python

DEPSTAR IT 68

You might also like