[go: up one dir, main page]

0% found this document useful (0 votes)
15 views43 pages

ML Lab Manual

The document outlines the Machine Learning Laboratory course at Government Engineering College, Hassan, detailing the program's educational objectives, outcomes, and specific outcomes related to Artificial Intelligence and Machine Learning. It includes a syllabus with programming experiments and aims to develop skills in data analysis, algorithm implementation, and ethical engineering practices. The document also provides specific programming tasks using datasets to illustrate various machine learning techniques such as PCA, KNN, and decision trees.

Uploaded by

poorna2130
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views43 pages

ML Lab Manual

The document outlines the Machine Learning Laboratory course at Government Engineering College, Hassan, detailing the program's educational objectives, outcomes, and specific outcomes related to Artificial Intelligence and Machine Learning. It includes a syllabus with programming experiments and aims to develop skills in data analysis, algorithm implementation, and ethical engineering practices. The document also provides specific programming tasks using datasets to illustrate various machine learning techniques such as PCA, KNN, and decision trees.

Uploaded by

poorna2130
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 43

GOVERNMENT ENGINEERING COLLEGE

HASSAN-573201

MACHINE LEARNING LABORATORY


(BCSL606)
As per VTU Syllabus/scheme for 6th Semester

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Prepared by:
Dr. VASANTHA KUMARA M
Associate Professor
Programme Educational Objectives (PEOs)

PEO1: Graduates of the program will have ability to understand, analyse and design an
Artificial Intelligence and Machine Learning solution to real-world challenges.

PEO2: Graduates of this program will have an ability to be getting employed and excel in
professional career, research to achieve higher goals.

PEO3: Graduates of the program will excel as socially committed engineers with high ethical and
moral values.

Program Outcomes (POs)

Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals,


and an engineering specialization to the solution of complex engineering problems.
Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences,
and engineering sciences.
Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.
Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.
Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with an
understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.
Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of
the engineering practice.
Individual and teamwork: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation, make effective presentations, and give and receive clear instructions.
Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
Lifelong learning: Recognize the need for and have the preparation & ability to engage in
independent & lifelong learning in the broadest context to technological change.

Programme Specific Outcomes (PSOs)

PSO1: An ability to apply concepts of Artificial Intelligence and Machine Learning to design, develop
and implement solutions to solve technical problems.

PSO2: An ability to use Artificial Intelligence and Machine Learning knowledge for successful career
as an employee and an engineering professional.

Course Outcomes
CO1 Illustrate the principles of multivariate data and apply dimensionality reduction techniques
CO2 Demonstrate similarity-based learning methods and perform regression analysis.
CO3 Develop decision trees for classification and regression problems, and Bayesian models for
probabilistic learning.
CO4
Implement the clustering algorithms to share computing resources

Syllabus
Subject: Machine Learning Subject Code: BCSL607

Programming Experiments

1. Develop a program to create histograms for all numerical features and analyze the distribution of
each feature. Generate box plots for all numerical features and identify any outliers. Use
California Housing dataset..
2. Develop a program to Compute the correlation matrix to understand the relationships between
pairs of features. Visualize the correlation matrix using a heatmap to know which variables have
strong positive/negative correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset..
3. Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
4. For a given set of training data examples stored in a .CSV file, implement and demonstrate the
Find-S algorithm to output a description of the set of all hypotheses consistent with the training
examples
5. Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly
generated 100 values of x in the range of [0,1]. Perform the following based on dataset generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊
Class1 b. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30
6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points.
Select appropriate data set for your experiment and draw graphs
7. Develop a program to demonstrate the working of Linear Regression and Polynomial Regression.
Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel
efficiency prediction) for Polynomial Regression
8. Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer
Data set for building the decision tree and apply this knowledge to classify a new sample.
9. Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set
for training. Compute the accuracy of the classifier, considering a few test data sets.
10. Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set
and visualize the clustering result.

Program 1
AIM:
To visualize univariate data distribution using histograms and identify outliers using box plots on the
California Housing dataset.
Objectives:
1. Generate histograms for all numerical features.
2. Analyze the distribution of each feature.
3. Create box plots for all numerical features.
4. Identify outliers using the Interquartile Range IQR) method.

Algorithm:
1. Load the California Housing dataset.
2. Extract numerical features from the dataset.
3. Plot histograms for each numerical feature.
4. Plot box plots to detect outliers.
5. Use IQR (Interquartile Range) method to find outliers:
o Compute Q1 (25th percentile) and Q3 (75th percentile).
o Compute IQR = Q3 - Q1.
o Identify outliers as values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
6. Display the summary statistics of the dataset.
1. Develop a program to create histograms for all numerical features and analyze the distribution
of each feature. Generate box plots for all numerical features and identify any outliers. Use
California Housing dataset.

Import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Step 1: Load the California Housing dataset

data = fetch_california_housing(as_frame=True)
housing_df = data.frame

# Step 2: Create histograms for numerical features


numerical_features = housing_df.select_dtypes(include=[np.number]).columns

# Plot histograms
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.histplot(housing_df[feature], kde=True, bins=30,
color='blue') plt.title(f'Distribution of {feature}')
plt.tight_layout()plt.show()

# Step 3: Generate box plots for numerical


features plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.boxplot(x=housing_df[feature], color='orange')
plt.title(f'Box Plot of {feature}')
plt.tight_layout()
plt.show()

# Step 4: Identify outliers using the IQR method

print("Outliers Detection:")
outliers_summary = {}
for feature in numerical_features:
Q1 = housing_df[feature].quantile(0.25)
Q3 = housing_df[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = housing_df[(housing_df[feature] < lower_bound) |
(housing_df[feature] > upper_bound)]
outliers_summary[feature] = len(outliers)
print(f"{feature}: {len(outliers)} outliers")

# Optional: Print a summary of the dataset


print("\nDataset Summary:")
print(housing_df.describe())

Explanation
 Histograms show the distribution of numerical features.
 Box plots help detect outliers using the IQR method.
 Used for data preprocessing and feature engineering.
 Helps in understanding skewness and spread of data.
 Essential for identifying anomalies before model training.
Program 2

AIM:
To analyze feature relationships using a correlation matrix and visualize feature relationships using a
heatmap and pair plot on the California Housing dataset.

Objectives:

1. Compute correlation matrix for all numerical features.


2. Visualize correlation matrix using a heatmap.
3. Identify strong positive/negative correlations.
4. Generate pair plots to examine feature relationships.

Algorithm:

1. Load the California Housing dataset.


2. Compute the correlation matrix between features.
3. Visualize the correlation matrix using a heatmap.
4. Generate a pair plot for feature relationships.
5. Identify highly correlated features.

2. Develop a program to Compute the correlation matrix to understand the relationships between pairs of
features. Visualize the correlation matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise relationships between features. Use
California Housing dataset.

import pandas as pd
import seaborn as
sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
# Step 1: Load the California Housing Dataset
california_data =
fetch_california_housing(as_frame=True) data =
california_data.frame

# Step 2: Compute the correlation


matrix correlation_matrix = data.corr()

# Step 3: Visualize the correlation matrix using a heatmap


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f',
linewidths=0.5)
plt.title('Correlation Matrix of California Housing Features')
plt.show()

# Step 4: Create a pair plot to visualize pairwise


relationships sns.pairplot(data, diag_kind='kde',
plot_kws={'alpha': 0.5}) plt.suptitle('Pair Plot of California
Housing Features', y=1.02) plt.show()
Explanation:

1. Correlation matrix helps find relationships between features.


2. Heatmaps visualize strong positive or negative correlations.
3. Used for feature selection to remove redundant attributes.
4. Pair plots provide insights into feature relationships and distributions.
5. Reduces the risk of multicollinearity in regression models
Program 3
AIM:
To reduce the dimensionality of the Iris dataset from 4 features to 2 using Principal Component Analysis
(PCA).
Objectives:
1. Reduce the number of dimensions while preserving variance.
2. Visualize data in 2D space using PCA.
3. Identify the contribution of each principal component. Algorithm:
1. Load the Iris dataset.
2. Extract features (X) and labels (Y).
3. Apply PCA transformation to reduce 4D data to 2D.
4. Create a scatter plot to visualize transformed data.
5. Display the explained variance ratio.

Explanation:
 PCA reduces dimensions while preserving maximum variance.
 Helps avoid overfitting and speeds up computation.
 Projects data onto a new set of axes (principal components).
 Used for visualizing high-dimensional data in 2D/3D.
 Commonly used in image compression and facial recognition.

3. Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = load_iris()
data = iris.data
labels = iris.target
label_names = iris.target_names

# Convert to a DataFrame for better visualization


iris_df = pd.DataFrame(data, columns=iris.feature_names)

# Perform PCA to reduce dimensionality to 2


pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)

# Create a DataFrame for the reduced data


reduced_df = pd.DataFrame(data_reduced, columns=['Principal Component 1',
'Principal Component 2'])
reduced_df['Label'] = labels

# Plot the reduced data


plt.figure(figsize=(8, 6))
colors = ['r', 'g', 'b']
for i, label in enumerate(np.unique(labels)):
plt.scatter(
reduced_df[reduced_df['Label'] == label]['Principal Component 1'],
reduced_df[reduced_df['Label'] == label]['Principal Component 2'],
label=label_names[label],color=colors[i]
)

plt.title('PCA on Iris Dataset')


plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid()
plt.show()

Output:

Program 4: Find-S Algorithm

AIM:
To implement and demonstrate the Find-S algorithm to find the most specific hypothesis consistent with the
training examples stored in a CSV file.

Objectives:

1. Read training data from a CSV file.


2. Implement Find-S algorithm for learning a hypothesis.
3. Output the most specific hypothesis.

Algorithm:

1. Read the training dataset.


2. Initialize the hypothesis as the first positive example.
3. For each positive training example:
o Update the hypothesis by retaining common attributes.
4. Return the final specific hypothesis.

4. For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-
S algorithm to output a description of the set of all hypotheses consistent with the training
examples.

import pandas as pd
def find_s_algorithm(file_path):
data = pd.read_csv(file_path)

print("Training data:")
print(data)

attributes = data.columns[:-1] class_label =


data.columns[-1]

hypothesis = ['?' for _ in attributes]

for index, row in data.iterrows():


if row[class_label] == 'Yes':
for i, value in enumerate(row[attributes]):
if hypothesis[i] == '?' or hypothesis[i] == value:
hypothesis[i] = value
else:
hypothesis[i] = '?'

return hypothesis

file_path = 'C:\\Users\\Admin\\Downloads\\training_data.csv' hypothesis =


find_s_algorithm(file_path)
print("\nThe final hypothesis is:", hypothesis)

Output:
Explanation:

 Finds the most specific hypothesis consistent with positive examples.


 Works only for binary classification with positive labels.
 Sensitive to noise, as it assumes no contradictions.
 Cannot handle negative examples or missing values.
 Used for concept learning but is not practical for real-world data.

Program 5: k-Nearest Neighbors (KNN) Algorithm


AIM:
To implement KNN classification on 100 randomly generated values in the range [0,1]. Objectives:

1. Generate 100 random values between 0 and 1.


2. Label the first 50 points:
o Class1 if x ≤ 0.5, else Class2.
3. Classify remaining 50 points using KNN (k=1,2,3,4,5,20,30).

Algorithm:

1. Generate 100 random values between 0 and 1.


2. Assign labels for first 50 points based on the condition (x ≤ 0.5 → Class1, x > 0.5 → Class2).
3. Compute Euclidean distance for each test point.
4. Find the k nearest neighbors and assign the majority class.
5. Repeat classification for k=1,2,3,4,5,20,30.

5. Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly


generated 100 values of x in the range of [0,1]. Perform the following based on dataset generated. a.
Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊ Class1 b.
Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

data = np.random.rand(100)

labels = ["Class1" if x <= 0.5 else "Class2" for x in data[:50]]

def euclidean_distance(x1, x2):


return abs(x1 - x2)

def knn_classifier(train_data, train_labels, test_point, k):


distances = [(euclidean_distance(test_point, train_data[i]),
train_labels[i])
for i in range(len(train_data))]

distances.sort(key=lambda x: x[0]) k_nearest_neighbors =


distances[:k]
k_nearest_labels = [label for _, label in k_nearest_neighbors] return
Counter(k_nearest_labels).most_common(1)[0][0]

train_data = data[:50]
train_labels = labels

test_data = data[50:]

k_values = [1, 2, 3, 4, 5, 20, 30]

print("--- k-Nearest Neighbors Classification ---")


print("Training dataset: First 50 points labeled based on the
rule (x <= 0.5 -> Class1, x > 0.5 -> Class2)")
print("Testing dataset: Remaining 50 points to be classified\n")

results = {}

for k in k_values:
print(f"Results for k = {k}:")
classified_labels = [knn_classifier(train_data, train_labels,
test_point, k) for test_point in test_data]
results[k] = classified_labels

for i, label in enumerate(classified_labels, start=51):


print(f"Point x{i} (value: {test_data[i - 51]:.4f}) is classified as
{label}")
print("\n")

print("Classification complete.\n")

for k in k_values:
classified_labels = results[k]
class1_points = [test_data[i] for i in range(len(test_data)) if
classified_labels[i] == "Class1"]
class2_points = [test_data[i] for i in range(len(test_data)) if
classified_labels[i] == "Class2"]

plt.figure(figsize=(10,6))
plt.scatter(train_data, [0] * len(train_data), c=["blue" if label ==
"Class1" else "red" for label in train_labels],
label="Training Data", marker="o")
plt.scatter(class1_points, [1] * len(class1_points), c="blue",
label="Class1 (Test)", marker="x")
plt.scatter(class2_points, [1] * len(class2_points), c="red",
label="Class2 (Test)", marker="x")

plt.title(f"k-NN Classification Results for k = {k}") plt.xlabel("Data Points")


plt.ylabel("Classification Level") plt.legend()
plt.grid(True)
plt.show()

Output:
Training dataset: First 50 points labeled based on the rule (x <= 0.5 -> Class1, x > 0.5 -> Class2)
Testing dataset: Remaining 50 points to be classified

Results for k = 1:
Point x51 (value: 0.5702) is classified as Class2
Point x52 (value: 0.4654) is classified as Class1
Point x53 (value: 0.7016) is classified as Class2
Point x54 (value: 0.5964) is classified as Class2
Point x55 (value: 0.0643) is classified as Class1
Point x56 (value: 0.2698) is classified as Class1
Point x57 (value: 0.7124) is classified as Class2
Point x58 (value: 0.3219) is classified as Class1
Point x59 (value: 0.2637) is classified as Class1
Point x60 (value: 0.5483) is classified as Class2
Point x61 (value: 0.1561) is classified as Class1
Point x62 (value: 0.1592) is classified as Class1
Point x63 (value: 0.3752) is classified as Class1
Point x64 (value: 0.1299) is classified as Class1
Point x65 (value: 0.6934) is classified as Class2
Point x66 (value: 0.5240) is classified as Class2
Point x67 (value: 0.0203) is classified as Class1
Point x68 (value: 0.3789) is classified as Class1
Point x69 (value: 0.6866) is classified as Class2
Point x70 (value: 0.1834) is classified as Class1
Point x71 (value: 0.4197) is classified as Class1
Point x72 (value: 0.3608) is classified as Class1
Point x73 (value: 0.7579) is classified as Class2
Point x74 (value: 0.1624) is classified as Class1
Point x75 (value: 0.5943) is classified as Class2
Point x76 (value: 0.4097) is classified as Class1
Point x77 (value: 0.6124) is classified as Class2
Point x78 (value: 0.2794) is classified as Class1
Point x79 (value: 0.3193) is classified as Class1
Point x80 (value: 0.0503) is classified as Class1
Point x81 (value: 0.8038) is classified as Class2
Point x83 (value: 0.4230) is classified as Class1
Point x84 (value: 0.7250) is classified as Class2
Point x85 (value: 0.7162) is classified as Class2
Point x86 (value: 0.0725) is classified as Class1
Point x87 (value: 0.0752) is classified as Class1
Point x88 (value: 0.4676) is classified as Class1
Point x89 (value: 0.2256) is classified as Class1
Point x90 (value: 0.4552) is classified as Class1
Point x91 (value: 0.4787) is classified as Class1
Point x92 (value: 0.7390) is classified as Class2
Point x93 (value: 0.0649) is classified as Class1
Point x94 (value: 0.3373) is classified as Class1
Point x95 (value: 0.7719) is classified as Class2
Point x96 (value: 0.0512) is classified as Class1
Point x97 (value: 0.3012) is classified as Class1
Point x98 (value: 0.5966) is classified as Class2
Point x99 (value: 0.2897) is classified as Class1
Point x100 (value: 0.2176) is classified as Class1
Results for k = 2:
Point x51 (value: 0.5702) is classified as Class2
Point x52 (value: 0.4654) is classified as Class1
Point x53 (value: 0.7016) is classified as Class2
Point x54 (value: 0.5964) is classified as Class2
Point x55 (value: 0.0643) is classified as Class1
Point x56 (value: 0.2698) is classified as Class1
Point x57 (value: 0.7124) is classified as Class2
Point x58 (value: 0.3219) is classified as Class1
Point x59 (value: 0.2637) is classified as Class1
Point x60 (value: 0.5483) is classified as Class2
Point x61 (value: 0.1561) is classified as Class1
Point x62 (value: 0.1592) is classified as Class1
Point x63 (value: 0.3752) is classified as Class1
Point x64 (value: 0.1299) is classified as Class1
Point x65 (value: 0.6934) is classified as Class2
Point x66 (value: 0.5240) is classified as Class2
Point x67 (value: 0.0203) is classified as Class1
Point x68 (value: 0.3789) is classified as Class1
Point x69 (value: 0.6866) is classified as Class2
Point x70 (value: 0.1834) is classified as Class1
Point x71 (value: 0.4197) is classified as Class1
Point x72 (value: 0.3608) is classified as Class1
Point x73 (value: 0.7579) is classified as Class2
Point x74 (value: 0.1624) is classified as Class1
Point x75 (value: 0.5943) is classified as Class2
Point x76 (value: 0.4097) is classified as Class1
Point x77 (value: 0.6124) is classified as Class2
Point x78 (value: 0.2794) is classified as Class1
Point x79 (value: 0.3193) is classified as Class1
Point x80 (value: 0.0503) is classified as Class1
Point x81 (value: 0.8038) is classified as Class2
Point x82 (value: 0.0792) is classified as Class1
Point x83 (value: 0.4230) is classified as Class1
Point x84 (value: 0.7250) is classified as Class2
Point x85 (value: 0.7162) is classified as Class2
Point x86 (value: 0.0725) is classified as Class1
Point x87 (value: 0.0752) is classified as Class1
Point x88 (value: 0.4676) is classified as Class1
Point x89 (value: 0.2256) is classified as Class1
Point x90 (value: 0.4552) is classified as Class1
Point x91 (value: 0.4787) is classified as Class1
Point x92 (value: 0.7390) is classified as Class2
Point x93 (value: 0.0649) is classified as Class1
Point x94 (value: 0.3373) is classified as Class1
Point x95 (value: 0.7719) is classified as Class2
Point x96 (value: 0.0512) is classified as Class1
Point x97 (value: 0.3012) is classified as Class1
Point x98 (value: 0.5966) is classified as Class2
Point x99 (value: 0.2897) is classified as Class1
Point x100 (value: 0.2176) is classified as Class1

Results for k = 3:
Point x51 (value: 0.5702) is classified as Class2
Point x53 (value: 0.7016) is classified as Class2
Point x54 (value: 0.5964) is classified as Class2
Point x55 (value: 0.0643) is classified as Class1
Point x56 (value: 0.2698) is classified as Class1
Point x57 (value: 0.7124) is classified as Class2
Point x58 (value: 0.3219) is classified as Class1
Point x59 (value: 0.2637) is classified as Class1
Point x60 (value: 0.5483) is classified as Class2
Point x61 (value: 0.1561) is classified as Class1
Point x62 (value: 0.1592) is classified as Class1
Point x63 (value: 0.3752) is classified as Class1
Point x64 (value: 0.1299) is classified as Class1
Point x65 (value: 0.6934) is classified as Class2
Point x66 (value: 0.5240) is classified as Class2
Point x67 (value: 0.0203) is classified as Class1
Point x68 (value: 0.3789) is classified as Class1
Point x69 (value: 0.6866) is classified as Class2
Point x70 (value: 0.1834) is classified as Class1
Point x71 (value: 0.4197) is classified as Class1
Point x72 (value: 0.3608) is classified as Class1
Point x73 (value: 0.7579) is classified as Class2
Point x74 (value: 0.1624) is classified as Class1
Point x75 (value: 0.5943) is classified as Class2
Point x76 (value: 0.4097) is classified as Class1
Point x77 (value: 0.6124) is classified as Class2
Point x78 (value: 0.2794) is classified as Class1
Point x79 (value: 0.3193) is classified as Class1
Point x80 (value: 0.0503) is classified as Class1
Point x81 (value: 0.8038) is classified as Class2
Point x82 (value: 0.0792) is classified as Class1
Point x83 (value: 0.4230) is classified as Class1
Point x84 (value: 0.7250) is classified as Class2
Point x85 (value: 0.7162) is classified as Class2
Point x86 (value: 0.0725) is classified as Class1
Point x87 (value: 0.0752) is classified as Class1
Point x88 (value: 0.4676) is classified as Class1
Point x89 (value: 0.2256) is classified as Class1
Point x90 (value: 0.4552) is classified as Class1
Point x91 (value: 0.4787) is classified as Class1
Point x92 (value: 0.7390) is classified as Class2
Point x93 (value: 0.0649) is classified as Class1
Point x94 (value: 0.3373) is classified as Class1
Point x95 (value: 0.7719) is classified as Class2
Point x96 (value: 0.0512) is classified as Class1
Point x97 (value: 0.3012) is classified as Class1
Point x98 (value: 0.5966) is classified as Class2
Point x99 (value: 0.2897) is classified as Class1
Point x100 (value: 0.2176) is classified as Class1
Results for k = 4:
Point x51 (value: 0.5702) is classified as Class2
Point x52 (value: 0.4654) is classified as Class1
Point x53 (value: 0.7016) is classified as Class2
Point x54 (value: 0.5964) is classified as Class2
Point x55 (value: 0.0643) is classified as Class1
Point x56 (value: 0.2698) is classified as Class1
Point x57 (value: 0.7124) is classified as Class2
Point x58 (value: 0.3219) is classified as Class1
Point x59 (value: 0.2637) is classified as Class1
Point x60 (value: 0.5483) is classified as Class2
Point x61 (value: 0.1561) is classified as Class1
Point x62 (value: 0.1592) is classified as Class1
Point x63 (value: 0.3752) is classified as Class1
Point x64 (value: 0.1299) is classified as Class1
Point x65 (value: 0.6934) is classified as Class2
Point x66 (value: 0.5240) is classified as Class2
Point x67 (value: 0.0203) is classified as Class1
Point x68 (value: 0.3789) is classified as Class1
Point x69 (value: 0.6866) is classified as Class2
Point x70 (value: 0.1834) is classified as Class1
Point x71 (value: 0.4197) is classified as Class1
Point x72 (value: 0.3608) is classified as Class1
Point x73 (value: 0.7579) is classified as Class2
Point x75 (value: 0.5943) is classified as Class2
Point x76 (value: 0.4097) is classified as Class1
Point x77 (value: 0.6124) is classified as Class2
Point x78 (value: 0.2794) is classified as Class1
Point x79 (value: 0.3193) is classified as Class1
Point x80 (value: 0.0503) is classified as Class1
Point x81 (value: 0.8038) is classified as Class2
Point x82 (value: 0.0792) is classified as Class1
Point x83 (value: 0.4230) is classified as Class1
Point x84 (value: 0.7250) is classified as Class2
Point x85 (value: 0.7162) is classified as Class2
Point x86 (value: 0.0725) is classified as Class1
Point x87 (value: 0.0752) is classified as Class1
Point x88 (value: 0.4676) is classified as Class1
Point x89 (value: 0.2256) is classified as Class1
Point x90 (value: 0.4552) is classified as Class1
Point x91 (value: 0.4787) is classified as Class1
Point x92 (value: 0.7390) is classified as Class2
Point x93 (value: 0.0649) is classified as Class1
Point x94 (value: 0.3373) is classified as Class1
Point x95 (value: 0.7719) is classified as Class2
Point x96 (value: 0.0512) is classified as Class1
Point x97 (value: 0.3012) is classified as Class1
Point x98 (value: 0.5966) is classified as Class2
Point x99 (value: 0.2897) is classified as Class1
Point x100 (value: 0.2176) is classified as Class1
Results for k = 5:
Point x51 (value: 0.5702) is classified as Class2
Point x52 (value: 0.4654) is classified as Class1
Point x53 (value: 0.7016) is classified as Class2
Point x54 (value: 0.5964) is classified as Class2
Point x55 (value: 0.0643) is classified as Class1
Point x56 (value: 0.2698) is classified as Class1
Point x57 (value: 0.7124) is classified as Class2
Point x58 (value: 0.3219) is classified as Class1
Point x59 (value: 0.2637) is classified as Class1
Point x60 (value: 0.5483) is classified as Class2
Point x61 (value: 0.1561) is classified as Class1
Point x62 (value: 0.1592) is classified as Class1
Point x63 (value: 0.3752) is classified as Class1
Point x64 (value: 0.1299) is classified as Class1
Point x65 (value: 0.6934) is classified as Class2
Point x66 (value: 0.5240) is classified as Class2
Point x67 (value: 0.0203) is classified as Class1
Point x68 (value: 0.3789) is classified as Class1
Point x69 (value: 0.6866) is classified as Class2
Point x70 (value: 0.1834) is classified as Class1
Point x71 (value: 0.4197) is classified as Class1
Point x72 (value: 0.3608) is classified as Class1
Point x73 (value: 0.7579) is classified as Class2
Point x74 (value: 0.1624) is classified as Class1
Point x75 (value: 0.5943) is classified as Class2
Point x76 (value: 0.4097) is classified as Class1
Point x77 (value: 0.6124) is classified as Class2
Point x78 (value: 0.2794) is classified as Class1
Point x79 (value: 0.3193) is classified as Class1
Point x80 (value: 0.0503) is classified as Class1
Point x81 (value: 0.8038) is classified as Class2
Point x82 (value: 0.0792) is classified as Class1
Point x83 (value: 0.4230) is classified as Class1
Point x84 (value: 0.7250) is classified as Class2
Point x85 (value: 0.7162) is classified as Class2
Point x86 (value: 0.0725) is classified as Class1
Point x87 (value: 0.0752) is classified as Class1
Point x88 (value: 0.4676) is classified as Class1
Point x89 (value: 0.2256) is classified as Class1
Point x90 (value: 0.4552) is classified as Class1
Point x91 (value: 0.4787) is classified as Class2
Point x92 (value: 0.7390) is classified as Class2
Point x93 (value: 0.0649) is classified as Class1
Point x94 (value: 0.3373) is classified as Class1
Point x95 (value: 0.7719) is classified as Class2
Pointx96(value: 0.0512) is classified as Class1
Point x97 (value: 0.3012) is classified as Class1
Point x98 (value: 0.5966) is classified as Class2
Point x99 (value: 0.2897) is classified as Class1
Pointx100 (value: 0.2176) is classified as Class1
Results for k = 20:
Point x51 (value: 0.5702) is classified as Class2
Point x52 (value: 0.4654) is classified as Class1
Point x53 (value: 0.7016) is classified as Class2
Point x54 (value: 0.5964) is classified as Class2
Point x55 (value: 0.0643) is classified as Class1
Point x56 (value: 0.2698) is classified as Class1
Point x57 (value: 0.7124) is classified as Class2
Point x58 (value: 0.3219) is classified as Class1
Point x59 (value: 0.2637) is classified as Class1
Point x60 (value: 0.5483) is classified as Class2
Point x61 (value: 0.1561) is classified as Class1
Point x62 (value: 0.1592) is classified as Class1
Point x63 (value: 0.3752) is classified as Class1
Point x64 (value: 0.1299) is classified as Class1
Point x65 (value: 0.6934) is classified as Class2
Point x66 (value: 0.5240) is classified as Class2
Point x67 (value: 0.0203) is classified as Class1
Point x68 (value: 0.3789) is classified as Class1
Point x69 (value: 0.6866) is classified as Class2
Point x70 (value: 0.1834) is classified as Class1
Point x71 (value: 0.4197) is classified as Class1
Point x72 (value: 0.3608) is classified as Class1
Point x73 (value: 0.7579) is classified as Class2
Point x74 (value: 0.1624) is classified as Class1
Point x75 (value: 0.5943) is classified as Class2
Point x76 (value: 0.4097) is classified as Class1
Point x77 (value: 0.6124) is classified as Class2
Point x78 (value: 0.2794) is classified as Class1
Point x79 (value: 0.3193) is classified as Class1
Point x80 (value: 0.0503) is classified as Class1
Point x81 (value: 0.8038) is classified as Class2
Point x82 (value: 0.0792) is classified as Class1
Point x83 (value: 0.4230) is classified as Class1
Point x84 (value: 0.7250) is classified as Class2
Point x85 (value: 0.7162) is classified as Class2
Point x86 (value: 0.0725) is classified as Class1
Point x87 (value: 0.0752) is classified as Class1
Point x88 (value: 0.4676) is classified as Class1
Point x89 (value: 0.2256) is classified as Class1
Point x90 (value: 0.4552) is classified as Class1
Point x91 (value: 0.4787) is classified as Class1
Point x92 (value: 0.7390) is classified as Class2
Point x93 (value: 0.0649) is classified as Class1
Point x94 (value: 0.3373) is classified as Class1
Point x95 (value: 0.7719) is classified as Class2
Point x96 (value: 0.0512) is classified as Class1
Point x97 (value: 0.3012) is classified as Class1
Point x98 (value: 0.5966) is classified as Class2
Point x99 (value: 0.2897) is classified as Class1
Pointx100 (value: 0.2176) is classified as Class1

Results for k = 30:


Point x51 (value: 0.5702) is classified as Class2
Point x52 (value: 0.4654) is classified as Class1
Point x53 (value: 0.7016) is classified as Class2
Point x54 (value: 0.5964) is classified as Class2
Point x55 (value: 0.0643) is classified as Class1
Point x56 (value: 0.2698) is classified as Class1
Point x57 (value: 0.7124) is classified as Class2
Point x58 (value: 0.3219) is classified as Class1
Point x59 (value: 0.2637) is classified as Class1
Point x60 (value: 0.5483) is classified as Class2
Point x61 (value: 0.1561) is classified as Class1
Point x62 (value: 0.1592) is classified as Class1
Point x63 (value: 0.3752) is classified as Class1
Point x64 (value: 0.1299) is classified as Class1
Point x65 (value: 0.6934) is classified as Class2
Point x66 (value: 0.5240) is classified as Class2
Point x67 (value: 0.0203) is classified as Class1
Point x68 (value: 0.3789) is classified as Class1
Point x69 (value: 0.6866) is classified as Class2
Point x70 (value: 0.1834) is classified as Class1
Point x71 (value: 0.4197) is classified as Class1
Point x72 (value: 0.3608) is classified as Class1
Point x73 (value: 0.7579) is classified as Class2
Point x74 (value: 0.1624) is classified as Class1
Point x75 (value: 0.5943) is classified as Class2
Point x76 (value: 0.4097) is classified as Class1
Point x77 (value: 0.6124) is classified as Class2
Point x78 (value: 0.2794) is classified as Class1
Point x79 (value: 0.3193) is classified as Class1
Point x80 (value: 0.0503) is classified as Class1
Point x81 (value: 0.8038) is classified as Class2
Point x82 (value: 0.0792) is classified as Class1
Point x83 (value: 0.4230) is classified as Class1
Point x84 (value: 0.7250) is classified as Class2
Point x85 (value: 0.7162) is classified as Class2
Point x86 (value: 0.0725) is classified as Class1
Point x87 (value: 0.0752) is classified as Class1
Point x88 (value: 0.4676) is classified as Class1
Point x89 (value: 0.2256) is classified as Class1
Point x90 (value: 0.4552) is classified as Class1
Point x91 (value: 0.4787) is classified as Class1
Point x92 (value: 0.7390) is classified as Class2
Point x93 (value: 0.0649) is classified as Class1
Point x94 (value: 0.3373) is classified as Class1
Point x95 (value: 0.7719) is classified as Class2
Point x96 (value: 0.0512) is classified as Class1
Point x97 (value: 0.3012) is classified as Class1
Point x98 (value: 0.5966) is classified as Class2
Point x99 (value: 0.2897) is classified as Class1
Pointx100 (value: 0.2176) is classified as Class1

Classification complete.

Explanation:

1. KNN is a lazy learner that stores all training instances.


2. Classifies new points based on the majority vote of k-nearest neighbors.
3. Works well for small datasets but is slow for large datasets.
4. The choice of k affects accuracy (small k = overfitting, large k = underfitting).
5. Used in recommendation systems and handwriting recognition.
Program 6: Locally Weighted Regression (LWR)
AIM:
To implement Locally Weighted Regression (LWR) to fit data points with non-parametric regression.

Objectives:

1. Implement LWR algorithm for smooth curve fitting.


2. Assign higher weights to closer points.
3. Draw prediction curves.

Algorithm:

1. Load the dataset.


2. Compute weights using the Gaussian kernel function.
3. Compute weighted regression coefficients.
4. Make predictions for new points.
5. Plot the LWR regression curve

6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points.
Select appropriate data set for your experiment and draw graphs

import numpy as np
import matplotlib.pyplot as plt

def gaussian_kernel(x, xi, tau):


return np.exp(-np.sum((x - xi) ** 2) / (2 * tau ** 2))

def locally_weighted_regression(x, X, y, tau):


m = X.shape[0]
weights = np.array([gaussian_kernel(x, X[i], tau) for i in range(m)]) W =
np.diag(weights)
X_transpose_W = X.T @ W
theta = np.linalg.inv(X_transpose_W @ X) @ X_transpose_W @ y return x @ theta

np.random.seed(42)
X = np.linspace(0, 2 * np.pi, 100)
y = np.sin(X) + 0.1 * np.random.randn(100) X_bias =
np.c_[np.ones(X.shape), X]
x_test = np.linspace(0, 2 * np.pi, 200) x_test_bias =
np.c_[np.ones(x_test.shape), x_test]
tau = 0.5
y_pred = np.array([locally_weighted_regression(xi, X_bias, y, tau) for xi in
x_test_bias])

plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='red', label='Training Data', alpha=0.7)
plt.plot(x_test, y_pred, color='blue', label=f'LWR Fit (tau={tau})',
linewidth=2)
plt.xlabel('X',fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Locally Weighted Regression', fontsize=14)
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.show()

Output:

Explanation:

 Non-parametric regression that assigns higher weight to nearby points.


 Useful for non-linear relationships in data.
 Avoids a global model, making predictions locally relevant.
 Sensitive to parameter tau, which controls locality strength.
 Used in stock price prediction and dynamic pricing models.
Program 7: Linear & Polynomial Regression

AIM:
To demonstrate Linear Regression on the Boston Housing dataset and Polynomial Regression on the Auto
MPG dataset.

Objectives:

1. Fit a Linear Regression model for housing price prediction.


2. Fit a Polynomial Regression model for fuel efficiency prediction.
3. Evaluate model performance using MSE & R² score.

Algorithm:

1. Load the dataset (Boston Housing / Auto MPG).


2. Preprocess the data (handle missing values).
3. Split data into training and testing sets.
4. Train a Linear/Polynomial regression model.
5. Evaluate using Mean Squared Error (MSE) and R² score.
6. Plot the predicted vs actual values.

Explanation:
 Linear regression fits a straight line to data.
 Polynomial regression fits a curved line for non-linear patterns.
 Evaluated using Mean Squared Error (MSE) & R² score.
 Feature scaling improves polynomial regression performance.
 Used in housing price prediction and sales forecasting.

7. Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use
Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency
prediction) for Polynomial Regression.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler from
sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score

def linear_regression_california():
housing = fetch_california_housing(as_frame=True)
X = housing.data[["AveRooms"]]
y = housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

plt.scatter(X_test, y_test, color="blue", label="Actual")


plt.plot(X_test, y_pred, color="red", label="Predicted")
plt.xlabel("Average number of rooms (AveRooms)")
plt.ylabel("Median value of homes ($100,000)")

plt.title("Linear Regression - California Housing Dataset")


plt.legend()
plt.show()
print("Linear Regression - California Housing Dataset")
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))

def polynomial_regression_auto_mpg():
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-
mpg/auto- mpg.data"
column_names = ["mpg", "cylinders", "displacement", "horsepower", "weight",
"acceleration", "model_year", "origin"]
data = pd.read_csv(url, sep='\s+', names=column_names, na_values="?")
data = data.dropna()

X = data["displacement"].values.reshape(-1, 1)
y = data["mpg"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

poly_model = make_pipeline(PolynomialFeatures(degree=2), StandardScaler(),


LinearRegression())
poly_model.fit(X_train, y_train)

y_pred = poly_model.predict(X_test)

plt.scatter(X_test, y_test, color="blue", label="Actual")


plt.scatter(X_test, y_pred, color="red", label="Predicted")
plt.xlabel("Displacement")
plt.ylabel("Miles per gallon (mpg)")
plt.title("Polynomial Regression - Auto MPG Dataset")
plt.legend()
plt.show()

print("Polynomial Regression - Auto MPG Dataset")


print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))

if name == " main ":


print("Demonstrating Linear Regression and Polynomial Regression\n")
linear_regression_california()
polynomial_regression_auto_mpg()
Output:
Program 8: Decision Tree Algorithm

AIM:
To build a Decision Tree classifier using the Breast Cancer dataset and apply it to classify a

new sample. Objectives:

1. Train a Decision Tree classifier.


2. Visualize the decision tree.
3. Predict the class of a new sample.

Algorithm:

1. Load the Breast Cancer dataset.


2. Split data into training & testing sets.
3. Train a Decision Tree Classifier.
4. Predict and evaluate accuracy.
5. Visualize the decision tree structure

8. Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data set for building the decision tree and apply this knowledge to classify a new sample.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

data = load_breast_cancer()
x=data.data
y=data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


print(f"Model Accuracy: {accuracy * 100:.2f}%")
new_sample = np.array([X_test[0]])
prediction = clf.predict(new_sample)

prediction_class = "Benign" if prediction == 1 else "Malignant"


print(f"Predicted Class for the new sample: {prediction_class}")

plt.figure(figsize=(12,8))
tree.plot_tree(clf, filled=True, feature_names=data.feature_names,
class_names=data.target_names)
plt.title("Decision Tree - Breast Cancer Dataset")
plt.show()

Output:

Explanation:

 Linear regression fits a straight line to data.


 Polynomial regression fits a curved line for non-linear patterns.
 Evaluated using Mean Squared Error (MSE) & R² score.
 Feature scaling improves polynomial regression performance.
 Used in housing price prediction and sales forecasting.
Program 9: Naïve Bayes Classifier
AIM:
To implement Naïve Bayes Classification using the Olivetti Face dataset and evaluate
accuracy.

Objectives:

1. Train a Naïve Bayes classifier.


2. Compute classification accuracy.
3. Visualize predicted vs actual labels.

Algorithm:

1. Load the Olivetti Face dataset.


2. Split into training & testing sets.
3. Train a Gaussian Naïve Bayes classifier.
4. Predict and compute accuracy, confusion matrix.
5. Display sample predictions.

9. Develop a program to implement the Naive Bayesian classifier considering Olivetti


Face Data set for training. Compute the accuracy of the classifier, considering a few
test data sets.

import numpy as np
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import matplotlib.pyplot as plt

data = fetch_olivetti_faces(shuffle=True,random_state=42)
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


print(f'Accuracy: {accuracy * 100:.2f}%')
print("\nClassification Report:")
print(classification_report(y_test,y_pred, zero_division=1))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

cross_val_accuracy = cross_val_score(gnb, X, y, cv=5, scoring='accuracy')


print(f'\nCross-validation accuracy: {cross_val_accuracy.mean() *
100:.2f}%')

fig, axes = plt.subplots(3, 5, figsize=(12, 8))


for ax, image, label, prediction in zip(axes.ravel(), X_test, y_test,
y_pred):
ax.imshow(image.reshape(64, 64), cmap=plt.cm.gray)
ax.set_title(f"True: {label}, Pred:{prediction}")
ax.axis('off')

plt.show()

Output:

Explanation :

 Linear regression fits a straight line to data.


 Polynomial regression fits a curved line for non-linear patterns.
 Evaluated using Mean Squared Error (MSE) & R² score.
 Feature scaling improves polynomial regression performance.
 Used in housing price prediction and sales forecasting.
Accuracy = 80.83%
Classification Report:
precision recall f1-score support

0 0.67 1.00 0.80 2


1 1.00 1.00 1.00 2
2 0.33 0.67 0.44 3
3 1.00 0.00 0.00 5
4 1.00 0.50 0.67 4
5 1.00 1.00 1.00 2
7 1.00 0.75 0.86 4
8 1.00 0.67 0.80 3
9 1.00 0.75 0.86 4
10 1.00 1.00 1.00 3
11 1.00 1.00 1.00 1
12 0.40 1.00 0.57 4
13 1.00 0.80 0.89 5
14 1.00 0.40 0.57 5
15 0.67 1.00 0.80 2
16 1.00 0.67 0.80 3
17 1.00 1.00 1.00 3
18 1.00 1.00 1.00 3
19 0.67 1.00 0.80 2
20 1.00 1.00 1.00 3
21 1.00 0.67 0.80 3
22 1.00 0.60 0.75 5
23 1.00 0.75 0.86 4
24 1.00 1.00 1.00 3
25 1.00 0.75 0.86 4
26 1.00 1.00 1.00 2
27 1.00 1.00 1.00 5
28 0.50 1.00 0.67 2
29 1.00 1.00 1.00 2
30 1.00 1.00 1.00 2
31 1.00 0.75 0.86 4
32 1.00 1.00 1.00 2
34 0.25 1.00 0.40 1
35 1.00 1.00 1.00 5
36 1.00 1.00 1.00 3
37 1.00 1.00 1.00 1
38 1.00 0.75 0.86 4
39 0.50 1.00 0.67 5

accuracy 0.81 120


macro avg 0.89 0.85 0.83 120
weighted avg 0.91 0.81 0.81 120

Confusion Matrix:
[[2 0 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
[0 0 2 ... 0 0 1]
...
[0 0 0 ... 1 0 0]
[0 0 0 ... 0 3 0]
[0 0 0 ... 0 0 5]]
Cross-validation accuracy: 87.25%
Program 10: k-Means Clustering
AIM:
To implement k-Means clustering using the Wisconsin Breast Cancer dataset and
visualize clustering results.

Objectives:

1. Perform k-Means clustering on the dataset.


2. Compare clusters with actual labels.
3. Visualize cluster centers and class distribution.

Algorithm:

1. Load the Breast Cancer dataset.


2. Scale the data using StandardScaler.
3. Apply k-Means clustering (k=2).
4. Evaluate clustering results using a confusion matrix.
5. Visualize clusters and centroids

10. Develop a program to implement k-means clustering using Wisconsin Breast Cancer data
set and visualize the clustering result.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report

data = load_breast_cancer()
X = data.data
y = data.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=2, random_state=42)


y_kmeans = kmeans.fit_predict(X_scaled)

print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df['Cluster'] = y_kmeans
df['True Label'] = y

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1',
s=100, edgecolor='black', alpha=0.7)
plt.title('K-Means Clustering of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='True Label',
palette='coolwarm', s=100, edgecolor='black', alpha=0.7)
plt.title('True Labels of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="True Label")
plt.show()

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1',
s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='red', marker='X',
label='Centroids')

plt.title('K-Means Clustering with Centroids')


plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()
Output:
Explanation:

1. Unsupervised learning algorithm for grouping similar data points.


2. Each cluster is represented by a centroid.
3. K value selection is crucial (Elbow method helps).
4. Sensitive to initialization and may get stuck in local optima.
5. Used in customer segmentation and image compression.
Viva Questions:

1. How does machine learning differ from traditional programming?


In traditional programming, rules and logic are explicitly coded by developers. In machine
learning, the system learns patterns and rules from data, allowing it to adapt to new inputs without
being reprogrammed.

2. What is machine learning?


Machine learning is a subset of artificial intelligence that involves the development of
algorithms and statistical models that allow computers to learn and make predictions or decisions
based on data.

3. Explain the concept of "training" in machine learning?


Training in machine learning refers to the process of feeding data to a model so it can learn
patterns and relationships in the data to make accurate predictions or decisions.

4. How does machine learning differ from artificial intelligence?


Machine learning is a subset of artificial intelligence. AI encompasses a broader scope,
including rule- based systems, robotics, and natural language processing, while machine
learning specifically focuses on learning from data.

5. What are the main types of machine learning?


Supervised Learning: Learning from labeled data.
Unsupervised Learning: Learning from unlabeled data.
Reinforcement Learning: Learning through interaction with an environment to maximize
rewards.

6. Give an example of supervised learning.


Predicting house prices based on features like size, location, and number of bedrooms
using historical labeled data.

7. What is the difference between supervised and unsupervised learning?


Supervised learning uses labeled data to train the model, while unsupervised learning
works with unlabeled data to find hidden patterns or groupings.

8. How do you handle over-fitting in a machine learning model?


Using regularization techniques.
Pruning decision trees.
Increasing training data.
Using cross-validation.
9. What are the steps involved in the machine learning process?
Problem definition.
Data collection.
Data preprocessing (cleaning, normalization).
Feature selection and engineering.
Model selection and training.
Model evaluation.
Deployment and monitoring.

10. Why is data preprocessing important in machine learning?


Data preprocessing ensures that the data is clean, consistent, and suitable for analysis,
improving the model's accuracy and performance.

11. What are some real-world applications of machine learning?

Healthcare: Disease diagnosis and drug discovery.


Finance: Fraud detection and algorithmic trading.
Retail: Recommendation systems and inventory management.
Autonomous vehicles: Object detection and navigation.
Natural language processing: Chatbots and language translation.

12. What are structured and unstructured data?


Structured Data: Organized in a fixed format, like tables (e.g., relational databases).
Unstructured Data: Lacks a predefined format (e.g., text, images, videos).

13. What is Big Data?


Big Data refers to extremely large and complex datasets that cannot be processed using
traditional methods due to their volume, velocity, and variety.

14. What are the 5 V's of Big Data?


Volume: The amount of data.
Velocity: The speed at which data is generated.
Variety: Different types of data (structured, unstructured, semi-structured).
Veracity: The quality or accuracy of the data.
Value: The usefulness of the data.
15. What are the key components of a Big Data framework?
Data Sources: Origin of data (e.g., sensors, social media).
Data Storage: Systems like Hadoop HDFS, cloud storage.
Data Processing: Tools like Spark, MapReduce.
Data Analysis: Techniques like machine learning, statistical analysis.
Visualization: Presenting results through tools like Tableau or Power BI.
16. What is Hadoop, and why is it important?
Hadoop is an open-source framework for distributed storage and processing of Big
Data. It allows scalability and fault tolerance.

17. What are descriptive statistics?

Descriptive statistics summarizes and organizes data to describe its main features using
measures like mean, median, mode, and standard deviation.

18. What are the measures of central tendency?


Mean: The average value.
Median: The middle value in a sorted dataset.
Mode: The most frequently occurring value.

19. What are measures of dispersion?


Range: Difference between the highest and lowest values.
Variance: The average squared deviation from the mean.
Standard Deviation: The square root of variance.

20. What is univariate data analysis?


Univariate data analysis involves analyzing a single variable to understand its
distribution, central tendency, and variability.

21. What are some common methods of univariate data visualization?


Bar Charts: For categorical data.
Histograms: For continuous data.
Box Plots: To show distribution and detect outliers.
Pie Charts: To represent proportions.

22. How do you identify outliers in univariate data?


Box Plot: Data points outside the whiskers. Z-
Score: Data points with a Z-score > ±3.
IQR Method: Values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

23. What is bivariate data?


Data involving two variables, often analyzed to find relationships or correlations
between them.

24. What is multivariate data?


Data involving three or more variables, analyzed to study complex relationships.
25. Give an example of bivariate and multivariate data.
Bivariate: Height vs. weight of individuals.
Multivariate: Height, weight, age, and income of individuals.

26. What are common techniques in multivariate statistics?


PCA (Principal Component Analysis), Factor Analysis, MANOVA (Multivariate Analysis of
Variance), and Cluster Analysis.

27. What is feature engineering?


The process of selecting, modifying, or creating features to improve the performance
of a model.

28. What is dimensionality reduction?


Reducing the number of features in a dataset while preserving as much information as
possible.

29. Name common dimensionality reduction techniques.


PCA, t-SNE, LDA (Linear Discriminant Analysis), and autoencoders.

30. What is nearest-neighbor learning?


A non-parametric method where predictions are based on the closest data points in the
feature space.

31. What is the K-Nearest-Neighbor algorithm?


A classification or regression algorithm that considers the 'k' closest data points to
make predictions.

32. What is weighted KNN?


A variant of KNN where closer neighbors are given more weight in predictions.

33. What is a nearest centroid classifier?


A classification method where each class is represented by the mean of its data points.

34. What is Locally Weighted Regression (LWR)?


A regression method that assigns weights to data points based on their distance from
the query point.

35. What is regression?


A statistical method for modeling the relationship between a dependent variable and
one or more independent variables.
36. Differentiate between linear and logistic regression.
Linear regression predicts continuous values, while logistic regression predicts
probabilities for classification.

37. What is multiple linear regression?


A regression model with multiple independent variables.

38. What is polynomial regression?


A regression model that fits a polynomial equation to the data.

39. What is a decision tree?


A tree-like model used for classification and regression tasks, where decisions are
made at nodes based on feature values.

40. What is decision tree induction?


The process of building a decision tree from training data.

41. What is probability-based learning?


A method of learning based on probabilistic models, such as Bayes' theorem.

42. What is Bayes' theorem?


A formula that describes the probability of an event based on prior knowledge of
related events.

43. What is the Naïve Bayes algorithm?


A probabilistic classifier based on Bayes' theorem, assuming independence between
features.

44. What is an artificial neuron?


A computational model inspired by biological neurons, consisting of weights, bias,
and an activation function.

45. What are the types of ANNs?


Feedforward, convolutional, recurrent, and generative adversarial networks.

46. What is clustering?


An unsupervised learning method to group similar data points.
47. What is hierarchical clustering?
A method that builds a hierarchy of clusters using agglomerative or divisive approaches.

48. What is density-based clustering?


A method that groups data points based on density, such as DBSCAN.

49. What is reinforcement learning?


A learning paradigm where agents learn by interacting with an environment to maximize
cumulative rewards.

50. What is a Markov Decision Process (MDP)?


A mathematical framework for modeling decision-making in reinforcement learning.

51. What is Q-Learning?


A model-free RL algorithm that learns a value function to estimate the quality of actions.

52. What is SARSA?


An on-policy RL algorithm that updates its Q-values based on the current state-action pair
and the next state- action pair.

You might also like