This document outlines various machine learning tasks implemented in Python, including data visualization techniques using the California Housing dataset, Principal Component Analysis (PCA) on the Iris dataset, and the Find-S algorithm for hypothesis generation. It also covers the k-Nearest Neighbors (k-NN) algorithm for classification, Locally Weighted Regression for fitting data points, and both Linear and Polynomial Regression using the Boston Housing and Auto MPG datasets. Each section includes code snippets and explanations of the methods used to analyze and visualize the data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
12 views13 pages
M pdf
This document outlines various machine learning tasks implemented in Python, including data visualization techniques using the California Housing dataset, Principal Component Analysis (PCA) on the Iris dataset, and the Find-S algorithm for hypothesis generation. It also covers the k-Nearest Neighbors (k-NN) algorithm for classification, Locally Weighted Regression for fitting data points, and both Linear and Polynomial Regression using the Boston Housing and Auto MPG datasets. Each section includes code snippets and explanations of the methods used to analyze and visualize the data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
‘Machine Learning Lab (BCSL606)
Develop a program to create histograms for all numerical features and analyze the distribution of each
feature. Generate box plots for all numerical features and identify any outliers. Use California Housing
dataset.
import pandas as pd
import numpy as ap
import seaborn as sns
import matplotlib.pyplot as pit
from sklearn.datasets import fetch_california_housing
4 step 1: Load the California Housing dataset
data = fetch california housing (as_frame=True)
housing df = data.frame
# Step 2: Create histograms for numerical features
numericai_features = housing df.select_deypes (includ
Inp. number] ) .columns
# Plot histograms
plt.figure (figsize=(15, 10))
for i, feature in enumerate (numerical_features) :
pit.subplot (3, 3, i + 1)
sns-histplct (housing df [feature], kak
plt.title(f'Distribution of (feature}")
plt.tight_layout ()
pit show (7
rue, bil
bine")
# step 3: Generate box plots for numerical features
plt.figure (figsize=(15, 20)
for i, feature in enumerate (mmerical_features) :
plt-subplot (3, 3, 1 + 1)
sns -boxplot (x“housing_df [feature], colo:
plt.title(f"Box Plot of (feature}")
plt.tight_layout ()
Blt. show ()
orange’)
# Step 4: Identify outliers using the rom method
print ("Outliers Detection:")
outliers summary = ()
for feature in numerical_features:
Q1 = housing df [feature] quantile (0.25)
Q3 = housing_df [feature] .quantile(0.75)
Tor ~ 93 - of
lowez_bound = Q1 - 1.5 * TOR
upper_bound = Q3 + 1.5 * 10R
outliers = housing df| (housing df (feature) < lower_bound) |
(housing df (feature) > upper_bcund)
outliers_summary[feature) = len(outliers)
print (f"{feature}: {len (outliers)) outliers™)
4 optional: Print a sunmary cf the dataset
peiat("\aDataset Summary:")
Print (housing _df.describe ()})
DEPARTMENT OF CS8E, RIT, HASSAN‘Machine Learning Lab (BCSL606)
2. Develop a program to Compute the correlation matrix to understand the relationships between pairs of features.
‘Visualize the correlation matrix using a heatmap to know which variables have strong positive/negative
correlations. Create a pair plot to visualize pairwise relationships between features. Use California Housing
dataset.
import pandas as pd
import seaborn as sms
Amport matplotlib.pyplot as pit
from sklearn.datasets import fetch california housing
# Step 1: Load the California Housing Dataset
california data = fetch california housing(as frame=True)
data = california data.frame
# Step 2: Compute the correlation mat
correlation matrix ~ data.corr()
4 Step 3: Visualize the correlaticn matrix using a heatmap
Plt. figure (Figsize=(10, 8)
sns-heatmap(correlation_matrix, annot-True, cmap='coolwarm', fmt=".2f",
Linewidths=0.5)
plt.title (‘Correlation Matrix of California Ho!
plt-show ()
ng Features")
# Step 4: Create a pair plot to visualize pairwise relationships
sns,pairplot (data, diag_kind-'kde', plot_kws=("alpha': 0,5})
plt.suptitle("Pair Plot of California Holling Features", y=1.02)
Bit. show ()
is script performs an exploratory data anal
is on the California Housing dataset using Pandas,
Seaborn, and Matplotlib. First, it loads the dataset, which contains various housing-related features, and
converts it into a DataFrame. Then, it calculates the correlation matrix to measure relationships between
numerical features. To visualize these relationships, a heatmap is created using Seaborn, displaying correlations
with color gradients and numerical values for better interpretability. Additionally, a pair plot is generated to
illustrate pairwise relationships between features, using scatter plots for bivariate distributions and KDE plots
for univariate distributions. These visualizations help in understanding feature dependencies and potential
patterns in the dataset,
DEPARTMENT OF CS8E, RIT, HASSAN‘Machine Learning Lab (BCSL606)
3
Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality
of the Iris dataset from 4 features to 2.
import aumpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as pit
4 Load the Iris dataset
iris = load iris()
data = iris.data
labels = iris.carget
label_names = iris. target_names
4 convert to a DataFrame for better visualization
iris df = pd.DataFrane (data, cclunns=iris.feature names)
# Perform PCA te reduce dimensionality to 2
Bea = PCA(n_components=2)
data_reduced = pea. fit_transform(data)
# Create a DataFrame for the reduced data
reduced_df = pd.DataFrame(data_reduced, column:
"Principal Component 2"]}
reduced_df(*Zabel'] = labels
‘principal Component 1',
4 Flot the reduced data
plt.figure (figsize=(€, €))
colors = ['r', 'g', 'B']
for i, label in enumerate (np.unique (1abels)) :
pit.scatter(
reduced_df[reduced_af["tabel']
xeduced_df [reduced_df [‘Zabel']
label~label_names(Tabel],
color=colorst
label] [Principal Component 1'1,
label) [*Frincipal Cemponent 2",
)
plt.title(*Pca on Iris Dataset")
plt.xlabel ("Principal Component 1')
plt-ylabel ("Principal Component 2")
plt-legend()
pitgrid()
Blt. show ()
ation:
‘This Python script demonstrates how to apply Principal Component Analysis (PCA) on the Iris dataset to reduce
its dimensionality from four features to two, making it easier to visualize. The script first loads the Iris dataset using
sklearn datasets load_iris(), extracts the feature data and labels, and converts it
to a Pandas DataFrame for better
readability. It then applies PCA with n_components=2 to transform the dataset into a two-dimensional space, The
transformed data is stored in another DataFrame, with the corresponding class labels. Finally, the script uses matplotlib to
create a scatter plot, where different species of flowers are represented in distinct colors. This visualization helps in
understanding how well“ PCA separates the three -—species_-in_lower-dimensional space.
DEPARTMENT OF CS8E, RIT, HASSANMachine Leaning Lab (BCSL606)
4. Fora given set of training data examples stored in a CSV file, implement and demonstrate the Find-§ algorithm
to output a description of the set ofall hypotheses consistent with the training examples.
import pandas as pd
def find_s_algorithm(file path)
data = pd-read_csv (file path)
pint ("Tzaining data:"
print (data)
attributes = data.columns[:=1]
class_label ~ data.cclumns [-1)
hypothesis = [027 fen: . dn, attrib
for index, row in data.iterrows():
f rox[class_label] =:
for i, value in enumerate
if hypothesis(i] == '
hypothesis [i
else:
hypothesis [i
[attributes))
"or hypothesis [i]
Lue
value
return hypothesis
file path = 'C:\\Users\\Admin\\Downloads\\training data.csv’
othesis = find s algorithm(file path)
\nthe final hypothesis is:", hypothesis!
The given Python script implements the Find-S algorithm using the pandas library for reading and
sing a CSV dataset. The Find-S algorithm is a simple machine learning approach used to find the most
hypothesis that fits all positive examples in a dataset. The script reads a CSV file containing training
data, where the last column represents the class label (e.g., "Yes" or "No". It initializes a hypothesis with the
most general values ("for each attribute) and then iterates through the dataset, updating the hypothesis
whenever a positive example (‘Yes' in the class label) is encountered. If an attribute in a positive example
matches the current hypothesis, it remains unchanged; otherwise, it is generalized to "?. The final hypothesis is
returned and printed, representing the most specific description that fits all positive examples,
DEPARTMENT OF CS8E, RIT, HASSAN‘Machine Learning Lab (BCSL606)
5. Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated
100 values of x in the range of (0,1. Perform the following based on dataset generated. a. Label the
first $0 points {x1........x50} as follows: if (xi < 0.5), then xi € Class1, else xi € Class! b. Classify
2.3,4,5,20,30
the remaining points, x51,.....x100 using KNN. Perform this for
import numpy as np
import matplotlib.pyplct as pit
from collections import Counter
data = ap.candom.2and(100)
labels = ["Classi" if «
0,5 else "Class2" for x in data[:50)
def euclidean_distance (x1, *2):
return abs (x1 ~ x2)
def knn classifier (train data, train labels, test point, k)
distances = [(euclidean_distance(test_point, train Gata[i)), train labels{il)
for i in range (1en(train data))
distances.sort (key=lambda x: [0]
k_nearest_neighbors = distances [:k
k_nearest_labele = [label for _, labe2 in k_neazest_neighbors]
return Counter (k nearest labels) .most_common (1) [07 [0]
train_data = data[:50]
train_labels - labels
test data = data(s0:]
k values = (1, 2, 3, 4, 5, 20, 30)
print ("--~ k-Nearest Neighbors Classification ---")
print ("Training dataset: First 50 points labeled based on the rule (x <= 0.5 ->
Classi, x > 0.5 -> Class2)")
int ("Testing dataset: Remaining $0 points to be classified\n")
results = (}
for k in k_values:
print (F*Results for k = {k}:")
classified labels = (knn_classifier(train data, train labels, test_point, k) for
test_point in test data)
Fesults[k] = classified labels
for i, labe2 in enumerate (classified labels, start=51)
print (£"Pcint (i) (value: (test data[i/- 52]:.4£)) is classified as
(labe)")
print ("\n")
DEPARTMENT OF CS8E, RIT, HASSAN‘Machine Learning Lab (BCSL606)
print ("Classification complete. \n")
for k in k_values:
classified labels = results (k]
classi_points = [test_data[i] for i in range(len(test_data)) if
classified tabels|i)] == "Classi")
class2 points = [test data[i] for 4 in range(len(test_data)) if
classified_tabe1s[i] == "Class2"]
plt. figure (Figsize=(10, 61)
plt-scatter(train_data, [0] * len(train data), c=["blue" if label -- "Classi"
else "red" for label Tn train labels],
label="Training Cata", marker="o"
plt.scatter(classl_points, [1] * len(class!_points), c-"blue", iabel="Class!
(test) ", marker="x")
plt scatter (class?_points, [1] * len(class?_points), co"red", label="Class2
(test) ", marker-"«")
pit. title (£"k-NN Classification Results for k = (k}")
plt-xlabel ("Data Points")
pit. ylabel ("Classification Level")
ple. legend ()
plt.grid(rrue)
pits how()
Explanation:
‘This Python script implements a simple k-Nearest Neighbors (k-NN) classifier using Euclidean distance for a 1D.
dataset generated randomly. It starts by creating 100 random values between 0 and 1, labeling the first 50 points as
“Classi” if < 0.5 and "Class2" otherwise. The script defines a function to calculate Euclidean distance and another to
classify test points based on their nearest neighbors. It then iterates over different K-values (J. 2, 3, etc), classifying the
remaining 50 data points. The classification results are printed and visualized using Matplotib, where training points are
plotted as circles and test points as Class1 (blue crosses) or Class? (red crosses). The visualizations help analyze how
Jifferent k-values affect classification performance,
Output:
DEPARTMENT OF CS8E, RIT, HASSAN‘Machine Learning Lab (BCSL606)
6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points,
Select appropriate data set for your experiment and draw graphs
import numpy as np
import matplotlib.pypLet as pt
def gaussian kernel (x, xi, tau):
return np-exp(-np-sum((x = xi) ** 2) / (2 * tau ** 2)
def locally_weighted_regression(x, X, y, tau)
m= X.shape [0
weights = np.array (Ig:
W = np.diag (weights)
X transpose ¥ = x. @
theta = np.linalg- inv (x_transpos
return x @ theta
sian_kernel (x, X[il, tau) for i in rangetm)1)
W@ x) @X transpose_w @ y
np. random. seed (42)
np-linspace(0, 2 * np.pi, 100)
y = ap.sin(%) + 0/1 * ap.random.randn (100)
X_Dias ~ np.c_{np.ones(x.shape),
x test = mp.linspace(0, 2 * mp.pi, 200)
x_test_bias = np.c_[np.cnes (x _test.shape), *_test
tau = 0.5
y_pred = np.array( [locally
xLtest_bias]
| weighted regression (xi, x bias, y, tau) for xi in
pit. figure (figsize=(10, €)
lscatter(X, y, color="red", label="training Data’, alpha-0.7
pit plot (x test, y pred, ccler=‘blue', label=f'LWR Fit (tau=(tau})",
plt.xlabel ("K", fontsize=12)
pit.ylabel("y', fontsize=12)
plt-title(*zccally Weighted Regressicn', fontsize=14)
pit legend (fontsize=10)
plt.grid(alpha=0.3)
plt.show()
Explanation:
‘This program implements Locally Weighted Regression (LWR) using a Gaussian kernel to assign
newidt
weights to training points based on their distance from a given query point. The function gaussian_kernel
‘computes the weight for each training point relative to the query point using a Gaussian function with
bandwidth tau. The locally_weighted_regression function then performs weighted linear regression by
‘computing the weighted least squares estimate for the regression coefficients. The dataset consists of noisy sine
wave samples, and LWR is applied to estimate the function’s trend. The predictions are visualized, showing the
fitted curve along with the training data, demonstrating how LWR captures local pattems in the data,
DEPARTMENT OF CS&E, RIT, HASSAN‘Machine Learning Lab (BCSL606)
7. Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use
Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency
predict
import numpy as np
import pandas as pd
import matplctlib.pyplot as pit
from sklearn.datasels import fetch_california housing
from sklearn model selection import train test split
from sklearn. linear model import LinearRegression
from sklearn preprocessing import PolynemialFeatures, StandardScaler
from sklearn pipeline import make pipeline
from sklearn.metrics impert mean_squared_error, r2_score
) for Polynomial Regression,
def Linear_regressicn_california()
housing = fetch_california_housing(as_framé
X= housing.data||"AveRooms")
y = housing.target
X_train, X test, y train, y test = train test_split(x, y, test_siz
randon_state=47)
model = LinearRegression()
model.fit (x train, y train)
y pred = model.predict (x test)
plt.scatter(x_test, y_test, color="blue", label="Actual")
plt.plot(x_test, y pred, color="red", label="Predicted")
plt.xlabel ("average nurber of rooms (AveRcons) ")
plt.ylabel ("Median value of homes ($100,000) ")
plt.titie ("Linear Regression - California Housing Dataset")
ple. legend ()
pit. show ()
print ("Linear Regression - California Housing Dataset")
print ("Mean Squared Error:", mean_squared error{y test, y_pred))
print ("R°2 Score:", r2 score(y test, y pred))
def polynomial_regression_auto_mpg() :
url = “https://archive.ics-uci-edu/ml/machine-learning~databases/auto-mpg/auto-
mpg. data’
columa_names = ("mpg", "cylinders", "displacement", "horsepower", "weight",
“acceleration”, “model year", "origin"
data = pd.read csw(url, Sep="\st", name:
data = data.dropna()
ro
-olumn_names, na_value:
X = data["disp!acement"} .values. reshape (=
y = datai"mpg"! .values
mat
X train, X test, y train, y test = train test split(x, y, test_siz
random_state=42)
2
poly_mede] = make_pipeline (PolynomialFeatures (degree=7), Standardscaler(),
LinearRegression())
poly_mcdel.£it (X train, y train)
DEPARTMENT OF CS&E, RIT, HASSANMachine Learning Lab (BCSL606)
y_pred = poly_nodel.pzedict (x_test)
blue", Label="Actual")
zed", label="Predicted")
pltscatter (x test, y test, celo:
plt.scatter (x test, y pred, cole:
plt.xlabel ("Displ
plt.ylabel ("Miles per gallon (mpg)")
plt,titie ("Polynomial Regression ~ Auto MPG Dataset"
pit. legend (
ple show ()
nent")
print ("Polynomial Regression - Auto MPG Dataset")
print ("Mean Squared Hrror:", mean_squared_error(y_test, y_pred)}
print ("R*2 scere:", r2_scorely test, y_pred))
if name
print ("Demonstrating Linear Regression and Polynomial Regressicn\n")
The provided code demonstrates linear and polynomial regression using two datasets. First, it performs
linear regression on the California Housing dataset, focusing on predicting the median home value based on the
average number of rooms. The data is split into training and test sets, and a LinearRegression model is trained
and evaluated, displaying a scatter plot comparing actual and predicted values along with the Mean Squared
Error (MSE) and R? score. In the second part, polynomial regression is applied to the Auto MPG dataset,
predicting fuel efficiency (mpg) based on engine displacement. A pipeline is used, consisting of polynomial
feature transformation, standard scaling, and linear regression. The model’s performance is visualized similarly,
and metrics are printed, Both visualizations help in understanding model fit and performance.
Output
DEPARTMENT OF CS&E, RIT, HASSAN‘Machine Learning Lab (BCSL606)
8, Develop a program to demonstrate the working of the decision tree algorithm, Use Breast Cancer
Data set for building the decision tree and apply this knowledge to classify a new sample.
import numpy as np
import matplctlib.pyplet as pit
from sklearn.datasets import icad breast_cancer,
from sklearn.model_selecticn impozt train test_split
from sklearn tree import DecisionTreeClassific
from sklearn metrics import accuracy score
fron sklearn import tree
data = lead breast cancer (
X= daca.data
y = dataltarget
x_train, x_test!
randon_state=42)
clf = DecisionTreeClassifier (random state:
elf.£10(_teain, y_tealn)
y pred = clf.predict (x test)
ytrain, y test = train_test_splitix, y, test_size~0.2,
2)
accuracy = accuracy_score(y_test, y_pred)
print ("Model Accuracy: {accuracy * 100:
new_sample = np.array4[x_test[0]])
prediction = clf.predict (nex_sample)
prediction class = "Benig: prediction
‘int (f"Predicted Class for the new sample: {predicticn_<
2 else "Nalignant”
ass}")
¢. figure (figsize=(12,8))
tree.plot_tree(clf, filled-rrue, feature_names-data.feature_nares,
class_nanes=data.tazget_names)
plt.title("Decisicn Tree - Breast Cancer Dataset”)
plt.show()
Explanation:
This Python script implements a Decision Tree Classifier to classify breast cancer tumors as malignant or
benign using the Breast Cancer dataset from sklearn.datasets. I begins by importing necessary libraries, then
loads the dataset and separates it into features (X) and target labels (y). The dataset is split into training
(80%) and testing (20%) subsets using train_test_split. A Decision TreeClassifier model is created and trained
on the training data, After training, predictions are made on the test set, and the model’s accuracy is evaluated
using accuracy_score, The script then predicts the class of a new sample (first test sample) and prints whether
it is malignant or benign. Finally, it visualizes the decision tree using matplotlib and tree.plot_tree, displaying
how the classifier makes decisions based on feature splits.
DEPARTMENT OF CS&E, RIT, HASSAN‘Machine Learning Lab (BCSL606)
9. Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for
training. Compute the accuracy of the classifier, considering a few test data sets,
import numpy as np
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_setection import train test_split, cross_val_score
from sklearn.naive_bayes import GaussianNa
from sklearn.netriés import accuracy score, classificaticn_report, confusion_matrix
import matp_ctiis.pyplot as plt
data = fetch_olivetti_faces (shuffle=True, random_state=42:
X= data.data
y 7 data-target
X_train, ¥ test,
randon_state=42)
yteain, y_test = train test_split(, y, test siz!
gnb = GaussianNs()
grb. fit (x train, y_train)
y_pred - gnb.predict (x_test)
accuracy = accuracy _score(y test, y_pred)
int (£'Accuracy: {accuracy * 100:.2£)8")
print ("\nClassification Report:")
print (classification report (y test, y pred, zero
ision-1))
print ("\nconfusion Matrix:")
print (confusien matrix(y test, y pred))
print (2 VA
al_accuracy = cross_val_sccre(gnb, X, y, cv=5, scoring="accuracy’
ss-validation accuracy: {cross val _accuracy.mean() * 100:.2£)8")
fig, axes = plt.subplcts(3, 5, £igsize=(12, 8))
for'ax, image, label, predicticn in zip(axes.ravel(}, ¥ test, y test, y pred):
ax-imshow (image. reshape (64, 64), cmap=plt.cn.gray)
aulset titie(#"?rue: (label), Pred: (prediction) ")
ax.axis ((ofE")
-show()
Explanation:
This Python script uses machine leaming to classify human face images from the Olivetti Faces dataset. It
begins by importing necessary libraries like NumPy, scikit-learn, and Matplotlib. The dataset is loaded using
fetch_olivetti_faces(), which provides grayscale 64x64 face images and corresponding labels, The data is
split into training and testing sets using train_test_split(). A Gaussian Naive Bayes (GNB) classifier is then
trained on the training set and used to predict labels for the test set. The model's performance is evaluated
using accuracy score, classification report, and confusion_matrix. Additionally, cross-validation is,
performed to assess generalization, Finally, a visualization is created using Matplotib, displaying a subset of
test images with their true and predicted labels,
DEPARTMENT OF CS&E, RIT, HASSAN‘Machine Learning Lab (BCSL606)
10, Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and
visualize the clustering result.
import numpy as ap
import pandas as pd
import matplotlib.pyplct as pit
import seaborn as sns
from sklearn.datasets import lcad_breast_cancer
from sklearn.cluster import Kmeans
from sklearn.preprocessing import Standardscaler
from sklearn.decomposition import PCA
from skiearn.metrics import confusion matrix, classification_repert
data = lead _breast_cancer (
X= data.data
y = data-target
scaler = Standardscaler()
X_scaled = scaler.fit_transform(x)
kmeans = RMeans(n_clusters=2, random state=42)
y_kmeans = kmeans.fit_predict (x_scaled)
print ("Confusion Matrix:")
print (confusicn_matrix(y, y_kmeans))
print ("\nClassiFication Report:")
print (class fication_report (y, y_kmeans)}
pea = PCA(a_components=2)
X_pea = pea-fit_transform(X_scaled)
dE = pd.DataFrame(x pea, columns=["PCi', "PC2"]}
df[‘cluster'] = y_kmeans
df['true Label'] = y
pit. figure (figsize:
sns.scatLezpict (dats
edgecolor="black", alph
plt.title('K-Means Clustering cf Breast Cancer Dataset")
‘elabel ("Principal Component 1")
sylabel ("Principal Component 2")
Slegend(title="Cluster")
plt.show()
2C2", hue='Clustez', paletu
plt. figure (figsize=(8- 6))
sns.scatterplot (data=df, x="PCI',
"pc2", ue="true Labei', palette="coolwarm',
57100, edgecolor="black', atpha-0.7)
stitle(*True Labels cf Breast Cancer Dataset")
label ("Principal component 1")
tylabel ("Principal Component 2")
legend (titie="True Labe:")
‘show ()
-figure (figsize= (8, 6))
sns.scatterplot (data~af, x-'PC1', y-'PC2", hue='Cluster', palette='setl", 5-100,
edgecolor="black", alpha=0.7)
centers = pca.transform(kmeans.cluster_ centers }
plt.scatter(centers[:, 0), centers[:, 1], 5=200, c:
label="centroids")
DEPARTMENT OF CS&E, RIT, HASSANMachine Learning Lab (BCSL606)
plt.title('K-Means Clustering wit
pltxlabel ("Principal Component 1")
plt-ylabel (Principal Component 2)
pit Jegend (title
pit show()
Explanation:
centroids")
This code performs K-Means clustering on the Breast Cancer dataset from sklearn, It first loads the
dataset and standardizes the features using StandardScaler to ensure all features have equal importance. Then, it
applies K-Means clustering with two clusters, as the dataset has two target classes (malignant and benign). The
clustering results are evaluated using a confusion matrix and a classification report. To visualize the data,
Principal Component Analysis (PCA) is used to reduce the dimensions to two principal components, making it
easier to plot. The results are displayed using scatter plots,
‘omparing predicted clusters and true labels. Finally,
the cluster centroids are plotted on the PCA-transformed data, helping to understand the separation of clusters.
DEPARTMENT OF CS&E, RIT, HASSAN