[go: up one dir, main page]

0% found this document useful (0 votes)
7 views35 pages

ML Recordjp

This document is a mini project report for a Machine Learning course, detailing various experiments conducted in Python, including data visualization, linear and logistic regression, K-Nearest Neighbors, and Support Vector Machines. Each experiment outlines the aim, algorithm, program code, and results, demonstrating the application of machine learning techniques on datasets related to retail sales, salary prediction, diabetes classification, and social network ads. The report serves as a record of the work done by a student during the academic year 2024-2025.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views35 pages

ML Recordjp

This document is a mini project report for a Machine Learning course, detailing various experiments conducted in Python, including data visualization, linear and logistic regression, K-Nearest Neighbors, and Support Vector Machines. Each experiment outlines the aim, algorithm, program code, and results, demonstrating the application of machine learning techniques on datasets related to retail sales, salary prediction, diabetes classification, and social network ads. The report serves as a record of the work done by a student during the academic year 2024-2025.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

UEC2604

MACHINE LEARNING

Mini Project Report

VI Semester
UG-Electronics and Communication Engineering
(2024 – 2025)

Name JAYAPRAKASH K
Register Number 3122223002049
Sri Sivasubramaiya Nadar College of Engineering
(An Autonomous Institution, Affiliated to Anna University, Chennai)
Rajiv Gandhi Salai (OMR), Kalavakkam – 603 110

BONAFIDE CERTIFICATE

Date: .....................

Certified that this is the bonafide record of work done by,


Name: .............................................................................................
Register No: ............................................ of Sixth Semester B.E
Electronic And Communication Engineering during the academic year
2024 - 2025 for the subject UEC2604 – Machine learning.

Submitted for the Continuous Assessment Test 2 held on .......................


Faculty In-charge

Exp. No TITLE PAGE NO.

1 Data Visualization Techniques

2 Linear and logistic Regression

3 K Nearest Neighbour

4 Support Vector Machine

5 Principal Component Analysis

6 K Means Clustering
Exp: No : 1 Data Visualization Techniques Date:

AIM:
To perform data visualization techniques on retail sales data to analyze trends,
distribution, top-selling products, sales by country, and correlations using Python libraries.

Algorithm:
1. Start
2. Read the Excel file containing the retail sales data.
3. Convert the InvoiceDate column to datetime format.
4. To analysis the sales tread ,resample the data monthly and visualize total quantity sold
over time.
5. Plot a histogram to analyze the distribution of unit prices.
6. Aggregate sales data by product descriptions and visualize the top-selling products
using a horizontal bar chart.
7. Group data by country and visualize total sales for each country using a bar chart.
8. Compute and visualize the correlation between Quantity and UnitPrice using a
heatmap.
9. Identify the top 5 countries by total sales and visualize their contribution using a pie
chart.
10. End.

Program:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data


file_path = "D:\College\SEMESTER 6\Machine Learning\Lab\lab\Online
Retail_dta_visualisation.xlsx" # Replace with your file path if needed
df = pd.read_csv(file_path)

# 1. Distribution of House Prices


plt.figure(figsize=(10, 6))
sns.histplot(df['sqft_above'], bins=50, kde=True, color='skyblue')
plt.title('Distribution of House Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# 2. Relationship between Square Footage and Price


plt.figure(figsize=(10, 6))
sns.scatterplot(x='sqft_living', y='price', data=df, alpha=0.5)
plt.title('Price vs Square Footage')
plt.xlabel('Square Footage')
plt.ylabel('Price')
plt.show()

# 3. Boxplot: Price by Number of Bedrooms


plt.figure(figsize=(10, 6))
sns.boxplot(x='bedrooms', y='price', data=df)
plt.title('Price Distribution by Number of Bedrooms')
plt.xlabel('Bedrooms')
plt.ylabel('Price')
plt.show()

plt.figure(figsize=(12, 8))
numeric_df = df.select_dtypes(include=['float64', 'int64']) # Exclude non-
numeric columns
correlation_matrix = numeric_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.show()

# 5. Line Plot: Average Price Over Time


df['date'] = pd.to_datetime(df['date'].str[:8], format='%Y%m%d')
price_trend = df.groupby(df['date']).mean()['price']

plt.figure(figsize=(12, 6))
price_trend.plot(color='green')
plt.title('Average House Price Over Time')
plt.xlabel('Date')
plt.ylabel('Average Price')
plt.show()
Output:
Figure:1.1: Plot

Figure:1.2:Histogram
Figure:1.3: Horizontal bar chart

Figure:1.4 :Bar chart


Figure:1.5 : Heat Map

Figure:1.6 : Pie chart


Results:
Thus, Python program to perform data visualization techniques on retail sales data to
analyze trends, distribution, top-selling products, sales by country, and correlations using
Python libraries has been executed and outputs are verified.

Exp: No : 2 Linear and logistic Regression Date:

AIM:
To implement and evaluate Linear Regression for salary prediction and Logistic
Regression for diabetes classification using Python libraries.

Algorithm:
Start
Linear Regression (Salary Prediction)
1. Load the dataset (Salary_Data_linear_regression.csv).
2. Extract Years of Experience as the independent variable (X) and Salary as the
dependent variable (y).
3. Split the data into training and testing sets (80% train, 20% test).
4. Train a Linear Regression model using the training data.
5. Predict salary values on the test set.
6. Evaluate the model using Mean Squared Error (MSE).
7. Plot the actual vs. predicted salaries.
Logistic Regression (Diabetes Classification)
1. Load the dataset (diabetes_logistic_regression.csv).
2. Extract the input features (X) and the target outcome (y).
3. Split the data into training and testing sets (80% train, 20% test).
4. Train a Logistic Regression model using the training data.
5. Predict diabetes outcomes on the test set.
6. Evaluate the model using Accuracy Score and Classification Report (precision,
recall, F1-score).

End.

Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the data
data = pd.read_csv('D:\College\SEMESTER 6\Machine Learning\Lab\lab\
Salary_Data_linear_regression.csv')
# Prepare the features (X) and target variable (y)
X = data['YearsExperience'].values.reshape(-1, 1)
y = data['Salary'].values
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)##
#Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print model details
print("Linear Regression Analysis Results:")
print(f"Slope (Coefficient): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")

# Visualize the results


plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Training Data')
plt.scatter(X_test, y_test, color='red', label='Testing Data')
plt.plot(X, model.predict(X), color='green', label='Regression Line')
plt.title('Salary vs Years of Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()
# Prediction example
years_of_experience = 5
predicted_salary = model.predict([[years_of_experience]])
print(f"\nPredicted Salary for {years_of_experience} years of experience: $
{predicted_salary[0]:.2f}")

Output:
Figure 2.1: Linear Regression

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report, roc_curve, auc

# Load dataset
df = pd.read_csv("D:/College/SEMESTER 6/Machine
Learning/Lab/lab/diabetes_logistic_regression.csv")

# Display basic info


print(df.head())
print(df.info())
# Assuming the last column is the target variable (adjust if needed)
y = df.iloc[:, -1]
X = df.iloc[:, :-1]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train logistic regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Predict on test data


y_pred = model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))
# Plot confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# ROC Curve
probs = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, probs)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()`
output:
Figure 2.1: logistic regression

Figure 2.2: ROC curve


figure 2.3: terminal output
Results:
Thus, Python program to implement and evaluate Linear Regression for salary
prediction and Logistic Regression for diabetes classification using Python libraries has been
executed and outputs are verified.
Exp: No : 3 K Nearest Neighbour Date:

AIM:
To implement and compare Linear Regression & K-Nearest Neighbours (KNN)
Regression for salary prediction and Logistic Regression & K-Nearest Neighbours (KNN)
Classification for diabetes prediction

Algorithm:
Start
Regression (Salary Prediction using Linear Regression & KNN Regression)
1. Load the Salary Dataset (Salary_Data_linear_regression.csv).
2. Select Years of Experience as the independent variable (X) and Salary as the
dependent variable (y).
3. Split the dataset into training (80%) and testing (20%) sets.
4. Train a Linear Regression model on the training data.
5. Predict salaries using the trained model and calculate Mean Squared Error (MSE).
6. Train a KNN Regressor (k=5) on the same data.
7. Predict salaries using KNN Regressor and calculate MSE.
8. Plot Actual vs. Predicted salaries for both models.
Classification (Diabetes Prediction using Logistic Regression & KNN Classification)
1. Load the Diabetes Dataset (diabetes_logistic_regression.csv).
2. Extract all input features (X) and target labels (y).
3. Split the dataset into training (80%) and testing (20%) sets.
4. Train a Logistic Regression model on the training data.
5. Predict diabetes outcomes on the test set and calculate accuracy.
6. Train a KNN Classifier (k=5) on the same data.
7. Predict outcomes using KNN and calculate accuracy.
8. Generate Classification Reports for both models.
End.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load dataset
file_path = "D:\College\SEMESTER 6\Machine Learning\Lab\lab\kc_house_data.csv"
if not os.path.exists(file_path):
raise FileNotFoundError(f"Dataset not found: {file_path}")

df = pd.read_csv(file_path)

# Handle missing values


df.dropna(inplace=True)

# Display basic info


print(df.head())
print(df.info())

# Drop non-useful columns if necessary (e.g., ID, address, date, etc.)


if 'id' in df.columns:
df = df.drop(columns=['id'])
if 'date' in df.columns:
df = df.drop(columns=['date'])
# Selecting features (X) and target variable (y)
# Assuming 'price' is the target variable (modify if needed)
X = df.drop(columns=['price'])
y = df['price'] > df['price'].median() # Binary classification (High/Low price)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train KNN model (with k=5)


knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Predict on test data


y_pred = knn_model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))

# --- Confusion Matrix Plot ---


plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Low Price', 'High
Price'], yticklabels=['Low Price', 'High Price'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# --- Scatter Plot (Visualizing sqft_living vs. price category) ---


if 'sqft_living' in df.columns:
plt.figure(figsize=(8, 5))
sns.scatterplot(x=df['sqft_living'], y=df['price'], hue=y, palette={True: 'red', False: 'blue'},
alpha=0.5)
plt.xlabel('Living Area (sqft)')
plt.ylabel('Price')
plt.title('House Prices: High vs Low Price Categories')
plt.legend(title='Price Category', labels=['Low Price', 'High Price'])
plt.show()

Output:

Figure 3.1:KNN vs Linear regression


Figure 3.2 :Terminal output of K-Nearest Neighbour

Results:
Thus, Python program to implement and compare Linear Regression & K-Nearest
Neighbours (KNN) Regression for salary prediction and Logistic Regression & K-Nearest
Neighbours (KNN) Classification for diabetes prediction has been executed and outputs are
verified.
Exp: No : 4 Support Vector Machine Date:

AIM:
To implement Support Vector Machine (SVM) classification on the Social Network
Ads dataset and evaluate its performance

Algorithm:
1. Start
2. Read the dataset Social_Network_Ads_SVM.csv using pandas.
3. Select the feature columns (Age & Estimated Salary) and the target column
(Purchased).
4. Split the dataset into training (80%) and testing (20%) sets.
5. Standardize the feature values using StandardScaler.
6. Use SVC(kernel='linear') to train a linear Support Vector Machine on the training
set.
7. Predict the target values on the test set.
8. Calculate accuracy using accuracy_score.
9. Generate a classification report showing precision, recall, and F1-score.
10. Compute and visualize a confusion matrix using seaborn.heatmap().
11. End.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load dataset
df = pd.read_csv("D:\College\SEMESTER 6\Machine Learning\Lab\lab\
Social_Network_Ads_SVM.csv")

# Display basic info


print(df.head())
print(df.info())

# Drop User ID column as it's not useful


if 'User ID' in df.columns:
df = df.drop(columns=['User ID'])

# Encode categorical variable (Gender)


if 'Gender' in df.columns:
label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender']) # Female -> 0, Male -> 1

# Assuming the last column is the target variable (adjust if needed)


y = df.iloc[:, -1]
X = df.iloc[:, :-1]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train SVM model


svm_model = SVC(kernel='rbf', probability=True)
svm_model.fit(X_train, y_train)

# Predict on test data


y_pred = svm_model.predict(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))

# --- Confusion Matrix Plot ---


plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative',
'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show() # Show the first plot before proceeding

# --- Scatter Plot (First Two Features) ---


plt.figure(figsize=(6, 4))
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='coolwarm', edgecolors='k',
alpha=0.7)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Scatter Plot of Training Data')
plt.show()

# --- Decision Boundary Visualization (Only for 2 Features) ---


def plot_decision_boundary(X, y, model):
X1, X2 = np.meshgrid(np.arange(start=X[:, 0].min() - 1, stop=X[:, 0].max() + 1,
step=0.01),
np.arange(start=X[:, 1].min() - 1, stop=X[:, 1].max() + 1, step=0.01))
plt.figure(figsize=(6, 4))
plt.contourf(X1, X2, model.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape), alpha=0.3, cmap='coolwarm')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Decision Boundary')
plt.show()

# Visualize decision boundary (only works if dataset has 2 features)


if X_train.shape[1] == 2:
plot_decision_boundary(X_train, y_train, svm_model)

Output:

Figure 4.1 : Confusion Matrix of Support Vector Machine


Figure 4.2 : Terminal output of Support Vector Machine

Results:
Thus, Python program to implement Support Vector Machine (SVM) classification on
the Social Network Ads dataset and evaluate its performance has been executed and outputs
are verified.
Exp: No : 5 Principal Component Analysis Date:

AIM:
To perform Principal Component Analysis (PCA) on the Social Network Ads dataset
to reduce dimensionality and visualize the data in a lower-dimensional space

Algorithm:
1. Start
2. Read the dataset Social_Network_Ads_SVM.csv using pandas.
3. Convert categorical columns (if any) into numerical values using LabelEncoder().
4. Extract features (X) and target variable (y).
5. Standardize the dataset using StandardScaler() to normalize feature values.
6. Use PCA(n_components=2) to reduce the dataset to two principal components for
visualization.
7. Transform the standardized feature matrix into the new PCA space.
8. Create a scatter plot of the PCA-transformed data, coloured by the target variable.
9. Display the explained variance ratio to understand how much variance each
principal component captures.
10. End.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.decomposition import PCA

# Check if file exists


file_path = "D:\College\SEMESTER 6\Machine Learning\Lab\lab\
Social_Network_Ads_SVM.csv"
if not os.path.exists(file_path):
raise FileNotFoundError(f"Dataset not found: {file_path}")

# Load dataset
df = pd.read_csv(file_path)

# Display basic info


print(df.head())
print(df.info())
# Drop User ID column as it's not useful
if 'User ID' in df.columns:
df = df.drop(columns=['User ID'])

# Encode categorical variable (Gender)


if 'Gender' in df.columns:
label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender']) # Female -> 0, Male -> 1

# Ensure all features are numerical


df = df.apply(pd.to_numeric, errors='coerce')
df = df.dropna()

# Assuming the last column is the target variable (adjust if needed)


y = df.iloc[:, -1]
X = df.iloc[:, :-1]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Apply PCA
pca = PCA(n_components=2) # Reduce to 2 principal components
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Explained variance ratio


print("Explained variance ratio:", pca.explained_variance_ratio_)

# Train SVM model


svm_model = SVC(kernel='rbf', probability=True)
svm_model.fit(X_train_pca, y_train)

# Predict on test data


y_pred = svm_model.predict(X_test_pca)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot confusion matrix


plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative',
'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Visualize PCA components


def plot_pca(X_pca, y):
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm', edgecolors='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA - Data Projection')
plt.show()

plot_pca(X_train_pca, y_train)

# Visualize decision boundary


def plot_decision_boundary(X, y, model):
X1, X2 = np.meshgrid(np.arange(start=X[:, 0].min()-1, stop=X[:, 0].max()+1, step=0.01),
np.arange(start=X[:, 1].min()-1, stop=X[:, 1].max()+1, step=0.01))
plt.contourf(X1, X2, model.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape), alpha=0.3, cmap='coolwarm')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('SVM Decision Boundary with PCA')
plt.show()

plot_decision_boundary(X_train_pca, y_train, svm_model)


Output:
Figure 5.2: Principal Component Analysis Visualization of dataset

Figure 3.2: Terminal output of Principal Component Analysis

Results:
Thus, Python program to perform Principal Component Analysis (PCA) on the Social
Network Ads dataset to reduce dimensionality and visualize the data in a lower-dimensional
space has been executed and outputs are verified.

Exp: No : 6 K-Means Clustering Date:

AIM:
To perform K-Means Clustering on the King County House Sales dataset to group
similar houses based on numerical features and visualize the clusters.
Algorithm:
1. Start
2. Read the dataset kc_house_data.csv using pandas.
3. Select only numerical columns for clustering.
4. Remove missing values (NaN) from the dataset.
5. Standardize the features using StandardScaler() to normalize the data.
6. Choose K = 3 clusters.
7. Fit the K-Means algorithm on the standardized dataset.
8. Predict the cluster for each data point and store it in a new column.
9. Plot a scatter plot of the first two principal features with color-coded clusters.
10. Display the cluster centers of the fitted model.
11. End.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load dataset
file_path = "D:\College\SEMESTER 6\Machine Learning\Lab\lab\kc_house_data.csv"
df = pd.read_csv(file_path)

# Display basic info


print(df.head())
print(df.info())

# Selecting relevant numerical features for clustering


selected_features = ['sqft_living', 'price']
data = df[selected_features]

# Standardizing the data


scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Determine optimal number of clusters using the Elbow Method


wcss = [] # Within-cluster sum of squares
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(data_scaled)
wcss.append(kmeans.inertia_)

# Plot the Elbow Method graph


plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--', color='b')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal Clusters')
plt.show()

# Applying K-Means Clustering with optimal clusters (e.g., k=3)


kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
df['Cluster'] = kmeans.fit_predict(data_scaled)

# Visualizing Clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data_scaled[:, 0], y=data_scaled[:, 1], hue=df['Cluster'], palette='viridis')
plt.xlabel('Scaled sqft_living')
plt.ylabel('Scaled price')
plt.title('K-Means Clustering')
plt.legend(title='Cluster')
plt.show()

# Display cluster centers


print("Cluster Centers (scaled values):", kmeans.cluster_centers_)

Output:
Figure 6.1: Scatter plot of K-Means clustering

Figure 6.2: Terminal output of K-Means clustering

Results:
Thus, Python program to perform K-Means Clustering on the King County House
Sales dataset to group similar houses based on numerical features and visualize the clusters
has been executed and outputs are verified.

You might also like