ML Recordjp
ML Recordjp
MACHINE LEARNING
VI Semester
UG-Electronics and Communication Engineering
(2024 – 2025)
Name JAYAPRAKASH K
Register Number 3122223002049
Sri Sivasubramaiya Nadar College of Engineering
(An Autonomous Institution, Affiliated to Anna University, Chennai)
Rajiv Gandhi Salai (OMR), Kalavakkam – 603 110
BONAFIDE CERTIFICATE
Date: .....................
3 K Nearest Neighbour
6 K Means Clustering
Exp: No : 1 Data Visualization Techniques Date:
AIM:
To perform data visualization techniques on retail sales data to analyze trends,
distribution, top-selling products, sales by country, and correlations using Python libraries.
Algorithm:
1. Start
2. Read the Excel file containing the retail sales data.
3. Convert the InvoiceDate column to datetime format.
4. To analysis the sales tread ,resample the data monthly and visualize total quantity sold
over time.
5. Plot a histogram to analyze the distribution of unit prices.
6. Aggregate sales data by product descriptions and visualize the top-selling products
using a horizontal bar chart.
7. Group data by country and visualize total sales for each country using a bar chart.
8. Compute and visualize the correlation between Quantity and UnitPrice using a
heatmap.
9. Identify the top 5 countries by total sales and visualize their contribution using a pie
chart.
10. End.
Program:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 8))
numeric_df = df.select_dtypes(include=['float64', 'int64']) # Exclude non-
numeric columns
correlation_matrix = numeric_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.show()
plt.figure(figsize=(12, 6))
price_trend.plot(color='green')
plt.title('Average House Price Over Time')
plt.xlabel('Date')
plt.ylabel('Average Price')
plt.show()
Output:
Figure:1.1: Plot
Figure:1.2:Histogram
Figure:1.3: Horizontal bar chart
AIM:
To implement and evaluate Linear Regression for salary prediction and Logistic
Regression for diabetes classification using Python libraries.
Algorithm:
Start
Linear Regression (Salary Prediction)
1. Load the dataset (Salary_Data_linear_regression.csv).
2. Extract Years of Experience as the independent variable (X) and Salary as the
dependent variable (y).
3. Split the data into training and testing sets (80% train, 20% test).
4. Train a Linear Regression model using the training data.
5. Predict salary values on the test set.
6. Evaluate the model using Mean Squared Error (MSE).
7. Plot the actual vs. predicted salaries.
Logistic Regression (Diabetes Classification)
1. Load the dataset (diabetes_logistic_regression.csv).
2. Extract the input features (X) and the target outcome (y).
3. Split the data into training and testing sets (80% train, 20% test).
4. Train a Logistic Regression model using the training data.
5. Predict diabetes outcomes on the test set.
6. Evaluate the model using Accuracy Score and Classification Report (precision,
recall, F1-score).
End.
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the data
data = pd.read_csv('D:\College\SEMESTER 6\Machine Learning\Lab\lab\
Salary_Data_linear_regression.csv')
# Prepare the features (X) and target variable (y)
X = data['YearsExperience'].values.reshape(-1, 1)
y = data['Salary'].values
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)##
#Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print model details
print("Linear Regression Analysis Results:")
print(f"Slope (Coefficient): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")
Output:
Figure 2.1: Linear Regression
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report, roc_curve, auc
# Load dataset
df = pd.read_csv("D:/College/SEMESTER 6/Machine
Learning/Lab/lab/diabetes_logistic_regression.csv")
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))
# Plot confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# ROC Curve
probs = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, probs)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()`
output:
Figure 2.1: logistic regression
AIM:
To implement and compare Linear Regression & K-Nearest Neighbours (KNN)
Regression for salary prediction and Logistic Regression & K-Nearest Neighbours (KNN)
Classification for diabetes prediction
Algorithm:
Start
Regression (Salary Prediction using Linear Regression & KNN Regression)
1. Load the Salary Dataset (Salary_Data_linear_regression.csv).
2. Select Years of Experience as the independent variable (X) and Salary as the
dependent variable (y).
3. Split the dataset into training (80%) and testing (20%) sets.
4. Train a Linear Regression model on the training data.
5. Predict salaries using the trained model and calculate Mean Squared Error (MSE).
6. Train a KNN Regressor (k=5) on the same data.
7. Predict salaries using KNN Regressor and calculate MSE.
8. Plot Actual vs. Predicted salaries for both models.
Classification (Diabetes Prediction using Logistic Regression & KNN Classification)
1. Load the Diabetes Dataset (diabetes_logistic_regression.csv).
2. Extract all input features (X) and target labels (y).
3. Split the dataset into training (80%) and testing (20%) sets.
4. Train a Logistic Regression model on the training data.
5. Predict diabetes outcomes on the test set and calculate accuracy.
6. Train a KNN Classifier (k=5) on the same data.
7. Predict outcomes using KNN and calculate accuracy.
8. Generate Classification Reports for both models.
End.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Load dataset
file_path = "D:\College\SEMESTER 6\Machine Learning\Lab\lab\kc_house_data.csv"
if not os.path.exists(file_path):
raise FileNotFoundError(f"Dataset not found: {file_path}")
df = pd.read_csv(file_path)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))
Output:
Results:
Thus, Python program to implement and compare Linear Regression & K-Nearest
Neighbours (KNN) Regression for salary prediction and Logistic Regression & K-Nearest
Neighbours (KNN) Classification for diabetes prediction has been executed and outputs are
verified.
Exp: No : 4 Support Vector Machine Date:
AIM:
To implement Support Vector Machine (SVM) classification on the Social Network
Ads dataset and evaluate its performance
Algorithm:
1. Start
2. Read the dataset Social_Network_Ads_SVM.csv using pandas.
3. Select the feature columns (Age & Estimated Salary) and the target column
(Purchased).
4. Split the dataset into training (80%) and testing (20%) sets.
5. Standardize the feature values using StandardScaler.
6. Use SVC(kernel='linear') to train a linear Support Vector Machine on the training
set.
7. Predict the target values on the test set.
8. Calculate accuracy using accuracy_score.
9. Generate a classification report showing precision, recall, and F1-score.
10. Compute and visualize a confusion matrix using seaborn.heatmap().
11. End.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Load dataset
df = pd.read_csv("D:\College\SEMESTER 6\Machine Learning\Lab\lab\
Social_Network_Ads_SVM.csv")
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Output:
Results:
Thus, Python program to implement Support Vector Machine (SVM) classification on
the Social Network Ads dataset and evaluate its performance has been executed and outputs
are verified.
Exp: No : 5 Principal Component Analysis Date:
AIM:
To perform Principal Component Analysis (PCA) on the Social Network Ads dataset
to reduce dimensionality and visualize the data in a lower-dimensional space
Algorithm:
1. Start
2. Read the dataset Social_Network_Ads_SVM.csv using pandas.
3. Convert categorical columns (if any) into numerical values using LabelEncoder().
4. Extract features (X) and target variable (y).
5. Standardize the dataset using StandardScaler() to normalize feature values.
6. Use PCA(n_components=2) to reduce the dataset to two principal components for
visualization.
7. Transform the standardized feature matrix into the new PCA space.
8. Create a scatter plot of the PCA-transformed data, coloured by the target variable.
9. Display the explained variance ratio to understand how much variance each
principal component captures.
10. End.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.decomposition import PCA
# Load dataset
df = pd.read_csv(file_path)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Apply PCA
pca = PCA(n_components=2) # Reduce to 2 principal components
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))
plot_pca(X_train_pca, y_train)
Results:
Thus, Python program to perform Principal Component Analysis (PCA) on the Social
Network Ads dataset to reduce dimensionality and visualize the data in a lower-dimensional
space has been executed and outputs are verified.
AIM:
To perform K-Means Clustering on the King County House Sales dataset to group
similar houses based on numerical features and visualize the clusters.
Algorithm:
1. Start
2. Read the dataset kc_house_data.csv using pandas.
3. Select only numerical columns for clustering.
4. Remove missing values (NaN) from the dataset.
5. Standardize the features using StandardScaler() to normalize the data.
6. Choose K = 3 clusters.
7. Fit the K-Means algorithm on the standardized dataset.
8. Predict the cluster for each data point and store it in a new column.
9. Plot a scatter plot of the first two principal features with color-coded clusters.
10. Display the cluster centers of the fitted model.
11. End.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load dataset
file_path = "D:\College\SEMESTER 6\Machine Learning\Lab\lab\kc_house_data.csv"
df = pd.read_csv(file_path)
# Visualizing Clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data_scaled[:, 0], y=data_scaled[:, 1], hue=df['Cluster'], palette='viridis')
plt.xlabel('Scaled sqft_living')
plt.ylabel('Scaled price')
plt.title('K-Means Clustering')
plt.legend(title='Cluster')
plt.show()
Output:
Figure 6.1: Scatter plot of K-Means clustering
Results:
Thus, Python program to perform K-Means Clustering on the King County House
Sales dataset to group similar houses based on numerical features and visualize the clusters
has been executed and outputs are verified.