[go: up one dir, main page]

0% found this document useful (0 votes)
8 views24 pages

ML 3

The document outlines various machine learning tasks, including creating histograms and box plots for the California Housing dataset, computing correlation matrices, implementing the ID3 algorithm, and demonstrating k-Nearest Neighbour and Locally Weighted Regression algorithms. It also covers Linear and Polynomial Regression using the Boston Housing and Auto MPG datasets, respectively. Each section includes code snippets and descriptions of the processes involved in analyzing and visualizing data.

Uploaded by

zenitsu192004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views24 pages

ML 3

The document outlines various machine learning tasks, including creating histograms and box plots for the California Housing dataset, computing correlation matrices, implementing the ID3 algorithm, and demonstrating k-Nearest Neighbour and Locally Weighted Regression algorithms. It also covers Linear and Polynomial Regression using the Boston Housing and Auto MPG datasets, respectively. Each section includes code snippets and descriptions of the processes involved in analyzing and visualizing data.

Uploaded by

zenitsu192004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Machine Learning Laboratory BCSL606 1

1: Develop a program to create histograms for all numerical features and analyze the distribution of each

feature. Generate box plots for all numerical features and identify any outliers. Use California Housing

dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

"""# Importing the Dataset"""

california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)

df.head()

# Display basic information


print(df.info())

print(df.describe())

"""# Create histograms for all numerical features"""

df.hist(figsize=(12, 8), bins=30, edgecolor='black')


plt.suptitle("Histograms of Numerical Features", fontsize=16)
plt.show()

"""# Create box plots for all numerical features"""

plt.figure(figsize=(12, 8))
for i, col in enumerate(df.columns):
plt.subplot(3, 3, i + 1)
sns.boxplot(y=df[col])
plt.title(f'Box Plot of {col}')
plt.xlabel("")
plt.tight_layout()
plt.show()
Data Set :
File name: California_housing dataset from sklearn

Output

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 2

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 3

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 4

2: Develop a program to Compute the correlation matrix to understand the relationships between pairs of
features. Visualize the correlation matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise relationships between features. Use
California Housing dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

"""# Load California Housing dataset"""

california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)

"""# Compute the correlation matrix

"""

correlation_matrix = df.corr()

"""# Plot the heatmap of the correlation matrix"""

plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix of California Housing Features")
plt.show()

"""# Pair plot for pairwise feature relationships"""

sns.pairplot(df, diag_kind="kde", corner=True)


plt.show()

Data Set:
File name: California_housing dataset from sklearn

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 5

Output:

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 6

3: Python Program to Implement and Demonstrate ID3 Algorithm

# Importing important libraries


"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

"""# Load the Iris dataset"""

iris = datasets.load_iris()
X = iris.data # Features (4-dimensional)
y = iris.target # Target labels (Setosa, Versicolor, Virginica)

"""# Standardize the features (PCA works better with standardized data)"""

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

"""# Apply PCA to reduce dimensions from 4 to 2"""

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

"""# Convert to DataFrame for visualization"""

df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])


df_pca['Target'] = y

"""# Visualize the transformed data"""

plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue=df_pca['Target'], palette='viridis', data=df_pca, alpha=0.8)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA: Iris Dataset (4D → 2D)")
plt.legend(title="Species", labels=iris.target_names)
plt.show()

# Explained variance ratio


print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance explained:", sum(pca.explained_variance_ratio_))

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 7

Data Set:
File name: Iris dataset from sklearn

Output:

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 8

4: For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-S
algorithm to output a description of the set of all hypotheses consistent with the training examples..

data = pd.read_csv(‘playing_tenis_dataset.csv')

data.head()

data.columns

import pandas as pd

def find_s_algorithm(file_path):
# Load the dataset
data = pd.read_csv(file_path)

# Remove the 'day' column since it's not a feature


data = data.drop(columns=['day'])

# Extract features and target class


attributes = data.iloc[:, :-1] # All columns except the last one (features)
target = data.iloc[:, -1] # Last column (class label)

# Initialize the most specific hypothesis


hypothesis = ['ϕ'] * len(attributes.columns) # Start with empty hypothesis

# Find the first positive example


for i in range(len(target)):
if target[i] == 'Yes': # Considering 'Yes' as a positive example
hypothesis = attributes.iloc[i].tolist()
break

# Iterate over all positive examples to generalize the hypothesis


for i in range(len(target)):
if target[i] == 'Yes': # Process only positive examples
for j in range(len(hypothesis)):
if hypothesis[j] != attributes.iloc[i, j]: # If values differ
hypothesis[j] = '?' # Generalize

return hypothesis

# Example usage:
file_path = "playing_tenis_dataset.csv"
final_hypothesis = find_s_algorithm(file_path)
print("Final Hypothesis:", final_hypothesis)

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 9

Data Set:
File name: playing tenis Data

Output:

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 10

5. Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated
100

values of x in the range of [0,1]. Perform the following based on dataset generated.

a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊ Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

# Step 1: Generate 100 random values in the range [0,1]


np.random.seed(42) # For reproducibility
x = np.random.rand(100, 1) # 100 values between 0 and 1

# Step 2: Assign labels to the first 50 values


labels = np.array([1 if xi <= 0.5 else 2 for xi in x[:50]]).reshape(-1, 1)

# Step 3: Train KNN classifier with different k values


k_values = [1, 2, 3, 4, 5, 20, 30]

# Initialize the results dictionary


classification_results = {}

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x[:50], labels.ravel()) # Train using first 50 points
predictions = knn.predict(x[50:]) # Predict the remaining 50 points
classification_results[k] = predictions

# Step 4: Visualize the classification results


plt.figure(figsize=(10, 6))
plt.scatter(x[:50], labels, color='blue', label='Training Data (Class 1, 2)')
for k in k_values:
plt.scatter(x[50:], classification_results[k], label=f'k={k}', alpha=0.6)
plt.xlabel("x values")
plt.ylabel("Class")
plt.legend()
plt.title("k-NN Classification Results for Different k Values")
plt.show()

# Step 5: Print classification results


for k in k_values:
print(f"k={k}: {classification_results[k]}")

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 11

Data Set:
Filename: random numbers from 1 to 100

Output

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 12

6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler

def locally_weighted_regression(x_train, y_train, x_test, tau):


"""
Locally Weighted Regression (LWR) implementation.
x_train: Training feature values.
y_train: Training target values.
x_test: Query point(s) for prediction.
tau: Bandwidth parameter for weighting.
"""
m = x_train.shape[0]
x_train_aug = np.c_[np.ones(m), x_train] # Add bias term
x_test_aug = np.c_[np.ones(len(x_test)), x_test] # Add bias term

y_pred = np.zeros(len(x_test))

for i in range(len(x_test)):
x_i = x_test_aug[i]

# Compute weights using Gaussian kernel


W = np.exp(-np.square(x_train - x_test[i]) / (2 * tau ** 2))
W = np.diag(W.flatten())

# Compute theta: (X'WX)^(-1) X'W y


theta = np.linalg.pinv(x_train_aug.T @ W @ x_train_aug) @ x_train_aug.T @ W @ y_train
y_pred[i] = x_i @ theta

return y_pred

# Load dataset
df = pd.read_csv("diabetes.csv") # Ensure the file is in the same directory or provide the correct path

# Select BMI as feature and Outcome as target


x = df["BMI"].values.reshape(-1, 1)
y = df["Outcome"].values

# Standardize the dataset


scaler = StandardScaler()
x = scaler.fit_transform(x)

# Split into training and test sets


train_size = int(0.7 * len(x))
x_train, y_train = x[:train_size], y[:train_size]
x_test, y_test = x[train_size:], y[train_size:]

# Apply LWR with different tau values


tau_values = [0.1, 0.5, 1, 5]
plt.figure(figsize=(10, 6))
plt.scatter(x_train, y_train, label='Training Data', color='blue', alpha=0.5)

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 13

for tau in tau_values:


y_pred = locally_weighted_regression(x_train, y_train, x_test, tau)
plt.plot(x_test, y_pred, label=f'tau={tau}')

plt.xlabel("Standardized BMI")
plt.ylabel("Diabetes Outcome (0 or 1)")
plt.legend()
plt.title("Locally Weighted Regression on Pima Diabetes Dataset")
plt.show()

Data Set:
Filename: diabetics.csv

Output

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 14

7a. Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use
Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency
prediction) for Polynomial Regression..

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

df = pd.read_csv("/content/drive/MyDrive/MachineLearningLab_2025/BostonHousing.csv")

df.head()

df.columns

df.info()

df = df.dropna()

# Select RM (number of rooms) as the independent variable


X_boston = df[['rm']]
y_boston = df['medv']

# Split into training and test sets


X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_test_split(X_boston, y_boston,
test_size=0.2, random_state=42)

# Apply Linear Regression


lin_reg = LinearRegression()
lin_reg.fit(X_train_boston, y_train_boston)
y_pred_boston = lin_reg.predict(X_test_boston)

# Plot Linear Regression results


plt.figure(figsize=(10, 5))
plt.scatter(X_test_boston, y_test_boston, color='blue', label='Actual')
plt.plot(X_test_boston, y_pred_boston, color='red', linewidth=2, label='Predicted')
plt.xlabel('Number of Rooms')
plt.ylabel('House Price (MEDV)')
plt.title('Linear Regression on Boston Housing Dataset')
plt.legend()
plt.show()

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 15

Dataset:

Output:

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 16

7b. Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use
Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency
prediction) for Polynomial Regression..

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error, r2_score

# Load Auto MPG Dataset for Polynomial Regression


auto_mpg_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin',
'car_name']

auto_mpg = pd.read_csv(auto_mpg_url, sep='\s+', names=columns, na_values='?')

auto_mpg.head()

auto_mpg.info()

auto_mpg = auto_mpg.dropna()

# Select horsepower as independent variable and mpg as dependent variable


X_auto = auto_mpg[['horsepower']].astype(float)
y_auto = auto_mpg['mpg']

# Split into training and test sets


X_train_auto, X_test_auto, y_train_auto, y_test_auto = train_test_split(X_auto, y_auto, test_size=0.2,
random_state=42)

# Apply Polynomial Regression (degree 2)


poly = PolynomialFeatures(degree=2)
X_train_auto_poly = poly.fit_transform(X_train_auto)
X_test_auto_poly = poly.transform(X_test_auto)

poly_reg = LinearRegression()
poly_reg.fit(X_train_auto_poly, y_train_auto)
y_pred_auto = poly_reg.predict(X_test_auto_poly)

# Plot Polynomial Regression results


plt.figure(figsize=(10, 5))
plt.scatter(X_test_auto, y_test_auto, color='blue', label='Actual')
plt.scatter(X_test_auto, y_pred_auto, color='red', label='Predicted', alpha=0.5)
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Polynomial Regression on Auto MPG Dataset')
plt.legend()
plt.show()

print("Polynomial Regression MSE (Auto MPG):", mean_squared_error(y_test_auto, y_pred_auto))

print("Polynomial Regression R2 Score (Auto MPG):", r2_score(y_test_auto, y_pred_auto))

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 17

Dataset:

Output:

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 18

8: Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer Data
set for building the decision tree and apply this knowledge to classify a new sample.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

"""# Load the Breast Cancer dataset"""

df = datasets.load_breast_cancer()

df

data =pd.DataFrame(df.data,columns=df.feature_names)

data.head()

X = data # Features
y = df.target # Target (Malignant: 0, Benign: 1)

"""# Split the dataset into Training and Testing sets (80% train, 20% test)"""

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

"""# Train a Decision Tree Classifier"""

clf = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=42)


clf.fit(X_train, y_train)

"""# Predict on test data"""

y_pred = clf.predict(X_test)

"""# Evaluate the model"""

accuracy = accuracy_score(y_test, y_pred)


print(f"Model Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Visualizing the Decision Tree


plt.figure(figsize=(15, 8))
plot_tree(clf, feature_names=cancer.feature_names, class_names=cancer.target_names, filled=True)
plt.title("Decision Tree Visualization")
plt.show()

"""# Classify a new sample"""

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 19

new_sample = np.array([[15.0, 14.5, 85.0, 600.0, 0.09, 0.08, 0.05, 0.05, 0.17, 0.06,
0.35, 1.2, 2.5, 30.0, 0.007, 0.02, 0.02, 0.01, 0.02, 0.003,
16.0, 18.0, 110.0, 800.0, 0.14, 0.20, 0.20, 0.12, 0.30, 0.08]])

prediction = clf.predict(new_sample)
print(f"New Sample Prediction: {'Malignant' if prediction[0] == 0 else 'Benign'}")

Sample data set


File name: breast cancer csv data

Output:

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 20

9.Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for
training. Compute the accuracy of the classifier, considering a few test data sets.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load Olivetti Faces Dataset


faces = fetch_olivetti_faces(shuffle=True, random_state=42)
X = faces.data # Features (flattened images)
y = faces.target # Labels (person ID)

# Split data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Naive Bayes classifier


nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Predict on test data


y_pred = nb_classifier.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Naive Bayes Classifier Accuracy: {accuracy:.2f}')

# Visualize some sample predictions


fig, axes = plt.subplots(2, 5, figsize=(10, 5))
for i, ax in enumerate(axes.flat):
ax.imshow(X_test[i].reshape(64, 64), cmap='gray')
ax.set_title(f'Pred: {y_pred[i]}, Actual: {y_test[i]}')
ax.axis('off')
plt.tight_layout()
plt.show()

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 21

Dataset:

Output:

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 22

10. Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and
visualize the clustering result..

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.mixture import GaussianMixture

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.metrics import silhouette_score

# Load the dataset

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"

columns = ['ID', 'Diagnosis'] + [f'Feature_{i}' for i in range(1, 31)]

data = pd.read_csv(url, header=None, names=columns)

data.drop('ID', axis=1, inplace=True)

data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0}) # Malignant = 1, Benign = 0

X = data.iloc[:, 1:].values # Extract features

y = data.iloc[:, 0].values # Extract true labels

# Standardizing the dataset

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Applying PCA to reduce dimensionality

pca = PCA(n_components=2)

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 23

X_pca = pca.fit_transform(X_scaled)

# Implementing K-Means with optimized parameters

kmeans = KMeans(n_clusters=2, init='k-means++', random_state=42)

kmeans_labels = kmeans.fit_predict(X_pca)

# Evaluate Clustering Performance

silhouette_kmeans = silhouette_score(X_pca, kmeans_labels)

# Implementing Gaussian Mixture Model (GMM)

gmm = GaussianMixture(n_components=2, random_state=42)

gmm_labels = gmm.fit_predict(X_pca)

# Evaluate GMM Performance

silhouette_gmm = silhouette_score(X_pca, gmm_labels)

# Plot Clustering Results

fig, ax = plt.subplots(1, 2, figsize=(12, 5))

ax[0].scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, cmap='coolwarm', alpha=0.6)

ax[0].set_title(f'K-Means Clustering (Silhouette Score: {silhouette_kmeans:.2f}')

ax[1].scatter(X_pca[:, 0], X_pca[:, 1], c=gmm_labels, cmap='coolwarm', alpha=0.6)

ax[1].set_title(f'GMM Clustering (Silhouette Score: {silhouette_gmm:.2f}')

plt.show()

print(f'K-Means Silhouette Score: {silhouette_kmeans:.2f}')

print(f'Gaussian Mixture Model (GMM) Silhouette Score: {silhouette_gmm:.2f}')

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.


Machine Learning Laboratory BCSL606 24

Dataset:

Output :

By: Prof. M.Saritha Dept. of AI&DS, SGBIT.

You might also like