SI.
No Description Page
No
1 Practice Programs 1-7
2 Develop a program to create histograms for all numerical features and
analyze the distribution of each feature. Generate box plots for all 8 - 12
numerical features and identify any outliers. Use California Housing
dataset.
3 Develop a program to Compute the correlation matrix to understand
the relationships between pairs of features. Visualize the correlation 13 - 15
matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise
relationships between features. Use California Housing dataset.
4 Develop a program to implement Principal Component Analysis (PCA) 16 – 17
for reducing the dimensionality of the Iris dataset from 4 features to 2.
5 For a given set of training data examples stored in a .CSV file, implement 18 - 19
and demonstrate the Find-S algorithm to output a description of the set
of all hypotheses consistent with the training examples.
6 Develop a program to implement k-Nearest Neighbour algorithm to
classify the randomly generated 100
values of x in the range of [0,1]. Perform the following based on dataset 20 – 22
generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε
Class1, else xi ε Class2
b. Classify the remaining points, x51,……,x100 using KNN. Perform this
for k=1,2,3,4,5,20,30
7 Implement the non-parametric Locally Weighted Regression algorithm
in order to fit data points. Select appropriate data set for your 23- 24
experiment and draw graphs
8 Develop a program to demonstrate the working of Linear Regression
and Polynomial Regression. Use Boston Housing Dataset for Linear
Regression and Auto MPG Dataset (for vehicle fuel efficiency 25 - 28
prediction) for Polynomial Regression.
9 Develop a program to demonstrate the working of the decision tree
algorithm. Use Breast Cancer Data set for building the decision tree and 29- 30
apply this knowledge to classify a new sample.
10 Develop a program to implement the Naive Bayesian classifier
considering Olivetti Face Data set for training. Compute the accuracy of 31 - 33
the classifier, considering a few test data sets.
11 Develop a program to implement k-means clustering using Wisconsin 34 - 35
Breast Cancer data set and visualize the clustering result.
12 Viva Questions 36
Practice Programs:
1. Write Python Script to create a DataFrame.
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
2. Write a Python Script to Read and Write CSV Files
# Save DataFrame to a CSV file
df.to_csv('data.csv', index=False)
# Read DataFrame from a CSV file
df_read = pd.read_csv('data.csv')
print(df_read)
3. Write a Python Script to perform Basic DataFrame Operations
# Show first 2 rows
print(df.head(2))
# Show last 2 rows
print(df.tail(2))
# Get summary statistics
print(df.describe())
# Get column names
print(df.columns)
# Get DataFrame shape (rows, columns)
print(df.shape)
# Get data types of each column
print(df.dtypes)
4. Write a Python Script for Selecting and Filtering Data
# Select a single column
print(df['Name'])
# Select multiple columns
print(df[['Name', 'Age']])
# Filter rows based on a condition
print(df[df['Age'] > 30])
5. Write a Python Script for Adding and Modifying Columns
# Add a new column
df['Salary'] = [50000, 60000, 70000, 80000]
# Modify an existing column
df['Age'] = df['Age'] + 1 # Increase age by 1
print(df)
6. Write a Python Script for Sorting and Grouping Data
# Sort DataFrame by Age in ascending order
print(df.sort_values(by='Age'))
# Group data by City and find the mean Age
print(df.groupby('City')['Age'].mean())
7. Write a Python Script for Handling Missing Values
import numpy as np
# Introduce missing values
df.loc[1, 'Age'] = np.nan
# Check for missing values
print(df.isnull().sum())
# Fill missing values with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
8. Write a Python Script for Applying Functions to DataFrame
# Apply a function to a column
df['Age_Category'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Old')
print(df)
9. Write a Python Script to Plot a Line Plot (Trends over Time)
import matplotlib.pyplot as plt
import numpy as np
# Sample Data
x = np.arange(1, 11)
y = np.sin(x)
# Line Plot
plt.plot(x, y, marker='o', linestyle='-', color='b', label='Sine Wave')
plt.xlabel("X Values")
plt.ylabel("Y Values")
plt.title("Simple Line Plot")
plt.legend()
plt.grid(True)
plt.show()
10. Write a Python Script to Plot a Bar Chart (Category Comparison)
import matplotlib.pyplot as plt
# Sample Data
categories = ['A', 'B', 'C', 'D', 'E']
values = [10, 25, 15, 30, 20]
# Bar Plot
plt.bar(categories, values, color=['red', 'blue', 'green', 'purple', 'orange'])
plt.xlabel("Categories")
plt.ylabel("Values")
plt.title("Bar Chart Example")
plt.show()
11. Write a Python Script to Plot a Histogram (Distribution of Data)
import numpy as np
import matplotlib.pyplot as plt
# Generate Random Data
data = np.random.randn(1000)
# Histogram
plt.hist(data, bins=30, color='skyblue', edgecolor='black')
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of Random Data")
plt.show()
12. Write a Python Script to Plot a Scatter Plot (Relationship between Two Variables)
import numpy as np
import matplotlib.pyplot as plt
# Generate Data
x = np.random.rand(100)
y = np.random.rand(100)
# Scatter Plot
plt.scatter(x, y, c='red', alpha=0.6)
plt.xlabel("X Values")
plt.ylabel("Y Values")
plt.title("Scatter Plot Example")
plt.show()
13. Write a Python Script to Plot a Box Plot (Detecting Outliers)
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Generate Random Data
data = np.random.randn(100)
# Box Plot
sns.boxplot(data=data, color='lightblue')
plt.title("Box Plot Example")
plt.show()
14. Write a Python Script to Plot a Pair Plot (Multiple Feature Relationships - Iris
Dataset)
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_iris
# Load Iris Dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
# Pair Plot
sns.pairplot(df, hue='species', palette='coolwarm')
plt.show()
15. Write a Python Script to Plot a Heatmap (Correlation Matrix - Titanic Dataset)
import seaborn as sns
import pandas as pd
# Load Sample Dataset
df = sns.load_dataset("titanic").dropna()
# Compute Correlation
corr_matrix = df.corr()
# Heatmap
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
Program 1: Develop a program to create histograms for all numerical features and analyze
the distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.
Source Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# Load California Housing dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
# Display basic dataset information
print("Dataset Overview:")
print(df.info())
print("\nSummary Statistics:")
print(df.describe())
# Set plot style
sns.set_style("whitegrid")
# Create histograms for all numerical features
plt.figure(figsize=(12, 8))
df.hist(bins=30, figsize=(12, 8), edgecolor='black')
plt.suptitle("Histograms of Numerical Features in California Housing Dataset", fontsize=14)
plt.show()
# Create improved box plots for all numerical features to identify outliers
plt.figure(figsize=(14, 8))
for i, col in enumerate(df.columns):
plt.subplot(3, 3, i + 1)
sns.boxplot(x=df[col], color="skyblue", width=0.6, fliersize=3)
plt.title(col, fontsize=12)
plt.tight_layout()
plt.suptitle("Box Plots of Numerical Features", fontsize=14, y=1.02)
plt.show()
# Identify outliers using IQR method
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))
print("\nOutlier Detection:")
print(outliers.sum())
Output :
Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
dtypes: float64(8)
Summary Statistics:
MedInc HouseAge AveRooms AveBedrms Population \
count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744
std 1.899822 12.585558 2.474173 0.473911 1132.462122
min 0.499900 1.000000 0.846154 0.333333 3.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000
AveOccup Latitude Longitude
count 20640.000000 20640.000000 20640.000000
mean 3.070655 35.631861 -119.569704
std 10.386050 2.135952 2.003532
min 0.692308 32.540000 -124.350000
25% 2.429741 33.930000 -121.800000
50% 2.818116 34.260000 -118.490000
75% 3.282261 37.710000 -118.010000
max 1243.333333 41.950000 -114.310000
<Figure size 1200x800 with 0 Axes>
Outlier Detection:
MedInc 681
HouseAge 0
AveRooms 511
AveBedrms 1424
Population 1196
AveOccup 711
Latitude 0
Longitude 0
Program 2: Develop a program to Compute the correlation matrix to understand the
relationships between pairs of features. Visualize the correlation matrix using a heatmap to
know which variables have strong positive/negative correlations. Create a pair plot to
visualize pairwise relationships between features. Use California Housing dataset.
Source Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# Load California Housing dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
# Set plot style
sns.set_style("whitegrid")
# Compute and visualize the correlation matrix
plt.figure(figsize=(10, 6))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title("Feature Correlation Heatmap", fontsize=14)
plt.show()
# Create pair plot to visualize pairwise relationships between features
sns.pairplot(df, diag_kind='kde', plot_kws={'alpha':0.5})
plt.suptitle("Pair Plot of Features", fontsize=14, y=1.02)
plt.show()
# Identify skewness of numerical features
skew_values = df.skew()
print("\nSkewness of Features:")
print(skew_values)
Output:
Skewness of Features:
MedInc 1.646657
HouseAge 0.060331
AveRooms 20.697869
AveBedrms 31.316956
Population 4.935858
AveOccup 97.639561
Latitude 0.465953 Longitude -0.297801
Program 3: Develop a program to implement Principal Component Analysis (PCA) for
reducing the dimensionality of the Iris dataset from 4 features to 2.
Source Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Load Iris dataset
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
# Standardize the data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
# Apply PCA to reduce dimensionality from 4 to 2
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df_scaled)
# Create a new DataFrame with principal components
pca_df = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])
pca_df['Target'] = data.target
# Visualize the PCA results
plt.figure(figsize=(8, 6))
for target, label in enumerate(data.target_names):
subset = pca_df[pca_df['Target'] == target]
plt.scatter(subset['PC1'], subset['PC2'], label=label, alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.legend()
plt.grid(True)
plt.show()
# Print explained variance ratio
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
Output:
Explained Variance Ratio: [0.72962445 0.22850762]
Program 4: For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples.
Source Code:
import csv
num_attributes = 6
a = []
print("\n The Given Training Data Set \n")
with open('data.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
a.append (row)
print(row)
print("\n The initial value of hypothesis: ")
hypothesis = ['0'] * num_attributes
print(hypothesis)
for j in range(0,num_attributes):
hypothesis[j] = a[0][j];
print("\n Find S: Finding a Maximally Specific Hypothesis\n")
for i in range(0,len(a)):
if a[i][num_attributes]=='yes':
for j in range(0,num_attributes):
if a[i][j]!=hypothesis[j]:
hypothesis[j]='?'
else :
hypothesis[j]= a[i][j]
print(" For Training instance No:{0} the hypothesis is ".format(i),hypothesis)
print("\n The Maximally Specific Hypothesis for a given Training Examples :\n")
print(hypothesis)
Output:
The Given Training Data Set
['sunny', 'warm', 'normal', 'strong', 'warm', 'same', 'yes']
['sunny', 'warm', 'high', 'strong', 'warm', 'same', 'yes']
['rainy', 'cold', 'high', 'strong', 'warm', 'change', 'no']
['sunny', 'warm', 'high', 'strong', 'cool', 'change', 'yes']
The initial value of hypothesis:
['0', '0', '0', '0', '0', '0']
Find S: Finding a Maximally Specific Hypothesis
For Training instance No:3 the hypothesis is ['sunny', 'warm', '?', 'strong', '?', '?']
The Maximally Specific Hypothesis for a given Training Examples :
['sunny', 'warm', '?', 'strong', '?', '?']
Program 5: Develop a program to implement k-Nearest Neighbour algorithm to classify the
randomly generated 100 values of x in the range of [0,1]. Perform the following based on
dataset generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε Class1, else xi ε Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
Source Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
# Generate 100 random values in the range [0,1]
x = np.random.rand(100, 1)
# Label the first 50 points based on the given condition
labels = np.array([1 if xi <= 0.5 else 2 for xi in x[:50]])
# Prepare training and test sets
X_train, y_train = x[:50], labels # First 50 for training
X_test = x[50:] # Remaining 50 for classification
# Test for different values of k
k_values = [1, 2, 3, 4, 5, 20, 30]
plt.figure(figsize=(10, 6))
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Visualization of classification results
plt.scatter(X_test, y_pred, label=f'k={k}', alpha=0.7)
# Mark training points for reference
plt.scatter(X_train, y_train, color='red', marker='x', label='Training Data')
plt.xlabel('X values')
plt.ylabel('Predicted Class')
plt.title('KNN Classification for Different k-values')
plt.legend()
plt.show()
# Print classification results for each k
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(f'Predictions for k={k}:', y_pred)
Output:
Predictions for k=1: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1
1 1 1 1 1 2 1 1 1 1 2 2 1]
Predictions for k=2: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1
1 1 1 1 1 2 1 1 1 1 2 2 1]
Predictions for k=3: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1
1 1 1 2 1 2 1 1 1 1 2 2 1]
Predictions for k=4: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1
1 1 1 1 1 2 1 1 1 1 2 2 1]
Predictions for k=5: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1
1 1 1 1 1 2 1 1 1 1 2 2 1]
Predictions for k=20: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 2 1 2 1
1 1 1 1 1 1 1 1 1 1 2 2 1]
Predictions for k=30: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 2 1 2 1
1 1 1 1 1 1 1 1 1 1 2 2 1]
Program 6: Implement the non-parametric Locally Weighted Regression algorithm in order
to fit data points. Select appropriate data set for your experiment and draw graphs.
Source Code:
import numpy as np
import matplotlib.pyplot as plt
# Generate synthetic dataset
np.random.seed(42)
X = np.linspace(0, 10, 100)
y = np.sin(X) + np.random.normal(0, 0.1, 100) # Sinusoidal data with noise
# Define Locally Weighted Regression function
def locally_weighted_regression(x_query, X, y, tau):
m = X.shape[0]
W = np.diag(np.exp(-((X[:, 1] - x_query[1]) ** 2) / (2 * tau ** 2))) # Diagonal weight matrix
theta = np.linalg.pinv(X.T @ W @ X) @ X.T @ W @ y # Compute theta
return x_query @ theta
# Fit Locally Weighted Regression for different values of tau
tau_values = [0.1, 0.5, 1, 5]
X_ones = np.c_[np.ones(X.shape[0]), X] # Add bias term
plt.figure(figsize=(10, 6))
plt.scatter(X, y, label='Data', color='blue', alpha=0.5)
for tau in tau_values:
y_pred = np.array([locally_weighted_regression(np.array([1, x_i]), X_ones, y, tau) for x_i in
X])
plt.plot(X, y_pred, label=f'tau={tau}')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Locally Weighted Regression with Different Bandwidths')
plt.legend()
plt.show()
Output:
Program 7: Develop a program to demonstrate the working of Linear Regression and
Polynomial Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG
Dataset (for vehicle fuel efficiency prediction) for Polynomial Regression.
Source Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
# Load Boston Housing Dataset for Linear Regression
boston = fetch_california_housing()
X_boston = boston.data[:, :2] # Selecting first two features for simplicity
y_boston = boston.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston, test_size=0.2,
random_state=42)
# Train Linear Regression Model
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
y_pred = linear_reg.predict(X_test)
# Evaluate Model
mse = mean_squared_error(y_test, y_pred)
print(f'Linear Regression MSE: {mse}')
# Plot Predictions vs Actual
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Linear Regression: Actual vs Predicted Prices')
plt.show()
# Load Auto MPG Dataset for Polynomial Regression
auto_mpg = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-
data/master/mpg.csv").dropna()
X_auto = auto_mpg[['horsepower']].values
y_auto = auto_mpg['mpg'].values
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_auto, y_auto, test_size=0.2,
random_state=42)
# Train Polynomial Regression Model
degree = 3 # Choosing a cubic polynomial model
poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
poly_model.fit(X_train, y_train)
y_pred_poly = poly_model.predict(X_test)
# Evaluate Model
mse_poly = mean_squared_error(y_test, y_pred_poly)
print(f'Polynomial Regression MSE: {mse_poly}')
# Plot Polynomial Regression Results
X_sorted = np.sort(X_test, axis=0)
y_sorted = poly_model.predict(X_sorted)
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label='Actual')
plt.plot(X_sorted, y_sorted, color='red', label=f'Polynomial Degree {degree}')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Polynomial Regression: Horsepower vs MPG')
plt.legend()
plt.show()
Output:
Linear Regression MSE: 0.6629874283048177
Polynomial Regression MSE: 18.460267222145088
Program 8: Develop a program to demonstrate the working of the decision tree algorithm.
Use Breast Cancer Data set for building the decision tree and apply this knowledge to
classify a new sample.
Source Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, classification_report
# Load Breast Cancer Dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Decision Tree Model
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)
# Predict on test data
y_pred = decision_tree.predict(X_test)
# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f'Decision Tree Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))
# Classify a new sample
new_sample = np.array([X_test[0]]) # Using first test sample as an example
predicted_class = decision_tree.predict(new_sample)
print(f'Predicted class for new sample: {cancer.target_names[predicted_class[0]]}')
Output:
Decision Tree Accuracy: 0.9473684210526315
precision recall f1-score support
0 0.93 0.93 0.93 43
1 0.96 0.96 0.96 71
accuracy 0.95 114
macro avg 0.94 0.94 0.94 114
weighted avg 0.95 0.95 0.95 114
Predicted class for new sample: benign
Program 9: Develop a program to implement the Naive Bayesian classifier considering
Olivetti Face Data set for training. Compute the accuracy of the classifier, considering a few
test data sets.
Source Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import fetch_olivetti_faces
from sklearn.metrics import accuracy_score, classification_report
# Load Olivetti Faces Dataset
faces = fetch_olivetti_faces(shuffle=True, random_state=42)
X = faces.data
y = faces.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Naive Bayes Classifier
naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)
# Predict on test data
y_pred = naive_bayes.predict(X_test)
# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f'Naive Bayes Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))
# Classify a new sample
new_sample = np.array([X_test[0]]) # Using first test sample as an example
predicted_class = naive_bayes.predict(new_sample)
print(f'Predicted class for new sample: {predicted_class[0]}')
Output:
Naive Bayes Accuracy: 0.775
precision recall f1-score support
0 1.00 1.00 1.00 2
1 1.00 1.00 1.00 1
2 0.33 1.00 0.50 1
3 0.00 0.00 0.00 3
4 1.00 0.50 0.67 4
5 1.00 1.00 1.00 2
7 1.00 1.00 1.00 3
8 1.00 0.67 0.80 3
9 0.50 1.00 0.67 2
10 1.00 1.00 1.00 1
11 1.00 1.00 1.00 1
12 0.50 0.67 0.57 3
13 1.00 0.50 0.67 2
14 0.00 0.00 0.00 4
15 1.00 1.00 1.00 1
16 0.67 1.00 0.80 2
17 1.00 1.00 1.00 2
18 1.00 1.00 1.00 3
19 0.40 1.00 0.57 2
20 1.00 1.00 1.00 3
21 1.00 0.50 0.67 2
22 1.00 0.40 0.57 5
23 1.00 0.50 0.67 2
24 1.00 1.00 1.00 1
25 0.67 1.00 0.80 2
26 1.00 1.00 1.00 1
27 1.00 1.00 1.00 4
28 0.00 0.00 0.00 0
29 1.00 1.00 1.00 2
30 1.00 1.00 1.00 1
31 1.00 0.67 0.80 3
32 1.00 1.00 1.00 1
34 0.00 0.00 0.00 0
35 1.00 1.00 1.00 2
36 1.00 1.00 1.00 2
38 1.00 1.00 1.00 3
39 0.57 1.00 0.73 4
accuracy 0.78 80
macro avg 0.80 0.79 0.77 80
weighted avg 0.82 0.78 0.76 80
Predicted class for new sample: 18
Program 10: Develop a program to implement k-means clustering using Wisconsin Breast
Cancer data set and visualize the clustering result.
Source Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
# Load Breast Cancer Dataset
cancer = load_breast_cancer()
X = cancer.data
# Apply K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_
# Reduce dimensions for visualization using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Scatter plot of the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', alpha=0.5)
plt.title('K-Means Clustering on Breast Cancer Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Cluster Label')
plt.show()
Output:
Viva Questions:
1. What is the difference between supervised and unsupervised learning?
2. What are the key assumptions of the Naive Bayes classifier?
3. How does the k-Nearest Neighbors (k-NN) algorithm work?
4. What is the curse of dimensionality, and how does PCA help mitigate it?
5. What is the significance of the correlation matrix in data analysis?
6. How does the Find-S algorithm work for hypothesis learning?
7. What is the difference between parametric and non-parametric regression?
8. Why is feature scaling important in machine learning?
9. How do you evaluate the performance of a clustering algorithm?
10. What is the difference between K-Means clustering and hierarchical clustering?
11. How does Locally Weighted Regression differ from traditional regression models?
12. How does k-NN classify a new data point?
13. What are the advantages and disadvantages of Decision Trees?
14. How does the Naive Bayes classifier handle continuous data?
15. What is the role of the Gaussian assumption in Naive Bayes?
16. What are the hyperparameters in K-Means clustering, and how do they affect results?
17. What is the role of eigenvalues and eigenvectors in PCA?
18. How does polynomial regression differ from linear regression?
19. Why do we use test-train splits in machine learning models?
20. What are some real-world applications of K-Means clustering?