Ml_datascience Manual (1)
Ml_datascience Manual (1)
Develop a program to
create histograms for all
numerical features and
analyze the
distribution of each
feature. Generate box
plots for all numerical
features and
identify any outliers. Use
California Housing
dataset
1. Develop a program to create histograms for all numerical features and analyze the distribution of
each feature. Generate box plots for all numerical features and identify any outliers. Use California
Housing dataset.
import pandas as pd
import numpy as np
data = fetch_california_housing(as_frame=True)
housing_df = data.frame
numerical_features = housing_df.select_dtypes(include=[np.number]).columns
# Plot histograms
plt.figure(figsize=(15, 10))
plt.subplot(3, 3, i + 1)
plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()
plt.figure(figsize=(15, 10))
plt.subplot(3, 3, i + 1)
sns.boxplot(x=housing_df[feature], color='orange')
plt.tight_layout()
plt.show()
print("Outliers Detection:")
outliers_summary = {}
for feature in numerical_features:
Q1 = housing_df[feature].quantile(0.25)
Q3 = housing_df[feature].quantile(0.75)
IQR = Q3 - Q1
outliers_summary[feature] = len(outliers)
print("\nDataset Summary:")
print(housing_df.describe())
Outliers Detection:
HouseAge: 0 outliers
Latitude: 0 outliers
Longitude: 0 outliers
Dataset Summary:
[8 rows x 9 columns]
2. Develop a program to Compute the correlation matrix to understand the relationships between pairs
of features. Visualize the correlation matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise relationships between features.
Use California Housing dataset.
import pandas as pd
from sklearn.datasets import fetch_california_housing# Step 1: Load the California Housing Dataset
california_data = fetch_california_housing(as_frame=True)
plt.figure(figsize=(10, 8))
OUTPUT:
3. Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality
of the Iris dataset from 4 features to 2.
import numpy as np
import pandas as pd
iris = load_iris()
data = iris.data
labels = iris.target
label_names = iris.target_names
# Convert to a DataFrame for better visualization
pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)
reduced_df['Label'] = labels
plt.figure(figsize=(8, 6))
plt.scatter(
label=label_names[label],
color=colors[i]
plt.legend()
plt.grid()
plt.show()
OUTPUT:
4. For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-S
algorithm to output a description of the set of all hypotheses consistent with the training examples.
import pandas as pd
def find_s_algorithm(file_path):
data = pd.read_csv(file_path)
print("Training data:")
print(data)
attributes = data.columns[:-1]
class_label = data.columns[-1]
if row[class_label] == 'Yes':
hypothesis[i] = value
else:
hypothesis[i] = '?'
return hypothesis
file_path = 'training_data.csv'
hypothesis = find_s_algorithm(file_path)
OUTPUT:
Training data:
a) Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊
generated.
Class1
b) Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
PROGRAM:
import numpy as np
data = np.random.rand(100)
labels = ["Class1" if x <= 0.5 else "Class2" for x in data[:50]]
distances.sort(key=lambda x: x[0])
k_nearest_neighbors = distances[:k]
return Counter(k_nearest_labels).most_common(1)[0][0]
train_data = data[:50]
train_labels = labels
test_data = data[50:]
print("Training dataset: First 50 points labeled based on the rule (x <= 0.5 ->
Class1, x > 0.5 -> Class2)")
results = {}
for k in k_values:
results[k] = classified_labels
for i, label in enumerate(classified_labels, start=51):
print("\n")
print("Classification complete.\n")
for k in k_values:
classified_labels = results[k]
plt.figure(figsize=(10, 6))
plt.xlabel("Data Points")
plt.ylabel("Classification Level")
plt.legend()
plt.grid(True)
plt.show()
OUTPUT:
--- k-Nearest Neighbors Classification ---
Training dataset: First 50 points labeled based on the rule (x <= 0.5 ->
Class1, x > 0.5 -> Class2)
Results for k = 1:
Results for k = 3:
Results for k = 4:
Results for k = 5:
Classification complete.
6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
PROGRAM:
import numpy as np
m = X.shape[0]
W = np.diag(weights)
X_transpose_W = X.T @ W
np.random.seed(42)
X_bias = np.c_[np.ones(X.shape), X]
tau = 0.5
plt.figure(figsize=(10, 6))
plt.xlabel('X', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.grid(alpha=0.3)
plt.show()
output:
7. Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use
Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency
prediction) for Polynomial Regression.
import numpy as np
import pandas as pd
housing = fetch_california_housing(as_frame=True)
X = housing.data[["AveRooms"]]
plt.legend()
poly_model.fit(X_train, y_train)y_pred =
poly_model.predict(X_test)plt.scatter(X_test, y_test, color="blue",
label="Actual")
plt.xlabel("Displacement")
plt.legend()
polynomial_regression_auto_mpg()
OUTPUT:
8. Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer
Data set for building the decision tree and apply this knowledge to classify a new sample.
import numpy as np
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data
y = data.target
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
prediction = clf.predict(new_sample)
plt.figure(figsize=(12,8))
plt.show()
OUTPUT:
9. Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for
training. Compute the accuracy of the classifier, considering a few test data sets.
import numpy as np
X = data.data
y = data.target
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("\nClassification Report:")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
ax.axis('off')
plt.show()
OUTPUT:
OUTPUT:
Accuracy: 80.83%
Classification Report:
[[2 0 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
[0 0 2 ... 0 0 1]
...
[0 0 0 ... 1 0 0]
[0 0 0 ... 0 3 0]
[0 0 0 ... 0 0 5]]
10. Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set
and visualize the clustering result.
PROGRAM:
import numpy as np
import pandas as pd
data = load_breast_cancer()
X = data.data
y = data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
y_kmeans = kmeans.fit_predict(X_scaled)
print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df['Cluster'] = y_kmeans
df['True Label'] = y
plt.figure(figsize=(8, 6))
plt.legend(title="Cluster")
plt.show()
plt.figure(figsize=(8, 6))
plt.legend(title="True Label")
plt.show()
plt.figure(figsize=(8, 6))
centers = pca.transform(kmeans.cluster_centers_)
plt.legend(title="Cluster")
plt.show()
OUTPUT:
Confusion Matrix:
[[175 37]
[ 13 344]]
Classification Report: