[go: up one dir, main page]

0% found this document useful (0 votes)
12 views13 pages

M pdf

This document outlines various machine learning tasks implemented in Python, including data visualization techniques using the California Housing dataset, Principal Component Analysis (PCA) on the Iris dataset, and the Find-S algorithm for hypothesis generation. It also covers the k-Nearest Neighbors (k-NN) algorithm for classification, Locally Weighted Regression for fitting data points, and both Linear and Polynomial Regression using the Boston Housing and Auto MPG datasets. Each section includes code snippets and explanations of the methods used to analyze and visualize the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

M pdf

This document outlines various machine learning tasks implemented in Python, including data visualization techniques using the California Housing dataset, Principal Component Analysis (PCA) on the Iris dataset, and the Find-S algorithm for hypothesis generation. It also covers the k-Nearest Neighbors (k-NN) algorithm for classification, Locally Weighted Regression for fitting data points, and both Linear and Polynomial Regression using the Boston Housing and Auto MPG datasets. Each section includes code snippets and explanations of the methods used to analyze and visualize the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
‘Machine Learning Lab (BCSL606) Develop a program to create histograms for all numerical features and analyze the distribution of each feature. Generate box plots for all numerical features and identify any outliers. Use California Housing dataset. import pandas as pd import numpy as ap import seaborn as sns import matplotlib.pyplot as pit from sklearn.datasets import fetch_california_housing 4 step 1: Load the California Housing dataset data = fetch california housing (as_frame=True) housing df = data.frame # Step 2: Create histograms for numerical features numericai_features = housing df.select_deypes (includ Inp. number] ) .columns # Plot histograms plt.figure (figsize=(15, 10)) for i, feature in enumerate (numerical_features) : pit.subplot (3, 3, i + 1) sns-histplct (housing df [feature], kak plt.title(f'Distribution of (feature}") plt.tight_layout () pit show (7 rue, bil bine") # step 3: Generate box plots for numerical features plt.figure (figsize=(15, 20) for i, feature in enumerate (mmerical_features) : plt-subplot (3, 3, 1 + 1) sns -boxplot (x“housing_df [feature], colo: plt.title(f"Box Plot of (feature}") plt.tight_layout () Blt. show () orange’) # Step 4: Identify outliers using the rom method print ("Outliers Detection:") outliers summary = () for feature in numerical_features: Q1 = housing df [feature] quantile (0.25) Q3 = housing_df [feature] .quantile(0.75) Tor ~ 93 - of lowez_bound = Q1 - 1.5 * TOR upper_bound = Q3 + 1.5 * 10R outliers = housing df| (housing df (feature) < lower_bound) | (housing df (feature) > upper_bcund) outliers_summary[feature) = len(outliers) print (f"{feature}: {len (outliers)) outliers™) 4 optional: Print a sunmary cf the dataset peiat("\aDataset Summary:") Print (housing _df.describe ()}) DEPARTMENT OF CS8E, RIT, HASSAN ‘Machine Learning Lab (BCSL606) 2. Develop a program to Compute the correlation matrix to understand the relationships between pairs of features. ‘Visualize the correlation matrix using a heatmap to know which variables have strong positive/negative correlations. Create a pair plot to visualize pairwise relationships between features. Use California Housing dataset. import pandas as pd import seaborn as sms Amport matplotlib.pyplot as pit from sklearn.datasets import fetch california housing # Step 1: Load the California Housing Dataset california data = fetch california housing(as frame=True) data = california data.frame # Step 2: Compute the correlation mat correlation matrix ~ data.corr() 4 Step 3: Visualize the correlaticn matrix using a heatmap Plt. figure (Figsize=(10, 8) sns-heatmap(correlation_matrix, annot-True, cmap='coolwarm', fmt=".2f", Linewidths=0.5) plt.title (‘Correlation Matrix of California Ho! plt-show () ng Features") # Step 4: Create a pair plot to visualize pairwise relationships sns,pairplot (data, diag_kind-'kde', plot_kws=("alpha': 0,5}) plt.suptitle("Pair Plot of California Holling Features", y=1.02) Bit. show () is script performs an exploratory data anal is on the California Housing dataset using Pandas, Seaborn, and Matplotlib. First, it loads the dataset, which contains various housing-related features, and converts it into a DataFrame. Then, it calculates the correlation matrix to measure relationships between numerical features. To visualize these relationships, a heatmap is created using Seaborn, displaying correlations with color gradients and numerical values for better interpretability. Additionally, a pair plot is generated to illustrate pairwise relationships between features, using scatter plots for bivariate distributions and KDE plots for univariate distributions. These visualizations help in understanding feature dependencies and potential patterns in the dataset, DEPARTMENT OF CS8E, RIT, HASSAN ‘Machine Learning Lab (BCSL606) 3 Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality of the Iris dataset from 4 features to 2. import aumpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.decomposition import PCA import matplotlib.pyplot as pit 4 Load the Iris dataset iris = load iris() data = iris.data labels = iris.carget label_names = iris. target_names 4 convert to a DataFrame for better visualization iris df = pd.DataFrane (data, cclunns=iris.feature names) # Perform PCA te reduce dimensionality to 2 Bea = PCA(n_components=2) data_reduced = pea. fit_transform(data) # Create a DataFrame for the reduced data reduced_df = pd.DataFrame(data_reduced, column: "Principal Component 2"]} reduced_df(*Zabel'] = labels ‘principal Component 1', 4 Flot the reduced data plt.figure (figsize=(€, €)) colors = ['r', 'g', 'B'] for i, label in enumerate (np.unique (1abels)) : pit.scatter( reduced_df[reduced_af["tabel'] xeduced_df [reduced_df [‘Zabel'] label~label_names(Tabel], color=colorst label] [Principal Component 1'1, label) [*Frincipal Cemponent 2", ) plt.title(*Pca on Iris Dataset") plt.xlabel ("Principal Component 1') plt-ylabel ("Principal Component 2") plt-legend() pitgrid() Blt. show () ation: ‘This Python script demonstrates how to apply Principal Component Analysis (PCA) on the Iris dataset to reduce its dimensionality from four features to two, making it easier to visualize. The script first loads the Iris dataset using sklearn datasets load_iris(), extracts the feature data and labels, and converts it to a Pandas DataFrame for better readability. It then applies PCA with n_components=2 to transform the dataset into a two-dimensional space, The transformed data is stored in another DataFrame, with the corresponding class labels. Finally, the script uses matplotlib to create a scatter plot, where different species of flowers are represented in distinct colors. This visualization helps in understanding how well“ PCA separates the three -—species_-in_lower-dimensional space. DEPARTMENT OF CS8E, RIT, HASSAN Machine Leaning Lab (BCSL606) 4. Fora given set of training data examples stored in a CSV file, implement and demonstrate the Find-§ algorithm to output a description of the set ofall hypotheses consistent with the training examples. import pandas as pd def find_s_algorithm(file path) data = pd-read_csv (file path) pint ("Tzaining data:" print (data) attributes = data.columns[:=1] class_label ~ data.cclumns [-1) hypothesis = [027 fen: . dn, attrib for index, row in data.iterrows(): f rox[class_label] =: for i, value in enumerate if hypothesis(i] == ' hypothesis [i else: hypothesis [i [attributes)) "or hypothesis [i] Lue value return hypothesis file path = 'C:\\Users\\Admin\\Downloads\\training data.csv’ othesis = find s algorithm(file path) \nthe final hypothesis is:", hypothesis! The given Python script implements the Find-S algorithm using the pandas library for reading and sing a CSV dataset. The Find-S algorithm is a simple machine learning approach used to find the most hypothesis that fits all positive examples in a dataset. The script reads a CSV file containing training data, where the last column represents the class label (e.g., "Yes" or "No". It initializes a hypothesis with the most general values ("for each attribute) and then iterates through the dataset, updating the hypothesis whenever a positive example (‘Yes' in the class label) is encountered. If an attribute in a positive example matches the current hypothesis, it remains unchanged; otherwise, it is generalized to "?. The final hypothesis is returned and printed, representing the most specific description that fits all positive examples, DEPARTMENT OF CS8E, RIT, HASSAN ‘Machine Learning Lab (BCSL606) 5. Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated 100 values of x in the range of (0,1. Perform the following based on dataset generated. a. Label the first $0 points {x1........x50} as follows: if (xi < 0.5), then xi € Class1, else xi € Class! b. Classify 2.3,4,5,20,30 the remaining points, x51,.....x100 using KNN. Perform this for import numpy as np import matplotlib.pyplct as pit from collections import Counter data = ap.candom.2and(100) labels = ["Classi" if « 0,5 else "Class2" for x in data[:50) def euclidean_distance (x1, *2): return abs (x1 ~ x2) def knn classifier (train data, train labels, test point, k) distances = [(euclidean_distance(test_point, train Gata[i)), train labels{il) for i in range (1en(train data)) distances.sort (key=lambda x: [0] k_nearest_neighbors = distances [:k k_nearest_labele = [label for _, labe2 in k_neazest_neighbors] return Counter (k nearest labels) .most_common (1) [07 [0] train_data = data[:50] train_labels - labels test data = data(s0:] k values = (1, 2, 3, 4, 5, 20, 30) print ("--~ k-Nearest Neighbors Classification ---") print ("Training dataset: First 50 points labeled based on the rule (x <= 0.5 -> Classi, x > 0.5 -> Class2)") int ("Testing dataset: Remaining $0 points to be classified\n") results = (} for k in k_values: print (F*Results for k = {k}:") classified labels = (knn_classifier(train data, train labels, test_point, k) for test_point in test data) Fesults[k] = classified labels for i, labe2 in enumerate (classified labels, start=51) print (£"Pcint (i) (value: (test data[i/- 52]:.4£)) is classified as (labe)") print ("\n") DEPARTMENT OF CS8E, RIT, HASSAN ‘Machine Learning Lab (BCSL606) print ("Classification complete. \n") for k in k_values: classified labels = results (k] classi_points = [test_data[i] for i in range(len(test_data)) if classified tabels|i)] == "Classi") class2 points = [test data[i] for 4 in range(len(test_data)) if classified_tabe1s[i] == "Class2"] plt. figure (Figsize=(10, 61) plt-scatter(train_data, [0] * len(train data), c=["blue" if label -- "Classi" else "red" for label Tn train labels], label="Training Cata", marker="o" plt.scatter(classl_points, [1] * len(class!_points), c-"blue", iabel="Class! (test) ", marker="x") plt scatter (class?_points, [1] * len(class?_points), co"red", label="Class2 (test) ", marker-"«") pit. title (£"k-NN Classification Results for k = (k}") plt-xlabel ("Data Points") pit. ylabel ("Classification Level") ple. legend () plt.grid(rrue) pits how() Explanation: ‘This Python script implements a simple k-Nearest Neighbors (k-NN) classifier using Euclidean distance for a 1D. dataset generated randomly. It starts by creating 100 random values between 0 and 1, labeling the first 50 points as “Classi” if < 0.5 and "Class2" otherwise. The script defines a function to calculate Euclidean distance and another to classify test points based on their nearest neighbors. It then iterates over different K-values (J. 2, 3, etc), classifying the remaining 50 data points. The classification results are printed and visualized using Matplotib, where training points are plotted as circles and test points as Class1 (blue crosses) or Class? (red crosses). The visualizations help analyze how Jifferent k-values affect classification performance, Output: DEPARTMENT OF CS8E, RIT, HASSAN ‘Machine Learning Lab (BCSL606) 6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points, Select appropriate data set for your experiment and draw graphs import numpy as np import matplotlib.pypLet as pt def gaussian kernel (x, xi, tau): return np-exp(-np-sum((x = xi) ** 2) / (2 * tau ** 2) def locally_weighted_regression(x, X, y, tau) m= X.shape [0 weights = np.array (Ig: W = np.diag (weights) X transpose ¥ = x. @ theta = np.linalg- inv (x_transpos return x @ theta sian_kernel (x, X[il, tau) for i in rangetm)1) W@ x) @X transpose_w @ y np. random. seed (42) np-linspace(0, 2 * np.pi, 100) y = ap.sin(%) + 0/1 * ap.random.randn (100) X_Dias ~ np.c_{np.ones(x.shape), x test = mp.linspace(0, 2 * mp.pi, 200) x_test_bias = np.c_[np.cnes (x _test.shape), *_test tau = 0.5 y_pred = np.array( [locally xLtest_bias] | weighted regression (xi, x bias, y, tau) for xi in pit. figure (figsize=(10, €) lscatter(X, y, color="red", label="training Data’, alpha-0.7 pit plot (x test, y pred, ccler=‘blue', label=f'LWR Fit (tau=(tau})", plt.xlabel ("K", fontsize=12) pit.ylabel("y', fontsize=12) plt-title(*zccally Weighted Regressicn', fontsize=14) pit legend (fontsize=10) plt.grid(alpha=0.3) plt.show() Explanation: ‘This program implements Locally Weighted Regression (LWR) using a Gaussian kernel to assign newidt weights to training points based on their distance from a given query point. The function gaussian_kernel ‘computes the weight for each training point relative to the query point using a Gaussian function with bandwidth tau. The locally_weighted_regression function then performs weighted linear regression by ‘computing the weighted least squares estimate for the regression coefficients. The dataset consists of noisy sine wave samples, and LWR is applied to estimate the function’s trend. The predictions are visualized, showing the fitted curve along with the training data, demonstrating how LWR captures local pattems in the data, DEPARTMENT OF CS&E, RIT, HASSAN ‘Machine Learning Lab (BCSL606) 7. Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency predict import numpy as np import pandas as pd import matplctlib.pyplot as pit from sklearn.datasels import fetch_california housing from sklearn model selection import train test split from sklearn. linear model import LinearRegression from sklearn preprocessing import PolynemialFeatures, StandardScaler from sklearn pipeline import make pipeline from sklearn.metrics impert mean_squared_error, r2_score ) for Polynomial Regression, def Linear_regressicn_california() housing = fetch_california_housing(as_framé X= housing.data||"AveRooms") y = housing.target X_train, X test, y train, y test = train test_split(x, y, test_siz randon_state=47) model = LinearRegression() model.fit (x train, y train) y pred = model.predict (x test) plt.scatter(x_test, y_test, color="blue", label="Actual") plt.plot(x_test, y pred, color="red", label="Predicted") plt.xlabel ("average nurber of rooms (AveRcons) ") plt.ylabel ("Median value of homes ($100,000) ") plt.titie ("Linear Regression - California Housing Dataset") ple. legend () pit. show () print ("Linear Regression - California Housing Dataset") print ("Mean Squared Error:", mean_squared error{y test, y_pred)) print ("R°2 Score:", r2 score(y test, y pred)) def polynomial_regression_auto_mpg() : url = “https://archive.ics-uci-edu/ml/machine-learning~databases/auto-mpg/auto- mpg. data’ columa_names = ("mpg", "cylinders", "displacement", "horsepower", "weight", “acceleration”, “model year", "origin" data = pd.read csw(url, Sep="\st", name: data = data.dropna() ro -olumn_names, na_value: X = data["disp!acement"} .values. reshape (= y = datai"mpg"! .values mat X train, X test, y train, y test = train test split(x, y, test_siz random_state=42) 2 poly_mede] = make_pipeline (PolynomialFeatures (degree=7), Standardscaler(), LinearRegression()) poly_mcdel.£it (X train, y train) DEPARTMENT OF CS&E, RIT, HASSAN Machine Learning Lab (BCSL606) y_pred = poly_nodel.pzedict (x_test) blue", Label="Actual") zed", label="Predicted") pltscatter (x test, y test, celo: plt.scatter (x test, y pred, cole: plt.xlabel ("Displ plt.ylabel ("Miles per gallon (mpg)") plt,titie ("Polynomial Regression ~ Auto MPG Dataset" pit. legend ( ple show () nent") print ("Polynomial Regression - Auto MPG Dataset") print ("Mean Squared Hrror:", mean_squared_error(y_test, y_pred)} print ("R*2 scere:", r2_scorely test, y_pred)) if name print ("Demonstrating Linear Regression and Polynomial Regressicn\n") The provided code demonstrates linear and polynomial regression using two datasets. First, it performs linear regression on the California Housing dataset, focusing on predicting the median home value based on the average number of rooms. The data is split into training and test sets, and a LinearRegression model is trained and evaluated, displaying a scatter plot comparing actual and predicted values along with the Mean Squared Error (MSE) and R? score. In the second part, polynomial regression is applied to the Auto MPG dataset, predicting fuel efficiency (mpg) based on engine displacement. A pipeline is used, consisting of polynomial feature transformation, standard scaling, and linear regression. The model’s performance is visualized similarly, and metrics are printed, Both visualizations help in understanding model fit and performance. Output DEPARTMENT OF CS&E, RIT, HASSAN ‘Machine Learning Lab (BCSL606) 8, Develop a program to demonstrate the working of the decision tree algorithm, Use Breast Cancer Data set for building the decision tree and apply this knowledge to classify a new sample. import numpy as np import matplctlib.pyplet as pit from sklearn.datasets import icad breast_cancer, from sklearn.model_selecticn impozt train test_split from sklearn tree import DecisionTreeClassific from sklearn metrics import accuracy score fron sklearn import tree data = lead breast cancer ( X= daca.data y = dataltarget x_train, x_test! randon_state=42) clf = DecisionTreeClassifier (random state: elf.£10(_teain, y_tealn) y pred = clf.predict (x test) ytrain, y test = train_test_splitix, y, test_size~0.2, 2) accuracy = accuracy_score(y_test, y_pred) print ("Model Accuracy: {accuracy * 100: new_sample = np.array4[x_test[0]]) prediction = clf.predict (nex_sample) prediction class = "Benig: prediction ‘int (f"Predicted Class for the new sample: {predicticn_< 2 else "Nalignant” ass}") ¢. figure (figsize=(12,8)) tree.plot_tree(clf, filled-rrue, feature_names-data.feature_nares, class_nanes=data.tazget_names) plt.title("Decisicn Tree - Breast Cancer Dataset”) plt.show() Explanation: This Python script implements a Decision Tree Classifier to classify breast cancer tumors as malignant or benign using the Breast Cancer dataset from sklearn.datasets. I begins by importing necessary libraries, then loads the dataset and separates it into features (X) and target labels (y). The dataset is split into training (80%) and testing (20%) subsets using train_test_split. A Decision TreeClassifier model is created and trained on the training data, After training, predictions are made on the test set, and the model’s accuracy is evaluated using accuracy_score, The script then predicts the class of a new sample (first test sample) and prints whether it is malignant or benign. Finally, it visualizes the decision tree using matplotlib and tree.plot_tree, displaying how the classifier makes decisions based on feature splits. DEPARTMENT OF CS&E, RIT, HASSAN ‘Machine Learning Lab (BCSL606) 9. Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for training. Compute the accuracy of the classifier, considering a few test data sets, import numpy as np from sklearn.datasets import fetch_olivetti_faces from sklearn.model_setection import train test_split, cross_val_score from sklearn.naive_bayes import GaussianNa from sklearn.netriés import accuracy score, classificaticn_report, confusion_matrix import matp_ctiis.pyplot as plt data = fetch_olivetti_faces (shuffle=True, random_state=42: X= data.data y 7 data-target X_train, ¥ test, randon_state=42) yteain, y_test = train test_split(, y, test siz! gnb = GaussianNs() grb. fit (x train, y_train) y_pred - gnb.predict (x_test) accuracy = accuracy _score(y test, y_pred) int (£'Accuracy: {accuracy * 100:.2£)8") print ("\nClassification Report:") print (classification report (y test, y pred, zero ision-1)) print ("\nconfusion Matrix:") print (confusien matrix(y test, y pred)) print (2 VA al_accuracy = cross_val_sccre(gnb, X, y, cv=5, scoring="accuracy’ ss-validation accuracy: {cross val _accuracy.mean() * 100:.2£)8") fig, axes = plt.subplcts(3, 5, £igsize=(12, 8)) for'ax, image, label, predicticn in zip(axes.ravel(}, ¥ test, y test, y pred): ax-imshow (image. reshape (64, 64), cmap=plt.cn.gray) aulset titie(#"?rue: (label), Pred: (prediction) ") ax.axis ((ofE") -show() Explanation: This Python script uses machine leaming to classify human face images from the Olivetti Faces dataset. It begins by importing necessary libraries like NumPy, scikit-learn, and Matplotlib. The dataset is loaded using fetch_olivetti_faces(), which provides grayscale 64x64 face images and corresponding labels, The data is split into training and testing sets using train_test_split(). A Gaussian Naive Bayes (GNB) classifier is then trained on the training set and used to predict labels for the test set. The model's performance is evaluated using accuracy score, classification report, and confusion_matrix. Additionally, cross-validation is, performed to assess generalization, Finally, a visualization is created using Matplotib, displaying a subset of test images with their true and predicted labels, DEPARTMENT OF CS&E, RIT, HASSAN ‘Machine Learning Lab (BCSL606) 10, Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and visualize the clustering result. import numpy as ap import pandas as pd import matplotlib.pyplct as pit import seaborn as sns from sklearn.datasets import lcad_breast_cancer from sklearn.cluster import Kmeans from sklearn.preprocessing import Standardscaler from sklearn.decomposition import PCA from skiearn.metrics import confusion matrix, classification_repert data = lead _breast_cancer ( X= data.data y = data-target scaler = Standardscaler() X_scaled = scaler.fit_transform(x) kmeans = RMeans(n_clusters=2, random state=42) y_kmeans = kmeans.fit_predict (x_scaled) print ("Confusion Matrix:") print (confusicn_matrix(y, y_kmeans)) print ("\nClassiFication Report:") print (class fication_report (y, y_kmeans)} pea = PCA(a_components=2) X_pea = pea-fit_transform(X_scaled) dE = pd.DataFrame(x pea, columns=["PCi', "PC2"]} df[‘cluster'] = y_kmeans df['true Label'] = y pit. figure (figsize: sns.scatLezpict (dats edgecolor="black", alph plt.title('K-Means Clustering cf Breast Cancer Dataset") ‘elabel ("Principal Component 1") sylabel ("Principal Component 2") Slegend(title="Cluster") plt.show() 2C2", hue='Clustez', paletu plt. figure (figsize=(8- 6)) sns.scatterplot (data=df, x="PCI', "pc2", ue="true Labei', palette="coolwarm', 57100, edgecolor="black', atpha-0.7) stitle(*True Labels cf Breast Cancer Dataset") label ("Principal component 1") tylabel ("Principal Component 2") legend (titie="True Labe:") ‘show () -figure (figsize= (8, 6)) sns.scatterplot (data~af, x-'PC1', y-'PC2", hue='Cluster', palette='setl", 5-100, edgecolor="black", alpha=0.7) centers = pca.transform(kmeans.cluster_ centers } plt.scatter(centers[:, 0), centers[:, 1], 5=200, c: label="centroids") DEPARTMENT OF CS&E, RIT, HASSAN Machine Learning Lab (BCSL606) plt.title('K-Means Clustering wit pltxlabel ("Principal Component 1") plt-ylabel (Principal Component 2) pit Jegend (title pit show() Explanation: centroids") This code performs K-Means clustering on the Breast Cancer dataset from sklearn, It first loads the dataset and standardizes the features using StandardScaler to ensure all features have equal importance. Then, it applies K-Means clustering with two clusters, as the dataset has two target classes (malignant and benign). The clustering results are evaluated using a confusion matrix and a classification report. To visualize the data, Principal Component Analysis (PCA) is used to reduce the dimensions to two principal components, making it easier to plot. The results are displayed using scatter plots, ‘omparing predicted clusters and true labels. Finally, the cluster centroids are plotted on the PCA-transformed data, helping to understand the separation of clusters. DEPARTMENT OF CS&E, RIT, HASSAN

You might also like