Machine Learning
Machine Learning
Submitted in
Department of Computer Science & Engineering
Enroll.No:-01827202721
Class:-B-TECH(CSE-A)
INDEX
Aim: Introduction to JUPYTER IDE and it’s libraries Pandas and Numpy
Introduction to Jupyter IDE:
Jupyter Notebook is an open-source, web-based interactive computing platform that allows users
to create and share documents containing live code, equations, visualizations, and narrative text.
It supports a variety of programming languages, including Python, R, and Julia.
Introduction to Pandas:
Pandas is a Python library used for data manipulation and analysis. It offers data structures like
Series (1D data) and DataFrame (2D data) to handle labeled data efficiently. Pandas is commonly
used for tasks such as:
• Data cleaning
• Data transformation
• Aggregating and grouping data
• Merging and joining datasets
• Handling missing data
Introduction to NumPy:
NumPy (Numerical Python) is a fundamental package for scientific computing in Python. It
provides support for arrays (multidimensional data), mathematical functions, and linear algebra
operations.
Simple Linear Regression is a fundamental type of regression used to predict the value of a
dependent variable (y) based on the value of an independent variable (x). The relationship
between the two variables is assumed to be linear, meaning the change in yyy is proportional to
the change in xxx.
The goal of simple linear regression is to find the best-fitting straight line through the data points,
known as the regression line, that minimizes the differences (errors) between the observed values
and the predicted values.
1. Linear Relationship:
o Simple linear regression assumes a linear relationship between the dependent and
independent variables. The model predicts y based on a straight-line equation:
y=mx+cy where:
▪ y is the predicted dependent variable,
▪ xxx is the independent variable,
▪ m (slope) represents how much y changes when xxx increases by 1 unit,
▪ c (intercept) is the value of y when x is 0.
2. Best Fit Line:
o The best fit line, or regression line, is the line that minimizes the sum of squared
differences between the observed data points and the predicted points (errors). This
technique is known as Ordinary Least Squares (OLS).
3. Slope and Intercept:
o The slope (m) represents the rate of change of the dependent variable with respect
to the independent variable. A positive slope means y increases as xxx increases,
while a negative slope means yyy decreases as xxx increases.
o The intercept (c) is the point where the regression line crosses the y-axis (i.e., the
value of y when x is 0).
4. Residuals (Errors):
o Residuals are the differences between the actual values of y and the predicted
values of y by the model. These are the vertical distances from the data points to
the regression line.
o The goal of linear regression is to minimize these residuals to ensure that the
predicted values are as close as possible to the actual values.
Code:
import numpy as np
import matplotlib.pyplot as plt
return m, b
X = np.array([1, 2, 3, 4, 5])
y = np.array([1, 2, 1.5, 3.5, 2.5])
m, b = simple_linear_regression(X, y)
y_pred = m * X + b
Output
Experiment-3
1. Data Preprocessing:
o Load a dataset that contains the features and target variable (labels) for
classification.
o Handle missing data, if any, and encode categorical variables into numeric form if
necessary.
o Split the data into training and testing sets to evaluate the model performance.
2. Model Training:
o Import and use the logistic regression algorithm from a machine learning library
(like Scikit-learn).
o Fit the logistic regression model to the training data, where it will learn the
relationship between the features and the target variable.
3. Prediction:
o Use the trained model to predict the target values (classes) for the testing data.
o The model will output probabilities for each data point, and those probabilities will
be used to classify the points into one of the two classes.
4. Model Evaluation:
o Evaluate the model’s performance on the testing data using appropriate metrics like
accuracy, precision, recall, F1-score, and confusion matrix.
o Optionally, plot the ROC curve and calculate the AUC to assess the quality of the
probability predictions.
5. Interpreting Results:
o Analyze the performance of the model to determine how well it classifies the data.
You can also interpret the learned coefficients to understand the relationship
between features and the outcome.
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X = iris.data[:, :2]
y = (iris.target == 0).astype(int)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Decision Tree is a supervised learning algorithm used for both classification and regression tasks.
The ID3 (Iterative Dichotomiser 3) algorithm is a popular decision tree algorithm used for
classification problems. It builds a decision tree based on the concept of information gain and
selects the attribute that maximizes the gain to split the data at each step.
# Print accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")
The K-Nearest Neighbors (KNN) algorithm is a simple and effective classification method that
can be used to classify flowers (or any data) based on their features. For example, we can classify
flowers into species based on features like petal length, petal width, sepal length, and sepal width.
Algorithm:
Example:
1. Training Data:
o Features: Sepal Length, Sepal Width, Petal Length, Petal Width.
o Labels: Flower species like Iris Setosa, Iris Versicolor, and Iris Virginica.
2. New Flower Data:
o Given features: Sepal Length = 5.1, Sepal Width = 3.5, Petal Length = 1.4, Petal Width =
0.2.
o Predict its species using KNN.
3. Steps:
o Compute the Euclidean distance between the new flower and all flowers in the training
dataset.
o Choose k=3k = 3k=3 (for example) and find the 3 nearest neighbors.
o Perform majority voting on the species of the 3 nearest neighbors.
o Assign the species with the majority vote to the new flower.
Code:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Print accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")
Naive Assumption:
• Naive Bayes assumes that all features are independent of each other. This means that the
presence of one feature does not influence the presence of any other feature, which
simplifies the calculation of the likelihood P(X∣C)P(X|C)P(X∣C).
• Despite this strong assumption, Naive Bayes performs well in practice, especially for high-
dimensional datasets (e.g., text data).
• There are several variations of the Naive Bayes classifier depending on the type of data:
o Gaussian Naive Bayes: Assumes that the continuous features follow a normal
(Gaussian) distribution.
o Multinomial Naive Bayes: Used for discrete features, typically for text data where
the features represent word counts or term frequencies.
o Bernoulli Naive Bayes: Used for binary/boolean data, where features represent
binary outcomes (e.g., presence or absence of a word in a document).
1. Spam Filtering:
o Naive Bayes is commonly used in email spam filters to classify emails as spam or
not spam based on the frequency of words in the email body.
2. Text Classification:
o Used in sentiment analysis, document classification, and news article
categorization by analyzing the frequency of words in a document.
3. Medical Diagnosis:
o Applied in medical diagnosis to predict diseases based on symptoms, using the
probabilities of various conditions given the symptoms.
4. Sentiment Analysis:
o In social media analysis, Naive Bayes is used to classify reviews, tweets, or
comments as positive, negative, or neutral based on word occurrences.
Code:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
y_pred = nb_classifier.predict(X_test)
accuracy = nb_classifier.score(X_test, y_test)
summary_df = pd.DataFrame({
'Sample Index': range(len(y_test)),
'Actual Class': y_test,
'Predicted Class': y_pred
})
# Filter predictions for classes 0 and 1
filtered_summary = summary_df[summary_df['Predicted Class'].isin([0, 1])]
# Print accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")
# Display the summary DataFrame with predictions
print("\nSummary of Predictions (Actual vs Predicted):")
print(filtered_summary)
Output:
Experiment-7
Both PCA and LDA are dimensionality reduction techniques but are used for different purposes.
Let’s go through their concepts briefly and then see how to apply both on the Iris dataset.
• LDA is a supervised technique that finds a projection maximizing the separation between
multiple classes.
• It focuses on maximizing the difference between class means while minimizing the
variance within each class.
Algorithm:
Points to Remember:
1. Core Point: A point is a core point if it has at least a minimum number (minPts) of
neighboring points within a given radius (epsilon, or eps).
2. Border Point: A point that is not a core point but is within the epsilon distance of a core
point. It lies on the boundary of a cluster.
3. Noise Point: A point that is neither a core point nor a border point, i.e., it doesn't have
enough neighbors within epsilon and is not reachable from any core point.
Code:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
Output
Experiment-9
K-Medoids is more robust to outliers and noise because it uses actual data points as centers,
unlike k-means which can be skewed by extreme values.
Algorithm Steps:
1. Initialization:
o Randomly select k data points from the dataset as the initial medoids
(representative points of clusters).
2. Assign each point to the nearest medoid:
o Calculate the dissimilarity (distance, typically Manhattan or Euclidean) of each
point to every medoid.
o Assign each point to the cluster with the nearest medoid.
3. Update Medoids:
o For each cluster, calculate the total dissimilarity of all points in the cluster to each
point within the cluster.
o Select the point that minimizes the total dissimilarity as the new medoid for that
cluster.
4. Repeat:
o Repeat the assignment and update steps until the medoids no longer change or the
total cost (sum of dissimilarities) stabilizes.
5. Termination:
o The algorithm converges when no further changes occur in the medoids or the
cluster assignments.
Applications:
• Customer segmentation.
• Gene expression clustering.
• Image segmentation.
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn_extra.cluster import KMedoids
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X = iris.data
y = iris.target
X_scaled = StandardScaler().fit_transform(X)
kmedoids = KMedoids(n_clusters=3, random_state=1)
kmedoids_labels = kmedoids.fit_predict(X_scaled)
medoids = kmedoids.cluster_centers_
print("Medoids of each cluster:")
print(medoids)
Output:
Experiment-10
K-Means is a popular clustering algorithm that aims to partition data into k clusters, where each
data point belongs to the cluster with the nearest mean (centroid). For a handwritten digits
dataset, like the MNIST dataset, K-Means can be used to group similar digits based on their pixel
values. Although K-Means is an unsupervised algorithm (meaning it doesn’t rely on labels), we
can still apply it to see if it clusters similar digits together.
Algorithm Steps:
1. Initialization:
o Choose k random points from the dataset as initial centroids (representative of
clusters).
2. Assignment Step:
o Assign each data point (handwritten digit) to the nearest centroid based on a
distance metric (usually Euclidean distance).
3. Update Step:
o After all points are assigned to clusters, compute the new centroids by taking the
mean of the points in each cluster.
4. Repeat:
o Repeat the assignment and update steps until the centroids no longer change
(convergence).
5. Termination:
o The algorithm stops when the centroids stabilize or the maximum number of
iterations is reached.
1. Dataset:
o You can use a dataset like MNIST (28x28 pixel images of handwritten digits) or
any similar dataset of handwritten digits.
2. Preprocessing:
o Flatten each image into a 1D vector. For example, a 28x28 pixel image becomes a
784-dimensional vector.
o Normalize pixel values to range between 0 and 1 (optional but recommended for
better convergence).
3. Apply K-Means:
o Choose k=10 (since there are 10 digit classes: 0-9).
o Run the K-Means algorithm to cluster the digits.
o Evaluate:
4. Evaluate:
o Since K-Means is unsupervised, there is no direct label information. However, you
can evaluate the clusters by seeing how well they align with the actual digit labels
(using metrics like cluster purity or adjusted Rand index).
o
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import fetch_openml
# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
X = mnist.data
y = mnist.target.astype(int)
# Reduce dataset size for faster processing (optional)
X_sample = X.sample(n=1000, random_state=1)
# Apply K-Means
n_clusters = 10 # Number of digits
kmeans = KMeans(n_clusters=n_clusters, random_state=1)
kmeans.fit(X_sample)
# Get cluster labels
labels = kmeans.labels_
# Visualize the clusters
plt.figure(figsize=(10, 8))
for i in range(n_clusters):
plt.subplot(2, 5, i + 1)
# Get the cluster center (medoid)
cluster_center = kmeans.cluster_centers_[i].reshape(28, 28)
plt.imshow(cluster_center, cmap='gray')
plt.title(f'Cluster {i}')
plt.axis('off')
plt.tight_layout()
plt.show()
Output: