Page 1 of 15 - Cover Page Submission ID trn:oid:::29034:86246888
submission
My Files
My Files
University
Document Details
Submission ID
trn:oid:::29034:86246888 11 Pages
Submission Date 1,085 Words
Mar 17, 2025, 1:20 AM GMT+5:30
5,901 Characters
Download Date
Mar 17, 2025, 1:21 AM GMT+5:30
File Name
Naïve Bayesian Classifier and K means.docx
File Size
98.2 KB
Page 1 of 15 - Cover Page Submission ID trn:oid:::29034:86246888
Page 2 of 15 - Integrity Overview Submission ID trn:oid:::29034:86246888
24% Overall Similarity
The combined total of all matches, including overlapping sources, for each database.
Filtered from the Report
Bibliography
Quoted Text
Match Groups Top Sources
21 Not Cited or Quoted 24% 18% Internet sources
Matches with neither in-text citation nor quotation marks
12% Publications
0 Missing Quotations 0% 20% Submitted works (Student Papers)
Matches that are still very similar to source material
0 Missing Citation 0%
Matches that have quotation marks, but no in-text citation
0 Cited and Quoted 0%
Matches with in-text citation present, but no quotation marks
Integrity Flags
0 Integrity Flags for Review
Our system's algorithms look deeply at a document for any inconsistencies that
No suspicious text manipulations found. would set it apart from a normal submission. If we notice something strange, we flag
it for you to review.
A Flag is not necessarily an indicator of a problem. However, we'd recommend you
focus your attention there for further review.
Page 2 of 15 - Integrity Overview Submission ID trn:oid:::29034:86246888
Page 3 of 15 - Integrity Overview Submission ID trn:oid:::29034:86246888
Match Groups Top Sources
21 Not Cited or Quoted 24% 18% Internet sources
Matches with neither in-text citation nor quotation marks
12% Publications
0 Missing Quotations 0% 20% Submitted works (Student Papers)
Matches that are still very similar to source material
0 Missing Citation 0%
Matches that have quotation marks, but no in-text citation
0 Cited and Quoted 0%
Matches with in-text citation present, but no quotation marks
Top Sources
The sources with the highest number of matches within the submission. Overlapping sources will not be displayed.
1 Internet
pmc.ncbi.nlm.nih.gov 5%
2 Submitted works
Indiana Wesleyan University on 2025-03-03 2%
3 Internet
machinelearningmodels.org 2%
4 Submitted works
University College London on 2022-03-04 2%
5 Internet
www.coursehero.com 2%
6 Submitted works
BB9.1 PROD on 2024-12-09 1%
7 Internet
machinelearningcoban.com 1%
8 Internet
dev.to 1%
9 Submitted works
King's College on 2014-04-22 <1%
10 Submitted works
University of Stirling on 2024-12-06 <1%
Page 3 of 15 - Integrity Overview Submission ID trn:oid:::29034:86246888
Page 4 of 15 - Integrity Overview Submission ID trn:oid:::29034:86246888
11 Submitted works
Erasmus University of Rotterdam on 2021-04-27 <1%
12 Submitted works
Liverpool John Moores University on 2024-11-22 <1%
13 Submitted works
Melbourne Institute of Technology on 2025-01-09 <1%
14 Submitted works
Cardiff University on 2024-09-07 <1%
15 Submitted works
Nottingham Trent University on 2020-09-04 <1%
16 Internet
runningcode11.blogspot.com <1%
17 Publication
Yan Liang, Jeong-Yeol Yoon. "Fundamentals of machine learning", Elsevier BV, 2024 <1%
18 Internet
fastercapital.com <1%
19 Internet
www.jatit.org <1%
Page 4 of 15 - Integrity Overview Submission ID trn:oid:::29034:86246888
Page 5 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Naïve Bayesian Classifier and K-Means Clustering
5 Student’s Name
Institution Affiliation
Professor’s name
Course
Date
Page 5 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 6 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Naïve Bayesian Classifier and K-Means Clustering
Part 1: Naïve Bayesian Classifier
1. Concept Explanation
19 Machine learning uses the Naïve Bayesian Classifier as a probabilistic algorithm for
14 classification operations. This algorithm relies on Bayes' theorem, assuming features become
independent when the class is known. Despite using a simplified conditional independence
assumption, the Naïve Bayesian Classifier functions effectively for spam detection, sentiment
analysis, and medical diagnosis. Feature independence occurs only after the class label has been
provided.
Assumptions:
1. Conditional Independence – The algorithm bases its operation on a rule that states
features show independence from one another when the class label serves as input.
2. Equal Importance of Features – Each feature contributes equally to the classification.
3. Prior Probabilities Are Used – The model relies on prior knowledge (base rates of
classes).
1 Mathematically, Bayes' theorem is given by:
𝑃(𝑋|𝐶)𝑃(𝐶)
𝑃(𝐶|𝑋) =
𝑃(𝑋)
Where:
𝑃(𝐶|𝑋) is the posterior probability of class C given feature set X.
𝑃(𝑋|𝐶) is the likelihood of feature set X given class C.
𝑃(𝐶) is the prior probability of class C.
𝑃(𝑋) is the marginal probability of feature set X.
Page 6 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 7 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
5 For multiple features X = (X1, X2, ..., Xn), the Naïve Bayes assumption simplifies:
𝑛
𝑃(𝐶)𝛱ⅈ=1 𝑃(𝑋ⅈ |𝐶)
𝑃(𝐶|𝑋) =
𝑃(𝑋)
This allows for efficient computation in classification problems.
2. Example with Explanation
Application: Spam Email Detection
Spam detection involves categorizing emails into spam and valid messages (ham). The
primary purpose is to develop a predictive model for identifying spam emails based on word
frequency patterns and additional characteristics.
Classification Objective
The goal is to determine the probability of an email being spam given a set of observed
9 words. This is achieved using the Naïve Bayes classifier, which assumes that the presence of
each word in the email is independent of the others, given the class label.
3. Sample Problem & Solution
Dataset
Consider a small dataset of emails with the presence (1) or absence (0) of specific keywords:
Email ID "Free" "Win" "Money" "Offer" Spam (1=Yes, 0=No)
1 1 1 0 1 1
2 0 1 1 0 0
3 1 1 1 1 1
4 0 0 1 0 0
5 1 0 1 1 1
Page 7 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 8 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
We classify a new email with: ("Free"=1, "Win"=1, "Money"=1, "Offer"=0).
Step-by-Step Calculation using Bayes' Theorem
Calculate Priors:
3
P(Spam) = = 0.6
5
2
P(Not Spam) = = 0.4
5
Calculate Likelihoods:
2
P(Free=1∣Spam) = = 0.67
3
2
P(Win=1∣Spam) = = 0.67
3
2
P(Money=1∣Spam) = = 0.67
3
11 1
P(Offer=0∣Spam) = = 0.33
3
0
P(Free=1∣Not Spam) = = 0.00
2
1
P(Win=1∣Not Spam) = = 0.5
2
1
P(Money=1∣Not Spam) = = 0.5
2
1
P(Offer=0∣Not Spam) = = 0.5
2
Compute Posteriors:
P(Spam∣X) ∝ 0.6 × (0.67 × 0.67 × 0.67 × 0.33)
P(Not Spam∣X) ∝ 0.4 × (0.00 × 0.50 × 0.50 × 0.50)
Since P(Not Spam∣X) is 0, the classification is Spam.
Python Code for Naïve Bayes Implementation
7 from sklearn.naive_bayes import BernoulliNB
import numpy as np
Page 8 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 9 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
# Training dataset
X_train = np.array([[1,1,0,1], [0,1,1,0], [1,1,1,1], [0,0,1,0], [1,0,1,1]])
16 y_train = np.array([1, 0, 1, 0, 1]) # 1 = Spam, 0 = Not Spam
# New email sample
X_test = np.array([[1,1,1,0]])
# Model training
6 nb_model = BernoulliNB()
nb_model.fit(X_train, y_train)
# Prediction
prediction = nb_model.predict(X_test)
print("Prediction:", "Spam" if prediction[0] == 1 else "Not Spam")
Output:
Prediction: Spam
Page 9 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 10 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Part 2: K-Means Clustering
1. Concept Explanation
Machine learning techniques implement clustering as an unsupervised approach that
groups data points through their shared features. Clustering algorithms detect natural data
groupings in an unsupervised manner since they work without predefined categories. Clustering
serves multiple functions, including market segmentation and anomaly detection, image
processing, and biological data analysis.
Definition of K-Means Clustering
10 K-Means Clustering is commonly used in marketing to segment customers based on
18 spending behavior and income levels. This allows businesses to target specific customer groups
with personalized promotions.
K-Means follows three main steps:
1. Centroid Selection:
17 o Randomly select K initial centroids from the dataset.
2 2. Cluster Assignment:
o Each data point is assigned to the nearest centroid based on the Euclidean
distance.
3. Centroid Updating:
o Compute the new centroid by taking the mean of all points in the cluster.
o Repeat until centroids no longer change significantly (convergence).
The centroid of a cluster is mathematically represented as:
Page 10 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 11 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
𝑛
1
𝐶𝑘 = ∑ 𝑥ⅈ
𝑛
ⅈ=1
where:
12 𝐶𝑘 is the centroid of cluster k,
𝑥ⅈ represents the data points in cluster k,
15 n is the number of points in the cluster.
The Euclidean distance used for assigning clusters:
𝑛
2
𝑑(𝑥, 𝐶𝑘 ) = √∑(𝑥𝑗 − 𝐶𝑘𝑗 )
0=1
where:
x is a data point,
Ck is the cluster centroid,
m is the number of features.
2. Customer Segmentation
The marketing industry uses K-Means Clustering as a popular technique to divide
customers by their purchasing activities and financial capability. Businesses use this approach to
deliver advertisements that cater specifically to discernible customer demographics.
3. Sample Problem & Solution
4 Customer ID Annual Income ($1000s) Spending Score (1-100)
1 15 39
Page 11 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 12 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
2 16 81
3 17 6
4 18 77
5 20 40
6 24 94
7 25 3
8 30 73
9 35 92
10 40 8
Step 1: Initial Centroid Selection
Randomly selecting K=3 centroids:
C1 (Low Income, Low Spending): (15, 39)
C2 (Middle Income, High Spending): (24, 94)
C3 (High Income, Low Spending): (40, 8)
Step 2: Cluster Assignment (Iteration 1)
Using Euclidean distance to compute the 3 centroids to each customer, and assigning it to
the nearest one.
Example Calculation for Customer 1 (15, 39)
Distance to C1 (15,39):
𝑑1 = √(15 − 15)2 + (39 − 39)2 = 0
(Customer 1 stays in Cluster 1)
Distance to C2 (24,94):
Page 12 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 13 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
𝑑1 = √(15 − 24)2 + (39 − 94)2 = 55.1
Distance to C3 (40,8):
𝑑1 = √(15 − 40)2 + (39 − 8)2 = 39.5
Step 3: Centroid Update
Example for Cluster 1 (Customers: 1, 3, 5, 7, 10):
New centroid:
51+17+20+25+40 39+6+40+3+8
C1 = ( , ) = (23.4,19.2)
5 5
Final Cluster Assignments
4 Customer ID Annual Income ($1000s) Spending Score (1-100) Final Cluster
1 15 39 1
2 16 81 2
3 17 6 1
4 18 77 2
5 20 40 1
6 24 94 2
7 25 3 1
8 30 73 2
9 35 92 2
10 40 8 1
Page 13 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 14 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
10
8 # Dataset: Customer Income & Spending Score
X = np.array([
[15, 39], [16, 81], [17, 6], [18, 77], [20, 40],
[24, 94], [25, 3], [30, 73], [35, 92], [40, 8]
])
3 # Apply K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Cluster assignments
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k')
plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='red', marker='X', label='Centroids')
plt.xlabel('Annual Income ($1000s)')
plt.ylabel('Spending Score (1-100)')
13 plt.title('K-Means Customer Segmentation')
plt.legend()
plt.show()
# Print cluster assignments
print("Final Cluster Assignments:", labels)
Output
Page 14 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
Page 15 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888
11
Page 15 of 15 - Integrity Submission Submission ID trn:oid:::29034:86246888