Naive Bayes
Naive Bayes
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not
a single algorithm but a family of algorithms where all of them share a common principle, i.e. every
pair of features being classified is independent of each other. To start with, let us consider a dataset.
One of the most simple and effective classification algorithms, the Naïve Bayes classifier aids in the
rapid development of machine learning models with rapid prediction capabilities.
Naïve Bayes algorithm is used for classification problems. It is highly used in text classification. In text
classification tasks, data contains high dimension (as each word represent one feature in the data). It
is used in spam filtering, sentiment detection, rating classification etc. The advantage of using naïve
Bayes is its speed. It is fast and making prediction is easy with high dimension of data.This model
predicts the probability of an instance belongs to a class with a given set of feature value. It is a
probabilistic classifier. It is because it assumes that one feature in the model is independent of
existence of another feature. In other words, each feature contributes to the predictions with no
relation between each other. In real world, this condition satisfies rarely. It uses Bayes theorem in the
algorithm for training and prediction
The “Naive” part of the name indicates the simplifying assumption made by the Naïve Bayes
classifier. The classifier assumes that the features used to describe an observation are conditionally
independent, given the class label. The “Bayes” part of the name refers to Reverend Thomas Bayes,
an 18th-century statistician and theologian who formulated Bayes’ theorem.
Consider a fictional dataset that describes the weather conditions for playing a game of golf. Given
the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for playing
golf. Here is a tabular representation of our dataset.
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector consists of the
value of dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable(prediction or output) for each row of
feature matrix. In above dataset, the class variable name is ‘Play golf’.
The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: The features of the data are conditionally independent of each
other, given the class label.
Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
No missing data: The data should not contain any missing values.
Secondly, each feature is given the same weight(or importance). For example, knowing only
temperature and humidity alone can’t predict the outcome accurately. None of the
attributes is irrelevant and assumed to be contributing equally to the outcome.
The assumptions made by Naive Bayes are not generally correct in real-world situations. In-fact, the
independence assumption is never correct but often works well in practice.Now, before moving to
the formula for Naive Bayes, it is important to know about Bayes’ theorem.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of another event
that has already occurred. Bayes’ theorem is stated mathematically as the following equation:
P(A∣B) = P(B)P(B∣A)/P(A)
Basically, we are trying to find probability of event A, given the event B is true. Event B is also
termed as evidence.
P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen).
The evidence is an attribute value of an unknown instance(here, it is event B).
P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will come true based on
the evidence.
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
P(y∣X)=P(X)P(X∣y)P(y)
where, y is class variable and X is a dependent feature vector (of size n) where:
X=(x1,x2,x3,…..,xn)
Just to clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of
dataset)
So basically, P(y∣X)here means, the probability of “Not playing golf” given that the weather
conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”.
We assume that no pair of features are dependent. For example, the temperature being
‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the
winds. Hence, the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example, knowing only
temperature and humidity alone can’t predict the outcome accurately. None of the
attributes is irrelevant and assumed to be contributing equally to the outcome.
Now, its time to put a naive assumption to the Bayes’ theorem, which is, independence among the
features. So now, we split evidence into the independent parts.
P(A,B) = P(A)P(B)
Now, as the denominator remains constant for a given input, we can remove that term:
Now, we need to create a classifier model. For this, we find the probability of given set of inputs for
all possible values of the class variable y and pick up the output with maximum probability. This can
be expressed mathematically as:
y=argmaxyP(y)∏i=1 to n P(xi∣y)
So, finally, we are left with the task of calculating P(y)and P(xi∣y).
Please note that P(y) is also called class probability and P(xi∣y) is called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the
distribution of P(xi∣y).
Let us try to apply the above formula manually on our weather dataset. For this, we need to do some
precomputations on our dataset.
We need to find P(xi∣yj)for each xi in X andyj in y. All these calculations have been demonstrated in
the tables below:
So, in the figure above, we have calculated P(xi ∣yj) for each xi in X and yj in y manually in the tables
1-4. For example, probability of playing golf given that the temperature is cool, i.e P(temp. = cool |
play golf = Yes) = 3/9.
Also, we need to find class probabilities P(y) which has been calculated in the table 5. For example,
P(play golf = Yes) = 9/14.
So now, we are done with our pre-computations and the classifier is ready!
P(Yes∣today)=P(SunnyOutlook∣Yes)P(HotTemperature∣Yes)P(NormalHumidity∣Yes)P(NoWind∣Yes)P(Yes)
/ P(today)
P(No∣today)=P(SunnyOutlook∣No)P(HotTemperature∣No)P(NormalHumidity∣No)P(NoWind∣No)P(No) /
P(today)
Since, P(today) is common in both probabilities, we can ignore P(today) and find proportional
probabilities as:
and
Now, since
P(Yes∣today)+P(No∣today)=1
These numbers can be converted into a probability by making the sum equal to 1 (normalization):
P(Yes∣today)=0.02116 / (0.02116+0.0068)≈0.0237
and
P(No∣today)=0.0068 / (0.0141+0.0068)≈0.33
Since
P(Yes∣today)>P(No∣today)
The method that we discussed above is applicable for discrete data. In case of continuous data, we
need to make some assumptions regarding the distribution of values of each feature. The different
naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of
P(xi∣y).