Naive Bayes Classifiers

A Naive Bayes classifiers, a family of algorithms based on Bayes’ Theorem. Despite the “naive” assumption of feature independence, these classifiers are widely utilized for their simplicity and efficiency in machine learning. The article delves into theory, implementation, and applications, shedding light on their practical utility despite oversimplified assumptions.

What is Naive Bayes Classifiers?

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem . It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. To start with, let us consider a dataset.

One of the most simple and effective classification algorithms, the Naïve Bayes classifier aids in the rapid development of machine learning models with rapid prediction capabilities.

Naïve Bayes algorithm is used for classification problems. It is highly used in text classification. In text classification tasks, data contains high dimension (as each word represent one feature in the data). It is used in spam filtering, sentiment detection, rating classification etc. The advantage of using naïve Bayes is its speed. It is fast and making prediction is easy with high dimension of data.

This model predicts the probability of an instance belongs to a class with a given set of feature value. It is a probabilistic classifier. It is because it assumes that one feature in the model is independent of existence of another feature. In other words, each feature contributes to the predictions with no relation between each other. In real world, this condition satisfies rarely. It uses Bayes theorem in the algorithm for training and prediction

Why it is Called Naive Bayes?

The “Naive” part of the name indicates the simplifying assumption made by the Naïve Bayes classifier. The classifier assumes that the features used to describe an observation are conditionally independent, given the class label. The “Bayes” part of the name refers to Reverend Thomas Bayes, an 18th-century statistician and theologian who formulated Bayes’ theorem.

Consider a fictional dataset that describes the weather conditions for playing a game of golf. Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for playing golf.Here is a tabular representation of our dataset.

Outlook Temperature Humidity Windy Play Golf
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No

The dataset is divided into two parts, namely, feature matrix and the response vector .

Assumption of Naive Bayes

The fundamental Naive Bayes assumption is that each feature makes an:

With relation to our dataset, this concept can be understood as:

The assumptions made by Naive Bayes are not generally correct in real-world situations. In-fact, the independence assumption is never correct but often works well in practice.Now, before moving to the formula for Naive Bayes, it is important to know about Bayes’ theorem.

Bayes’ Theorem

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation:

where A and B are events and P(B) ≠ 0

Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

where, y is class variable and X is a dependent feature vector (of size n ) where:

[Tex] X = (x_1,x_2,x_3,…. x_n) [/Tex]

Just to clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of dataset)

X = (Rainy, Hot, High, False)
y = No

So basically, [Tex]P(y|X) [/Tex] here means, the probability of “Not playing golf” given that the weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”.

With relation to our dataset, this concept can be understood as:

Now, its time to put a naive assumption to the Bayes’ theorem, which is, independence among the features. So now, we split evidence into the independent parts.

Now, if any two events A and B are independent, then,

Hence, we reach to the result:

which can be expressed as:

Now, as the denominator remains constant for a given input, we can remove that term:

[Tex] P(y|x_1,…,x_n)\propto P(y)\prod_^P(x_i|y) [/Tex]

Now, we need to create a classifier model. For this, we find the probability of given set of inputs for all possible values of the class variable y and pick up the output with maximum probability. This can be expressed mathematically as:

[Tex]y = argmax_ P(y)\prod_^P(x_i|y) [/Tex]

So, finally, we are left with the task of calculating [Tex] P(y) [/Tex] and [Tex]P(x_i | y) [/Tex] .

Please note that [Tex]P(y) [/Tex] is also called class probability and [Tex]P(x_i | y) [/Tex] is called conditional probability.

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of [Tex]P(x_i | y). [/Tex]

Let us try to apply the above formula manually on our weather dataset. For this, we need to do some precomputations on our dataset.

We need to find [Tex] P(x_i | y_j) [/Tex] for each [Tex]x_i [/Tex] in X and [Tex]y_j [/Tex] in y. All these calculations have been demonstrated in the tables below:

vghbj

Naive Bayes Classifiers

So, in the figure above, we have calculated [Tex]P(x_i | y_j) [/Tex] for each [Tex]x_i [/Tex] in X and [Tex]y_j [/Tex] in y manually in the tables 1-4. For example, probability of playing golf given that the temperature is cool, i.e P(temp. = cool | play golf = Yes) = 3/9.

Also, we need to find class probabilities [Tex]P(y) [/Tex] which has been calculated in the table 5. For example, P(play golf = Yes) = 9/14.

So now, we are done with our pre-computations and the classifier is ready!

Let us test it on a new set of features (let us call it today):

today = (Sunny, Hot, Normal, False)

and probability to not play golf is given by:

Since, P(today) is common in both probabilities, we can ignore P(today) and find proportional probabilities as:

[Tex] P(Yes | today) \propto \frac.\frac.\frac.\frac.\frac \approx 0.02116 [/Tex]

[Tex] P(No | today) \propto \frac.\frac.\frac.\frac.\frac \approx 0.0068 [/Tex]

[Tex] P(Yes | today) + P(No | today) = 1 [/Tex]

These numbers can be converted into a probability by making the sum equal to 1 (normalization):

[Tex] P(Yes | today) = \frac \approx 0.0237 [/Tex]

[Tex] P(No | today) = \frac \approx 0.33 [/Tex]

[Tex] P(Yes | today) > P(No | today) [/Tex]

So, prediction that golf would be played is ‘Yes’.

The method that we discussed above is applicable for discrete data. In case of continuous data, we need to make some assumptions regarding the distribution of values of each feature. The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of [Tex]P(x_i | y). [/Tex]

Types of Naive Bayes Model

There are three types of Naive Bayes Model:

Gaussian Naive Bayes classifier

normal

In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as shown below:

Updated table of prior probabilities for outlook feature is as following:

The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by:

Now, we look at an implementation of Gaussian Naive Bayes classifier using scikit-learn.