Naive Bayes
Two types of Naive Bayes
1. Multinomial Naive Bayes: Here features are assumed to follow simple multinomial distribution. Here we model data with best fit multinomial distribution.
Let us see this by an example. But just remember the simple formula of Bayes Theorem.
Example: You have words and frequency of usage in each mail. Each mail is classified as a Normal or Spam mail. End goal is to be able to classify a new mail based on the content( words, frequency of words) as a Normal or Spam mail.
Suppose train data(100 mails) has 67 normal and 33 spam mails.
In all Normal mails taken together words with frequency are as follows
Dear(47), Friend(29), Lunch(18), Money(6) in total of 100 words.
In all Spam mails taken together words with frequency are as follows
Dear(29), Friend(14), Lunch(0), Money(57) in total of 100 words.
We want to simply calculate prior probabilities of class i.e. P(Normal Mail) = 67/100=0.67 and likewise P(Spam) = 0.33. In similar manner we calculate condition probabilities of individual words given each class label i.e. P(Dear|Normal) = 0.47, P(Friend|Normal) = 0.29, P(Lunch|Normal) = 0.18, P(Money|Normal) = 0.06
And Similarly for Spam label we have, P(Dear|Spam) = 0.29, P(Friend|Spam) = 0.14, P(Lunch|Spam) = 0.00, P(Money|Spam) = 0.57
Let the new mail be ‘Dear Friend’, we have to find is it Normal or Spam
Now using the above probabilities we are interested in P(Normal|Dear, Friend).
According to Bayes Theorem we have
P[Normal|Dear, Friend] proportional to P[Normal] * P[Dear, Friend| Normal]
Since we have a common denominator common for both P[Normal|Dear, Friend] and P[Spam|Dear, Friend], so we can just used proportion and one with the higher probability decides whether new mail is Normal or Spam.
Also since we assume that features are independent therefore we know that P[A, B] = P[A].P[B]. So P[Normal|Dear, Friend] proportional to P[Normal] * P[Dear| Normal] * P[Friend| Normal] by above we get P[Normal|Dear, Friend] = 0.09, P[Spam|Dear, Friend] = 0.01. Therefore above mail is ‘NORMAL’.
But if we have eg. ‘Lunch Money Money Money’ as the mail to be classified then we are stuck with a problem that in our training data set we dont have any spam mail having Lunch word, therefore the P[Lunch|Spam] = 0 which will make P[Spam|Lunch, Money, Money, Money] as 0, thereafter making Normal as a default class… therefore to rectify this problem we add alpha(can be 1 in this case to all observations therefore now we have 104 words in all]
As before simply calculate prior probabilities of class i.e. P(Normal Mail) = 67/100=0.67 and likewise P(Spam) = 0.33. In similar manner we calculate condition probabilities of individual words given each class label i.e. P(Dear|Normal) = 0.43, P(Friend|Normal) = 0.29, P(Lunch|Normal) = 0.19, P(Money|Normal) = 0.10
And Similarly for Spam label we have, P(Dear|Spam) = 0.27, P(Friend|Spam) = 0.18, P(Lunch|Spam) = 0.09, P(Money|Spam) = 0.45
and P[Normal|Lunch, Money(4 times)] = P[Normal]*P[Lunch|Normal]*P[Money|Normal]⁴
= 0.00001
P[Spam|Lunch , Money(4 times)] = 0.00122
Another problem while dealing with this is Why it is called NAIVE because this algorithm does not account for order of words in a sentence implying Dear Friend and Friend Dear is treated equivalent for the classifier, however in real life it matters while classification a mail as Spam or Normal
Resources
2. Gaussian Naive Bayes :Here features are assumed to follow simple Gaussian distribution. Here we model data with best fit Gaussian distribution.
Lets start with an example where we have Group A with X1 = [24.3, 28.2, …], X2 = [ 750.7, 533.2,….], X3 = [0.2, 50.5, …] and Group B with X1 = [2.1,4.8, …], X2 = [ 120.5,110.9,….], X3 = [90.7,102.3, …]
Here we see that the features are continuous variables assumed to be Gaussian with Group A having variable[Mean, variance] as X1[24,4 ] , X2[, ] , X3[, ] and Group B having X1[4,2 ] , X2[, ] , X3[, ]…likewise … now for some new test data with (X1=20, X2= 500, X3 =25) we need to specify whether this belongs to Group A or Group B.
For this we follow steps on pattern of Multinomial Naive Bayes, however since the data is continuous we instead use log of pdf of Normal distribution (called LogLikelihoood)…which gives expression like
P[Group A|20, 500, 25] = log(P[Group A])*P[X1 = 20|Group A]*P[X2 = 500|Group A]*P[X3 = 25|Group A])
= log(P[Group A]) + log(P[X1 = 20|Group A]) + log(P[X2 = 500|Group A]) + log(P[X3= 25|Group A])
and final decision is similarly based on highest value among P[Group A|20, 500, 25] or P[Group B|20, 500, 25]
When to use Naive Bayes
- Extremely fast for both training and prediction
- Provide straightforward probabilistic prediction
- Easily interpretable and few (if any ) tunable parameters
Resources Github Code, Statquest, cs cornell,