# Naive Bayes Theorem

Now Logistic Regression and Multinomial Regression are called *Discriminant learning algorithms *which learn p(y|x) directly.

Naive bayes and linear/quadratic discriminant analysis are called *Generative learning algorithms *that try to model p(x|y) and p(y). They use Bayes rule to derive p(y|x).

Only thing you need to know here is Bayes Theorem, which we talk about in first section.

Remember the Bayes Theorem

Follow the example

**Q. Given the following statistics, what is the probability that a woman has cancer if she has a positive mammogram(MM) result?**

- One percent of women over 50 have breast cancer.
- Ninety percent of women who have breast cancer test positive on MM.
- Eight percent of women will have false positives.

** Step 1**: Assign events to A and B. You want to know what a woman’s probability of having cancer is, given a positive MM. For this problem, actually having cancer is A and a positive test result is B.

**: List out the parts of the equation (this makes it easier to work the actual equation):**

*Step 2*P(A)=0.01

Probability of Not A = P(~A)=0.99

P(B|A)=0.9

P(B|~A)=0.08

**: Insert the parts into the equation and solve. Note**

*Step 3*P(B) = P(B|A)*P(A)+ P(B|~A)*P(~A), So

P(A|B) = (0.9 * 0.01) / ((0.9 * 0.01) + (0.08 * 0.99) = 0.10.

The probability of a woman having cancer, given a positive test result, is 10%.

Now this gives us beautiful way of classification examples like mail is spam/not spam, tumor/no tumor etc.

# Naive Bayes

Naive Bayes is a classification technique that uses Bayesian statistics. It makes the assumption that all features (Xi) are conditionally independent of each other given its class (YY). That is, P(Xi|Xj,Y)=P(Xi|Y)where i≠j. The goal is to find the value of Y that is most likely given Xi.

Since the denominator is constant across all values of y, it can be ignored. Thus the goal is to find (the discriminant function)

To prevent underflow of multiplying small values, use logs

# Numeric Features

If the original input feature is continuous, convert it to a discrete value by binning, allowing for the computation of probabilities.

# Laplace Smoothing

The probability estimates can be vulnerable to rare events. In these cases, estimating the probabilities to be 0 ends up canceling all of the calculations. This is a problem especially when the data set is small.

So the solution to this is to do Laplace estimates by adding 1 to all counts. This smoothing step’s effect is great when data sets are small and doesn’t have an effect on calculations when data sets are large.

This effect changes the probability calculations to

Now shall we take an example,

From this data, compute the conditional and marginal probabilities.

However, there are 00s in the data. This is concerning because multiplying anything by zero automatically cancels it out. Laplace estimates can be used to adjust the counts (add 1 to all counts). This accounts for very rare occurrences that may not occur in this training data set.

Given information of a fruit that is long, sweet and yellow, predict what class this fruit belongs to.

Please Note that as in above P(long|apple), P(sweet|apple), P(yellow|apple) can be written in silos because we are assuming that being long or sweet or yellow are all conditionally independent.

Based off the evidence, this sweet, long, and yellow fruit is a banana.

# Gaussian Naive Bayes

Gaussian naive Bayes assumes that each Σk(covariance matrix for kth label) is diagonal. Thus

Now what if we have missing labels, in which case we use EM Algorithm. Let’s delay it to another post.

Resources:

http://cs229.stanford.edu/notes-spring2019/cs229-notes2.pdf, http://jennguyen1.github.io/nhuyhoa/statistics/Discriminant-Analysis-Naive-Bayes.html