Now Logistic Regression and Multinomial Regression are called Discriminant learning algorithms which learn p(y|x) directly.

Naive bayes and linear/quadratic discriminant analysis are called Generative learning algorithms that try to model p(x|y) and p(y). They use Bayes rule to derive p(y|x).

Only thing you need to know here is Bayes Theorem, which we talk about in first section.

Remember the Bayes Theorem

Follow the example

Q. Given the following statistics, what is the probability that a woman has cancer if she has a positive mammogram(MM) result?

  • One percent of women over 50 have breast cancer.
  • Ninety percent of women who have breast cancer test positive on MM.
  • Eight percent of women will have false positives.

Step 1: Assign events to A and B. You want to know what a woman’s probability of having cancer is, given a positive MM. For this problem, actually having cancer is A and a positive test result is B.
Step 2: List out the parts of the equation (this makes it easier to work the actual equation):
Probability of Not A = P(~A)=0.99
Step 3: Insert the parts into the equation and solve. Note
P(B) = P(B|A)*P(A)+ P(B|~A)*P(~A), So
P(A|B) = (0.9 * 0.01) / ((0.9 * 0.01) + (0.08 * 0.99) = 0.10.

The probability of a woman having cancer, given a positive test result, is 10%.

Now this gives us beautiful way of classification examples like mail is spam/not spam, tumor/no tumor etc.

Naive Bayes

Since the denominator is constant across all values of y, it can be ignored. Thus the goal is to find (the discriminant function)

To prevent underflow of multiplying small values, use logs

Numeric Features

Laplace Smoothing

So the solution to this is to do Laplace estimates by adding 1 to all counts. This smoothing step’s effect is great when data sets are small and doesn’t have an effect on calculations when data sets are large.

This effect changes the probability calculations to

Now shall we take an example,

From this data, compute the conditional and marginal probabilities.

However, there are 00s in the data. This is concerning because multiplying anything by zero automatically cancels it out. Laplace estimates can be used to adjust the counts (add 1 to all counts). This accounts for very rare occurrences that may not occur in this training data set.

Given information of a fruit that is long, sweet and yellow, predict what class this fruit belongs to.

Please Note that as in above P(long|apple), P(sweet|apple), P(yellow|apple) can be written in silos because we are assuming that being long or sweet or yellow are all conditionally independent.

Based off the evidence, this sweet, long, and yellow fruit is a banana.

Gaussian Naive Bayes

Now what if we have missing labels, in which case we use EM Algorithm. Let’s delay it to another post.


Problem Solver, Data Science, Actuarial Science, Knowledge Sharer, Hardcore Googler

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store