SoftMax Regression

Shaily jain
3 min readSep 20, 2020

--

This is the first kind of multiclass classification that I studied. Jotting down what I learnt about it.

Literally there’s a reason for calling it softmax. So softmax is actually the activation function that we selected for our logistic regression case here.

Just like we used

as our activation function (sigmoid function) in vanilla logistic regression, in softmax regression we use

called softmax activation function. This function has a part of actual softmax function which is

This means that actual softmax is approximation of max function but a differential version of it, as shown below.

Blue : SoftMax(0, x) ,Red : Max(0, x)

So we have seen that softmax activation function comprises of softmax function hence the name. To read more about the name go through this article.

The differentiable is must for any activation to be able to maximise loglikelihood and thereby arrive at an update rule for parameters/ weights of the final prediction expression. To see the actual math behind differentiability go through this video .

Dataset used: Iris dataset from sklearn.datasets which has three classes

Process followed.

  • Functions created are softmax activation functon, cost function, one hot encoding, gradient calculator.
  1. Softmax Activation function
def softmax(z):
z -= np.max(z)
return np.exp(z)/np.sum(np.exp(z)) """
To get rid of very large value of denominator in activation function
"""

2. Cost Function is called Cross Entropy Function which has to be minimised

def J(preds, y):
return np.sum(-np.log(preds[np.arange(m),y]))

3. One Hot Encoding

def T(y, k):
"""One hot encoding"""
one_hot = np.zeros((len(y), k))
one_hot[np.arange(len(y)), y] = 1
return one_hot

4. Gradient Calculator

def compute_gradient(theta, X, y):
preds = h(X, theta)
gradient = 1/m * X.T @(preds - T(y, k))
return gradient
for i in range(1500):
gradient = compute_gradient(theta, X, y)
theta -= alpha * gradient

where preds is probabilities matrix corresponding to each observation.

Softmax turns out to be an important one, thereby may be will meet it again in Neural Nets.

Till then if you would like to follow my code from scratch, check out my repository . You can also check out all the reference mentioned in it.

Check out my other articles as well.

I would really appreciate you to leave a note if you followed till the end. Any recommendation and thoughts are welcomed.

Cheers to learning!!! 💪
Shaily Jain

Be a part of my Instagram Community 👇

--

--

Shaily jain
Shaily jain

Written by Shaily jain

Problem Solver, Data Science, Actuarial Science, Knowledge Sharer, Hardcore Googler