Understanding Stochastic Gradient Descent in Linear Regression

4 min readSep 12, 2020

Every algorithm is an optimization problem been associated with a loss function. Often the way we minimize this loss is question in hand.

Luckily we have found quite many ways to get through.

Here we consider one such method called Stochastic Gradient Descent. And that too from scratch, because that’s how I remember the loops in any code and functionalities of an inbuilt python functions.

I hope this repository will keep things clear. Putting down short and /crisp process and keys elements to remember.

Dataset used: california_housing from sklearn.datasets

Data Cleaning performed:

Target variable of price of house was capped at 5, which removes 992 entries.
Variables were scaled using Min Max Scaling

Process followed: We aim to minimize Squared error loss given by

After removing outliers and scaling variables, we try to built in Simple Linear Regression class with one independent and one dependent variable, having methods like fit(to calculate coefficients), coeffs(display calculated coefficients), predict(get prediction for new test data), r_squared(display R squared value). This uses Ordinary Least Square which is a non iterative technique as opposed to Gradient Descent.
Where to equation y = m x +c , below is the solution

def fit(self, X, y):
    self.X = X
    self.y = y
    self.m =((np.mean(X*y)-np.mean(X)*np.mean(y))/(np.mean(X**2) -  np.mean(X)**2))
    self.b = np.mean(y) - self.m * np.mean(X)

Plotting the Best fit line between the highest correlated independent variable of MedInc and target prices.
We built Multiple Linear Regression class for all variables and a constant. Here Linear algebra yield us the solution of

def fit(self, X, y):
    X = np.array(X)
    y = np.array(y)
    self.coeffs = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

Other ways of computation of solution of Linear Regression can be obtained here which shows solution by QR decomposition and Singular Value decomposition(SVD)
Next we move onto to iterative methods of obtained solutions to minimizing our squared error loss which is Batch Gradient Descent . This method starts with an initial value specified by us and update value of coefficients based on entire dataset.
Common terminologies are Epoch(runs per entire data), Learning rate(speed of convergence or jumps is an input parameter value). Update Rule is as follows

for _ in range(epoch):
    f= y - (m*X + b)
#Updating m and b
    m -= alpha * (-2 * X.dot(f).sum() / N)
    b -= alpha * (-2 * f.sum() / N)

But Batch Gradient descent is a slow process as each iteration requires calculating y_pred for all observation and computing next updated value. Therefore Stochastic Gradient Descent comes into play which updates value of coefficient for each observation in order to minimise loss. Update Rule is as follows

for _ in range(epoch):
        
        indexes = np.random.randint(0, len(X), batch_size)  
#random sample
        
        Xs = np.take(X, indexes)
        ys = np.take(X, indexes)
        N  = len(Xs)
        
        f = ys - (m * Xs + b)
        
        #Updating parameters m and b
        
        m -= alpha * ( -2 * Xs.dot(f).sum() / N )
        b -= alpha * ( -2 * f.sum() / N )

Batch size here decides if it is purely SGD(batch size=1) or Mini Batch gradient Descent(>1).

Lastly you can replicate all of the above code from scratch from my repository
I would really appreciate you to leave a note if you followed till the end. Any recommendation and thoughts are welcomed.

Cheers to learning!!! 💪
Shaily Jain

Be a part of my instagram Community 👇

Understanding Stochastic Gradient Descent in Linear Regression

Any big book keepers... Drop your email id in comments or DM for the this collection 🤩🌈

Edit description

Written by Shaily jain

No responses yet