Every Machine learning Problem is an Optimization Problem.

Now bring on your toolkit and let’s mend our ways to get things rolling.

Characteristics of Convex Functions which makes them so special and dear to Data Scientists:

  1. Every Convex function has a global Minima…so no problem of fake local minimas disturbing your results.
  2. This Global Minima is Unique too..so bingo as you could now recover ground truth argmin from your cost/loss functions.
  3. Convexity is preserved over addition and multiplication for convex functions….this allows you to use regularization and weights to your loss functions without a problem.
  4. There are tons of techniques…

Hi, Isn’t PCA a hyped method for Dimensionality Reduction. All for good reasons. Let’s look out what maths has to say about this Non Parametric Method.

  • Principal component analysis (PCA) Orthogonally transforms the original n numeric dimensions of a dataset into a new set of m dimensions called principal components. As a result of the transformation, the first principal component has the largest possible variance; each succeeding principal component has the highest possible variance under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding principal components. Keeping only the first m < n principal components reduces the…

Hey!, I know enough of PCA, but why not?

It is a great algorithm with certain watch-outs which can mostly be tackled with certain adjustments in the Vanilla PCA.


  • Check whether some assumptions hold, like the presence of linear correlations (e.g., Bartlett’s test of sphericity) or sampling adequacy (e.g., Kaiser-Meyer-Olkin test). The distinctiveness of the eigenvalues is a fundamental assumption of PCA so that Principal Components are unique. The following image was taken from here:

Making Scary looking, uncontrollable data to a manageable, small dimensions while retaining properties of original data.

Yes, this is what we are doing here.

Essence is look for columns that add no new information or little new information to what data set says. It might be performed after data cleaning and data scaling and before training a predictive model. Although often it is also done post modelling just for visualization purposes too.

We start with Methods, Most common Techniques, Uses of Dimensionality Reduction


  1. Feature Selection
    - Missing Value(drop variables with large proportions of missing value)
    - Low Variance Filter(Within column variance low…

Feature Selection is a simple way to reduce dimensions of your data, which is easier to understand as well.

Most common ones are listed below:

  • Ratio of missing values: Data columns with a ratio of missing values greater than a given threshold can be removed.
  • Low variance in the column values :Columns with little changes in the data carry little information. Data columns with a variance lower than a given threshold can be removed. Notice that the variance depends on the column range, and therefore normalization is required before applying this technique. It can be used only for numerical data.
  • High correlation between two columns :Data…

One step forward to our goal of knowing everything.

When we learn about mean, median, mode, range, variance, skewness of the data; we are essentially talking about Descriptive Statistics. It is undoubtedly first and best thing you can do when your mail box is hit with data from manager.

But todays post is going to be all about Inferential Statistics.

Must have books in libraries.

  1. John E. Freund’s Mathematical Statistics with Applications
    Irwin Miller, Marylees Miller
  1. Basic Econometrics
    Damodar Gujrati McGraw-Hill

We are about to discuss pure beauty, stay with me for this.

Bagging algorithms:

  • Bagging meta-estimator
  • Random forest

Boosting algorithms:

  • AdaBoost
  • GBM(Gradient Boosting)
  • XGBoost
  • Light GBM
  • CatBoost

Bagging meta-estimator

  1. Random subsets are created from the original dataset (Bootstrapping).
  2. The subset of the dataset includes all features.
  3. A user-specified base estimator is fitted on each of these smaller sets.
  4. Predictions from each model are combined to get the final result.

Random Forest

  1. Random subsets are created from the original dataset (bootstrapping).
  2. At each node in the decision tree, only a random set of features are considered to decide the best split.
  3. A decision tree model…

As name says it is Extreme Gradient Boosting.

Having properties:

  • Gradient Boosting
  • Regularization
  • A unique Regression Tree- discussed below
  • Approximate Greedy Algorithm, Weighted Quantile Sketch, Parallel Learning: Chooses Absolute best choice at each step. Instead of checking at every threshold for numerical data, it checks for quantiles. About 33, uses Sketch algorithm. Quantiles are not like ones with each band having similar number of observations. It instead has similar sum of weights. These weights are nothing but cover(eg. p(1-p) for classification) as talked in Gradient Boosting. …

Extreme Gradient Boosting- Nasty Math

XGBoost optimizes a loss function which needs to be prespecified, as below

Shaily jain

Problem Solver, Data Science, Actuarial Science, Knowledge Sharer, Hardcore Googler

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store