XGBoost- Ensemble Technique

Shaily jain

4 min readMay 11, 2021

As name says it is Extreme Gradient Boosting.

Having properties:

Gradient Boosting
Regularization
A unique Regression Tree- discussed below
Approximate Greedy Algorithm, Weighted Quantile Sketch, Parallel Learning: Chooses Absolute best choice at each step. Instead of checking at every threshold for numerical data, it checks for quantiles. About 33, uses Sketch algorithm. Quantiles are not like ones with each band having similar number of observations. It instead has similar sum of weights. These weights are nothing but cover(eg. p(1-p) for classification) as talked in Gradient Boosting.
This ensures more weights is given to ones to observations which are classified with low confidence.
Sparsity- Aware Split Finding: Handles missing data as after calculating residuals(which we can do, because initial prediction=0.5), we put missing value in left or right leaf based on whichever gives us a higher gain value. And similarly you can predict for new observation with missing feature value too.
Cache- Aware Access: CPU has Cache, Main and Hard Drive memory. Cache Memory is the fastest, so XG Boost optimize by storing gradient and Hessian in cache, so that similarity score and gain can be calculated fastly.
Blocks for Out-of-Core Computation: CPU decompresses the data, so that we don’t have to read it from hard drive repeatedly which saves us time. Also if there are more than one hard drive, XGBoost can use Sharding. That is divide and access data.
XGBoost can also saves computational time, by building trees with random observations or random features from sample.

Before diving in make sure, you have fully understood Gradient Boosting.

Lets start, here we build XGBoost Trees, unlike normal out of the shelf trees in Gradient boost for residuals.

Construction of XGBoost Tree

Start with taking average of all observations, find corresponding residuals.

Start with single node, calculate Similarity score.

Since residuals are added and then squared, it might lead to cancellation of opposite sign terms.

Next split this over feature set(different thresholds as well for numerical features), again calculate similarity scores for left and right leaves.

Now calculate Gain due to each split

Split with highest gain, is selected. Similarly grow the tree.

Next we have to prune the tree, subject to minimum gain gamma. Note tree also gets pruned when gamma = 0, which happens when gain in itself is negative.

Note lambda>0, lower are similarity scores. And amount of decrease is inversely proportional to number of residuals in leaf. Also it is easier to prune leaves with a regularization parameter. Lambda in total reduces sensitivity to individual observation.

As similar to gradient boost, we calculate new prediction post this XGBoost tree by adding to previous prediction plus learning rate(eta) times* similarity score of leaf.

This was about Regression. Note things are similar for classification too, with obvious changes as we see in Gradient Boosting.

where similarity score is calculated as

Terminology to remember is Cover(min_child_weight) which equals p(1-p) for classification and number of residuals in regression…i.e. basically denominator minus lambda. Default value for this is 1, so this would prune leaves giving cover < 1.

So starting with 0.5 as initial probability(can choose percentage of class labels 1), getting its log(odds) as log(0.5/(1–0.5)) = 0. We make new prediction by adding to this learning rate times similarity score for corresponding leaf. And finally to convert it into probability again, use sigmoid function(aka logistic function).

So, basically