Ensemble Methods

5 min readMay 6, 2021

Long story short in this post we are dealing with predictions from a Combination of multiple models.

Common Ensemble Methods:

Voting or Averaging:

Each base model can be created using different splits of the same training dataset and same algorithm, or using the same dataset with different algorithms, or any other method.

Majority Voting :

Classify each test observation through each base model
Pick the prediction made by majority of models. If none makes majority then ensemble method fails to give a stable prediction.
This can be adaptive by giving weights to any specific kind of models counting them multiple time. This is called Weighted Voting.

Averaging : Above voting technique can be replicated for regression by taking average of all the prediction made by all base models. This can also be extended by heavy weights given to any particular kind of models.

Stacking

Also known as stacked generalization, is an ensemble method where the base models are combined using another machine learning algorithm.
The basic idea is to train machine learning algorithms with training dataset and then generate a new dataset with these base models. Then this new dataset is used as input for the combiner machine learning algorithm.
Note base models can be different hence Stacking is heterogenous.

We can understand the process in the following steps

1.We split the data into two parts viz, a training set and test set. The training data is further split into K-folds just like K-fold cross-validation.
2. A base model(e.g k-NN) is fitted on the K-1 parts and predictions are made for the Kth part.
3. This process is iterated until every fold has been predicted.
4. The base model is then fitted on the whole train data set to calculate its performance on the test set.
5. We repeat the last 3 steps for other base models.(e.g SVM, decision tree, neural network etc )
6.Predictions from the train set are used as features for the second level model.
7.Second level model is used to make a prediction on the test set.

The outputs from the base models used as input to the meta-model may be real values in the case of regression, and probability values, probability like values, or class labels in the case of classification.

Blending

Unlike Stacking, where predictions are made on entire training set, in Blending there is a holdout set out of Training Set on which predictions are done. This holdout set, and predictions are used to build the model which then runs on Test set. The meta model, same as Voting Technique discussed earlier. For Regression, this meta model could be just weighted average, for Classification it could be a probabilities…so called Soft Voting(in contrast to Hard Voting, which chooses the output as highest probability)

Bagging (Bootstrap Aggregating)

Creating multiple models using the same algorithm with random sub-samples of the dataset which are drawn from the original dataset randomly with bootstrap sampling method.
The second step in bagging is aggregating the generated models. Well known methods, such as voting and averaging, are used for this purpose.
In bagging, each sub-samples can be generated independently from each other. So generation and training can be done in parallel.
Ex. Random Forest algorithm uses the bagging technique with some differences. Random Forest uses random feature selection, and the base algorithm of it is a decision tree algorithm.

Boosting

Convert weak models(substantial Error rate but better than guessing) to strong models.
Boosting incrementally builds an ensemble by training each model with the same dataset but where the weights of instances are adjusted according to the error of the last prediction. The main idea is forcing the models to focus on the instances which are hard.
Unlike bagging, boosting is a sequential method, and so you can not use parallel operations here.
Ex AdaBoost is incremental method using Decision Trees as base model
The ‘adjusting dataset’ step is different from the one described above and the ‘combining models’ step is calculated by using weighted voting.

I am covering some commonly used Bagging and Boosting techniques in another post. Watch out for them.

Advantages:

Higher Predictive Accuracy
Lower Variance and Lower Bias
Deeper understanding of the data
Stable and more robust in predictions
Statistical. Training data is small relative to size of space we want to search, averaging reduces risk of choosing the wrong classifier
Computational. Individual models get stuck in local optima, averaging can get to better overall prediction
Representational. Individual learners may not span large enough space to find real solution, combining can enable search in expanded space

Disadvantages:
Lower Interpretability
Ensembles cost more to create, train, and deploy
Selection of ensemble model and parameters is an art
There is a rick of overfitting due to large number of parameters