Long story short in this post we are dealing with predictions from a Combination of multiple models.

Common Ensemble Methods:

Voting or Averaging:

Each base model can be created using different splits of the same training dataset and same algorithm, or using the same dataset with different algorithms, or any other method.

Majority Voting :

  • Classify each test observation through each base model
  • Pick the prediction made by majority of models. If none makes majority then ensemble method fails to give a stable prediction.
  • This can be adaptive by giving weights to any specific kind of models counting them multiple time. This is called Weighted Voting.

There are various types of algorithm available named ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE, and CTREE. Here we are looking to three most commonly used one.

Following up from our previous article on Decision Tree, here we have few commonly used techniques.

Starting from ID3

  • Categorical Independent Variables
  • Class Labels of Predicted Class is given
  • Start with finding root node which is the variable that gives maximum Information Gain. This partitions data into say m parts.
  • Next leaving the variable already considered at root node, divide each subtree to arrive at variable giving maximum information gain.

C4.5 Algorithm address…

Here’s the list of people that I meet while during my course of study. Say Hi to them, and don’t forget to thank them. ❤❤

Background: Clergyman, amateur scientist, mathematician
Achievement: Solution for probabilities in gambling, Insurance in paper ‘Essay towards solving a problem in the doctrine of chances’, inverse probability

The Statement

Central Limit Theorem states that if you have population with mean mu and standard deviation sigma, taking sufficiently large random samples from population with replacement would give distribution of sample means to be approximately Normal.


  1. Distribution of population does not matter.
  2. Mean and Variance of X bar can be seen as

Tree based models are a part of Non-Parametric Algorithms that are good at dealing with no linear relationships. They are good for Classification and Regression both.

Decision Trees is a Supervised Algorithm .It can be used for both categorical and continuous input and output variables by partitioning the feature space into a number of smaller (non-overlapping) regions with similar response values using a set of splitting rules. …

Need of Pruning is to reduce overfitting of the Decision tree and make a happy place for test data. Let’s see how we can do this.

Pruning can be done in two ways :

Pre Pruning

(Early Stopping Rule)

  • Minimum no. of sample present in nodes
  • Maximum Depth
  • Maximum no. of nodes
  • Preset Gini Index, Information gain is fixed, which if violated the tree isn’t split further

Post Pruning

(Grow the tree and then trim it, replace subtree by leaf node)

  • Reduced Error Pruning : 1. Holdout some instances from training data 2. Calculate misclassification for each of holdout set using the decision tree…

Let’s start with what they do and why we need them.

Impurity measures are used in Decision Trees just like squared loss function in linear regression. We try to arrive at as lowest impurity as possible by the algorithm of our choice.

Impurity is presence of more than one class in a subset of data.

So all below mentioned measures differ in formula but align in goal. Watch till the end to know secret highlights of this topic.

Remember this

Hasn’t specifying the number of clusters in KNN and K-mediods been a pain, no worries because now we have Hierarchical clustering to save us from the mess. Another added advantage for this is ability to visualize the construction of clusters diagrammatically.

Let us first introduce tree based representation Dendograms which make things beautiful for Hierarchical clustering.

  1. Conclusions about the proximity of two observations should not be implied by their relationship on the horizontal axis nor by the vertical connections. Rather, the height of the branch between an observation and the clusters of observations below them indicate the distance between the…

Our start with Unsupervised learning was through KNN(K Nearest Neighbor) which is most popular of them all. You can read it here.

Here’s a refinement of it, below is algorithm for PAM(Partitioning Around Mediods)

Cool picture of Cluster

Build phase:

  1. Select k objects to become the medoids, or in case these objects were provided use them as the medoids;
  2. Calculate the dissimilarity matrix
  3. Assign every observation to its closest medoid;

Swap phase: 4. For each cluster search if any of the object of the cluster decreases the average dissimilarity coefficient; if it does, select the entity that decreases this coefficient the most as the…

This is a first type of algorithm in unsupervised data.

This is similar to KNN which assumes that similar observations are grouped together are distance- based algorithms with only difference being that in KNN we check that labels of K nearest neighbors and then decide corresponding label of our point while in K means we find that why K nearest neighbors are similar to the point in consideration…

Steps for Algorithm

  1. Specify the desired number of clusters K
  2. Randomly assign each data point to a cluster
  3. Compute cluster centroids(mean of all points within the cluster)
  4. Re-assign each point to the…

Shaily jain

Problem Solver, Data Science, Actuarial Science, Knowledge Sharer, Hardcore Googler

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store