Support Vector Machine

Shaily jain
5 min readFeb 26, 2021

--

Maximal Margin Classifier

We consider binary classification in this post, where observations are encoded as 1 and -1.
Idea is to finite a linear separable boundary between the observations of two classes. Since there are infinite number of lines (planes in case of 3 independent variables), we need to find the one which is least sensitive to test data, and can classify new data without the problems of misclassification resulted due to overfitting.
For that purpose we came try to decision boundary is optimal if it is farthest from the observations.

Step wise we bilt our understanding, it will be as follows

Construct decision boundary, built a margin around it by extending each side to the nearest point of each class. The perpendicular distance from each observation is computed. The smallest of all those distances is a measure of how close the hyperplane is to the group of observations.

The nearest observations of each class is called support vectors, because they support the boundaries of the maximal margin. A slight shift in them will change the maximal margin.

Mathematically, decision boundary(separating hyperplane) is given by

Hence any point on one side(above) the hyperplane satisfies

and on the other side, it satisfies

In entirety the equation set for constraints are

the weight vector of thetas is say w, then the size of maximal margin is 2/||w||. Point to note here is that weight vector is perpendicular to decision boundary. As if we take two points on decision hyperplane and then put it in equation, say points a, b . Then equations are w.X(a) +b = 0 and w.X(b) +b = 0, on subtracting both we get… w.[X(a) — X(b)]= 0, since [X(a) — X(b)] is parallel to decision hyperplane as both the points lie on hyperplane, ..and their dot product is 0, therefore it is easy to make out that vector w is perpendicular to decision hyperplane.
Finding the maximal margin hyperplanes and support vectors is a problem of convex quadratic optimization. It is important to note that complexity of SVM is characterized by the number of support vectors, rather than the dimension of the feature space. Which is the reason SVM has comparatively less tendency to overfit. If all data points other than support vectors are removed from training data set, then also the training algorithm will give same separating hyperplane. The number of support vectors provide an upper bound to the expected error rate of the SVM classifier, which happens to be independent of data dimensionality.

Soft Margin

When some relaxation is allowed for such that some points are allowed to cross the margins or even the separating hyperplane. An additional set of coefficients are introduced that give the margin wiggle room in each dimension. Margins show some leniency hence the name soft margin. Therefore the optimization problem can be modified as

The εi is the slack corresponding to ith observation and C is a regularization parameter set by the user. The larger value of C leads to a larger penalty for errors.

  • The smaller the value of C, the more sensitive the algorithm is to the training data (higher variance and lower bias).
  • The larger the value of C, the less sensitive the algorithm is to the training data (lower variance and higher bias).

Points on Mathematics

To the above constraints, we find the optimization function we resort to Maximizing the geometric margin

Since we aim to maximize margin from both cross as well circle, we can indeed maximize the band from support vector of circle to support vector of cross.

Geometric margin of hyperplane w.r.t. the entire dataset
Geometric margin w.r.t. ith training example
Hyperplane passing through positive support vectors
Hyperplane passing through negative support vectors
Reformulated optimization objective as a bandwidth maximisation
Reformulated optimisation which equivalent to maximisation of margin(= 2/ ||w||)

But this turns out to be a non convex function, because of which our gradient descent will get stuck at a local minimum. Squaring the norm(w) in the multiplicative inverse of the objective function gives a a convex, differentiable function(inverse of the objective function is also convex but not differentiable at w = 0). Therefore, we get final optimization function as

Multiclass classification
When there are more than two classes, we can use one of the below methods:
One-versus-All: If the number of classes is K > 2 then K different 2-class SVM classifiers are fitted where one class is compared with the rest of the classes combined. A new observation is classified according to where the classifier value is the largest.

One-versus-One: All (kC2) pairwise classifiers are fitted and a test observation is classified in the class which wins in the majority of the cases.

The latter method is preferable but if K is too large, the former is to be used.

Resources

MIT Coursware, Scratch code, machine learning mastery , online.stat.psu.edu , Statquest , Maximal Margin , solved numerical examples , medium article

--

--

Shaily jain
Shaily jain

Written by Shaily jain

Problem Solver, Data Science, Actuarial Science, Knowledge Sharer, Hardcore Googler

No responses yet