Kernels
Kernels are widely used tool in Machine Learning, here we are trying to understanding kernels in respect of Support Vector Machines for classification.
The kernels are assigned an explicit duty of transforming the data into higer dimensional. In respect of support vector machines, when the data is not linearly separable, we project observations into higher dimensional, and then try to find a linear decision boundary.
Things are pretty straightforward when visualised as above. Here we have transformed a 2 dimensional plane to a 3 dimensional bowl like structure. Although initially we were unable to find a linear decision boundary, but after transformation we are able to locate an optimal hyperplane to help us with classification of red and green dots.
Before jumping into types of Kernel functions, we should understand that the data is represented through a set of pairwise similarity comparisons between the original data observations x (with the original coordinates in the lower dimensional space), instead of explicitly applying the transformations phi(x) and changing the data into higher dimensions, although the intuition is the same. This is famously called as Kernel Trick. Due to this there is hardly any requirement for actually computing the kernel mapping function.
There no constraints to the dimensions to which mapping can project data onto. Kernel functions provide a simple bridge from linearity to non linearity for algorithms which can be expressed in terms of dot products. Due to kernel trick wherever dot product is used, it is replaced with a kernel function. The kernel function denotes an inner product in feature space and is usually denoted as
K(x,y) = <φ(x),φ(y)>
Properties of Kernel functions although out of scope can be checked out here.
Frequently used Kernel Functions are:
- Linear Kernel: It is given by inner product <x, y> plus c.
- Polynomial Kernel : It is a non stationary kernel, which is well suited for problems where training data is normalised.
- Gaussian Kernel : This maps to a infinite dimensional space. It performs like weighted nearest neighbors, by the virtue of which it classifies observations in regards with the nearest neighbors. Parameter gamma is to be set carefully via cross validation. If overestimates, the exponential will behave almost linearly and high dimensional projection will start to lose its non linear power, and if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in the training data.
For more refer to the first link in resources.
Resources: