Linear Methods of Classification
Below I have accumulated few Linear methods of classification. If the discriminant function or P(y=K|X = x) is linear in x, then this results in a linear decision boundary, hence linear classification method.
There are two broad classes of methods for determining the parameters of a linear classifier namely Generative and Discriminative models . All of the below linear classifiers can be converted into Non linear algorithms operating on different input space phi(x) using Kernel Trick.
Class 1 Generative Models ( Needs P(x|K) and P(K) to determine posterior distribution)- Good for handling missing data
- Linear Discriminant Analysis (LDA) — assumes Gaussian Conditional density models with equal covariance (if unequal covariance then this becomes Quadratic Discriminant Analysis
- Naive Bayes Classifier with multinomial or multivariate Bernoulli event models
Class 2 Discriminant Models ( Attempts to maximise the quality of the output on training set)-yields higher accuracy
- Logistic Regression (Softmax Regression for muticlass classification) — MLE of parameters of regression assuming that the observed training set was generated by binomial model that depends on the output of the classifier.
- Perceptron- an algorithm that attempts to fix all errors encountered in the training set
- Fisher’s Linear Discriminant Analysis — an algorithm (different than ‘LDA’) that maximises the ratio of betwen class scatter to within class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification
- Support Vector Machine — an algorithm that maximises the margin between the decision hyperplane and the examples in the training set
There is generally an optimisation problem which is solved by a learning algorithm using techniques like Stochastic Gradient Descent, L-BFGS, coordinate descent and Newton methods.
where
- w is a vector of classifier parameters,
- L(yi, wTxi) is a loss function that measures the discrepancy between the classifier’s prediction and the true output yi for the i’th training example,
- R(w) is a regularization function that prevents the parameters from getting too large (causing overfitting), if this is convex then above is a convex problem
- C is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function.