Multicollinearity
1 min readApr 18, 2021
Have heard it multiple times, but not really sure what it means.
Let’s hear it once again,
Multicollinearity is existence of correlation between independent variables in modelled data.
Problems:
- It can cause inaccuracy in the regression coefficients
- Magnify the standard errors in the regression coefficients and reduce the efficiency of any t-tests.
- It can produce deceiving results and p-values and increase the redundancy of a model, making its predictability inefficient and less reliable.
Sources:
- It can be a result of error in data collection :
- Occur as a result of over-defined model or model specification/choice : Over-defining is existence of more variables than observations, like taking undesirable interaction and main effect of variables.
- Can be due to outliers: The removal of extreme variable values before regression can reduce multicollinearity.
Detection
- Investigating independent variables for correlation in pairwise scatter plots.
- Variance Inflation Factor(VIF): A score of 10 or more shows high collinearity.
- Eigen values of correlation matrix are close to zero : One should use the condition numbers, as opposed to eigen value’s numerical sizes. Larger the condition numbers, the more the multicollinearity.
Correction:
- Collecting data from appropriate sub population.
- Proper Variable selection by regularization methods
Resources:
https://corporatefinanceinstitute.com/resources/knowledge/other/ridge/ ,