Have heard it multiple times, but not really sure what it means.

Let’s hear it once again,

Multicollinearity is existence of correlation between independent variables in modelled data.

Problems:

  • It can cause inaccuracy in the regression coefficients
  • Magnify the standard errors in the regression coefficients and reduce the efficiency of any t-tests.
  • It can produce deceiving results and p-values and increase the redundancy of a model, making its predictability inefficient and less reliable.

Sources:

  • It can be a result of error in data collection :
  • Occur as a result of over-defined model or model specification/choice : Over-defining is existence of more variables than observations, like taking undesirable interaction and main effect of variables.
  • Can be due to outliers: The removal of extreme variable values before regression can reduce multicollinearity.

Detection

  • Investigating independent variables for correlation in pairwise scatter plots.
  • Variance Inflation Factor(VIF): A score of 10 or more shows high collinearity.
  • Eigen values of correlation matrix are close to zero : One should use the condition numbers, as opposed to eigen value’s numerical sizes. Larger the condition numbers, the more the multicollinearity.

Correction:

  • Collecting data from appropriate sub population.
  • Proper Variable selection by regularization methods

Resources:

https://corporatefinanceinstitute.com/resources/knowledge/other/ridge/ ,