Understanding these assumptions is important.
Linear models are parametric in nature which means there are some assumptions related to such models which we need to validate before using these models. To understand these assumptions better, it is advised to get familiar with the Linear regression concepts, we have covered Linear regression with codes in detail in the previous blog. In this one, we will focus on the Assumptions of the Linear model. It is important to validate these assumptions to make a robust linear model. The linear models fail to give a good result if these assumptions are violated. This is the most overlooked part of the Linear Regression analysis, which leads to the poor performance of the model and even leads to wrong interpretation. So let's read about these assumptions in detail.
There should be a linear relationship between dependent and independent variables.
This is the first assumption for any linear model. It needs to hold true to make a good linear model. The slope of one dependent variable should not depend on another dependent variable and the value of the dependent variable should be a straight-line(linear) function of an independent variable given other variables are fixed. We can validate this using a scatter plot between input and target variable if we have a few columns. For more columns, we can check the correlation value between input variables and the target variable. We can see this below python code.
It is a good idea to plot a scatterplot for few important dependent variables and the target variable. Above we can see that the area alone is not a very strong indicator of Price and correlation is also not very high. We can also plot the 3D scatterplot to visualize the correlation between multiple variables but understanding it is not easy and clean. For multiple variables correlation, heatmap will make much more sense and will be easy to interpret. A higher value in the correlation matrix shows the stronger relationship between variables and vice-versa. A visual comparison is shown below.
We can apply some transformation on the dependent or the independent variable to make it a linear function. We need to be careful while applying these transformations as it may decrease the performance even mode if not done properly. We cannot apply any random transformation to the data. For example, if the variable is strictly positive we can apply log transformation to make it a linear function.
Error terms are normally distributed
This is an important assumption to make good interpretable linear models. If the residuals are not normally distributed then it will affect the hypothesis testing and it creates problems in determining whether model coefficients are significantly different from zero and for calculating the confidence intervals. It shows that the data is fitted in a way that most of the error values are nearby or we can say the residual terms are normal. If it is not bell-shaped or not normally distributed then it means there are some anomalies in the best fit line and need to revisit the regression analysis again. At the same time, it is important to note that there is no assumption on the distribution of the input variable and the target variable. If we do not see normal distribution it may be because the dependent and independent variables themselves are not linear-dependent and there may be some outliers present. To fix it we need to check the first assumption and if any nonlinearity is present then consider applying some nonlinear transformation to the input of the target variable.
Python snippet is shown below to demonstrate how to check this assumption.
We can see above that error terms are almost normally distributed around zero which is good. But at the same time, we can see some evident values on the right tail showing some extreme values. This means some outlier value is affecting our regression analysis. The R2 score, which we discussed in the previous blog was also not very high, maybe these are dependent.
No Autocorrelation- Residuals should be independent of each other
Error terms are independent of each other which means the value of error terms is not dependent on any of the other error terms. We verify this with the residual vs prediction graph. If we do not find any pattern and points on the graph are random then we can say it is normally distributed. But if some pattern is shown in the points it means the residuals are somehow dependent on each other and the regression model fails to capture some important patterns from the data and we may need to recheck the model or the data is not suited for the linear models. Another reason our model is not following this assumption is that the independent variables are time series data are the values of target variables are dependent. We treat time series data differently. We can draw a residual vs some DateTime data if present to check the presence of any time series-related problem. We can apply the ARIMA model for such types of data. Below is the python snippet to check the same.
In the above plot, we see that the error terms are independent of each other, we cannot make any single line or pattern to explain the data point on the graph. Again we see some higher values on the upper side which were shown by the distribution plot also hence it is now very clear that our regression analysis is skewed by the presence of outliers. To do it better we can apply the transformation to the target variable.
Homoscedasticity- Error terms have constant variance
This means the error terms should have a constant variation across the data points. If the variation is present in the error then the direct consequence is that the independent and target variables are not a linear function of the dependent variable and the model is not robust. This shows that the model performance is not reliable. We cannot perform any analysis on such models as the variation in the error terms is high and choosing the right sample to test will be difficult. These types of models may lead to biases in the prediction. This may be present in the model if it is fitted into nonlinear data. Another very serious consequence is the widening of the confidence interval which leads to wrong statistical inferences from the model. In time-series data this problem is likely to occur due to a few time-dependent factors such as inflation, compounding and seasonality. we can try to reduce the heteroscedasticity by choosing some regularized models and also by applying some nonlinear transformations to the independent variables. Below python code explains how to check his assumption.
In the above plot, we can see that the presence of some extreme outliers leads to the introduction of some homoscedasticity as the variance is not constant. We can revise our regression analysis and can improve it. We have not removed any outliers so that we can see how our analysis is affected by them. For linear models, it is always better to treat these outliers.
This happens if the independent variables are related to correlated to each other. Violating this assumption will not degrade the model performance but it makes it difficult to analyze the model and make inferences from it. If we have two highly correlated features then how we will decide if explained variation in target due to which variable. To make our analysis more clear we check this assumption. To verify this more accurately we find the VIF(Variance inflation factor) of the variable. If the VIF is high it means the given variable is multicollinear. Generally, we try to keep the VIF value below 10, but there is no strict rule for this, it depends on the nature of the data as well. To fix this we can combine the related variables and make a new feature out of it. Another way to fix this is that we drop the related variables and keep the one that is more correlated to the target variable.
In the heatmap, we can see that there no very high correlation between the independent variables, so we are good at this point. We can also plot the pair plot between the variables to check multicollinearity but it is difficult to plot it if we have a large number of variables.
In the above article, we saw the importance of assumptions related to the linear models and how important it is. We also saw python code to check these assumptions. You can find the complete notebook for the codes used in the above blog in the Github repository. Hope this will make our regression analysis more robust and insightful. 😄
There are some regularized linear models such as Ridge and Lasso Regression which can help to create a more robust linear model. The idea behind these models is how they penalize the error and avoid the overfitting or biases towards the outliers. It's worth looking at them.