Introduction to Linear Regression
This is one of the first models we get introduced to when we start our machine learning journey. It is a very simple yet elegant model to predict the continuous target variable. Before jumping into the algorithm, let's try to understand what the term regression means? It is a statistical method to establish a relationship between independent variables (input) and the dependent variable (target). It is widely used in forecasting and hypothesis testing.
Linear regression is a method to model the linear relationship between the dependent variables and independent variables. It assumes the linear relationship between the input and target variables.
There are two types of regression we will discuss:
- Simple linear regression.
- Multiple linear regression.
In simple linear regression, we try to establish a relationship between one input and a target variable.
In multiple linear regression, we try to establish a relation between many inputs and a target variable.
In linear regression, we try to fit a line through all the data points which try to reduce the residual sum of squares, and in multiple, we try to fit a hyperplane in such a way that it minimizes the residual sum of squares. The equation for linear regression is:
y = β0 + β1.X1 # Simple linear regressiony = β0 + β1.X1 + β2.X2 + ... + βn.Xn # Multiple linear regression
β0 is the intercept value and β1 to βn are slope values. Intercept is the value of the target variable at x=0, if all independent variables are 0 then the value of the dependent value is β0. And the way we interpret slope is how much the dependent variable will change given a unit change in the independent variable.
The best fit line.
The best fit line is found such that the residual sum of squares is minimum. RSS( Residual Sum of Square) is the square of residual of each observation.
We choose the line which gives the least RSS value. This process is called the Ordinary Least Square method. We optimize the value of intercept and slopes using the Gradient Descent method. The idea behind the Gradient descent is to move in the opposite direction of the gradient until we reach the point having the lowest error value. We will cover this in detail in some later articles. For now, I have given a link to a great video by Joshua Starmer.
Is our model performing well?
For now, we will talk about the Coefficient of Determination to keep this discussion simple. We will cover all evaluation matrices in detail very soon. Stay tuned for it. 😜
R2 score or Coefficient of Determination is a metric that explains what portion of the given data variation is explained by the model. It tells about the goodness of fit of the model. The higher score is better. Ideally, the range for the R2 score is 0 to 1. But if the model performance is even worse than the average model then the value may go below 0, this may be due to some issue with the model fitting. R2 score is the ratio of RSS and TSS(Total Sum of Square). Here RSS signifies the difference between the actual and predicted value. TSS is a baseline model which signifies the difference between the actual and mean of the target variable. So we are comparing the squared sum residual with the baseline model.
Let us see this algorithm in action
We will be using Python programming language for code implementation.
Steps we will follow to code:
- Import the required Libraries and data
2. Prepare the data
3. Split the data and perform Scaling
4. Train, Predict, Evaluate the Statsmodel OLS method.
5. Train, Predict, Evaluate the sklearn LinearRegression method.
6. Model Interpretation
The final equation for the above Linear Regression is:
3295125.11*area + 506109.44*bedrooms + 1874073.79*bathrooms + 1385770.25*stories + 421368.74*mainroad + 253819.75*guestroom + 314233.13*basement + 1021073.02*hotwaterheating + 794828.15*airconditioning + 833078.62*parking + 617166.58*prefarea + -47533.58*furnishingstatus_semi-furnished + -431147.75*furnishingstatus_unfurnished + 2105949.005
The Linear Regression algorithm is a very simple yet powerful technique to do the prediction of the continuous variable. It is very much understandable by the business people due to its linear nature, which makes this algorithm widely used in the Industry. You can code it using Statsmodel which gives great insight into the model, to get started quickly you can use the Sklearn package. They both do the pretty much same thing. Here is the link for the Github repository having the required files to get started with.
Where to go next?
The next important thing to learn about linear models is their Assumptions. I have covered the Assumptions in detail in the next blog. I will strongly suggest going through it. It is important to check for these assumptions to make a robust linear model.
I hope you liked the explanation, I tried to keep things simple. 😋