Moving from Linear to Logistic function.
In the previous blog, we learned about Linear regression which we used to predict the continuous target variable. But what if the target is discrete? In this case, we use the model called Logistic regression. It a modelling algorithm that is used in the classification type of problem.
Let us clear continuous and discrete meaning with the below representation.
The main point to note here is that we see the target variable only. If it is continuous use Linear Regression and if it is Discrete then use Logistic Regression. We are not at all worried about the input data.
Example: If we want to predict how much mm rainfall will be there tomorrow, we will use the Regression model as the out can be 1mm, 1.2mm, 1.5mm or 2 mm, But the same question if we ask, is it going to rain tomorrow, then we will use Classification model to answer this as the output is either Yes or No.
In Linear regression we try to fit the line using “Least squares” and in Logistic regression, we use the concept of Maximum Likelihood. Now we are clear that Logistic Regression is used to do the classification problem, then why there is a Regression term there. We will try to answer this also in this blog.
Why we cannot use a straight line to make a decision boundary? The answer is very intuitive because it is prone to outliers.
As we can see that the blue line was a better decision boundary but as few outliers got introduced in the data the decision boundary got shifted leading to a higher miss rate. Hence a simple linear boundary is not an ideal choice. The range of the y axis is negative infinity to positive infinity. So we need a way to limit this value to a certain range and one such elegant function is the Sigmoid function.
When we have two output categories we call it binary Logistic regression. Example: Yes- No, High-Low.
If we have more than two output categories we call it multinomial Logistic Regression, Example: Yes-No-Maybe, High-Medium-Low, 1–2–3–4–5.
It is used to model the infinity range of the target variable into the range of 0 to 1 which help us to convert the values in terms of the probabilities.
Pass any high value to the Sigmoid equation it will limit it to 1 and pass any negative value it will limit it to 0. This a perfect unction to get the probability value out of the continuous variables. Now we can set a decision rule if a value is greater than say 0.5 then it is a positive class or it is negative class. In the below graph we can see that the output of the logistic function is between 0 and 1 showing the probability value y.
In the Sigmoid equation, we plug the value of β0 and β1 to get the curve but what value to put? We choose these values in such a way that the likelihood is maximum.
The likelihood is defined as the product of the probabilities of all the observations.
Now for different value of β0 and β1 calculate the likelihood and choose the value which optimizes the likelihood value.
The cost function is optimized using the gradient descent method. The cost function which optimizes the likelihood can be written as:
One thing to notice in Logistic Regression is that it is difficult to establish a relationship between the input and output variable. To simplify this we use the term called odds ratio which is defined as P/(1-P). And the fact is that log base e (ln) of the odds ratio is a linear function of the target variable.
ln(P/(1-P)) = β0 + β1x; P is output of Sigmoid equation
Let us derive it very quickly.
Let us see this algorithm in action
We will be using Python programming language for code implementation.
- Import the required Libraries and data
2. Split the data and perform Scaling
3. Train, Predict, Evaluate the Statsmodel GLM method.
4. Train, Predict, Evaluate the sklearn LogisticRegression method.
You can find the complete notebook on my Github repository.
In this blog, we covered the Logistic Regression model which is used for the classification problem. We also saw how it works internally and the cost function for it. I am sure we can now answer why it is Logistic Regression 😉. We also saw how to fit the best curve and why the linear model is not a good option for classification problems. One part missing here is the model evaluation which we will cover in the next blog.