(**Note:** This is a post attempting to explain the intuition behind Logistic Regression to readers NOT well acquainted with statistics. Therefore, you may not find any rigorous mathematical work in here.)

*Logistic Regression is a type of classification algorithm involving a linear discriminant.* What do I mean by that?

1. Unlike actual regression, logistic regression does not try to predict the value of a numeric variable given a set of inputs. Instead, the output is a ** probability** that the given input point belongs to a certain class. For simplicity, lets assume that we have only two classes(for multiclass problems, you can look at Multinomial Logistic Regression), and the probability in question is -> the probability that a certain data point belongs to the ‘‘ class. Ofcourse, . Thus, the output of Logistic Regression always lies in [0, 1].

2. The central premise of Logistic Regression is the assumption that your input space can be separated into two nice ‘regions’, one for each class, by a ** linear**(read: straight)

**boundary**. So what does a ‘linear’ boundary mean? For two dimensions, its a straight line- no curving. For three dimensions, its a plane. And so on. This boundary will ofcourse be decided by your input data and the learning algorithm. But for this to make sense, it is clear that the data points MUST be separable into the two aforementioned regions by a linear boundary. If your data points do satisfy this constraint, they are said to be

*linear-separable*. Look at the image below.

This dividing plane is called a ** linear discriminant**, because 1. its linear in terms of its function, and 2. it helps the model ‘discriminate’ between points belonging to different classes.

(Note: If your points aren’t linearly separable in the original concept space, you could consider converting the feature vectors into a higher dimensional space by adding dimensions of interaction terms, higher degree terms, etc. Such usage of a linear algorithm in a higher dimensional space gives you some benefits of non-linear function learning, since the boundary would be non-linear if plotted back in the original input space.)

==========X===========

*But how does Logistic Regression use this linear boundary to quantify the probability of a data point belonging to a certain class?*

First, lets try to understand the geometric implication of this ‘division’ of the input space into two distinct regions. Assuming two input variables for simplicity(unlike the 3-dimensional figure shown above)- and , the function corresponding to the boundary will be something like

.

(It is crucial to note that and are BOTH input variables, and the output variable isn’t a part of the conceptual space- unlike a technique like linear regression.)

Consider a point . Plugging the values of and into the boundary function, we will get its output . Now depending on the *location* of , there are three possibilities to consider-

I. lies in the region defined by points of the class. As a result, will be positive, lying somewhere in (0, ). Mathematically, the higher the magnitude of this value, the greater is the distance between the point and the boundary. Intuitively speaking, the greater is the probability that belongs to the class. Therefore, will lie in (0.5, 1].

2. lies in the region defined by points of the class. Now, will be negative, lying in (-, 0). But like in the positive case, higher the absolute value of the function output, greater the probability that belongs to the class. will now lie in [0, 0.5).

3. lies ON the linear boundary. In this case, . This means that the model cannot really say whether belongs to the or class. As a result, will be exactly 0.5.

Great. So now we have a function that outputs a value in (-, ) given an input data point. But how do we map this to the probability , that goes from [0, 1]? The answer, is in the ** odds** function.

Let denote the probability of an event occurring. In that case, the odds ratio () is defined by , which is essentially the ratio of the probability of the event happening, vs. it not happening. It is clear that probability and odds convey the exact same information. But as $P(X)$ goes from 0 to 1, goes from 0 to .

However, we are still not quite there yet, since our boundary function gives a value from – to . So what we do, is take the ** logarithm **of , called the

**log-odds function**. Mathematically, as goes from 0 to , goes from – to !

So we finally have a way to interpret the result of plugging in the attributes of an input into the boundary function. The boundary function actually defines the log-odds of the class, in our model. So essentially, inour two-dimensional example, given a point , this is what Logistic regression would do-

**Step 1**. Compute the boundary function(alternatively, the log-odds function) value, . Lets call this value for short.

**Step 2**. Compute the Odds Ratio, by doing . (Since is the logarithm of ).

**Step 3**. Knowing , it would compute using the simple mathematical relation

.

There you go! In fact, once you know from step 1, you can combine steps 2 and 3 to give you

The RHS of the above equation is called the **logistic function**. Hence the name given to this model of learning :-).

==========X===========

We have now understood the intuition behind Logistic Regression, but the question remains- How does it learn the boundary function ? The mathematical working behind this is beyond the scope of this post, but heres a rough idea:

Consider a function , where is a data point in the training dataset. can be defined in simple terms as:

If is a part of the class, (Here, is the output given by your Logistic Regression model). If is a part of the class, .

Intuitively, quantifies the probability that a training point was classified *correctly *by your model. Therefore, if you average over your entire training data, you would get the ** likelihood** that a random data point would be classified correctly by your system, irrespective of the class it belongs to. Simplifying things a little, it is this ‘average’ that a Logistic Regression learner tries to maximize. The method adopted for the same is called

**maximum likelihood estimation**(for obvious reasons). Unless you are a mathematician, you can do without learning

*how*the optimization happens, as long as you have a good idea of

*what*is being optimized – mostly because most statistics or ML libraries have inbuilt methods to get it done.

==========X===========

Thats all for now! And like all my blog posts, I hope this one helps some guy trying to Google up and learn some stuff on his own, understand the misunderstood technique of Logistic Regression. Cheers!

great way to intuitively understand. thanks

Thanks!

It is very great explanation. But I have a question. Why is log(Odd Ratio) is compared of equated to decision boundary. Is it because these two functions have the same domain and range ?

Hi, very useful list, thanks for updating so many information in one page, Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable.

Nicely explained. Thank you!!

great thank you!

this is the most lucid explanation I have found, so far. keep it up!

Amazing post!

Can I use this post in my class content?

Haha sure 🙂

Unable to get anything after Odds function. Still very heavy for a weak person like me in maths.

The function you give for the boundary in the case of classification in a 2-dimensional space has one degree of freedom too many: in that case it should be a line. You can correct it by giving the boundary either as x1 = beta_0 + beta_1 x_1, or alternatively by writing it as beta_0 + beta_1 x_1 + beta_2 x_2 = 0 (in which the coefficients will be different, of course)

https://sincrenete.blogspot.am/2017/07/logistic-regression-explained-and.html I also put here an example of Logistic regression done by R. In a case of any question I am ready. Thanks ))

Thank you! This has been very helpful in preparation for a job interview. I look forward to following future posts!

Nice Explanation. Very much useful

https://sincrenete.blogspot.am/2017/07/logistic-regression-explained-and.html I also put here an example of Logistic regression done by R. In a case of any question I am ready.

Great article! But I still have a question:

I can understand the hypothesis of linear expression(beta0 + beta1*x1 + beta2*x2 )

I can also understand the way to infer log-logit expression

but why can they be equel, just for the reason they have the same range(–infty to infty)?

They aren’t exactly equivalent. Its just ‘interpreting’ the value as the log-odds – mainly since both functions follow the same trends wrt input.

Excellent explanation and very intuitive! Thank you

sincrenete.blogspot.am/2017/07/logistic-regression-explained-and.html Logistic-Regression, You are welcome to ask questions

This website is completely gret. I’ve researched these details a long time and I realised

that is professional, easy to understand. I congratulate you because of this article that I am going to

tell to prospects friends. I ask you to go to the gpa-calculator.co site where

each pupil or university student can calculate results gpa rating.

Thank you!

One of the very few articles that explained LogReg this intuitively! Excellent article.

m,nm,,sndn,mn,mnsdf

Reblogged this on and commented:

simple explanation of logistic regression

Woooooooooow!!!! As clear as daylight! Thanks!

It is a very great explanation. But I have a question. Why is log(Odd Ratio) is compared or equated to decision boundary. Is it because these two functions have the same domain and range ?

Can someone help me out on the following please

I am trying to simulate Circular Decision Boundary.

generating for 20×20 grid of data from x-range of -10 to 10 and a y-range of -10 to 10

seq =seq(-10,10)

x = seq

y = seq

data = expand.grid(x = x, y = y)

# setting the probabilities to 1 where x^2 + y^2 > 25 and 0 otherwise

data$z = ifelse((data$x^2+data$y^2) > 25, 1, 0 )

# flipping two values to avoid perfect separation

data[158,]$z = 1

data[441,]$z = 0

# generating the quadratic model

m = glm(data=data,z~I(data$x^2)+I(data$y^2),binomial(link = “logit”),maxit = 100)

# Here are the coefficients

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.02540 0.59213 -6.798 1.06e-11 ***

I(data$x^2) 0.15568 0.02353 6.617 3.66e-11 ***

I(data$y^2) 0.15795 0.02387 6.617 3.66e-11 ***

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 420.64 on 440 degrees of freedom

Residual deviance: 127.54 on 438 degrees of freedom

AIC: 133.54

Number of Fisher Scoring iterations: 100

If I use the predict function of R on training data itself, the probabilities are matching.

data_new = data.frame(data$x,data$y)

data$p= round(predict(m,data_new,type=”response”))

But when I try to calculate the probabilities manually using the above coefficients, the results seem to be incorrect.

We expect the coefficients to give the equation of a circle. so, the coefficients seem to be incorrect but how come the

predict working correctly.

tx = – (0.15568 *x*x + 0.15795 *y*y -4.02540)

data$p1 = 1/(1+exp(tx))