Logistic Regression (for dummies)

(Note: This is a post attempting to explain the intuition behind Logistic Regression to readers NOT well acquainted with statistics. Therefore, you may not find any rigorous mathematical work in here.)

Logistic Regression is a type of classification algorithm involving a linear discriminant. What do I mean by that?

1. Unlike actual regression, logistic regression does not try to predict the value of a numeric variable given a set of inputs. Instead, the output is a probability that the given input point belongs to a certain class. For simplicity, lets assume that we have only two classes(for multiclass problems, you can look at Multinomial Logistic Regression), and the probability in question is P_+ -> the probability that a certain data point belongs to the ‘+‘ class. Ofcourse, P_- = 1 - P_+. Thus, the output of Logistic Regression always lies in [0, 1].

2. The central premise of Logistic Regression is the assumption that your input space can be separated into two nice ‘regions’, one for each class, by a linear(read: straight) boundary. So what does a ‘linear’ boundary mean? For two dimensions, its a straight line- no curving. For three dimensions, its a plane. And so on. This boundary will ofcourse be decided by your input data and the learning algorithm. But for this to make sense, it is clear that the data points MUST be separable into the two aforementioned regions by a linear boundary. If your data points do satisfy this constraint, they are said to be linear-separable. Look at the image below.


This dividing plane is called a linear discriminant, because 1. its linear in terms of its function, and 2. it helps the model ‘discriminate’ between points belonging to different classes.

(Note: If your points aren’t linearly separable in the original concept space, you could consider converting the feature vectors into a higher dimensional space by adding dimensions of interaction terms, higher degree terms, etc. Such usage of a linear algorithm in a higher dimensional space gives you some benefits of non-linear function learning, since the boundary would be non-linear if plotted back in the original input space.)


But how does Logistic Regression use this linear boundary to quantify the probability of a data point belonging to a certain class?

First, lets try to understand the geometric implication of this ‘division’ of the input space into two distinct regions. Assuming two input variables for simplicity(unlike the 3-dimensional figure shown above)- x_1 and x_2, the function corresponding to the boundary will be something like

\beta_0 + \beta_1 x_1 + \beta_2 x_2.

(It is crucial to note that x_1 and x_2 are BOTH input variables, and the output variable isn’t a part of the conceptual space- unlike a technique like linear regression.)

Consider a point (a, b). Plugging the values of  x_1 and x_2 into the boundary function, we will get its output \beta_0 + \beta_1 a + \beta_2 b. Now depending on the location of (a, b), there are three possibilities to consider-

I. (a, b) lies in the region defined by points of the + class. As a result, \beta_0 + \beta_1 a + \beta_2 b will be positive, lying somewhere in (0, \infty). Mathematically, the higher the magnitude of this value, the greater is the distance between the point and the boundary. Intuitively speaking, the greater is the probability that (a, b) belongs to the + class. Therefore, P_+ will lie in (0.5, 1].

2. (a, b) lies in the region defined by points of the - class. Now, \beta_0 + \beta_1 a + \beta_2 b will be negative, lying in (-\infty, 0). But like in the positive case, higher the absolute value of the function output, greater the probability that (a, b) belongs to the - class. P_+ will now lie in [0, 0.5).

3. (a, b) lies ON the linear boundary. In this case, \beta_0 + \beta_1 a + \beta_2 b = 0. This means that the model cannot really say whether (a, b) belongs to the + or - class. As a result, P_+ will be exactly 0.5.

Great. So now we have a function that outputs a value in (-\infty, \infty) given an input data point. But how do we map this to the probability P_+, that goes from [0, 1]? The answer, is in the odds function.

Let P(X) denote the probability of an event X occurring. In that case, the odds ratio (OR(X)) is defined by  \frac{P(X)}{1-P(X)}, which is essentially the ratio of the probability of the event happening, vs. it not happening. It is clear that probability and odds convey the exact same information. But as $P(X)$ goes from 0 to 1, OR(X) goes from 0 to \infty.

However, we are still not quite there yet, since our boundary function gives a value from –\infty to \infty. So what we do, is take the logarithm of OR(X), called the log-odds function. Mathematically, as OR(X) goes from 0 to \infty, log(OR(X)) goes from –\infty to \infty!

So we finally have a way to interpret the result of plugging in the attributes of an input into the boundary function. The boundary function actually defines the log-odds of the + class, in our model. So essentially, inour two-dimensional example, given a point (a, b), this is what Logistic regression would do-

Step 1. Compute the boundary function(alternatively, the log-odds function) value, \beta_0 + \beta_1 a + \beta_2 b. Lets call this value t for short.

Step 2. Compute the Odds Ratio, by doing OR_+ = e^t. (Since t is the logarithm of OR_+).

Step 3. Knowing OR_+, it would compute P_+ using the simple mathematical relation

P_+ = \frac{OR_+}{1 + OR_+}.

There you go! In fact, once you know t from step 1, you can combine steps 2 and 3 to give you

P_+ = \frac{e^t}{1 + e^t}

The RHS of the above equation is called the logistic function. Hence the name given to this model of learning :-).


We have now understood the intuition behind Logistic Regression, but the question remains- How does it learn the boundary function \beta_0 + \beta_1 x_1 + \beta_2 x_2? The mathematical working behind this is beyond the scope of this post, but heres a rough idea:

Consider a function g(x), where x is a data point in the training dataset. g(x) can be defined in simple terms as:

If x is a part of the + class, g(x) = P_+ (Here, P_+ is the output given by your Logistic Regression model). If x is a part of the - class, g(x) = 1 - P_+.

Intuitively, g(x) quantifies the probability that a training point was classified correctly by your model. Therefore, if you average g(x) over your entire training data, you would get the likelihood that a random data point would be classified correctly by your system, irrespective of the class it belongs to. Simplifying things a little, it is this ‘average’ g(x) that a Logistic Regression learner tries to maximize. The method adopted for the same is called maximum likelihood estimation (for obvious reasons). Unless you are a mathematician, you can do without learning how the optimization happens, as long as you have a good idea of what is being optimized – mostly because most statistics or ML libraries have inbuilt methods to get it done.


Thats all for now! And like all my blog posts, I hope this one helps some guy trying to Google up and learn some stuff on his own, understand the misunderstood technique of Logistic Regression. Cheers!

29 thoughts on “Logistic Regression (for dummies)

      1. It is very great explanation. But I have a question. Why is log(Odd Ratio) is compared of equated to decision boundary. Is it because these two functions have the same domain and range ?

  1. Hi, very useful list, thanks for updating so many information in one page, Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable.

  2. Unable to get anything after Odds function. Still very heavy for a weak person like me in maths.

  3. The function you give for the boundary in the case of classification in a 2-dimensional space has one degree of freedom too many: in that case it should be a line. You can correct it by giving the boundary either as x1 = beta_0 + beta_1 x_1, or alternatively by writing it as beta_0 + beta_1 x_1 + beta_2 x_2 = 0 (in which the coefficients will be different, of course)

  4. Great article! But I still have a question:

    I can understand the hypothesis of linear expression(beta0 + beta1*x1 + beta2*x2 )
    I can also understand the way to infer log-logit expression
    but why can they be equel, just for the reason they have the same range(–infty to infty)?

    1. They aren’t exactly equivalent. Its just ‘interpreting’ the value as the log-odds – mainly since both functions follow the same trends wrt input.

  5. This website is completely gret. I’ve researched these details a long time and I realised
    that is professional, easy to understand. I congratulate you because of this article that I am going to
    tell to prospects friends. I ask you to go to the gpa-calculator.co site where
    each pupil or university student can calculate results gpa rating.
    Thank you!

  6. One of the very few articles that explained LogReg this intuitively! Excellent article.

  7. It is a very great explanation. But I have a question. Why is log(Odd Ratio) is compared or equated to decision boundary. Is it because these two functions have the same domain and range ?

  8. Can someone help me out on the following please

    I am trying to simulate Circular Decision Boundary.

    generating for 20×20 grid of data from x-range of -10 to 10 and a y-range of -10 to 10

    seq =seq(-10,10)

    x = seq
    y = seq

    data = expand.grid(x = x, y = y)

    # setting the probabilities to 1 where x^2 + y^2 > 25 and 0 otherwise

    data$z = ifelse((data$x^2+data$y^2) > 25, 1, 0 )

    # flipping two values to avoid perfect separation

    data[158,]$z = 1
    data[441,]$z = 0

    # generating the quadratic model

    m = glm(data=data,z~I(data$x^2)+I(data$y^2),binomial(link = “logit”),maxit = 100)

    # Here are the coefficients

    Estimate Std. Error z value Pr(>|z|)
    (Intercept) -4.02540 0.59213 -6.798 1.06e-11 ***
    I(data$x^2) 0.15568 0.02353 6.617 3.66e-11 ***
    I(data$y^2) 0.15795 0.02387 6.617 3.66e-11 ***

    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    (Dispersion parameter for binomial family taken to be 1)

    Null deviance: 420.64 on 440 degrees of freedom
    Residual deviance: 127.54 on 438 degrees of freedom
    AIC: 133.54

    Number of Fisher Scoring iterations: 100

    If I use the predict function of R on training data itself, the probabilities are matching.

    data_new = data.frame(data$x,data$y)
    data$p= round(predict(m,data_new,type=”response”))

    But when I try to calculate the probabilities manually using the above coefficients, the results seem to be incorrect.
    We expect the coefficients to give the equation of a circle. so, the coefficients seem to be incorrect but how come the
    predict working correctly.

    tx = – (0.15568 *x*x + 0.15795 *y*y -4.02540)

    data$p1 = 1/(1+exp(tx))

  9. You have a major error in your explanation. You say that “the odds ratio (OR(X)) is defined by \frac{P(X)}{1-P(X)}”
    That is NOT the odds ratio, that is the ODDS. The rest of the paragraph describes the odds correctly, but it’s misnamed as the odds ratio.
    And then you never actually define the odds ratio, which is the odds of X given particular values of the predictors, divided by the odds of X given the reference values of the predictors.
    I am concerned by this, because your page appears very high up in a Google search for “logistic regression” and so it is important that the information should not be misleading.
    Otherwise, I think it’s a helpful page, but please correctly define odds, and then define odds ratio. Thank you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s