# Logistic Regression¶

Sanjiv R. Das

## Limited Dependent Variables¶

• The dependent variable may be discrete, and could be binomial or multinomial. That is, the dependent variable is limited. In such cases, we need a different approach.

• Discrete dependent variables are a special case of limited dependent variables. The Logit model we look at here is a discrete dependent variable model. Such models are also often called qualitative response (QR) models.

## The Logistic Function¶

$$y = \frac{1}{1+e^{-f(x_1,x_2,...,x_n)}} \in (0,1)$$

where

$$f(x_1,x_2,...,x_n) = a_0 + a_1 x_1 + ... + a_n x_n \in (-\infty,+\infty)$$

## Odds Ratio¶

What are odds ratios? An odds ratio (OR) is the ratio of probability of success to the probability of failure. If the probability of success is $p$, then

$$OR = \frac{p}{1-p}; \quad \quad p = \frac{OR}{1+OR}$$

## Odds Ratio Coefficients¶

• In a linear regression, it is easy to see how the dependent variable changes when any right hand side variable changes. Not so with nonlinear models. A little bit of pencil pushing is required (add some calculus too).

• The coefficient of an independent variable in a logit regression tell us by how much the log odds of the dependent variable change with a one unit change in the independent variable. If you want the odds ratio, then simply take the exponentiation of the log odds.

## Metrics¶

1. Accuracy: the number of correctly predicted class values.

2. ROC and AUC: The Receiver-Operating Characteristic (ROC) curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) for different levels of the cut-off posterior probability. This is an essential trade-off in all classification systems.

3. TPR = sensitivity or recall = TP/(TP+FN)

4. FPR = (1 − specificity) = FP/(FP+TN)

## More Metrics¶

1. Precision = $\frac{TP}{TP+FP}$

2. Recall = $\frac{TP}{TP+FN}$

3. F1 score = $\frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}$

(F1 is the harmonic mean of precision and recall.)

## Using R¶

• We can also use the R programming language as it if often better suited to econometrics.

• We will use a basketball data set this time for a change of pace.

• The rpy2 package allows us to call R from a Python notebook. https://rpy2.bitbucket.io/

## Multinomial Logit¶

The probability of each class $(0,1,...,k)$ for $(k+1)$ classes is as follows:

$$Pr[y=j] = \frac{e^{a_j^\top x}}{\sum_{i=1}^k e^{a_i^\top x}}$$

and

$$Pr[y=0] = \frac{1}{\sum_{i=1}^k e^{a_i^\top x}}$$

Note that $\sum_{i=1}^k Pr[y=i] = 1$.