Truncate and Estimate: Limited Dependent Variables {#LimitedDependentVariables}

Maximum-Likelihood Estimation (MLE)

Suppose we wish to fit data to a given distribution, then we may use this technique to do so. Many of the data fitting procedures need to use MLE. MLE is a general technique, and applies widely. It is also a fundamental approach to many estimation tools in econometrics. Here we recap this.

Let's say we have a series of data $x$, with $T$ observations. If $x \sim N(\mu,\sigma^2)$, then \begin{equation} \mbox{density function:} \quad f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2} \right] \end{equation}

\begin{equation} N(x) = 1 - N(-x) \end{equation}\begin{equation} F(x) = \int_{-\infty}^x f(u) du \end{equation}

The standard normal distribution is $x \sim N(0,1)$. For the standard normal distribution: $F(0) = \frac{1}{2}$.


The likelihood of the entire series is

\begin{equation} \prod_{t=1}^T f[R(t)] \end{equation}

It is easier (computationally) to maximize \begin{equation} \max_{\mu,\sigma} \; {\cal L} \equiv \sum_{t=1}^T \ln f[R(t)] \end{equation} known as the log-likelihood.


This is easily done in R. First we create the log-likelihood function, so you can see how functions are defined in R. Second, we optimize the log-likelihood, i.e., we find the maximum value, hence it is known as maximum log-likelihood estimation (MLE).

We can see that the result was a fitted normal distribution with mean close to 5, and variance close to 9, the square root of which is roughly the same as the distribution from which the data was originally generated. Further, notice that the gradient is zero for both parameters, as it should be when the maximum is reached.

Logit and Probit Models

Usually we run regressions using continuous variables for the dependent ($y$) variables, such as, for example, when we regress income on education. Sometimes however, the dependent variable may be discrete, and could be binomial or multinomial. That is, the dependent variable is limited. In such cases, we need a different approach.

Discrete dependent variables are a special case of limited dependent variables. The Logit and Probit models we look at here are examples of discrete dependent variable models. Such models are also often called qualitative response (QR) models. These are common usage and do not need to be capitalized, so we will use lower case subsequently.

In particular, when the variable is binary, i.e., takes values of $\{0,1\}$, then we get a probability model. If we just regressed the left hand side variables of ones and zeros on a suite of right hand side variables we could of course fit a linear regression. Then if we took another observation with values for the right hand side, i.e., $x = \{x_1,x_2,\ldots,x_k\}$, we could compute the value of the $y$ variable using the fitted coefficients. But of course, this value will not be exactly 0 or 1, except by unlikely coincidence. Nor will this value lie in the range $(0,1)$. However, this is still done and is known as a linear probability model.

Relation to classifiers

There is also a relationship to classifier models. In classifier models, we are interested in allocating observations to categories. In limited dependent models we also want to explain the reasons (i.e., find explanatory variables) for the allocation across categories.

Some examples of such models are to explain whether a person is employed or not, whether a firm is syndicated or not, whether a firm is solvent or not, which field of work is chosen by graduates, where consumers shop, whether they choose Coke versus Pepsi, etc.

These fitted values might not even lie between 0 and 1 with a linear regression. However, if we used a carefully chosen nonlinear regression function, then we could ensure that the fitted values of $y$ are restricted to the range $(0,1)$, and then we would get a model where we fitted a probability. There are two such model forms that are widely used: (a) Logit, also known as a logistic regression, and (b) Probit models. We look at each one in turn.


A logit model takes the following form:

\begin{equation} y = \frac{e^{f(x)}}{1+e^{f(x)}}, \quad f(x) = \beta_0 + \beta_1 x_1 + \ldots \beta_k x_k \end{equation}

We are interested in fitting the coefficients $\{\beta_0,\beta_1, \ldots, \beta_k\}$. Note that, irrespective of the coefficients, $f(x) \in (-\infty,+\infty)$, but $y \in (0,1)$. When $f(x) \rightarrow -\infty$, $y \rightarrow 0$, and when $f(x) \rightarrow +\infty$, $y \rightarrow 1$. We also write this model as

\begin{equation} y = \frac{e^{\beta' x}}{1+e^{\beta' x}} \equiv \Lambda(\beta' x) \end{equation}

where $\Lambda$ (lambda) is for logit.

The model generates a $S$-shaped curve for $y$, and we can plot it as follows. The fitted value of $y$ is nothing but the probability that $y=1$.


For the NCAA data, take the top 32 teams and make their dependent variable 1, and that of the bottom 32 teams zero. Therefore, the teams that have $y=1$ are those that did not lose in the first round of the playoffs, and the teams that have $y=0$ are those that did. Estimation is done by maximizing the log-likelihood.


Probit has essentially the same idea as the logit except that the probability function is replaced by the normal distribution. The nonlinear regression equation is as follows:

\begin{equation} y = \Phi[f(x)], \quad f(x) = \beta_0 + \beta_1 x_1 + \ldots \beta_k x_k \end{equation}

where $\Phi(.)$ is the cumulative normal probability function. Again, irrespective of the coefficients, $f(x) \in (-\infty,+\infty)$, but $y \in (0,1)$. When $f(x) \rightarrow -\infty$, $y \rightarrow 0$, and when $f(x) \rightarrow +\infty$, $y \rightarrow 1$.

We can redo the same previous logit model using a probit instead:


Both these models are just settings in which we are computing binomial (binary) probabilities, i.e.

\begin{equation} \mbox{Pr}[y=1] = F(\beta' x) \end{equation}

where $\beta$ is a vector of coefficients, and $x$ is a vector of explanatory variables. $F$ is the logit/probit function.

\begin{equation} {\hat y} = F(\beta' x) \end{equation}

where ${\hat y}$ is the fitted value of $y$ for a given $x$, and now $\beta$ is the fitted model's coefficients.

In each case the function takes the logit or probit form that we provided earlier. Of course,

\begin{equation} \mbox{Pr}[y=0] = 1 - F(\beta' x) \end{equation}

Note that the model may also be expressed in conditional expectation form, i.e. \begin{equation} E[y | x] = F(\beta' x) (y=1) + [1-F(\beta' x)] (y=0) = F(\beta' x) \end{equation}

Odds Ratio and Slopes (Coefficients) in a Logit

In a linear regression, it is easy to see how the dependent variable changes when any right hand side variable changes. Not so with nonlinear models. A little bit of pencil pushing is required (and some calculus too).

The coefficient of an independent variable in a logit regression tell us by how much the log odds of the dependent variable change with a one unit change in the independent variable. If you want the odds ratio, then simply take the exponentiation of the log odds. The odds ratio says that when the independent variable increases by one, then the odds of the dependent outcome occurring increase by a factor of the odds ratio.

What are odds ratios? An odds ratio is the ratio of probability of success to the probability of failure. If the probability of success is $p$, then we have

$$ \mbox{Odds Ratio (OR)} = \frac{p}{1-p}, \quad p = \frac{OR}{1+OR} $$

For example, if $p=0.3$, then the odds ratio will be $OR=0.3/0.7 = 0.4285714$.

If the coefficient $\beta$ (log odds) of an independent variable in the logit is (say) 2, then it meands the odds ratio is $\exp(2) = 7.38$. This is the factor by which the variable impacts the odds ratio when the variable increases by 1.

Suppose the independent variable increases by 1. Then the odds ratio and probabilities change as follows.

So we see that the probability of the dependent outcome occurring has increased from $0.3$ to $0.76$. Why?

$$ y = \frac{e^{\beta' x}}{1-e^{\beta' x}}; \quad 1-y = \frac{1}{1-e^{\beta' x}} $$


$$ OR = \frac{y}{1-y} = e^{\beta' x} = \prod_{i} e^{\beta_i x_i} $$

Now let's do the same example with the NCAA data.

Now, let's see what happens if the rebounds increase by 1.

So, the probability increases as expected. We can check that the new odds ratio will give the new probability as well.

And we see that this is exactly as required.

Calculus of the logit coefficients

Remember that $y$ lies in the range $(0,1)$. Hence, we may be interested in how $E(y|x)$ changes as any of the explanatory variables changes in value, so we can take the derivative:

\begin{equation} \frac{\partial E(y|x)}{\partial x} = F'(\beta' x) \beta \equiv f(\beta' x) \beta \end{equation}

For each model we may compute this at the means of the regressors:

\begin{equation} \frac{\partial E(y|x)}{\partial x} = \beta\left( \frac{e^{\beta' x}}{1+e^{\beta' x}} \right) \left( 1 - \frac{e^{\beta' x}}{1+e^{\beta' x}} \right) = \beta y (1-y) \end{equation}

which may be re-written as

\begin{equation} \frac{\partial E(y|x)}{\partial x} = \beta \cdot \Lambda(\beta' x) \cdot [1-\Lambda(\beta'x)] \end{equation}

How about the Probit model?

In the probit model this is

\begin{equation} \frac{\partial E(y|x)}{\partial x} = \phi(\beta' x) \beta \end{equation}

where $\phi(.)$ is the normal density function (not the cumulative probability).

Maximum-Likelihood Estimation (MLE) of these Choice Models

Estimation in the models above, using the glm function is done by R using MLE. Lets write this out a little formally. Since we have say $n$ observations, and each LHS variable is $y = \{0,1\}$, we have the likelihood function as follows:

\begin{equation} L = \prod_{i=1}^n F(\beta'x)^{y_i} [1-F(\beta'x)]^{1-y_i} \end{equation}

The log-likelihood will be

\begin{equation} \ln L = \sum_{i=1}^n \left[ y_i \ln F(\beta'x) + (1-y_i) \ln [1-F(\beta'x)] \right] \end{equation}

To maximize the log-likelihood we take the derivative:

\begin{equation} \frac{\partial \ln L}{\partial \beta} = \sum_{i=1}^n \left[ y_i \frac{f(\beta'x)}{F(\beta'x)} - (1-y_i) \frac{f(\beta'x)}{1-F(\beta'x)} \right]x = 0 \end{equation}

which gives a system of equations to be solved for $\beta$. This is what the software is doing. The system of first-order conditions are collectively called the likelihood equation.

You may well ask, how do we get the t-statistics of the parameter estimates $\beta$? The formal derivation is beyond the scope of this class, as it requires probability limit theorems, but let's just do this a little heuristically, so you have some idea of what lies behind it.

The t-stat for a coefficient is its value divided by its standard deviation. We get some idea of the standard deviation by asking the question: how does the coefficient set $\beta$ change when the log-likelihood changes? That is, we are interested in $\partial \beta / \partial \ln L$. Above we have computed the reciprocal of this, as you can see. Lets define

\begin{equation} g = \frac{\partial \ln L}{\partial \beta} \end{equation}

We also define the second derivative (also known as the Hessian matrix)

\begin{equation} H = \frac{\partial^2 \ln L}{\partial \beta \partial \beta'} \end{equation}

Note that the following are valid:

\begin{eqnarray*} E(g) &=& 0 \quad \mbox{(this is a vector)} \\ Var(g) &=& E(g g') - E(g)^2 = E(g g') \\ &=& -E(H) \quad \mbox{(this is a non-trivial proof)} \end{eqnarray*}

We call

\begin{equation} I(\beta) = -E(H) \end{equation}

the information matrix. Since (heuristically) the variation in log-likelihood with changes in beta is given by $Var(g)=-E(H)=I(\beta)$, the inverse gives the variance of $\beta$. Therefore, we have

\begin{equation} Var(\beta) \rightarrow I(\beta)^{-1} \end{equation}

We take the square root of the diagonal of this matrix and divide the values of $\beta$ by that to get the t-statistics.

Multinomial Logit

You will need the nnet package for this. This model takes the following form:

\begin{equation} \mbox{Prob}[y = j] = p_j= \frac{\exp(\beta_j' x)}{1+\sum_{j=1}^{J} \exp(\beta_j' x)} \end{equation}

We usually set

\begin{equation} \mbox{Prob}[y = 0] = p_0 = \frac{1}{1+\sum_{j=1}^{J} \exp(\beta_j' x)} \end{equation}

You can see from the results that the probability for category 1 is the same as $p_0$. What this means is that we compute the other three probabilities, and the remaining is for the first category. We check that the probabilities across each row for all four categories add up to 1:

When OLS fails

The standard linear regression model often does not apply, and we need to be careful to not overuse it. Peter Kennedy in his excellent book "A Guide to Econometrics" states five cases where violations of critical assumptions for OLS occur, and we should then be warned against its use.

  1. The OLS model is in error when (a) the RHS variables are incorrect (inapproprate regressors) for use to explain the LHS variable. This is just the presence of a poor model. Hopefully, the F-statistic from such a regression will warn against use of the model. (b) The relationship between the LHS and RHS is nonlinear, and this makes use of a linear regression inaccurate. (c) the model is non-stationary, i.e., the data spans a period where the coefficients cannot be reasonably expected to remain the same.
  1. Non-zero mean regression residuals. This occurs with truncated residuals (see discussion below) and in sample selection problems, where the fitted model to a selected subsample would result in non-zero mean errors for the full sample. This is also known as the biased intercept problem. The errors may also be correlated with the regressors, i.e., endogeneity (see below).

  2. Residuals are not iid. This occurs in two ways. (a) Heterosledasticity, i.e., the variances of the residuals for all observations is not the same, i.e., violation of the identically distributed assumption. (b) Autocorrelation, where the residuals are correlated with each other, i.e., violation of the independence assumption.

  1. Endogeneity. Here the observations of regressors $x$ cannot be assumed to be fixed in repeated samples. This occurs in several ways. (a) Errors in variables, i.e., measurement of $x$ in error. (b) Omitted variables, which is a form of errors in variables. (c) Autoregression, i.e., using a lagged value of the dependent variable as an independent variable, as in VARs. (d) Simultaneous equation systems, where all variables are endogenous, and this is also known as reverse causality. For example, changes in tax rates changes economic behavior, and hence income, which may result in further policy changes in tax rates, and so on. Because the $x$ variables are correlated with the errors $\epsilon$, they are no longer exogenous, and hence we term this situation one of "endogeneity".

  2. $n > p$. The number of observations ($n$) is greater than the number of independent variables ($p$), i.e., the dimension of $x$. This can also occur when two regressors are highly correlated with each other, i.e., known as multicollinearity.

Truncated Variables and Sample Selection

Sample selection problems arise because the sample is truncated based on some selection criterion, and the regression that is run is biased because the sample is biased and does not reflect the true/full population. For example, wage data is only available for people who decided to work, i.e., the wage was worth their while, and above their reservation wage. If we are interested in finding out the determinants of wages, we need to take this fact into account, i.e., the sample only contains people who were willing to work at the wage levels that were in turn determined by demand and supply of labor. The sample becomes non-random. It explains the curious case that women with more children tend to have lower wages (because they need the money and hence, their reservation wage is lower).

Usually we handle sample selection issues using a two-equation regression approach. The first equation determines if an observation enters the sample. The second equation then assesses the model of interest, e.g., what determines wages. We will look at an example later.

But first, we provide some basic mathematical results that we need later. And of course, we need to revisit our Bayesian ideas again!

Given a probability density $f(x)$,

\begin{equation} f(x | x > a) = \frac{f(x)}{Pr(x>a)} \end{equation}

If we are using the normal distribution then this is:

\begin{equation} f(x | x > a) = \frac{\phi(x)}{1-\Phi(a)} \end{equation}

If $x \sim N(\mu, \sigma^2)$, then

\begin{equation} E(x | x>a) = \mu + \sigma\; \frac{\phi(c)}{1-\Phi(c)}, \quad c = \frac{a-\mu}{\sigma} \end{equation}

Note that this expectation is provided without proof, as are the next few ones. For example if we let $x$ be standard normal and we want $E([x | x > -1])$, we have

For the same distribution

\begin{equation} E(x | x < a) = \mu + \sigma\; \frac{-\phi(c)}{\Phi(c)}, \quad c = \frac{a-\mu}{\sigma} \end{equation}

For example, $E[x | x < 1]$ is

Inverse Mills Ratio

The values $\frac{\phi(c)}{1-\Phi(c)}$ or $\frac{-\phi(c)}{\Phi(c)}$ as the case may be is often shortened to the variable $\lambda(c)$, which is also known as the Inverse Mills Ratio.

If $y$ and $x$ are correlated (with correlation $\rho$), and $y \sim N(\mu_y,\sigma_y^2)$, then

\begin{eqnarray*} Pr(y,x | x>a) &=& \frac{f(y,x)}{Pr(x>a)} \\ E(y | x>a) &=& \mu_y + \sigma_y \rho \lambda(c), \quad c = \frac{a-\mu}{\sigma} \end{eqnarray*}

This leads naturally to the truncated regression model. Suppose we have the usual regression model where

\begin{equation} y = \beta'x + e, \quad e \sim N(0,\sigma^2) \end{equation}

But suppose we restrict attention in our model to values of $y$ that are greater than a cut off $a$. We can then write down by inspection the following correct model (no longer is the simple linear regression valid)

\begin{equation} E(y | y > a) = \beta' x + \sigma \; \frac{\phi[(a-\beta'x)/\sigma]}{1-\Phi[(a-\beta'x)/\sigma]} \end{equation}

Therefore, when the sample is truncated, then we need to run the regression above, i.e., the usual right-hand side $\beta' x$ with an additional variable, i.e., the Inverse Mill's ratio. We look at this in a real-world example.

Example: Limited Dependent Variables in VC Syndications

Not all venture-backed firms end up making a successful exit, either via an IPO, through a buyout, or by means of another exit route. By examining a large sample of startup firms, we can measure the probability of a firm making a successful exit. By designating successful exits as $S=1$, and setting $S=0$ otherwise, we use matrix $X$ of explanatory variables and fit a Probit model to the data. We define $S$ to be based on a latent threshold variable $S^*$ such that

\begin{equation} S = \left\{ \begin{array}{ll} 1 & \mbox{if } S^* > 0\\ 0 & \mbox{if } S^* \leq 0. \end{array} \right. \end{equation}

where the latent variable is modeled as

\begin{equation} S^* = \gamma' X + u, \quad u \sim N(0,\sigma_u^2) \end{equation}

The fitted model provides us the probability of exit, i.e., $E(S)$, for all financing rounds.

\begin{equation} E(S) = E(S^* > 0) = E(u > -\gamma' X) = 1 - \Phi(-\gamma' X) = \Phi(\gamma' X), \end{equation}

where $\gamma$ is the vector of coefficients fitted in the Probit model, using standard likelihood methods. The last expression in the equation above follows from the use of normality in the Probit specification. $\Phi(.)$ denotes the cumulative normal distribution.

Sample Selection Problems (and endogeneity)

Suppose we want to examine the role of syndication in venture success. Success in a syndicated venture comes from two broad sources of VC expertise. First, VCs are experienced in picking good projects to invest in, and syndicates are efficient vehicles for picking good firms; this is the selection hypothesis put forth by Lerner (1994). Amongst two projects that appear a-priori similar in prospects, the fact that one of them is selected by a syndicate is evidence that the project is of better quality (ex-post to being vetted by the syndicate, but ex-ante to effort added by the VCs), since the process of syndication effectively entails getting a second opinion by the lead VC. Second, syndicates may provide better monitoring as they bring a wide range of skills to the venture, and this is suggested in the value-added hypothesis of Brander, Amit, and Antweiler (2002).

A regression of venture returns on various firm characteristics and a dummy variable for syndication allows a first pass estimate of whether syndication impacts performance. However, it may be that syndicated firms are simply of higher quality and deliver better performance, whether or not they chose to syndicate. Better firms are more likely to syndicate because VCs tend to prefer such firms and can identify them. In this case, the coefficient on the dummy variable might reveal a value-add from syndication, when indeed, there is none. Hence, we correct the specification for endogeneity, and then examine whether the dummy variable remains significant.

Greene, in his classic book "Econometric Analysis" provides the correction for endogeneity required here. We briefly summarize the model required. The performance regression is of the form:

\begin{equation} Y = \beta' X + \delta Q + \epsilon, \quad \epsilon \sim N(0,\sigma_{\epsilon}^2) \end{equation}

where $Y$ is the performance variable; $Q$ is the dummy variable taking a value of 1 if the firm is syndicated, and zero otherwise, and $\delta$ is a coefficient that determines whether performance is different on account of syndication. If it is not, then it implies that the variables $X$ are sufficient to explain the differential performance across firms, or that there is no differential performance across the two types of firms.

However, since these same variables determine also, whether the firm syndicates or not, we have an endogeneity issue which is resolved by adding a correction to the model above. The error term $\epsilon$ is affected by censoring bias in the subsamples of syndicated and non-syndicated firms. When $Q=1$, i.e. when the firm's financing is syndicated, then the residual $\epsilon$ has the following expectation

\begin{equation} E(\epsilon | Q=1) = E(\epsilon | S^* >0) = E(\epsilon | u > -\gamma' X) = \rho \sigma_{\epsilon} \left[ \frac{\phi(\gamma' X)}{\Phi(\gamma' X)} \right]. \end{equation}

where $\rho = Corr(\epsilon,u)$, and $\sigma_{\epsilon}$ is the standard deviation of $\epsilon$. This implies that

\begin{equation} E(Y | Q=1) = \beta'X + \delta + \rho \sigma_{\epsilon} \left[ \frac{\phi(\gamma' X)}{\Phi(\gamma' X)} \right]. \end{equation}

Note that $\phi(-\gamma'X)=\phi(\gamma'X)$, and $1-\Phi(-\gamma'X)=\Phi(\gamma'X)$. For estimation purposes, we write this as the following regression equation:

EQN1 \begin{equation} Y = \delta + \beta' X + \beta_m m(\gamma' X) \end{equation}

where $m(\gamma' X) = \frac{\phi(\gamma' X)}{\Phi(\gamma' X)}$ and $\beta_m = \rho \sigma_{\epsilon}$. Thus, $\{\delta,\beta,\beta_m\}$ are the coefficients estimated in the regression. (Note here that $m(\gamma' X)$ is also known as the inverse Mill's ratio.)

Likewise, for firms that are not syndicated, we have the following result

\begin{equation} E(Y | Q=0) = \beta'X + \rho \sigma_{\epsilon} \left[ \frac{-\phi(\gamma' X)}{1-\Phi(\gamma' X)} \right]. \end{equation}

This may also be estimated by linear cross-sectional regression.

EQN0 \begin{equation} Y = \beta' X + \beta_m \cdot m'(\gamma' X) \end{equation}

where $m' = \frac{-\phi(\gamma' X)}{1-\Phi(\gamma' X)}$ and $\beta_m = \rho \sigma_{\epsilon}$.

The estimation model will take the form of a stacked linear regression comprising both equations (EQN1) and (EQN0). This forces $\beta$ to be the same across all firms without necessitating additional constraints, and allows the specification to remain within the simple OLS form. If $\delta$ is significant after this endogeneity correction, then the empirical evidence supports the hypothesis that syndication is a driver of differential performance. If the coefficients $\{\delta, \beta_m\}$ are significant, then the expected difference in performance for each syndicated financing round $(i,j)$ is

\begin{equation} \delta + \beta_m \left[ m(\gamma_{ij}' X_{ij}) - m'(\gamma_{ij}' X_{ij}) \right], \;\;\; \forall i,j. \end{equation}

The method above forms one possible approach to addressing treatment effects. Another approach is to estimate a Probit model first, and then to set $m(\gamma' X) = \Phi(\gamma' X)$. This is known as the instrumental variables approach.

Some References: @JEMS:JEMS423; @10.2307/3665618

The correct regression may be run using the sampleSelection package in R. Sample selection models correct for the fact that two subsamples may be different because of treatment effects. Let's take an example with data from the wage market.

Example: Women in the Labor Market

This is an example from the package in R itself. The data used is also within the package.

After loading in the package sampleSelection we can use the data set called Mroz87. This contains labour market participation data for women as well as wage levels for women. If we are explaining what drives women's wages we can simply run the following regression. See:

The original paper may be downloaded at:

So, education matters. But since education also determines labor force participation (variable lfp) it may just be that we can use lfp instead. Let's try that.

In fact, it seems like both matter, but we should use the selection equation approach of Heckman, in two stages.

Note that even after using education to explain lfp in the selection equation, it still matters in the wage equation. So education does really impact wages.

Endogeity: Some Theory to Wrap Up

Endogeneity may be technically expressed as arising from a correlation of the independent variables and the error term in a regression. This can be stated as:

\begin{equation} Y = \beta' X + u, \quad E(X\cdot u) \neq 0 \end{equation}

This can happen in many ways:

\begin{equation} Y = \beta_0 + \beta_1 ({\tilde X} - e) + u = \beta_0 + \beta_1 {\tilde X} + (u - \beta_1 e) = \beta_0 + \beta_1 {\tilde X} + v \end{equation}

We see that

\begin{equation} E({\tilde X} \cdot v) = E[(X+e)(u - \beta_1 e)] = -\beta_1 E(e^2) = -\beta_1 Var(e) \neq 0 \end{equation}
\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + u \end{equation}

but we do not have $X_2$, which happens to be correlated with $X_1$, then it will be subsumed in the error term and no longer will $E(X_i \cdot u) = 0, \forall i$.

\begin{equation} Y = \beta_0 + \beta_1 X + u, \quad \quad X = \alpha_0 + \alpha_1 Y + v \end{equation}

The solution to these equations gives the {\it reduced-form} version of the model. \begin{equation} Y = \frac{\beta_0 + \beta_1 \alpha_0}{1 - \alpha_1 \beta_1} + \frac{\beta v + u}{1 - \alpha_1 \beta_1}, \quad \quad X = \frac{\alpha_0 +\alpha_1 \beta_0}{1 - \alpha_1 \beta_1} + \frac{v + \alpha_1 u}{1 - \alpha_1 \beta_1} \end{equation}

From which we can compute the endogeneity result. \begin{equation} Cov(X, u) = Cov\left(\frac{v + \alpha_1 u}{1 - \alpha_1 \beta_1}, u \right) = \frac{\alpha_1}{1 - \alpha_1 \beta_1}\cdot Var(u) \end{equation}

To summarize, if $x$ is correlated with $u$ then $x$ is said to be "endogenous". Endogeneity biases parameter estimates. The solution is to find an instrumental variable (denoted $x'$) that is highly correlated with $x$, but not correlated with $u$. That is

But since $x'$ is not really $x$, it adds (uncorrelated )variance to the residuals, because $x' = x + \eta$.

Cox Proportional Hazards Model

This is a model used to estimate the expected time to an event. We may be interested in estimating mortality, failure time of equipment, time to successful IPO of a startup, etc. If we define "stopping" time of an event as $\tau$, then we are interested in the cumulative probability of an event occurring in time $t$ as

$$ F(t) = Pr(\tau \leq t ) $$

and the corresponding density function $f(t) = F'(t)$. The hazard rate is defined as the probability that the event occurs at time $t$, conditional on it not having occurred until time $t$, i.e.,

$$ \lambda(t) = \frac{f(t)}{1-F(t)} \in (0,\infty) $$

Correspondingly, the probability of survival is

$$ s(t) = \exp\left( -\int_0^t \lambda(u)\; du \right) $$

with the probability of failure up to time $t$ then given by

$$ F(t) = 1 - s(t) = 1 -\exp\left( -\int_0^t \lambda(u)\; du \right) $$

Empirically, we estimate the hazard rate as follows, for individual $i$:

$$ \lambda_i(t) = \lambda_0(t) \exp[\beta^\top x_i] \geq 0 $$

where $\beta$ is a vector of coefficients, and $x_i$ is a vector of characteristics of individual $i$. The function $\lambda_0(t) \geq 0$ is known as the "baseline hazard function". The hazard ratio is defined as $\lambda_i(t)/\lambda_0(t)$. When greater than 1, individual $i$ has a greater hazard than baseline. The log hazard ratio is linear in $x_i$.

$$ \ln \left[ \frac{\lambda_i(t)}{\lambda_0(t)} \right] = \beta^\top x_i $$

In order to get some intuition for the hazard rate, suppose we have three friends who just graduated from college, and they all have an equal chance of getting married. Then at any time $t$, the probability that any one gets married, given no one has been married so far is $\lambda_i(t) = \lambda_0(t) = 1/3, \forall t$. Now, if anyone gets married, then the hazard rate will jump to $1/2$.

But what if all the three friends are of different ages, and the propensity to get married is proportional to age? Then

$$ \lambda_i(t) = \frac{\mbox{Age}_i(t)}{\sum_{j=1}^3 \mbox{Age}_j(t)} $$

This model may also be extended to include gender and other variables. Given we have data on $M$ individuals, we can order the data by discrete times $t_1 < t_2 < ... t_i < ... < t_M$. Some of these times are times to the event, and some are times of existence without the event, the latter is also known as "censoring" times. The values $\delta_1, \delta_2, ..., \delta_i, ..., \delta_M$ take values 1 if the individual has experienced the event and zero otherwise. The likelihood of an individual experiencing the event is

$$ L_i(\beta) = \frac{\lambda_i(t_i)}{\sum_{j=i}^M \lambda_j(t_i)} = \frac{\lambda_0(t_i) e^{\beta^\top x_i}}{\sum_{j=i}^M \lambda_0(t_i) e^{\beta^\top x_j}} = \frac{ e^{\beta^\top x_i}}{\sum_{j=i}^M e^{\beta^\top x_j}} $$

This accounts for all remaining individuals in the population at time $t_i$. We see that the likelihood does not depend on $t$ as the baseline hazard function cancels out. The parameters $\beta$ are obtained by maximizing the likelihood function:

$$ L(\beta) = \prod_{i=1}^M L_i(\beta)^{\delta_i} $$

which uses the subset of data where $\delta_i = 1$.

We use the survival package in R.

Here is a very small data set. Note the columns that correspond to time to event, and the indictor variable "death" ($\delta$). The $x$ variables are "age" and "female".

We can of course run a linear regression just to see how age and gender affect death, by merely looking at the sign, and we see that being older means on average a greater chance of dying, and being female reduces risk.

Instead of a linear regression, estimate the Cox PH model for the survival time. Here the coefficients are reversed in sign because we are estimating survival and not death.

Note that the exp(coef) is the hazard ratio. When it is greater than 1, there is an increase in hazard, and when it is less than 1, there is a decrease in the hazard. We can do a test for proportional hazards as follows, and examine the p-values.

Finally, we are interested in obtaining the baseline hazard function $\lambda_0(t)$ which as we know has dropped out of the estimation. So how do we recover it? In fact, without it, where do we even get $\lambda_i(t)$ from? We would also like to get the cumulative baseline hazard, i.e., $\Lambda_0(t) = \int_0^t \lambda_0(u) du$. Sadly, this is a major deficiency of the Cox PH model. However, one may make a distributional assumption about the form of $\lambda_0(t)$ and then fit it to maximize the likelihood of survival times, after the coefficients $\beta$ have been fit already. For example, one function might be $\lambda_0(t) = e^{\alpha t}$, and it would only need the estimation of $\alpha$. We can then obtain the estimated survival probabilities over time.

The "survival" column gives the survival probabilty for various time horizons shown in the first column.

For a useful guide, see

To sum up, see that the Cox PH model estimates the hazard rate function $\lambda(t)$:

$$ \lambda(t) = \lambda_0(t) \exp[\beta^\top x] $$

The "exp(coef)" is the baseline hazard rate multiplier effect. If exp(coef)>1, then an increase in the variable $x$ increases the hazard rate by that factor, and if exp(coef)<1, then it reduces the hazard rate $\lambda(t)$ by that factor. Note that the hazard rate is NOT the probability of survival, and in fact $\lambda(t) \in (0,\infty)$. Note that the probability of survival over time t, if we assume a constant hazard rate $\lambda$ is $s(t) = e^{-\lambda t}$. Of course $s(t) \in (0,1)$.

So for example, if the current (assumed constant) hazard rate is $\lambda = 0.02$, then the (for example) 3-year survival probability is

$$ s(t) = e^{-0.02 \times 3} = 0.9418 $$

This implies an Odds ratio of $OR = s/(1-s) = 16.17167$. If the person is female, then the new odds ratio is adjusted by the exponentiated coefficient, i.e., $OR \times 4.686 = 75.78043$. So the new survival probability is

$$ s(t=3) = \frac{OR}{1+OR} = 0.9870 $$

If Age increases by one year then the odds ratio will be $16.17167 \times 0.3886 = 6.28431$. And the new survival probability will be

$$ s(t=3) = \frac{OR}{1+OR} = 0.8627 $$

Note that the hazard rate and the probability of survival go in opposite directions.

GLMNET: Lasso and Ridge Regressions

The glmnet package is from Stanford, and you can get all the details and examples here:

The package fits generalized linear models and also penalizes the size of the model, with various standard models as special cases.

The function equation for minimization is

$$ \min_{\beta} \frac{1}{n}\sum_{i=1}^n w_i L(y_i,\beta^\top x_i) + \lambda \left[(1-\alpha) \frac{1}{2}\| \beta \|_2^2 + \alpha \|\beta \|_1\right] $$

where $\|\beta\|_1$ and $\|\beta\|_2$ are the $L_1$ and $L_2$ norms for the vector $\beta$. The idea is to take any loss function and penalize it. For example, if the loss function is just the sum of squared residuals $y_i-\beta^\top x_i$, and $w_i=1, \lambda=0$, then we get an ordinary least squares regression model. The function $L$ is usually set to be the log-likelihood function.

If the $L_1$ norm is applied only, i.e., $\alpha=1$, then we get the Lasso model. If the $L_2$ norm is solely applied, i.e., $\alpha=0$, then we get a ridge regression. As is obvious from the equation, $\lambda$ is the size of the penalty applied, and increasing this parameter forces a more parsimonious model.

Ridge regression suppresses the size of the coefficients, whereas Lasso also minimizes the number of non-zero coefficients.

Here is an example of lasso ($\alpha=1$):

We may also run glmnet to get coefficients.

Prediction on test data

ROC Curves

ROC stands for Receiver Operating Characteristic. The acronym comes from signal theory, where the users are interested in the number of true positive signals that are identified. The idea is simple, and best explained with an example. Let's say you have an algorithm that detects customers probability $p \in (0,1)$ of buying a product. Take a tagged set of training data and sort the customers by this probability in a line with the highest propensity to buy on the left and moving to the right the probabilty declines monotonically. (Tagged means you know whether they bought the product or not.) Now, starting from the left, plot a line that jumps vertically by a unit if the customer buys the product as you move across else remains flat. If the algorithm is a good one, the line will quickly move up at first and then flatten out.

Let's take the train and test data here and plot the ROC curve by writing our own code.

We can do the same with the pROC package. Here is the code.

We see that "specificity" equals the true negative rate, and is also denoted as "recall". And that the true positive rate is also labeled as "sensitivity". The AUC or "area under the curve"" is the area between the curve and the diagonal divided by the area in the top right triangle of the diagram. This is also reported and is the same number as obtained when we fitted the model using the glmnet function before. For nice graphics that explain all these measures and more, see

Glmnet Cox Models

As we did before, we may fit a Cox PH model using GLMNET with the additional feature that we include a penalty when we maximize the likelihood function.

With cross validation, we get the usual plot for the fit.

Note that the signs of the coefficients are the same as we had earlier, i.e., survival is lower with age and higher for females.