%pylab inline
import pandas as pd
import os
from ipypublish import nb_setup
%load_ext rpy2.ipython
#%load_ext RWinOut
Populating the interactive namespace from numpy and matplotlib
Suppose we wish to fit data to a given distribution, then we may use this technique to do so. Many of the data fitting procedures need to use MLE. MLE is a general technique, and applies widely. It is also a fundamental approach to many estimation tools in econometrics. Here we recap this.
Let's say we have a series of data $x$, with $T$ observations. If $x \sim N(\mu,\sigma^2)$, then \begin{equation} \mbox{density function:} \quad f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2} \right] \end{equation}
\begin{equation} N(x) = 1 - N(-x) \end{equation}\begin{equation} F(x) = \int_{-\infty}^x f(u) du \end{equation}The standard normal distribution is $x \sim N(0,1)$. For the standard normal distribution: $F(0) = \frac{1}{2}$.
The likelihood of the entire series is
\begin{equation} \prod_{t=1}^T f[R(t)] \end{equation}It is easier (computationally) to maximize \begin{equation} \max_{\mu,\sigma} \; {\cal L} \equiv \sum_{t=1}^T \ln f[R(t)] \end{equation} known as the log-likelihood.
This is easily done in R. First we create the log-likelihood function, so you can see how functions are defined in R. Second, we optimize the log-likelihood, i.e., we find the maximum value, hence it is known as maximum log-likelihood estimation (MLE).
%%R
#LOG-LIKELIHOOD FUNCTION
LL = function(params,x) {
mu = params[1]; sigsq = params[2]
f = (1/sqrt(2*pi*sigsq))*exp(-0.5*(x-mu)^2/sigsq)
LL = -sum(log(f))
}
%%R
#GENERATE DATA FROM A NORMAL DISTRIBUTION
x = rnorm(10000, mean=5, sd=3)
#MAXIMIZE LOG-LIKELIHOOD
params = c(4,2) #Create starting guess for parameters
res = nlm(LL,params,x)
print(res)
$minimum [1] 25177.91 $estimate [1] 5.040839 9.004325 $gradient [1] 5.051907e-06 -4.444283e-06 $code [1] 1 $iterations [1] 11
We can see that the result was a fitted normal distribution with mean close to 5, and variance close to 9, the square root of which is roughly the same as the distribution from which the data was originally generated. Further, notice that the gradient is zero for both parameters, as it should be when the maximum is reached.
Usually we run regressions using continuous variables for the dependent ($y$) variables, such as, for example, when we regress income on education. Sometimes however, the dependent variable may be discrete, and could be binomial or multinomial. That is, the dependent variable is limited. In such cases, we need a different approach.
Discrete dependent variables are a special case of limited dependent variables. The Logit and Probit models we look at here are examples of discrete dependent variable models. Such models are also often called qualitative response (QR) models. These are common usage and do not need to be capitalized, so we will use lower case subsequently.
In particular, when the variable is binary, i.e., takes values of $\{0,1\}$, then we get a probability model. If we just regressed the left hand side variables of ones and zeros on a suite of right hand side variables we could of course fit a linear regression. Then if we took another observation with values for the right hand side, i.e., $x = \{x_1,x_2,\ldots,x_k\}$, we could compute the value of the $y$ variable using the fitted coefficients. But of course, this value will not be exactly 0 or 1, except by unlikely coincidence. Nor will this value lie in the range $(0,1)$. However, this is still done and is known as a linear probability model.
There is also a relationship to classifier models. In classifier models, we are interested in allocating observations to categories. In limited dependent models we also want to explain the reasons (i.e., find explanatory variables) for the allocation across categories.
Some examples of such models are to explain whether a person is employed or not, whether a firm is syndicated or not, whether a firm is solvent or not, which field of work is chosen by graduates, where consumers shop, whether they choose Coke versus Pepsi, etc.
These fitted values might not even lie between 0 and 1 with a linear regression. However, if we used a carefully chosen nonlinear regression function, then we could ensure that the fitted values of $y$ are restricted to the range $(0,1)$, and then we would get a model where we fitted a probability. There are two such model forms that are widely used: (a) Logit, also known as a logistic regression, and (b) Probit models. We look at each one in turn.
A logit model takes the following form:
\begin{equation} y = \frac{e^{f(x)}}{1+e^{f(x)}}, \quad f(x) = \beta_0 + \beta_1 x_1 + \ldots \beta_k x_k \end{equation}We are interested in fitting the coefficients $\{\beta_0,\beta_1, \ldots, \beta_k\}$. Note that, irrespective of the coefficients, $f(x) \in (-\infty,+\infty)$, but $y \in (0,1)$. When $f(x) \rightarrow -\infty$, $y \rightarrow 0$, and when $f(x) \rightarrow +\infty$, $y \rightarrow 1$. We also write this model as
\begin{equation} y = \frac{e^{\beta' x}}{1+e^{\beta' x}} \equiv \Lambda(\beta' x) \end{equation}where $\Lambda$ (lambda) is for logit.
The model generates a $S$-shaped curve for $y$, and we can plot it as follows. The fitted value of $y$ is nothing but the probability that $y=1$.
%%R
logit = function(fx) { res = exp(fx)/(1+exp(fx)) }
fx = seq(-4,4,0.01)
y = logit(fx)
plot(fx,y,type="l",xlab="x",ylab="f(x)",col="blue",lwd=3)
For the NCAA data, take the top 32 teams and make their dependent variable 1, and that of the bottom 32 teams zero. Therefore, the teams that have $y=1$ are those that did not lose in the first round of the playoffs, and the teams that have $y=0$ are those that did. Estimation is done by maximizing the log-likelihood.
%%R
ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
y1 = 1:32
y1 = y1*0+1
y2 = y1*0
y = c(y1,y2)
print(y)
x = as.matrix(ncaa[4:14])
h = glm(y~x, family=binomial(link="logit"))
names(h)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 [39] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [1] "coefficients" "residuals" "fitted.values" [4] "effects" "R" "rank" [7] "qr" "family" "linear.predictors" [10] "deviance" "aic" "null.deviance" [13] "iter" "weights" "prior.weights" [16] "df.residual" "df.null" "y" [19] "converged" "boundary" "model" [22] "call" "formula" "terms" [25] "data" "offset" "control" [28] "method" "contrasts" "xlevels"
%%R
print(logLik(h))
summary(h)
'log Lik.' -21.44779 (df=12) Call: glm(formula = y ~ x, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -1.80174 -0.40502 -0.00238 0.37584 2.31767 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -45.83315 14.97564 -3.061 0.00221 ** xPTS -0.06127 0.09549 -0.642 0.52108 xREB 0.49037 0.18089 2.711 0.00671 ** xAST 0.16422 0.26804 0.613 0.54010 xTO -0.38405 0.23434 -1.639 0.10124 xA.T 1.56351 3.17091 0.493 0.62196 xSTL 0.78360 0.32605 2.403 0.01625 * xBLK 0.07867 0.23482 0.335 0.73761 xPF 0.02602 0.13644 0.191 0.84874 xFG 46.21374 17.33685 2.666 0.00768 ** xFT 10.72992 4.47729 2.397 0.01655 * xX3P 5.41985 5.77966 0.938 0.34838 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 88.723 on 63 degrees of freedom Residual deviance: 42.896 on 52 degrees of freedom AIC: 66.896 Number of Fisher Scoring iterations: 6
%%R
h$fitted.values
1 2 3 4 5 6 0.9998267965 0.9983229192 0.9686755530 0.9909359265 0.9977011039 0.9639506326 7 8 9 10 11 12 0.5381841865 0.9505255187 0.4329829232 0.7413280575 0.9793554057 0.7273235463 13 14 15 16 17 18 0.2309261473 0.9905414749 0.7344407215 0.9936312074 0.2269619354 0.8779507370 19 20 21 22 23 24 0.2572796426 0.9335376447 0.9765843274 0.7836742557 0.9967552281 0.9966486903 25 26 27 28 29 30 0.9715110760 0.0681674628 0.4984153630 0.9607522159 0.8624544140 0.6988578200 31 32 33 34 35 36 0.9265057217 0.7472357037 0.5589318497 0.2552381741 0.0051790298 0.4394307950 37 38 39 40 41 42 0.0205919396 0.0545333361 0.0100662111 0.0995262051 0.1219394290 0.0025416737 43 44 45 46 47 48 0.3191888357 0.0149772804 0.0685930622 0.3457439539 0.0034943441 0.5767386617 49 50 51 52 53 54 0.5489544863 0.4637012227 0.2354894587 0.0487342700 0.6359622098 0.8027221707 55 56 57 58 59 60 0.0003240393 0.0479116454 0.3422867567 0.4649889328 0.0547385409 0.0722894447 61 62 63 64 0.0228629774 0.0002730981 0.0570387301 0.2830628760
Probit has essentially the same idea as the logit except that the probability function is replaced by the normal distribution. The nonlinear regression equation is as follows:
\begin{equation} y = \Phi[f(x)], \quad f(x) = \beta_0 + \beta_1 x_1 + \ldots \beta_k x_k \end{equation}where $\Phi(.)$ is the cumulative normal probability function. Again, irrespective of the coefficients, $f(x) \in (-\infty,+\infty)$, but $y \in (0,1)$. When $f(x) \rightarrow -\infty$, $y \rightarrow 0$, and when $f(x) \rightarrow +\infty$, $y \rightarrow 1$.
We can redo the same previous logit model using a probit instead:
%%R
h = glm(y~x, family=binomial(link="probit"))
print(logLik(h))
summary(h)
'log Lik.' -21.27924 (df=12) Call: glm(formula = y ~ x, family = binomial(link = "probit")) Deviance Residuals: Min 1Q Median 3Q Max -1.76353 -0.41212 -0.00031 0.34996 2.24568 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -26.28219 8.09608 -3.246 0.00117 ** xPTS -0.03463 0.05385 -0.643 0.52020 xREB 0.28493 0.09939 2.867 0.00415 ** xAST 0.10894 0.15735 0.692 0.48874 xTO -0.23742 0.13642 -1.740 0.08180 . xA.T 0.71485 1.86701 0.383 0.70181 xSTL 0.45963 0.18414 2.496 0.01256 * xBLK 0.03029 0.13631 0.222 0.82415 xPF 0.01041 0.07907 0.132 0.89529 xFG 26.58461 9.38711 2.832 0.00463 ** xFT 6.28278 2.51452 2.499 0.01247 * xX3P 3.15824 3.37841 0.935 0.34988 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 88.723 on 63 degrees of freedom Residual deviance: 42.558 on 52 degrees of freedom AIC: 66.558 Number of Fisher Scoring iterations: 8
%%R
h$fitted.values
1 2 3 4 5 6 9.999998e-01 9.999048e-01 9.769711e-01 9.972812e-01 9.997756e-01 9.721166e-01 7 8 9 10 11 12 5.590209e-01 9.584564e-01 4.367808e-01 7.362946e-01 9.898112e-01 7.262200e-01 13 14 15 16 17 18 2.444006e-01 9.968605e-01 7.292286e-01 9.985910e-01 2.528807e-01 8.751178e-01 19 20 21 22 23 24 2.544738e-01 9.435318e-01 9.850437e-01 7.841357e-01 9.995601e-01 9.996077e-01 25 26 27 28 29 30 9.825306e-01 8.033540e-02 5.101626e-01 9.666841e-01 8.564489e-01 6.657773e-01 31 32 33 34 35 36 9.314164e-01 7.481401e-01 5.810465e-01 2.488875e-01 1.279599e-03 4.391782e-01 37 38 39 40 41 42 1.020269e-02 5.461190e-02 4.267754e-03 1.067584e-01 1.234915e-01 2.665101e-04 43 44 45 46 47 48 3.212605e-01 6.434112e-03 7.362892e-02 3.673105e-01 4.875193e-04 6.020993e-01 49 50 51 52 53 54 5.605770e-01 4.786576e-01 2.731573e-01 4.485079e-02 6.194202e-01 7.888145e-01 55 56 57 58 59 60 1.630556e-06 4.325189e-02 3.899566e-01 4.809365e-01 5.043005e-02 7.330590e-02 61 62 63 64 1.498018e-02 8.425836e-07 5.515960e-02 3.218696e-01
Both these models are just settings in which we are computing binomial (binary) probabilities, i.e.
\begin{equation} \mbox{Pr}[y=1] = F(\beta' x) \end{equation}where $\beta$ is a vector of coefficients, and $x$ is a vector of explanatory variables. $F$ is the logit/probit function.
\begin{equation} {\hat y} = F(\beta' x) \end{equation}where ${\hat y}$ is the fitted value of $y$ for a given $x$, and now $\beta$ is the fitted model's coefficients.
In each case the function takes the logit or probit form that we provided earlier. Of course,
\begin{equation} \mbox{Pr}[y=0] = 1 - F(\beta' x) \end{equation}Note that the model may also be expressed in conditional expectation form, i.e. \begin{equation} E[y | x] = F(\beta' x) (y=1) + [1-F(\beta' x)] (y=0) = F(\beta' x) \end{equation}
In a linear regression, it is easy to see how the dependent variable changes when any right hand side variable changes. Not so with nonlinear models. A little bit of pencil pushing is required (and some calculus too).
The coefficient of an independent variable in a logit regression tell us by how much the log odds of the dependent variable change with a one unit change in the independent variable. If you want the odds ratio, then simply take the exponentiation of the log odds. The odds ratio says that when the independent variable increases by one, then the odds of the dependent outcome occurring increase by a factor of the odds ratio.
What are odds ratios? An odds ratio is the ratio of probability of success to the probability of failure. If the probability of success is $p$, then we have
$$ \mbox{Odds Ratio (OR)} = \frac{p}{1-p}, \quad p = \frac{OR}{1+OR} $$For example, if $p=0.3$, then the odds ratio will be $OR=0.3/0.7 = 0.4285714$.
If the coefficient $\beta$ (log odds) of an independent variable in the logit is (say) 2, then it meands the odds ratio is $\exp(2) = 7.38$. This is the factor by which the variable impacts the odds ratio when the variable increases by 1.
Suppose the independent variable increases by 1. Then the odds ratio and probabilities change as follows.
%%R
p = 0.3
OR = p/(1-p); print(OR)
beta = 2
OR_new = OR * exp(beta); print(OR_new)
p_new = OR_new/(1+OR_new); print(p_new)
[1] 0.4285714 [1] 3.166738 [1] 0.7600041
So we see that the probability of the dependent outcome occurring has increased from $0.3$ to $0.76$. Why?
$$ y = \frac{e^{\beta' x}}{1-e^{\beta' x}}; \quad 1-y = \frac{1}{1-e^{\beta' x}} $$So
$$ OR = \frac{y}{1-y} = e^{\beta' x} = \prod_{i} e^{\beta_i x_i} $$Now let's do the same example with the NCAA data.
%%R
h = glm(y~x, family=binomial(link="logit"))
b = h$coefficients
#Odds ratio is the exponentiated coefficients
print(exp(b))
x1 = c(1,as.numeric(x[18,])) #Take row 18 and create the RHS variables array
p1 = 1/(1+exp(-sum(b*x1)))
print(p1)
OR1 = p1/(1-p1)
cat("Odds Ratio = "); print(OR1)
(Intercept) xPTS xREB xAST xTO xA.T 1.244270e-20 9.405653e-01 1.632927e+00 1.178470e+00 6.810995e-01 4.775577e+00 xSTL xBLK xPF xFG xFT xX3P 2.189332e+00 1.081849e+00 1.026364e+00 1.175903e+20 4.570325e+04 2.258450e+02 [1] 0.8779507 Odds Ratio = [1] 7.193413
Now, let's see what happens if the rebounds increase by 1.
%%R
x2 = x1
x2[3] = x2[3] + 1
p2 = 1/(1+exp(-sum(b*x2)))
print(p2)
[1] 0.921546
So, the probability increases as expected. We can check that the new odds ratio will give the new probability as well.
%%R
OR2 = OR1 * exp(b[3])
print(OR2/(1+OR2))
xREB 0.921546
And we see that this is exactly as required.
Remember that $y$ lies in the range $(0,1)$. Hence, we may be interested in how $E(y|x)$ changes as any of the explanatory variables changes in value, so we can take the derivative:
\begin{equation} \frac{\partial E(y|x)}{\partial x} = F'(\beta' x) \beta \equiv f(\beta' x) \beta \end{equation}For each model we may compute this at the means of the regressors:
\begin{equation} \frac{\partial E(y|x)}{\partial x} = \beta\left( \frac{e^{\beta' x}}{1+e^{\beta' x}} \right) \left( 1 - \frac{e^{\beta' x}}{1+e^{\beta' x}} \right) = \beta y (1-y) \end{equation}which may be re-written as
\begin{equation} \frac{\partial E(y|x)}{\partial x} = \beta \cdot \Lambda(\beta' x) \cdot [1-\Lambda(\beta'x)] \end{equation}%%R
h = glm(y~x, family=binomial(link="logit"))
beta = h$coefficients
print(beta)
print(dim(x))
beta = as.matrix(beta)
print(dim(beta))
wuns = matrix(1,64,1)
x = cbind(wuns,x)
xbar = as.matrix(colMeans(x))
xbar
(Intercept) xPTS xREB xAST xTO xA.T -45.83315262 -0.06127422 0.49037435 0.16421685 -0.38404689 1.56351478 xSTL xBLK xPF xFG xFT xX3P 0.78359670 0.07867125 0.02602243 46.21373793 10.72992472 5.41984900 [1] 64 11 [1] 12 1 [,1] 1.0000000 PTS 67.1015625 REB 34.4671875 AST 12.7484375 TO 13.9578125 A.T 0.9778125 STL 6.8234375 BLK 2.7500000 PF 18.6562500 FG 0.4232969 FT 0.6914687 X3P 0.3333750
%%R
logitfunction = exp(t(beta) %*% xbar)/(1+exp(t(beta) %*% xbar))
print(logitfunction)
slopes = beta * logitfunction[1] * (1-logitfunction[1])
slopes
[,1] [1,] 0.5139925 [,1] (Intercept) -11.449314459 xPTS -0.015306558 xREB 0.122497576 xAST 0.041022062 xTO -0.095936529 xA.T 0.390572574 xSTL 0.195745753 xBLK 0.019652410 xPF 0.006500512 xFG 11.544386272 xFT 2.680380362 xX3P 1.353901094
In the probit model this is
\begin{equation} \frac{\partial E(y|x)}{\partial x} = \phi(\beta' x) \beta \end{equation}where $\phi(.)$ is the normal density function (not the cumulative probability).
%%R
ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
y1 = 1:32
y1 = y1*0+1
y2 = y1*0
y = c(y1,y2)
print(y)
x = as.matrix(ncaa[4:14])
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 [39] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
%%R
h = glm(y~x, family=binomial(link="probit"))
beta = h$coefficients
print(beta)
print(dim(x))
beta = as.matrix(beta)
print(dim(beta))
(Intercept) xPTS xREB xAST xTO xA.T -26.28219202 -0.03462510 0.28493498 0.10893727 -0.23742076 0.71484863 xSTL xBLK xPF xFG xFT xX3P 0.45963279 0.03029006 0.01040612 26.58460638 6.28277680 3.15823537 [1] 64 11 [1] 12 1
%%R
wuns = matrix(1,64,1)
x = cbind(wuns,x)
xbar = as.matrix(colMeans(x))
print(xbar)
probitfunction = t(beta) %*% xbar
slopes = probitfunction[1] * beta
slopes
[,1] 1.0000000 PTS 67.1015625 REB 34.4671875 AST 12.7484375 TO 13.9578125 A.T 0.9778125 STL 6.8234375 BLK 2.7500000 PF 18.6562500 FG 0.4232969 FT 0.6914687 X3P 0.3333750 [,1] (Intercept) -1.401478911 xPTS -0.001846358 xREB 0.015193952 xAST 0.005809001 xTO -0.012660291 xA.T 0.038118787 xSTL 0.024509587 xBLK 0.001615196 xPF 0.000554899 xFG 1.417604938 xFT 0.335024536 xX3P 0.168410621
Estimation in the models above, using the glm function is done by R using MLE. Lets write this out a little formally. Since we have say $n$ observations, and each LHS variable is $y = \{0,1\}$, we have the likelihood function as follows:
\begin{equation} L = \prod_{i=1}^n F(\beta'x)^{y_i} [1-F(\beta'x)]^{1-y_i} \end{equation}The log-likelihood will be
\begin{equation} \ln L = \sum_{i=1}^n \left[ y_i \ln F(\beta'x) + (1-y_i) \ln [1-F(\beta'x)] \right] \end{equation}To maximize the log-likelihood we take the derivative:
\begin{equation} \frac{\partial \ln L}{\partial \beta} = \sum_{i=1}^n \left[ y_i \frac{f(\beta'x)}{F(\beta'x)} - (1-y_i) \frac{f(\beta'x)}{1-F(\beta'x)} \right]x = 0 \end{equation}which gives a system of equations to be solved for $\beta$. This is what the software is doing. The system of first-order conditions are collectively called the likelihood equation.
You may well ask, how do we get the t-statistics of the parameter estimates $\beta$? The formal derivation is beyond the scope of this class, as it requires probability limit theorems, but let's just do this a little heuristically, so you have some idea of what lies behind it.
The t-stat for a coefficient is its value divided by its standard deviation. We get some idea of the standard deviation by asking the question: how does the coefficient set $\beta$ change when the log-likelihood changes? That is, we are interested in $\partial \beta / \partial \ln L$. Above we have computed the reciprocal of this, as you can see. Lets define
\begin{equation} g = \frac{\partial \ln L}{\partial \beta} \end{equation}We also define the second derivative (also known as the Hessian matrix)
\begin{equation} H = \frac{\partial^2 \ln L}{\partial \beta \partial \beta'} \end{equation}Note that the following are valid:
\begin{eqnarray*} E(g) &=& 0 \quad \mbox{(this is a vector)} \\ Var(g) &=& E(g g') - E(g)^2 = E(g g') \\ &=& -E(H) \quad \mbox{(this is a non-trivial proof)} \end{eqnarray*}We call
\begin{equation} I(\beta) = -E(H) \end{equation}the information matrix. Since (heuristically) the variation in log-likelihood with changes in beta is given by $Var(g)=-E(H)=I(\beta)$, the inverse gives the variance of $\beta$. Therefore, we have
\begin{equation} Var(\beta) \rightarrow I(\beta)^{-1} \end{equation}We take the square root of the diagonal of this matrix and divide the values of $\beta$ by that to get the t-statistics.
You will need the nnet package for this. This model takes the following form:
\begin{equation} \mbox{Prob}[y = j] = p_j= \frac{\exp(\beta_j' x)}{1+\sum_{j=1}^{J} \exp(\beta_j' x)} \end{equation}We usually set
\begin{equation} \mbox{Prob}[y = 0] = p_0 = \frac{1}{1+\sum_{j=1}^{J} \exp(\beta_j' x)} \end{equation}%%R
ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
x = as.matrix(ncaa[4:14])
w1 = (1:16)*0 + 1
w0 = (1:16)*0
y1 = c(w1,w0,w0,w0)
y2 = c(w0,w1,w0,w0)
y3 = c(w0,w0,w1,w0)
y4 = c(w0,w0,w0,w1)
y = cbind(y1,y2,y3,y4)
library(nnet)
res = multinom(y~x)
res
# weights: 52 (36 variable) initial value 88.722839 iter 10 value 71.177975 iter 20 value 60.076921 iter 30 value 51.167439 iter 40 value 47.005269 iter 50 value 45.196280 iter 60 value 44.305029 iter 70 value 43.341689 iter 80 value 43.260097 iter 90 value 43.247324 iter 100 value 43.141297 final value 43.141297 stopped after 100 iterations Call: multinom(formula = y ~ x) Coefficients: (Intercept) xPTS xREB xAST xTO xA.T xSTL y2 -8.847514 -0.1595873 0.3134622 0.6198001 -0.2629260 -2.1647350 -0.813519 y3 65.688912 0.2983748 -0.7309783 -0.6059289 0.9284964 -0.5720152 -1.310701 y4 31.513342 -0.1382873 -0.2432960 0.2887910 0.2204605 -2.6409780 -1.470406 xBLK xPF xFG xFT xX3P y2 0.01472506 0.6521056 -13.77579 10.374888 -3.436073 y3 0.63038878 -0.1788238 -86.37410 -24.769245 -4.897203 y4 -0.31863373 0.5392835 -45.18077 6.701026 -7.841990 Residual Deviance: 86.28259 AIC: 158.2826
%%R
print(names(res))
res$fitted.values
[1] "n" "nunits" "nconn" "conn" [5] "nsunits" "decay" "entropy" "softmax" [9] "censored" "value" "wts" "convergence" [13] "fitted.values" "residuals" "call" "terms" [17] "weights" "deviance" "rank" "lab" [21] "coefnames" "vcoefnames" "xlevels" "edf" [25] "AIC" y1 y2 y3 y4 1 6.785454e-01 3.214178e-01 7.032345e-06 2.972107e-05 2 6.168467e-01 3.817718e-01 2.797313e-06 1.378715e-03 3 7.784836e-01 1.990510e-01 1.688098e-02 5.584445e-03 4 5.962949e-01 3.988588e-01 5.018346e-04 4.344392e-03 5 9.815286e-01 1.694721e-02 1.442350e-03 8.179230e-05 6 9.271150e-01 6.330104e-02 4.916966e-03 4.666964e-03 7 4.515721e-01 9.303667e-02 3.488898e-02 4.205023e-01 8 8.210631e-01 1.530721e-01 7.631770e-03 1.823302e-02 9 1.567804e-01 9.375075e-02 6.413693e-01 1.080996e-01 10 8.403357e-01 9.793135e-03 1.396393e-01 1.023186e-02 11 9.163789e-01 6.747946e-02 7.847380e-05 1.606316e-02 12 2.448850e-01 4.256001e-01 2.880803e-01 4.143463e-02 13 1.040352e-01 1.534272e-01 1.369554e-01 6.055822e-01 14 8.468755e-01 1.506311e-01 5.083480e-04 1.985036e-03 15 7.136048e-01 1.294146e-01 7.385294e-02 8.312770e-02 16 9.885439e-01 1.114547e-02 2.187311e-05 2.887256e-04 17 6.478074e-02 3.547072e-01 1.988993e-01 3.816127e-01 18 4.414721e-01 4.497228e-01 4.716550e-02 6.163956e-02 19 6.024508e-03 3.608270e-01 7.837087e-02 5.547777e-01 20 4.553205e-01 4.270499e-01 3.614863e-04 1.172681e-01 21 1.342122e-01 8.627911e-01 1.759865e-03 1.236845e-03 22 1.877123e-02 6.423037e-01 5.456372e-05 3.388705e-01 23 5.620528e-01 4.359459e-01 5.606424e-04 1.440645e-03 24 2.837494e-01 7.154506e-01 2.190456e-04 5.809815e-04 25 1.787749e-01 8.037335e-01 3.361806e-04 1.715541e-02 26 3.274874e-02 3.484005e-02 1.307795e-01 8.016317e-01 27 1.635480e-01 3.471676e-01 1.131599e-01 3.761245e-01 28 2.360922e-01 7.235497e-01 3.375018e-02 6.607966e-03 29 1.618602e-02 7.233098e-01 5.762083e-06 2.604984e-01 30 3.037741e-02 8.550873e-01 7.487804e-02 3.965729e-02 31 1.122897e-01 8.648388e-01 3.935657e-03 1.893584e-02 32 2.312231e-01 6.607587e-01 4.770775e-02 6.031045e-02 33 6.743125e-01 2.028181e-02 2.612683e-01 4.413746e-02 34 1.407693e-01 4.089518e-02 7.007541e-01 1.175815e-01 35 6.919547e-04 4.194577e-05 9.950322e-01 4.233924e-03 36 8.051225e-02 4.213965e-03 9.151287e-01 1.450423e-04 37 5.691220e-05 7.480549e-02 5.171594e-01 4.079782e-01 38 2.709867e-02 3.808987e-02 6.193969e-01 3.154145e-01 39 4.531001e-05 2.248580e-08 9.999542e-01 4.626258e-07 40 1.021976e-01 4.597678e-03 5.133839e-01 3.798208e-01 41 2.005837e-02 2.063200e-01 5.925050e-01 1.811166e-01 42 1.829028e-04 1.378795e-03 6.182839e-01 3.801544e-01 43 1.734296e-01 9.025284e-04 7.758862e-01 4.978171e-02 44 4.314938e-05 3.131390e-06 9.997892e-01 1.645004e-04 45 1.516231e-02 2.060325e-03 9.792594e-01 3.517926e-03 46 2.917597e-01 6.351166e-02 4.943818e-01 1.503468e-01 47 1.278933e-04 1.773509e-03 1.209486e-01 8.771500e-01 48 1.320000e-01 2.064338e-01 6.324904e-01 2.907578e-02 49 1.683221e-02 4.007848e-01 1.628981e-03 5.807540e-01 50 9.670085e-02 4.314765e-01 7.669035e-03 4.641536e-01 51 4.953577e-02 1.370037e-01 9.882004e-02 7.146405e-01 52 1.787927e-02 9.825660e-02 2.203037e-01 6.635604e-01 53 1.174053e-02 4.723628e-01 2.430072e-03 5.134666e-01 54 2.053871e-01 6.721356e-01 4.169640e-02 8.078090e-02 55 3.060369e-06 1.418623e-03 1.072549e-02 9.878528e-01 56 1.122164e-02 6.566169e-02 3.080641e-01 6.150525e-01 57 8.873716e-03 4.996907e-01 8.222034e-03 4.832136e-01 58 2.164962e-02 2.874313e-01 1.136455e-03 6.897826e-01 59 5.230443e-03 6.430174e-04 9.816825e-01 1.244406e-02 60 8.743368e-02 6.710327e-02 4.260116e-01 4.194514e-01 61 1.913578e-01 6.458463e-04 3.307553e-01 4.772410e-01 62 6.450967e-07 5.035697e-05 7.448285e-01 2.551205e-01 63 2.400365e-04 4.651537e-03 8.183390e-06 9.951002e-01 64 1.515894e-04 2.631451e-01 1.002332e-05 7.366933e-01
You can see from the results that the probability for category 1 is the same as $p_0$. What this means is that we compute the other three probabilities, and the remaining is for the first category. We check that the probabilities across each row for all four categories add up to 1:
%%R
rowSums(res$fitted.values)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 53 54 55 56 57 58 59 60 61 62 63 64 1 1 1 1 1 1 1 1 1 1 1 1
The standard linear regression model often does not apply, and we need to be careful to not overuse it. Peter Kennedy in his excellent book "A Guide to Econometrics" states five cases where violations of critical assumptions for OLS occur, and we should then be warned against its use.
Non-zero mean regression residuals. This occurs with truncated residuals (see discussion below) and in sample selection problems, where the fitted model to a selected subsample would result in non-zero mean errors for the full sample. This is also known as the biased intercept problem. The errors may also be correlated with the regressors, i.e., endogeneity (see below).
Residuals are not iid. This occurs in two ways. (a) Heterosledasticity, i.e., the variances of the residuals for all observations is not the same, i.e., violation of the identically distributed assumption. (b) Autocorrelation, where the residuals are correlated with each other, i.e., violation of the independence assumption.
Endogeneity. Here the observations of regressors $x$ cannot be assumed to be fixed in repeated samples. This occurs in several ways. (a) Errors in variables, i.e., measurement of $x$ in error. (b) Omitted variables, which is a form of errors in variables. (c) Autoregression, i.e., using a lagged value of the dependent variable as an independent variable, as in VARs. (d) Simultaneous equation systems, where all variables are endogenous, and this is also known as reverse causality. For example, changes in tax rates changes economic behavior, and hence income, which may result in further policy changes in tax rates, and so on. Because the $x$ variables are correlated with the errors $\epsilon$, they are no longer exogenous, and hence we term this situation one of "endogeneity".
$n > p$. The number of observations ($n$) is greater than the number of independent variables ($p$), i.e., the dimension of $x$. This can also occur when two regressors are highly correlated with each other, i.e., known as multicollinearity.
Sample selection problems arise because the sample is truncated based on some selection criterion, and the regression that is run is biased because the sample is biased and does not reflect the true/full population. For example, wage data is only available for people who decided to work, i.e., the wage was worth their while, and above their reservation wage. If we are interested in finding out the determinants of wages, we need to take this fact into account, i.e., the sample only contains people who were willing to work at the wage levels that were in turn determined by demand and supply of labor. The sample becomes non-random. It explains the curious case that women with more children tend to have lower wages (because they need the money and hence, their reservation wage is lower).
Usually we handle sample selection issues using a two-equation regression approach. The first equation determines if an observation enters the sample. The second equation then assesses the model of interest, e.g., what determines wages. We will look at an example later.
But first, we provide some basic mathematical results that we need later. And of course, we need to revisit our Bayesian ideas again!
Given a probability density $f(x)$,
\begin{equation} f(x | x > a) = \frac{f(x)}{Pr(x>a)} \end{equation}If we are using the normal distribution then this is:
\begin{equation} f(x | x > a) = \frac{\phi(x)}{1-\Phi(a)} \end{equation}If $x \sim N(\mu, \sigma^2)$, then
\begin{equation} E(x | x>a) = \mu + \sigma\; \frac{\phi(c)}{1-\Phi(c)}, \quad c = \frac{a-\mu}{\sigma} \end{equation}Note that this expectation is provided without proof, as are the next few ones. For example if we let $x$ be standard normal and we want $E([x | x > -1])$, we have
%%R
dnorm(-1)/(1-pnorm(-1))
[1] 0.2876
For the same distribution
\begin{equation} E(x | x < a) = \mu + \sigma\; \frac{-\phi(c)}{\Phi(c)}, \quad c = \frac{a-\mu}{\sigma} \end{equation}For example, $E[x | x < 1]$ is
%%R
-dnorm(1)/pnorm(1)
[1] -0.2876
The values $\frac{\phi(c)}{1-\Phi(c)}$ or $\frac{-\phi(c)}{\Phi(c)}$ as the case may be is often shortened to the variable $\lambda(c)$, which is also known as the Inverse Mills Ratio.
If $y$ and $x$ are correlated (with correlation $\rho$), and $y \sim N(\mu_y,\sigma_y^2)$, then
\begin{eqnarray*} Pr(y,x | x>a) &=& \frac{f(y,x)}{Pr(x>a)} \\ E(y | x>a) &=& \mu_y + \sigma_y \rho \lambda(c), \quad c = \frac{a-\mu}{\sigma} \end{eqnarray*}This leads naturally to the truncated regression model. Suppose we have the usual regression model where
\begin{equation} y = \beta'x + e, \quad e \sim N(0,\sigma^2) \end{equation}But suppose we restrict attention in our model to values of $y$ that are greater than a cut off $a$. We can then write down by inspection the following correct model (no longer is the simple linear regression valid)
\begin{equation} E(y | y > a) = \beta' x + \sigma \; \frac{\phi[(a-\beta'x)/\sigma]}{1-\Phi[(a-\beta'x)/\sigma]} \end{equation}Therefore, when the sample is truncated, then we need to run the regression above, i.e., the usual right-hand side $\beta' x$ with an additional variable, i.e., the Inverse Mill's ratio. We look at this in a real-world example.
Not all venture-backed firms end up making a successful exit, either via an IPO, through a buyout, or by means of another exit route. By examining a large sample of startup firms, we can measure the probability of a firm making a successful exit. By designating successful exits as $S=1$, and setting $S=0$ otherwise, we use matrix $X$ of explanatory variables and fit a Probit model to the data. We define $S$ to be based on a latent threshold variable $S^*$ such that
\begin{equation} S = \left\{ \begin{array}{ll} 1 & \mbox{if } S^* > 0\\ 0 & \mbox{if } S^* \leq 0. \end{array} \right. \end{equation}where the latent variable is modeled as
\begin{equation} S^* = \gamma' X + u, \quad u \sim N(0,\sigma_u^2) \end{equation}The fitted model provides us the probability of exit, i.e., $E(S)$, for all financing rounds.
\begin{equation} E(S) = E(S^* > 0) = E(u > -\gamma' X) = 1 - \Phi(-\gamma' X) = \Phi(\gamma' X), \end{equation}where $\gamma$ is the vector of coefficients fitted in the Probit model, using standard likelihood methods. The last expression in the equation above follows from the use of normality in the Probit specification. $\Phi(.)$ denotes the cumulative normal distribution.
Suppose we want to examine the role of syndication in venture success. Success in a syndicated venture comes from two broad sources of VC expertise. First, VCs are experienced in picking good projects to invest in, and syndicates are efficient vehicles for picking good firms; this is the selection hypothesis put forth by Lerner (1994). Amongst two projects that appear a-priori similar in prospects, the fact that one of them is selected by a syndicate is evidence that the project is of better quality (ex-post to being vetted by the syndicate, but ex-ante to effort added by the VCs), since the process of syndication effectively entails getting a second opinion by the lead VC. Second, syndicates may provide better monitoring as they bring a wide range of skills to the venture, and this is suggested in the value-added hypothesis of Brander, Amit, and Antweiler (2002).
A regression of venture returns on various firm characteristics and a dummy variable for syndication allows a first pass estimate of whether syndication impacts performance. However, it may be that syndicated firms are simply of higher quality and deliver better performance, whether or not they chose to syndicate. Better firms are more likely to syndicate because VCs tend to prefer such firms and can identify them. In this case, the coefficient on the dummy variable might reveal a value-add from syndication, when indeed, there is none. Hence, we correct the specification for endogeneity, and then examine whether the dummy variable remains significant.
Greene, in his classic book "Econometric Analysis" provides the correction for endogeneity required here. We briefly summarize the model required. The performance regression is of the form:
\begin{equation} Y = \beta' X + \delta Q + \epsilon, \quad \epsilon \sim N(0,\sigma_{\epsilon}^2) \end{equation}where $Y$ is the performance variable; $Q$ is the dummy variable taking a value of 1 if the firm is syndicated, and zero otherwise, and $\delta$ is a coefficient that determines whether performance is different on account of syndication. If it is not, then it implies that the variables $X$ are sufficient to explain the differential performance across firms, or that there is no differential performance across the two types of firms.
However, since these same variables determine also, whether the firm syndicates or not, we have an endogeneity issue which is resolved by adding a correction to the model above. The error term $\epsilon$ is affected by censoring bias in the subsamples of syndicated and non-syndicated firms. When $Q=1$, i.e. when the firm's financing is syndicated, then the residual $\epsilon$ has the following expectation
\begin{equation} E(\epsilon | Q=1) = E(\epsilon | S^* >0) = E(\epsilon | u > -\gamma' X) = \rho \sigma_{\epsilon} \left[ \frac{\phi(\gamma' X)}{\Phi(\gamma' X)} \right]. \end{equation}where $\rho = Corr(\epsilon,u)$, and $\sigma_{\epsilon}$ is the standard deviation of $\epsilon$. This implies that
\begin{equation} E(Y | Q=1) = \beta'X + \delta + \rho \sigma_{\epsilon} \left[ \frac{\phi(\gamma' X)}{\Phi(\gamma' X)} \right]. \end{equation}Note that $\phi(-\gamma'X)=\phi(\gamma'X)$, and $1-\Phi(-\gamma'X)=\Phi(\gamma'X)$. For estimation purposes, we write this as the following regression equation:
EQN1 \begin{equation} Y = \delta + \beta' X + \beta_m m(\gamma' X) \end{equation}
where $m(\gamma' X) = \frac{\phi(\gamma' X)}{\Phi(\gamma' X)}$ and $\beta_m = \rho \sigma_{\epsilon}$. Thus, $\{\delta,\beta,\beta_m\}$ are the coefficients estimated in the regression. (Note here that $m(\gamma' X)$ is also known as the inverse Mill's ratio.)
Likewise, for firms that are not syndicated, we have the following result
\begin{equation} E(Y | Q=0) = \beta'X + \rho \sigma_{\epsilon} \left[ \frac{-\phi(\gamma' X)}{1-\Phi(\gamma' X)} \right]. \end{equation}This may also be estimated by linear cross-sectional regression.
EQN0 \begin{equation} Y = \beta' X + \beta_m \cdot m'(\gamma' X) \end{equation}
where $m' = \frac{-\phi(\gamma' X)}{1-\Phi(\gamma' X)}$ and $\beta_m = \rho \sigma_{\epsilon}$.
The estimation model will take the form of a stacked linear regression comprising both equations (EQN1) and (EQN0). This forces $\beta$ to be the same across all firms without necessitating additional constraints, and allows the specification to remain within the simple OLS form. If $\delta$ is significant after this endogeneity correction, then the empirical evidence supports the hypothesis that syndication is a driver of differential performance. If the coefficients $\{\delta, \beta_m\}$ are significant, then the expected difference in performance for each syndicated financing round $(i,j)$ is
\begin{equation} \delta + \beta_m \left[ m(\gamma_{ij}' X_{ij}) - m'(\gamma_{ij}' X_{ij}) \right], \;\;\; \forall i,j. \end{equation}The method above forms one possible approach to addressing treatment effects. Another approach is to estimate a Probit model first, and then to set $m(\gamma' X) = \Phi(\gamma' X)$. This is known as the instrumental variables approach.
Some References: @JEMS:JEMS423; @10.2307/3665618
The correct regression may be run using the sampleSelection package in R. Sample selection models correct for the fact that two subsamples may be different because of treatment effects. Let's take an example with data from the wage market.
This is an example from the package in R itself. The data used is also within the package.
After loading in the package sampleSelection we can use the data set called Mroz87. This contains labour market participation data for women as well as wage levels for women. If we are explaining what drives women's wages we can simply run the following regression. See: http://www.inside-r.org/packages/cran/sampleSelection/docs/Mroz87
The original paper may be downloaded at: http://eml.berkeley.edu/~cle/e250a_f13/mroz-paper.pdf
%%R
library(sampleSelection)
data(Mroz87)
Mroz87$kids = (Mroz87$kids5 + Mroz87$kids618 > 0)
Mroz87$numkids = Mroz87$kids5 + Mroz87$kids618
summary(Mroz87)
/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Loading required package: maxLik warnings.warn(x, RRuntimeWarning) /anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Loading required package: miscTools warnings.warn(x, RRuntimeWarning) /anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Please cite the 'maxLik' package as: Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1. If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site: https://r-forge.r-project.org/projects/maxlik/ warnings.warn(x, RRuntimeWarning)
lfp hours kids5 kids618 Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.000 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000 Median :1.0000 Median : 288.0 Median :0.0000 Median :1.000 Mean :0.5684 Mean : 740.6 Mean :0.2377 Mean :1.353 3rd Qu.:1.0000 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000 Max. :1.0000 Max. :4950.0 Max. :3.0000 Max. :8.000 age educ wage repwage hushrs Min. :30.00 Min. : 5.00 Min. : 0.000 Min. :0.00 Min. : 175 1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 0.000 1st Qu.:0.00 1st Qu.:1928 Median :43.00 Median :12.00 Median : 1.625 Median :0.00 Median :2164 Mean :42.54 Mean :12.29 Mean : 2.375 Mean :1.85 Mean :2267 3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 3.788 3rd Qu.:3.58 3rd Qu.:2553 Max. :60.00 Max. :17.00 Max. :25.000 Max. :9.98 Max. :5010 husage huseduc huswage faminc Min. :30.00 Min. : 3.00 Min. : 0.4121 Min. : 1500 1st Qu.:38.00 1st Qu.:11.00 1st Qu.: 4.7883 1st Qu.:15428 Median :46.00 Median :12.00 Median : 6.9758 Median :20880 Mean :45.12 Mean :12.49 Mean : 7.4822 Mean :23081 3rd Qu.:52.00 3rd Qu.:15.00 3rd Qu.: 9.1667 3rd Qu.:28200 Max. :60.00 Max. :17.00 Max. :40.5090 Max. :96000 mtr motheduc fatheduc unem Min. :0.4415 Min. : 0.000 Min. : 0.000 Min. : 3.000 1st Qu.:0.6215 1st Qu.: 7.000 1st Qu.: 7.000 1st Qu.: 7.500 Median :0.6915 Median :10.000 Median : 7.000 Median : 7.500 Mean :0.6789 Mean : 9.251 Mean : 8.809 Mean : 8.624 3rd Qu.:0.7215 3rd Qu.:12.000 3rd Qu.:12.000 3rd Qu.:11.000 Max. :0.9415 Max. :17.000 Max. :17.000 Max. :14.000 city exper nwifeinc wifecoll huscoll Min. :0.0000 Min. : 0.00 Min. :-0.02906 TRUE:212 TRUE:295 1st Qu.:0.0000 1st Qu.: 4.00 1st Qu.:13.02504 FALSE:541 FALSE:458 Median :1.0000 Median : 9.00 Median :17.70000 Mean :0.6428 Mean :10.63 Mean :20.12896 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:24.46600 Max. :1.0000 Max. :45.00 Max. :96.00000 kids numkids Mode :logical Min. :0.000 FALSE:229 1st Qu.:0.000 TRUE :524 Median :1.000 Mean :1.591 3rd Qu.:3.000 Max. :8.000
%%R
res = lm(wage ~ age + age^2 + educ + city, data=Mroz87)
summary(res)
Call: lm(formula = wage ~ age + age^2 + educ + city, data = Mroz87) Residuals: Min 1Q Median 3Q Max -4.5331 -2.2710 -0.4765 1.3975 22.7241 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.2490882 0.9094210 -3.573 0.000376 *** age 0.0008193 0.0141084 0.058 0.953708 educ 0.4496393 0.0503591 8.929 < 2e-16 *** city 0.0998064 0.2388551 0.418 0.676174 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.079 on 749 degrees of freedom Multiple R-squared: 0.1016, Adjusted R-squared: 0.09799 F-statistic: 28.23 on 3 and 749 DF, p-value: < 2.2e-16
So, education matters. But since education also determines labor force participation (variable lfp) it may just be that we can use lfp instead. Let's try that.
%%R
res = lm(wage ~ age + age^2 + lfp + city, data=Mroz87)
summary(res)
Call: lm(formula = wage ~ age + age^2 + lfp + city, data = Mroz87) Residuals: Min 1Q Median 3Q Max -4.1815 -0.9869 -0.1624 0.3081 20.6809 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.478793 0.513001 -0.933 0.3510 age 0.004163 0.011333 0.367 0.7135 lfp 4.185897 0.183727 22.783 <2e-16 *** city 0.462158 0.190176 2.430 0.0153 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.489 on 749 degrees of freedom Multiple R-squared: 0.4129, Adjusted R-squared: 0.4105 F-statistic: 175.6 on 3 and 749 DF, p-value: < 2.2e-16
%%R
#LET'S TRY BOTH VARIABLES
Mroz87$educlfp = Mroz87$educ*Mroz87$lfp
res = lm(wage ~ age + age^2 + lfp + educ + city + educlfp , data=Mroz87)
summary(res)
Call: lm(formula = wage ~ age + age^2 + lfp + educ + city + educlfp, data = Mroz87) Residuals: Min 1Q Median 3Q Max -5.8139 -0.7307 -0.0712 0.2261 21.1120 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.528196 0.904949 -0.584 0.5596 age 0.009299 0.010801 0.861 0.3895 lfp -2.028354 0.963841 -2.104 0.0357 * educ -0.002723 0.060710 -0.045 0.9642 city 0.244245 0.182220 1.340 0.1805 educlfp 0.491515 0.077942 6.306 4.89e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.347 on 747 degrees of freedom Multiple R-squared: 0.4792, Adjusted R-squared: 0.4757 F-statistic: 137.4 on 5 and 747 DF, p-value: < 2.2e-16
%%R
#LET'S DROP INTERACTION
res = lm(wage ~ age + age^2 + lfp + educ + city, data=Mroz87)
summary(res)
Call: lm(formula = wage ~ age + age^2 + lfp + educ + city, data = Mroz87) Residuals: Min 1Q Median 3Q Max -4.9849 -1.1053 -0.1626 0.4762 21.0179 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.18595 0.71239 -5.876 6.33e-09 *** age 0.01421 0.01105 1.286 0.199 lfp 3.94731 0.18073 21.841 < 2e-16 *** educ 0.29043 0.04005 7.252 1.03e-12 *** city 0.22401 0.18685 1.199 0.231 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.407 on 748 degrees of freedom Multiple R-squared: 0.4514, Adjusted R-squared: 0.4485 F-statistic: 153.9 on 4 and 748 DF, p-value: < 2.2e-16
In fact, it seems like both matter, but we should use the selection equation approach of Heckman, in two stages.
%%R
res = selection(lfp ~ age + age^2 + faminc + kids5 + educ,
wage ~ exper + exper^2 + educ + city, data=Mroz87, method = "2step" )
summary(res)
-------------------------------------------- Tobit 2 model (sample selection model) 2-step Heckman / heckit estimation 753 observations (325 censored and 428 observed) 12 free parameters (df = 742) Probit selection equation: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.394e-01 4.119e-01 0.824 0.410 age -3.424e-02 6.728e-03 -5.090 4.55e-07 *** faminc 3.390e-06 4.267e-06 0.795 0.427 kids5 -8.624e-01 1.111e-01 -7.762 2.78e-14 *** educ 1.162e-01 2.361e-02 4.923 1.05e-06 *** Outcome equation: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.66736 1.30192 -2.049 0.0408 * exper 0.02370 0.01886 1.256 0.2093 educ 0.48816 0.07946 6.144 1.31e-09 *** city 0.44936 0.31585 1.423 0.1553 Multiple R-Squared:0.1248, Adjusted R-Squared:0.1165 Error terms: Estimate Std. Error t value Pr(>|t|) invMillsRatio 0.11082 0.73108 0.152 0.88 sigma 3.09434 NA NA NA rho 0.03581 NA NA NA --------------------------------------------
Note that even after using education to explain lfp in the selection equation, it still matters in the wage equation. So education does really impact wages.
%%R
## Example using binary outcome for selection model.
## We estimate the probability of womens' education on their
## chances to get high wage (> $5/hr in 1975 USD), using PSID data
## We use education as explanatory variable
## and add age, kids, and non-work income as exclusion restrictions.
library(mvtnorm)
data(Mroz87)
m <- selection(lfp ~ educ + age + kids5 + kids618 + nwifeinc,
wage >= 5 ~ educ, data = Mroz87 )
summary(m)
-------------------------------------------- Tobit 2 model with binary outcome (sample selection model) Maximum Likelihood estimation BHHH maximisation, 8 iterations Return code 2: successive function values within tolerance limit Log-Likelihood: -653.2037 753 observations (325 censored and 428 observed) 9 free parameters (df = 744) Probit selection equation: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.430362 0.475966 0.904 0.366 educ 0.156223 0.023811 6.561 1.00e-10 *** age -0.034713 0.007649 -4.538 6.61e-06 *** kids5 -0.890560 0.112663 -7.905 9.69e-15 *** kids618 -0.038167 0.039320 -0.971 0.332 nwifeinc -0.020948 0.004318 -4.851 1.49e-06 *** Outcome equation: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.5213 0.5611 -8.058 3.08e-15 *** educ 0.2879 0.0369 7.800 2.09e-14 *** Error terms: Estimate Std. Error t value Pr(>|t|) rho 0.1164 0.2706 0.43 0.667 --------------------------------------------
%%R
#CHECK THAT THE NUMBER OF KIDS MATTERS OR NOT
Mroz87$numkids = Mroz87$kids5 + Mroz87$kids618
summary(lm(wage ~ numkids, data=Mroz87))
Call: lm(formula = wage ~ numkids, data = Mroz87) Residuals: Min 1Q Median 3Q Max -2.6814 -2.2957 -0.8125 1.3186 23.0900 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.68138 0.17421 15.39 <2e-16 *** numkids -0.19285 0.08069 -2.39 0.0171 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.232 on 751 degrees of freedom Multiple R-squared: 0.007548, Adjusted R-squared: 0.006227 F-statistic: 5.712 on 1 and 751 DF, p-value: 0.0171
%%R
res = selection(lfp ~ age + I(age^2) + faminc + numkids + educ,
wage ~ exper + I(exper^2) + educ + city + numkids, data=Mroz87, method = "2step" )
summary(res)
-------------------------------------------- Tobit 2 model (sample selection model) 2-step Heckman / heckit estimation 753 observations (325 censored and 428 observed) 15 free parameters (df = 739) Probit selection equation: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.725e+00 1.398e+00 -2.664 0.00789 ** age 1.656e-01 6.482e-02 2.554 0.01084 * I(age^2) -2.198e-03 7.537e-04 -2.917 0.00365 ** faminc 4.001e-06 4.204e-06 0.952 0.34161 numkids -1.513e-01 3.827e-02 -3.955 8.39e-05 *** educ 9.224e-02 2.302e-02 4.007 6.77e-05 *** Outcome equation: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.2476932 2.0702572 -1.086 0.278 exper 0.0271253 0.0635033 0.427 0.669 I(exper^2) -0.0001957 0.0019429 -0.101 0.920 educ 0.4726828 0.1037086 4.558 6.05e-06 *** city 0.4389577 0.3166504 1.386 0.166 numkids -0.0471181 0.1420580 -0.332 0.740 Multiple R-Squared:0.1252, Adjusted R-Squared:0.1128 Error terms: Estimate Std. Error t value Pr(>|t|) invMillsRatio -0.11737 1.38036 -0.085 0.932 sigma 3.09374 NA NA NA rho -0.03794 NA NA NA --------------------------------------------
Endogeneity may be technically expressed as arising from a correlation of the independent variables and the error term in a regression. This can be stated as:
\begin{equation} Y = \beta' X + u, \quad E(X\cdot u) \neq 0 \end{equation}This can happen in many ways:
We see that
\begin{equation} E({\tilde X} \cdot v) = E[(X+e)(u - \beta_1 e)] = -\beta_1 E(e^2) = -\beta_1 Var(e) \neq 0 \end{equation}but we do not have $X_2$, which happens to be correlated with $X_1$, then it will be subsumed in the error term and no longer will $E(X_i \cdot u) = 0, \forall i$.
The solution to these equations gives the {\it reduced-form} version of the model. \begin{equation} Y = \frac{\beta_0 + \beta_1 \alpha_0}{1 - \alpha_1 \beta_1} + \frac{\beta v + u}{1 - \alpha_1 \beta_1}, \quad \quad X = \frac{\alpha_0 +\alpha_1 \beta_0}{1 - \alpha_1 \beta_1} + \frac{v + \alpha_1 u}{1 - \alpha_1 \beta_1} \end{equation}
From which we can compute the endogeneity result. \begin{equation} Cov(X, u) = Cov\left(\frac{v + \alpha_1 u}{1 - \alpha_1 \beta_1}, u \right) = \frac{\alpha_1}{1 - \alpha_1 \beta_1}\cdot Var(u) \end{equation}
To summarize, if $x$ is correlated with $u$ then $x$ is said to be "endogenous". Endogeneity biases parameter estimates. The solution is to find an instrumental variable (denoted $x'$) that is highly correlated with $x$, but not correlated with $u$. That is
But since $x'$ is not really $x$, it adds (uncorrelated )variance to the residuals, because $x' = x + \eta$.
This is a model used to estimate the expected time to an event. We may be interested in estimating mortality, failure time of equipment, time to successful IPO of a startup, etc. If we define "stopping" time of an event as $\tau$, then we are interested in the cumulative probability of an event occurring in time $t$ as
$$ F(t) = Pr(\tau \leq t ) $$and the corresponding density function $f(t) = F'(t)$. The hazard rate is defined as the probability that the event occurs at time $t$, conditional on it not having occurred until time $t$, i.e.,
$$ \lambda(t) = \frac{f(t)}{1-F(t)} \in (0,\infty) $$Correspondingly, the probability of survival is
$$ s(t) = \exp\left( -\int_0^t \lambda(u)\; du \right) $$with the probability of failure up to time $t$ then given by
$$ F(t) = 1 - s(t) = 1 -\exp\left( -\int_0^t \lambda(u)\; du \right) $$Empirically, we estimate the hazard rate as follows, for individual $i$:
$$ \lambda_i(t) = \lambda_0(t) \exp[\beta^\top x_i] \geq 0 $$where $\beta$ is a vector of coefficients, and $x_i$ is a vector of characteristics of individual $i$. The function $\lambda_0(t) \geq 0$ is known as the "baseline hazard function". The hazard ratio is defined as $\lambda_i(t)/\lambda_0(t)$. When greater than 1, individual $i$ has a greater hazard than baseline. The log hazard ratio is linear in $x_i$.
$$ \ln \left[ \frac{\lambda_i(t)}{\lambda_0(t)} \right] = \beta^\top x_i $$In order to get some intuition for the hazard rate, suppose we have three friends who just graduated from college, and they all have an equal chance of getting married. Then at any time $t$, the probability that any one gets married, given no one has been married so far is $\lambda_i(t) = \lambda_0(t) = 1/3, \forall t$. Now, if anyone gets married, then the hazard rate will jump to $1/2$.
But what if all the three friends are of different ages, and the propensity to get married is proportional to age? Then
$$ \lambda_i(t) = \frac{\mbox{Age}_i(t)}{\sum_{j=1}^3 \mbox{Age}_j(t)} $$This model may also be extended to include gender and other variables. Given we have data on $M$ individuals, we can order the data by discrete times $t_1 < t_2 < ... t_i < ... < t_M$. Some of these times are times to the event, and some are times of existence without the event, the latter is also known as "censoring" times. The values $\delta_1, \delta_2, ..., \delta_i, ..., \delta_M$ take values 1 if the individual has experienced the event and zero otherwise. The likelihood of an individual experiencing the event is
$$ L_i(\beta) = \frac{\lambda_i(t_i)}{\sum_{j=i}^M \lambda_j(t_i)} = \frac{\lambda_0(t_i) e^{\beta^\top x_i}}{\sum_{j=i}^M \lambda_0(t_i) e^{\beta^\top x_j}} = \frac{ e^{\beta^\top x_i}}{\sum_{j=i}^M e^{\beta^\top x_j}} $$This accounts for all remaining individuals in the population at time $t_i$. We see that the likelihood does not depend on $t$ as the baseline hazard function cancels out. The parameters $\beta$ are obtained by maximizing the likelihood function:
$$ L(\beta) = \prod_{i=1}^M L_i(\beta)^{\delta_i} $$which uses the subset of data where $\delta_i = 1$.
We use the survival package in R.
%%R
library(survival)
Here is a very small data set. Note the columns that correspond to time to event, and the indictor variable "death" ($\delta$). The $x$ variables are "age" and "female".
%%R
SURV = read.table("DSTMAA_data/survival_data.txt",header=TRUE)
SURV
id time death age female 1 1 1 1 20 0 2 2 4 0 21 1 3 3 7 1 19 0 4 4 10 1 22 1 5 5 12 0 20 0 6 6 13 1 24 1
We can of course run a linear regression just to see how age and gender affect death, by merely looking at the sign, and we see that being older means on average a greater chance of dying, and being female reduces risk.
%%R
#SIMPLE REGRESSION APPROACH
summary(lm(death ~ age+female, SURV))
Call: lm(formula = death ~ age + female, data = SURV) Residuals: 1 2 3 4 5 6 0.27083 -0.41667 0.45833 0.39583 -0.72917 0.02083 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.0208 5.2751 -0.573 0.607 age 0.1875 0.2676 0.701 0.534 female -0.5000 0.8740 -0.572 0.607 Residual standard error: 0.618 on 3 degrees of freedom Multiple R-squared: 0.1406, Adjusted R-squared: -0.4323 F-statistic: 0.2455 on 2 and 3 DF, p-value: 0.7967
Instead of a linear regression, estimate the Cox PH model for the survival time. Here the coefficients are reversed in sign because we are estimating survival and not death.
%%R
#COX APPROACH
res = coxph(Surv(time, death) ~ female + age, data = SURV)
summary(res)
plot(survfit(res)) #Plot the baseline survival function
Note that the exp(coef) is the hazard ratio. When it is greater than 1, there is an increase in hazard, and when it is less than 1, there is a decrease in the hazard. We can do a test for proportional hazards as follows, and examine the p-values.
%%R
cox.zph(res)
rho chisq p female 0.563 1.504 0.220 age -0.472 0.743 0.389 GLOBAL NA 1.762 0.414
Finally, we are interested in obtaining the baseline hazard function $\lambda_0(t)$ which as we know has dropped out of the estimation. So how do we recover it? In fact, without it, where do we even get $\lambda_i(t)$ from? We would also like to get the cumulative baseline hazard, i.e., $\Lambda_0(t) = \int_0^t \lambda_0(u) du$. Sadly, this is a major deficiency of the Cox PH model. However, one may make a distributional assumption about the form of $\lambda_0(t)$ and then fit it to maximize the likelihood of survival times, after the coefficients $\beta$ have been fit already. For example, one function might be $\lambda_0(t) = e^{\alpha t}$, and it would only need the estimation of $\alpha$. We can then obtain the estimated survival probabilities over time.
%%R
covs <- data.frame(age = 21, female = 0)
summary(survfit(res, newdata = covs, type = "aalen"))
Call: survfit(formula = res, newdata = covs, type = "aalen") time n.risk n.event survival std.err lower 95% CI upper 95% CI 1 6 1 0.9475 0.108 7.58e-01 1 7 4 1 0.8672 0.236 5.08e-01 1 10 3 1 0.7000 0.394 2.32e-01 1 13 1 1 0.0184 0.117 7.14e-08 1
The "survival" column gives the survival probabilty for various time horizons shown in the first column.
For a useful guide, see https://rpubs.com/daspringate/survival
To sum up, see that the Cox PH model estimates the hazard rate function $\lambda(t)$:
$$ \lambda(t) = \lambda_0(t) \exp[\beta^\top x] $$The "exp(coef)" is the baseline hazard rate multiplier effect. If exp(coef)>1, then an increase in the variable $x$ increases the hazard rate by that factor, and if exp(coef)<1, then it reduces the hazard rate $\lambda(t)$ by that factor. Note that the hazard rate is NOT the probability of survival, and in fact $\lambda(t) \in (0,\infty)$. Note that the probability of survival over time t, if we assume a constant hazard rate $\lambda$ is $s(t) = e^{-\lambda t}$. Of course $s(t) \in (0,1)$.
So for example, if the current (assumed constant) hazard rate is $\lambda = 0.02$, then the (for example) 3-year survival probability is
$$ s(t) = e^{-0.02 \times 3} = 0.9418 $$This implies an Odds ratio of $OR = s/(1-s) = 16.17167$. If the person is female, then the new odds ratio is adjusted by the exponentiated coefficient, i.e., $OR \times 4.686 = 75.78043$. So the new survival probability is
$$ s(t=3) = \frac{OR}{1+OR} = 0.9870 $$If Age increases by one year then the odds ratio will be $16.17167 \times 0.3886 = 6.28431$. And the new survival probability will be
$$ s(t=3) = \frac{OR}{1+OR} = 0.8627 $$Note that the hazard rate and the probability of survival go in opposite directions.
The glmnet package is from Stanford, and you can get all the details and examples here: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
The package fits generalized linear models and also penalizes the size of the model, with various standard models as special cases.
The function equation for minimization is
$$ \min_{\beta} \frac{1}{n}\sum_{i=1}^n w_i L(y_i,\beta^\top x_i) + \lambda \left[(1-\alpha) \frac{1}{2}\| \beta \|_2^2 + \alpha \|\beta \|_1\right] $$where $\|\beta\|_1$ and $\|\beta\|_2$ are the $L_1$ and $L_2$ norms for the vector $\beta$. The idea is to take any loss function and penalize it. For example, if the loss function is just the sum of squared residuals $y_i-\beta^\top x_i$, and $w_i=1, \lambda=0$, then we get an ordinary least squares regression model. The function $L$ is usually set to be the log-likelihood function.
If the $L_1$ norm is applied only, i.e., $\alpha=1$, then we get the Lasso model. If the $L_2$ norm is solely applied, i.e., $\alpha=0$, then we get a ridge regression. As is obvious from the equation, $\lambda$ is the size of the penalty applied, and increasing this parameter forces a more parsimonious model.
Ridge regression suppresses the size of the coefficients, whereas Lasso also minimizes the number of non-zero coefficients.
Here is an example of lasso ($\alpha=1$):
%%R
library(glmnet)
ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
y = c(rep(1,32),rep(0,32))
x = as.matrix(ncaa[4:14])
res = cv.glmnet(x = x, y = y, family = 'binomial', alpha = 1, type.measure = "auc")
plot(res)
We may also run glmnet to get coefficients.
%%R
res = glmnet(x = x, y = y, family = 'binomial', alpha = 1)
print(names(res))
print(res)
[1] "a0" "beta" "df" "dim" "lambda" [6] "dev.ratio" "nulldev" "npasses" "jerr" "offset" [11] "classnames" "call" "nobs" Call: glmnet(x = x, y = y, family = "binomial", alpha = 1) Df %Dev Lambda [1,] 0 1.602e-16 0.2615000 [2,] 1 3.357e-02 0.2383000 [3,] 1 6.172e-02 0.2171000 [4,] 1 8.554e-02 0.1978000 [5,] 1 1.058e-01 0.1803000 [6,] 1 1.231e-01 0.1642000 [7,] 1 1.380e-01 0.1496000 [8,] 1 1.508e-01 0.1364000 [9,] 1 1.618e-01 0.1242000 [10,] 2 1.721e-01 0.1132000 [11,] 4 1.851e-01 0.1031000 [12,] 5 1.990e-01 0.0939800 [13,] 4 2.153e-01 0.0856300 [14,] 4 2.293e-01 0.0780300 [15,] 4 2.415e-01 0.0711000 [16,] 5 2.540e-01 0.0647800 [17,] 8 2.730e-01 0.0590200 [18,] 8 2.994e-01 0.0537800 [19,] 8 3.225e-01 0.0490000 [20,] 8 3.428e-01 0.0446500 [21,] 8 3.608e-01 0.0406800 [22,] 8 3.766e-01 0.0370700 [23,] 8 3.908e-01 0.0337800 [24,] 8 4.033e-01 0.0307800 [25,] 8 4.145e-01 0.0280400 [26,] 9 4.252e-01 0.0255500 [27,] 10 4.356e-01 0.0232800 [28,] 10 4.450e-01 0.0212100 [29,] 10 4.534e-01 0.0193300 [30,] 10 4.609e-01 0.0176100 [31,] 10 4.676e-01 0.0160500 [32,] 10 4.735e-01 0.0146200 [33,] 10 4.789e-01 0.0133200 [34,] 10 4.836e-01 0.0121400 [35,] 10 4.878e-01 0.0110600 [36,] 9 4.912e-01 0.0100800 [37,] 9 4.938e-01 0.0091820 [38,] 9 4.963e-01 0.0083670 [39,] 9 4.984e-01 0.0076230 [40,] 9 5.002e-01 0.0069460 [41,] 9 5.018e-01 0.0063290 [42,] 9 5.032e-01 0.0057670 [43,] 9 5.044e-01 0.0052540 [44,] 9 5.055e-01 0.0047880 [45,] 9 5.064e-01 0.0043620 [46,] 9 5.071e-01 0.0039750 [47,] 10 5.084e-01 0.0036220 [48,] 10 5.095e-01 0.0033000 [49,] 10 5.105e-01 0.0030070 [50,] 10 5.114e-01 0.0027400 [51,] 10 5.121e-01 0.0024960 [52,] 10 5.127e-01 0.0022750 [53,] 11 5.133e-01 0.0020720 [54,] 11 5.138e-01 0.0018880 [55,] 11 5.142e-01 0.0017210 [56,] 11 5.146e-01 0.0015680 [57,] 11 5.149e-01 0.0014280 [58,] 11 5.152e-01 0.0013020 [59,] 11 5.154e-01 0.0011860 [60,] 11 5.156e-01 0.0010810 [61,] 11 5.157e-01 0.0009846 [62,] 11 5.158e-01 0.0008971 [63,] 11 5.160e-01 0.0008174 [64,] 11 5.160e-01 0.0007448 [65,] 11 5.161e-01 0.0006786 [66,] 11 5.162e-01 0.0006183 [67,] 11 5.162e-01 0.0005634 [68,] 11 5.163e-01 0.0005134 [69,] 11 5.163e-01 0.0004678 [70,] 11 5.164e-01 0.0004262 [71,] 11 5.164e-01 0.0003883 [72,] 11 5.164e-01 0.0003538 [73,] 11 5.164e-01 0.0003224 [74,] 11 5.164e-01 0.0002938 [75,] 11 5.165e-01 0.0002677 [76,] 11 5.165e-01 0.0002439 [77,] 11 5.165e-01 0.0002222
%%R
b = coef(res)[,25] #Choose the best case with 8 coefficients
print(b)
x1 = c(1,as.numeric(x[18,]))
p = 1/(1+exp(-sum(b*x1)))
print(p)
(Intercept) PTS REB AST TO A.T -17.30807196 0.04224762 0.13304541 0.00000000 -0.13440922 0.63059336 STL BLK PF FG FT X3P 0.21867734 0.11635708 0.00000000 17.14864198 3.00069901 0.00000000 [1] 0.7696481
%%R
preds = predict(res, x, type = 'response')
print(dim(preds))
preds = preds[,25] #Take the 25th case
print(preds)
print(glmnet:::auc(y, preds))
print(table(y,round(preds,0))) #rounding needed to make 0,1
[1] 64 77 [1] 0.97443940 0.90157397 0.87711437 0.89911656 0.95684199 0.82949042 [7] 0.53186622 0.83745812 0.45979765 0.58355756 0.78726183 0.55050365 [13] 0.30633472 0.93605170 0.70646742 0.85811465 0.42394178 0.76964806 [19] 0.40172414 0.66137964 0.69620096 0.61569705 0.88800581 0.92834645 [25] 0.82719624 0.17209046 0.66881541 0.84149477 0.58937886 0.64674446 [31] 0.79368965 0.51186217 0.58500925 0.61275721 0.17532362 0.47406867 [37] 0.24314471 0.11843924 0.26787937 0.24296988 0.21129918 0.05041436 [43] 0.30109650 0.14989973 0.17976216 0.57119150 0.05514704 0.46220128 [49] 0.63788393 0.32605605 0.35544396 0.12647374 0.61772958 0.63883954 [55] 0.02306762 0.21285032 0.36455131 0.53953727 0.18563868 0.23598354 [61] 0.11821886 0.04258418 0.19603015 0.24630145 [1] 0.9072266 y 0 1 0 25 7 1 5 27
ROC stands for Receiver Operating Characteristic. The acronym comes from signal theory, where the users are interested in the number of true positive signals that are identified. The idea is simple, and best explained with an example. Let's say you have an algorithm that detects customers probability $p \in (0,1)$ of buying a product. Take a tagged set of training data and sort the customers by this probability in a line with the highest propensity to buy on the left and moving to the right the probabilty declines monotonically. (Tagged means you know whether they bought the product or not.) Now, starting from the left, plot a line that jumps vertically by a unit if the customer buys the product as you move across else remains flat. If the algorithm is a good one, the line will quickly move up at first and then flatten out.
Let's take the train and test data here and plot the ROC curve by writing our own code.
We can do the same with the pROC package. Here is the code.
%%R
suppressMessages(library(pROC))
res = roc(response=y,predictor=preds)
print(res)
plot(res); grid()
Call: roc.default(response = y, predictor = preds) Data: preds in 32 controls (y 0) < 32 cases (y 1). Area under the curve: 0.9072
We see that "specificity" equals the true negative rate, and is also denoted as "recall". And that the true positive rate is also labeled as "sensitivity". The AUC or "area under the curve"" is the area between the curve and the diagonal divided by the area in the top right triangle of the diagram. This is also reported and is the same number as obtained when we fitted the model using the glmnet function before. For nice graphics that explain all these measures and more, see https://en.wikipedia.org/wiki/Precision_and_recall
As we did before, we may fit a Cox PH model using GLMNET with the additional feature that we include a penalty when we maximize the likelihood function.
%%R
SURV = read.table("DSTMAA_data/survival_data.txt",header=TRUE)
print(SURV)
names(SURV)[3] = "status"
y = as.matrix(SURV[,2:3])
x = as.matrix(SURV[,4:5])
id time death age female 1 1 1 1 20 0 2 2 4 0 21 1 3 3 7 1 19 0 4 4 10 1 22 1 5 5 12 0 20 0 6 6 13 1 24 1
%%R
res = glmnet(x, y, family = "cox")
print(res)
Call: glmnet(x = x, y = y, family = "cox") Df %Dev Lambda [1,] 0 0.00000 0.331700 [2,] 1 0.02347 0.302200 [3,] 1 0.04337 0.275400 [4,] 1 0.06027 0.250900 [5,] 1 0.07466 0.228600 [6,] 1 0.08690 0.208300 [7,] 1 0.09734 0.189800 [8,] 1 0.10620 0.172900 [9,] 1 0.11380 0.157600 [10,] 1 0.12020 0.143600 [11,] 1 0.12570 0.130800 [12,] 1 0.13040 0.119200 [13,] 1 0.13430 0.108600 [14,] 1 0.13770 0.098970 [15,] 1 0.14050 0.090180 [16,] 1 0.14300 0.082170 [17,] 1 0.14500 0.074870 [18,] 1 0.14670 0.068210 [19,] 1 0.14820 0.062150 [20,] 1 0.14940 0.056630 [21,] 1 0.15040 0.051600 [22,] 1 0.15130 0.047020 [23,] 1 0.15200 0.042840 [24,] 1 0.15260 0.039040 [25,] 1 0.15310 0.035570 [26,] 2 0.15930 0.032410 [27,] 2 0.16480 0.029530 [28,] 2 0.16930 0.026910 [29,] 2 0.17320 0.024520 [30,] 2 0.17640 0.022340 [31,] 2 0.17910 0.020350 [32,] 2 0.18140 0.018540 [33,] 2 0.18330 0.016900 [34,] 2 0.18490 0.015400 [35,] 2 0.18630 0.014030 [36,] 2 0.18740 0.012780 [37,] 2 0.18830 0.011650 [38,] 2 0.18910 0.010610 [39,] 2 0.18980 0.009669 [40,] 2 0.19030 0.008810 [41,] 2 0.19080 0.008028 [42,] 2 0.19120 0.007314 [43,] 2 0.19150 0.006665 [44,] 2 0.19180 0.006073 [45,] 2 0.19200 0.005533 [46,] 2 0.19220 0.005042 [47,] 2 0.19240 0.004594 [48,] 2 0.19250 0.004186 [49,] 2 0.19260 0.003814 [50,] 2 0.19270 0.003475 [51,] 2 0.19280 0.003166 [52,] 2 0.19280 0.002885 [53,] 2 0.19290 0.002629 [54,] 2 0.19290 0.002395 [55,] 2 0.19300 0.002182 [56,] 2 0.19300 0.001988
%%R
plot(res)
print(coef(res))
2 x 56 sparse Matrix of class "dgCMatrix" age . -0.03232796 -0.06240328 -0.09044971 -0.1166396 -0.1411157 -0.1639991 female . . . . . . . age -0.185342 -0.2053471 -0.2240373 -0.2414872 -0.2577658 -0.272938 female . . . . . . age -0.2870651 -0.3002053 -0.3124148 -0.3237473 -0.3342545 -0.3440275 female . . . . . . age -0.3530249 -0.3613422 -0.3690231 -0.3761098 -0.3826423 -0.3886591 female . . . . . . age -0.4300447 -0.4704889 -0.5078614 -0.5424838 -0.5745449 -0.6042077 female 0.1232263 0.2429576 0.3522138 0.4522592 0.5439278 0.6279337 age -0.6316057 -0.6569988 -0.6804703 -0.7022042 -0.7222141 -0.7407295 female 0.7048655 0.7754539 0.8403575 0.9000510 0.9546989 1.0049765 age -0.7577467 -0.773467 -0.7878944 -0.8012225 -0.8133071 -0.8246563 female 1.0509715 1.093264 1.1319284 1.1675026 1.1999905 1.2297716 age -0.8349496 -0.8442393 -0.8528942 -0.860838 -0.8680639 -0.874736 female 1.2570025 1.2817654 1.3045389 1.325398 1.3443458 1.361801 age -0.8808466 -0.8863844 -0.8915045 -0.8961894 -0.9004172 -0.9043351 female 1.3777603 1.3922138 1.4055495 1.4177359 1.4287319 1.4389022 age -0.9079181 female 1.4481934
With cross validation, we get the usual plot for the fit.
%%R
cvfit = cv.glmnet(x, y, family = "cox")
plot(cvfit)
print(cvfit$lambda.min)
print(coef(cvfit,s=cvfit$lambda.min))
[1] 0.0989681 2 x 1 sparse Matrix of class "dgCMatrix" 1 age -0.2870651 female .
Note that the signs of the coefficients are the same as we had earlier, i.e., survival is lower with age and higher for females.