%pylab inline
import os
from ipypublish import nb_setup
Populating the interactive namespace from numpy and matplotlib
%load_ext rpy2.ipython
#%load_ext RWinOut
"Walking on water and developing software from a specification are easy if both are frozen" -- Edward V. Berard
In this chapter, we develop some expertise in using the R statistical package. See the manual https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf on the R web site. Work through Appendix A, at least the first page. Also see Grant Farnsworth's document "Econometrics in R": https://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf.
There is also a great book that I personally find to be of very high quality, titled "The Art of R Programming" by Norman Matloff.
You can easily install the R programming language, which is a very useful tool for Machine Learning. See: http://en.wikipedia.org/wiki/Machine_learning
Get R from: http://www.r-project.org/ (download and install it).
If you want to use R in IDE mode, download RStudio: http://www.rstudio.com.
Here is quick test to make sure your installation of R is working along with graphics capabilities.
%%R
#PLOT HISTOGRAM FROM STANDARD NORMAL RANDOM NUMBERS
x = rnorm(1000000)
hist(x,50)
grid(col="blue",lwd=2)
If you want to directly access the system you can issue system commands as follows:
%%R
#SYSTEM COMMANDS
#The following command will show the files in the directory which are notebooks.
print(system("ls -lt")) #This command will not work in the notebook.
[1] 0
To get started, we need to grab some data. Go to Yahoo! Finance and download some historical data in an Excel spreadsheet, re-sort it into chronological order, then save it as a CSV file. Read the file into R as follows.
%%R
#READ IN DATA FROM CSV FILE
data = read.csv("DSTMAA_data/goog.csv",header=TRUE)
print(head(data))
m = length(data)
n = length(data[,1])
print(c("Number of columns = ",m))
print(c("Length of data series = ",n))
Date Open High Low Close Adj.Close Volume 1 2004-08-19 49.67690 51.69378 47.66995 49.84580 49.84580 44994500 2 2004-08-20 50.17863 54.18756 49.92529 53.80505 53.80505 23005800 3 2004-08-23 55.01717 56.37334 54.17266 54.34653 54.34653 18393200 4 2004-08-24 55.26058 55.43942 51.45036 52.09616 52.09616 15361800 5 2004-08-25 52.14087 53.65105 51.60436 52.65751 52.65751 9257400 6 2004-08-26 52.13591 53.62621 51.99184 53.60634 53.60634 7148200 [1] "Number of columns = " "7" [1] "Length of data series = " "3607"
%%R
#REVERSE ORDER THE DATA (Also get some practice with a for loop)
for (j in 1:m) {
data[,j] = rev(data[,j])
}
print(head(data))
stkp = as.matrix(data[,7])
plot(stkp,type="l",col="blue")
grid(lwd=2)
Date Open High Low Close Adj.Close Volume 1 2018-12-14 1049.98 1062.60 1040.79 1042.10 1042.10 1685900 2 2018-12-13 1068.07 1079.76 1053.93 1061.90 1061.90 1329800 3 2018-12-12 1068.00 1081.65 1062.79 1063.68 1063.68 1523800 4 2018-12-11 1056.49 1060.60 1039.84 1051.75 1051.75 1394700 5 2018-12-10 1035.05 1048.45 1023.29 1039.55 1039.55 1807700 6 2018-12-07 1060.01 1075.26 1028.50 1036.58 1036.58 2101200
We can do the same data set up exercise for financial data using the quantmod package.
Note: to install a package you can use the drop down menus on Windows and Mac operating systems inside RStudio, and use a package installer on Linux.
You can install R packages from the console using conda: conda install r-
Or issue the following command from the notebook:
(when asked for "selection" enter 60 for California)
%%R
#install.packages("quantmod")
NULL
Now we move on to using this package for one stock.
%%R
#USE THE QUANTMOD PACKAGE TO GET STOCK DATA
library(quantmod)
getSymbols("IBM")
/Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Loading required package: xts warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Loading required package: zoo warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Attaching package: ‘zoo’ warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: The following objects are masked from ‘package:base’: as.Date, as.Date.numeric warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Registered S3 method overwritten by 'xts': method from as.zoo.xts zoo warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Loading required package: TTR warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Registered S3 method overwritten by 'quantmod': method from as.zoo.data.frame zoo warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Version 0.4-0 included new data defaults. See ?getSymbols. Learn from a quantmod author: https://www.datacamp.com/courses/importing-and-managing-financial-data-in-r warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: ‘getSymbols’ currently uses auto.assign=TRUE by default, but will use auto.assign=FALSE in 0.5-0. You will still be able to use ‘loadSymbols’ to automatically load data. getOption("getSymbols.env") and getOption("getSymbols.auto.assign") will still be checked for alternate defaults. This message is shown once per session and may be disabled by setting options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details. warnings.warn(x, RRuntimeWarning)
[1] "IBM"
%%R
chartSeries(IBM)
Let's take a quick look at the data.
%%R
head(IBM)
IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume IBM.Adjusted 2007-01-03 97.18 98.40 96.26 97.27 9196800 68.48550 2007-01-04 97.25 98.79 96.88 98.31 10524500 69.21778 2007-01-05 97.60 97.95 96.91 97.42 7221300 68.59115 2007-01-08 98.50 99.50 98.35 98.90 10340000 69.63318 2007-01-09 99.08 100.33 99.07 100.07 11108200 70.45693 2007-01-10 98.50 99.05 97.93 98.89 8744800 69.62614
Extract the dates using pipes (we will see this in more detail later).
%%R
library(magrittr)
dts = IBM %>% as.data.frame %>% row.names
dts %>% head %>% print
dts %>% length %>% print
[1] "2007-01-03" "2007-01-04" "2007-01-05" "2007-01-08" "2007-01-09" [6] "2007-01-10" [1] 3205
Plot the data.
%%R
stkp = as.matrix(IBM$IBM.Adjusted)
rets = diff(log(stkp))
dts = as.Date(dts)
plot(dts,stkp,type="l",col="blue",xlab="Years",ylab="Stock Price of IBM")
grid(lwd=2)
Summarize the data.
%%R
#DESCRIPTIVE STATS
summary(IBM)
Index IBM.Open IBM.High IBM.Low Min. :2007-01-03 Min. : 72.74 Min. : 76.98 Min. : 69.5 1st Qu.:2010-03-10 1st Qu.:125.92 1st Qu.:126.99 1st Qu.:124.7 Median :2013-05-15 Median :149.60 Median :150.54 Median :148.3 Mean :2013-05-14 Mean :149.81 Mean :151.00 Mean :148.7 3rd Qu.:2016-07-20 3rd Qu.:175.65 3rd Qu.:177.27 3rd Qu.:174.8 Max. :2019-09-25 Max. :215.38 Max. :215.90 Max. :214.3 IBM.Close IBM.Volume IBM.Adjusted Min. : 71.74 Min. : 1027500 Min. : 52.12 1st Qu.:126.08 1st Qu.: 3489400 1st Qu.: 94.93 Median :149.25 Median : 4699100 Median :132.22 Mean :149.89 Mean : 5631818 Mean :122.00 3rd Qu.:175.77 3rd Qu.: 6799200 3rd Qu.:144.85 Max. :215.80 Max. :30770700 Max. :169.22
Compute risk (volatility).
%%R
#STOCK VOLATILITY
sigma_daily = sd(rets)
sigma_annual = sigma_daily*sqrt(252)
print(sigma_annual)
print(c("Sharpe ratio = ",mean(rets)*252/sigma_annual))
[1] 0.2217987 [1] "Sharpe ratio = " "0.26146725711648"
We may also use the package to get data for more than one stock.
%%R
library(quantmod)
getSymbols(c("GOOG","AAPL","CSCO","IBM"))
[1] "GOOG" "AAPL" "CSCO" "IBM"
We now go ahead and concatenate columns of data into one stock data set.
%%R
goog = as.numeric(GOOG[,6])
aapl = as.numeric(AAPL[,6])
csco = as.numeric(CSCO[,6])
ibm = as.numeric(IBM[,6])
stkdata = cbind(goog,aapl,csco,ibm)
dim(stkdata)
[1] 3205 4
Now, compute daily returns. This time, we do log returns in continuous-time. The mean returns are:
%%R
n = dim(stkdata)[1]
rets = log(stkdata[2:n,]/stkdata[1:(n-1),])
colMeans(rets)
goog aapl csco ibm 0.0005235354 0.0009525354 0.0002568174 0.0002301313
We can also compute the covariance matrix and correlation matrix:
%%R
cv = cov(rets)
print(cv,2)
cr = cor(rets)
print(cr,4)
goog aapl csco ibm goog 0.00032 0.00019 0.00017 0.00011 aapl 0.00019 0.00039 0.00018 0.00013 csco 0.00017 0.00018 0.00033 0.00014 ibm 0.00011 0.00013 0.00014 0.00020 goog aapl csco ibm goog 1.0000 0.5462 0.5063 0.4557 aapl 0.5462 1.0000 0.4945 0.4616 csco 0.5063 0.4945 1.0000 0.5589 ibm 0.4557 0.4616 0.5589 1.0000
Notice the print command allows you to choose the number of significant digits (in this case 4). Also, as expected the four return time series are positively correlated with each other.
Data frames are the most essential data structure in the R programming language. One may think of a data frame as simply a spreadsheet. In fact you can view it as such with the following command.
%%R
#Only works in RStudio
#View(data)
NULL
However, data frames in R are much more than mere spreadsheets, which is why Excel will never trump R in the hanlding and analysis of data, except for very small applications on small spreadsheets. One may also think of data frames as databases, and there are many commands that we may use that are database-like, such as joins, merges, filters, selections, etc. Indeed, packages such as dplyr and data.table are designed to make these operations seamless, and operate efficiently on big data, where the number of observations (rows) are of the order of hundreds of millions.
Data frames can be addressed by column names, so that we do not need to remember column numbers specifically. If you want to find the names of all columns in a data frame, the names function does the trick. To address a chosen column, append the column name to the data frame using the "$" connector, as shown below.
%%R
#THIS IS A DATA FRAME AND CAN BE REFERENCED BY COLUMN NAMES
print(names(data))
print(head(data$Close))
[1] "Date" "Open" "High" "Low" "Close" "Adj.Close" [7] "Volume" [1] 1042.10 1061.90 1063.68 1051.75 1039.55 1036.58
The command printed out the first few observations in the column "Close". All variables and functions in R are "objects", and you are well-served to know the object type, because objects have properties and methods apply differently to objects of various types. Therefore, to check an object type, use the class function.
%%R
class(data)
[1] "data.frame"
To obtain descriptive statistics on the data variables in a data frame, the summary function is very handy.
%%R
#DESCRIPTIVE STATISTICS
summary(data)
Date Open High Low 2004-08-19: 1 Min. : 49.27 Min. : 50.54 Min. : 47.67 2004-08-20: 1 1st Qu.: 232.20 1st Qu.: 234.77 1st Qu.: 229.58 2004-08-23: 1 Median : 303.78 Median : 306.46 Median : 301.22 2004-08-24: 1 Mean : 438.93 Mean : 442.87 Mean : 434.65 2004-08-25: 1 3rd Qu.: 584.82 3rd Qu.: 587.96 3rd Qu.: 579.83 2004-08-26: 1 Max. :1271.00 Max. :1273.89 Max. :1249.02 (Other) :3601 Close Adj.Close Volume Min. : 49.68 Min. : 49.68 Min. : 7900 1st Qu.: 232.10 1st Qu.: 232.10 1st Qu.: 2050450 Median : 303.94 Median : 303.94 Median : 4791200 Mean : 438.82 Mean : 438.82 Mean : 7518149 3rd Qu.: 583.90 3rd Qu.: 583.90 3rd Qu.: 9869450 Max. :1268.33 Max. :1268.33 Max. :82768100
Let's take a given column of data and perform some transformations on it. We can also plot the data, with some arguments for look and feel, using the plot function.
%%R
#USING A PARTICULAR COLUMN
stkp = data$Adj.Close
dt = data$Date
print(c("Length of stock series = ",length(stkp)))
#Ln of differenced stk prices gives continuous returns
rets = diff(log(stkp)) #diff() takes first differences
print(c("Length of return series = ",length(rets)))
print(head(rets))
plot(rets,type="l",col="blue")
[1] "Length of stock series = " "3607" [1] "Length of return series = " "3606" [1] 0.018821894 0.001674866 -0.011279201 -0.011667469 -0.002861184 [6] 0.030544219
In case you want more descriptive statistics than provided by the summary function, then use an appropriate package. We may be interested in the higher-order moments, and we use the moments package for this.
%%R
print(summary(rets))
Min. 1st Qu. Median Mean 3rd Qu. Max. -0.1822511 -0.0098321 -0.0005632 -0.0008431 0.0075836 0.1234015
Compute the daily and annualized standard deviation of returns.
%%R
r_sd = sd(rets)
r_sd_annual = r_sd*sqrt(252)
print(c(r_sd,r_sd_annual))
#What if we take the stdev of annualized returns?
print(sd(rets*252))
#Huh?
print(sd(rets*252))/252
print(sd(rets*252))/sqrt(252)
[1] 0.01896794 0.30110676 [1] 4.779922 [1] 4.779922 [1] 4.779922 [1] 0.3011068
Notice the interesting use of the print function here. The variance is easy as well.
%%R
#Variance
r_var = var(rets)
r_var_annual = var(rets)*252
print(c(r_var,r_var_annual))
[1] 0.0003597829 0.0906652783
Skewness and kurtosis are key moments that arise in all return distributions. We need a different library in R for these. We use the moments library.
\begin{equation} \mbox{Skewness} = \frac{E[(X-\mu)^3]}{\sigma^{3}} \end{equation}Skewness means one tail is fatter than the other (asymmetry). Fatter right (left) tail implies positive (negative) skewness.
\begin{equation} \mbox{Kurtosis} = \frac{E[(X-\mu)^4]}{\sigma^{4}} \end{equation}Kurtosis means both tails are fatter than with a normal distribution.
%%R
#HIGHER-ORDER MOMENTS
library(moments)
hist(rets,50)
print(c("Skewness=",skewness(rets)))
print(c("Kurtosis=",kurtosis(rets)))
[1] "Skewness=" "-0.57512743445236" [1] "Kurtosis=" "12.6866762557976"
For the normal distribution, skewness is zero, and kurtosis is 3. Kurtosis minus three is denoted "excess kurtosis".
%%R
print(skewness(rnorm(1000000)))
print(kurtosis(rnorm(1000000)))
[1] 0.00078868 [1] 2.994231
What is the skewness and kurtosis of the stock index (S\&P500)?
Often the original data is in a space delimited file, not a comma separated one, in which case the read.table function is appropriate.
%%R
#READ IN MORE DATA USING SPACE DELIMITED FILE
data = read.table("DSTMAA_data/markowitzdata.txt",header=TRUE)
print(head(data))
print(c("Length of data series = ",length(data$X.DATE)))
X.DATE SUNW MSFT IBM CSCO AMZN 1 20010102 -0.087443948 0.000000000 -0.002205882 -0.129084975 -0.10843374 2 20010103 0.297297299 0.105187319 0.115696386 0.240150094 0.26576576 3 20010104 -0.060606062 0.010430248 -0.015191546 0.013615734 -0.11743772 4 20010105 -0.096774191 0.014193549 0.008718981 -0.125373140 -0.06048387 5 20010108 0.006696429 -0.003816794 -0.004654255 -0.002133106 0.02575107 6 20010109 0.044345897 0.058748405 -0.010688043 0.015818726 0.09623431 mktrf smb hml rf 1 -0.0345 -0.0037 0.0209 0.00026 2 0.0527 0.0097 -0.0493 0.00026 3 -0.0121 0.0083 -0.0015 0.00026 4 -0.0291 0.0027 0.0242 0.00026 5 -0.0037 -0.0053 0.0129 0.00026 6 0.0046 0.0044 -0.0026 0.00026 [1] "Length of data series = " "1507"
We compute covariance and correlation in the data frame.
%%R
#COMPUTE COVARIANCE AND CORRELATION
rets = as.data.frame(cbind(data$SUNW,data$MSFT,data$IBM,data$CSCO,data$AMZN))
names(rets) = c("SUNW","MSFT","IBM","CSCO","AMZN")
print(cov(rets))
print(cor(rets))
SUNW MSFT IBM CSCO AMZN SUNW 0.0014380649 0.0003241903 0.0003104236 0.0007174466 0.0004594254 MSFT 0.0003241903 0.0003646160 0.0001968077 0.0003301491 0.0002678712 IBM 0.0003104236 0.0001968077 0.0002991120 0.0002827622 0.0002056656 CSCO 0.0007174466 0.0003301491 0.0002827622 0.0009502685 0.0005041975 AMZN 0.0004594254 0.0002678712 0.0002056656 0.0005041975 0.0016479809 SUNW MSFT IBM CSCO AMZN SUNW 1.0000000 0.4477060 0.4733132 0.6137298 0.2984349 MSFT 0.4477060 1.0000000 0.5959466 0.5608788 0.3455669 IBM 0.4733132 0.5959466 1.0000000 0.5303729 0.2929333 CSCO 0.6137298 0.5608788 0.5303729 1.0000000 0.4029038 AMZN 0.2984349 0.3455669 0.2929333 0.4029038 1.0000000
We may redo the example above using a very useful package called magrittr which mimics pipes in the Unix operating system. In the code below, we pipe the returns data into the correlation function and then "pipe" the output of that into the print function. This is analogous to issuing the command print(cor(rets)).
%%R
#Repeat the same process using pipes
library(magrittr)
rets %>% cor %>% print
SUNW MSFT IBM CSCO AMZN SUNW 1.0000000 0.4477060 0.4733132 0.6137298 0.2984349 MSFT 0.4477060 1.0000000 0.5959466 0.5608788 0.3455669 IBM 0.4733132 0.5959466 1.0000000 0.5303729 0.2929333 CSCO 0.6137298 0.5608788 0.5303729 1.0000000 0.4029038 AMZN 0.2984349 0.3455669 0.2929333 0.4029038 1.0000000
Question: What do you get if you cross a mountain-climber with a mosquito? Answer: Can't be done. You'll be crossing a scaler with a vector.
We will use matrices extensively in modeling, and here we examine the basic commands needed to create and manipulate matrices in R. We create a $4 \times 3$ matrix with random numbers as follows:
%%R
x = matrix(rnorm(12),4,3)
print(x)
[,1] [,2] [,3] [1,] 0.0518732 -0.6717193 -0.92050937 [2,] -0.2082150 0.4202647 -0.47640145 [3,] -0.1571675 -0.2385928 -0.88844408 [4,] -0.5663517 0.1489406 0.09808534
Transposing the matrix, notice that the dimensions are reversed.
%%R
print(t(x),3)
[,1] [,2] [,3] [,4] [1,] 0.0519 -0.208 -0.157 -0.5664 [2,] -0.6717 0.420 -0.239 0.1489 [3,] -0.9205 -0.476 -0.888 0.0981
Of course, it is easy to multiply matrices as long as they conform. By "conform" we mean that when multiplying one matrix by another, the number of columns of the matrix on the left must be equal to the number of rows of the matrix on the right. The resultant matrix that holds the answer of this computation will have the number of rows of the matrix on the left, and the number of columns of the matrix on the right. See the examples below:
%%R
print(t(x) %*% x,3)
print(x %*% t(x),3)
[,1] [,2] [,3] [1,] 0.392 -0.169 0.136 [2,] -0.169 0.707 0.645 [3,] 0.136 0.645 1.873 [,1] [,2] [,3] [,4] [1,] 1.301 0.145 0.9699 -0.2197 [2,] 0.145 0.447 0.3557 0.1338 [3,] 0.970 0.356 0.8710 -0.0337 [4,] -0.220 0.134 -0.0337 0.3526
Here is an example of non-conforming matrices.
%%R
#CREATE A RANDOM MATRIX
x = matrix(runif(12),4,3)
print(x)
print(x*2)
print(x+x)
print(t(x) %*% x) #THIS SHOULD BE 3x3
#print(x %*% x) #SHOULD GIVE AN ERROR
[,1] [,2] [,3] [1,] 0.11745159 0.7604409 0.616825059 [2,] 0.57022200 0.1588325 0.660023327 [3,] 0.05157777 0.3268232 0.314505190 [4,] 0.40146040 0.8728381 0.005924898 [,1] [,2] [,3] [1,] 0.2349032 1.5208819 1.2336501 [2,] 1.1404440 0.3176650 1.3200467 [3,] 0.1031555 0.6536465 0.6290104 [4,] 0.8029208 1.7456762 0.0118498 [,1] [,2] [,3] [1,] 0.2349032 1.5208819 1.2336501 [2,] 1.1404440 0.3176650 1.3200467 [3,] 0.1031555 0.6536465 0.6290104 [4,] 0.8029208 1.7456762 0.0118498 [,1] [,2] [,3] [1,] 0.5027787 0.5471515 0.4674070 [2,] 0.5471515 1.4721579 0.6818513 [3,] 0.4674070 0.6818513 0.9150526
Taking the inverse of the covariance matrix, we get:
%%R
cv_inv = solve(cv)
print(cv_inv,3)
goog aapl csco ibm goog 5052 -1619 -1222 -1014 aapl -1619 4121 -961 -1053 csco -1222 -961 5207 -2439 ibm -1014 -1053 -2439 8180
Check that the inverse is really so!
%%R
print(cv_inv %*% cv,3)
goog aapl csco ibm goog 1.00e+00 -6.97e-17 2.02e-16 4.33e-17 aapl 2.07e-16 1.00e+00 1.97e-16 9.96e-17 csco -1.16e-16 -1.70e-16 1.00e+00 -1.48e-16 ibm 2.95e-17 6.03e-18 1.39e-16 1.00e+00
It is, the result of multiplying the inverse matrix by the matrix itself results in the identity matrix.
A covariance matrix should be positive definite. Why? What happens if it is not? Checking for this property is easy.
%%R
library(corpcor)
is.positive.definite(cv)
[1] TRUE
What happens if you compute pairwise covariances from differing lengths of data for each pair?
Let's take the returns data we have and find the inverse.
%%R
cv = cov(rets)
print(round(cv,6))
cv_inv = solve(cv) #TAKE THE INVERSE
print(round(cv_inv %*% cv,2)) #CHECK THAT WE GET IDENTITY MATRIX
SUNW MSFT IBM CSCO AMZN SUNW 0.001438 0.000324 0.000310 0.000717 0.000459 MSFT 0.000324 0.000365 0.000197 0.000330 0.000268 IBM 0.000310 0.000197 0.000299 0.000283 0.000206 CSCO 0.000717 0.000330 0.000283 0.000950 0.000504 AMZN 0.000459 0.000268 0.000206 0.000504 0.001648 SUNW MSFT IBM CSCO AMZN SUNW 1 0 0 0 0 MSFT 0 1 0 0 0 IBM 0 0 1 0 0 CSCO 0 0 0 1 0 AMZN 0 0 0 0 1
%%R
#CHECK IF MATRIX IS POSITIVE DEFINITE (why do we check this?)
library(corpcor)
is.positive.definite(cv)
[1] TRUE
Finding roots of nonlinear equations is often required, and R has several packages for this purpose. Here we examine a few examples. Suppose we are given the function [ (x^2 + y^2 - 1)^3 - x^2 y^3 = 0 ] and for various values of $y$ we wish to solve for the values of $x$. The function we use is called multiroot and the use of the function is shown below.
%%R
#ROOT SOLVING IN R
library(rootSolve)
fn = function(x,y) {
result = (x^2+y^2-1)^3 - x^2*y^3
}
yy = 1
sol = multiroot(f=fn,start=1,maxiter=10000,rtol=0.000001,atol=0.000001,ctol=0.00001,y=yy)
print(c("solution=",sol$root))
check = fn(sol$root,yy)
print(check)
[1] "solution=" "1" [1] 0
Here we demonstrate the use of another function called uniroot.
%%R
fn = function(x) {
result = 0.065*(x*(1-x))^0.5- 0.05 +0.05*x
}
sol = uniroot.all(f=fn,c(0,1))
print(sol)
check = fn(sol)
print(check)
[1] 1.0000000 0.3717627 [1] 0.000000e+00 1.041576e-06
In a multivariate linear regression, we have
\begin{equation} Y = X \cdot \beta + e \end{equation}where $Y \in R^{t \times 1}$, $X \in R^{t \times n}$, and $\beta \in R^{n \times 1}$, and the regression solution is simply equal to $\beta = (X'X)^{-1}(X'Y) \in R^{n \times 1}$.
To get this result we minimize the sum of squared errors.
\begin{eqnarray*} \min_{\beta} e'e &=& (Y - X \cdot \beta)' (Y-X \cdot \beta) \\ &=& Y'(Y-X \cdot \beta) - (X \beta)'\cdot (Y-X \cdot \beta) \\ &=& Y'Y - Y' X \beta - (\beta' X') Y + \beta' X'X \beta \\ &=& Y'Y - Y' X \beta - Y' X \beta + \beta' X'X \beta \\ &=& Y'Y - 2Y' X \beta + \beta' X'X \beta \end{eqnarray*}Note that this expression is a scalar.
Differentiating w.r.t. $\beta'$ gives the following f.o.c:
\begin{eqnarray*} - 2 X'Y + 2 X'X \beta&=& {\bf 0} \\ & \Longrightarrow & \\ \beta &=& (X'X)^{-1} (X'Y) \end{eqnarray*}There is another useful expression for each individual $\beta_i = \frac{Cov(X_i,Y)}{Var(X_i)}$. You should compute this and check that each coefficient in the regression is indeed equal to the $\beta_i$ from this calculation.
Example: We run a stock return regression to exemplify the algebra above.
%%R
data = read.table("DSTMAA_data/markowitzdata.txt",header=TRUE) #THESE DATA ARE RETURNS
print(names(data)) #THIS IS A DATA FRAME (important construct in R)
head(data)
[1] "X.DATE" "SUNW" "MSFT" "IBM" "CSCO" "AMZN" "mktrf" "smb" [9] "hml" "rf" X.DATE SUNW MSFT IBM CSCO AMZN 1 20010102 -0.087443948 0.000000000 -0.002205882 -0.129084975 -0.10843374 2 20010103 0.297297299 0.105187319 0.115696386 0.240150094 0.26576576 3 20010104 -0.060606062 0.010430248 -0.015191546 0.013615734 -0.11743772 4 20010105 -0.096774191 0.014193549 0.008718981 -0.125373140 -0.06048387 5 20010108 0.006696429 -0.003816794 -0.004654255 -0.002133106 0.02575107 6 20010109 0.044345897 0.058748405 -0.010688043 0.015818726 0.09623431 mktrf smb hml rf 1 -0.0345 -0.0037 0.0209 0.00026 2 0.0527 0.0097 -0.0493 0.00026 3 -0.0121 0.0083 -0.0015 0.00026 4 -0.0291 0.0027 0.0242 0.00026 5 -0.0037 -0.0053 0.0129 0.00026 6 0.0046 0.0044 -0.0026 0.00026
%%R
#RUN A MULTIVARIATE REGRESSION ON STOCK DATA
Y = as.matrix(data$SUNW)
X = as.matrix(data[,3:6])
res = lm(Y~X)
summary(res)
Call: lm(formula = Y ~ X) Residuals: Min 1Q Median 3Q Max -0.233758 -0.014921 -0.000711 0.014214 0.178859 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0007256 0.0007512 -0.966 0.33422 XMSFT 0.1382312 0.0529045 2.613 0.00907 ** XIBM 0.3791500 0.0566232 6.696 3.02e-11 *** XCSCO 0.5769097 0.0317799 18.153 < 2e-16 *** XAMZN 0.0324899 0.0204802 1.586 0.11286 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.02914 on 1502 degrees of freedom Multiple R-squared: 0.4112, Adjusted R-squared: 0.4096 F-statistic: 262.2 on 4 and 1502 DF, p-value: < 2.2e-16
Now we can cross-check the regression using the algebraic solution for the regression coefficients.
%%R
#CHECK THE REGRESSION
n = length(Y)
X = cbind(matrix(1,n,1),X)
b = solve(t(X) %*% X) %*% (t(X) %*% Y)
print(b)
[,1] -0.0007256342 MSFT 0.1382312148 IBM 0.3791500328 CSCO 0.5769097262 AMZN 0.0324898716
Example: As a second example, we take data on basketball teams in a cross-section, and try to explain their performance using team statistics. Here is a simple regression run on some data from the 2005-06 NCAA basketball season for the March madness stats. The data is stored in a space-delimited file called ncaa.txt. We use the metric of performance to be the number of games played, with more successful teams playing more playoff games, and then try to see what variables explain it best. We apply a simple linear regression that uses the R command lm, which stands for "linear model".
%%R
#REGRESSION ON NCAA BASKETBALL PLAYOFF DATA
ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
print(head(ncaa))
y = ncaa[3]
y = as.matrix(y)
x = ncaa[4:14]
x = as.matrix(x)
No NAME GMS PTS REB AST TO A.T STL BLK PF FG FT X3P 1 1 NorthCarolina 6 84.2 41.5 17.8 12.8 1.39 6.7 3.8 16.7 0.514 0.664 0.417 2 2 Illinois 6 74.5 34.0 19.0 10.2 1.87 8.0 1.7 16.5 0.457 0.753 0.361 3 3 Louisville 5 77.4 35.4 13.6 11.0 1.24 5.4 4.2 16.6 0.479 0.702 0.376 4 4 MichiganState 5 80.8 37.8 13.0 12.6 1.03 8.4 2.4 19.8 0.445 0.783 0.329 5 5 Arizona 4 79.8 35.0 15.8 14.5 1.09 6.0 6.5 13.3 0.542 0.759 0.397 6 6 Kentucky 4 72.8 32.3 12.8 13.5 0.94 7.3 3.5 19.5 0.510 0.663 0.400
%%R
fm = lm(y~x)
res = summary(fm)
res
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1.5074 -0.5527 -0.2454 0.6705 2.2344 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -10.194804 2.892203 -3.525 0.000893 *** xPTS -0.010442 0.025276 -0.413 0.681218 xREB 0.105048 0.036951 2.843 0.006375 ** xAST -0.060798 0.091102 -0.667 0.507492 xTO -0.034545 0.071393 -0.484 0.630513 xA.T 1.325402 1.110184 1.194 0.237951 xSTL 0.181015 0.068999 2.623 0.011397 * xBLK 0.007185 0.075054 0.096 0.924106 xPF -0.031705 0.044469 -0.713 0.479050 xFG 13.823190 3.981191 3.472 0.001048 ** xFT 2.694716 1.118595 2.409 0.019573 * xX3P 2.526831 1.754038 1.441 0.155698 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9619 on 52 degrees of freedom Multiple R-squared: 0.5418, Adjusted R-squared: 0.4448 F-statistic: 5.589 on 11 and 52 DF, p-value: 7.889e-06
An alternative specification of regression using data frames is somewhat easier to implement.
%%R
#CREATING DATA FRAMES
ncaa_data_frame = data.frame(y=as.matrix(ncaa[3]),x=as.matrix(ncaa[4:14]))
fm = lm(y~x,data=ncaa_data_frame)
summary(fm)
Call: lm(formula = y ~ x, data = ncaa_data_frame) Residuals: Min 1Q Median 3Q Max -1.5074 -0.5527 -0.2454 0.6705 2.2344 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -10.194804 2.892203 -3.525 0.000893 *** xPTS -0.010442 0.025276 -0.413 0.681218 xREB 0.105048 0.036951 2.843 0.006375 ** xAST -0.060798 0.091102 -0.667 0.507492 xTO -0.034545 0.071393 -0.484 0.630513 xA.T 1.325402 1.110184 1.194 0.237951 xSTL 0.181015 0.068999 2.623 0.011397 * xBLK 0.007185 0.075054 0.096 0.924106 xPF -0.031705 0.044469 -0.713 0.479050 xFG 13.823190 3.981191 3.472 0.001048 ** xFT 2.694716 1.118595 2.409 0.019573 * xX3P 2.526831 1.754038 1.441 0.155698 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9619 on 52 degrees of freedom Multiple R-squared: 0.5418, Adjusted R-squared: 0.4448 F-statistic: 5.589 on 11 and 52 DF, p-value: 7.889e-06
In a regression, we estimate the coefficients on each explanatory variable (these are the Estimates in the regression above). In addition, we also estimate the standard deviation of the coefficient value, which implies the range around the mean value. This is called the standard error of the coefficient. We are interested in making sure that the coefficient value $b$ is not zero in the statistical sense, usually taken to mean that it is at least 2 standard deviations away from zero. That is, we want the coefficient value $b$ divided by its standard deviation $\sigma_b$ to be at least 2. This is called the t-statistic or t-value, shown in the regression above.
The t-statistic is the number of standard deviations the coefficient is away from zero. It implies a p-value, which is a probability that the coefficient is equal to zero. So, we want the p-values to be small. We see in the above regression [Pr(>|t|)] that the coefficients that are statistically significant have small p-values and large absolute values of t-statistics. It is intuitive that when the t-statistic is large (negative or positive) it means that the coefficient is far away from zero and using the standard normal distribution we can calculate the probability left in the tails. So if the t-statistic is (say) 2.843, it means that there is only 0.006375 probability remaining in the right tail to the right of the t-statistic value.
For a more detailed discussion see this excellent article in the Scientific American (2019); pdf.
The linear regression is fit by minimizing the sum of squared errors, but the same concept may also be applied to a nonlinear regression as well. So we might have:
$$ y_i = f(x_{i1},x_{i2},...,x_{ip}) + \epsilon_i, \quad i=1,2,...,n $$which describes a data set that has $n$ rows and $p$ columns, which are the standard variables for the number of rows and columns. Note that the error term (residual) is $\epsilon_i$.
The regression will have $(p+1)$ coefficients, i.e., ${\bf b} = \{b_0,b_1,b_2,...,b_p\}$, and ${\bf x}_i = \{x_{i1},x_{i2},...,x_{ip}\}$. The model is fit by minimizing the sum of squared residuals, i.e.,
$$ \min_{\bf b} \sum_{i=1}^n \epsilon_i^2 $$We define the following:
The $R$-squared of the regression is
$$ R^2 = \left( 1 - \frac{SSE}{SST} \right) \quad \in (0,1) $$The $F$-statistic in the regression is what tells us if the RHS variables comprise a model that explains the LHS variable sufficiently. Do the RHS variables offer more of an explanation that simply assuming that the mean value of $y$ is the best prediction? The null hypothesis we care about is
To test this the $F$-statistic is computed as the following ratio:
$$ F = \frac{\mbox{Explained variance}}{\mbox{Unexplained variance}} = \frac{SSM/DFM}{SSE/DFE} = \frac{MSM}{MSE} $$where $MSM$ is the mean squared model error, and $MSE$ is mean squared error.
Now let's relate this to $R^2$. First, we find an approximation for the $R^2$.
$$ R^2 = 1 - \frac{SSE}{SST} \\ = 1 - \frac{SSE/n}{SST/n} \\ \approx 1 - \frac{MSE}{MST} \\ = \frac{MST-MSE}{MST} \\ = \frac{MSM}{MST} $$The $R^2$ of a regression that has no RHS variables is zero, and of course $MSM=0$. In such a regression $MST = MSE$. So the expression above becomes:
$$ R^2_{p=0} = \frac{MSM}{MST} = 0 $$We can also see with some manipulation, that $R^2$ is related to $F$ (approximately, assuming large $n$).
$$ R^2 + \frac{1}{F+1}=1 \quad \mbox{or} \quad 1+F = \frac{1}{1-R^2} $$Check to see that when $R^2=0$, then $F=0$.
We can further check the formulae with a numerical example, by creating some sample data.
%%R
x = matrix(runif(300),100,3)
y = 5 + 4*x[,1] + 3*x[,2] + 2*x[,3] + rnorm(100)
y = as.matrix(y)
res = lm(y~x)
print(summary(res))
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -2.3499 -0.7816 0.1548 0.8725 2.4822 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.4788 0.3996 13.711 < 2e-16 *** x1 4.2712 0.3748 11.396 < 2e-16 *** x2 2.3911 0.4029 5.935 4.66e-08 *** x3 1.5251 0.3949 3.862 0.000204 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.093 on 96 degrees of freedom Multiple R-squared: 0.6193, Adjusted R-squared: 0.6074 F-statistic: 52.05 on 3 and 96 DF, p-value: < 2.2e-16
%%R
e = res$residuals
SSE = sum(e^2)
SST = sum((y-mean(y))^2)
SSM = SST - SSE
print(c(SSE,SSM,SST))
R2 = 1 - SSE/SST
print(R2)
n = dim(x)[1]
p = dim(x)[2]+1
MSE = SSE/(n-p)
MSM = SSM/(p-1)
MST = SST/(n-1)
print(c(n,p,MSE,MSM,MST))
Fstat = MSM/MSE
print(Fstat)
[1] 114.7558 186.6638 301.4196 [1] 0.6192822 [1] 100.000000 4.000000 1.195373 62.221270 3.044643 [1] 52.05176
We can also compare two regressions, say one with 5 RHS variables with one that has only 3 of those five to see whether the additional two variables has any extra value. The ratio of the two $MSM$ values from the first and second regressions is also a $F$-statistic that may be tested for it to be large enough.
Note that if the residuals $\epsilon$ are assumed to be normally distributed, then squared residuals are distributed as per the chi-square ($\chi^2$) distribution. Further, the sum of residuals is distributed normal and the sum of squared residuals is distributed $\chi^2$. And finally, the ratio of two $\chi^2$ variables is $F$-distributed, which is why we call it the $F$-statistic, it is the ratio of two sums of squared errors.
Underlying the analyses of the regression model above is an assumption that the error term $\epsilon$ is independent of the $x$ variables. This assumption ensures that the regression coefficient $\beta$ is unbiased. To see this in the simplest way, consider the univariate regression
$$ y = \beta x + \epsilon $$We have seen earlier that the coefficient $\beta$ is given by
$$ \frac{Cov(x,y)}{Var(x)} = \frac{Cov(x,\beta x + \epsilon)}{Cov(x,x)} = \beta + \frac{Cov(x,\epsilon)}{Cov(x,x)} $$This little piece of statistical math shows that this is biased if there is correlation between $x$ and $\epsilon$.
One way in which the coefficient is biased is if there is a missing variable in the regression that has an effect on both $x$ and $y$, which then injects correlation between $x$ and $\epsilon$. If there is a missing variable that impacts $y$ and not $x$, then it is just fine, after all, every regression has missing variables, else there would be no residual (error) term. Hopefully, there is some idea of how the missing variable impacts both $x$ and $y$ (direction, and if possible sign). Then at least one might have a sense of the direction of bias in the regression coefficient.
Simple linear regression assumes that the standard error of the residuals is the same for all observations. Many regressions suffer from the failure of this condition. The word for this is "heteroskedastic" errors. "Hetero" means different, and "skedastic" means dependent on type.
We can first test for the presence of heteroskedasticity using a standard Breusch-Pagan test available in R. This resides in the lmtest package which is loaded in before running the test.
%%R
ncaa = read.table("DSTMAA_data/ncaa.txt",header=TRUE)
y = as.matrix(ncaa[3])
x = as.matrix(ncaa[4:14])
result = lm(y~x)
library(lmtest)
bptest(result)
studentized Breusch-Pagan test data: result BP = 15.538, df = 11, p-value = 0.1592
We can see that there is very little evidence of heteroskedasticity in the standard errors as the $p$-value is not small. However, lets go ahead and correct the t-statistics for heteroskedasticity as follows, using the hccm function. The hccm stands for heteroskedasticity corrected covariance matrix.
%%R
wuns = matrix(1,64,1)
z = cbind(wuns,x)
b = solve(t(z) %*% z) %*% (t(z) %*% y)
result = lm(y~x)
library(car)
vb = hccm(result)
stdb = sqrt(diag(vb))
tstats = b/stdb
print(tstats)
/Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Loading required package: carData warnings.warn(x, RRuntimeWarning)
GMS -2.68006069 PTS -0.38212818 REB 2.38342637 AST -0.40848721 TO -0.28709450 A.T 0.65632053 STL 2.13627108 BLK 0.09548606 PF -0.68036944 FG 3.52193532 FT 2.35677255 X3P 1.23897636
Compare these to the t-statistics in the original model
%%R
summary(result)
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1.5074 -0.5527 -0.2454 0.6705 2.2344 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -10.194804 2.892203 -3.525 0.000893 *** xPTS -0.010442 0.025276 -0.413 0.681218 xREB 0.105048 0.036951 2.843 0.006375 ** xAST -0.060798 0.091102 -0.667 0.507492 xTO -0.034545 0.071393 -0.484 0.630513 xA.T 1.325402 1.110184 1.194 0.237951 xSTL 0.181015 0.068999 2.623 0.011397 * xBLK 0.007185 0.075054 0.096 0.924106 xPF -0.031705 0.044469 -0.713 0.479050 xFG 13.823190 3.981191 3.472 0.001048 ** xFT 2.694716 1.118595 2.409 0.019573 * xX3P 2.526831 1.754038 1.441 0.155698 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9619 on 52 degrees of freedom Multiple R-squared: 0.5418, Adjusted R-squared: 0.4448 F-statistic: 5.589 on 11 and 52 DF, p-value: 7.889e-06
It is apparent that when corrected for heteroskedasticity, the t-statistics in the regression are lower, and also render some of the previously significant coefficients insignificant.
When data is autocorrelated, i.e., has dependence in time, not accounting for this issue results in unnecessarily high statistical significance (in terms of inflated t-statistics). Intuitively, this is because observations are treated as independent when actually they are correlated in time, and therefore, the true number of observations is effectively less.
Consider a finance application. In efficient markets, the correlation of stock returns from one period to the next should be close to zero. We use the returns on Google stock as an example. First, read in the data.
%%R
data = read.csv("DSTMAA_data/goog.csv",header=TRUE)
head(data)
Date Open High Low Close Adj.Close Volume 1 2004-08-19 49.67690 51.69378 47.66995 49.84580 49.84580 44994500 2 2004-08-20 50.17863 54.18756 49.92529 53.80505 53.80505 23005800 3 2004-08-23 55.01717 56.37334 54.17266 54.34653 54.34653 18393200 4 2004-08-24 55.26058 55.43942 51.45036 52.09616 52.09616 15361800 5 2004-08-25 52.14087 53.65105 51.60436 52.65751 52.65751 9257400 6 2004-08-26 52.13591 53.62621 51.99184 53.60634 53.60634 7148200
Next, create the returns time series.
%%R
n = length(data$Close)
stkp = rev(data$Adj.Close)
rets = as.matrix(log(stkp[2:n]/stkp[1:(n-1)]))
n = length(rets)
plot(rets,type="l",col="blue")
print(n)
[1] 3606