%pylab inline
import os
from ipypublish import nb_setup
%load_ext rpy2.ipython
Populating the interactive namespace from numpy and matplotlib
#Run if on Windows machine
%load_ext RWinOut
The rpy2.ipython extension is already loaded. To reload it, use: %reload_ext rpy2.ipython
To get started, we need to grab some data. Go to Yahoo! Finance and download some historical data in a spreadsheet or csv file. Read the file into R as follows:
%%R
data = read.csv("DSTMAA_data/goog.csv", header=TRUE)
print(head(data))
m = length(data)
n = length(data[,1])
print(c('Number of columns:',m,' Number of rows:',n))
Date Open High Low Close Adj.Close Volume 1 2004-08-19 49.67690 51.69378 47.66995 49.84580 49.84580 44994500 2 2004-08-20 50.17863 54.18756 49.92529 53.80505 53.80505 23005800 3 2004-08-23 55.01717 56.37334 54.17266 54.34653 54.34653 18393200 4 2004-08-24 55.26058 55.43942 51.45036 52.09616 52.09616 15361800 5 2004-08-25 52.14087 53.65105 51.60436 52.65751 52.65751 9257400 6 2004-08-26 52.13591 53.62621 51.99184 53.60634 53.60634 7148200 [1] "Number of columns:" "7" " Number of rows:" [4] "3607"
%%R
stkp = as.matrix(data[,6])
plot(stkp,type='l',col="blue")
grid(lwd=2)
We can use pipes with the "magrittr" package to do the same thing as follows:
%%R
library(magrittr)
data[,6] %>% plot(type='l',col='blue') %>% grid(lwd=2)
In this chapter, we will revisit some of the topics considered in the previous chapters, and demonstrate alternate programming approaches in R. There are some extremely powerful packages in R that allow sql-like operations on data sets, making for advanced data handling. One of the most time-consuming activities in data analytics is cleaning and arranging data, and here we will show examples of many tools available for that purpose. Let's assume we have a good working knowledge of R by now. Here we revisit some more packages, functions, and data structures.
We have seen the package already in the previous chapter. Now, we proceed to use it to get some initial data.
%%R
library(quantmod)
tickers = c("AAPL","MSFT","IBM","CSCO","C")
getSymbols(tickers)
getSymbols("^GSPC")
tickers = c(tickers,"GSPC")
%%R
print(head(AAPL))
length(tickers)
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted 2007-01-03 12.32714 12.36857 11.70000 11.97143 309579900 10.41635 2007-01-04 12.00714 12.27857 11.97429 12.23714 211815100 10.64755 2007-01-05 12.25286 12.31428 12.05714 12.15000 208685400 10.57173 2007-01-08 12.28000 12.36143 12.18286 12.21000 199276700 10.62393 2007-01-09 12.35000 13.28286 12.16429 13.22429 837324600 11.50647 2007-01-10 13.53571 13.97143 13.35000 13.85714 738220000 12.05711 [1] 6
Are they all the same? Here we need to extract the ticker symbol without quotes.
Now we can examine the number of observations in each ticker.
%%R
for (t in tickers) {
a = get(noquote(t))[,1]
print(c(t,length(a)))
}
[1] "AAPL" "3268" [1] "MSFT" "3268" [1] "IBM" "3268" [1] "CSCO" "3268" [1] "C" "3268" [1] "GSPC" "3268"
We see that they are all the same.
First, we create a list of data.frames. This will also illustrate how useful lists are because we store data.frames in lists. Notice how we also add a new column to each data.frame so that the dates column may later be used as an index to join the individual stock data.frames into one composite data.frame.
%%R
df = list()
j = 0
for (t in tickers) {
j = j + 1
a = noquote(t)
b = data.frame(get(a)[,6])
b$dt = row.names(b)
df[[j]] = b
}
Data frames a very much like spreadsheets or tables, but they are also a lot like databases. Some sort of happy medium. If you want to join two dataframes, it is the same a joining two databases. For this R has the merge function. It is best illustrated with an example.
Second, we combine all the stocks adjusted closing prices into a single data.frame using a join, excluding all dates for which all stocks do not have data. The main function used here is merge which could be an intersect join or a union join. The default is the intersect join.
%%R
stock_table = df[[1]]
for (j in 2:length(df)) {
stock_table = merge(stock_table,df[[j]],by="dt")
}
print(dim(stock_table))
class(stock_table)
[1] 3268 7 [1] "data.frame"
Note that the stock table contains the number of rows of the stock index, which had fewer observations than the individual stocks. So since this is an intersect join, some rows have been dropped.
Plot all stocks in a single data.frame using ggplot2, which is more advanced than the basic plot function. We use the basic plot function first.
%%R
par(mfrow=c(3,2)) #Set the plot area to six plots
for (j in 1:length(tickers)) {
plot(as.Date(stock_table[,1]),stock_table[,j+1], type="l",
ylab=tickers[j],xlab="date")
}
par(mfrow=c(1,1)) #Set the plot figure back to a single plot
These are continuously compounded returns, or log returns.
%%R
n = length(stock_table[,1])
rets = stock_table[,2:(length(tickers)+1)]
for (j in 1:length(tickers)) {
rets[2:n,j] = diff(log(rets[,j]))
}
rets$dt = stock_table$dt
rets = rets[2:n,] #lose the first row when converting to returns
print(head(rets))
class(rets)
AAPL.Adjusted MSFT.Adjusted IBM.Adjusted CSCO.Adjusted C.Adjusted 2 0.021952752 -0.001675780 0.010635618 0.0259846938 -0.0034450807 3 -0.007146552 -0.005719328 -0.009094258 0.0003512162 -0.0052807781 4 0.004926396 0.009736840 0.015077467 0.0056044941 0.0050991365 5 0.079799664 0.001001738 0.011760737 -0.0056044941 -0.0087575193 6 0.046745721 -0.010063596 -0.011861730 0.0073491998 -0.0080957555 7 -0.012448531 0.034463050 -0.002429803 0.0003488354 0.0007391458 GSPC.Adjusted dt 2 0.0012275323 2007-01-04 3 -0.0061031679 2007-01-05 4 0.0022178572 2007-01-08 5 -0.0005168099 2007-01-09 6 0.0019384723 2007-01-10 7 0.0063198611 2007-01-11 [1] "data.frame"
The data.frame of returns can be used to present the descriptive statistics of returns.
%%R
summary(rets)
AAPL.Adjusted MSFT.Adjusted IBM.Adjusted Min. :-0.197470 Min. :-0.1245786 Min. :-0.0864189 1st Qu.:-0.007762 1st Qu.:-0.0073024 1st Qu.:-0.0062800 Median : 0.001008 Median : 0.0004589 Median : 0.0004686 Mean : 0.001012 Mean : 0.0005989 Mean : 0.0002113 3rd Qu.: 0.011061 3rd Qu.: 0.0086850 3rd Qu.: 0.0073029 Max. : 0.130194 Max. : 0.1706256 Max. : 0.1089891 CSCO.Adjusted C.Adjusted GSPC.Adjusted Min. :-0.1768644 Min. :-0.4946963 Min. :-0.0946951 1st Qu.:-0.0071493 1st Qu.:-0.0106260 1st Qu.:-0.0038841 Median : 0.0005723 Median : 0.0000000 Median : 0.0006632 Mean : 0.0002427 Mean :-0.0005693 Mean : 0.0002517 3rd Qu.: 0.0086450 3rd Qu.: 0.0104800 3rd Qu.: 0.0055557 Max. : 0.1479931 Max. : 0.4563160 Max. : 0.1095720 dt Length:3267 Class :character Mode :character
Now we compute the correlation matrix of returns.
%%R
cor(rets[,1:length(tickers)])
AAPL.Adjusted MSFT.Adjusted IBM.Adjusted CSCO.Adjusted C.Adjusted AAPL.Adjusted 1.0000000 0.4890636 0.4598145 0.4922452 0.3734263 MSFT.Adjusted 0.4890636 1.0000000 0.5344394 0.5937782 0.4110750 IBM.Adjusted 0.4598145 0.5344394 1.0000000 0.5554172 0.4202308 CSCO.Adjusted 0.4922452 0.5937782 0.5554172 1.0000000 0.4563835 C.Adjusted 0.3734263 0.4110750 0.4202308 0.4563835 1.0000000 GSPC.Adjusted 0.6196366 0.7141322 0.6854599 0.7186963 0.6660740 GSPC.Adjusted AAPL.Adjusted 0.6196366 MSFT.Adjusted 0.7141322 IBM.Adjusted 0.6854599 CSCO.Adjusted 0.7186963 C.Adjusted 0.6660740 GSPC.Adjusted 1.0000000
Show the correlogram for the six return series. This is a useful way to visualize the relationship between all variables in the data set.
%%R
library(corrgram)
corrgram(rets[,1:length(tickers)], order=TRUE, lower.panel=panel.ellipse,
upper.panel=panel.pts, text.panel=panel.txt)
To see the relation between the stocks and the index, run a regression of each of the five stocks on the index returns.
%%R
betas = NULL
for (j in 1:(length(tickers)-1)) {
res = lm(rets[,j]~rets[,6])
betas[j] = res$coefficients[2]
}
print(betas)
[1] 0.9996354 0.9899433 0.7809676 1.0658362 1.8969453
The $\beta$s indicate the level of systematic risk for each stock. We notice that all the betas are positive, and highly significant. But they are not close to unity, in fact all are lower. This is evidence of misspecification that may arise from the fact that the stocks are in the tech sector and better explanatory power would come from an index that was more relevant to the technology sector.
In order to assess whether in the cross-section, there is a relation between average returns and the systematic risk or $\beta$ of a stock, run a regression of the five average returns on the five betas from the regression.
%%R
betas = matrix(betas)
avgrets = colMeans(rets[,1:(length(tickers)-1)])
res = lm(avgrets~betas)
print(summary(res))
plot(betas,avgrets)
abline(res,col="red")
Call: lm(formula = avgrets ~ betas) Residuals: AAPL.Adjusted MSFT.Adjusted IBM.Adjusted CSCO.Adjusted C.Adjusted 0.0005626 0.0001394 -0.0004620 -0.0001392 -0.0001008 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0014723 0.0006141 2.397 0.0961 . betas -0.0010231 0.0005074 -2.016 0.1372 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0004393 on 3 degrees of freedom Multiple R-squared: 0.5754, Adjusted R-squared: 0.4338 F-statistic: 4.065 on 1 and 3 DF, p-value: 0.1372
We see indeed, that there is an unexpected negative relation between $\beta$ and the return levels. This may be on account of the particular small sample we used for illustration here, however, we note that the CAPM (Capital Asset Pricing Model) dictate that we see a positive relation between stock returns and a firm's systematic risk level.
Suppose we have a list of ticker symbols and we want to generate a dataframe with more details on these tickers, especially their sector and the full name of the company. Let's look at the input list of tickers. Suppose I have them in a file called tickers.csv where the delimiter is the colon sign. We read this in as follows.
%%R
tickers = read.table("DSTMAA_data/tickers.csv",header=FALSE,sep=":")
The line of code reads in the file and this gives us two columns of data. We can look at the top of the file (first 6 rows).
%%R
head(tickers)
V1 V2 1 NasdaqGS ACOR 2 NasdaqGS AKAM 3 NYSE ARE 4 NasdaqGS AMZN 5 NasdaqGS AAPL 6 NasdaqGS AREX
Note that the ticker symbols relate to stocks from different exchanges, in this case Nasdaq and NYSE. The file may also contain AMEX listed stocks.
The second line of code below counts the number of input tickers, and the third line of code renames the columns of the dataframe. We need to call the column of ticker symbols as ``Symbol'' because we will see that the dataframe with which we will merge this one also has a column with the same name. This column becomes the index on which the two dataframes are matched and joined.
%%R
n = dim(tickers)[1]
print(n)
names(tickers) = c("Exchange","Symbol")
head(tickers)
[1] 98 Exchange Symbol 1 NasdaqGS ACOR 2 NasdaqGS AKAM 3 NYSE ARE 4 NasdaqGS AMZN 5 NasdaqGS AAPL 6 NasdaqGS AREX
Next, we read in lists of all stocks on Nasdaq, NYSE, and AMEX as follows:
%%R
nasdaq_names = stockSymbols(exchange="NASDAQ")
nyse_names = stockSymbols(exchange="NYSE")
amex_names = stockSymbols(exchange="AMEX")
#If it does not work use the file below
#load("DSTMAA_data/stock_exchange.Rdata")
/Users/sanjivda/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Fetching NASDAQ symbols... warnings.warn(x, RRuntimeWarning)
Error in names(x) <- value : 'names' attribute [8] must be the same length as the vector [1]
/Users/sanjivda/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Error in names(x) <- value : 'names' attribute [8] must be the same length as the vector [1] warnings.warn(x, RRuntimeWarning)
If this does not work, use the following URL download of CSV files
Then read in these csv files, after renaming them to nyse_names.csv, amex_names.csv, nasdaq_names.csv
%%R
nyse_names = read.csv("DSTMAA_data/nyse_names.csv", header=TRUE)
amex_names = read.csv("DSTMAA_data/amex_names.csv", header=TRUE)
nasdaq_names = read.csv("DSTMAA_data/nasdaq_names.csv", header=TRUE)
We can look at the top of the Nasdaq file.
%%R
head(nasdaq_names)
Symbol Name LastSale MarketCap IPOyear 1 TXG 10x Genomics, Inc. 79.44 $7.64B 2019 2 YI 111, Inc. 5.16 $425.43M 2018 3 PIH 1347 Property Insurance Holdings, Inc. 5.47 $33.37M 2014 4 PIHPP 1347 Property Insurance Holdings, Inc. 26.51 $18.56M n/a 5 TURN 180 Degree Capital Corp. 2.1399 $66.6M n/a 6 FLWS 1-800 FLOWERS.COM, Inc. 14.22 $918.48M 1999 Sector industry 1 Capital Goods Biotechnology: Laboratory Analytical Instruments 2 Health Care Medical/Nursing Services 3 Finance Property-Casualty Insurers 4 Finance Property-Casualty Insurers 5 Finance Finance/Investors Services 6 Consumer Services Other Specialty Stores Summary.Quote X 1 https://old.nasdaq.com/symbol/txg NA 2 https://old.nasdaq.com/symbol/yi NA 3 https://old.nasdaq.com/symbol/pih NA 4 https://old.nasdaq.com/symbol/pihpp NA 5 https://old.nasdaq.com/symbol/turn NA 6 https://old.nasdaq.com/symbol/flws NA
Next we merge all three dataframes for each of the exchanges into one data frame.
%%R
co_names = rbind(nyse_names,nasdaq_names,amex_names)
To see how many rows are there in this merged file, we check dimensions.
%%R
dim(co_names)
[1] 6996 9
Finally, use the merge function to combine the ticker symbols file with the exchanges data to extend the tickers file to include the information from the exchanges file.
%%R
result = merge(tickers,co_names,by="Symbol")
head(result)
Symbol Exchange Name LastSale MarketCap 1 AAPL NasdaqGS Apple Inc. 284.27 $1263.09B 2 ACOR NasdaqGS Acorda Therapeutics, Inc. 1.92 $92.22M 3 AKAM NasdaqGS Akamai Technologies, Inc. 85.73 $13.85B 4 AMZN NasdaqGS Amazon.com, Inc. 1789.21 $887.09B 5 ARE NYSE Alexandria Real Estate Equities, Inc. 160.1 $18.44B 6 BLK NYSE BlackRock, Inc. 499.64 $77.54B IPOyear Sector 1 1980 Technology 2 2006 Health Care 3 1999 Miscellaneous 4 1997 Consumer Services 5 n/a Consumer Services 6 1999 Finance industry 1 Computer Manufacturing 2 Biotechnology: Biological Products (No Diagnostic Substances) 3 Business Services 4 Catalog/Specialty Distribution 5 Real Estate Investment Trusts 6 Investment Bankers/Brokers/Service Summary.Quote X 1 https://old.nasdaq.com/symbol/aapl NA 2 https://old.nasdaq.com/symbol/acor NA 3 https://old.nasdaq.com/symbol/akam NA 4 https://old.nasdaq.com/symbol/amzn NA 5 https://old.nasdaq.com/symbol/are NA 6 https://old.nasdaq.com/symbol/blk NA
An alternate package to download stock tickers en masse is BatchGetSymbols.
The Data Table package is a very good way to examine tabular data through an R-driven user interface.
%%R
library(DT)
datatable(co_names, options = list(pageLength = 25),filter="top")
Now suppose we want to find the CEOs of these 98 companies. There is no one file with compay CEO listings freely available for download. However, sites like Google Finance have a page for each stock and mention the CEOs name on the page. By writing R code to scrape the data off these pages one by one, we can extract these CEO names and augment the tickers dataframe. The code for this is simple in R.
%%R
library(stringr)
#READ IN THE LIST OF TICKERS
tickers = read.table("DSTMAA_data/tickers.csv",header=FALSE,sep=":")
n = dim(tickers)[1]
names(tickers) = c("Exchange","Symbol")
tickers$ceo = NA
#PULL CEO NAMES FROM YAHOO/GOOGLE FINANCE (take random 10 firms)
library(rvest)
library(magrittr)
for (j in 1:5) {
url = paste("https://finance.yahoo.com/quote/",tickers[j,2],"/profile?p=",tickers[j,2],sep="")
#url = paste("https://finance.google.com/finance?q=",tickers[j,2],sep="")
#text = readLines(url)
#idx = grep("Chief Executive",text)
#if (length(idx)>0) {
# tickers[j,3] = str_split(text[idx-2],">")[[1]][2]
#}
#else {
# tickers[j,3] = NA
#}
doc = read_html(url)
tabel = doc %>% html_nodes("table") %>% html_table()
if (length(tabel)>0) {
tickers[j,3] = tabel[[1]]$Name[1]
}
else {
tickers[j,3] = NA
}
print(tickers[j,])
}
#WRITE CEO_NAMES TO CSV
write.table(tickers,file="DSTMAA_data/ceo_names.csv",
row.names=FALSE,sep=",")
/Users/sanjivda/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Loading required package: xml2 warnings.warn(x, RRuntimeWarning) /Users/sanjivda/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Registered S3 method overwritten by 'rvest': method from read_xml.response xml2 warnings.warn(x, RRuntimeWarning)
Exchange Symbol ceo 1 NasdaqGS ACOR Dr. Ron Cohen Exchange Symbol ceo 2 NasdaqGS AKAM Dr. F. Thomson Leighton Exchange Symbol ceo 3 NYSE ARE Mr. Joel S. Marcus CPA, J.D. Exchange Symbol ceo 4 NasdaqGS AMZN Mr. Jeffrey P. Bezos Exchange Symbol ceo 5 NasdaqGS AAPL Mr. Timothy D. Cook
The code uses the stringr package so that string handling is simplified. Scraping is done with the rvest package. The lines of code using rvest are above. After extracting the page, we search for the line in which the words "Chief Executive Officer" pr "CEO" show up, and we note that the name of the CEO appears in a table in the html page. A sample web page for Apple Inc is shown here:
nb_setup.images_hconcat(["DSTMAA_images/yahoofinance_AAPL.png"], width=1000)