MoRe: Data Handling and Other Useful Things {#MoreDataHandling}

Getting started

To get started, we need to grab some data. Go to Yahoo! Finance and download some historical data in a spreadsheet or csv file. Read the file into R as follows:

We can use pipes with the "magrittr" package to do the same thing as follows:

Data Extraction of stocks using the quantmod package

In this chapter, we will revisit some of the topics considered in the previous chapters, and demonstrate alternate programming approaches in R. There are some extremely powerful packages in R that allow sql-like operations on data sets, making for advanced data handling. One of the most time-consuming activities in data analytics is cleaning and arranging data, and here we will show examples of many tools available for that purpose. Let's assume we have a good working knowledge of R by now. Here we revisit some more packages, functions, and data structures.

We have seen the package already in the previous chapter. Now, we proceed to use it to get some initial data.

Are they all the same? Here we need to extract the ticker symbol without quotes.

Now we can examine the number of observations in each ticker.

We see that they are all the same.

Convert closing adjusted prices of all stocks into individual data.frames.

First, we create a list of data.frames. This will also illustrate how useful lists are because we store data.frames in lists. Notice how we also add a new column to each data.frame so that the dates column may later be used as an index to join the individual stock data.frames into one composite data.frame.

Using the merge function

Data frames a very much like spreadsheets or tables, but they are also a lot like databases. Some sort of happy medium. If you want to join two dataframes, it is the same a joining two databases. For this R has the merge function. It is best illustrated with an example.

Make a single data frame

Second, we combine all the stocks adjusted closing prices into a single data.frame using a join, excluding all dates for which all stocks do not have data. The main function used here is merge which could be an intersect join or a union join. The default is the intersect join.

Note that the stock table contains the number of rows of the stock index, which had fewer observations than the individual stocks. So since this is an intersect join, some rows have been dropped.

Plot the stock series

Plot all stocks in a single data.frame using ggplot2, which is more advanced than the basic plot function. We use the basic plot function first.

Convert the data into returns

These are continuously compounded returns, or log returns.

Descriptive statistics

The data.frame of returns can be used to present the descriptive statistics of returns.

Correlation matrix

Now we compute the correlation matrix of returns.

Correlogram

Show the correlogram for the six return series. This is a useful way to visualize the relationship between all variables in the data set.

Market regression

To see the relation between the stocks and the index, run a regression of each of the five stocks on the index returns.

The $\beta$s indicate the level of systematic risk for each stock. We notice that all the betas are positive, and highly significant. But they are not close to unity, in fact all are lower. This is evidence of misspecification that may arise from the fact that the stocks are in the tech sector and better explanatory power would come from an index that was more relevant to the technology sector.

Return versus systematic risk

In order to assess whether in the cross-section, there is a relation between average returns and the systematic risk or $\beta$ of a stock, run a regression of the five average returns on the five betas from the regression.

We see indeed, that there is an unexpected negative relation between $\beta$ and the return levels. This may be on account of the particular small sample we used for illustration here, however, we note that the CAPM (Capital Asset Pricing Model) dictate that we see a positive relation between stock returns and a firm's systematic risk level.

Extracting online corporate data

Suppose we have a list of ticker symbols and we want to generate a dataframe with more details on these tickers, especially their sector and the full name of the company. Let's look at the input list of tickers. Suppose I have them in a file called tickers.csv where the delimiter is the colon sign. We read this in as follows.

The line of code reads in the file and this gives us two columns of data. We can look at the top of the file (first 6 rows).

Note that the ticker symbols relate to stocks from different exchanges, in this case Nasdaq and NYSE. The file may also contain AMEX listed stocks.

The second line of code below counts the number of input tickers, and the third line of code renames the columns of the dataframe. We need to call the column of ticker symbols as ``Symbol'' because we will see that the dataframe with which we will merge this one also has a column with the same name. This column becomes the index on which the two dataframes are matched and joined.

Get all stock symbols from exchanges

Next, we read in lists of all stocks on Nasdaq, NYSE, and AMEX as follows:

If this does not work, use the following URL download of CSV files

Then read in these csv files, after renaming them to nyse_names.csv, amex_names.csv, nasdaq_names.csv

We can look at the top of the Nasdaq file.

Next we merge all three dataframes for each of the exchanges into one data frame.

To see how many rows are there in this merged file, we check dimensions.

Finally, use the merge function to combine the ticker symbols file with the exchanges data to extend the tickers file to include the information from the exchanges file.

An alternate package to download stock tickers en masse is BatchGetSymbols.

Using the DT package

The Data Table package is a very good way to examine tabular data through an R-driven user interface.

Web scraping

Now suppose we want to find the CEOs of these 98 companies. There is no one file with compay CEO listings freely available for download. However, sites like Google Finance have a page for each stock and mention the CEOs name on the page. By writing R code to scrape the data off these pages one by one, we can extract these CEO names and augment the tickers dataframe. The code for this is simple in R.

The code uses the stringr package so that string handling is simplified. Scraping is done with the rvest package. The lines of code using rvest are above. After extracting the page, we search for the line in which the words "Chief Executive Officer" pr "CEO" show up, and we note that the name of the CEO appears in a table in the html page. A sample web page for Apple Inc is shown here:

The final dataframe with CEO names is shown here (the top 6 lines):

Using the apply class of functions

Sometimes we need to apply a function to many cases, and these case parameters may be supplied in a vector, matrix, or list. This is analogous to looping through a set of values to repeat evaluations of a function using different sets of parameters. We illustrate here by computing the mean returns of all stocks in our sample using the apply function. The first argument of the function is the data.frame to which it is being applied, the second argument is either 1 (by rows) or 2 (by columns). The third argument is the function being evaluated.

We see that the function returns the column means of the data set. The variants of the function pertain to what the loop is being applied to. The lapply is a function applied to a list, and sapply is for matrices and vectors. Likewise, mapply uses multiple arguments.

To cross check, we can simply use the colMeans function:

As we see, this result is verified.

Getting interest rate data from FRED

In finance, data on interest rates is widely used. An authoritative source of data on interest rates is FRED (Federal Reserve Economic Data), maintained by the St. Louis Federal Reserve Bank, and is warehoused at the following web site: https://research.stlouisfed.org/fred2/. Let's assume that we want to download the data using R from FRED directly. To do this we need to write some custom code. There used to be a package for this but since the web site changed, it has been updated but does not work properly. Still, see that it is easy to roll your own code quite easily in R.

Using the custom function

Now, we provide a list of economic time series and download data accordingly using the function above. Note that we also join these individual series using the data as index. We download constant maturity interest rates (yields) starting from a maturity of one month (DGS1MO) to a maturity of thirty years (DGS30).

Organize the data by date

Having done this, we now have a data.frame called rates containing all the time series we are interested in. We now convert the dates into numeric strings and sort the data.frame by date.

Because the function above may not work in Jupyter, we can load in the data from a file as below.

Handling missing values

Note that there are missing values, denoted by NA. Also there are rows with "-99" values and we can clean those out too but they represent periods when there was no yield available of that maturity, so we leave this in.

Cross-Sectional Data (an example)

  1. A great resource for data sets in corporate finance is on Aswath Damodaran's web site, see: http://people.stern.nyu.edu/adamodar/New_Home_Page/data.html
  2. Financial statement data sets are available at: http://www.sec.gov/dera/data/financial-statement-data-sets.html
  3. And another comprehensive data source: http://fisher.osu.edu/fin/fdf/osudata.htm
  4. Open government data: https://www.data.gov/finance/

Let's read in the list of failed banks: http://www.fdic.gov/bank/individual/failed/banklist.csv

(This does not work, and has been an issue for a while.)

Access file from the web using the readLines function

You can also read in the data using readLines but then further work is required to clean it up, but it works well in downloading the data.

Or, read the file from CSV

It may be simpler to just download the data and read it in from the csv file:

This gives a data.frame which is easy to work with. We will illustrate some interesting ways in which to manipulate this data.

Failed banks by State

Suppose we want to get subtotals of how many banks failed by state. First add a column of ones to the data.frame.

Check for missing data

It's good to check that there is no missing data.

Sort by State

Now we sort the data by state to see how many there are.

Use the aggregate function (for subtotals)

We can directly use the aggregate function to get subtotals by state.

Data by acquiring bank

And another example, subtotal by acquiring bank. Note how we take the subtotals into another data.frame, which is then sorted and returned in order using the index of the sort.

Handling dates with lubridate

Suppose we want to take the preceding data.frame of failed banks and aggregate the data by year, or month, etc. In this case, it us useful to use a dates package. Another useful tool developed by Hadley Wickham is the lubridate package.

By Month

Let's do the same thing by month to see if there is seasonality

There does appear to be seasonality. Why?

By Day

What about day?

Definitely, counts are lower at the start and end of the month!

Using the data.table package

This is an incredibly useful package that was written by Matt Dowle. It essentially allows your data.frame to operate as a database. It enables very fast handling of massive quantities of data, and much of this technology is now embedded in the IP of the company called h2o: http://h2o.ai/

The data.table cheat sheet is here: https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf

California Crime Statistics

We start with some freely downloadable crime data statistics for California. We placed the data in a csv file which is then easy to read in to R.

It is easy to convert this into a data.table.

Note, it is still a data.frame also. Hence, it inherits its properties from the data.frame class.

Examine the data.table

Let's see how it works, noting that the syntax is similar to that for data.frames as much as possible. We print only a part of the names list. And do not go through each and everyone.

Indexing the data.table

A nice feature of the data.table is that it can be indexed, i.e., resorted on the fly by making any column in the database the key. Once that is done, then it becomes easy to compute subtotals, and generate plots from these subtotals as well.

The data table can be used like a database, and you can directly apply summarization functions to it. Essentially, it is governed by a format that is summarized as ($i$,$j$,by), i.e., apply some rule to rows $i$, then to some columns $j$, and one may also group by some columns. We can see how this works with the following example.

The data table was operated on for all columns, i.e., all $i$, and the $j$ column we are interested in was the "ForRape_sum" which we want to total by Year. This returns a summary of only the Year and the total number of rapes per year. See that the type of output is also of the type data.table, which includes the class data.frame also.

Plotting from the data.table

Next, we plot the results from the data.table in the same way as we would for a data.frame.

By County

Repeat the process looking at crime (Rape) totals by county.

Barplot of crime

Now, we can go ahead and plot it using a different kind of plot, a horizontal barplot.

Bay Area Bike Share data

We show some other features using a different data set, the bike information on Silicon Valley routes for the Bike Share program. This is a much larger data set.

Summarize Trips Data

Next we print some descriptive statistics.

Start and End Bike Stations

Now, we quickly check how many start and end stations there are.

As we can see, there are quite a few stations in the bike share program where riders can pick up and drop off bikes. The trip duration information is stored in seconds, so has been converted to minutes in the code above.

The plyr package family

This package by Hadley Wickham is useful for applying functions to tables of data, i.e., data.frames. Since we may want to write custom functions, this is a highly useful package. R users often select either the data.table or the plyr class of packages for handling data.frames as databases. The latest incarnation is the dplyr package, which focuses only on data.frames.

Filter the data

One of the useful things you can use is the filter function, to subset the rows of the dataset you might want to select for further analysis.

Sorting using the arrange function

The arrange function is useful for sorting by any number of columns as needed. Here we sort by the start and end stations.

Reverse order sort

The sort can also be done in reverse order as follows.

Descriptive statistics

Data.table also offers a fantastic way to do descriptive statistics! First, group the data by start point, and then produce statistics by this group, choosing to count the number of trips starting from each station and the average duration of each trip.

Other functions in dplyr

Try also the select(), extract(), mutate(), summarise(), sample_n(), sample_frac() functions.

The group_by() function is particularly useful as we have seen.

Application to IPO Data

Let's revisit all the stock exchange data from before, where we download the table of firms listed on the NYSE, NASDAQ, and AMEX using the quantmod package.

We then clean off the rows with incomplete data, using the very useful complete.cases function.

We create a table of the frequency of IPOs by year to see hot and cold IPO markets.

  1. First, remove all rows with missing IPO data.
  2. Plot IPO Activity with a bar plot. We make sure to label the axes properly.
  3. Plot IPO Activity using the rbokeh package to make a pretty line plot. See: https://hafen.github.io/rbokeh/

Bokeh plots

These are really nice looking but requires simple code. The "hover"" features make these plots especially appealing.