The Art of Data Science

"All models are wrong, but some are useful." --- George E. P. Box and N.R. Draper in "Empirical Model Building and Response Surfaces," John Wiley & Sons, New York, 1987.

So you want to be a "data scientist"? There is no widely accepted definition of who a data scientist is. The term "data scientist" was coined by D.J. Patil. He was the Chief Scientist for LinkedIn. In 2011 Forbes placed him second in their Data Scientist List, just behind Larry Page of Google. Several books now attempt to define what data science is and who a data scientist may be, see Patil (2011), Patil (2012), and Loukides (2012). This book's viewpoint is that a data scientist is someone who asks unique, interesting questions of data based on formal or informal theory, to generate rigorous and useful insights. To quote Georg Cantor - "In mathematics the art of proposing a question must be held of higher value than solving it." It is likely to be an individual with multi-disciplinary training in computer science, business, economics, statistics, and armed with the necessary quantity of domain knowledge relevant to the question at hand. The potential of the field is enormous for just a few well-trained data scientists armed with big data have the potential to transform organizations and societies. In the narrower domain of business life, the role of the data scientist is to generate applicable business intelligence.

Among all the new buzzwords in business -- and there are many -- "Big Data" is one of the most often heard. The burgeoning social web, and the growing role of the internet as the primary information channel of business, has generated more data than we might imagine. Users upload an hour of video data to YouTube every second. See Mayer-Schonberger and Cukier, p8: They report that USC's Martin Hilbert calculated that more than 300 exabytes of data storage was being used in 2007, an exabyte being one billion gigabytes, i.e., $10^{18}$ bytes, and $2^{60}$ of binary usage. 87% of the U.S. population has heard of Twitter, and 7% use it. In contrast, 88% of the population has heard of Facebook, and 41% use it. See\7-surprising-statistics-about\-twitter-in-america/. Half of Twitter users are white, and of the remaining half, half are black.] Forty-nine percent of Twitter users follow some brand or the other, hence the reach is enormous, and, as of 2014, there are more then 500 million tweets a day. But data is not information, and until we add analytics, it is just noise. And more, bigger, data may mean more noise and does not mean better data.

In many cases, less is more, and we need models as well. That is what this book is about, it's about theories and models, with or without data, big or small. It's about analytics and applications, and a scientific approach to using data based on well-founded theory and sound business judgment. This book is about the science and art of data analytics.

Data science is transforming business. Companies are using medical data and claims data to offer incentivized health programs to employees. Caesar's Entertainment Corp. analyzed data for 65,000 employees and found substantial cost savings. Zynga Inc, famous for its game Farmville, accumulates 25 terabytes of data every day and analyzes it to make choices about new game features. UPS installed sensors to collect data on speed and location of its vans, which combined with GPS information, reduced fuel usage in 2011 by 8.4 million gallons, and shaved 85 million miles off its routes. ["How Big Data is Changing the Whole Equation for Business," Wall Street Journal March 8, 2013.] McKinsey argues that a successful data analytics plan contains three elements: interlinked data inputs, analytics models, and decision-support tools. ["Big Data: What's Your Plan?" McKinsey Quarterly, March 2013.] In a seminal paper, Halevy, Norvig, and Pereira (2009) argue that even simple theories and models, with big data, have the potential to do better than complex models with less data.

In a recent talk at the h2o world conference in the Bay Area, on 11th November 2015, well-regarded data scientist Hilary Mason emphasized that the creation of "data products" requires three components: data (of course) plus technical expertise (machine-learning) plus people and process (talent). Google Maps is a great example of a data product that epitomizes all these three qualities. She mentioned three skills that good data scientists need to cultivate: (a) in math and stats, (b) coding, (c) communication. I would add that preceding all these is the ability to ask relevant questions, the answers to which unlock value for companies, consumers, and society. Everything in data analytics begins with a clear problem statement, and needs to be judged with clear metrics.

Being a data scientist is inherently interdisciplinary. Good questions come from many disciplines, and the best answers are likely to come from people who are interested in multiple fields, or at least from teams that co-mingle varied skill sets. Josh Wills of Cloudera stated it well - "A data scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician." In contrast, complementing data scientists are business analytics people, who are more familiar with business models and paradigms and can ask good questions of the data.

Analytics may be broken down into various types, but here is a useful taxonomy. [See: by Alain Louchez, Georgia State.] There are five stages or types of anaytics; (i) Descriptive analytics, which describes past phenomena using the data generated by historically; (ii) Diagnostic analytics, that looks at why a phenomenon occurred; (iii) Predictive analytics, a very popular business area, which is about forecasting what may happen in the future based on past (big) data; and (iv) Prescriptive analytics, which states what should happen in the future. Finally, one may also consider (v) Normative analytics, or the theoretical derivation or postulation of what is ideal.


Volume, Velocity, Variety

There are several "V"s of big data: three of these are volume, velocity, variety.[This nomenclature was originated by the Gartner group in 2001, and has been in place more than a decade.] Big data exceeds the storage capacity of conventional databases. This is it's volume aspect. The scale of data generation is mind-boggling. Google's Eric Schmidt pointed out that until 2003, all of human kind had generated just 5 exabytes of data (an exabyte is $1000^6$ bytes or a billion-billion bytes). Today we generate 5 exabytes of data every two days. The main reason for this is the explosion of "interaction" data, a new phenomenon in contrast to mere "transaction" data. Interaction data comes from recording activities in our day-to-day ever more digital lives, such as browser activity, geo-location data, RFID data, sensors, personal digital recorders such as the fitbit and phones, satellites, etc. We now live in the "internet of things" (or iOT), and it's producing a wild quantity of data, all of which we seem to have an endless need to analyze. In some quarters it is better to speak of 4 Vs of big data, see below.

A good data scientist will be adept at managing volume not just technically in a database sense, but by building algorithms to make intelligent use of the size of the data as efficiently as possible. Things change when you have gargantuan data because almost all correlations become significant, and one might be tempted to draw spurious conclusions about causality. For many modern business applications today extraction of correlation is sufficient, but good data science involves techniques that extract causality from these correlations as well.

In many cases, detecting correlations is useful as is. For example, consider the classic case of Google Flu Trends, see the figure below. The figure shows the high correlation between flu incidence and searches about "flu" on Google, see Culotta (2010). Obviously searches on the key word "flu" do not result in the flu itself! Of course, the incidence of searches on this key word is influenced by flu outbreaks. The interesting point here is that even though searches about flu do not cause flu, they correlate with it, and may at times even be predictive of it, simply because searches lead the actual reported levels of flu, as those may occur concurrently but take time to be reported. And whereas searches may be predictive, the cause of searches is the flu itself, one variable feeding on the other, in a repeat cycle. [Interwoven time series such as these may be modeled using Vector Auto-Regressions, a technique we will encounter later in this book.] Hence, prediction is a major outcome of correlation, and has led to the recent buzz around the subfield of "predictive analytics." There are entire conventions devoted to this facet of correlation, such as the wildly popular PAW (Predictive Analytics World). [May be a futile collection of people, with non-working crystal balls, as William Gibson said - "The future is not google-able."] Pattern recognition is in, passe causality is out.

Flu and searches for "flu":

Data velocity is accelerating. Streams of tweets, Facebook entries, financial information, etc., are being generated by more users at an ever increasing pace. Whereas velocity increases data volume, often exponentially, it might shorten the window of data retention or application. For example, high-frequency trading relies on micro-second information and streams of data, but the relevance of the data rapidly decays.

Finally, data variety is much greater than ever before. Models that relied on just a handful of variables can now avail of hundreds of variables, as computing power has increased. The scale of change in volume, velocity, and variety of the data that is now available calls for new econometrics, and a range of tools for even single questions. This book aims to introduce the reader to a variety of modeling concepts and econometric techniques that are essential for a well-rounded data scientist.

Data science is more than the mere analysis of large data sets. It is also about the creation of data. The field of "text-mining" expands available data enormously, since there is so much more text being generated than numbers. The creation of data from varied sources, and its quantification into information is known as "datafication."

Machine Learning

Data science is also more than "machine learning," which is about how systems learn from data. Systems may be trained on data to make decisions, and training is a continuous process, where the system updates its learning and (hopefully) improves its decision-making ability with more data. A spam filter is a good example of machine learning. As we feed it more data it keeps changing its decision rules, using a Bayesian filter, thereby remaining ahead of the spammers. It is this ability to adaptively learn that prevents spammers from gaming the filter, as highlighted in Paul Graham's interesting essay titled "A Plan for Spam". [] Credit card approvals are also based on neural-nets, another popular machine learning technique. However, machine-learning techniques favor data over judgment, and good data science requires a healthy mix of both. Judgment is needed to accurately contextualize the setting for analysis and to construct effective models. A case in point is Vinny Bruzzese, known as the "mad scientist of Hollywood" who uses machine learning to predict movie revenues. ["Solving Equation of a Hit Film Script, With Data," New York Times, May 5, 2013.] He asserts that mere machine learning would be insufficient to generate accurate predictions. He complements machine learning with judgment generated from interviews with screenwriters, surveys, etc., "to hear and understand the creative vision, so our analysis can be contextualized."

Machine intelligence is re-emerging as the new incarnation of AI (a field that many feel has not lived up to its promise). Machine learning promises and has delivered on many questions of interest, and is also proving to be quite a game-changer, as we will see later on in this chapter, and also as discussed in many preceding examples. What makes it so appealing? Hilary Mason suggests four characteristics of machine intelligence that make it interesting: (i) It is usually based on a theoretical breakthrough and is therefore well grounded in science. (ii) It changes the existing economic paradigm. (iii) The result is commoditization (e.g. Hadoop), and (iv) it makes available new data that leads to further data science.

Machine Learning (a.k.a. "ML") has diverged and is now defined separate from traditional statistics. ML is more about learning and matching inputs with outputs, whereas statistics has always been interested more in analyzing data under a given problem statement or hypothesis. ML tends to be more heuristic, whereas econometrics and statistical analyses tend to be theory-driven, with tight assumptions. ML tends to focus more on prediction, econometrics on causality, which is a stronger outcome than prediction (or correlation). ML techniques work well with big data, whereas econometrics techniques tend toward too much significance with too much data. Hence, the latter is better served with dimension reduction, though ML may not in fact be implementable with small data. Under ML techniques, even when they work very well, it is hard to explain why, and also which variables in the feature set seem to work best. Under traditional econometrics and statistics, tracing the effects in the model is clear and feasible, making understanding of the model better. Deciding which approach fits a given problem best is a matter of taste, but experience often helps in deciding which one of the two methods applies better.

Let's examine a definition of Machine Learning. Tom Mitchell, one of the founders of the field, stated a formal definition thus:

"A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ if its performance at tasks in $T$, as measured by $P$, improves with experience $E$." -- Mitchell (1997)

Domingos (2012) offers an excellent introduction to machine learning. He defines learning as the sum of Representation, Evaluation, and Optimization. Machine learning representation requires specifying the problem in a formal language that a computer can handle. These representations will differ for different machine learning techniques. For example, in a classification problem, there may be a choice of many classifiers, each of which will be formally represented. Next, a scoring function or a loss function is specified in order to complete the evaluation step. Finally, best evaluation is attained through optimization.

Once these steps have been undertaken and the best ML algorithm is chosen on the training data, we may validate the model on out-of-sample data, or the test data set. One may randomly choose a fraction of the data sample to hold out for validation. Repeating this process by holding out different parts of the data for testing, and training on the remainder, is a process known as cross-validation and is strongly recommended.

If it turns out that repeated cross-validation results in poor results, even though in-sample testing does very well, then it is possible evidence of over-fitting. Over-fitting usually occurs when the model is over-parameterized in-sample, so that it fits very well, but then it becomes less useful on new data. This is akin to driving by looking in the rear-view mirror, which does not work well when the road does not remain straight going forward. Therefore, many times, simpler and less parameterized models tend to work better in forecasting and prediction settings. If the model performs pretty much the same in-sample and out-of-sample, it is very unlikely to be overfit. The argument that simpler models overfit less is often made with Occam's Razor in mind, but is not always an accurate underpinning, so simpler may not always be better. [See the excellent paper on this by Domingos (1999).]

Supervised and Unsupervised Learning

Systems may learn in two broad ways, through "supervised" and "unsupervised" learning. In supervised learning, a system produces decisions (outputs) based on input data. Both spam filters and automated credit card approval systems are examples of this type of learning. So is linear discriminant analysis (LDA). The system is given a historical data sample of inputs and known outputs, and it "learns" the relationship between the two using machine learning techniques, of which there are several. Judgment is needed to decide which technique is most appropriate for the task at hand.

Unsupervised learning is a process of reorganizing and enhancing the inputs in order to place structure on unlabeled data. A good example is cluster analysis, which takes a collection of entities, each with a number of attributes, and partitions the entity space into sets or groups based on closeness of the attributes of all entities. What this does is reorganizes the data, but it also enhances the data through a process of labeling the data with additional tags (in this case a cluster number/name). Factor analysis is also an unsupervised learning technique. The origin of this terminology is unclear, but it presumably arises from the fact that there is no clear objective function that is maximized or minimized in unsupervised learning, so that no "supervision" to reach an optimal is called for. However, this is not necessarily true in general, and we will see examples of unsupervised learning (such as community detection in the social web) where the outcome depends on measurable objective criteria.

Supervised learning might include broad topics such as regression, classification, forecasting, and importance attribution. All these analyses are supported by the fact that the feature set ($X$ variables) are accompanied by tags ($Y$ variables). Unsupervised learning includes analyses such as clustering, and association models, e.g., recommendation engines, market baskets, etc.

Feature Selection

When faced with a machine learning problem, having the right data is paramount. Sometimes, especially in this age of Big Data, we may have too much data; abundance comes with a curse. Too much data also might mean featureless data, which is not useful to the data scientist. Hence, we might want to extract those data variables that are useful, through a process called "feature selection". Dimension reduction is also a useful by product of feature selection, and pruning data might also mean that ML algorithms will run faster, and converge better.

Wikipedia defines feature selection as -- "In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction." []

Feature selection subsets the variable space. If there are $p$ columns of data, then we choose $q \ll p$ variates. Feature extraction on the other hand refers to transformation of the original variables to create new variables, i.e., functionals of $p$, such as $g(p)$. We will encounter these topics later on, as we work through various ML techniques.

Ensemble Learning

Ensemble models are simply combinations of many ML models. There are of course, many ways in which models may be combined to generate better ML models. It is astonishing how powerful this "model democracy" turns out to be where various models vote, for example, on a classification problem. In Das and Chen (2007), five different classifiers vote on classifying stock bulletin board messages into three categories of signals: Buy, Hold, Sell. In this early work, ensemble methods were able to improve the signal-to-noise ratio in classification.

Different classification models are not always necessary. One may instead calibrate the same model to different subsamples of the training data, delivering multiple similar, but different models. Each of these models is then used to classify out-of-sample, and the decision is made by voting across models. This method is known as bagging. One of the most popular examples of bagging algorithms is the random forest model, which we will encounter later when we examine classifiers in more detail.

In another technique, boosting, the loss function that is being optimized does not weight all examples in the training data set equally. After one pass of calibration, training examples are reweighted such that the cases where the ML algorithm made errors (as in a classification problem) are given higher weight in the loss function. By penalizing these observations, the algorithm learns to prevent those mistakes as they are more costly.

Another approach to ensemble learning is called stacking where models are chained to each other, so that the output of low-level models becomes the input of another higher-level model. Here models are vertically integrated in contrast to bagging, where models are horizontally integrated.

Predictions and Forecasts

Data science is about making predictions and forecasts. There is a difference between the two. The statistician-economist Paul Saffo has suggested that predictions aim to identify one outcome, whereas forecasts encompass a range of outcomes. To say that "it will rain tomorrow" is to make a prediction, but to say that "the chance of rain is 40%" (implying that the chance of no rain is 60%) is to make a forecast, as it lays out the range of possible outcomes with probabilities. We make weather forecasts, not predictions. Predictions are statements of great certainty, whereas forecasts exemplify the range of uncertainty. In the context of these definitions, the term predictive analytics is a misnomer for it's goal is to make forecasts, not mere predictions.

Innovation and Experimentation

Data science is about new ideas and approaches. It merges new concepts with fresh algorithms. Take for example the A/B test, which is nothing but the online implementation of a real-time focus group. Different subsets of users are exposed to A and B stimuli respectively, and responses are measured and analyzed. It is widely used for web site design. This approach has been in place for more than a decade, and in 2011 Google ran more than 7,000 A/B tests. Facebook, Amazon, Netflix, and several others firms use A/B testing widely. ["The A/B Test: Inside the Technology that's Changing the Rules of Business," by Brian Christian, Wired, April 2012.] The social web has become a teeming ecosystem for running social science experiments. The potential to learn about human behavior using innovative methods is much greater now than ever before.

The Dark Side: Big Errors

The good data scientist will take care to not over-reach in drawing conclusions from big data. Because there are so many variables available, and plentiful observations, correlations are often statistically significant, but devoid of basis. In the immortal words of the bard, empirical results from big data may be - "A tale told by an idiot, full of sound and fury, signifying nothing." [William Shakespeare in Macbeth, Act V, Scene V.] One must be careful not to read too much in the data. More data does not guarantee less noise, and signal extraction may be no easier than with less data.

Adding more columns (variables in the cross section) to the data set, but not more rows (time dimension) is also fraught with danger. As the number of variables increases, more characteristics are likely to be related statistically. Over fitting models in-sample is much more likely with big data, leading to poor performance out-of-sample.

Researchers have also to be careful to explore the data fully, and not terminate their research the moment a viable result, especially one that the researcher is looking for, is attained. With big data, the chances of stopping at a suboptimal, or worse, intuitively appealing albeit wrong result become very high. It is like asking a question to a class of students. In a very large college class, the chance that someone will provide a plausible yet off-base answer quickly is very high, which often short circuits the opportunity for others in class to think more deeply about the question and provide a much better answer.

Nassim Taleb ["Beware the Big Errors of Big Data" Wired, February 2013.] describes these issues elegantly - "I am not saying there is no information in big data. There is plenty of information. The problem -- the central issue -- is that the needle comes in an increasingly larger haystack." The fact is, one is not always looking for needles or Taleb's black swans, and there are plenty of normal phenomena about which robust forecasts are made possible by the presence of big data.

The Dark Side: Privacy

The emergence of big data coincides with a gigantic erosion of privacy. Human kind has always been torn between the need for social interaction, and the urge for solitude and privacy. One trades off against the other. Technology has simply sharpened the divide and made the slope of this trade off steeper. It has provided tools of social interaction that steal privacy much faster than in the days before the social web.

Rumors and gossip are now old world. They required bilateral transmission. The social web provides multilateral revelation, where privacy no longer capitulates a battle at a time, but the entire war is lost at one go. And data science is the tool that enables firms, governments, individuals, benefactors and predators, et al, en masse, to feed on privacy's carcass. The cartoon below parodies the kind of information specialization that comes with the loss of privacy!