Text Analytics (an Introduction)

Sanjiv R. Das

Reading references

News Analysis

In Finance, for example, text has become a major source of trading information, leading to a new field known as News Metrics.

News analysis is defined as “the measurement of the various qualitative and quantitative attributes of textual news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers permits the manipulation of everyday information in a mathematical and statistical way.” (Wikipedia). In this chapter, I provide a framework for text analytics techniques that are in widespread use. I will discuss various text analytic methods and software, and then provide a set of metrics that may be used to assess the performance of analytics. Various directions for this field are discussed through the exposition. The techniques herein can aid in the valuation and trading of securities, facilitate investment decision making, meet regulatory requirements, provide marketing insights, or manage risk.

News Analytics

See: https://www.amazon.com/Handbook-News-Analytics-Finance/dp/047066679X/ref=sr_1_1?ie=UTF8&qid=1466897817&sr=8-1&keywords=handbook+of+news+analytics

“News analytics are used in financial modeling, particularly in quantitative and algorithmic trading. Further, news analytics can be used to plot and characterize firm behaviors over time and thus yield important strategic insights about rival firms. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, `bag of words’, among other techniques.” (Wikipedia)

Text as Data

There are many reasons why text has business value. But this is a narrow view. Textual data provides a means of understanding all human behavior through a data-driven, analytical approach. Let’s enumerate some reasons for this.

In a talk at the 17th ACM Conference on Information Knowledge and Management (CIKM ’08), Google’s director of research Peter Norvig stated his unequivocal preference for data over algorithms—“data is more agile than code.” Yet, it is well-understood that too much data can lead to overfitting so that an algorithm becomes mostly useless out-of-sample.

Chris Anderson: “Data is the New Theory.”

Definition: Text-Mining

Algorithm Complexity

The Response to News

Das, Martinez-Jerez, and Tufano (FM 2005)

Breakdown of news flow

Frequency of posting

Weekly posting

Intraday posting

Number of characters per posting

Examples: Basic Text Handling

But this returns words with commas and periods included, which is not desired. So what we need is the regular expressions package, i.e., re.

Using List Comprehensions to find specific words

Or, use regular expressions to help us with more complex parsing.

For example '@[A-Za-z0-9_]+' will return all words that:

String operations

Read in a URL

Use Beautiful Soup to clean up all the html stuff

Dictionaries

Webster’s defines a “dictionary” as “…a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses.”

  1. The Harvard General Inquirer: http://www.wjh.harvard.edu/~inquirer/
  2. Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.
  3. Computer dictionary: http://www.hyperdictionary.com/computer that contains about 14,000 computer related words, such as “byte” or “hyperlink”.
  4. Math dictionary, such as http://www.amathsdictionaryforkids.com/dictionary.html.
  5. Medical dictionary, see http://www.hyperdictionary.com/medical.
  6. Internet lingo dictionaries may be used to complement standard dictionaries with words that are not usually found in standard language, for example, see http://www.netlingo.com/dictionary/all.php for words such as “2BZ4UQT” which stands for “too busy for you cutey” (LOL). When extracting text messages, postings on Facebook, or stock message board discussions, internet lingo does need to be parsed and such a dictionary is very useful.
  7. Associative dictionaries are also useful when trying to find context, as the word may be related to a concept, identified using a dictionary such as http://www.visuwords.com/. This dictionary doubles up as a thesaurus, as it provides alternative words and phrases that mean the same thing, and also related concepts.
  8. Value dictionaries deal with values and may be useful when only affect (positive or negative) is insufficient for scoring text. The Lasswell Value Dictionary http://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used to score the loading of text on the eight basic value categories: Wealth, Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Well being.

Lexicons

Constructing a lexicon

Lexicons as Word Lists

Negation Tagging

The Grammarly Handbook provides the folowing negation words (see https://www.grammarly.com/handbook/):

  1. Negative words: No, Not, None, No one, Nobody, Nothing, Neither, Nowhere, Never.
  2. Negative Adverbs: Hardly, Scarcely, Barely.
  3. Negative verbs: Doesn’t, Isn’t, Wasn’t, Shouldn’t, Wouldn’t, Couldn’t, Won’t, Can’t, Don’t.

Scoring Text

Read in a dictionary

Sentiment Score the Text using this Dictionary from Harvard Inquirer

General Function to Pull Financial Text and score it

The results are different, depending on the source.

Parts of Speech (POS) Tagging

https://www.cs.toronto.edu/~frank/csc2501/Tutorials/cs485_nltk_krish_tutorial1.pdf

Please install the nltk package:

pip install nltk

Also NLTK has packages that may need packages to be installed from within it, so use nltk.download() to do so, in case you get the following error when using NLTK.

LookupError:

Resource 'tokenizers/punkt/PY3/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in:

- '/Users/srdas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''

Twitter API

We explore using the Twitter API here.

We can set up keys and tokens at: https://apps.twitter.com/

Using FastText to analyze Tweets

See Malafosse (2019): FastText sentiment analysis for tweets: A straightforward guide; pdf.

JSON

JSON = Java Script Object Notation. It is a flat file data format.

https://www.json.org/

Using the NLTK package to conduct sentiment analysis without a dictionary

When using tweets, it may be a good idea to install the Twython library: "pip3 install -U nltk[twitter]" (You can also simply use pip instead of pip3.)

Extracting tweets with a hashtag

News Extractor: Reading in parts of a URL

Let's read in the top news from the ET main page.

You also want to get SelectorGadget: http://selectorgadget.com/

Remove punctuation from headlines

Remove Numbers

Stemming

https://pythonprogramming.net/stemming-nltk-tutorial/

Remove Stopwords

Reference: https://pythonprogramming.net/stop-words-nltk-tutorial/

Write all docs to separate text files

Create a Corpus

Term Document Matrix

Term Frequency - Inverse Document Frequency (TF-IDF)

This is a weighting scheme provided to sharpen the importance of rare words in a document, relative to the frequency of these words in the corpus. It is based on simple calculations and even though it does not have strong theoretical foundations, it is still very useful in practice. The TF-IDF is the importance of a word w in a document d in a corpus C. Therefore it is a function of all these three, i.e., we write it as $TFIDF(w,d,C)$, and is the product of term frequency (TF) and inverse document frequency (IDF).

The frequency of a word in a document is defined as

$$ f(w,d)=\frac{\#w \in d}{|d|} $$

where $|d|$ is the number of words in the document. We usually normalize word frequency so that

$$ TF(w,d)=ln[f(w,d)] $$

This is log normalization. Another form of normalization is known as double normalization and is as follows:

$$ TF(w,d)=\frac{1}{2} + \frac{1}{2} \cdot \frac{f(w,d)}{\max_{w \in d} f(w,d)} $$

Note that normalization is not necessary, but it tends to help shrink the difference between counts of words.

Inverse document frequency is as follows:

$$ IDF(w,C)=\ln\left[\frac{|C|}{|d_{w \in d}|}\right] $$

That is, we compute the ratio of the number of documents in the corpus $C$ divided by the number of documents with word $w$ in the corpus.

Finally, we have the weighting score for a given word $w$ in document $d$ in corpus $C$:

$$ TFIDF(w,d,C)=TF(w,d) \times IDF(w,C) $$

WordClouds

Cosine Similarity in the Text Domain

In this segment we will learn some popular functions on text that are used in practice. One of the first things we like to do is to find similar text or like sentences (think of web search as one application). Since documents are vectors in the TDM, we may want to find the closest vectors or compute the distance between vectors.

$$ \cos(\theta) = \frac{A \cdot B}{||A|| \cdot ||B||} $$

where $||A|| = \sqrt{A \cdot A}$, is the dot product of $A$ with itself, also known as the norm of $A$. This gives the cosine of the angle between the two vectors and is zero for orthogonal vectors and 1 for identical vectors.

Readability of Text

Or, how to grade text!

In recent years, the SAT exams added a new essay section. While the test aimed at assessing original writing, it also introduced automated grading. A goal of the test is to assess the writing level of the student. This is associated with the notion of readability.

“Readability” is a metric of how easy it is to comprehend text. Given a goal of efficient markets, regulators want to foster transparency by making sure financial documents that are disseminated to the investing public are readable. Hence, metrics for readability are very important and are recently gaining traction.

Gunning-Fog Index

Gunning (1952) developed the Fog index. The index estimates the years of formal education needed to understand text on a first reading. A fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The index is based on the idea that poor readability is associated with longer sentences and complex words. Complex words are those that have more than two syllables. The formula for the Fog index is

$$ 0.4 \left[\frac{\#words}{\#sentences} + 100 \cdot \frac{\#complex words}{\#words} \right] $$

Flesch Score

Alternative readability scores use similar ideas. The Flesch Reading Ease Score and the Flesch-Kincaid Grade Level also use counts of words, syllables, and sentences. See http://en.wikipedia.org/wiki/Flesch-Kincaid_readability_tests. The Flesch Reading Ease Score is defined as

$$ 206.835−1.015 \cdot \frac{\#words}{\#sentences} − 84.6 \cdot \frac{\#syllables}{\#words} $$

With a range of 90-100 easily accessible by a 11-year old, 60-70 being easy to understand for 13-15 year olds, and 0-30 for university graduates.

The Flesch-Kincaid Grade Level

This is defined as

$$ 0.39 \cdot \frac{\#words}{\#sentences} + 11.8 \cdot \frac{\#syllables}{\#words} − 15.59 $$

which gives a number that corresponds to the grade level. As expected these two measures are negatively correlated. Various other measures of readability use the same ideas as in the Fog index. For example the Coleman and Liau (1975) index does not even require a count of syllables, as follows:

$$ CLI=0.0588L−0.296S−15.8 $$

where $L$ is the average number of letters per hundred words and $S$ is the average number of sentences per hundred words.

Standard readability metrics may not work well for financial text. Loughran and McDonald (2014) find that the Fog index is inferior to simply looking at 10-K file size.

References

koRpus package

R package koRpus for readability scoring here. http://www.inside-r.org/packages/cran/koRpus/docs/readability

First, let’s grab some text from my web site.

Text Summarization

A document $D$ is comprised of $m$ sentences $s_i,i=1,2,...,m$, where each $s_i$ is a set of words. We compute the pairwise overlap between sentences using the Jaccard similarity index:

$$ J_{ij} = J(s_i,s_j)=\frac{|s_i \cap s_j|}{|s_i \cup s_j|} = J_{ji} $$

The overlap is the ratio of the size of the intersect of the two word sets in sentences $s_i$ and $s_j$, divided by the size of the union of the two sets. The similarity score of each sentence is computed as the row sums of the Jaccard similarity matrix.

$$ S_i=\sum_{j=1}^m J_{ij} $$

Generating the summary

Once the row sums are obtained, they are sorted and the summary is the first $n$ sentences based on the $S_i$ values.

One Function to Rule All Text in R

Example: Summarization

We will use a sample of text that I took from Bloomberg news. It is about the need for data scientists.

Modern Methods

Big Data: Reuters News Corpus

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/reuters.zip

The Reuters-21578 benchmark corpus, ApteMod

id: reuters; size: 6378691; author: ; copyright: ; license: The copyright for the text of newswire articles and Reuters annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data for research purposes only. If you publish results based on this data set, please acknowledge its use, refer to the data set by the name 'Reuters-21578, Distribution 1.0', and inform your readers of the current location of the data set.;

https://pynlp.wordpress.com/2013/12/10/unit-5-part-ii-working-with-files-ii-the-plain-text-corpus-reader-of-nltk/

Topic Modeling using LDA

This is a nice article that has most of what is needed: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

LDA Explained (Briefly)

Latent Dirichlet Allocation (LDA) was created by David Blei, Andrew Ng, and Michael Jordan in 2003, see their paper titled "Latent Dirichlet Allocation" in the Journal of Machine Learning Research, pp 993--1022.

The simplest way to think about LDA is as a probability model that connects documents with words and topics. The components are:

Next, we connect the above objects to $K$ topics, indexed by $l$, i.e., $t_l$. We will see that LDA is encapsulated in two matrices: Matrix $A$ and Matrix $B$.

Matrix $A$: Connecting Documents with Topics

Matrix $B$: Connecting Words with Topics

Distribution of Topics in a Document

$$ p(\theta | \alpha) = \frac{\Gamma(\sum_{l=1}^K \alpha_l)}{\prod_{l=1}^K \Gamma(\alpha_l)} \; \prod_{l=1}^K \theta_l^{\alpha_l - 1} $$

where $\Gamma(\cdot)$ is the Gamma function.

Note: https://en.wikipedia.org/wiki/Dirichlet_distribution

Distribution of Words and Topics for a Document

$$ p(\theta, {\bf t}, {\bf w}) = p(\theta | \alpha) \prod_{l=1}^K p(t_l | \theta) p(w_l | t_l) $$ $$ p({\bf w}) = \int p(\theta | \alpha) \left(\prod_{l=1}^K \sum_{t_l} p(t_l | \theta) p(w_l | t_l)\; \right) d\theta $$

Likelihood of the entire Corpus

$$ p(D) = \prod_{j=1}^M \int p(\theta_j | \alpha) \left(\prod_{l=1}^K \sum_{t_{jl}} p(t_l | \theta_j) p(w_l | t_l)\; \right) d\theta_j $$

Word2Vec: Word Embeddings

In a number of Natural Language Processing (NLP) applications classic methods for language modeling that represent words as high-dimensional, sparse vectors have been replaced by Neural Language models that learn word embeddings, i.e., low-dimensional representations of words, often through the use of neural networks.

Words have multiple degrees of similarity, such as syntactic similarity and semantic similarity. Word embeddings have been found to pick up both types of similarity. These similarities have been found to support algebraic operations on words, as in the famous word2vec example where vector("Man") + vector("King") - vector("Queen") equals vector("Woman").

The output of word2vec is an input to many natural language models using deep learning, such as sentence completion, parsing, information retrieval, document classification, question answering, and named entity recognition.

There are two approaches to word2vec, discussed below.

Skip-gram

Given a sequence of words $w_1,...,w_T$ (i.e., $T$ terms), the quantity of interest in the skip-gram model is the conditional probability:

$$ p(w_{t+j} | w_t), \quad -c ≤ j ≤ c, j ≠ 0 $$

Here $c$ is a window around the current word $w_t$ in the text, and $c$ may also be a function of $w_t$. This is what is depicted in the graphic above. The objective function is to maxmize the log conditional probability:

$$ L = \frac{1}{T} \sum_{t=1}^T \left[\sum_{-c ≤ j ≤ c, j ≠ 0} \log p(w_{t+j} | w_t) \right] $$

Assume a vocabulary of $W$ words. Let each word $w$ be represented by a word vector $v(w)$ of dimension $N$. Then the skip-gram model assumes that the conditional probabilities come from a softmax function as follows:

$$ p(w_j | w_i) = \frac{\exp(v(w_j)^⊤ · v(w_i))}{\sum_{w=1}^W \exp(v(w)^⊤ · v(w_i))} $$

It requires gradients for each element of the vectors $v(w)$, which is onerous and is of order $O(W)$. Instead of softmax, hierarchical softmax is used, which is of order $O(log W)$. An alternative is negative sampling, specifically Noise Contrastive Estimation (NCE) as this approximates softmax with logistic functions. Explanation of these approximations and speedups is beyond the scope of these notes, but the reader is referred to Mikolov et al (2013), or see this simpler exposition.

Training is done with neural nets containing a single hidden layer of dimension $N$. The input and output layers are of the same size, and this is your essential autoencoder.

GloVe (Global Vectors)

A matrix factorization representation of word2vec from Stanford is called GloVe. This is unsupervised learning. I highly recommend reading this page, it is one of the most beautiful and succint presentations of an algorithmic idea I have encountered.

This is not the first matrix factorization idea, Latent Semantic Analysis (LSA) has been around for some time. LSA factorizes a term-document matrix into lower dimension. But it has its drawbacks and does poorly on word analogy tasks, for which word2vec does much better. GloVe is an approach that marries the ideas in LSA and word2vec in a computationally efficient manner.

As usual, the output from a word embedding model is an embedding matrix $E$ of size $V × N$, where $V$ is the size of the vocabulary (number of words) and $N$ is the dimension of the embedding.

GloVe is based on the co-occurrence matrix of words $X$ of size $V × V$. This matrix depends on the "window" chosen for co-occurrence. The matrix values are also scaled depending on the closeness within the window, resulting in all values in the matrix in $(0,1)$. This matrix is then factorized to get the embedding matrix $E$. This is an extremely high-level sketch of the GloVe algorithm. See Jeffrey Pennington, Richard Socher, and Christopher D. Manning (2014); pdf.

Technically, GloVe is faster than word2vec, but requires more memory. Also, once the word co-occurrence matrix has been prepared, then $E$ can be quickly generated for any chosen $N$, whereas in word2vec, an entirely fresh neural net has to be estimated, because the hidden layer of the autoencoder has changed in dimension.

For large text corpora, one can intuitively imagine that word embeddings should be roughly similar if the texts are from the same domain. This suggests that pre-trained embeddings $E$ might be a good way to go for NLP applications.

Given the word co-occurrence matrix $X$, let $X_i = \sum_k X_{ik}$ be the number of times any word occurs in the context of word $i$. We can then define the conditional probability, also known as co-occurrence probabilities.

$$ P_{ij} = P(j | i) = \frac{X_{ij}}{X_i} $$

What's the difference between a word co-occurring and a word appearing "in the context of" another word? In the context of is represented by a conditional probability, whereas co-occurence is a correlation.

For a sample word $k$, the ratio $P_{ik}/P_{jk}$ will be large if word $i$ occurs more in the context of $k$ than does word $j$. If both words $i$ and $j$ are not related to word $k$, then we'd expect this ratio to be close to 1. This suggests that the variable we should model is the ratio of co-occurrence probabilities rather than the probabilities themselves. Since these ratios are functions of three words, we may write $$ F(w_i,w_j,w_k) = \frac{P_{ik}}{P_{jk}} $$ where $w_i, w_j, w_k ∈ {\cal R}^d$ are word vectors.

This function may depend on a parameter set. It is desired to have the following properties:

The embeddings idea may be extend to many other cases where co-occurrences exist. For example, user search histories over AirBnB in a search session may be converted into embeddings, see Grbovic and Cheng (2018); pdf.

word2vec fitting with neural nets

We are now ready to discuss the actual fitting of the word2vec model with neural nets. The neural net is simple. As before assume that the vocabulary is of size $V$ and the embedding is of size $N$. To make things more concrete, let $V=10,000$ and $N=100$. The neural net will have $V$ nodes in the input and output layers, and $N$ nodes in the single hidden layer. If you are familiar with autoencoders, then this is a common NN of that type. The number of parameters (ignoring a bias term) in the NN are $VN = 1,000,000$ for the hidden layer, and, for the output layer (with a bias term), $NV=1,000,100$, i.e., over 2 million parameters to be fit. This is a fairly large NN.

The inputs to the model are based on a window of text around the "target" word, $w$. Suppose the window is of size $c=2$, then $w$ may have up to 4 possible co-occurrence words---2 ahead (denoted $w_1, w_2$) and 2 before (denoted $w_{-1},w_{-2}$) in the window. This leads to 4 rows of input data. How? The input $X$ in all 4 rows is a one-hot vector of $V-1$ zeros and a single 1 in the position indexed by $w$. The label $Y$ is a one-hot vector with $V-1$ zeros and a 1 in the position where the leading or lagging word appears. Because a large corpus will have several words, each with up to $2c$ co-occurrence words, the size of the data may also run into the millions or even billions.

The coefficient matrix for the hidden layer is of dimension $V × N$---that is, for every word in the vocabulary, we have a $N$-vector representing it. This indeed, is the entire matrix of word embeddings. Just as an autoencoder compresses the original input, in this case, the neural net projects all the words onto a $N$-dimensional space. We are interested here in the weights matrix of the hidden layer, not the predicted output itself.

However, fitting this NN is no easy task, with millions of parameters, and possibly, billions of observations of data. To reduce the computational load, two simple additional techniques (hacks) are applied.

  1. Subsampling: We get rid of words that occur too frequently. So we only keep a subsample of words that occur less often. There is a formula for this. Let $\gamma(w)$ be the percentage of word count for $w$ among all words. This is likely to be a small number. We then sort the words based on a function of $\gamma(w)$ and put in a cutoff, where only words with smaller $\gamma(w)$ are retained. This eliminates common words like "the" and "this" and reduces computation time without much impact on the final word embeddings.

  2. Negative sampling: NNs are usually fitted in batches. In each batch of data all the weights (parameters) are updated. This can be quite costly in computation. In the hidden layer, this would mean updating all $VN=1$ million weights in the hidden layer. Instead, we only update the target word $w$ and 5-10 words that do not co-occur with $w$. We call these words "negatives" and hence the terminology of negative sampling. Negative words are sampled with a probability that is higher if they occur more frequently in the sample.

Both these approaches work well and have resulted in great speed up in fitting the word2vec model. pdf

Doc2Vec

This algorithm is analogous to word2vec but instead of word embeddings it generates document embeddings. Documents that are semantically similar with be closer to each other in the embedding space.

Let's use the Reuters news corpus as an example.

RegTech

https://app.box.com/s/ar58s7z253wdgy9ceiq9w0rm7h52z4p3

Research in Finance

https://srdas.github.io/MLBook/TextAnalytics.html#research-in-finance