Text Analytics (Advanced Topics)

Sanjiv R. Das

Basic Textual Data

We use my bio page again as a test bed for basic analysis.

Basic Text Cleanup

We repeat the functions we had developed earlier.

Lemmatization

Additional reading: https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

Non-negative Matrix Factorization (NMF)

NMF is used to break a matrix $X$ into a product of two matrices, $A$ and $B$, such that the resulting matrices have components that are all non-negative. Hence, the nomenclature is taken literally.

$$ X = A \cdot B $$

where $X$ is of dimension $m \times n$, $A$ is $m \times k$, and $B$ is $k \times n$.

We may not be able to determine $A$ and $B$ precisely, so we obtain the "best fit" to the following problem:

$$ \min_{A,B} \parallel X-A \cdot B \parallel_F^2 $$

subject to $A,B \geq 0$ element-wise. Here $\parallel Y \parallel_F^2$ is the Frobenius norm, i.e., the sum of squared elements of matrix $Y$.

Iterative Solution for NMF

How do we solve for the elements of $A$ and $B$? There is an obvious and intuitive element by element update that can be done as follows:

$$ A_{[i,j]} \leftarrow A_{[i,j]} \odot \frac{[XB^\top]_{[i,j]}}{[ABB^\top]_{[i,j]}} $$$$ B_{[i,j]} \leftarrow B_{[i,j]} \odot \frac{[A^\top]X_{[i,j]}}{[A^\top AB]_{[i,j]}} $$

If $X_{[i,j]} > (AB)_{[i,j]}$ then update weight on the right-side of the Hadamard product ($\odot$, element-wise multiplication) will be greater than 1, else less than 1. This update scheme is guaranteed to keep reducing the loss function and also make sure $A_{[i,j]}, B_{[i,j]} \geq 0$ as long as the initial guesses for $A,B$ are non-negative.

NMF by gradient descent

We can use gradient descent to implement the previous scheme with the following iterative rule, which does not need element by element updating:

$$ A \leftarrow A - \eta_A [A B B^\top - X B^\top] \in {\cal R}^{m \times k} $$

and

$$ B \leftarrow B - \eta_B [A^\top A B - A^\top X] \in {\cal R}^{k \times n} $$

The former update equation is applied to update $A$, holding $B$ constant and the latter equation is applied to $B$, holding $A$ constant. As usual $\eta_i, i=\{A,B\}$ are the learning rates used in gradient descent.

Singular Value Decomposition (SVD)

SVD is a generalization of eigenvalue decomposition. We can obtain decompositions of non-square matrices. Consider the decomposition of the Term-Document Matrix $M$ of size $m \times n$. The canonical decomposition is as follows:

$$ M = T \cdot S \cdot D^\top $$

where $T$ is $m \times n$, $S$ is $n \times n$, and $D^\top$ is $n \times n$. $T$ and $D$ are orthonormal to each other. $S$ is the “singular values” matrix, i.e., a diagonal matrix with singular values on the diagonal. These values denote the relative importance of the terms in the TDM.

Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) is an approach for reducing the dimension of the Term-Document Matrix (TDM), or the corresponding Document-Term Matrix (DTM), in general used interchangeably, unless a specific one is invoked. Dimension reduction of the TDM offers two benefits:

  1. The DTM is usually a sparse matrix, and sparseness means that our algorithms have to work harder on missing data, which is clearly wasteful. Some of this sparseness is attenuated by applying LSA to the TDM.

  2. The problem of synonymy also exists in the TDM, which usually contains thousands of terms (words). Synonymy arises becauses many words have similar meanings, i.e., redundancy exists in the list of terms. LSA mitigates this redundancy, as we shall see through the ensuing anaysis of LSA. See: http://www.oxfordbibliographies.com/view/document/obo-9780199772810/obo-9780199772810-0220.xml

While not precisely the same thing, think of LSA in the text domain as analogous to PCA in the data domain.

How is LSA implemented using SVD?

LSA is the application of Singular Value Decomposition (SVD) to the TDM, extracted from a text corpus. Define the TDM to be a matrix $M \in R^{m \times n}$, where $m$ is the number of terms and $n$ is the number of documents.

The SVD of matrix M is given by

$$ M=T \cdot S \cdot D^\top $$

where $T \in R^{m \times n}$ and $D \in R^{n \times n}$ are orthonormal to each other, and $S ∈\in R^{n \times n}$ is the “singluar values” matrix, i.e., a diagonal matrix with singular values on the diagonal. These values denote the relative importance of the terms in the TDM.

Example in R

LSA and Singular Value Decomposition (SVD)

SVD tries to connect the correlation matrix of terms ($M \cdot M^\top$) with the correlation matrix of documents ($M^\top \cdot M$) through the singular matrix.

To see this connection, note that matrix $T$ contains the eigenvectors of the correlation matrix of terms. Likewise, the matrix $D$ contains the eigenvectors of the correlation matrix of documents. To see this, let’s compute

Dimension reduction of the TDM via LSA

If we wish to reduce the dimension of the latent semantic space to $k<n$ then we use only the first $k$ eigenvectors. The lsa function does this automatically.

We call LSA and ask it to automatically reduce the dimension of the TDM using a built-in function dimcalc_share.

We can see that the dimension has been reduced from $n=4$ to $n=2$. The output is shown for both the term matrix and the document matrix, both of which have only two columns. Think of these as the two “principal semantic components” of the TDM.

Compare the output of the LSA to the eigenvectors above to see that it is exactly that. The singular values in the ouput are connected to SVD as follows.

LSA and SVD: the connection?

First of all we see that the lsa function is nothing but the svd function in base R.

The output here is the same as that of LSA except it is provided for $n=4$. So we have four columns in $T$ and $D$ rather than two. Compare the results here to the previous two slides to see the connection.

What is the rank of the TDM?

We may reconstruct the TDM using the result of the LSA.

We see the new TDM after the LSA operation, it has non-integer frequency counts, but it may be treated in the same way as the original TDM. The document vectors populate a slightly different hyperspace.

LSA reduces the rank of the correlation matrix of terms $M \cdot M^\top$ to $n=2$. Here we see the rank before and after LSA.

Classification and Word Embeddings using text2vec in R

Text2vec is an excellent implementation in R of much of the functionality we have seen so far. It is written by Dmitriy Selivanov in C++ and is extremely fast. See: http://text2vec.org/

https://srdas.github.io/MLBook/Text2Vec.html

The example below is taken from the sample code here: http://text2vec.org/vectorization.html

The processing steps are:

  1. Lower case the documents and then tokenize them.
  2. Create an iterator. (Step 1 can also be done while making the iterator, as the itoken function supports this, see below.)
  3. Use the iterator to create the vocabulary, which is nothing but the list of unique words across all documents.
  4. Vectorize the vocabulary, i.e., create a data structure of words that can be used later for matrix factorizations needed for various text analytics.
  5. Using the iterator and vectorized vocabulary, form text matrices, such as the Document-Term Matrix (DTM) or the Term Co-occurrence Matrix (TCM).
  6. Use the TCM or DTM to undertake various text analytics such as classification, word2vec, topic modeling using LDA (Latent Dirichlet Allocation), and LSA (Latent Semantic Analysis).

Preprocessing and tokenization

Iterate and Vectorize

An iterator is an object that traverses a container. A list is iterable. See: https://www.r-bloggers.com/iterators-in-r/

Vectorize creates words vectors

Document Term Matrix (DTM)

N-Grams

n-grams are phrases made by coupling words that co-occur. For example, a bi-gram is a set of two consecutive words.

This creates a vocabulary of both single words and bi-grams. Notice how large it is compared to the unigram vocabulary from earlier. Because of this we go ahead and prune the vocabulary first, as this will speed up computation. We redo the classification with n-grams.

TF-IDF

We have seen the TF-IDF discussion earlier, and here we see how to implement it using the text2vec package.

Word Embeddings

From: http://stackoverflow.com/questions/39514941/preparing-word-embeddings-in-text2vec-r-package

Create the TCM (Term Co-occurrence Matrix).

Word Embeddings (in particular, the word2vec algorithm, has been used to mine a large number of research abstracts to uncover new knowledge. See "With little training, machine-learning algorithms can uncover hidden scientific knowledge" in Nature (2019), which describes Tshitoyan et al (2019).

GloVe

Now fit the word embeddings using GloVe See: http://nlp.stanford.edu/projects/glove/

Example: Show the ten words close to the word “man” and “woman”.

This is a very useful feature of word embeddings, as it is often argued that in the embedded space, words that are close to each other, also tend to have semantic similarities, even though the closeness is computed simply by using their co-occurence frequencies.

word2vec explained

For more details, see: https://www.quora.com/How-does-word2vec-work

A geometrical interpretation: word2vec is a shallow word embedding model. This means that the model learns to map each discrete word id (0 through the number of words in the vocabulary) into a low-dimensional continuous vector-space from their distributional properties observed in some raw text corpus. Geometrically, one may interpret these vectors as tracing out points on the outside surface of a manifold in the “embedded space”. If we initialize these vectors from a spherical gaussian distribution, then you can imagine this manifold to look something like a hypersphere initially.

Let us focus on the CBOW for now. CBOW is trained to predict the target word t from the contextual words that surround it, c, i.e. the goal is to maximize P(t | c) over the training set. I am simplifying somewhat, but you can show that this probability is roughly inversely proportional to the distance between the current vectors assigned to t and to c. Since this model is trained in an online setting (one example at a time), at time T the goal is therefore to take a small step (mediated by the “learning rate”) in order to minimize the distance between the current vectors for t and c (and thereby increase the probability P(t |c)). By repeating this process over the entire training set, we have that vectors for words that habitually co-occur tend to be nudged closer together, and by gradually lowering the learning rate, this process converges towards some final state of the vectors.

By the Distributional Hypothesis (Firth, 1957; see also the Wikipedia page on Distributional semantics), words with similar distributional properties (i.e. that co-occur regularly) tend to share some aspect of semantic meaning. For example, we may find several sentences in the training set such as “citizens of X protested today” where X (the target word t) may be names of cities or countries that are semantically related.

You can therefore interpret each training step as deforming or morphing the initial manifold by nudging the vectors for some words somewhat closer together, and the result, after projecting down to two dimensions, is the familiar t-SNE visualizations where related words cluster together (e.g. Word representations for NLP).

For the skipgram, the direction of the prediction is simply inverted, i.e. now we try to predict P(citizens | X), P(of | X), etc. This turns out to learn finer-grained vectors when one trains over more data. The main reason is that the CBOW smooths over a lot of the distributional statistics by averaging over all context words while the skipgram does not. With little data, this “regularizing” effect of the CBOW turns out to be helpful, but since data is the ultimate regularizer the skipgram is able to extract more information when more data is available.

There’s a bit more going on behind the scenes, but hopefully this helps to give a useful geometrical intuition as to how these models work.

Topic Analysis using text2vec

Uses Latent Dirichlet Allocation.

Entity Extraction

Entities are elements of the data that fall into specific pre-defined categories. They are part of the data model. For instance, when extracting data from the SEC, we get various entity types: companies, directors, loans, securities, products, accounting line items, etc. We need to identify these when they appear in financial text.

EE is often also called Named Entity Recognition (NER).

Using spaCy

https://spacy.io/

https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

Stochastic Network Embeddings (t-SNE)

Original paper: https://lvdmaaten.github.io/tsne/

From: https://github.com/oreillymedia/t-SNE-tutorial

A popular dimensonality reduction algorithm: t-distributed stochastic neighbor embedding (t-SNE). Developed by Laurens van der Maaten and Geoffrey Hinton (see the original paper here: http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf), this algorithm has been successfully applied to many real-world datasets.

A nice article on visualizing high-dimensional data sets: https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b; https://drive.google.com/file/d/1Cucl4HYtYBgS12-BoslSuaOvFTCghqpe/view?usp=sharing

The good thing is that we can use the same model as we used in word2vec and feed it into t-SNE in the following.

Reuter's news corpus for t-SNE