Sanjiv R. Das
%pylab inline
import pandas as pd
import os
from ipypublish import nb_setup
%load_ext rpy2.ipython
%load_ext RWinOut #if using windows
Populating the interactive namespace from numpy and matplotlib The rpy2.ipython extension is already loaded. To reload it, use: %reload_ext rpy2.ipython
We use my bio page again as a test bed for basic analysis.
import requests
url = 'http://srdas.github.io/bio-candid.html'
f = requests.get(url)
text = f.text
f.close()
#Create a set of small docs = one line each in case needed for later examples
lines = text.splitlines()
print('Number of lines =',len(lines))
print(lines[3])
Number of lines = 78 Sanjiv Das is the William and Janice Terry Professor of Finance and
#Clean up text
from bs4 import BeautifulSoup
bio = BeautifulSoup(text,'lxml').get_text()
print(bio)
Sanjiv Das is the William and Janice Terry Professor of Finance and Data Science at Santa Clara University's Leavey School of Business. He previously held faculty appointments as Professor at Harvard Business School and UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management, Ahmedabad, B.Com in Accounting and Economics (University of Bombay, Sydenham College), and is also a qualified Cost and Works Accountant (AICWA). He is a senior editor of The Journal of Investment Management and Associate Editor of Management Science and other academic journals. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice-President at Citibank. His current research interests include: machine learning, social networks, derivatives pricing models, portfolio theory, the modeling of default risk, systemic risk, and venture capital. He has published over a hundred articles in academic journals, and has won numerous awards for research and teaching. His recent book "Derivatives: Principles and Practice" was published in May 2010 (second edition 2016). Sanjiv Das: A Short Academic Life History After loafing and working in many parts of Asia, but never really growing up, Sanjiv moved to New York to change the world, hopefully through research. He graduated in 1994 with a Ph.D. from NYU, and since then spent five years in Boston, and now lives in San Jose, California. Sanjiv loves animals, places in the world where the mountains meet the sea, riding sport motorbikes, reading, gadgets, science fiction movies, and writing cool software code. When there is time available from the excitement of daily life, Sanjiv writes academic papers, which helps him relax. Always the contrarian, Sanjiv thinks that New York City is the most calming place in the world, after California of course. Sanjiv is now a Professor of Finance at Santa Clara University. He came to SCU from Harvard Business School and spent a year at UC Berkeley. In his past life in the unreal world, Sanjiv worked at Citibank, N.A. in the Asia-Pacific region. He takes great pleasure in merging his many previous lives into his current existence, which is incredibly confused and diverse. Sanjiv's research style is instilled with a distinct "New York state of mind" - it is chaotic, diverse, with minimal method to the madness. He has published articles on derivatives, term-structure models, mutual funds, the internet, portfolio choice, banking models, credit risk, and has unpublished articles in many other areas. Some years ago, he took time off to get another degree in computer science at Berkeley, confirming that an unchecked hobby can quickly become an obsession. There he learnt about the fascinating field of Randomized Algorithms, skills he now applies earnestly to his editorial work, and other pursuits, many of which stem from being in the epicenter of Silicon Valley. Coastal living did a lot to mold Sanjiv, who needs to live near the ocean. The many walks in Greenwich village convinced him that there is no such thing as a representative investor, yet added many unique features to his personal utility function. He learnt that it is important to open the academic door to the ivory tower and let the world in. Academia is a real challenge, given that he has to reconcile many more opinions than ideas. He has been known to have turned down many offers from Mad magazine to publish his academic work. As he often explains, you never really finish your education - "you can check out any time you like, but you can never leave." Which is why he is doomed to a lifetime in Hotel California. And he believes that, if this is as bad as it gets, life is really pretty good.
We repeat the functions we had developed earlier.
import string
def removePuncStr(s):
for c in string.punctuation:
s = s.replace(c," ")
return s
def removePunc(text_array):
return [removePuncStr(h) for h in text_array]
def removeNumbersStr(s):
for c in range(10):
n = str(c)
s = s.replace(n," ")
return s
def removeNumbers(text_array):
return [removeNumbersStr(h) for h in text_array]
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def stopText(text_array):
stop_words = set(stopwords.words('english'))
stopped_text = []
for h in text_array:
words = word_tokenize(h)
h2 = ''
for w in words:
if w not in stop_words:
h2 = h2 + ' ' + w
stopped_text.append(h2)
return stopped_text
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
def stemText(text_array):
stemmed_text = []
for h in text_array:
words = word_tokenize(h)
h2 = ''
for w in words:
h2 = h2 + ' ' + PorterStemmer().stem(w)
stemmed_text.append(h2)
return stemmed_text
bio2 = bio.splitlines()
bio2[:10]
['', '', '', 'Sanjiv Das is the William and Janice Terry Professor of Finance and', "Data Science at Santa Clara University's Leavey School of Business. He", 'previously held faculty appointments as Professor at Harvard Business', 'School and UC Berkeley. He holds post-graduate degrees in Finance', '(M.Phil and Ph.D. from New York University), Computer Science', '(M.S. from UC Berkeley), an MBA from the Indian Institute of', 'Management, Ahmedabad, B.Com in Accounting and Economics (University']
#Clean up all lines in one set of nested functions
bio2 = stemText(stopText(removeNumbers(removePunc(bio2))))
bio2 = [j for j in bio2 if len(j)>0]
bio2
[' sanjiv da william janic terri professor financ', ' data scienc santa clara univers leavey school busi He', ' previous held faculti appoint professor harvard busi', ' school UC berkeley He hold post graduat degre financ', ' M phil Ph D new york univers comput scienc', ' M S UC berkeley mba indian institut', ' manag ahmedabad B com account econom univers', ' bombay sydenham colleg also qualifi cost work', ' account aicwa He senior editor the journal invest', ' manag associ editor manag scienc', ' academ journal prior academ work', ' deriv busi asia pacif region vice presid', ' citibank hi current research interest includ machin learn', ' social network deriv price model portfolio theori', ' model default risk system risk ventur capit He', ' publish hundr articl academ journal', ' numer award research teach hi recent book', ' deriv principl practic publish may', ' second edit', ' sanjiv da A short academ life histori', ' after loaf work mani part asia never realli', ' grow sanjiv move new york chang world hope', ' research He graduat Ph D nyu', ' sinc spent five year boston live san jose', ' california sanjiv love anim place world', ' mountain meet sea ride sport motorbik read gadget', ' scienc fiction movi write cool softwar code when', ' time avail excit daili life sanjiv write', ' academ paper help relax alway contrarian sanjiv', ' think new york citi calm place world', ' california cours', ' sanjiv professor financ santa clara univers He came', ' scu harvard busi school spent year UC berkeley In', ' past life unreal world sanjiv work citibank N A', ' asia pacif region He take great pleasur merg mani', ' previou live current exist incred confus', ' divers', ' sanjiv research style instil distinct new york state', ' mind chaotic divers minim method mad He', ' publish articl deriv term structur model mutual', ' fund internet portfolio choic bank model credit risk', ' unpublish articl mani area some year ago took', ' time get anoth degre comput scienc berkeley', ' confirm uncheck hobbi quickli becom obsess', ' there learnt fascin field random algorithm', ' skill appli earnestli editori work', ' pursuit mani stem epicent silicon', ' valley', ' coastal live lot mold sanjiv need live near', ' ocean the mani walk greenwich villag convinc', ' thing repres investor yet ad mani uniqu', ' featur person util function He learnt', ' import open academ door ivori tower let world', ' academia real challeng given reconcil mani', ' opinion idea He known turn mani', ' offer mad magazin publish academ work As often', ' explain never realli finish educ check', ' time like never leav which doom', ' lifetim hotel california and believ', ' bad get life realli pretti good']
Stemming reduces words to their root form. The root form may not be an actual word in the language being processed. The goal of stemming is to reduce the number of forms of the word to a single form, so that when the term-document matrix is constructed, the same word does not appear as different words, as it may conflate the textual analysis being undertaken.
Stemming is a hard problem and a long-standing solution was developed by Porter in 1979. This has stood the test of time. https://tartarus.org/martin/PorterStemmer/. The Lancaster stemmer is more aggressive and was developed in 1990, the source code is quite economical and you can see it here: https://www.nltk.org/_modules/nltk/stem/lancaster.html
Lemmatization is the same as stemming with the additional constraint that the root word is present in the language's dictionary. NLTK uses the WordNet lemmatizer. (WordNet is a widely used word corpus also known as a "lexical database".) See: https://wordnet.princeton.edu/
Additional reading: https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
def lemmText(text_array):
WNlemmatizer = WordNetLemmatizer()
lemmatized_text = []
for h in text_array:
words = word_tokenize(h)
h2 = ''
for w in words:
h2 = h2 + ' ' + WNlemmatizer.lemmatize(w)
lemmatized_text.append(h2)
return lemmatized_text
#Example
temp = stopText(removeNumbers(removePunc(bio.splitlines()[15:22])))
print('Original: ',temp)
bio_lemm = lemmText(temp)
print('Lemmatized: ',bio_lemm)
bio_stem = stemText(temp)
print('Stemmed: ',bio_stem)
Original: [' Citibank His current research interests include machine learning', ' social networks derivatives pricing models portfolio theory', ' modeling default risk systemic risk venture capital He', ' published hundred articles academic journals', ' numerous awards research teaching His recent book', ' Derivatives Principles Practice published May', ' second edition'] Lemmatized: [' Citibank His current research interest include machine learning', ' social network derivative pricing model portfolio theory', ' modeling default risk systemic risk venture capital He', ' published hundred article academic journal', ' numerous award research teaching His recent book', ' Derivatives Principles Practice published May', ' second edition'] Stemmed: [' citibank hi current research interest includ machin learn', ' social network deriv price model portfolio theori', ' model default risk system risk ventur capit He', ' publish hundr articl academ journal', ' numer award research teach hi recent book', ' deriv principl practic publish may', ' second edit']
NMF is used to break a matrix $X$ into a product of two matrices, $A$ and $B$, such that the resulting matrices have components that are all non-negative. Hence, the nomenclature is taken literally.
$$ X = A \cdot B $$where $X$ is of dimension $m \times n$, $A$ is $m \times k$, and $B$ is $k \times n$.
We may not be able to determine $A$ and $B$ precisely, so we obtain the "best fit" to the following problem:
$$ \min_{A,B} \parallel X-A \cdot B \parallel_F^2 $$subject to $A,B \geq 0$ element-wise. Here $\parallel Y \parallel_F^2$ is the Frobenius norm, i.e., the sum of squared elements of matrix $Y$.
How do we solve for the elements of $A$ and $B$? There is an obvious and intuitive element by element update that can be done as follows:
$$ A_{[i,j]} \leftarrow A_{[i,j]} \odot \frac{[XB^\top]_{[i,j]}}{[ABB^\top]_{[i,j]}} $$$$ B_{[i,j]} \leftarrow B_{[i,j]} \odot \frac{[A^\top]X_{[i,j]}}{[A^\top AB]_{[i,j]}} $$If $X_{[i,j]} > (AB)_{[i,j]}$ then update weight on the right-side of the Hadamard product ($\odot$, element-wise multiplication) will be greater than 1, else less than 1. This update scheme is guaranteed to keep reducing the loss function and also make sure $A_{[i,j]}, B_{[i,j]} \geq 0$ as long as the initial guesses for $A,B$ are non-negative.
We can use gradient descent to implement the previous scheme with the following iterative rule, which does not need element by element updating:
$$ A \leftarrow A - \eta_A [A B B^\top - X B^\top] \in {\cal R}^{m \times k} $$and
$$ B \leftarrow B - \eta_B [A^\top A B - A^\top X] \in {\cal R}^{k \times n} $$The former update equation is applied to update $A$, holding $B$ constant and the latter equation is applied to $B$, holding $A$ constant. As usual $\eta_i, i=\{A,B\}$ are the learning rates used in gradient descent.
#Example
bio2 = bio.splitlines()
bio2 = [j for j in bio2 if len(j)>0 ]
#Get TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(bio2)
tfs = tfidf.fit_transform(bio2)
# Make TDM
tdm_mat = tfs.toarray().T
print(tdm_mat.shape)
tdm_mat
(326, 60)
array([[0. , 0. , 0. , ..., 0. , 0. , 0. ], [0. , 0. , 0. , ..., 0. , 0. , 0. ], [0. , 0. , 0. , ..., 0. , 0. , 0. ], ..., [0. , 0. , 0. , ..., 0. , 0. , 0. ], [0. , 0. , 0. , ..., 0.49908804, 0. , 0. ], [0. , 0. , 0. , ..., 0. , 0. , 0. ]])
from sklearn.decomposition import NMF
nmf = NMF(n_components=10, solver="mu", max_iter=1000)
print(nmf)
A = nmf.fit_transform(tdm_mat)
B = nmf.components_
print(A.shape)
print(B.shape)
print(A.min(),B.min())
print((tdm_mat - A.dot(B)).max())
NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=1000, n_components=10, random_state=None, shuffle=False, solver='mu', tol=0.0001, verbose=0) (326, 10) (10, 60) 0.0 0.0 1.0
SVD is a generalization of eigenvalue decomposition. We can obtain decompositions of non-square matrices. Consider the decomposition of the Term-Document Matrix $M$ of size $m \times n$. The canonical decomposition is as follows:
$$ M = T \cdot S \cdot D^\top $$where $T$ is $m \times n$, $S$ is $n \times n$, and $D^\top$ is $n \times n$. $T$ and $D$ are orthonormal to each other. $S$ is the “singular values” matrix, i.e., a diagonal matrix with singular values on the diagonal. These values denote the relative importance of the terms in the TDM.
#Example
docs = bio.splitlines()[:10]
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
tdm = df.T
print(tdm.shape)
print(tdm)
(49, 10) 0 1 2 3 4 5 6 7 8 9 accounting 0 0 0 0 0 0 0 0 0 1 ahmedabad 0 0 0 0 0 0 0 0 0 1 an 0 0 0 0 0 0 0 0 1 0 and 0 0 0 2 0 0 1 1 0 1 appointments 0 0 0 0 0 1 0 0 0 0 as 0 0 0 0 0 1 0 0 0 0 at 0 0 0 0 1 1 0 0 0 0 berkeley 0 0 0 0 0 0 1 0 1 0 business 0 0 0 0 1 1 0 0 0 0 clara 0 0 0 0 1 0 0 0 0 0 com 0 0 0 0 0 0 0 0 0 1 computer 0 0 0 0 0 0 0 1 0 0 das 0 0 0 1 0 0 0 0 0 0 data 0 0 0 0 1 0 0 0 0 0 degrees 0 0 0 0 0 0 1 0 0 0 economics 0 0 0 0 0 0 0 0 0 1 faculty 0 0 0 0 0 1 0 0 0 0 finance 0 0 0 1 0 0 1 0 0 0 from 0 0 0 0 0 0 0 1 2 0 graduate 0 0 0 0 0 0 1 0 0 0 harvard 0 0 0 0 0 1 0 0 0 0 he 0 0 0 0 1 0 1 0 0 0 held 0 0 0 0 0 1 0 0 0 0 holds 0 0 0 0 0 0 1 0 0 0 in 0 0 0 0 0 0 1 0 0 1 indian 0 0 0 0 0 0 0 0 1 0 institute 0 0 0 0 0 0 0 0 1 0 is 0 0 0 1 0 0 0 0 0 0 janice 0 0 0 1 0 0 0 0 0 0 leavey 0 0 0 0 1 0 0 0 0 0 management 0 0 0 0 0 0 0 0 0 1 mba 0 0 0 0 0 0 0 0 1 0 new 0 0 0 0 0 0 0 1 0 0 of 0 0 0 1 1 0 0 0 1 0 ph 0 0 0 0 0 0 0 1 0 0 phil 0 0 0 0 0 0 0 1 0 0 post 0 0 0 0 0 0 1 0 0 0 previously 0 0 0 0 0 1 0 0 0 0 professor 0 0 0 1 0 1 0 0 0 0 sanjiv 0 0 0 1 0 0 0 0 0 0 santa 0 0 0 0 1 0 0 0 0 0 school 0 0 0 0 1 0 1 0 0 0 science 0 0 0 0 1 0 0 1 0 0 terry 0 0 0 1 0 0 0 0 0 0 the 0 0 0 1 0 0 0 0 1 0 uc 0 0 0 0 0 0 1 0 1 0 university 0 0 0 0 1 0 0 1 0 1 william 0 0 0 1 0 0 0 0 0 0 york 0 0 0 0 0 0 0 1 0 0
#Using SciPy
from scipy.linalg import svd
T,S,Dt = svd(tdm)
print(T.shape, S.shape, Dt.shape)
print(S)
(49, 49) (10,) (10, 10) [ 4.48419389e+00 3.40404159e+00 3.34728247e+00 3.10699004e+00 2.98154402e+00 2.62943757e+00 2.37555726e+00 8.76143088e-17 0.00000000e+00 -0.00000000e+00]
#Using SkLearn
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=5, n_iter=100, random_state=42)
svd.fit(tdm)
TruncatedSVD(algorithm='randomized', n_components=5, n_iter=100, random_state=42, tol=0.0)
print(svd.explained_variance_ratio_)
print(svd.explained_variance_ratio_.sum())
print(svd.singular_values_)
[0.11395279 0.18194832 0.18837593 0.16128021 0.14944096] 0.7949982025652943 [4.48419389 3.40404159 3.34728247 3.10699004 2.98154402]
from sklearn.utils.extmath import randomized_svd
T, S, Dt = randomized_svd(tdm.values, n_components=5, n_iter=100, random_state=42)
print(T.shape, S.shape, Dt.shape)
print(S)
(49, 5) (5,) (5, 10) [4.48419389 3.40404159 3.34728247 3.10699004 2.98154402]
Latent Semantic Analysis (LSA) is an approach for reducing the dimension of the Term-Document Matrix (TDM), or the corresponding Document-Term Matrix (DTM), in general used interchangeably, unless a specific one is invoked. Dimension reduction of the TDM offers two benefits:
The DTM is usually a sparse matrix, and sparseness means that our algorithms have to work harder on missing data, which is clearly wasteful. Some of this sparseness is attenuated by applying LSA to the TDM.
The problem of synonymy also exists in the TDM, which usually contains thousands of terms (words). Synonymy arises becauses many words have similar meanings, i.e., redundancy exists in the list of terms. LSA mitigates this redundancy, as we shall see through the ensuing anaysis of LSA. See: http://www.oxfordbibliographies.com/view/document/obo-9780199772810/obo-9780199772810-0220.xml
While not precisely the same thing, think of LSA in the text domain as analogous to PCA in the data domain.
LSA is the application of Singular Value Decomposition (SVD) to the TDM, extracted from a text corpus. Define the TDM to be a matrix $M \in R^{m \times n}$, where $m$ is the number of terms and $n$ is the number of documents.
The SVD of matrix M is given by
$$ M=T \cdot S \cdot D^\top $$where $T \in R^{m \times n}$ and $D \in R^{n \times n}$ are orthonormal to each other, and $S ∈\in R^{n \times n}$ is the “singluar values” matrix, i.e., a diagonal matrix with singular values on the diagonal. These values denote the relative importance of the terms in the TDM.
%%R
system("mkdir D")
write( c("blue", "red", "green"), file=paste("D", "D1.txt", sep="/"))
write( c("black", "blue", "red"), file=paste("D", "D2.txt", sep="/"))
write( c("yellow", "black", "green"), file=paste("D", "D3.txt", sep="/"))
write( c("yellow", "red", "black"), file=paste("D", "D4.txt", sep="/"))
%%R
library(lsa)
tdm = textmatrix("D",minWordLength=1)
print(tdm)
system("rm -rf D")
/home/srdas/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Loading required package: SnowballC warnings.warn(x, RRuntimeWarning)
docs terms D1.txt D2.txt D3.txt D4.txt blue 1 1 0 0 green 1 0 1 0 red 1 1 0 1 black 0 1 1 1 yellow 0 0 1 1
SVD tries to connect the correlation matrix of terms ($M \cdot M^\top$) with the correlation matrix of documents ($M^\top \cdot M$) through the singular matrix.
To see this connection, note that matrix $T$ contains the eigenvectors of the correlation matrix of terms. Likewise, the matrix $D$ contains the eigenvectors of the correlation matrix of documents. To see this, let’s compute
%%R
et = eigen(tdm %*% t(tdm))$vectors
print(et)
ed = eigen(t(tdm) %*% tdm)$vectors
print(ed)
[,1] [,2] [,3] [,4] [,5] [1,] 0.3629044 -6.015010e-01 -0.06829369 -3.717480e-01 0.6030227 [2,] 0.3328695 1.387779e-16 -0.89347008 3.053113e-16 -0.3015113 [3,] 0.5593741 -3.717480e-01 0.31014767 6.015010e-01 -0.3015113 [4,] 0.5593741 3.717480e-01 0.31014767 -6.015010e-01 -0.3015113 [5,] 0.3629044 6.015010e-01 -0.06829369 3.717480e-01 0.6030227 [,1] [,2] [,3] [,4] [1,] 0.4570561 0.601501 -0.5395366 -0.371748 [2,] 0.5395366 0.371748 0.4570561 0.601501 [3,] 0.4570561 -0.601501 -0.5395366 0.371748 [4,] 0.5395366 -0.371748 0.4570561 -0.601501
If we wish to reduce the dimension of the latent semantic space to $k<n$ then we use only the first $k$ eigenvectors. The lsa function does this automatically.
We call LSA and ask it to automatically reduce the dimension of the TDM using a built-in function dimcalc_share.
%%R
res = lsa(tdm,dims=dimcalc_share())
print(res)
$tk [,1] [,2] blue -0.3629044 -6.015010e-01 green -0.3328695 -5.551115e-17 red -0.5593741 -3.717480e-01 black -0.5593741 3.717480e-01 yellow -0.3629044 6.015010e-01 $dk [,1] [,2] D1.txt -0.4570561 -0.601501 D2.txt -0.5395366 -0.371748 D3.txt -0.4570561 0.601501 D4.txt -0.5395366 0.371748 $sk [1] 2.746158 1.618034 attr(,"class") [1] "LSAspace"
We can see that the dimension has been reduced from $n=4$ to $n=2$. The output is shown for both the term matrix and the document matrix, both of which have only two columns. Think of these as the two “principal semantic components” of the TDM.
Compare the output of the LSA to the eigenvectors above to see that it is exactly that. The singular values in the ouput are connected to SVD as follows.
First of all we see that the lsa function is nothing but the svd function in base R.
%%R
res2 = svd(tdm)
print(res2)
$d [1] 2.746158 1.618034 1.207733 0.618034 $u [,1] [,2] [,3] [,4] [1,] -0.3629044 -6.015010e-01 0.06829369 3.717480e-01 [2,] -0.3328695 -5.551115e-17 0.89347008 -3.441691e-15 [3,] -0.5593741 -3.717480e-01 -0.31014767 -6.015010e-01 [4,] -0.5593741 3.717480e-01 -0.31014767 6.015010e-01 [5,] -0.3629044 6.015010e-01 0.06829369 -3.717480e-01 $v [,1] [,2] [,3] [,4] [1,] -0.4570561 -0.601501 0.5395366 -0.371748 [2,] -0.5395366 -0.371748 -0.4570561 0.601501 [3,] -0.4570561 0.601501 0.5395366 0.371748 [4,] -0.5395366 0.371748 -0.4570561 -0.601501
The output here is the same as that of LSA except it is provided for $n=4$. So we have four columns in $T$ and $D$ rather than two. Compare the results here to the previous two slides to see the connection.
We may reconstruct the TDM using the result of the LSA.
%%R
tdm_lsa = res$tk %*% diag(res$sk) %*% t(res$dk)
print(tdm_lsa)
D1.txt D2.txt D3.txt D4.txt blue 1.0409089 0.8995016 -0.1299115 0.1758948 green 0.4178005 0.4931970 0.4178005 0.4931970 red 1.0639006 1.0524048 0.3402938 0.6051912 black 0.3402938 0.6051912 1.0639006 1.0524048 yellow -0.1299115 0.1758948 1.0409089 0.8995016
We see the new TDM after the LSA operation, it has non-integer frequency counts, but it may be treated in the same way as the original TDM. The document vectors populate a slightly different hyperspace.
LSA reduces the rank of the correlation matrix of terms $M \cdot M^\top$ to $n=2$. Here we see the rank before and after LSA.
%%R
library(Matrix)
print(rankMatrix(tdm))
[1] 4 attr(,"method") [1] "tolNorm2" attr(,"useGrad") [1] FALSE attr(,"tol") [1] 1.110223e-15
%%R
print(rankMatrix(tdm_lsa))
UsageError: Cell magic `%%R` not found.
Text2vec is an excellent implementation in R of much of the functionality we have seen so far. It is written by Dmitriy Selivanov in C++ and is extremely fast. See: http://text2vec.org/
https://srdas.github.io/MLBook/Text2Vec.html
The example below is taken from the sample code here: http://text2vec.org/vectorization.html
%%R
suppressMessages(library(text2vec))
%%R
suppressMessages(library(data.table))
data("movie_review")
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)
all_ids = movie_review$id
train_ids = sample(all_ids, 4000)
test_ids = setdiff(all_ids, train_ids)
train = movie_review[J(train_ids)]
test = movie_review[J(test_ids)]
print(head(train))
id sentiment 1: 11912_2 0 2: 11507_10 1 3: 8194_9 1 4: 11426_10 1 5: 4043_3 0 6: 11287_3 0 review 1: The story behind this movie is very interesting, and in general the plot is not so bad... but the details: writing, directing, continuity, pacing, action sequences, stunts, and use of CG all cheapen and spoil the film.<br /><br />First off, action sequences. They are all quite unexciting. Most consist of someone standing up and getting shot, making no attempt to run, fight, dodge, or whatever, even though they have all the time in the world. The sequences just seem bland for something made in 2004.<br /><br />The CG features very nicely rendered and animated effects, but they come off looking cheap because of how they are used.<br /><br />Pacing: everything happens too quickly. For example, \\"Elle\\" is trained to fight in a couple of hours, and from the start can do back-flips, etc. Why is she so acrobatic? None of this is explained in the movie. As Lilith, she wouldn't have needed to be able to do back flips - maybe she couldn't, since she had wings.<br /><br />Also, we have sequences like a woman getting run over by a car, and getting up and just wandering off into a deserted room with a sink and mirror, and then stabbing herself in the throat, all for no apparent reason, and without any of the spectators really caring that she just got hit by a car (and then felt the secondary effects of another, exploding car)... \\"Are you okay?\\" asks the driver \\"yes, I'm fine\\" she says, bloodied and disheveled.<br /><br />I watched it all, though, because the introduction promised me that it would be interesting... but in the end, the poor execution made me wish for anything else: Blade, Vampire Hunter D, even that movie with vampires where Jackie Chan was comic relief, because they managed to suspend my disbelief, but this just made me want to shake the director awake, and give the writer a good talking to. 2: I remember the original series vividly mostly due to it's unique blend of wry humor and macabre subject matter. Kolchak was hard-bitten newsman from the Ben Hecht school of big-city reporting, and his gritty determination and wise-ass demeanor made even the most mundane episode eminently watchable. My personal fave was \\"The Spanish Moss Murders\\" due to it's totally original storyline. A poor,troubled Cajun youth from Louisiana bayou country, takes part in a sleep research experiment, for the purpose of dream analysis. Something goes inexplicably wrong, and he literally dreams to life a swamp creature inhabiting the dark folk tales of his youth. This malevolent manifestation seeks out all persons who have wronged the dreamer in his conscious state, and brutally suffocates them to death. Kolchak investigates and uncovers this horrible truth, much to the chagrin of police captain Joe \\"Mad Dog\\" Siska(wonderfully essayed by a grumpy Keenan Wynn)and the head sleep researcher played by Second City improv founder, Severn Darden, to droll, understated perfection. The wickedly funny, harrowing finale takes place in the Chicago sewer system, and is a series highlight. Kolchak never got any better. Timeless. 3: Despite the other comments listed here, this is probably the best Dirty Harry movie made; a film that reflects -- for better or worse -- the country's socio-political feelings during the Reagan glory years of the early '80's. It's also a kickass action movie.<br /><br />Opening with a liberal, female judge overturning a murder case due to lack of tangible evidence and then going straight into the coffee shop encounter with several unfortunate hoodlums (the scene which prompts the famous, \\"Go ahead, make my day\\" line), \\"Sudden Impact\\" is one non-stop roller coaster of an action film. The first time you get to catch your breath is when the troublesome Inspector Callahan is sent away to a nearby city to investigate the background of a murdered hood. It gets only better from there with an over-the-top group of grotesque thugs for Callahan to deal with along with a sherriff with a mysterious past. Superb direction and photography and a at-times hilarious script help make this film one of the best of the '80's. 4: I think this movie would be more enjoyable if everyone thought of it as a picture of colonial Africa in the 50's and 60's rather than as a story. Because there is no real story here. Just one vignette on top of another like little points of light that don't mean much until you have enough to paint a picture. The first time I saw Chocolat I didn't really \\"get it\\" until having thought about it for a few days. Then I realized there were lots of things to \\"get\\", including the end of colonialism which was but around the corner, just no plot. Anyway, it's one of my all-time favorite movies. The scene at the airport with the brief shower and beautiful music was sheer poetry. If you like \\"exciting\\" movies, don't watch this--you'll be bored to tears. But, for some of you..., you can thank me later for recommending it to you. 5: The film begins with promise, but lingers too long in a sepia world of distance and alienation. We are left hanging, but with nothing much else save languid shots of grave and pensive male faces to savour. Certainly no rope up the wall to help us climb over. It's a shame, because the concept is not without merit.<br /><br />We are left wondering why a loving couple - a father and son no less - should be so estranged from the real world that their own world is preferable when claustrophobic beyond all imagining. This loss of presence in the real world is, rather too obviously and unnecessarily, contrasted with the son having enlisted in the armed forces. Why not the circus, so we can at least appreciate some colour? We are left with a gnawing sense of loss, but sadly no enlightenment, which is bewildering given the film is apparently about some form of attainment not available to us all. 6: This is a film that had a lot to live down to . on the year of its release legendary film critic Barry Norman considered it the worst film of the year and I'd heard nothing but bad things about it especially a plot that was criticised for being too complicated <br /><br />To be honest the plot is something of a red herring and the film suffers even more when the word \\" plot \\" is used because as far as I can see there is no plot as such . There's something involving Russian gangsters , a character called Pete Thompson who's trying to get his wife Sarah pregnant , and an Irish bloke called Sean . How they all fit into something called a \\" plot \\" I'm not sure . It's difficult to explain the plots of Guy Ritchie films but if you watch any of his films I'm sure we can all agree that they all posses one no matter how complicated they may seem on first viewing . Likewise a James Bond film though the plots are stretched out with action scenes . You will have a serious problem believing RANCID ALUMINIUM has any type of central plot that can be cogently explained <br /><br />Taking a look at the cast list will ring enough warning bells as to what sort of film you'll be watching . Sadie Frost has appeared in some of the worst British films made in the last 15 years and she's doing nothing to become inconsistent . Steven Berkoff gives acting a bad name ( and he plays a character called Kant which sums up the wit of this movie ) while one of the supporting characters is played by a TV presenter presumably because no serious actress would be seen dead in this <br /><br />The only good thing I can say about this movie is that it's utterly forgettable . I saw it a few days ago and immediately after watching I was going to write a very long a critical review warning people what they are letting themselves in for by watching , but by now I've mainly forgotten why . But this doesn't alter the fact that I remember disliking this piece of crap immensely
The processing steps are:
%%R
prep_fun = tolower
tok_fun = word_tokenizer
#Create an iterator to pass to the create_vocabulary function
it_train = itoken(train$review,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = train$id,
progressbar = FALSE)
#Now create a vocabulary
vocab = create_vocabulary(it_train)
print(vocab)
Number of docs: 4000 0 stopwords: ... ngram_min = 1; ngram_max = 1 Vocabulary: term term_count doc_count 1: ufo 1 1 2: rader 1 1 3: bouchet 1 1 4: atherton 1 1 5: cyhper 1 1 --- 38302: to 22095 3805 38303: of 23653 3792 38304: a 26614 3878 38305: and 27069 3877 38306: the 54362 3969
An iterator is an object that traverses a container. A list is iterable. See: https://www.r-bloggers.com/iterators-in-r/
Vectorize creates words vectors
%%R
vectorizer = vocab_vectorizer(vocab)
%%R
dtm_train = create_dtm(it_train, vectorizer)
print(dim(as.matrix(dtm_train)))
[1] 4000 38306
n-grams are phrases made by coupling words that co-occur. For example, a bi-gram is a set of two consecutive words.
%%R
vocab = create_vocabulary(it_train, ngram = c(1, 2))
print(vocab)
Number of docs: 4000 0 stopwords: ... ngram_min = 1; ngram_max = 2 Vocabulary: term term_count doc_count 1: old_used 1 1 2: corey_savier's 1 1 3: key_monosyllabic 1 1 4: rural_england 1 1 5: apartment_house 1 1 --- 406813: to 22095 3805 406814: of 23653 3792 406815: a 26614 3878 406816: and 27069 3877 406817: the 54362 3969
This creates a vocabulary of both single words and bi-grams. Notice how large it is compared to the unigram vocabulary from earlier. Because of this we go ahead and prune the vocabulary first, as this will speed up computation. We redo the classification with n-grams.
%%R
library(glmnet)
library(magrittr)
NFOLDS = 5
vocab = vocab %>% prune_vocabulary(term_count_min = 10,
doc_proportion_max = 0.5)
print(vocab)
/home/srdas/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Loading required package: foreach warnings.warn(x, RRuntimeWarning) /home/srdas/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Loaded glmnet 2.0-16 warnings.warn(x, RRuntimeWarning)
Number of docs: 4000 0 stopwords: ... ngram_min = 1; ngram_max = 2 Vocabulary: term term_count doc_count 1: and_loved 10 10 2: screenplay_and 10 9 3: you_saw 10 10 4: was_her 10 10 5: feel_bad 10 9 --- 17663: from 3369 1914 17664: they 3426 1646 17665: by 3727 1919 17666: he 4293 1588 17667: his 4808 1732
%%R
bigram_vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, bigram_vectorizer)
res = cv.glmnet(x = dtm_train, y = train[['sentiment']],
family = 'binomial',
alpha = 1,
type.measure = "auc",
nfolds = NFOLDS,
thresh = 1e-3,
maxit = 1e3)
plot(res)
%%R
print(names(res))
cat("AUC (area under curve):")
print(max(res$cvm))
[1] "lambda" "cvm" "cvsd" "cvup" "cvlo" [6] "nzero" "name" "glmnet.fit" "lambda.min" "lambda.1se" AUC (area under curve):[1] 0.9251195
%%R
#Out-of-sample test
it_test = test$review %>%
prep_fun %>%
tok_fun %>%
itoken(ids = test$id,
# turn off progressbar because it won't look nice in rmd
progressbar = FALSE)
dtm_test = create_dtm(it_test, bigram_vectorizer)
preds = predict(res, dtm_test, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)
[1] 0.9316295
We have seen the TF-IDF discussion earlier, and here we see how to implement it using the text2vec package.
%%R
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)
tfidf = TfIdf$new()
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
dtm_test_tfidf = create_dtm(it_test, vectorizer) %>% transform(tfidf)
%%R
## Refit classifier
## Now we take the TF-IDF adjusted DTM and run the classifier.
res = cv.glmnet(x = dtm_train_tfidf, y = train[['sentiment']],
family = 'binomial',
alpha = 1,
type.measure = "auc",
nfolds = NFOLDS,
thresh = 1e-3,
maxit = 1e3)
print(paste("max AUC =", round(max(res$cvm), 4)))
[1] "max AUC = 0.9113"
%%R
#Test on hold-out sample
preds = predict(res, dtm_test_tfidf, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)
[1] 0.9063606
From: http://stackoverflow.com/questions/39514941/preparing-word-embeddings-in-text2vec-r-package
Create the TCM (Term Co-occurrence Matrix).
Word Embeddings (in particular, the word2vec algorithm, has been used to mine a large number of research abstracts to uncover new knowledge. See "With little training, machine-learning algorithms can uncover hidden scientific knowledge" in Nature (2019), which describes Tshitoyan et al (2019).
%%R
data("movie_review")
tokens = movie_review$review %>% tolower %>% word_tokenizer()
it = itoken(tokens)
v = create_vocabulary(it) %>% prune_vocabulary(term_count_min=10)
vectorizer = vocab_vectorizer(v) #, grow_dtm = FALSE, skip_grams_window = 5)
tcm = create_tcm(it, vectorizer, skip_grams_window=5)
print(dim(tcm))
| |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%[1] 7805 7805
Now fit the word embeddings using GloVe See: http://nlp.stanford.edu/projects/glove/
%%R
model = GlobalVectors$new(word_vectors_size = 50, vocabulary = v, x_max = 10)
wv_main = model$fit_transform(tcm, n_iter = 25, convergence_tol = 0.01)
INFO [2019-05-19 17:49:29] 2019-05-19 17:49:29 - epoch 1, expected cost 0.0814 INFO [2019-05-19 17:49:30] 2019-05-19 17:49:30 - epoch 2, expected cost 0.0525 INFO [2019-05-19 17:49:30] 2019-05-19 17:49:30 - epoch 3, expected cost 0.0458 INFO [2019-05-19 17:49:31] 2019-05-19 17:49:31 - epoch 4, expected cost 0.0415 INFO [2019-05-19 17:49:31] 2019-05-19 17:49:31 - epoch 5, expected cost 0.0384 INFO [2019-05-19 17:49:32] 2019-05-19 17:49:32 - epoch 6, expected cost 0.0360 INFO [2019-05-19 17:49:32] 2019-05-19 17:49:32 - epoch 7, expected cost 0.0342 INFO [2019-05-19 17:49:33] 2019-05-19 17:49:33 - epoch 8, expected cost 0.0328 INFO [2019-05-19 17:49:33] 2019-05-19 17:49:33 - epoch 9, expected cost 0.0316 INFO [2019-05-19 17:49:34] 2019-05-19 17:49:34 - epoch 10, expected cost 0.0306 INFO [2019-05-19 17:49:34] 2019-05-19 17:49:34 - epoch 11, expected cost 0.0298 INFO [2019-05-19 17:49:35] 2019-05-19 17:49:35 - epoch 12, expected cost 0.0291 INFO [2019-05-19 17:49:35] 2019-05-19 17:49:35 - epoch 13, expected cost 0.0285 INFO [2019-05-19 17:49:36] 2019-05-19 17:49:36 - epoch 14, expected cost 0.0279 INFO [2019-05-19 17:49:36] 2019-05-19 17:49:36 - epoch 15, expected cost 0.0275 INFO [2019-05-19 17:49:37] 2019-05-19 17:49:37 - epoch 16, expected cost 0.0270 INFO [2019-05-19 17:49:37] 2019-05-19 17:49:37 - epoch 17, expected cost 0.0267 INFO [2019-05-19 17:49:38] 2019-05-19 17:49:38 - epoch 18, expected cost 0.0263 INFO [2019-05-19 17:49:38] 2019-05-19 17:49:38 - epoch 19, expected cost 0.0260 INFO [2019-05-19 17:49:39] 2019-05-19 17:49:39 - epoch 20, expected cost 0.0258 INFO [2019-05-19 17:49:39] 2019-05-19 17:49:39 - epoch 21, expected cost 0.0255 INFO [2019-05-19 17:49:39] Success: early stopping. Improvement at iterartion 21 is less then convergence_tol
%%R
print(dim(wv_main))
wv_context = model$components
print(dim(wv_context))
wv = wv_main + t(wv_context)
#wv = model$get_word_vectors() #Dimension words x wvec_size
#Make distance matrix
d = dist2(wv, method="cosine") #Smaller values means closer
print(dim(d))
[1] 7805 50 [1] 50 7805 [1] 7805 7805
%%R
#Pass: w=word, d=dist matrix, n=nomber of close words
findCloseWords = function(w,d,n) {
words = rownames(d)
i = which(words==w)
if (length(i) > 0) {
res = sort(d[i,])
print(as.matrix(res[2:(n+1)]))
}
else {
print("Word not in corpus.")
}
}
Example: Show the ten words close to the word “man” and “woman”.
%%R
findCloseWords("man",d,10)
[,1] woman 0.1759534 girl 0.2667730 who 0.2977268 guy 0.2996575 young 0.2998047 plays 0.3508409 boy 0.3725276 old 0.3939009 he 0.3969646 kid 0.3970353
%%R
findCloseWords("woman",d,10)
[,1] young 0.1739555 man 0.1759534 girl 0.1913266 guy 0.2662980 who 0.2982760 kid 0.3122388 boy 0.3182666 named 0.3425043 old 0.3647606 plays 0.3782488
This is a very useful feature of word embeddings, as it is often argued that in the embedded space, words that are close to each other, also tend to have semantic similarities, even though the closeness is computed simply by using their co-occurence frequencies.
For more details, see: https://www.quora.com/How-does-word2vec-work
A geometrical interpretation: word2vec is a shallow word embedding model. This means that the model learns to map each discrete word id (0 through the number of words in the vocabulary) into a low-dimensional continuous vector-space from their distributional properties observed in some raw text corpus. Geometrically, one may interpret these vectors as tracing out points on the outside surface of a manifold in the “embedded space”. If we initialize these vectors from a spherical gaussian distribution, then you can imagine this manifold to look something like a hypersphere initially.
Let us focus on the CBOW for now. CBOW is trained to predict the target word t from the contextual words that surround it, c, i.e. the goal is to maximize P(t | c) over the training set. I am simplifying somewhat, but you can show that this probability is roughly inversely proportional to the distance between the current vectors assigned to t and to c. Since this model is trained in an online setting (one example at a time), at time T the goal is therefore to take a small step (mediated by the “learning rate”) in order to minimize the distance between the current vectors for t and c (and thereby increase the probability P(t |c)). By repeating this process over the entire training set, we have that vectors for words that habitually co-occur tend to be nudged closer together, and by gradually lowering the learning rate, this process converges towards some final state of the vectors.
By the Distributional Hypothesis (Firth, 1957; see also the Wikipedia page on Distributional semantics), words with similar distributional properties (i.e. that co-occur regularly) tend to share some aspect of semantic meaning. For example, we may find several sentences in the training set such as “citizens of X protested today” where X (the target word t) may be names of cities or countries that are semantically related.
You can therefore interpret each training step as deforming or morphing the initial manifold by nudging the vectors for some words somewhat closer together, and the result, after projecting down to two dimensions, is the familiar t-SNE visualizations where related words cluster together (e.g. Word representations for NLP).
For the skipgram, the direction of the prediction is simply inverted, i.e. now we try to predict P(citizens | X), P(of | X), etc. This turns out to learn finer-grained vectors when one trains over more data. The main reason is that the CBOW smooths over a lot of the distributional statistics by averaging over all context words while the skipgram does not. With little data, this “regularizing” effect of the CBOW turns out to be helpful, but since data is the ultimate regularizer the skipgram is able to extract more information when more data is available.
There’s a bit more going on behind the scenes, but hopefully this helps to give a useful geometrical intuition as to how these models work.
Uses Latent Dirichlet Allocation.
%%R
suppressMessages(library(tm))
suppressMessages(library(text2vec))
stopw = stopwords('en')
stopw = c(stopw,"br","t","s","m","ve","2","d","1")
%%R
#Make DTM
data("movie_review")
tokens = movie_review$review %>% tolower %>% word_tokenizer()
it = itoken(tokens)
v = create_vocabulary(it, stopwords = stopw) %>% prune_vocabulary(term_count_min=5)
vectrzr = vocab_vectorizer(v)
dtm = create_dtm(it, vectrzr, skip_grams_window = 5)
print(dim(dtm))
| |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%[1] 5000 12803
%%R
#Do LDA
dtm = create_dtm(it, vectrzr, type = "dgTMatrix")
lda = LDA$new(n_topics = 5, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topics = lda$fit_transform(x = dtm, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
| |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%INFO [2019-05-19 17:50:37] iter 25 loglikelihood = -4236738.001 INFO [2019-05-19 17:50:38] iter 50 loglikelihood = -4175059.756 INFO [2019-05-19 17:50:39] iter 75 loglikelihood = -4145685.344 INFO [2019-05-19 17:50:40] iter 100 loglikelihood = -4127472.096 INFO [2019-05-19 17:50:41] iter 125 loglikelihood = -4116825.926 INFO [2019-05-19 17:50:42] iter 150 loglikelihood = -4111737.651 INFO [2019-05-19 17:50:43] iter 175 loglikelihood = -4105210.327 INFO [2019-05-19 17:50:44] iter 200 loglikelihood = -4103461.840 INFO [2019-05-19 17:50:44] early stopping at 200 iteration
%%R
barplot(doc_topics[1, ], xlab = "topic",
ylab = "proportion", ylim = c(0, 1),
names.arg = 1:ncol(doc_topics))
%%R
#Get top words by topic
lda$get_top_words(n = 10, topic_number = seq(1,5), lambda = 1)
[,1] [,2] [,3] [,4] [,5] [1,] "film" "man" "film" "movie" "film" [2,] "movie" "one" "people" "like" "one" [3,] "good" "also" "life" "just" "story" [4,] "like" "gets" "one" "one" "great" [5,] "one" "action" "way" "really" "time" [6,] "just" "john" "can" "bad" "well" [7,] "films" "two" "us" "good" "movie" [8,] "story" "get" "young" "even" "love" [9,] "plot" "new" "will" "see" "also" [10,] "really" "night" "like" "movies" "best"
%%R
#Get top words by topic, sorted by relevance (set lambda between 0.2 and 0.4)
lda$get_top_words(n = 10, topic_number = seq(1,5), lambda = 0.2)
[,1] [,2] [,3] [,4] [,5] [1,] "film" "match" "war" "movie" "wonderful" [2,] "films" "stewart" "sister" "bad" "novel" [3,] "plot" "won" "french" "show" "brilliant" [4,] "dialogue" "team" "lives" "stupid" "york" [5,] "director" "dr" "society" "worst" "excellent" [6,] "average" "doctor" "woman" "just" "jane" [7,] "forward" "island" "white" "movies" "william" [8,] "script" "ring" "becomes" "kids" "performances" [9,] "characters" "attack" "military" "guy" "mary" [10,] "actors" "rock" "tells" "like" "adaptation"
%%R
#Plot LDA
suppressMessages(library(LDAvis))
lda$plot()
/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Loading required namespace: servr warnings.warn(x, RRuntimeWarning) /anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Failed with error: warnings.warn(x, RRuntimeWarning) /anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: warnings.warn(x, RRuntimeWarning) /anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: ‘there is no package called ‘servr’’ warnings.warn(x, RRuntimeWarning) /anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: warnings.warn(x, RRuntimeWarning) /anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: If the visualization doesn't render, install the servr package and re-run serVis: install.packages('servr') Alternatively, you could configure your default browser to allow access to local files as some browsers block this by default warnings.warn(x, RRuntimeWarning)
Entities are elements of the data that fall into specific pre-defined categories. They are part of the data model. For instance, when extracting data from the SEC, we get various entity types: companies, directors, loans, securities, products, accounting line items, etc. We need to identify these when they appear in financial text.
EE is often also called Named Entity Recognition (NER).
text = 'A new statement from Boeing indicates that the aerospace manufacturer knew about a problem with the 737 Max aircraft well before the deadly October 2018 Lion Air crash, but decided not to do anything about it.'
print(text)
A new statement from Boeing indicates that the aerospace manufacturer knew about a problem with the 737 Max aircraft well before the deadly October 2018 Lion Air crash, but decided not to do anything about it.
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
txt = nltk.word_tokenize(text)
txt = nltk.pos_tag(txt)
txt
[('A', 'DT'), ('new', 'JJ'), ('statement', 'NN'), ('from', 'IN'), ('Boeing', 'NNP'), ('indicates', 'VBZ'), ('that', 'IN'), ('the', 'DT'), ('aerospace', 'NN'), ('manufacturer', 'NN'), ('knew', 'VBD'), ('about', 'IN'), ('a', 'DT'), ('problem', 'NN'), ('with', 'IN'), ('the', 'DT'), ('737', 'CD'), ('Max', 'NNP'), ('aircraft', 'NN'), ('well', 'RB'), ('before', 'IN'), ('the', 'DT'), ('deadly', 'JJ'), ('October', 'NNP'), ('2018', 'CD'), ('Lion', 'NNP'), ('Air', 'NNP'), ('crash', 'NN'), (',', ','), ('but', 'CC'), ('decided', 'VBD'), ('not', 'RB'), ('to', 'TO'), ('do', 'VB'), ('anything', 'NN'), ('about', 'IN'), ('it', 'PRP'), ('.', '.')]
pattern = 'NP: {<DT>?<JJ>*<NN>}' #noun phrase = optional determinor DT, followed by any of adjectives (JJ), and ending in a noun NN.
cp = nltk.RegexpParser(pattern)
cs = cp.parse(txt)
print(cs)
(S (NP A/DT new/JJ statement/NN) from/IN Boeing/NNP indicates/VBZ that/IN (NP the/DT aerospace/NN) (NP manufacturer/NN) knew/VBD about/IN (NP a/DT problem/NN) with/IN the/DT 737/CD Max/NNP (NP aircraft/NN) well/RB before/IN the/DT deadly/JJ October/NNP 2018/CD Lion/NNP Air/NNP (NP crash/NN) ,/, but/CC decided/VBD not/RB to/TO do/VB (NP anything/NN) about/IN it/PRP ./.)
#Code from the spaCy web site
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
#If not working: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
# Process the text above
doc = nlp(text)
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
# Find named entities, phrases and concepts
print('----Named Entities----')
for entity in doc.ents:
print(entity.text, entity.label_)
Noun phrases: ['A new statement', 'Boeing', 'the aerospace manufacturer', 'a problem', 'the 737 Max aircraft', 'the deadly October 2018 Lion Air crash', 'anything', 'it'] Verbs: ['indicate', 'know', 'decide', 'do'] ----Named Entities---- Boeing ORG 737 Max PRODUCT October 2018 DATE Lion Air ORG
Original paper: https://lvdmaaten.github.io/tsne/
From: https://github.com/oreillymedia/t-SNE-tutorial
A popular dimensonality reduction algorithm: t-distributed stochastic neighbor embedding (t-SNE). Developed by Laurens van der Maaten and Geoffrey Hinton (see the original paper here: http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf), this algorithm has been successfully applied to many real-world datasets.
A nice article on visualizing high-dimensional data sets: https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b; https://drive.google.com/file/d/1Cucl4HYtYBgS12-BoslSuaOvFTCghqpe/view?usp=sharing
The good thing is that we can use the same model as we used in word2vec and feed it into t-SNE in the following.
#Read in the corpus
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'reuters/training/'
ctext = PlaintextCorpusReader(corpus_root, '.*')
#Convert corpus to text array with a full string for each doc
def merge_arrays(word_lists):
wordlist = []
for wl in word_lists:
wordlist = wordlist + wl
doc = ' '.join(wordlist)
return doc
#Run this through the corpus to get a word array for each doc
text_array = []
for p in ctext.paras():
doc = merge_arrays(p)
text_array.append(doc)
#Clean up the docs using the previous functions
news = text_array
news = removePunc(news)
news = removeNumbers(news)
news = stopText(news)
#news = stemText(news)
news = [j.lower() for j in news]
#Select a few random news items
import random
news_sample = random.sample(news,25)
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
#Tokenize each document
def textTokenize(text_array):
textTokens = []
for h in text_array:
textTokens.append(h.split(' '))
return textTokens
sentences = textTokenize(news_sample)
print(len(sentences))
type(sentences)
25
list
#Train the model on Word2Vec
model = gensim.models.Word2Vec(sentences, min_count=1)
type(model)
2019-05-19 17:51:21,951 : INFO : collecting all words and their counts 2019-05-19 17:51:21,952 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2019-05-19 17:51:21,953 : INFO : collected 1085 word types from a corpus of 2481 raw words and 25 sentences 2019-05-19 17:51:21,954 : INFO : Loading a fresh vocabulary 2019-05-19 17:51:21,957 : INFO : min_count=1 retains 1085 unique words (100% of original 1085, drops 0) 2019-05-19 17:51:21,958 : INFO : min_count=1 leaves 2481 word corpus (100% of original 2481, drops 0) 2019-05-19 17:51:21,965 : INFO : deleting the raw counts dictionary of 1085 items 2019-05-19 17:51:21,966 : INFO : sample=0.001 downsamples 55 most-common words 2019-05-19 17:51:21,967 : INFO : downsampling leaves estimated 2150 word corpus (86.7% of prior 2481) 2019-05-19 17:51:21,969 : INFO : estimated required memory for 1085 words and 100 dimensions: 1410500 bytes 2019-05-19 17:51:21,969 : INFO : resetting layer weights 2019-05-19 17:51:21,990 : INFO : training model with 3 workers on 1085 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 2019-05-19 17:51:22,006 : INFO : worker thread finished; awaiting finish of 2 more threads 2019-05-19 17:51:22,007 : INFO : worker thread finished; awaiting finish of 1 more threads 2019-05-19 17:51:22,009 : INFO : worker thread finished; awaiting finish of 0 more threads 2019-05-19 17:51:22,010 : INFO : EPOCH - 1 : training on 2481 raw words (2133 effective words) took 0.0s, 507706 effective words/s 2019-05-19 17:51:22,014 : INFO : worker thread finished; awaiting finish of 2 more threads 2019-05-19 17:51:22,014 : INFO : worker thread finished; awaiting finish of 1 more threads 2019-05-19 17:51:22,017 : INFO : worker thread finished; awaiting finish of 0 more threads 2019-05-19 17:51:22,018 : INFO : EPOCH - 2 : training on 2481 raw words (2142 effective words) took 0.0s, 377271 effective words/s 2019-05-19 17:51:22,021 : INFO : worker thread finished; awaiting finish of 2 more threads 2019-05-19 17:51:22,022 : INFO : worker thread finished; awaiting finish of 1 more threads 2019-05-19 17:51:22,024 : INFO : worker thread finished; awaiting finish of 0 more threads 2019-05-19 17:51:22,025 : INFO : EPOCH - 3 : training on 2481 raw words (2152 effective words) took 0.0s, 457254 effective words/s 2019-05-19 17:51:22,027 : INFO : worker thread finished; awaiting finish of 2 more threads 2019-05-19 17:51:22,027 : INFO : worker thread finished; awaiting finish of 1 more threads 2019-05-19 17:51:22,030 : INFO : worker thread finished; awaiting finish of 0 more threads 2019-05-19 17:51:22,030 : INFO : EPOCH - 4 : training on 2481 raw words (2142 effective words) took 0.0s, 596262 effective words/s 2019-05-19 17:51:22,032 : INFO : worker thread finished; awaiting finish of 2 more threads 2019-05-19 17:51:22,033 : INFO : worker thread finished; awaiting finish of 1 more threads 2019-05-19 17:51:22,036 : INFO : worker thread finished; awaiting finish of 0 more threads 2019-05-19 17:51:22,036 : INFO : EPOCH - 5 : training on 2481 raw words (2160 effective words) took 0.0s, 549381 effective words/s 2019-05-19 17:51:22,036 : INFO : training on a 12405 raw words (10729 effective words) took 0.0s, 234067 effective words/s 2019-05-19 17:51:22,037 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
gensim.models.word2vec.Word2Vec
#t-SNE uses vocabulary from word2vec
from sklearn.manifold import TSNE
def tsne_plot(model):
"Creates and TSNE model and plots it"
labels = []
tokens = []
for word in model.wv.vocab:
tokens.append(model[word])
labels.append(word)
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(16, 16))
for i in range(len(x)):
pyplot.scatter(x[i],y[i])
pyplot.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
pyplot.show()
%%time
figure(figsize=(20,10))
tsne_plot(model)
/home/srdas/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:10: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). # Remove the CWD from sys.path while we load stuff.
<Figure size 1440x720 with 0 Axes>
CPU times: user 21.7 s, sys: 188 ms, total: 21.9 s Wall time: 21.1 s
A KG is knowledge in graphical form. The graph contains entities, attributes, and relationships.
References:
Here is an interesting application of text classification using neural nets. The original blog post (https://realpython.com/python-keras-text-classification/) is very well written and is worth a read. Probably best read after studying neural networks and deep learning.
https://drive.google.com/file/d/1FE7oFZd5fPhq8GQZ45k84_hbjil24kdP/view?usp=sharing
Harking back to traditional rule-based NLP, the role of linguistic markers is important. Linguistic markers are a set of words that relate to a concept. The phrase "I swear to God" is an example of a marker of an "overzealous expression". Or "I, my, mine" are examples of "self-reference". The quantity of these markers by category may be features in the NLP of a conversation. VoicePrint is a system being used by banks to detect fraud in customer phone calls. The following references on linguistic markers may be worth a further read.
A system of Decpetion and Fraud Detection using Reliable Linguistic Cues (Humphreys 2010)
Fraud detection in finance using Linguistic Features; pdf.