%pylab inline
import pandas as pd
import os
from ipypublish import nb_setup
%load_ext rpy2.ipython
%load_ext RWinOut  #if using windows

Populating the interactive namespace from numpy and matplotlib
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


import requests
url = 'http://srdas.github.io/bio-candid.html'
f = requests.get(url)
text = f.text
f.close()

#Create a set of small docs = one line each in case needed for later examples
lines = text.splitlines()
print('Number of lines =',len(lines))
print(lines[3])

Number of lines = 78
Sanjiv Das is the William and Janice Terry Professor of Finance and


#Clean up text
from bs4 import BeautifulSoup
bio = BeautifulSoup(text,'lxml').get_text()
print(bio)



Sanjiv Das is the William and Janice Terry Professor of Finance and
Data Science at Santa Clara University's Leavey School of Business. He
previously held faculty appointments as Professor at Harvard Business
School and UC Berkeley. He holds post-graduate degrees in Finance
(M.Phil and Ph.D. from New York University), Computer Science
(M.S. from UC Berkeley), an MBA from the Indian Institute of
Management, Ahmedabad, B.Com in Accounting and Economics (University
of Bombay, Sydenham College), and is also a qualified Cost and Works
Accountant (AICWA). He is a senior editor of The Journal of Investment
Management and Associate Editor of Management Science and other
academic journals. Prior to being an academic, he worked in the
derivatives business in the Asia-Pacific region as a Vice-President at
Citibank. His current research interests include: machine learning,
social networks, derivatives pricing models, portfolio theory, the
modeling of default risk, systemic risk, and venture capital.  He has
published over a hundred articles in academic journals, and has won
numerous awards for research and teaching. His recent book
"Derivatives: Principles and Practice" was published in May 2010
(second edition 2016).  


 Sanjiv Das: A Short Academic Life History 

After loafing and working in many parts of Asia, but never really
growing up, Sanjiv moved to New York to change the world, hopefully
through research.  He graduated in 1994 with a Ph.D. from NYU, and
since then spent five years in Boston, and now lives in San Jose,
California.  Sanjiv loves animals, places in the world where the
mountains meet the sea, riding sport motorbikes, reading, gadgets,
science fiction movies, and writing cool software code. When there is
time available from the excitement of daily life, Sanjiv writes
academic papers, which helps him relax. Always the contrarian, Sanjiv
thinks that New York City is the most calming place in the world,
after California of course.


Sanjiv is now a Professor of Finance at Santa Clara University. He came
to SCU from Harvard Business School and spent a year at UC Berkeley. In
his past life in the unreal world, Sanjiv worked at Citibank, N.A. in
the Asia-Pacific region. He takes great pleasure in merging his many
previous lives into his current existence, which is incredibly confused
and diverse.


Sanjiv's research style is instilled with a distinct "New York state of
mind" - it is chaotic, diverse, with minimal method to the madness. He
has published articles on derivatives, term-structure models, mutual
funds, the internet, portfolio choice, banking models, credit risk, and
has unpublished articles in many other areas. Some years ago, he took
time off to get another degree in computer science at Berkeley,
confirming that an unchecked hobby can quickly become an obsession.
There he learnt about the fascinating field of Randomized Algorithms,
skills he now applies earnestly to his editorial work, and other
pursuits, many of which stem from being in the epicenter of Silicon
Valley.


Coastal living did a lot to mold Sanjiv, who needs to live near the
ocean.  The many walks in Greenwich village convinced him that there is
no such thing as a representative investor, yet added many unique
features to his personal utility function. He learnt that it is
important to open the academic door to the ivory tower and let the world
in. Academia is a real challenge, given that he has to reconcile many
more opinions than ideas. He has been known to have turned down many
offers from Mad magazine to publish his academic work. As he often
explains, you never really finish your education - "you can check out
any time you like, but you can never leave." Which is why he is doomed
to a lifetime in Hotel California. And he believes that, if this is as
bad as it gets, life is really pretty good.


import string
def removePuncStr(s):
    for c in string.punctuation:
        s = s.replace(c," ")
    return s

def removePunc(text_array):
    return [removePuncStr(h) for h in text_array]


def removeNumbersStr(s):
    for c in range(10):
        n = str(c)
        s = s.replace(n," ")
    return s

def removeNumbers(text_array):
    return [removeNumbersStr(h) for h in text_array]


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def stopText(text_array):
    stop_words = set(stopwords.words('english'))
    stopped_text = []
    for h in text_array:
        words = word_tokenize(h)
        h2 = ''
        for w in words:
            if w not in stop_words:
                h2 = h2 + ' ' + w
        stopped_text.append(h2)
    return stopped_text


from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

def stemText(text_array):
    stemmed_text = []
    for h in text_array:
        words = word_tokenize(h)
        h2 = ''
        for w in words:
            h2 = h2 + ' ' + PorterStemmer().stem(w)
        stemmed_text.append(h2)
    return stemmed_text


bio2 = bio.splitlines()
bio2[:10]

['',
 '',
 '',
 'Sanjiv Das is the William and Janice Terry Professor of Finance and',
 "Data Science at Santa Clara University's Leavey School of Business. He",
 'previously held faculty appointments as Professor at Harvard Business',
 'School and UC Berkeley. He holds post-graduate degrees in Finance',
 '(M.Phil and Ph.D. from New York University), Computer Science',
 '(M.S. from UC Berkeley), an MBA from the Indian Institute of',
 'Management, Ahmedabad, B.Com in Accounting and Economics (University']


#Clean up all lines in one set of nested functions
bio2 = stemText(stopText(removeNumbers(removePunc(bio2))))
bio2 = [j for j in bio2 if len(j)>0]
bio2

[' sanjiv da william janic terri professor financ',
 ' data scienc santa clara univers leavey school busi He',
 ' previous held faculti appoint professor harvard busi',
 ' school UC berkeley He hold post graduat degre financ',
 ' M phil Ph D new york univers comput scienc',
 ' M S UC berkeley mba indian institut',
 ' manag ahmedabad B com account econom univers',
 ' bombay sydenham colleg also qualifi cost work',
 ' account aicwa He senior editor the journal invest',
 ' manag associ editor manag scienc',
 ' academ journal prior academ work',
 ' deriv busi asia pacif region vice presid',
 ' citibank hi current research interest includ machin learn',
 ' social network deriv price model portfolio theori',
 ' model default risk system risk ventur capit He',
 ' publish hundr articl academ journal',
 ' numer award research teach hi recent book',
 ' deriv principl practic publish may',
 ' second edit',
 ' sanjiv da A short academ life histori',
 ' after loaf work mani part asia never realli',
 ' grow sanjiv move new york chang world hope',
 ' research He graduat Ph D nyu',
 ' sinc spent five year boston live san jose',
 ' california sanjiv love anim place world',
 ' mountain meet sea ride sport motorbik read gadget',
 ' scienc fiction movi write cool softwar code when',
 ' time avail excit daili life sanjiv write',
 ' academ paper help relax alway contrarian sanjiv',
 ' think new york citi calm place world',
 ' california cours',
 ' sanjiv professor financ santa clara univers He came',
 ' scu harvard busi school spent year UC berkeley In',
 ' past life unreal world sanjiv work citibank N A',
 ' asia pacif region He take great pleasur merg mani',
 ' previou live current exist incred confus',
 ' divers',
 ' sanjiv research style instil distinct new york state',
 ' mind chaotic divers minim method mad He',
 ' publish articl deriv term structur model mutual',
 ' fund internet portfolio choic bank model credit risk',
 ' unpublish articl mani area some year ago took',
 ' time get anoth degre comput scienc berkeley',
 ' confirm uncheck hobbi quickli becom obsess',
 ' there learnt fascin field random algorithm',
 ' skill appli earnestli editori work',
 ' pursuit mani stem epicent silicon',
 ' valley',
 ' coastal live lot mold sanjiv need live near',
 ' ocean the mani walk greenwich villag convinc',
 ' thing repres investor yet ad mani uniqu',
 ' featur person util function He learnt',
 ' import open academ door ivori tower let world',
 ' academia real challeng given reconcil mani',
 ' opinion idea He known turn mani',
 ' offer mad magazin publish academ work As often',
 ' explain never realli finish educ check',
 ' time like never leav which doom',
 ' lifetim hotel california and believ',
 ' bad get life realli pretti good']


from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

def lemmText(text_array):
    WNlemmatizer = WordNetLemmatizer()
    lemmatized_text = []
    for h in text_array:
        words = word_tokenize(h) 
        h2 = ''
        for w in words:
            h2 = h2 + ' ' + WNlemmatizer.lemmatize(w)
        lemmatized_text.append(h2)
    return lemmatized_text


#Example
temp = stopText(removeNumbers(removePunc(bio.splitlines()[15:22])))
print('Original: ',temp)
bio_lemm = lemmText(temp)
print('Lemmatized: ',bio_lemm)
bio_stem = stemText(temp)
print('Stemmed: ',bio_stem)

Original:  [' Citibank His current research interests include machine learning', ' social networks derivatives pricing models portfolio theory', ' modeling default risk systemic risk venture capital He', ' published hundred articles academic journals', ' numerous awards research teaching His recent book', ' Derivatives Principles Practice published May', ' second edition']
Lemmatized:  [' Citibank His current research interest include machine learning', ' social network derivative pricing model portfolio theory', ' modeling default risk systemic risk venture capital He', ' published hundred article academic journal', ' numerous award research teaching His recent book', ' Derivatives Principles Practice published May', ' second edition']
Stemmed:  [' citibank hi current research interest includ machin learn', ' social network deriv price model portfolio theori', ' model default risk system risk ventur capit He', ' publish hundr articl academ journal', ' numer award research teach hi recent book', ' deriv principl practic publish may', ' second edit']


#Example
bio2 = bio.splitlines()
bio2 = [j for j in bio2 if len(j)>0 ]

#Get TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(bio2)
tfs = tfidf.fit_transform(bio2)
# Make TDM
tdm_mat = tfs.toarray().T
print(tdm_mat.shape)
tdm_mat

(326, 60)

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.49908804, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])


from sklearn.decomposition import NMF
nmf = NMF(n_components=10, solver="mu", max_iter=1000)
print(nmf)
A = nmf.fit_transform(tdm_mat)
B = nmf.components_
print(A.shape)
print(B.shape)
print(A.min(),B.min())
print((tdm_mat - A.dot(B)).max())

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=1000,
  n_components=10, random_state=None, shuffle=False, solver='mu',
  tol=0.0001, verbose=0)
(326, 10)
(10, 60)
0.0 0.0
1.0


#Example
docs = bio.splitlines()[:10]

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
tdm = df.T 
print(tdm.shape)
print(tdm)

(49, 10)
              0  1  2  3  4  5  6  7  8  9
accounting    0  0  0  0  0  0  0  0  0  1
ahmedabad     0  0  0  0  0  0  0  0  0  1
an            0  0  0  0  0  0  0  0  1  0
and           0  0  0  2  0  0  1  1  0  1
appointments  0  0  0  0  0  1  0  0  0  0
as            0  0  0  0  0  1  0  0  0  0
at            0  0  0  0  1  1  0  0  0  0
berkeley      0  0  0  0  0  0  1  0  1  0
business      0  0  0  0  1  1  0  0  0  0
clara         0  0  0  0  1  0  0  0  0  0
com           0  0  0  0  0  0  0  0  0  1
computer      0  0  0  0  0  0  0  1  0  0
das           0  0  0  1  0  0  0  0  0  0
data          0  0  0  0  1  0  0  0  0  0
degrees       0  0  0  0  0  0  1  0  0  0
economics     0  0  0  0  0  0  0  0  0  1
faculty       0  0  0  0  0  1  0  0  0  0
finance       0  0  0  1  0  0  1  0  0  0
from          0  0  0  0  0  0  0  1  2  0
graduate      0  0  0  0  0  0  1  0  0  0
harvard       0  0  0  0  0  1  0  0  0  0
he            0  0  0  0  1  0  1  0  0  0
held          0  0  0  0  0  1  0  0  0  0
holds         0  0  0  0  0  0  1  0  0  0
in            0  0  0  0  0  0  1  0  0  1
indian        0  0  0  0  0  0  0  0  1  0
institute     0  0  0  0  0  0  0  0  1  0
is            0  0  0  1  0  0  0  0  0  0
janice        0  0  0  1  0  0  0  0  0  0
leavey        0  0  0  0  1  0  0  0  0  0
management    0  0  0  0  0  0  0  0  0  1
mba           0  0  0  0  0  0  0  0  1  0
new           0  0  0  0  0  0  0  1  0  0
of            0  0  0  1  1  0  0  0  1  0
ph            0  0  0  0  0  0  0  1  0  0
phil          0  0  0  0  0  0  0  1  0  0
post          0  0  0  0  0  0  1  0  0  0
previously    0  0  0  0  0  1  0  0  0  0
professor     0  0  0  1  0  1  0  0  0  0
sanjiv        0  0  0  1  0  0  0  0  0  0
santa         0  0  0  0  1  0  0  0  0  0
school        0  0  0  0  1  0  1  0  0  0
science       0  0  0  0  1  0  0  1  0  0
terry         0  0  0  1  0  0  0  0  0  0
the           0  0  0  1  0  0  0  0  1  0
uc            0  0  0  0  0  0  1  0  1  0
university    0  0  0  0  1  0  0  1  0  1
william       0  0  0  1  0  0  0  0  0  0
york          0  0  0  0  0  0  0  1  0  0


#Using SciPy
from scipy.linalg import svd
T,S,Dt = svd(tdm)
print(T.shape, S.shape, Dt.shape)
print(S)

(49, 49) (10,) (10, 10)
[ 4.48419389e+00  3.40404159e+00  3.34728247e+00  3.10699004e+00
  2.98154402e+00  2.62943757e+00  2.37555726e+00  8.76143088e-17
  0.00000000e+00 -0.00000000e+00]


#Using SkLearn
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=5, n_iter=100, random_state=42)
svd.fit(tdm)

TruncatedSVD(algorithm='randomized', n_components=5, n_iter=100,
       random_state=42, tol=0.0)


print(svd.explained_variance_ratio_)  
print(svd.explained_variance_ratio_.sum())  
print(svd.singular_values_)

[0.11395279 0.18194832 0.18837593 0.16128021 0.14944096]
0.7949982025652943
[4.48419389 3.40404159 3.34728247 3.10699004 2.98154402]


from sklearn.utils.extmath import randomized_svd
T, S, Dt = randomized_svd(tdm.values, n_components=5, n_iter=100, random_state=42)
print(T.shape, S.shape, Dt.shape)
print(S)

(49, 5) (5,) (5, 10)
[4.48419389 3.40404159 3.34728247 3.10699004 2.98154402]


%%R
system("mkdir D")
write( c("blue", "red", "green"), file=paste("D", "D1.txt", sep="/"))
write( c("black", "blue", "red"), file=paste("D", "D2.txt", sep="/"))
write( c("yellow", "black", "green"), file=paste("D", "D3.txt", sep="/"))
write( c("yellow", "red", "black"), file=paste("D", "D4.txt", sep="/"))


%%R
library(lsa)
tdm = textmatrix("D",minWordLength=1)
print(tdm)
system("rm -rf D")

/home/srdas/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Loading required package: SnowballC

  warnings.warn(x, RRuntimeWarning)

        docs
terms    D1.txt D2.txt D3.txt D4.txt
  blue        1      1      0      0
  green       1      0      1      0
  red         1      1      0      1
  black       0      1      1      1
  yellow      0      0      1      1


%%R
et = eigen(tdm %*% t(tdm))$vectors
print(et)

ed = eigen(t(tdm) %*% tdm)$vectors
print(ed)

          [,1]          [,2]        [,3]          [,4]       [,5]
[1,] 0.3629044 -6.015010e-01 -0.06829369 -3.717480e-01  0.6030227
[2,] 0.3328695  1.387779e-16 -0.89347008  3.053113e-16 -0.3015113
[3,] 0.5593741 -3.717480e-01  0.31014767  6.015010e-01 -0.3015113
[4,] 0.5593741  3.717480e-01  0.31014767 -6.015010e-01 -0.3015113
[5,] 0.3629044  6.015010e-01 -0.06829369  3.717480e-01  0.6030227
          [,1]      [,2]       [,3]      [,4]
[1,] 0.4570561  0.601501 -0.5395366 -0.371748
[2,] 0.5395366  0.371748  0.4570561  0.601501
[3,] 0.4570561 -0.601501 -0.5395366  0.371748
[4,] 0.5395366 -0.371748  0.4570561 -0.601501


%%R
res = lsa(tdm,dims=dimcalc_share())
print(res)

$tk
             [,1]          [,2]
blue   -0.3629044 -6.015010e-01
green  -0.3328695 -5.551115e-17
red    -0.5593741 -3.717480e-01
black  -0.5593741  3.717480e-01
yellow -0.3629044  6.015010e-01

$dk
             [,1]      [,2]
D1.txt -0.4570561 -0.601501
D2.txt -0.5395366 -0.371748
D3.txt -0.4570561  0.601501
D4.txt -0.5395366  0.371748

$sk
[1] 2.746158 1.618034

attr(,"class")
[1] "LSAspace"


%%R
res2 = svd(tdm)
print(res2)

$d
[1] 2.746158 1.618034 1.207733 0.618034

$u
           [,1]          [,2]        [,3]          [,4]
[1,] -0.3629044 -6.015010e-01  0.06829369  3.717480e-01
[2,] -0.3328695 -5.551115e-17  0.89347008 -3.441691e-15
[3,] -0.5593741 -3.717480e-01 -0.31014767 -6.015010e-01
[4,] -0.5593741  3.717480e-01 -0.31014767  6.015010e-01
[5,] -0.3629044  6.015010e-01  0.06829369 -3.717480e-01

$v
           [,1]      [,2]       [,3]      [,4]
[1,] -0.4570561 -0.601501  0.5395366 -0.371748
[2,] -0.5395366 -0.371748 -0.4570561  0.601501
[3,] -0.4570561  0.601501  0.5395366  0.371748
[4,] -0.5395366  0.371748 -0.4570561 -0.601501


%%R
tdm_lsa = res$tk %*% diag(res$sk) %*% t(res$dk)
print(tdm_lsa)

           D1.txt    D2.txt     D3.txt    D4.txt
blue    1.0409089 0.8995016 -0.1299115 0.1758948
green   0.4178005 0.4931970  0.4178005 0.4931970
red     1.0639006 1.0524048  0.3402938 0.6051912
black   0.3402938 0.6051912  1.0639006 1.0524048
yellow -0.1299115 0.1758948  1.0409089 0.8995016


%%R
library(Matrix)
print(rankMatrix(tdm))

[1] 4
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.110223e-15


%%R
print(rankMatrix(tdm_lsa))

UsageError: Cell magic `%%R` not found.


%%R
suppressMessages(library(text2vec))


%%R
suppressMessages(library(data.table))
data("movie_review")
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)
all_ids = movie_review$id
train_ids = sample(all_ids, 4000)
test_ids = setdiff(all_ids, train_ids)
train = movie_review[J(train_ids)]
test = movie_review[J(test_ids)]

print(head(train))

         id sentiment
1:  11912_2         0
2: 11507_10         1
3:   8194_9         1
4: 11426_10         1
5:   4043_3         0
6:  11287_3         0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   review
1:                                                                                                                                           The story behind this movie is very interesting, and in general the plot is not so bad... but the details: writing, directing, continuity, pacing, action sequences, stunts, and use of CG all cheapen and spoil the film.<br /><br />First off, action sequences. They are all quite unexciting. Most consist of someone standing up and getting shot, making no attempt to run, fight, dodge, or whatever, even though they have all the time in the world. The sequences just seem bland for something made in 2004.<br /><br />The CG features very nicely rendered and animated effects, but they come off looking cheap because of how they are used.<br /><br />Pacing: everything happens too quickly. For example, \\"Elle\\" is trained to fight in a couple of hours, and from the start can do back-flips, etc. Why is she so acrobatic? None of this is explained in the movie. As Lilith, she wouldn't have needed to be able to do back flips - maybe she couldn't, since she had wings.<br /><br />Also, we have sequences like a woman getting run over by a car, and getting up and just wandering off into a deserted room with a sink and mirror, and then stabbing herself in the throat, all for no apparent reason, and without any of the spectators really caring that she just got hit by a car (and then felt the secondary effects of another, exploding car)... \\"Are you okay?\\" asks the driver \\"yes, I'm fine\\" she says, bloodied and disheveled.<br /><br />I watched it all, though, because the introduction promised me that it would be interesting... but in the end, the poor execution made me wish for anything else: Blade, Vampire Hunter D, even that movie with vampires where Jackie Chan was comic relief, because they managed to suspend my disbelief, but this just made me want to shake the director awake, and give the writer a good talking to.
2:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I remember the original series vividly mostly due to it's unique blend of wry humor and macabre subject matter. Kolchak was hard-bitten newsman from the Ben Hecht school of big-city reporting, and his gritty determination and wise-ass demeanor made even the most mundane episode eminently watchable. My personal fave was \\"The Spanish Moss Murders\\" due to it's totally original storyline. A poor,troubled Cajun youth from Louisiana bayou country, takes part in a sleep research experiment, for the purpose of dream analysis. Something goes inexplicably wrong, and he literally dreams to life a swamp creature inhabiting the dark folk tales of his youth. This malevolent manifestation seeks out all persons who have wronged the dreamer in his conscious state, and brutally suffocates them to death. Kolchak investigates and uncovers this horrible truth, much to the chagrin of police captain Joe \\"Mad Dog\\" Siska(wonderfully essayed by a grumpy Keenan Wynn)and the head sleep researcher played by Second City improv founder, Severn Darden, to droll, understated perfection. The wickedly funny, harrowing finale takes place in the Chicago sewer system, and is a series highlight. Kolchak never got any better. Timeless.
3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Despite the other comments listed here, this is probably the best Dirty Harry movie made; a film that reflects -- for better or worse -- the country's socio-political feelings during the Reagan glory years of the early '80's. It's also a kickass action movie.<br /><br />Opening with a liberal, female judge overturning a murder case due to lack of tangible evidence and then going straight into the coffee shop encounter with several unfortunate hoodlums (the scene which prompts the famous, \\"Go ahead, make my day\\" line), \\"Sudden Impact\\" is one non-stop roller coaster of an action film. The first time you get to catch your breath is when the troublesome Inspector Callahan is sent away to a nearby city to investigate the background of a murdered hood. It gets only better from there with an over-the-top group of grotesque thugs for Callahan to deal with along with a sherriff with a mysterious past. Superb direction and photography and a at-times hilarious script help make this film one of the best of the '80's.
4:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I think this movie would be more enjoyable if everyone thought of it as a picture of colonial Africa in the 50's and 60's rather than as a story. Because there is no real story here. Just one vignette on top of another like little points of light that don't mean much until you have enough to paint a picture. The first time I saw Chocolat I didn't really \\"get it\\" until having thought about it for a few days. Then I realized there were lots of things to \\"get\\", including the end of colonialism which was but around the corner, just no plot. Anyway, it's one of my all-time favorite movies. The scene at the airport with the brief shower and beautiful music was sheer poetry. If you like \\"exciting\\" movies, don't watch this--you'll be bored to tears. But, for some of you..., you can thank me later for recommending it to you.
5:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The film begins with promise, but lingers too long in a sepia world of distance and alienation. We are left hanging, but with nothing much else save languid shots of grave and pensive male faces to savour. Certainly no rope up the wall to help us climb over. It's a shame, because the concept is not without merit.<br /><br />We are left wondering why a loving couple - a father and son no less - should be so estranged from the real world that their own world is preferable when claustrophobic beyond all imagining. This loss of presence in the real world is, rather too obviously and unnecessarily, contrasted with the son having enlisted in the armed forces. Why not the circus, so we can at least appreciate some colour? We are left with a gnawing sense of loss, but sadly no enlightenment, which is bewildering given the film is apparently about some form of attainment not available to us all.
6: This is a film that had a lot to live down to . on the year of its release legendary film critic Barry Norman considered it the worst film of the year and I'd heard nothing but bad things about it especially a plot that was criticised for being too complicated <br /><br />To be honest the plot is something of a red herring and the film suffers even more when the word \\" plot \\" is used because as far as I can see there is no plot as such . There's something involving Russian gangsters , a character called Pete Thompson who's trying to get his wife Sarah pregnant , and an Irish bloke called Sean . How they all fit into something called a \\" plot \\" I'm not sure . It's difficult to explain the plots of Guy Ritchie films but if you watch any of his films I'm sure we can all agree that they all posses one no matter how complicated they may seem on first viewing . Likewise a James Bond film though the plots are stretched out with action scenes . You will have a serious problem believing RANCID ALUMINIUM has any type of central plot that can be cogently explained <br /><br />Taking a look at the cast list will ring enough warning bells as to what sort of film you'll be watching . Sadie Frost has appeared in some of the worst British films made in the last 15 years and she's doing nothing to become inconsistent . Steven Berkoff gives acting a bad name ( and he plays a character called Kant which sums up the wit of this movie ) while one of the supporting characters is played by a TV presenter presumably because no serious actress would be seen dead in this <br /><br />The only good thing I can say about this movie is that it's utterly forgettable . I saw it a few days ago and immediately after watching I was going to write a very long a critical review warning people what they are letting themselves in for by watching , but by now I've mainly forgotten why . But this doesn't alter the fact that I remember disliking this piece of crap immensely


%%R
prep_fun = tolower
tok_fun = word_tokenizer

#Create an iterator to pass to the create_vocabulary function
it_train = itoken(train$review, 
             preprocessor = prep_fun, 
             tokenizer = tok_fun, 
             ids = train$id, 
             progressbar = FALSE)

#Now create a vocabulary
vocab = create_vocabulary(it_train)
print(vocab)

Number of docs: 4000 
0 stopwords:  ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
           term term_count doc_count
    1:      ufo          1         1
    2:    rader          1         1
    3:  bouchet          1         1
    4: atherton          1         1
    5:   cyhper          1         1
   ---                              
38302:       to      22095      3805
38303:       of      23653      3792
38304:        a      26614      3878
38305:      and      27069      3877
38306:      the      54362      3969


%%R
vectorizer = vocab_vectorizer(vocab)


%%R
dtm_train = create_dtm(it_train, vectorizer)
print(dim(as.matrix(dtm_train)))

[1]  4000 38306


%%R
vocab = create_vocabulary(it_train, ngram = c(1, 2))
print(vocab)

Number of docs: 4000 
0 stopwords:  ... 
ngram_min = 1; ngram_max = 2 
Vocabulary: 
                    term term_count doc_count
     1:         old_used          1         1
     2:   corey_savier's          1         1
     3: key_monosyllabic          1         1
     4:    rural_england          1         1
     5:  apartment_house          1         1
    ---                                      
406813:               to      22095      3805
406814:               of      23653      3792
406815:                a      26614      3878
406816:              and      27069      3877
406817:              the      54362      3969


%%R
library(glmnet)
library(magrittr)
NFOLDS = 5

vocab = vocab %>% prune_vocabulary(term_count_min = 10, 
                   doc_proportion_max = 0.5) 
print(vocab)

/home/srdas/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Loading required package: foreach

  warnings.warn(x, RRuntimeWarning)
/home/srdas/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Loaded glmnet 2.0-16


  warnings.warn(x, RRuntimeWarning)

Number of docs: 4000 
0 stopwords:  ... 
ngram_min = 1; ngram_max = 2 
Vocabulary: 
                 term term_count doc_count
    1:      and_loved         10        10
    2: screenplay_and         10         9
    3:        you_saw         10        10
    4:        was_her         10        10
    5:       feel_bad         10         9
   ---                                    
17663:           from       3369      1914
17664:           they       3426      1646
17665:             by       3727      1919
17666:             he       4293      1588
17667:            his       4808      1732


%%R
bigram_vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, bigram_vectorizer)
res = cv.glmnet(x = dtm_train, y = train[['sentiment']], 
                 family = 'binomial', 
                 alpha = 1,
                 type.measure = "auc",
                 nfolds = NFOLDS,
                 thresh = 1e-3,
                 maxit = 1e3) 
plot(res)


%%R
print(names(res))
cat("AUC (area under curve):")
print(max(res$cvm))

 [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
 [6] "nzero"      "name"       "glmnet.fit" "lambda.min" "lambda.1se"
AUC (area under curve):[1] 0.9251195


%%R
#Out-of-sample test
it_test = test$review %>% 
  prep_fun %>% 
  tok_fun %>% 
  itoken(ids = test$id, 
         # turn off progressbar because it won't look nice in rmd
         progressbar = FALSE)

dtm_test = create_dtm(it_test, bigram_vectorizer)
preds = predict(res, dtm_test, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)

[1] 0.9316295


%%R
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)

tfidf = TfIdf$new()
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
dtm_test_tfidf  = create_dtm(it_test, vectorizer) %>% transform(tfidf)


%%R
## Refit classifier
## Now we take the TF-IDF adjusted DTM and run the classifier.

res = cv.glmnet(x = dtm_train_tfidf, y = train[['sentiment']], 
                              family = 'binomial', 
                              alpha = 1,
                              type.measure = "auc",
                              nfolds = NFOLDS,
                              thresh = 1e-3,
                              maxit = 1e3)
print(paste("max AUC =", round(max(res$cvm), 4)))

[1] "max AUC = 0.9113"


%%R
#Test on hold-out sample
preds = predict(res, dtm_test_tfidf, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)

[1] 0.9063606


%%R
data("movie_review")
tokens = movie_review$review %>% tolower %>% word_tokenizer()
it = itoken(tokens)
v = create_vocabulary(it) %>% prune_vocabulary(term_count_min=10)
vectorizer = vocab_vectorizer(v)  #, grow_dtm = FALSE, skip_grams_window = 5)
tcm = create_tcm(it, vectorizer, skip_grams_window=5)
print(dim(tcm))

  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%[1] 7805 7805


%%R
model = GlobalVectors$new(word_vectors_size = 50, vocabulary = v, x_max = 10)
wv_main = model$fit_transform(tcm, n_iter = 25, convergence_tol = 0.01)

INFO [2019-05-19 17:49:29] 2019-05-19 17:49:29 - epoch 1, expected cost 0.0814
INFO [2019-05-19 17:49:30] 2019-05-19 17:49:30 - epoch 2, expected cost 0.0525
INFO [2019-05-19 17:49:30] 2019-05-19 17:49:30 - epoch 3, expected cost 0.0458
INFO [2019-05-19 17:49:31] 2019-05-19 17:49:31 - epoch 4, expected cost 0.0415
INFO [2019-05-19 17:49:31] 2019-05-19 17:49:31 - epoch 5, expected cost 0.0384
INFO [2019-05-19 17:49:32] 2019-05-19 17:49:32 - epoch 6, expected cost 0.0360
INFO [2019-05-19 17:49:32] 2019-05-19 17:49:32 - epoch 7, expected cost 0.0342
INFO [2019-05-19 17:49:33] 2019-05-19 17:49:33 - epoch 8, expected cost 0.0328
INFO [2019-05-19 17:49:33] 2019-05-19 17:49:33 - epoch 9, expected cost 0.0316
INFO [2019-05-19 17:49:34] 2019-05-19 17:49:34 - epoch 10, expected cost 0.0306
INFO [2019-05-19 17:49:34] 2019-05-19 17:49:34 - epoch 11, expected cost 0.0298
INFO [2019-05-19 17:49:35] 2019-05-19 17:49:35 - epoch 12, expected cost 0.0291
INFO [2019-05-19 17:49:35] 2019-05-19 17:49:35 - epoch 13, expected cost 0.0285
INFO [2019-05-19 17:49:36] 2019-05-19 17:49:36 - epoch 14, expected cost 0.0279
INFO [2019-05-19 17:49:36] 2019-05-19 17:49:36 - epoch 15, expected cost 0.0275
INFO [2019-05-19 17:49:37] 2019-05-19 17:49:37 - epoch 16, expected cost 0.0270
INFO [2019-05-19 17:49:37] 2019-05-19 17:49:37 - epoch 17, expected cost 0.0267
INFO [2019-05-19 17:49:38] 2019-05-19 17:49:38 - epoch 18, expected cost 0.0263
INFO [2019-05-19 17:49:38] 2019-05-19 17:49:38 - epoch 19, expected cost 0.0260
INFO [2019-05-19 17:49:39] 2019-05-19 17:49:39 - epoch 20, expected cost 0.0258
INFO [2019-05-19 17:49:39] 2019-05-19 17:49:39 - epoch 21, expected cost 0.0255
INFO [2019-05-19 17:49:39] Success: early stopping. Improvement at iterartion 21 is less then convergence_tol


%%R

print(dim(wv_main))
wv_context = model$components
print(dim(wv_context))

wv = wv_main + t(wv_context)

#wv = model$get_word_vectors()  #Dimension words x wvec_size

#Make distance matrix
d = dist2(wv, method="cosine")  #Smaller values means closer
print(dim(d))

[1] 7805   50
[1]   50 7805
[1] 7805 7805


%%R
#Pass: w=word, d=dist matrix, n=nomber of close words
findCloseWords = function(w,d,n) {
  words = rownames(d)
  i = which(words==w)
  if (length(i) > 0) {
    res = sort(d[i,])
    print(as.matrix(res[2:(n+1)]))
  } 
  else {
    print("Word not in corpus.")
  }
}


%%R
findCloseWords("man",d,10)

           [,1]
woman 0.1759534
girl  0.2667730
who   0.2977268
guy   0.2996575
young 0.2998047
plays 0.3508409
boy   0.3725276
old   0.3939009
he    0.3969646
kid   0.3970353


%%R
findCloseWords("woman",d,10)

           [,1]
young 0.1739555
man   0.1759534
girl  0.1913266
guy   0.2662980
who   0.2982760
kid   0.3122388
boy   0.3182666
named 0.3425043
old   0.3647606
plays 0.3782488


%%R
suppressMessages(library(tm))
suppressMessages(library(text2vec))
stopw = stopwords('en')
stopw = c(stopw,"br","t","s","m","ve","2","d","1")


%%R
#Make DTM
data("movie_review")
tokens = movie_review$review %>% tolower %>% word_tokenizer()
it = itoken(tokens)
v = create_vocabulary(it, stopwords = stopw) %>% prune_vocabulary(term_count_min=5)
vectrzr = vocab_vectorizer(v)
dtm = create_dtm(it, vectrzr, skip_grams_window = 5)
print(dim(dtm))

  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%[1]  5000 12803


%%R
#Do LDA
dtm = create_dtm(it, vectrzr, type = "dgTMatrix")

lda = LDA$new(n_topics = 5, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topics = lda$fit_transform(x = dtm, n_iter = 1000, 
                          convergence_tol = 0.001, n_check_convergence = 25, 
                          progressbar = FALSE)

  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%INFO [2019-05-19 17:50:37] iter 25 loglikelihood = -4236738.001
INFO [2019-05-19 17:50:38] iter 50 loglikelihood = -4175059.756
INFO [2019-05-19 17:50:39] iter 75 loglikelihood = -4145685.344
INFO [2019-05-19 17:50:40] iter 100 loglikelihood = -4127472.096
INFO [2019-05-19 17:50:41] iter 125 loglikelihood = -4116825.926
INFO [2019-05-19 17:50:42] iter 150 loglikelihood = -4111737.651
INFO [2019-05-19 17:50:43] iter 175 loglikelihood = -4105210.327
INFO [2019-05-19 17:50:44] iter 200 loglikelihood = -4103461.840
INFO [2019-05-19 17:50:44] early stopping at 200 iteration


%%R
barplot(doc_topics[1, ], xlab = "topic", 
        ylab = "proportion", ylim = c(0, 1), 
        names.arg = 1:ncol(doc_topics))


%%R
#Get top words by topic
lda$get_top_words(n = 10, topic_number = seq(1,5), lambda = 1)

      [,1]     [,2]     [,3]     [,4]     [,5]   
 [1,] "film"   "man"    "film"   "movie"  "film" 
 [2,] "movie"  "one"    "people" "like"   "one"  
 [3,] "good"   "also"   "life"   "just"   "story"
 [4,] "like"   "gets"   "one"    "one"    "great"
 [5,] "one"    "action" "way"    "really" "time" 
 [6,] "just"   "john"   "can"    "bad"    "well" 
 [7,] "films"  "two"    "us"     "good"   "movie"
 [8,] "story"  "get"    "young"  "even"   "love" 
 [9,] "plot"   "new"    "will"   "see"    "also" 
[10,] "really" "night"  "like"   "movies" "best"


%%R
#Get top words by topic, sorted by relevance (set lambda between 0.2 and 0.4)
lda$get_top_words(n = 10, topic_number = seq(1,5), lambda = 0.2)

      [,1]         [,2]      [,3]       [,4]     [,5]          
 [1,] "film"       "match"   "war"      "movie"  "wonderful"   
 [2,] "films"      "stewart" "sister"   "bad"    "novel"       
 [3,] "plot"       "won"     "french"   "show"   "brilliant"   
 [4,] "dialogue"   "team"    "lives"    "stupid" "york"        
 [5,] "director"   "dr"      "society"  "worst"  "excellent"   
 [6,] "average"    "doctor"  "woman"    "just"   "jane"        
 [7,] "forward"    "island"  "white"    "movies" "william"     
 [8,] "script"     "ring"    "becomes"  "kids"   "performances"
 [9,] "characters" "attack"  "military" "guy"    "mary"        
[10,] "actors"     "rock"    "tells"    "like"   "adaptation"


%%R
#Plot LDA
suppressMessages(library(LDAvis))
lda$plot()

/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Loading required namespace: servr

  warnings.warn(x, RRuntimeWarning)
/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Failed with error:  
  warnings.warn(x, RRuntimeWarning)
/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: 
  warnings.warn(x, RRuntimeWarning)
/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: ‘there is no package called ‘servr’’
  warnings.warn(x, RRuntimeWarning)
/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: 

  warnings.warn(x, RRuntimeWarning)
/anaconda3/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: If the visualization doesn't render, install the servr package
and re-run serVis: 
 install.packages('servr') 
Alternatively, you could configure your default browser to allow
access to local files as some browsers block this by default

  warnings.warn(x, RRuntimeWarning)


text = 'A new statement from Boeing indicates that the aerospace manufacturer knew about a problem with the 737 Max aircraft well before the deadly October 2018 Lion Air crash, but decided not to do anything about it.'
print(text)

A new statement from Boeing indicates that the aerospace manufacturer knew about a problem with the 737 Max aircraft well before the deadly October 2018 Lion Air crash, but decided not to do anything about it.


import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

txt = nltk.word_tokenize(text)
txt = nltk.pos_tag(txt)
txt

[('A', 'DT'),
 ('new', 'JJ'),
 ('statement', 'NN'),
 ('from', 'IN'),
 ('Boeing', 'NNP'),
 ('indicates', 'VBZ'),
 ('that', 'IN'),
 ('the', 'DT'),
 ('aerospace', 'NN'),
 ('manufacturer', 'NN'),
 ('knew', 'VBD'),
 ('about', 'IN'),
 ('a', 'DT'),
 ('problem', 'NN'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('737', 'CD'),
 ('Max', 'NNP'),
 ('aircraft', 'NN'),
 ('well', 'RB'),
 ('before', 'IN'),
 ('the', 'DT'),
 ('deadly', 'JJ'),
 ('October', 'NNP'),
 ('2018', 'CD'),
 ('Lion', 'NNP'),
 ('Air', 'NNP'),
 ('crash', 'NN'),
 (',', ','),
 ('but', 'CC'),
 ('decided', 'VBD'),
 ('not', 'RB'),
 ('to', 'TO'),
 ('do', 'VB'),
 ('anything', 'NN'),
 ('about', 'IN'),
 ('it', 'PRP'),
 ('.', '.')]


pattern = 'NP: {<DT>?<JJ>*<NN>}'  #noun phrase = optional determinor DT, followed by any of adjectives (JJ), and ending in a noun NN. 
cp = nltk.RegexpParser(pattern)
cs = cp.parse(txt)
print(cs)

(S
  (NP A/DT new/JJ statement/NN)
  from/IN
  Boeing/NNP
  indicates/VBZ
  that/IN
  (NP the/DT aerospace/NN)
  (NP manufacturer/NN)
  knew/VBD
  about/IN
  (NP a/DT problem/NN)
  with/IN
  the/DT
  737/CD
  Max/NNP
  (NP aircraft/NN)
  well/RB
  before/IN
  the/DT
  deadly/JJ
  October/NNP
  2018/CD
  Lion/NNP
  Air/NNP
  (NP crash/NN)
  ,/,
  but/CC
  decided/VBD
  not/RB
  to/TO
  do/VB
  (NP anything/NN)
  about/IN
  it/PRP
  ./.)


#Code from the spaCy web site
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
#If not working: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Process the text above
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
print('----Named Entities----')
for entity in doc.ents:
    print(entity.text, entity.label_)

Noun phrases: ['A new statement', 'Boeing', 'the aerospace manufacturer', 'a problem', 'the 737 Max aircraft', 'the deadly October 2018 Lion Air crash', 'anything', 'it']
Verbs: ['indicate', 'know', 'decide', 'do']
----Named Entities----
Boeing ORG
737 Max PRODUCT
October 2018 DATE
Lion Air ORG


#Read in the corpus
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'reuters/training/'
ctext = PlaintextCorpusReader(corpus_root, '.*')


#Convert corpus to text array with a full string for each doc
def merge_arrays(word_lists):
    wordlist = []
    for wl in word_lists:
        wordlist = wordlist + wl
    doc = ' '.join(wordlist)
    return doc

#Run this through the corpus to get a word array for each doc
text_array = []
for p in ctext.paras():
    doc = merge_arrays(p)
    text_array.append(doc)


#Clean up the docs using the previous functions
news = text_array
news = removePunc(news)
news = removeNumbers(news)
news = stopText(news)
#news = stemText(news)
news = [j.lower() for j in news]


#Select a few random news items
import random
news_sample = random.sample(news,25)


import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


#Tokenize each document
def textTokenize(text_array):
    textTokens = []
    for h in text_array:
        textTokens.append(h.split(' '))
    return textTokens

sentences = textTokenize(news_sample)
print(len(sentences))
type(sentences)

25

list


#Train the model on Word2Vec
model = gensim.models.Word2Vec(sentences, min_count=1)
type(model)

2019-05-19 17:51:21,951 : INFO : collecting all words and their counts
2019-05-19 17:51:21,952 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-05-19 17:51:21,953 : INFO : collected 1085 word types from a corpus of 2481 raw words and 25 sentences
2019-05-19 17:51:21,954 : INFO : Loading a fresh vocabulary
2019-05-19 17:51:21,957 : INFO : min_count=1 retains 1085 unique words (100% of original 1085, drops 0)
2019-05-19 17:51:21,958 : INFO : min_count=1 leaves 2481 word corpus (100% of original 2481, drops 0)
2019-05-19 17:51:21,965 : INFO : deleting the raw counts dictionary of 1085 items
2019-05-19 17:51:21,966 : INFO : sample=0.001 downsamples 55 most-common words
2019-05-19 17:51:21,967 : INFO : downsampling leaves estimated 2150 word corpus (86.7% of prior 2481)
2019-05-19 17:51:21,969 : INFO : estimated required memory for 1085 words and 100 dimensions: 1410500 bytes
2019-05-19 17:51:21,969 : INFO : resetting layer weights
2019-05-19 17:51:21,990 : INFO : training model with 3 workers on 1085 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2019-05-19 17:51:22,006 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-19 17:51:22,007 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-19 17:51:22,009 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-19 17:51:22,010 : INFO : EPOCH - 1 : training on 2481 raw words (2133 effective words) took 0.0s, 507706 effective words/s
2019-05-19 17:51:22,014 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-19 17:51:22,014 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-19 17:51:22,017 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-19 17:51:22,018 : INFO : EPOCH - 2 : training on 2481 raw words (2142 effective words) took 0.0s, 377271 effective words/s
2019-05-19 17:51:22,021 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-19 17:51:22,022 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-19 17:51:22,024 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-19 17:51:22,025 : INFO : EPOCH - 3 : training on 2481 raw words (2152 effective words) took 0.0s, 457254 effective words/s
2019-05-19 17:51:22,027 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-19 17:51:22,027 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-19 17:51:22,030 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-19 17:51:22,030 : INFO : EPOCH - 4 : training on 2481 raw words (2142 effective words) took 0.0s, 596262 effective words/s
2019-05-19 17:51:22,032 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-19 17:51:22,033 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-19 17:51:22,036 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-19 17:51:22,036 : INFO : EPOCH - 5 : training on 2481 raw words (2160 effective words) took 0.0s, 549381 effective words/s
2019-05-19 17:51:22,036 : INFO : training on a 12405 raw words (10729 effective words) took 0.0s, 234067 effective words/s
2019-05-19 17:51:22,037 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay

gensim.models.word2vec.Word2Vec


#t-SNE uses vocabulary from word2vec
from sklearn.manifold import TSNE

def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        pyplot.scatter(x[i],y[i])
        pyplot.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    pyplot.show()


%%time
figure(figsize=(20,10))
tsne_plot(model)

/home/srdas/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:10: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  # Remove the CWD from sys.path while we load stuff.

<Figure size 1440x720 with 0 Axes>

CPU times: user 21.7 s, sys: 188 ms, total: 21.9 s
Wall time: 21.1 s

Text Analytics (Advanced Topics)¶

Basic Textual Data¶

Basic Text Cleanup¶

Lemmatization¶

Non-negative Matrix Factorization (NMF)¶

Iterative Solution for NMF¶

NMF by gradient descent¶

Singular Value Decomposition (SVD)¶

Latent Semantic Analysis (LSA)¶

How is LSA implemented using SVD?¶

Example in R¶

LSA and Singular Value Decomposition (SVD)¶

Dimension reduction of the TDM via LSA¶

LSA and SVD: the connection?¶

What is the rank of the TDM?¶

Classification and Word Embeddings using text2vec in R¶

Preprocessing and tokenization¶

Iterate and Vectorize¶

Document Term Matrix (DTM)¶

N-Grams¶

TF-IDF¶

Word Embeddings¶

GloVe¶

word2vec explained¶

Topic Analysis using text2vec¶

Entity Extraction¶

Using spaCy¶

Stochastic Network Embeddings (t-SNE)¶

Reuter's news corpus for t-SNE¶

Knowledge Graphs¶

Neural Text Generation¶

Text Classification with Neural Nets¶

Linguistic Markers¶