Dimension Reduction (and Topics)

29. Dimension Reduction (and Topics)#

Squeezing the data down to be digested easily for modeling and analysis.

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')

Mounted at /content/drive

%%capture
%pylab inline
import pandas as pd
import os
%load_ext rpy2.ipython
from IPython.display import Image

29.1. Dimension Reduction of Text#

We show here that documents end up in a space of two PCs (principal components).
LDA is a way of reducing documents into a smaller space.

We can use the wine reviews dataset from Kaggle.

https://www.kaggle.com/thebrownviking20/topic-modelling-with-spacy-and-scikit-learn

Dataset: https://www.kaggle.com/thebrownviking20/topic-modelling-with-spacy-and-scikit-learn/data?select=winemag-data-130k-v2.csv

%%time
os.system('wget https://github.com/morisasy/kaggle/raw/master/data/winemag-data-130k-v2.csv')
os.system('mv winemag-data-130k-v2.csv NLP_data/')

CPU times: user 25 ms, sys: 3.85 ms, total: 28.8 ms
Wall time: 4.84 s

wineDF = pd.read_csv("NLP_data/winemag-data-130k-v2.csv")
print(wineDF.shape)
wineDF.head()

(129971, 14)

	Unnamed: 0	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
from tqdm import tqdm
import string

%%capture
!pip install spacy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
!python -m spacy download en_core_web_sm

# Load in the spaCy object, i.e., the language model
nlp = spacy.load('en_core_web_sm')

wineDF.description

	description
0	Aromas include tropical fruit, broom, brimston...
1	This is ripe and fruity, a wine that is smooth...
2	Tart and snappy, the flavors of lime flesh and...
3	Pineapple rind, lemon pith and orange blossom ...
4	Much like the regular bottling from 2012, this...
...	...
129966	Notes of honeysuckle and cantaloupe sweeten th...
129967	Citation is given as much as a decade of bottl...
129968	Well-drained gravel soil gives this wine its c...
129969	A dry style of Pinot Gris, this is crisp with ...
129970	Big, rich and off-dry, this is powered by inte...

129971 rows × 1 columns

dtype: object

# Functions to parse the reviews
punctuations = string.punctuation
stopwords = list(STOP_WORDS)

parser = English()
def spacy_tokenizer(sentence):
    mytokens = list(parser(sentence))
#     mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [word.lower_ for word in mytokens ]
    mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

Let’s use the TQDM progress bar. The abbreviation tqdm has interesting naming origins: https://tqdm.github.io

%%time
tqdm.pandas()
wineDF["processed_description"] = wineDF["description"].progress_apply(spacy_tokenizer)

100%|██████████| 129971/129971 [01:12<00:00, 1794.82it/s]

CPU times: user 55.3 s, sys: 586 ms, total: 55.9 s
Wall time: 1min 12s

for j in range(5):
    print(wineDF.processed_description[j])

aromas include tropical fruit broom brimstone dried herb palate overly expressive offering unripened apple citrus dried sage alongside brisk acidity
ripe fruity wine smooth structured firm tannins filled juicy red berry fruits freshened acidity   drinkable certainly better 2016
tart snappy flavors lime flesh rind dominate green pineapple pokes crisp acidity underscoring flavors wine stainless steel fermented
pineapple rind lemon pith orange blossom start aromas palate bit opulent notes honey drizzled guava mango giving way slightly astringent semidry finish
like regular bottling 2012 comes rough tannic rustic earthy herbal characteristics nonetheless think pleasantly unfussy country wine good companion hearty winter stew

# Create the document-term matrix
vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
DTM = vectorizer.fit_transform(wineDF["processed_description"])
DTM

<129971x11917 sparse matrix of type '<class 'numpy.int64'>'
	with 2953416 stored elements in Compressed Sparse Row format>

29.2. Topics from LDA using `scikit-learn`#

Next we use spaCy to fit the LDA model. This is slow and so we run only a few iterations, and the best results will come after many iterations.

%%time
# ~<1 minute per iteration
# Fit LDA in spaCy
n_topics = 3
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5, learning_method='online',verbose=True)
data_lda = lda.fit_transform(DTM)

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
CPU times: user 3min 2s, sys: 1.39 s, total: 3min 3s
Wall time: 3min 6s

# Functions for printing keywords for each topic
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])

selected_topics(lda, vectorizer)

Topic 0:
[('flavors', 20950.67018477242), ('wine', 16635.82545843346), ('palate', 16354.406381362307), ('acidity', 15506.05593080573), ('finish', 14769.111553817538), ('fruit', 13941.560939821009), ('apple', 13566.236570759675), ('aromas', 13122.03875126582), ('crisp', 12684.586526258352), ('citrus', 11662.251281882267)]
Topic 1:
[('wine', 55565.60958002671), ('flavors', 27153.00418436964), ('drink', 21554.66699850035), ('fruit', 20633.16021614574), ('ripe', 16809.443714251152), ('acidity', 16485.57645214745), ('tannins', 15305.53765344409), ('rich', 14543.343787563213), ('fruits', 13144.847668129269), ('red', 9631.062687215997)]
Topic 2:
[('aromas', 26374.091755743342), ('cherry', 25480.441094742815), ('palate', 22273.87156641139), ('black', 20613.840965534495), ('tannins', 15636.192367999198), ('finish', 15518.815192377546), ('fruit', 15467.455651759687), ('flavors', 14852.499696966233), ('spice', 12117.49492696586), ('red', 11960.038937625673)]

new_review = "This wine terrific aromatic flavor hint berries almonds lingering aftertaste. Both winter and spring varietals."
text = spacy_tokenizer(new_review)
vec = lda.transform(vectorizer.transform([text]))[0]
print("Topic vector score=",vec)
print("Topic ordering:",argsort(vec)[::-1])

Topic vector score= [0.60990689 0.35577218 0.03432093]
Topic ordering: [0 1 2]

vocab = vectorizer.get_feature_names_out()
vocab

array(['-georges', 'aaron', 'abacela', ..., 'zonin', 'zucchini',
       'zweigelt'], dtype=object)

So, now we are ready and in a position to see how we reduce the dimension of the corpus.

Note that the document-term matrix \(M\) can be reduced to a decomposition using matrix factorization. \(M\) is of dimension \(d \times t\), where \(d\) is the number of documents and \(t\) is the number of terms.
We set \(M = D \cdot T\), where \(D \in {R}^{d \times k}\), where \(k\) is the number of topics.
\(T \in {R}^{k \times t}\).
We can think of this as projecting the documents onto a smaller set of topics instead of terms, i.e., \(k \ll d\).
We can think of this as projecting the terms onto a smaller set of topics, i.e., \(k \ll t\).

Image("NLP_images/DTM_reduction.png", width=500)

_images/386cfbef699458439e8c9a8d13b73a0f49b2d29b076ef99eec6b4ad8b7c30f2e.png

# Check for dimension reduction
data_lda.shape

(129971, 3)

lda.perplexity(DTM)

962.4460528706168

29.3. Non-Negative Matrix Factorization (NMF)#

As noted above we are looking to decompose the DTM \(M\) into \(D \cdot T\), with common dimension of size \(k\).
We apply the Froebenius norm:

\[ f = ||M - D \cdot T ||_F^2 = \sum_d \sum_t (M_{dt} - D_{dt} T_{dt})^2. \]

Let the gradient be \(\nabla f\).
We minimize \(f\) by iteration. First, hold \(T\) fixed and find the best \(D\), then hold \(D\) fixed and find the best \(T\).
If we hold \(T\) fixed, and differentiate with respect to a specific \(d\): \(\nabla f_d = \frac{\partial f}{\partial d} \sum_d \sum_t (M_{dt}^2 - 2 M_{dt} D_{dt} T_{dt} + D_{dt}^2 T_{dt}^2) = 0\).
We get that \(\nabla f_d = \sum_t (MT^\top - DTT^\top) = 0\), which implies that

\[ MT^\top = DTT^\top \]

We want to choose the elements of \(D\) such that
1. The LHS is as close to the RHS as possible
2. The elements of \(D\) are non-negative, else we are not doing NMF

Using gradients, we set up the algorithm as follows:

\[ D \leftarrow D + \eta_D [MT^\top - DTT^\top] \]

We can choose the learning rate \(\eta_D\) in a clever way to make sure that elements of \(D \geq 0\), i.e., \(\eta_D = \frac{D}{DTT^\top}\), the fraction taken elementwise. So, we now get

\[ D \leftarrow D + \frac{D}{DTT^\top} [MT^\top - DTT^\top] \]

or, elementwise

\[ D_{dt} \leftarrow D_{dt} \cdot \frac{M_{dt}T_{dt}^\top}{D_{dt}T_{dt}T_{dt}^\top} \]

This update rule ensures that \(D_{dt} \geq 0, \forall d,t\). Correspondingly, the update rule for the elements of \(T\) is

\[ T_{dt} \leftarrow T_{dt} \cdot \frac{D_{dt}^\top M_{dt}}{D_{dt} D_{dt}^\top T_{dt}} \]

n_topics = 3

# Non-Negative Matrix Factorization Model
nmf = NMF(n_components=n_topics)
data_nmf = nmf.fit_transform(DTM)

data_nmf.shape

(129971, 3)

we can get more details here: https://srdas.github.io/MLBook2/24_TextAnaytics_Advanced.html#Non-negative-Matrix-Factorization-(NMF)

You can use this reduced-form matrix for document similarity (for example).

# Latent Semantic Indexing Model using Truncated SVD
lsi = TruncatedSVD(n_components=n_topics)
data_lsi = lsi.fit_transform(DTM)

data_lsi.shape

(129971, 3)

More details here: https://srdas.github.io/MLBook2/24_TextAnaytics_Advanced.html#Singular-Value-Decomposition-(SVD)

Dimension Reduction (and Topics)

Contents

29. Dimension Reduction (and Topics)#

29.1. Dimension Reduction of Text#

29.2. Topics from LDA using scikit-learn#

29.3. Non-Negative Matrix Factorization (NMF)#

29.2. Topics from LDA using `scikit-learn`#