20. Term Document Matrix#

How do we represent a document as a vector? A collection of such vectors is known as a “Term Document Matrix” (TDM).

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
%pylab inline
import pandas as pd
import os
# !pip install ipypublish
# from ipypublish import nb_setup
%load_ext rpy2.ipython
from IPython.display import Image

20.1. Term Document Matrix#

Image("NLP_images/tdm.png", width=500)

20.2. Term Frequency - Inverse Document Frequency (TF-IDF)#

This is a weighting scheme provided to sharpen the importance of rare words in a document, relative to the frequency of these words in the corpus. It is based on simple calculations and even though it does not have strong theoretical foundations, it is still very useful in practice. The TF-IDF is the importance of a word \(w\) in a document \(d\) in a corpus \(C\). Therefore it is a function of all these three, i.e., we write it as \(TFIDF(w,d,C)\), and is the product of term frequency (TF) and inverse document frequency (IDF).

The frequency of a word in a document is defined as

\[ f(w,d)=\frac{\#w \in d}{|d|} \]

where \(|d|\) is the number of words in the document. We may normalize word frequency so that

\[ TF(w,d)=ln[f(w,d)] \]

This is log normalization. Another form of normalization is known as double normalization and is as follows:

\[ TF(w,d)=\frac{1}{2} + \frac{1}{2} \cdot \frac{f(w,d)}{\max_{w \in d} f(w,d)} \]

Note that normalization is not necessary, but it tends to help shrink the difference between counts of words.

Inverse document frequency is as follows (in this case we show the normalized version):

\[ IDF(w,C)=\ln\left[\frac{|C|}{|d_{w \in d}|}\right] \]

That is, we compute the ratio of the number of documents in the corpus \(C\) divided by the number of documents with word \(w\) in the corpus.

Finally, we have the weighting score for a given word \(w\) in document \(d\) in corpus \(C\):

\[ TFIDF(w,d,C)=TF(w,d) \times IDF(w,C) \]
# Collect some text data
!pip install cssselect
import requests
from lxml.html import fromstring

#Copy the URL from the web site
url = 'https://economictimes.indiatimes.com'
html = requests.get(url, timeout=10).text

#See: http://infohost.nmt.edu/~shipman/soft/pylxml/web/etree-fromstring.html
doc = fromstring(html)

#http://lxml.de/cssselect.html#the-cssselect-method
doc.cssselect(".active")

x = doc.cssselect(".active li")    #Try a, h2, section if you like
headlines = [x[j].text_content() for j in range(len(x))]
headlines = headlines[:20]   #Needed to exclude any other stuff that was not needed.
print(headlines)
Collecting cssselect
  Downloading cssselect-1.3.0-py3-none-any.whl.metadata (2.6 kB)
Downloading cssselect-1.3.0-py3-none-any.whl (18 kB)
Installing collected packages: cssselect
Successfully installed cssselect-1.3.0
['Institutions home in on mixed-use realty', 'US revokes over 500 student visas', 'RBI warns of global shockwaves hitting India ', 'US VP Vance set to visit India this month', 'L&T bets on capex by big companies', 'Trump just bought a visa that doesn’t exist', 'T20 WC a huge boost for USA, West Indies', 'China warns citizens against US travel', '26% US tariff on India kicks in amid turmoil', 'Inflation expectations lowest since pandemic', 'Vi debt at Rs 2.17 lakh cr in Dec 2024 quarter', 'India economy to grow at 6.7% in FY26: ADB', 'China slaps restrictions on 18 US firms', 'What you need to be aware about opinion trading platforms', 'HC gives key ruling on railway luggage theft', "Kolkata set to get India's largest EV charging hub", 'RBI draft eases credit enhancement norms', 'CBDT sets April 30 deadline for VSV scheme', 'Unemployment rate declines to 4.9% in 2024', 'RBI proposes new, stricter gold loan rules']

In the next code block, we see the use of Scikit-Learn’s TFIDF vectorizer. We have 20 headlines (documents) and the dimension of the matrix will be 20 times the number of unique words (terms) across all the documents. The dimension is shown in the output below. From the dimensions we see that it is a document-term matrix (DTM) and its transpose will be a TDM.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
tfidf = TfidfVectorizer()
# tfidf = CountVectorizer()
tfs = tfidf.fit_transform(headlines)
tfs
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 150 stored elements and shape (20, 122)>
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
res = [word_tokenize(j) for j in headlines]
res[:2]
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[['Institutions', 'home', 'in', 'on', 'mixed-use', 'realty'],
 ['US', 'revokes', 'over', '500', 'student', 'visas']]
# Make TDM
tdm_mat = tfs.toarray().T
print(tdm_mat.shape)
tdm_mat
(120, 20)
array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.4077704 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.36447019, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.42828064]])

20.3. The BM25 Algorithm#

Both, the TF and IDF parts are modified as follows:

\[ TF(w,d) = \frac{TF(w,d)}{TF(w,d) + k (1-b+b \cdot \frac{|d|}{AVG(|d|)})} \]

where \(k\) modulates word saturation and \(b\) modulates document length. The size of document \(d\) is denoted as \(|d|\).

\[ IDF(w) = \ln\left[\frac{|C|-DF(w)+0.5}{DF(w)+0.5} \right], \quad \mbox{if } DF(w)\geq 0.5 \]

else

\[ IDF(w) = \ln \left[\frac{|C|}{DF(w)} \right], \quad \mbox{if } DF(w) < 0.5 \]

where \(N\) is the number of documents and \(DF(w)\) is the document frequency for word \(w\).

20.4. Implementation of BM25#

20.5. WordClouds#

We can take the text in the headlines and generate a wordcloud.

!conda install -c conda-forge wordcloud -y
/bin/bash: line 1: conda: command not found
text = ''
for h in headlines:
    text = text + ' ' + h
print(text)
 Rising Bharat may need to take center stage for India’s game-changing plans Will Indian Railways accelerate to global standards? Clearer skies for plane lessors with new bill EVs to drive 30% of volumes for TVS Tapping nobility in all professions How Trump changed lives of legal immigrants India's ancient tag sport seeks Olympic push Leadership holds key to AI adoption success Trump says 'H-1B lets the best people in'  US school shooting: Teen kills female student Musk behind Ramaswamy's ouster from DOGE? India to form consortium for global port ops Modi, Trump likely to meet in Feb India sets target to register 10k GI products Domestic air traffic grows to 16.13 cr in 2024 Knife attacker in Germany kills two Expect demand moderation to continue: HUL 6 Trump executive orders to watch Star Air offers 66,666 tickets from Rs 1950 Indians aren't that worried Trump’s back
from wordcloud import WordCloud
wordcloud = WordCloud().generate(text)

#Use pyplot from matplotlib
figure(figsize=(20,10))
pyplot.imshow(wordcloud, interpolation='bilinear')
pyplot.axis("off")
(-0.5, 399.5, 199.5, -0.5)
_images/a3818fbebe22dad8b3b40804f4dec926a75f11795e7671b6d1626f982f3efb7b.png