19. Text Transformations, Numbers, Punctuation, Stopwords, Stemming, Corpus#

Where we process text in different ways in order to clean it up and make it more amenable to analysis.

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
!pip install ipypublish
!pip install cssselect
%pylab inline
import pandas as pd
import os
from ipypublish import nb_setup
%load_ext rpy2.ipython
Populating the interactive namespace from numpy and matplotlib

19.1. Collect some text from the news#

Here we use web scraping to collect news headlines from the Economic Times, an Indian newspaper.

It is useful to navigate to the URL for the newspaper to take a look at the page: https://economictimes.indiatimes.com

import requests
from lxml.html import fromstring
#Copy the URL from the web site
url = 'https://economictimes.indiatimes.com'
html = requests.get(url, timeout=10).text

#See: http://infohost.nmt.edu/~shipman/soft/pylxml/web/etree-fromstring.html
doc = fromstring(html)

#http://lxml.de/cssselect.html#the-cssselect-method
doc.cssselect(".active")
[<Element li at 0x7d009a7feee0>,
 <Element li at 0x7d005bd04f50>,
 <Element li at 0x7d005bcc9b30>,
 <Element li at 0x7d005bb7cb90>,
 <Element li at 0x7d005bb7cc30>,
 <Element li at 0x7d005bb7cc80>,
 <Element li at 0x7d005bb7ccd0>,
 <Element li at 0x7d005bb7cd20>,
 <Element li at 0x7d005bb7cd70>,
 <Element li at 0x7d005bb7cdc0>,
 <Element li at 0x7d005bb7ce10>,
 <Element li at 0x7d005bb7ce60>,
 <Element li at 0x7d005bb7ceb0>,
 <Element li at 0x7d005bb7cf00>,
 <Element li at 0x7d005bb7cf50>,
 <Element li at 0x7d005bb7cfa0>]
x = doc.cssselect(".active li")    #Try a, h2, section if you like
headlines = [x[j].text_content() for j in range(len(x))]
headlines = headlines[:20]   #Needed to exclude any other stuff that was not needed.
for h in headlines:
    print(h)
Middle class tax pain to be finally alleviated this time?
Modi govt has a key task in Budget 2025: Unlocking the PLI goldmine
Coldplay live hits 83L views on Hotstar
New Zealand to let visitors to work remotely
Trump urges 'fair' India-US trade in Modi call
What is Deepseek that freaked out AI world 
Dubai's boom is putting strains on residents
Trump vows to build 'Iron Dome' missile shield
Google Maps' plan for the 'Gulf of America'
Justice Dept fires Trump case prosecutors
Hamas says 300K displaced return
17 battles may shape Delhi's 2025 polls
RBI dissolves Aviom Housing board
Ujjivan & others lower lending rate from Jan
Body Shop to begin manufacturing in India
SC spurns plea to expedite Sebi probe
Building collapses in Burari, many trapped
NCLAT dismisses insolvency plea against HUL
India, China to resume flights after 5 yrs
PM Modi speaks to US Prez Trump over phone 

19.2. Remove punctuation from headlines#

import string
print(string.punctuation)

punc_set = string.punctuation.replace('.','')
print(punc_set)

def removePuncStr(s):
    for c in string.punctuation:
        s = s.replace(c," ")
        s = s.replace('. ','')
    return s

def removePunc(text_array):
    return [removePuncStr(h) for h in text_array]
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
!"#$%&'()*+,-/:;<=>?@[\]^_`{|}~
headlines = removePunc(headlines)
headlines
['Middle class tax pain to be finally alleviated this time ',
 'Modi govt has a key task in Budget 2025  Unlocking the PLI goldmine',
 'Coldplay live hits 83L views on Hotstar',
 'New Zealand to let visitors to work remotely',
 'Trump urges  fair  India US trade in Modi call',
 'What is Deepseek that freaked out AI world ',
 'Dubai s boom is putting strains on residents',
 'Trump vows to build  Iron Dome  missile shield',
 'Google Maps  plan for the  Gulf of America ',
 'Justice Dept fires Trump case prosecutors',
 'Hamas says 300K displaced return',
 '17 battles may shape Delhi s 2025 polls',
 'RBI dissolves Aviom Housing board',
 'Ujjivan   others lower lending rate from Jan',
 'Body Shop to begin manufacturing in India',
 'SC spurns plea to expedite Sebi probe',
 'Building collapses in Burari  many trapped',
 'NCLAT dismisses insolvency plea against HUL',
 'India  China to resume flights after 5 yrs',
 'PM Modi speaks to US Prez Trump over phone ']

19.3. Remove Numbers#

def removeNumbersStr(s):
    for c in range(10):
        n = str(c)
        s = s.replace(n," ")
    return s

def removeNumbers(text_array):
    return [removeNumbersStr(h) for h in text_array]
headlines = removeNumbers(headlines)
headlines
['Middle class tax pain to be finally alleviated this time ',
 'Modi govt has a key task in Budget       Unlocking the PLI goldmine',
 'Coldplay live hits   L views on Hotstar',
 'New Zealand to let visitors to work remotely',
 'Trump urges  fair  India US trade in Modi call',
 'What is Deepseek that freaked out AI world ',
 'Dubai s boom is putting strains on residents',
 'Trump vows to build  Iron Dome  missile shield',
 'Google Maps  plan for the  Gulf of America ',
 'Justice Dept fires Trump case prosecutors',
 'Hamas says    K displaced return',
 '   battles may shape Delhi s      polls',
 'RBI dissolves Aviom Housing board',
 'Ujjivan   others lower lending rate from Jan',
 'Body Shop to begin manufacturing in India',
 'SC spurns plea to expedite Sebi probe',
 'Building collapses in Burari  many trapped',
 'NCLAT dismisses insolvency plea against HUL',
 'India  China to resume flights after   yrs',
 'PM Modi speaks to US Prez Trump over phone ']

19.4. Remove Stopwords#

Reference: https://pythonprogramming.net/stop-words-nltk-tutorial/

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def stopText(text_array):
    stop_words = set(stopwords.words('english'))
    stopped_text = []
    for h in text_array:
        words = word_tokenize(h)
        h2 = ''
        for w in words:
            if w.lower() not in stop_words:
                h2 = h2 + ' ' + w
        stopped_text.append(h2)
    return stopped_text
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
stopped_headlines = stopText(headlines)
stopped_headlines
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[' Middle class tax pain finally alleviated time',
 ' Modi govt key task Budget Unlocking PLI goldmine',
 ' Coldplay live hits L views Hotstar',
 ' New Zealand let visitors work remotely',
 ' Trump urges fair India US trade Modi call',
 ' Deepseek freaked AI world',
 ' Dubai boom putting strains residents',
 ' Trump vows build Iron Dome missile shield',
 ' Google Maps plan Gulf America',
 ' Justice Dept fires Trump case prosecutors',
 ' Hamas says K displaced return',
 ' battles may shape Delhi polls',
 ' RBI dissolves Aviom Housing board',
 ' Ujjivan others lower lending rate Jan',
 ' Body Shop begin manufacturing India',
 ' SC spurns plea expedite Sebi probe',
 ' Building collapses Burari many trapped',
 ' NCLAT dismisses insolvency plea HUL',
 ' India China resume flights yrs',
 ' PM Modi speaks US Prez Trump phone']

19.5. Stemming#

https://pythonprogramming.net/stemming-nltk-tutorial/

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

def stemText(text_array):
    stemmed_text = []
    for h in text_array:
        words = word_tokenize(h)
        h2 = ''
        for w in words:
            h2 = h2 + ' ' + PorterStemmer().stem(w)
        stemmed_text.append(h2)
    return stemmed_text
import nltk
nltk.download('punkt')
stemmed_headlines = stemText(headlines)
stemmed_headlines
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[' middl class tax pain to be final allevi thi time',
 ' modi govt ha a key task in budget unlock the pli goldmin',
 ' coldplay live hit l view on hotstar',
 ' new zealand to let visitor to work remot',
 ' trump urg fair india us trade in modi call',
 ' what is deepseek that freak out ai world',
 ' dubai s boom is put strain on resid',
 ' trump vow to build iron dome missil shield',
 ' googl map plan for the gulf of america',
 ' justic dept fire trump case prosecutor',
 ' hama say k displac return',
 ' battl may shape delhi s poll',
 ' rbi dissolv aviom hous board',
 ' ujjivan other lower lend rate from jan',
 ' bodi shop to begin manufactur in india',
 ' sc spurn plea to expedit sebi probe',
 ' build collaps in burari mani trap',
 ' nclat dismiss insolv plea against hul',
 ' india china to resum flight after yr',
 ' pm modi speak to us prez trump over phone']

19.6. Write all docs to separate text files#

This a typical approach in the text mining community. When creating a repository of plain text documents, each document is written as a separate text file to a folder. We do this here, so that we can see how to ingest a folder of such documents into a corpus, which is defined below.

def write2textfile(s,filename):
    text_file = open(filename, "w")
    text_file.write(s)
    text_file.close()
import os
os.system('mkdir CTEXT')

j = 0
for h in headlines:
    j = j + 1
    fname = "CTEXT/" + str(j) + ".ctxt"  #using "ctxt" to denote a corpus related file
    write2textfile(h,fname)

19.7. Create a Corpus#

A corpus is a data structure that contains multiple documents.

Functions may be written at the corpus level itself.

We only need to point to the folder and the function PlaintextCorpusReader in NLTK does the trick of constructng the corpus in one line of code.

#Read in the corpus
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'CTEXT/'
ctext = PlaintextCorpusReader(corpus_root, '.*')
ctext
<PlaintextCorpusReader in '/content/drive/MyDrive/Books_Writings/NLPBook/CTEXT'>
ctext.fileids()
['1.ctxt',
 '10.ctxt',
 '11.ctxt',
 '12.ctxt',
 '13.ctxt',
 '14.ctxt',
 '15.ctxt',
 '16.ctxt',
 '17.ctxt',
 '18.ctxt',
 '19.ctxt',
 '2.ctxt',
 '20.ctxt',
 '3.ctxt',
 '4.ctxt',
 '5.ctxt',
 '6.ctxt',
 '7.ctxt',
 '8.ctxt',
 '9.ctxt']
# We now have functions that apply to the entire corpus
print(ctext.words(), len(ctext.words()))
print(len(set(ctext.words()))) # gives the vocabulary
print(ctext.words('1.ctxt'), len(ctext.words('1.ctxt')))
['Middle', 'class', 'tax', 'pain', 'to', 'be', ...] 149
126
['Middle', 'class', 'tax', 'pain', 'to', 'be', ...] 10
ctext.words('2.ctxt')
['Modi', 'govt', 'has', 'a', 'key', 'task', 'in', ...]