19. Text Transformations, Numbers, Punctuation, Stopwords, Stemming, Corpus#
Where we process text in different ways in order to clean it up and make it more amenable to analysis.
from google.colab import drive
drive.mount('/content/drive') # Add My Drive/<>
import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
!pip install ipypublish
!pip install cssselect
%pylab inline
import pandas as pd
import os
from ipypublish import nb_setup
%load_ext rpy2.ipython
Populating the interactive namespace from numpy and matplotlib
19.1. Collect some text from the news#
Here we use web scraping to collect news headlines from the Economic Times, an Indian newspaper.
It is useful to navigate to the URL for the newspaper to take a look at the page: https://economictimes.indiatimes.com
import requests
from lxml.html import fromstring
# Collect some text data
!pip install cssselect
import requests
from lxml.html import fromstring
#Copy the URL from the web site
url = 'https://economictimes.indiatimes.com'
html = requests.get(url, timeout=10).text
#See: http://infohost.nmt.edu/~shipman/soft/pylxml/web/etree-fromstring.html
doc = fromstring(html)
#http://lxml.de/cssselect.html#the-cssselect-method
x = doc.cssselect(".jsx-48c379259a10063f")
print(len(x))
headlines = [j.text_content() for j in x]
headlines = [j for j in headlines if len(j)>20]
headlines = unique(headlines)
headlines = headlines[:30] #Needed to exclude any other stuff that was not needed.
for h in headlines:
print(h)
Requirement already satisfied: cssselect in /usr/local/lib/python3.12/dist-packages (1.3.0)
56
A clear pattern emerges in UPI vs cards battle
Addressing Indiaâs cognitive time bomb
Air India partners with STARLUX Airlines
Ambani 'less well-off' this year: Forbes
Ambani 'less well-off' this year: ForbesSept third-hottest globally on recordWhat we know about the new Gaza dealHow Donald Trump pulled off his Gaza dealEarthquake of magnitude 3.1 strikes BhutanNo visas on the table with India: StarmerJSW MG wants to top India's luxe EV marketNational Employment Policy coming soonGovt plans pension law for coal workersDGCA slaps â¹20 lakh penalty on IndiGoPM hails Mumbai Metro Line-3 Phase 2BTCS Q2 earnings press conference called offA clear pattern emerges in UPI vs cards battleSkoda planning to launch EV in IndiaAddressing Indiaâs cognitive time bombAir India partners with STARLUX AirlinesArunachal Pradesh bans Coldrif cough syrupInternational students face fewer risks abroadBira plans $132 million fundraising: ReportIMC: India's telecom road map beyond 5GModi inaugurates India Mobile Congress 2025Indian seafood exporters in troubled water
Ambani 'less well-off' this year: ForbesSept third-hottest globally on recordWhat we know about the new Gaza dealHow Donald Trump pulled off his Gaza dealEarthquake of magnitude 3.1 strikes BhutanNo visas on the table with India: StarmerJSW MG wants to top India's luxe EV marketNational Employment Policy coming soonGovt plans pension law for coal workersDGCA slaps â¹20 lakh penalty on IndiGoPM hails Mumbai Metro Line-3 Phase 2BTCS Q2 earnings press conference called offA clear pattern emerges in UPI vs cards battleSkoda planning to launch EV in IndiaAddressing Indiaâs cognitive time bombAir India partners with STARLUX AirlinesArunachal Pradesh bans Coldrif cough syrupInternational students face fewer risks abroadBira plans $132 million fundraising: ReportIMC: India's telecom road map beyond 5GModi inaugurates India Mobile Congress 2025Indian seafood exporters in troubled water More from Top News »
Arunachal Pradesh bans Coldrif cough syrup
Bira plans $132 million fundraising: Report
DGCA slaps â¹20 lakh penalty on IndiGo
Earthquake of magnitude 3.1 strikes Bhutan
Govt plans pension law for coal workers
How Donald Trump pulled off his Gaza deal
IMC: India's telecom road map beyond 5G
Indian seafood exporters in troubled water
International students face fewer risks abroad
JSW MG wants to top India's luxe EV market
Modi inaugurates India Mobile Congress 2025
National Employment Policy coming soon
No visas on the table with India: Starmer
PM hails Mumbai Metro Line-3 Phase 2B
Sept third-hottest globally on record
Skoda planning to launch EV in India
TCS Q2 earnings press conference called off
What we know about the new Gaza deal
19.2. Remove punctuation from headlines#
import string
print(string.punctuation)
punc_set = string.punctuation.replace('.','')
print(punc_set)
def removePuncStr(s):
for c in string.punctuation:
s = s.replace(c," ")
s = s.replace('. ','')
return s
def removePunc(text_array):
return [removePuncStr(h) for h in text_array]
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
!"#$%&'()*+,-/:;<=>?@[\]^_`{|}~
headlines = removePunc(headlines)
headlines
['A clear pattern emerges in UPI vs cards battle',
'Addressing Indiaâ\x80\x99s cognitive time bomb',
'Air India partners with STARLUX Airlines',
'Ambani less well off this year Forbes',
'Ambani less well off this year ForbesSept third hottest globally on recordWhat we know about the new Gaza dealHow Donald Trump pulled off his Gaza dealEarthquake of magnitude 3 1 strikes BhutanNo visas on the table with India StarmerJSW MG wants to top India s luxe EV marketNational Employment Policy coming soonGovt plans pension law for coal workersDGCA slaps â\x82¹20 lakh penalty on IndiGoPM hails Mumbai Metro Line 3 Phase 2BTCS Q2 earnings press conference called offA clear pattern emerges in UPI vs cards battleSkoda planning to launch EV in IndiaAddressing Indiaâ\x80\x99s cognitive time bombAir India partners with STARLUX AirlinesArunachal Pradesh bans Coldrif cough syrupInternational students face fewer risks abroadBira plans 132 million fundraising ReportIMC India s telecom road map beyond 5GModi inaugurates India Mobile Congress 2025Indian seafood exporters in troubled water',
'Ambani less well off this year ForbesSept third hottest globally on recordWhat we know about the new Gaza dealHow Donald Trump pulled off his Gaza dealEarthquake of magnitude 3 1 strikes BhutanNo visas on the table with India StarmerJSW MG wants to top India s luxe EV marketNational Employment Policy coming soonGovt plans pension law for coal workersDGCA slaps â\x82¹20 lakh penalty on IndiGoPM hails Mumbai Metro Line 3 Phase 2BTCS Q2 earnings press conference called offA clear pattern emerges in UPI vs cards battleSkoda planning to launch EV in IndiaAddressing Indiaâ\x80\x99s cognitive time bombAir India partners with STARLUX AirlinesArunachal Pradesh bans Coldrif cough syrupInternational students face fewer risks abroadBira plans 132 million fundraising ReportIMC India s telecom road map beyond 5GModi inaugurates India Mobile Congress 2025Indian seafood exporters in troubled water More from Top News »',
'Arunachal Pradesh bans Coldrif cough syrup',
'Bira plans 132 million fundraising Report',
'DGCA slaps â\x82¹20 lakh penalty on IndiGo',
'Earthquake of magnitude 3 1 strikes Bhutan',
'Govt plans pension law for coal workers',
'How Donald Trump pulled off his Gaza deal',
'IMC India s telecom road map beyond 5G',
'Indian seafood exporters in troubled water',
'International students face fewer risks abroad',
'JSW MG wants to top India s luxe EV market',
'Modi inaugurates India Mobile Congress 2025',
'National Employment Policy coming soon',
'No visas on the table with India Starmer',
'PM hails Mumbai Metro Line 3 Phase 2B',
'Sept third hottest globally on record',
'Skoda planning to launch EV in India',
'TCS Q2 earnings press conference called off',
'What we know about the new Gaza deal']
19.3. Remove Numbers#
def removeNumbersStr(s):
for c in range(10):
n = str(c)
s = s.replace(n," ")
return s
def removeNumbers(text_array):
return [removeNumbersStr(h) for h in text_array]
headlines = removeNumbers(headlines)
headlines
['A clear pattern emerges in UPI vs cards battle',
'Addressing Indiaâ\x80\x99s cognitive time bomb',
'Air India partners with STARLUX Airlines',
'Ambani less well off this year Forbes',
'Ambani less well off this year ForbesSept third hottest globally on recordWhat we know about the new Gaza dealHow Donald Trump pulled off his Gaza dealEarthquake of magnitude strikes BhutanNo visas on the table with India StarmerJSW MG wants to top India s luxe EV marketNational Employment Policy coming soonGovt plans pension law for coal workersDGCA slaps â\x82¹ lakh penalty on IndiGoPM hails Mumbai Metro Line Phase BTCS Q earnings press conference called offA clear pattern emerges in UPI vs cards battleSkoda planning to launch EV in IndiaAddressing Indiaâ\x80\x99s cognitive time bombAir India partners with STARLUX AirlinesArunachal Pradesh bans Coldrif cough syrupInternational students face fewer risks abroadBira plans million fundraising ReportIMC India s telecom road map beyond GModi inaugurates India Mobile Congress Indian seafood exporters in troubled water',
'Ambani less well off this year ForbesSept third hottest globally on recordWhat we know about the new Gaza dealHow Donald Trump pulled off his Gaza dealEarthquake of magnitude strikes BhutanNo visas on the table with India StarmerJSW MG wants to top India s luxe EV marketNational Employment Policy coming soonGovt plans pension law for coal workersDGCA slaps â\x82¹ lakh penalty on IndiGoPM hails Mumbai Metro Line Phase BTCS Q earnings press conference called offA clear pattern emerges in UPI vs cards battleSkoda planning to launch EV in IndiaAddressing Indiaâ\x80\x99s cognitive time bombAir India partners with STARLUX AirlinesArunachal Pradesh bans Coldrif cough syrupInternational students face fewer risks abroadBira plans million fundraising ReportIMC India s telecom road map beyond GModi inaugurates India Mobile Congress Indian seafood exporters in troubled water More from Top News »',
'Arunachal Pradesh bans Coldrif cough syrup',
'Bira plans million fundraising Report',
'DGCA slaps â\x82¹ lakh penalty on IndiGo',
'Earthquake of magnitude strikes Bhutan',
'Govt plans pension law for coal workers',
'How Donald Trump pulled off his Gaza deal',
'IMC India s telecom road map beyond G',
'Indian seafood exporters in troubled water',
'International students face fewer risks abroad',
'JSW MG wants to top India s luxe EV market',
'Modi inaugurates India Mobile Congress ',
'National Employment Policy coming soon',
'No visas on the table with India Starmer',
'PM hails Mumbai Metro Line Phase B',
'Sept third hottest globally on record',
'Skoda planning to launch EV in India',
'TCS Q earnings press conference called off',
'What we know about the new Gaza deal']
19.4. Remove Stopwords#
Reference: https://pythonprogramming.net/stop-words-nltk-tutorial/
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def stopText(text_array):
stop_words = set(stopwords.words('english'))
stopped_text = []
for h in text_array:
words = word_tokenize(h)
h2 = ''
for w in words:
if w.lower() not in stop_words:
h2 = h2 + ' ' + w
stopped_text.append(h2)
return stopped_text
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
stopped_headlines = stopText(headlines)
stopped_headlines
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.
[' clear pattern emerges UPI vs cards battle',
' Addressing Indiaâ\x80\x99s cognitive time bomb',
' Air India partners STARLUX Airlines',
' Ambani less well year Forbes',
' Ambani less well year ForbesSept third hottest globally recordWhat know new Gaza dealHow Donald Trump pulled Gaza dealEarthquake magnitude strikes BhutanNo visas table India StarmerJSW MG wants top India luxe EV marketNational Employment Policy coming soonGovt plans pension law coal workersDGCA slaps â\x82¹ lakh penalty IndiGoPM hails Mumbai Metro Line Phase BTCS Q earnings press conference called offA clear pattern emerges UPI vs cards battleSkoda planning launch EV IndiaAddressing Indiaâ\x80\x99s cognitive time bombAir India partners STARLUX AirlinesArunachal Pradesh bans Coldrif cough syrupInternational students face fewer risks abroadBira plans million fundraising ReportIMC India telecom road map beyond GModi inaugurates India Mobile Congress Indian seafood exporters troubled water',
' Ambani less well year ForbesSept third hottest globally recordWhat know new Gaza dealHow Donald Trump pulled Gaza dealEarthquake magnitude strikes BhutanNo visas table India StarmerJSW MG wants top India luxe EV marketNational Employment Policy coming soonGovt plans pension law coal workersDGCA slaps â\x82¹ lakh penalty IndiGoPM hails Mumbai Metro Line Phase BTCS Q earnings press conference called offA clear pattern emerges UPI vs cards battleSkoda planning launch EV IndiaAddressing Indiaâ\x80\x99s cognitive time bombAir India partners STARLUX AirlinesArunachal Pradesh bans Coldrif cough syrupInternational students face fewer risks abroadBira plans million fundraising ReportIMC India telecom road map beyond GModi inaugurates India Mobile Congress Indian seafood exporters troubled water Top News  »',
' Arunachal Pradesh bans Coldrif cough syrup',
' Bira plans million fundraising Report',
' DGCA slaps â\x82¹ lakh penalty IndiGo',
' Earthquake magnitude strikes Bhutan',
' Govt plans pension law coal workers',
' Donald Trump pulled Gaza deal',
' IMC India telecom road map beyond G',
' Indian seafood exporters troubled water',
' International students face fewer risks abroad',
' JSW MG wants top India luxe EV market',
' Modi inaugurates India Mobile Congress',
' National Employment Policy coming soon',
' visas table India Starmer',
' PM hails Mumbai Metro Line Phase B',
' Sept third hottest globally record',
' Skoda planning launch EV India',
' TCS Q earnings press conference called',
' know new Gaza deal']
19.5. Stemming#
https://pythonprogramming.net/stemming-nltk-tutorial/
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
def stemText(text_array):
stemmed_text = []
for h in text_array:
words = word_tokenize(h)
h2 = ''
for w in words:
h2 = h2 + ' ' + PorterStemmer().stem(w)
stemmed_text.append(h2)
return stemmed_text
import nltk
nltk.download('punkt')
stemmed_headlines = stemText(headlines)
stemmed_headlines
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[' a clear pattern emerg in upi vs card battl',
' address indiaâ\x80\x99 cognit time bomb',
' air india partner with starlux airlin',
' ambani less well off thi year forb',
' ambani less well off thi year forbessept third hottest global on recordwhat we know about the new gaza dealhow donald trump pull off hi gaza dealearthquak of magnitud strike bhutanno visa on the tabl with india starmerjsw mg want to top india s lux ev marketn employ polici come soongovt plan pension law for coal workersdgca slap â\x82¹ lakh penalti on indigopm hail mumbai metro line phase btc q earn press confer call offa clear pattern emerg in upi vs card battleskoda plan to launch ev in indiaaddress indiaâ\x80\x99 cognit time bombair india partner with starlux airlinesarunach pradesh ban coldrif cough syrupintern student face fewer risk abroadbira plan million fundrais reportimc india s telecom road map beyond gmodi inaugur india mobil congress indian seafood export in troubl water',
' ambani less well off thi year forbessept third hottest global on recordwhat we know about the new gaza dealhow donald trump pull off hi gaza dealearthquak of magnitud strike bhutanno visa on the tabl with india starmerjsw mg want to top india s lux ev marketn employ polici come soongovt plan pension law for coal workersdgca slap â\x82¹ lakh penalti on indigopm hail mumbai metro line phase btc q earn press confer call offa clear pattern emerg in upi vs card battleskoda plan to launch ev in indiaaddress indiaâ\x80\x99 cognit time bombair india partner with starlux airlinesarunach pradesh ban coldrif cough syrupintern student face fewer risk abroadbira plan million fundrais reportimc india s telecom road map beyond gmodi inaugur india mobil congress indian seafood export in troubl water more from top new â »',
' arunach pradesh ban coldrif cough syrup',
' bira plan million fundrais report',
' dgca slap â\x82¹ lakh penalti on indigo',
' earthquak of magnitud strike bhutan',
' govt plan pension law for coal worker',
' how donald trump pull off hi gaza deal',
' imc india s telecom road map beyond g',
' indian seafood export in troubl water',
' intern student face fewer risk abroad',
' jsw mg want to top india s lux ev market',
' modi inaugur india mobil congress',
' nation employ polici come soon',
' no visa on the tabl with india starmer',
' pm hail mumbai metro line phase b',
' sept third hottest global on record',
' skoda plan to launch ev in india',
' tc q earn press confer call off',
' what we know about the new gaza deal']
19.6. Write all docs to separate text files#
This a typical approach in the text mining community. When creating a repository of plain text documents, each document is written as a separate text file to a folder. We do this here, so that we can see how to ingest a folder of such documents into a corpus, which is defined below.
def write2textfile(s,filename):
text_file = open(filename, "w")
text_file.write(s)
text_file.close()
import os
os.system('mkdir CTEXT')
j = 0
for h in headlines:
j = j + 1
fname = "CTEXT/" + str(j) + ".ctxt" #using "ctxt" to denote a corpus related file
write2textfile(h,fname)
19.7. Create a Corpus#
A corpus is a data structure that contains multiple documents.
Functions may be written at the corpus level itself.
We only need to point to the folder and the function PlaintextCorpusReader
in NLTK does the trick of constructng the corpus in one line of code.
#Read in the corpus
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'CTEXT/'
ctext = PlaintextCorpusReader(corpus_root, '.*')
ctext
<PlaintextCorpusReader in '/content/drive/MyDrive/Books_Writings/NLPBook/CTEXT'>
ctext.fileids()
['1.ctxt',
'10.ctxt',
'11.ctxt',
'12.ctxt',
'13.ctxt',
'14.ctxt',
'15.ctxt',
'16.ctxt',
'17.ctxt',
'18.ctxt',
'19.ctxt',
'2.ctxt',
'20.ctxt',
'21.ctxt',
'22.ctxt',
'23.ctxt',
'24.ctxt',
'3.ctxt',
'4.ctxt',
'5.ctxt',
'6.ctxt',
'7.ctxt',
'8.ctxt',
'9.ctxt']
# We now have functions that apply to the entire corpus
print(ctext.words(), len(ctext.words()))
print(len(set(ctext.words()))) # gives the vocabulary
print(ctext.words('1.ctxt'), len(ctext.words('1.ctxt')))
['A', 'clear', 'pattern', 'emerges', 'in', 'UPI', 'vs', ...] 422
158
['A', 'clear', 'pattern', 'emerges', 'in', 'UPI', 'vs', ...] 9
ctext.words('2.ctxt')
['Addressing', 'Indiaâ', '\x80\x99', 's', 'cognitive', ...]