# News Scraping with Python

In [1]:
%%capture
#INCLUDING SCIENTIFIC AND NUMERICAL COMPUTING LIBRARIES
#Run this code to make sure that you have all the libraries at one go.
%pylab inline
import os
import pandas as pd
%load_ext rpy2.ipython

In [2]:
# Basic lines of code needed to import a data file with permissions from Google Drive
from google.colab import drive
# drive.mount("/content/drive", force_remount=True)
drive.mount('/content/drive')
os.chdir("drive/My Drive/Books_Writings/NLPBook/")

Mounted at /content/drive


## News Extractor: Reading in parts of a URL

Let's read in the top news from the ET main page.

You also want to get SelectorGadget: http://selectorgadget.com/

In [3]:
!pip install cssselect

Collecting cssselect
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: cssselect
Successfully installed cssselect-1.2.0


In [4]:
import requests
from lxml.html import fromstring

In [5]:
#Copy the URL from the web site
url = 'https://economictimes.indiatimes.com'
html = requests.get(url, timeout=10).text

#See: http://infohost.nmt.edu/~shipman/soft/pylxml/web/etree-fromstring.html
doc = fromstring(html)

#http://lxml.de/cssselect.html#the-cssselect-method
doc.cssselect(".active")

[<Element li at 0x7f96a754b930>,
 <Element li at 0x7f96a7721f90>,
 <Element li at 0x7f96fdd1b700>,
 <Element li at 0x7f96a754ec60>,
 <Element li at 0x7f96a754ed00>,
 <Element li at 0x7f96a754ed50>,
 <Element li at 0x7f96a754eda0>,
 <Element li at 0x7f96a754edf0>,
 <Element li at 0x7f96a754ee40>,
 <Element li at 0x7f96a754ee90>,
 <Element li at 0x7f96a754eee0>,
 <Element li at 0x7f96a754ef30>,
 <Element li at 0x7f96a754ef80>,
 <Element li at 0x7f96a754efd0>,
 <Element li at 0x7f96a754f020>,
 <Element li at 0x7f96a754f070>]

In [6]:
x = doc.cssselect(".active li")    #Try a, h2, section if you like
headlines = [x[j].text_content() for j in range(len(x))]
# headlines = headlines[:20]   #Needed to exclude any other stuff that was not needed.
for h in headlines:
    print(h)


Middle class tax pain to be finally alleviated this time?
Modi govt has a key task in Budget 2025: Unlocking the PLI goldmine
Coldplay live hits 83L views on Hotstar
New Zealand to let visitors to work remotely
Trump urges 'fair' India-US trade in Modi call
What is Deepseek that freaked out AI world 
Dubai's boom is putting strains on residents
Trump vows to build 'Iron Dome' missile shield
Google Maps' plan for the 'Gulf of America'
Justice Dept fires Trump case prosecutors
Hamas says 300K displaced return
17 battles may shape Delhi's 2025 polls
RBI dissolves Aviom Housing board
Ujjivan & others lower lending rate from Jan
Body Shop to begin manufacturing in India
SC spurns plea to expedite Sebi probe
Building collapses in Burari, many trapped
NCLAT dismisses insolvency plea against HUL
India, China to resume flights after 5 yrs
PM Modi speaks to US Prez Trump over phone 
Where are women in India Inc's C-suite roles?
DeepSeek, Masa Son have lessons for Stargate
Bumrah named Test Crick

In [7]:
#Sentiment scoring
## Here we will read in an entire dictionary from Harvard Inquirer
f = open('NLP_data/inqdict.txt')
HIDict = f.read()
HIDict = HIDict.splitlines()
HIDict = HIDict[1:]
print(HIDict[:5])
print(len(HIDict))

#Extract all the lines that contain the Pos tag
poswords = [j for j in HIDict if "Pos" in j]  #using a list comprehension
poswords = [j.split()[0] for j in poswords]
poswords = [j.split("#")[0] for j in poswords]
poswords = unique(poswords)
poswords = [j.lower() for j in poswords]
print(poswords[:20])
print(len(poswords))

#Extract all the lines that contain the Neg tag
negwords = [j for j in HIDict if "Neg" in j]  #using a list comprehension
negwords = [j.split()[0] for j in negwords]
negwords = [j.split("#")[0] for j in negwords]
negwords = unique(negwords)
negwords = [j.lower() for j in negwords]
print(negwords[:20])
print(len(negwords))

['A H4Lvd DET ART  | article: Indefinite singular article--some or any one', 'ABANDON H4Lvd Neg Ngtv Weak Fail IAV AFFLOSS AFFTOT SUPV  |', 'ABANDONMENT H4 Neg Weak Fail Noun  |', 'ABATE H4Lvd Neg Psv Decr IAV TRANS SUPV  |', 'ABATEMENT Lvd Noun  ']
11895
['abide', 'able', 'abound', 'absolve', 'absorbent', 'absorption', 'abundance', 'abundant', 'accede', 'accentuate', 'accept', 'acceptable', 'acceptance', 'accessible', 'accession', 'acclaim', 'acclamation', 'accolade', 'accommodate', 'accommodation']
1646
['abandon', 'abandonment', 'abate', 'abdicate', 'abhor', 'abject', 'abnormal', 'abolish', 'abominable', 'abrasive', 'abrupt', 'abscond', 'absence', 'absent', 'absent-minded', 'absentee', 'absurd', 'absurdity', 'abuse', 'abyss']
2120


In [8]:
#Create a sentiment scoring function
def textSentiment(text,poswords,negwords):
    text.lower(); print(text)
    text = text.split(' ')
    posmatches = set(text).intersection(set(poswords)); print(posmatches)
    negmatches = set(text).intersection(set(negwords)); print(negmatches)
    return [len(posmatches),len(negmatches)]

In [9]:
for h in headlines:
    s = textSentiment(h,poswords,negwords)
    print(s)

Middle class tax pain to be finally alleviated this time?
set()
{'pain', 'tax'}
[0, 2]
Modi govt has a key task in Budget 2025: Unlocking the PLI goldmine
set()
set()
[0, 0]
Coldplay live hits 83L views on Hotstar
{'live'}
set()
[1, 0]
New Zealand to let visitors to work remotely
set()
{'let'}
[0, 1]
Trump urges 'fair' India-US trade in Modi call
{'call'}
set()
[1, 0]
What is Deepseek that freaked out AI world 
set()
set()
[0, 0]
Dubai's boom is putting strains on residents
{'boom'}
set()
[1, 0]
Trump vows to build 'Iron Dome' missile shield
{'shield'}
set()
[1, 0]
Google Maps' plan for the 'Gulf of America'
set()
set()
[0, 0]
Justice Dept fires Trump case prosecutors
set()
set()
[0, 0]
Hamas says 300K displaced return
{'return'}
set()
[1, 0]
17 battles may shape Delhi's 2025 polls
set()
set()
[0, 0]
RBI dissolves Aviom Housing board
{'board'}
{'board'}
[1, 1]
Ujjivan & others lower lending rate from Jan
set()
{'lower'}
[0, 1]
Body Shop to begin manufacturing in India
set()
set()
[0, 0

## Using R for extraction with rvest

There are various options to run R code in Jupyter:
1. Run this in a new notebook with the R kernel.
2. Install rpy2 with pip : pip install rpy2
3. Install the anaconda package using conda : search for 'anaconda r-package-name'
4. Using a R code block, use "install.packages("r-package-name")

In [None]:
# !pip install -U rpy2   # run this in the R kernel with: system('pip install rpy2')
# %reload_ext rpy2.ipython
# ! conda install -c conda-forge r-rvest -y

In [10]:
%%R
install.packages(c("magrittr","stringr","rvest"))

(as ‘lib’ is unspecified)

















	‘/tmp/RtmpRXXdK8/downloaded_packages’



In [11]:
%%R
library(rvest)
library(magrittr)
library(stringr)

In [12]:
%%R
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
print(url)
doc = read_html(url)
# res = doc %>% html_nodes("table") %>% html_table()
res = doc %>% html_element("table") %>% html_table()

[1] "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"


In [13]:
%%R
res

# A tibble: 503 × 8
   Symbol Security      `GICS Sector` `GICS Sub-Industry` Headquarters Locatio…¹
   <chr>  <chr>         <chr>         <chr>               <chr>                 
 1 MMM    3M            Industrials   Industrial Conglom… Saint Paul, Minnesota 
 2 AOS    A. O. Smith   Industrials   Building Products   Milwaukee, Wisconsin  
 3 ABT    Abbott Labor… Health Care   Health Care Equipm… North Chicago, Illino…
 4 ABBV   AbbVie        Health Care   Biotechnology       North Chicago, Illino…
 5 ACN    Accenture     Information … IT Consulting & Ot… Dublin, Ireland       
 6 ADBE   Adobe Inc.    Information … Application Softwa… San Jose, California  
 7 AMD    Advanced Mic… Information … Semiconductors      Santa Clara, Californ…
 8 AES    AES Corporat… Utilities     Independent Power … Arlington, Virginia   
 9 AFL    Aflac         Financials    Life & Health Insu… Columbus, Georgia     
10 A      Agilent Tech… Health Care   Life Sciences Tool… Santa Clara, Californ…
# ℹ 493 

In [16]:
%%R
symbols = res[1]$Symbol
symbols

  [1] "MMM"   "AOS"   "ABT"   "ABBV"  "ACN"   "ADBE"  "AMD"   "AES"   "AFL"  
 [10] "A"     "APD"   "ABNB"  "AKAM"  "ALB"   "ARE"   "ALGN"  "ALLE"  "LNT"  
 [19] "ALL"   "GOOGL" "GOOG"  "MO"    "AMZN"  "AMCR"  "AEE"   "AEP"   "AXP"  
 [28] "AIG"   "AMT"   "AWK"   "AMP"   "AME"   "AMGN"  "APH"   "ADI"   "ANSS" 
 [37] "AON"   "APA"   "APO"   "AAPL"  "AMAT"  "APTV"  "ACGL"  "ADM"   "ANET" 
 [46] "AJG"   "AIZ"   "T"     "ATO"   "ADSK"  "ADP"   "AZO"   "AVB"   "AVY"  
 [55] "AXON"  "BKR"   "BALL"  "BAC"   "BAX"   "BDX"   "BRK.B" "BBY"   "TECH" 
 [64] "BIIB"  "BLK"   "BX"    "BK"    "BA"    "BKNG"  "BWA"   "BSX"   "BMY"  
 [73] "AVGO"  "BR"    "BRO"   "BF.B"  "BLDR"  "BG"    "BXP"   "CHRW"  "CDNS" 
 [82] "CZR"   "CPT"   "CPB"   "COF"   "CAH"   "KMX"   "CCL"   "CARR"  "CAT"  
 [91] "CBOE"  "CBRE"  "CDW"   "CE"    "COR"   "CNC"   "CNP"   "CF"    "CRL"  
[100] "SCHW"  "CHTR"  "CVX"   "CMG"   "CB"    "CHD"   "CI"    "CINF"  "CTAS" 
[109] "CSCO"  "C"     "CFG"   "CLX"   "CME"   "CMS"   "KO"    "C

In [None]:
%%R
res = doc %>% html_nodes("p") %>% html_text()
print(res)

[1] "\nThe S&P 500 is a stock market index maintained by S&P Dow Jones Indices. It comprises 503 common stocks which are issued by 500 large-cap companies traded on the American stock exchanges (including the 30 companies that compose the Dow Jones Industrial Average). The index includes about 80 percent of the American market by capitalization. It is weighted by free-float market capitalization, so more valuable companies account for relatively more weight in the index. The index constituents and the constituent weights are updated regularly using rules published by S&P Dow Jones Indices. Although called the S&P 500, the index contains 503 stocks because it includes two share classes of stock from 3 of its component companies.[1][2]"
[2] "S&P Dow Jones Indices updates the components of the S&P 500 periodically, typically in response to acquisitions, or to keep the index up to date as various companies grow or shrink in value.[3] Between January 1, 1963, and December 31, 2014, 1,186 in

In [19]:
syms = %Rget symbols

In [20]:
syms

0,1,2,3,4,5,6
'MMM','AOS','ABT',...,'ZBRA','ZBH','ZTS'


In [21]:
x = 3


In [22]:
%Rpush x

In [23]:
%%R
x

[1] 3
