12. News Scraping with Python#
%%capture
#INCLUDING SCIENTIFIC AND NUMERICAL COMPUTING LIBRARIES
#Run this code to make sure that you have all the libraries at one go.
%pylab inline
import os
import pandas as pd
%load_ext rpy2.ipython
# Basic lines of code needed to import a data file with permissions from Google Drive
from google.colab import drive
# drive.mount("/content/drive", force_remount=True)
drive.mount('/content/drive')
os.chdir("drive/My Drive/Books_Writings/NLPBook/")
Mounted at /content/drive
12.1. News Extractor: Reading in parts of a URL#
Let’s read in the top news from the ET main page.
You also want to get SelectorGadget: http://selectorgadget.com/
!pip install cssselect
Collecting cssselect
Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: cssselect
Successfully installed cssselect-1.2.0
import requests
from lxml.html import fromstring
#Copy the URL from the web site
url = 'https://economictimes.indiatimes.com'
html = requests.get(url, timeout=10).text
#See: http://infohost.nmt.edu/~shipman/soft/pylxml/web/etree-fromstring.html
doc = fromstring(html)
#http://lxml.de/cssselect.html#the-cssselect-method
doc.cssselect(".active")
[<Element li at 0x7f96a754b930>,
<Element li at 0x7f96a7721f90>,
<Element li at 0x7f96fdd1b700>,
<Element li at 0x7f96a754ec60>,
<Element li at 0x7f96a754ed00>,
<Element li at 0x7f96a754ed50>,
<Element li at 0x7f96a754eda0>,
<Element li at 0x7f96a754edf0>,
<Element li at 0x7f96a754ee40>,
<Element li at 0x7f96a754ee90>,
<Element li at 0x7f96a754eee0>,
<Element li at 0x7f96a754ef30>,
<Element li at 0x7f96a754ef80>,
<Element li at 0x7f96a754efd0>,
<Element li at 0x7f96a754f020>,
<Element li at 0x7f96a754f070>]
x = doc.cssselect(".active li") #Try a, h2, section if you like
headlines = [x[j].text_content() for j in range(len(x))]
# headlines = headlines[:20] #Needed to exclude any other stuff that was not needed.
for h in headlines:
print(h)
Middle class tax pain to be finally alleviated this time?
Modi govt has a key task in Budget 2025: Unlocking the PLI goldmine
Coldplay live hits 83L views on Hotstar
New Zealand to let visitors to work remotely
Trump urges 'fair' India-US trade in Modi call
What is Deepseek that freaked out AI world
Dubai's boom is putting strains on residents
Trump vows to build 'Iron Dome' missile shield
Google Maps' plan for the 'Gulf of America'
Justice Dept fires Trump case prosecutors
Hamas says 300K displaced return
17 battles may shape Delhi's 2025 polls
RBI dissolves Aviom Housing board
Ujjivan & others lower lending rate from Jan
Body Shop to begin manufacturing in India
SC spurns plea to expedite Sebi probe
Building collapses in Burari, many trapped
NCLAT dismisses insolvency plea against HUL
India, China to resume flights after 5 yrs
PM Modi speaks to US Prez Trump over phone
Where are women in India Inc's C-suite roles?
DeepSeek, Masa Son have lessons for Stargate
Bumrah named Test Cricketer of the Year
DOGE's conflict-of-interest clash for Musk
JSW to invest Rs 2,600 cr in Jharkhand project
JP Morgan's investment banking head quits
Is Spain really banning tourists?
Kejriwal gives Delhi 15 'guarantees'
Advent set to acquire Orra at Rs 1,750 cr
China's dam project is alarming everyone
Meesho closes $550 mn round, moves NCLT
Airlines warned over sky-high Prayagraj airfare
Budget: SBI suggests new funding for sectors
US intensifies crackdown on immigration
China's industrial profits fall 3.3% in 2024
Indian government is on a hiring spree
Hamas to free 3 Israeli hostages before Friday
Anand Mahindra gets emotional for young entrepreneur who took his father to dine at five-star hotel where he worked as a guard
Veer Pahariya takes flight with his delightful debut in Sky Force
Biggest stock market crash coming in February: Rich Dad Poor Dad's author Robert Kiyosaki
Wildlife expert Forrest Galante visits Anant Ambani’s Vantara, the world’s largest wildlife rehabilitation and rescue sanctuary
Revenge quitting is the newest workplace trend
How did the Jennifer Aniston and Barack Obama affair rumor begin? Here's the breakdown
Is Planetary Parade visible tonight as part of a rare event that won’t repeat for 400 years?
#Sentiment scoring
## Here we will read in an entire dictionary from Harvard Inquirer
f = open('NLP_data/inqdict.txt')
HIDict = f.read()
HIDict = HIDict.splitlines()
HIDict = HIDict[1:]
print(HIDict[:5])
print(len(HIDict))
#Extract all the lines that contain the Pos tag
poswords = [j for j in HIDict if "Pos" in j] #using a list comprehension
poswords = [j.split()[0] for j in poswords]
poswords = [j.split("#")[0] for j in poswords]
poswords = unique(poswords)
poswords = [j.lower() for j in poswords]
print(poswords[:20])
print(len(poswords))
#Extract all the lines that contain the Neg tag
negwords = [j for j in HIDict if "Neg" in j] #using a list comprehension
negwords = [j.split()[0] for j in negwords]
negwords = [j.split("#")[0] for j in negwords]
negwords = unique(negwords)
negwords = [j.lower() for j in negwords]
print(negwords[:20])
print(len(negwords))
['A H4Lvd DET ART | article: Indefinite singular article--some or any one', 'ABANDON H4Lvd Neg Ngtv Weak Fail IAV AFFLOSS AFFTOT SUPV |', 'ABANDONMENT H4 Neg Weak Fail Noun |', 'ABATE H4Lvd Neg Psv Decr IAV TRANS SUPV |', 'ABATEMENT Lvd Noun ']
11895
['abide', 'able', 'abound', 'absolve', 'absorbent', 'absorption', 'abundance', 'abundant', 'accede', 'accentuate', 'accept', 'acceptable', 'acceptance', 'accessible', 'accession', 'acclaim', 'acclamation', 'accolade', 'accommodate', 'accommodation']
1646
['abandon', 'abandonment', 'abate', 'abdicate', 'abhor', 'abject', 'abnormal', 'abolish', 'abominable', 'abrasive', 'abrupt', 'abscond', 'absence', 'absent', 'absent-minded', 'absentee', 'absurd', 'absurdity', 'abuse', 'abyss']
2120
#Create a sentiment scoring function
def textSentiment(text,poswords,negwords):
text.lower(); print(text)
text = text.split(' ')
posmatches = set(text).intersection(set(poswords)); print(posmatches)
negmatches = set(text).intersection(set(negwords)); print(negmatches)
return [len(posmatches),len(negmatches)]
for h in headlines:
s = textSentiment(h,poswords,negwords)
print(s)
Middle class tax pain to be finally alleviated this time?
set()
{'pain', 'tax'}
[0, 2]
Modi govt has a key task in Budget 2025: Unlocking the PLI goldmine
set()
set()
[0, 0]
Coldplay live hits 83L views on Hotstar
{'live'}
set()
[1, 0]
New Zealand to let visitors to work remotely
set()
{'let'}
[0, 1]
Trump urges 'fair' India-US trade in Modi call
{'call'}
set()
[1, 0]
What is Deepseek that freaked out AI world
set()
set()
[0, 0]
Dubai's boom is putting strains on residents
{'boom'}
set()
[1, 0]
Trump vows to build 'Iron Dome' missile shield
{'shield'}
set()
[1, 0]
Google Maps' plan for the 'Gulf of America'
set()
set()
[0, 0]
Justice Dept fires Trump case prosecutors
set()
set()
[0, 0]
Hamas says 300K displaced return
{'return'}
set()
[1, 0]
17 battles may shape Delhi's 2025 polls
set()
set()
[0, 0]
RBI dissolves Aviom Housing board
{'board'}
{'board'}
[1, 1]
Ujjivan & others lower lending rate from Jan
set()
{'lower'}
[0, 1]
Body Shop to begin manufacturing in India
set()
set()
[0, 0]
SC spurns plea to expedite Sebi probe
set()
set()
[0, 0]
Building collapses in Burari, many trapped
set()
set()
[0, 0]
NCLAT dismisses insolvency plea against HUL
set()
{'against'}
[0, 1]
India, China to resume flights after 5 yrs
set()
set()
[0, 0]
PM Modi speaks to US Prez Trump over phone
set()
set()
[0, 0]
Where are women in India Inc's C-suite roles?
set()
set()
[0, 0]
DeepSeek, Masa Son have lessons for Stargate
{'have'}
set()
[1, 0]
Bumrah named Test Cricketer of the Year
set()
set()
[0, 0]
DOGE's conflict-of-interest clash for Musk
set()
{'clash'}
[0, 1]
JSW to invest Rs 2,600 cr in Jharkhand project
set()
set()
[0, 0]
JP Morgan's investment banking head quits
set()
set()
[0, 0]
Is Spain really banning tourists?
set()
set()
[0, 0]
Kejriwal gives Delhi 15 'guarantees'
set()
set()
[0, 0]
Advent set to acquire Orra at Rs 1,750 cr
set()
set()
[0, 0]
China's dam project is alarming everyone
set()
{'alarming'}
[0, 1]
Meesho closes $550 mn round, moves NCLT
set()
set()
[0, 0]
Airlines warned over sky-high Prayagraj airfare
set()
set()
[0, 0]
Budget: SBI suggests new funding for sectors
set()
set()
[0, 0]
US intensifies crackdown on immigration
set()
set()
[0, 0]
China's industrial profits fall 3.3% in 2024
set()
{'fall'}
[0, 1]
Indian government is on a hiring spree
set()
set()
[0, 0]
Hamas to free 3 Israeli hostages before Friday
{'free'}
set()
[1, 0]
Anand Mahindra gets emotional for young entrepreneur who took his father to dine at five-star hotel where he worked as a guard
{'his'}
set()
[1, 0]
Veer Pahariya takes flight with his delightful debut in Sky Force
{'his', 'delightful'}
set()
[2, 0]
Biggest stock market crash coming in February: Rich Dad Poor Dad's author Robert Kiyosaki
set()
set()
[0, 0]
Wildlife expert Forrest Galante visits Anant Ambani’s Vantara, the world’s largest wildlife rehabilitation and rescue sanctuary
{'sanctuary', 'rescue', 'expert', 'rehabilitation'}
set()
[4, 0]
Revenge quitting is the newest workplace trend
set()
set()
[0, 0]
How did the Jennifer Aniston and Barack Obama affair rumor begin? Here's the breakdown
set()
{'breakdown', 'rumor'}
[0, 2]
Is Planetary Parade visible tonight as part of a rare event that won’t repeat for 400 years?
set()
set()
[0, 0]
12.2. Using R for extraction with rvest#
There are various options to run R code in Jupyter:
Run this in a new notebook with the R kernel.
Install rpy2 with pip : pip install rpy2
Install the anaconda package using conda : search for ‘anaconda r-package-name’
Using a R code block, use “install.packages(“r-package-name”)
# !pip install -U rpy2 # run this in the R kernel with: system('pip install rpy2')
# %reload_ext rpy2.ipython
# ! conda install -c conda-forge r-rvest -y
%%R
install.packages(c("magrittr","stringr","rvest"))
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/magrittr_2.0.3.tar.gz'
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Content type 'application/x-gzip'
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: length 267074 bytes (260 KB)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: downloaded 260 KB
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/stringr_1.5.1.tar.gz'
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Content type 'application/x-gzip'
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: length 176599 bytes (172 KB)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: downloaded 172 KB
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/rvest_1.0.4.tar.gz'
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Content type 'application/x-gzip'
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: length 115876 bytes (113 KB)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: =
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: downloaded 113 KB
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: The downloaded source packages are in
‘/tmp/RtmpRXXdK8/downloaded_packages’
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
%%R
library(rvest)
library(magrittr)
library(stringr)
%%R
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
print(url)
doc = read_html(url)
# res = doc %>% html_nodes("table") %>% html_table()
res = doc %>% html_element("table") %>% html_table()
[1] "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
%%R
res
# A tibble: 503 × 8
Symbol Security `GICS Sector` `GICS Sub-Industry` Headquarters Locatio…¹
<chr> <chr> <chr> <chr> <chr>
1 MMM 3M Industrials Industrial Conglom… Saint Paul, Minnesota
2 AOS A. O. Smith Industrials Building Products Milwaukee, Wisconsin
3 ABT Abbott Labor… Health Care Health Care Equipm… North Chicago, Illino…
4 ABBV AbbVie Health Care Biotechnology North Chicago, Illino…
5 ACN Accenture Information … IT Consulting & Ot… Dublin, Ireland
6 ADBE Adobe Inc. Information … Application Softwa… San Jose, California
7 AMD Advanced Mic… Information … Semiconductors Santa Clara, Californ…
8 AES AES Corporat… Utilities Independent Power … Arlington, Virginia
9 AFL Aflac Financials Life & Health Insu… Columbus, Georgia
10 A Agilent Tech… Health Care Life Sciences Tool… Santa Clara, Californ…
# ℹ 493 more rows
# ℹ abbreviated name: ¹`Headquarters Location`
# ℹ 3 more variables: `Date added` <chr>, CIK <int>, Founded <chr>
# ℹ Use `print(n = ...)` to see more rows
%%R
symbols = res[1]$Symbol
symbols
[1] "MMM" "AOS" "ABT" "ABBV" "ACN" "ADBE" "AMD" "AES" "AFL"
[10] "A" "APD" "ABNB" "AKAM" "ALB" "ARE" "ALGN" "ALLE" "LNT"
[19] "ALL" "GOOGL" "GOOG" "MO" "AMZN" "AMCR" "AEE" "AEP" "AXP"
[28] "AIG" "AMT" "AWK" "AMP" "AME" "AMGN" "APH" "ADI" "ANSS"
[37] "AON" "APA" "APO" "AAPL" "AMAT" "APTV" "ACGL" "ADM" "ANET"
[46] "AJG" "AIZ" "T" "ATO" "ADSK" "ADP" "AZO" "AVB" "AVY"
[55] "AXON" "BKR" "BALL" "BAC" "BAX" "BDX" "BRK.B" "BBY" "TECH"
[64] "BIIB" "BLK" "BX" "BK" "BA" "BKNG" "BWA" "BSX" "BMY"
[73] "AVGO" "BR" "BRO" "BF.B" "BLDR" "BG" "BXP" "CHRW" "CDNS"
[82] "CZR" "CPT" "CPB" "COF" "CAH" "KMX" "CCL" "CARR" "CAT"
[91] "CBOE" "CBRE" "CDW" "CE" "COR" "CNC" "CNP" "CF" "CRL"
[100] "SCHW" "CHTR" "CVX" "CMG" "CB" "CHD" "CI" "CINF" "CTAS"
[109] "CSCO" "C" "CFG" "CLX" "CME" "CMS" "KO" "CTSH" "CL"
[118] "CMCSA" "CAG" "COP" "ED" "STZ" "CEG" "COO" "CPRT" "GLW"
[127] "CPAY" "CTVA" "CSGP" "COST" "CTRA" "CRWD" "CCI" "CSX" "CMI"
[136] "CVS" "DHR" "DRI" "DVA" "DAY" "DECK" "DE" "DELL" "DAL"
[145] "DVN" "DXCM" "FANG" "DLR" "DFS" "DG" "DLTR" "D" "DPZ"
[154] "DOV" "DOW" "DHI" "DTE" "DUK" "DD" "EMN" "ETN" "EBAY"
[163] "ECL" "EIX" "EW" "EA" "ELV" "EMR" "ENPH" "ETR" "EOG"
[172] "EPAM" "EQT" "EFX" "EQIX" "EQR" "ERIE" "ESS" "EL" "EG"
[181] "EVRG" "ES" "EXC" "EXPE" "EXPD" "EXR" "XOM" "FFIV" "FDS"
[190] "FICO" "FAST" "FRT" "FDX" "FIS" "FITB" "FSLR" "FE" "FI"
[199] "FMC" "F" "FTNT" "FTV" "FOXA" "FOX" "BEN" "FCX" "GRMN"
[208] "IT" "GE" "GEHC" "GEV" "GEN" "GNRC" "GD" "GIS" "GM"
[217] "GPC" "GILD" "GPN" "GL" "GDDY" "GS" "HAL" "HIG" "HAS"
[226] "HCA" "DOC" "HSIC" "HSY" "HES" "HPE" "HLT" "HOLX" "HD"
[235] "HON" "HRL" "HST" "HWM" "HPQ" "HUBB" "HUM" "HBAN" "HII"
[244] "IBM" "IEX" "IDXX" "ITW" "INCY" "IR" "PODD" "INTC" "ICE"
[253] "IFF" "IP" "IPG" "INTU" "ISRG" "IVZ" "INVH" "IQV" "IRM"
[262] "JBHT" "JBL" "JKHY" "J" "JNJ" "JCI" "JPM" "JNPR" "K"
[271] "KVUE" "KDP" "KEY" "KEYS" "KMB" "KIM" "KMI" "KKR" "KLAC"
[280] "KHC" "KR" "LHX" "LH" "LRCX" "LW" "LVS" "LDOS" "LEN"
[289] "LII" "LLY" "LIN" "LYV" "LKQ" "LMT" "L" "LOW" "LULU"
[298] "LYB" "MTB" "MPC" "MKTX" "MAR" "MMC" "MLM" "MAS" "MA"
[307] "MTCH" "MKC" "MCD" "MCK" "MDT" "MRK" "META" "MET" "MTD"
[316] "MGM" "MCHP" "MU" "MSFT" "MAA" "MRNA" "MHK" "MOH" "TAP"
[325] "MDLZ" "MPWR" "MNST" "MCO" "MS" "MOS" "MSI" "MSCI" "NDAQ"
[334] "NTAP" "NFLX" "NEM" "NWSA" "NWS" "NEE" "NKE" "NI" "NDSN"
[343] "NSC" "NTRS" "NOC" "NCLH" "NRG" "NUE" "NVDA" "NVR" "NXPI"
[352] "ORLY" "OXY" "ODFL" "OMC" "ON" "OKE" "ORCL" "OTIS" "PCAR"
[361] "PKG" "PLTR" "PANW" "PARA" "PH" "PAYX" "PAYC" "PYPL" "PNR"
[370] "PEP" "PFE" "PCG" "PM" "PSX" "PNW" "PNC" "POOL" "PPG"
[379] "PPL" "PFG" "PG" "PGR" "PLD" "PRU" "PEG" "PTC" "PSA"
[388] "PHM" "PWR" "QCOM" "DGX" "RL" "RJF" "RTX" "O" "REG"
[397] "REGN" "RF" "RSG" "RMD" "RVTY" "ROK" "ROL" "ROP" "ROST"
[406] "RCL" "SPGI" "CRM" "SBAC" "SLB" "STX" "SRE" "NOW" "SHW"
[415] "SPG" "SWKS" "SJM" "SW" "SNA" "SOLV" "SO" "LUV" "SWK"
[424] "SBUX" "STT" "STLD" "STE" "SYK" "SMCI" "SYF" "SNPS" "SYY"
[433] "TMUS" "TROW" "TTWO" "TPR" "TRGP" "TGT" "TEL" "TDY" "TFX"
[442] "TER" "TSLA" "TXN" "TPL" "TXT" "TMO" "TJX" "TSCO" "TT"
[451] "TDG" "TRV" "TRMB" "TFC" "TYL" "TSN" "USB" "UBER" "UDR"
[460] "ULTA" "UNP" "UAL" "UPS" "URI" "UNH" "UHS" "VLO" "VTR"
[469] "VLTO" "VRSN" "VRSK" "VZ" "VRTX" "VTRS" "VICI" "V" "VST"
[478] "VMC" "WRB" "GWW" "WAB" "WBA" "WMT" "DIS" "WBD" "WM"
[487] "WAT" "WEC" "WFC" "WELL" "WST" "WDC" "WY" "WMB" "WTW"
[496] "WDAY" "WYNN" "XEL" "XYL" "YUM" "ZBRA" "ZBH" "ZTS"
%%R
res = doc %>% html_nodes("p") %>% html_text()
print(res)
[1] "\nThe S&P 500 is a stock market index maintained by S&P Dow Jones Indices. It comprises 503 common stocks which are issued by 500 large-cap companies traded on the American stock exchanges (including the 30 companies that compose the Dow Jones Industrial Average). The index includes about 80 percent of the American market by capitalization. It is weighted by free-float market capitalization, so more valuable companies account for relatively more weight in the index. The index constituents and the constituent weights are updated regularly using rules published by S&P Dow Jones Indices. Although called the S&P 500, the index contains 503 stocks because it includes two share classes of stock from 3 of its component companies.[1][2]"
[2] "S&P Dow Jones Indices updates the components of the S&P 500 periodically, typically in response to acquisitions, or to keep the index up to date as various companies grow or shrink in value.[3] Between January 1, 1963, and December 31, 2014, 1,186 index components were replaced by other components.\n"
syms = %Rget symbols
syms
StrVector with 503 elements.
'MMM' | 'AOS' | 'ABT' | ... | 'ZBRA' | 'ZBH' | 'ZTS' |
x = 3
%Rpush x
%%R
x
[1] 3