11. Web-Scraping, Dictionaries, Sentiment#

Note: We may need to use the PyTorch Kernel for this notebook

Reading references

We will be using a mix of R and Python in this notebook.

%%capture
#INCLUDING SCIENTIFIC AND NUMERICAL COMPUTING LIBRARIES
#Run this code to make sure that you have all the libraries at one go.
%pylab inline
import os
import pandas as pd
%load_ext rpy2.ipython
# Basic lines of code needed to import a data file with permissions from Google Drive
from google.colab import drive
# drive.mount("/content/drive", force_remount=True)
drive.mount('/content/drive')
os.chdir("drive/My Drive/Books_Writings/NLPBook/")
Mounted at /content/drive

11.1. Read in a URL#

## Reading in a URL
import requests

url = 'http://srdas.github.io/bio-candid.html'
f = requests.get(url)
text = f.text
print(text)
f.close()
<HTML>
<style>
    body {
        font-family:'Segoe UI', Tahoma, Geneva, Verdana, sans-serif
    }
</style>
      
<BODY background="http://algo.scu.edu/~sanjivdas/graphics/back2.gif">

<h2>Bio</h2>


Sanjiv Das is the William and Janice Terry Professor of Finance and Data Science at Santa Clara University's Leavey School of Business, and Amazon Scholar at AWS. He previously held faculty appointments at Harvard Business School and UC Berkeley. He has post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management Ahmedabad (IIMA), B.Com in Accounting and Economics (University of Bombay, Sydenham College), and is also a qualified Cost and Works Accountant (AICWA). He is a senior founding editor of The Journal of Investment Management, is on the Advisory Board of the Journal of Financial Data Science, and holds editorial positions at other journals. Prior to being an academic, he worked in the financial derivatives business in the Asia-Pacific region as a Vice-President at Citibank. His current research interests include: AI and machine learning, FinTech, portfolio theory and wealth management, financial networks, derivatives pricing models, the modeling of default risk, systemic risk, and venture capital. He has published over a hundred and thirty articles in academic journals, and has won numerous awards for research and teaching (CV: https://srdas.github.io/srdvita.pdf).  Sanjiv's scholarship may be accessed at https://srdas.github.io/research.htm




<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>

After loafing and working in many parts of Asia, but never really
growing up, Sanjiv moved to New York to change the world, hopefully
through research.  He graduated in 1994 with a Ph.D. from NYU, and
since then spent five years in Boston, and now lives in San Jose,
California.  Sanjiv loves animals, places in the world where the
mountains meet the sea, riding sport motorbikes, reading, gadgets,
science fiction movies, and writing cool software code. When there is
time available from the excitement of daily life, Sanjiv writes
academic papers, which helps him relax. Always the contrarian, Sanjiv
thinks that New York City is the most calming place in the world,
after California of course.

<p>

Sanjiv is now a Professor of Finance at Santa Clara University. He came
to SCU from Harvard Business School and spent a year at UC Berkeley. In
his past life in the unreal world, Sanjiv worked at Citibank, N.A. in
the Asia-Pacific region. He takes great pleasure in merging his many
previous lives into his current existence, which is incredibly confused
and diverse.

<p>

Sanjiv's research style is instilled with a distinct "New York state of
mind" - it is chaotic, diverse, with minimal method to the madness. He
has published articles on derivatives, term-structure models, mutual
funds, the internet, portfolio choice, banking models, credit risk, and
has unpublished articles in many other areas. Some years ago, he took
time off to get another degree in computer science at Berkeley,
confirming that an unchecked hobby can quickly become an obsession.
There he learnt about the fascinating field of Randomized Algorithms,
skills he now applies earnestly to his editorial work, and other
pursuits, many of which stem from being in the epicenter of Silicon
Valley.

<p>

Coastal living did a lot to mold Sanjiv, who needs to live near the
ocean.  The many walks in Greenwich village convinced him that there is
no such thing as a representative investor, yet added many unique
features to his personal utility function. He learnt that it is
important to open the academic door to the ivory tower and let the world
in. Academia is a real challenge, given that he has to reconcile many
more opinions than ideas. He has been known to have turned down many
offers from Mad magazine to publish his academic work. As he often
explains, you never really finish your education - "you can check out
any time you like, but you can never leave." Which is why he is doomed
to a lifetime in Hotel California. And he believes that, if this is as
bad as it gets, life is really pretty good.
len(text)
4224
lines = text.splitlines()
print(len(lines))
print(lines[3])
68
        font-family:'Segoe UI', Tahoma, Geneva, Verdana, sans-serif

11.2. Use Beautiful Soup to clean up all the html stuff#

# Use BS to get tagged portions of the text
# !pip install beautifulsoup4
from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'html.parser')
print(soup.prettify())
<html>
 <style>
  body {
        font-family:'Segoe UI', Tahoma, Geneva, Verdana, sans-serif
    }
 </style>
 <body background="http://algo.scu.edu/~sanjivdas/graphics/back2.gif">
  <h2>
   Bio
  </h2>
  Sanjiv Das is the William and Janice Terry Professor of Finance and Data Science at Santa Clara University's Leavey School of Business, and Amazon Scholar at AWS. He previously held faculty appointments at Harvard Business School and UC Berkeley. He has post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management Ahmedabad (IIMA), B.Com in Accounting and Economics (University of Bombay, Sydenham College), and is also a qualified Cost and Works Accountant (AICWA). He is a senior founding editor of The Journal of Investment Management, is on the Advisory Board of the Journal of Financial Data Science, and holds editorial positions at other journals. Prior to being an academic, he worked in the financial derivatives business in the Asia-Pacific region as a Vice-President at Citibank. His current research interests include: AI and machine learning, FinTech, portfolio theory and wealth management, financial networks, derivatives pricing models, the modeling of default risk, systemic risk, and venture capital. He has published over a hundred and thirty articles in academic journals, and has won numerous awards for research and teaching (CV: https://srdas.github.io/srdvita.pdf).  Sanjiv's scholarship may be accessed at https://srdas.github.io/research.htm
  <p>
   <b>
    Sanjiv Das: A Short Academic Life History
   </b>
   <p>
    After loafing and working in many parts of Asia, but never really
growing up, Sanjiv moved to New York to change the world, hopefully
through research.  He graduated in 1994 with a Ph.D. from NYU, and
since then spent five years in Boston, and now lives in San Jose,
California.  Sanjiv loves animals, places in the world where the
mountains meet the sea, riding sport motorbikes, reading, gadgets,
science fiction movies, and writing cool software code. When there is
time available from the excitement of daily life, Sanjiv writes
academic papers, which helps him relax. Always the contrarian, Sanjiv
thinks that New York City is the most calming place in the world,
after California of course.
    <p>
     Sanjiv is now a Professor of Finance at Santa Clara University. He came
to SCU from Harvard Business School and spent a year at UC Berkeley. In
his past life in the unreal world, Sanjiv worked at Citibank, N.A. in
the Asia-Pacific region. He takes great pleasure in merging his many
previous lives into his current existence, which is incredibly confused
and diverse.
     <p>
      Sanjiv's research style is instilled with a distinct "New York state of
mind" - it is chaotic, diverse, with minimal method to the madness. He
has published articles on derivatives, term-structure models, mutual
funds, the internet, portfolio choice, banking models, credit risk, and
has unpublished articles in many other areas. Some years ago, he took
time off to get another degree in computer science at Berkeley,
confirming that an unchecked hobby can quickly become an obsession.
There he learnt about the fascinating field of Randomized Algorithms,
skills he now applies earnestly to his editorial work, and other
pursuits, many of which stem from being in the epicenter of Silicon
Valley.
      <p>
       Coastal living did a lot to mold Sanjiv, who needs to live near the
ocean.  The many walks in Greenwich village convinced him that there is
no such thing as a representative investor, yet added many unique
features to his personal utility function. He learnt that it is
important to open the academic door to the ivory tower and let the world
in. Academia is a real challenge, given that he has to reconcile many
more opinions than ideas. He has been known to have turned down many
offers from Mad magazine to publish his academic work. As he often
explains, you never really finish your education - "you can check out
any time you like, but you can never leave." Which is why he is doomed
to a lifetime in Hotel California. And he believes that, if this is as
bad as it gets, life is really pretty good.
      </p>
     </p>
    </p>
   </p>
  </p>
 </body>
</html>
print(soup.title)
print(len(soup.p))
print(type(soup.p))
print(soup.b)
print(soup.body)
None
4
<class 'bs4.element.Tag'>
<b>Sanjiv Das: A Short Academic Life History</b>
<body background="http://algo.scu.edu/~sanjivdas/graphics/back2.gif">
<h2>Bio</h2>


Sanjiv Das is the William and Janice Terry Professor of Finance and Data Science at Santa Clara University's Leavey School of Business, and Amazon Scholar at AWS. He previously held faculty appointments at Harvard Business School and UC Berkeley. He has post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management Ahmedabad (IIMA), B.Com in Accounting and Economics (University of Bombay, Sydenham College), and is also a qualified Cost and Works Accountant (AICWA). He is a senior founding editor of The Journal of Investment Management, is on the Advisory Board of the Journal of Financial Data Science, and holds editorial positions at other journals. Prior to being an academic, he worked in the financial derivatives business in the Asia-Pacific region as a Vice-President at Citibank. His current research interests include: AI and machine learning, FinTech, portfolio theory and wealth management, financial networks, derivatives pricing models, the modeling of default risk, systemic risk, and venture capital. He has published over a hundred and thirty articles in academic journals, and has won numerous awards for research and teaching (CV: https://srdas.github.io/srdvita.pdf).  Sanjiv's scholarship may be accessed at https://srdas.github.io/research.htm




<p> <b>Sanjiv Das: A Short Academic Life History</b> <p>

After loafing and working in many parts of Asia, but never really
growing up, Sanjiv moved to New York to change the world, hopefully
through research.  He graduated in 1994 with a Ph.D. from NYU, and
since then spent five years in Boston, and now lives in San Jose,
California.  Sanjiv loves animals, places in the world where the
mountains meet the sea, riding sport motorbikes, reading, gadgets,
science fiction movies, and writing cool software code. When there is
time available from the excitement of daily life, Sanjiv writes
academic papers, which helps him relax. Always the contrarian, Sanjiv
thinks that New York City is the most calming place in the world,
after California of course.

<p>

Sanjiv is now a Professor of Finance at Santa Clara University. He came
to SCU from Harvard Business School and spent a year at UC Berkeley. In
his past life in the unreal world, Sanjiv worked at Citibank, N.A. in
the Asia-Pacific region. He takes great pleasure in merging his many
previous lives into his current existence, which is incredibly confused
and diverse.

<p>

Sanjiv's research style is instilled with a distinct "New York state of
mind" - it is chaotic, diverse, with minimal method to the madness. He
has published articles on derivatives, term-structure models, mutual
funds, the internet, portfolio choice, banking models, credit risk, and
has unpublished articles in many other areas. Some years ago, he took
time off to get another degree in computer science at Berkeley,
confirming that an unchecked hobby can quickly become an obsession.
There he learnt about the fascinating field of Randomized Algorithms,
skills he now applies earnestly to his editorial work, and other
pursuits, many of which stem from being in the epicenter of Silicon
Valley.

<p>

Coastal living did a lot to mold Sanjiv, who needs to live near the
ocean.  The many walks in Greenwich village convinced him that there is
no such thing as a representative investor, yet added many unique
features to his personal utility function. He learnt that it is
important to open the academic door to the ivory tower and let the world
in. Academia is a real challenge, given that he has to reconcile many
more opinions than ideas. He has been known to have turned down many
offers from Mad magazine to publish his academic work. As he often
explains, you never really finish your education - "you can check out
any time you like, but you can never leave." Which is why he is doomed
to a lifetime in Hotel California. And he believes that, if this is as
bad as it gets, life is really pretty good.
</p></p></p></p></p></body>
sanjivbio = BeautifulSoup(text,'lxml').get_text()
print(sanjivbio)
Bio


Sanjiv Das is the William and Janice Terry Professor of Finance and Data Science at Santa Clara University's Leavey School of Business, and Amazon Scholar at AWS. He previously held faculty appointments at Harvard Business School and UC Berkeley. He has post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management Ahmedabad (IIMA), B.Com in Accounting and Economics (University of Bombay, Sydenham College), and is also a qualified Cost and Works Accountant (AICWA). He is a senior founding editor of The Journal of Investment Management, is on the Advisory Board of the Journal of Financial Data Science, and holds editorial positions at other journals. Prior to being an academic, he worked in the financial derivatives business in the Asia-Pacific region as a Vice-President at Citibank. His current research interests include: AI and machine learning, FinTech, portfolio theory and wealth management, financial networks, derivatives pricing models, the modeling of default risk, systemic risk, and venture capital. He has published over a hundred and thirty articles in academic journals, and has won numerous awards for research and teaching (CV: https://srdas.github.io/srdvita.pdf).  Sanjiv's scholarship may be accessed at https://srdas.github.io/research.htm




 Sanjiv Das: A Short Academic Life History 

After loafing and working in many parts of Asia, but never really
growing up, Sanjiv moved to New York to change the world, hopefully
through research.  He graduated in 1994 with a Ph.D. from NYU, and
since then spent five years in Boston, and now lives in San Jose,
California.  Sanjiv loves animals, places in the world where the
mountains meet the sea, riding sport motorbikes, reading, gadgets,
science fiction movies, and writing cool software code. When there is
time available from the excitement of daily life, Sanjiv writes
academic papers, which helps him relax. Always the contrarian, Sanjiv
thinks that New York City is the most calming place in the world,
after California of course.



Sanjiv is now a Professor of Finance at Santa Clara University. He came
to SCU from Harvard Business School and spent a year at UC Berkeley. In
his past life in the unreal world, Sanjiv worked at Citibank, N.A. in
the Asia-Pacific region. He takes great pleasure in merging his many
previous lives into his current existence, which is incredibly confused
and diverse.



Sanjiv's research style is instilled with a distinct "New York state of
mind" - it is chaotic, diverse, with minimal method to the madness. He
has published articles on derivatives, term-structure models, mutual
funds, the internet, portfolio choice, banking models, credit risk, and
has unpublished articles in many other areas. Some years ago, he took
time off to get another degree in computer science at Berkeley,
confirming that an unchecked hobby can quickly become an obsession.
There he learnt about the fascinating field of Randomized Algorithms,
skills he now applies earnestly to his editorial work, and other
pursuits, many of which stem from being in the epicenter of Silicon
Valley.



Coastal living did a lot to mold Sanjiv, who needs to live near the
ocean.  The many walks in Greenwich village convinced him that there is
no such thing as a representative investor, yet added many unique
features to his personal utility function. He learnt that it is
important to open the academic door to the ivory tower and let the world
in. Academia is a real challenge, given that he has to reconcile many
more opinions than ideas. He has been known to have turned down many
offers from Mad magazine to publish his academic work. As he often
explains, you never really finish your education - "you can check out
any time you like, but you can never leave." Which is why he is doomed
to a lifetime in Hotel California. And he believes that, if this is as
bad as it gets, life is really pretty good.
print(len(sanjivbio))
type(sanjivbio)
4009
str

11.3. Dictionaries#

Webster’s defines a “dictionary” as “…a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses.”

  1. The Harvard General Inquirer: http://www.mariapinto.es/ciberabstracts/Articulos/Inquirer.htm

  2. Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.

  3. Computer dictionary: http://www.hyperdictionary.com/computer that contains about 14,000 computer related words, such as “byte” or “hyperlink”.

  4. Math dictionary, such as http://www.amathsdictionaryforkids.com/dictionary.html.

  5. Medical dictionary, see http://www.hyperdictionary.com/medical.

  6. Internet lingo dictionaries may be used to complement standard dictionaries with words that are not usually found in standard language, for example, see http://www.netlingo.com/dictionary/all.php for words such as “2BZ4UQT” which stands for “too busy for you cutey” (LOL). When extracting text messages, postings on Facebook, or stock message board discussions, internet lingo does need to be parsed and such a dictionary is very useful.

  7. Associative dictionaries are also useful when trying to find context, as the word may be related to a concept, identified using a dictionary such as http://www.visuwords.com/. This dictionary doubles up as a thesaurus, as it provides alternative words and phrases that mean the same thing, and also related concepts.

  8. Value dictionaries deal with values and may be useful when only affect (positive or negative) is insufficient for scoring text. The Lasswell Value Dictionary http://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used to score the loading of text on the eight basic value categories: Wealth, Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Well being.

11.4. Lexicons#

  • A lexicon is defined by Webster’s as “a book containing an alphabetical arrangement of the words in a language and their definitions; the vocabulary of a language, an individual speaker or group of speakers, or a subject; the total stock of morphemes in a language.” This suggests it is not that different from a dictionary.

  • A “morpheme” is defined as “a word or a part of a word that has a meaning and that contains no smaller part that has a meaning.”

  • In the text analytics realm, we will take a lexicon to be a smaller, special purpose dictionary, containing words that are relevant to the domain of interest.

  • The benefit of a lexicon is that it enables focusing only on words that are relevant to the analytics and discards words that are not.

  • Another benefit is that since it is a smaller dictionary, the computational effort required by text analytics algorithms is drastically reduced.

11.5. Constructing a lexicon#

  • By hand. This is an effective technique and the simplest. It calls for a human reader who scans a representative sample of text documents and culls important words that lend interpretive meaning.

  • Examine the term document matrix for most frequent words, and pick the ones that have high connotation for the classification task at hand.

  • Use pre-classified documents in a text corpus. We analyze the separate groups of documents to find words whose difference in frequency between groups is highest. Such words are likely to be better in discriminating between groups.

11.6. Lexicons as Word Lists#

  • Das and Chen (2007) constructed a lexicon of about 375 words that are useful in parsing sentiment from stock message boards.

  • Loughran and McDonald (2011): Taking a sample of 50,115 firm-year 10-Ks from 1994 to 2008, they found that almost three-fourths of the words identified as negative by the Harvard Inquirer dictionary are not typically negative words in a financial context.

  • Therefore, they specifically created separate lists of words by the following attributes of words: negative, positive, uncertainty, litigious, strong modal, and weak modal. Modal words are based on Jordan’s categories of strong and weak modal words. These word lists may be downloaded from http://www3.nd.edu/~mcdonald/Word_Lists.html.

11.7. Negation Tagging#

  • Das and Chen (2007) introduced the notion of “negation tagging” into the literature. Negation tags create additional words in the word list using some rule. In this case, the rule used was to take any sentence, and if a negation word occurred, then tag all remaining positive words in the sentence as negative. For example, take a sentence - “This is not a good book.” Here the positive words after “not” are candidates for negation tagging. So, we would replace the sentence with “This is not a n__good book.”

  • Sometimes this can be more nuanced. For example, a sentence such as “There is nothing better than sliced bread.” So now, the negation word “nothing” is used in conjunction with “better” so is an exception to the rule. Such exceptions may need to be coded in to rules for parsing textual content.

The Grammarly Handbook provides the folowing negation words (see https://www.grammarly.com/handbook/):

  1. Negative words: No, Not, None, No one, Nobody, Nothing, Neither, Nowhere, Never.

  2. Negative Adverbs: Hardly, Scarcely, Barely.

  3. Negative verbs: Doesn’t, Isn’t, Wasn’t, Shouldn’t, Wouldn’t, Couldn’t, Won’t, Can’t, Don’t.

11.8. Scoring Text#

  • Text can be scored using dictionaries and word lists. Here is an example of mood scoring. We use a psychological dictionary from Harvard. There is also WordNet.

  • WordNet is a large database of words in English, i.e., a lexicon. The repository is at http://wordnet.princeton.edu. WordNet groups words together based on their meanings (synonyms) and hence may be used as a thesaurus. WordNet is also useful for natural language processing as it provides word lists by language category, such as noun, verb, adjective, etc.

11.9. Read in a dictionary#

## Read in a file
## Here we will read in an entire dictionary from Harvard Inquirer

f = open('NLP_data/inqdict.txt')
HIDict = f.read()
HIDict = HIDict.splitlines()
HIDict[:20]
['Entryword Source Pos Neg Pstv Affil Ngtv Hostile Strng Power Weak Subm Actv Psv Pleasure Pain Arousal EMOT Feel Virtue Vice Ovrst Undrst Acad Doctr Econ* Exch ECON Exprs Legal Milit Polit* POLIT Relig Role COLL Work Ritual Intrel Race Kin* MALE Female Nonadlt HU ANI PLACE Social Region Route Aquatic Land Sky Object Tool Food Vehicle Bldgpt Natobj Bodypt Comnobj Comform COM Say Need Goal Try Means Ach Persist Complt Fail Natpro Begin Vary Change Incr Decr Finish Stay Rise Move Exert Fetch Travel Fall Think Know Causal Ought Percv Comp Eval EVAL Solve Abs* ABS Qual Quan NUMB ORD CARD FREQ DIST Time* TIME Space POS DIM Dimn Rel COLOR Self Our You Name Yes No Negate Intrj IAV DAV SV IPadj IndAdj POWGAIN POWLOSS POWENDS POWAREN POWCON POWCOOP POWAPT POWPT POWDOCT POWAUTH POWOTH POWTOT RCTETH RCTREL RCTGAIN RCTLOSS RCTENDS RCTTOT RSPGAIN RSPLOSS RSPOTH RSPTOT AFFGAIN AFFLOSS AFFPT AFFOTH AFFTOT WLTPT WLTTRAN WLTOTH WLTTOT WLBGAIN WLBLOSS WLBPHYS WLBPSYC WLBPT WLBTOT ENLGAIN ENLLOSS ENLENDS ENLPT ENLOTH ENLTOT SKLAS SKLPT SKLOTH SKLTOT TRNGAIN TRNLOSS TRANS MEANS ENDS ARENAS PARTIC NATIONS AUD ANOMIE NEGAFF POSAFF SURE IF NOT TIMESP FOOD FORM Othertags Definition ',
 'A H4Lvd DET ART  | article: Indefinite singular article--some or any one',
 'ABANDON H4Lvd Neg Ngtv Weak Fail IAV AFFLOSS AFFTOT SUPV  |',
 'ABANDONMENT H4 Neg Weak Fail Noun  |',
 'ABATE H4Lvd Neg Psv Decr IAV TRANS SUPV  |',
 'ABATEMENT Lvd Noun  ',
 'ABDICATE H4 Neg Weak Subm Psv Finish IAV SUPV  |',
 'ABHOR H4 Neg Hostile Psv Arousal SV SUPV  |',
 'ABIDE H4 Pos Affil Actv Doctr IAV SUPV  |',
 'ABIDE#1 Lvd Modif  ',
 'ABIDE#2 Lvd SUPV  ',
 'ABILITY Lvd MEANS Noun ABS ABS*  ',
 'ABJECT H4 Neg Weak Subm Psv Vice IPadj Modif  |',
 'ABLE H4Lvd Pos Pstv Strng Virtue EVAL MEANS Modif  | adjective: Having necessary power, skill, resources, etc.',
 'ABNORMAL H4Lvd Neg Ngtv Vice NEGAFF Modif  |',
 'ABOARD H4Lvd Space PREP LY  |',
 'ABOLISH H4Lvd Neg Ngtv Hostile Strng Power Actv Intrel IAV POWOTH POWTOT SUPV  |',
 'ABOLITION Lvd TRANS Noun  ',
 'ABOMINABLE H4 Neg Strng Vice Ovrst Eval IndAdj Modif  |',
 'ABORTIVE Lvd POWOTH POWTOT Modif POLIT  ']

11.10. Sentiment Score the Text using this Dictionary from Harvard Inquirer#

#Extract all the lines that contain the Pos tag
HIDict = HIDict[1:]
print(HIDict[:5])
print(len(HIDict))
poswords = [j for j in HIDict if "Pos" in j]  #using a list comprehension
poswords = [j.split()[0] for j in poswords]
poswords = [j.split("#")[0] for j in poswords]
poswords = unique(poswords)
poswords = [j.lower() for j in poswords]
print(poswords[:20])
print(len(poswords))
['A H4Lvd DET ART  | article: Indefinite singular article--some or any one', 'ABANDON H4Lvd Neg Ngtv Weak Fail IAV AFFLOSS AFFTOT SUPV  |', 'ABANDONMENT H4 Neg Weak Fail Noun  |', 'ABATE H4Lvd Neg Psv Decr IAV TRANS SUPV  |', 'ABATEMENT Lvd Noun  ']
11895
['abide', 'able', 'abound', 'absolve', 'absorbent', 'absorption', 'abundance', 'abundant', 'accede', 'accentuate', 'accept', 'acceptable', 'acceptance', 'accessible', 'accession', 'acclaim', 'acclamation', 'accolade', 'accommodate', 'accommodation']
1646
#Extract all the lines that contain the Neg tag
negwords = [j for j in HIDict if "Neg" in j]  #using a list comprehension
negwords = [j.split()[0] for j in negwords]
negwords = [j.split("#")[0] for j in negwords]
negwords = unique(negwords)
negwords = [j.lower() for j in negwords]
print(negwords[:20])
print(len(negwords))
['abandon', 'abandonment', 'abate', 'abdicate', 'abhor', 'abject', 'abnormal', 'abolish', 'abominable', 'abrasive', 'abrupt', 'abscond', 'absence', 'absent', 'absent-minded', 'absentee', 'absurd', 'absurdity', 'abuse', 'abyss']
2120
#Pull clean lowercase version of bio as one long string
text = sanjivbio.replace('\n',' ').lower()
text
'   bio   sanjiv das is the william and janice terry professor of finance and data science at santa clara university\'s leavey school of business, and amazon scholar at aws. he previously held faculty appointments at harvard business school and uc berkeley. he has post-graduate degrees in finance (m.phil and ph.d. from new york university), computer science (m.s. from uc berkeley), an mba from the indian institute of management ahmedabad (iima), b.com in accounting and economics (university of bombay, sydenham college), and is also a qualified cost and works accountant (aicwa). he is a senior founding editor of the journal of investment management, is on the advisory board of the journal of financial data science, and holds editorial positions at other journals. prior to being an academic, he worked in the financial derivatives business in the asia-pacific region as a vice-president at citibank. his current research interests include: ai and machine learning, fintech, portfolio theory and wealth management, financial networks, derivatives pricing models, the modeling of default risk, systemic risk, and venture capital. he has published over a hundred and thirty articles in academic journals, and has won numerous awards for research and teaching (cv: https://srdas.github.io/srdvita.pdf).  sanjiv\'s scholarship may be accessed at https://srdas.github.io/research.htm      sanjiv das: a short academic life history   after loafing and working in many parts of asia, but never really growing up, sanjiv moved to new york to change the world, hopefully through research.  he graduated in 1994 with a ph.d. from nyu, and since then spent five years in boston, and now lives in san jose, california.  sanjiv loves animals, places in the world where the mountains meet the sea, riding sport motorbikes, reading, gadgets, science fiction movies, and writing cool software code. when there is time available from the excitement of daily life, sanjiv writes academic papers, which helps him relax. always the contrarian, sanjiv thinks that new york city is the most calming place in the world, after california of course.    sanjiv is now a professor of finance at santa clara university. he came to scu from harvard business school and spent a year at uc berkeley. in his past life in the unreal world, sanjiv worked at citibank, n.a. in the asia-pacific region. he takes great pleasure in merging his many previous lives into his current existence, which is incredibly confused and diverse.    sanjiv\'s research style is instilled with a distinct "new york state of mind" - it is chaotic, diverse, with minimal method to the madness. he has published articles on derivatives, term-structure models, mutual funds, the internet, portfolio choice, banking models, credit risk, and has unpublished articles in many other areas. some years ago, he took time off to get another degree in computer science at berkeley, confirming that an unchecked hobby can quickly become an obsession. there he learnt about the fascinating field of randomized algorithms, skills he now applies earnestly to his editorial work, and other pursuits, many of which stem from being in the epicenter of silicon valley.    coastal living did a lot to mold sanjiv, who needs to live near the ocean.  the many walks in greenwich village convinced him that there is no such thing as a representative investor, yet added many unique features to his personal utility function. he learnt that it is important to open the academic door to the ivory tower and let the world in. academia is a real challenge, given that he has to reconcile many more opinions than ideas. he has been known to have turned down many offers from mad magazine to publish his academic work. as he often explains, you never really finish your education - "you can check out any time you like, but you can never leave." which is why he is doomed to a lifetime in hotel california. and he believes that, if this is as bad as it gets, life is really pretty good. '
text = text.split(' ')
text = [j for j in text if len(j)>0]
print(len(text))
text[:25]
647
['bio',
 'sanjiv',
 'das',
 'is',
 'the',
 'william',
 'and',
 'janice',
 'terry',
 'professor',
 'of',
 'finance',
 'and',
 'data',
 'science',
 'at',
 'santa',
 'clara',
 "university's",
 'leavey',
 'school',
 'of',
 'business,',
 'and',
 'amazon']
#Match text to poswords, negwords, use the set operators
# Warning: gives only unique matches
# posmatches = [j for j in text if j in poswords]  # for all words
posmatches = set(text).intersection(set(poswords))
print(posmatches)
print(len(posmatches))
negmatches = set(text).intersection(set(negwords))
print(negmatches)
print(len(negmatches))
{'pretty', 'meet', 'open', 'board', 'live', 'education', 'his', 'credit', 'mutual', 'have', 'important', 'great', 'distinct', 'excitement', 'reconcile', 'your', 'real', 'unique', 'pleasure'}
19
{'cool', 'cost', 'no', 'unreal', 'unchecked', 'mad', 'get', 'default', 'short', 'never', 'board', 'let', 'bad'}
13

11.11. General Function to Pull Financial Text and score it#

def finScore(url,poswords,negwords):
    f = requests.get(url)
    text = f.text
    f.close()
    text = BeautifulSoup(text,'lxml').get_text()
    text = text.replace('\n',' ').lower()
    text = text.split(' ')
    posmatches = set(text).intersection(set(poswords))
    print(posmatches)
    print(len(posmatches))
    negmatches = set(text).intersection(set(negwords))
    print(negmatches)
    print(len(negmatches))
#Try this on the same data as before
url = 'http://srdas.github.io/bio-candid.html'
finScore(url,poswords,negwords)
{'pretty', 'meet', 'open', 'board', 'live', 'education', 'his', 'credit', 'mutual', 'have', 'important', 'great', 'distinct', 'excitement', 'reconcile', 'your', 'real', 'unique', 'pleasure'}
19
{'cool', 'cost', 'no', 'unreal', 'unchecked', 'mad', 'get', 'default', 'short', 'never', 'board', 'let', 'bad'}
13
#Let's get Apple's SEC filing 10K
# https://www.sec.gov/edgar/searchedgar/companysearch.html
# https://www.sec.gov/edgar/searchedgar/cik.htm
url = "https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/8b4913e8-22f8-4935-af3b-8b1492e528e1.html" # (2020)
# url = 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/71ac2994-85af-426b-982a-8fcc71d6fe52.html' #(2018)
#url = 'http://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/bc9269c5-539b-4a69-9054-abe7849c4242.html' #(2017)
finScore(url,poswords,negwords)
{'valid', 'aggregate', 'well', 'improvement', 'justice', 'consistent', 'adjust', 'protect', 'optional', 'appropriate', 'redemption', 'useful', 'company', 'fulfill', 'compliance', 'adjustment', 'good', 'commission', 'readily', 'content', 'arisen', 'basic', 'appeal', 'comprehensive', 'office', 'minister', 'return', 'responsibility', 'reconcile', 'offset', 'major', 'provide', 'health', 'capacity', 'board', 'resolve', 'satisfaction', 'significant', 'award', 'reasonable', 'hand', 'contribution', 'upgrade', 'value', 'particular', 'important', 'primarily', 'distinct', 'maturity', 'able', 'consider', 'rational', 'approach', 'satisfy', 'assurance', 'regard', 'validity', 'bonus', 'open', 'effective', 'principle', 'guarantee', 'agreement', 'back', 'renewal', 'outstanding', 'obtain', 'approval', 'principal', 'availability', 'share', 'practical', 'right', 'true', 'authority', 'promise', 'allowance', 'have', 'best', 'reconciliation', 'equity', 'hope', 'adequate', 'make', 'clarify', 'their', 'civil', 'voluntary', 'normal', 'effectiveness', 'knowledge', 'unique', 'objective', 'interest', 'even', 'protection', 'meet', 'call', 'essential', 'exact', 'home', 'fine', 'legal', 'compensation', 'qualify', 'free', 'consideration', 'partnership', 'resolved', 'premium', 'asset', 'profit', 'upward', 'better', 'common', 'kind', 'fair', 'credit', 'productive', 'improve', 'actual', 'relevant', 'forward', 'utilize', 'utilization', 'accordance', 'establish', 'benefit', 'acceptable', 'gain', 'appreciation', 'respect', 'pay', 'support', 'commencement', 'intellectual', 'aid', 'settle', 'contribute', 'security'}
140
{'tax', 'even', 'depreciation', 'intangible', 'unconditional', 'hedge', 'insignificant', 'intrusion', 'lower', 'lag', 'indirect', 'liquidation', 'fine', 'board', 'competition', 'decrease', 'withheld', 'excess', 'liability', 'capital', 'show', 'no', 'discount', 'point', 'violation', 'short', 'expedient', 'expose', 'foreign', 'expense', 'bankruptcy', 'loss', 'hand', 'cost', 'differ', 'unspecified', 'prohibitive', 'false', 'account', 'cannot', 'unfavorable', 'shortage', 'fail', 'particular', 'not', 'split', 'uncertainty', 'shell', 'unemployment', 'make', 'negative', 'violate', 'limit', 'disposal', 'least', 'service', 'against'}
57

The results are different, depending on the source.

#Repeat with a different URL from the SEC
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019319000119/a10-k20199282019.htm" # 2020
# url = 'https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/a10-k20189292018.htm' # 2018
#url = 'https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/a10-k20179302017.htm' # 2017
finScore(url,poswords,negwords)
{'interest', 'contact', 'protection', 'well', 'open', 'thank', 'privacy', 'please', 'company', 'ensure', 'right', 'commission', 'equitable', 'content', 'allow', 'best', 'offer', 'make', 'acceptable', 'efficient', 'our', 'support', 'your', 'security'}
24
{'deny', 'make', 'no', 'regardless', 'fraud', 'excessive', 'prosecution', 'block', 'limit', 'service', 'not', 'abuse'}
12

11.12. Financial Tabular Data#

You will likely need to complement your financial text data with tabular data and some useful links to get and handle this data are here:

finScore('https://www.scu.edu/is/', poswords, negwords)
{'contact', 'well', 'community', 'home', 'outstanding', 'responsible', 'main', 'content', 'collaboration', 'accomplish', 'reliable', 'office', 'their', 'help', 'our', 'support', 'order', 'enhance', 'security'}
19
{'emergency', 'help', 'service', 'order', 'study'}
5