Sanjiv R. Das
Reading references
%pylab inline
import pandas as pd
import os
from ipypublish import nb_setup
%load_ext rpy2.ipython
Populating the interactive namespace from numpy and matplotlib
#Load if needed on Windows
# !curl -O "https://raw.githubusercontent.com/vitorcurtis/RWinOut/master/RWinOut.py"
# %load_ext RWinOut
In Finance, for example, text has become a major source of trading information, leading to a new field known as News Metrics.
News analysis is defined as “the measurement of the various qualitative and quantitative attributes of textual news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers permits the manipulation of everyday information in a mathematical and statistical way.” (Wikipedia). In this chapter, I provide a framework for text analytics techniques that are in widespread use. I will discuss various text analytic methods and software, and then provide a set of metrics that may be used to assess the performance of analytics. Various directions for this field are discussed through the exposition. The techniques herein can aid in the valuation and trading of securities, facilitate investment decision making, meet regulatory requirements, provide marketing insights, or manage risk.
“News analytics are used in financial modeling, particularly in quantitative and algorithmic trading. Further, news analytics can be used to plot and characterize firm behaviors over time and thus yield important strategic insights about rival firms. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, `bag of words’, among other techniques.” (Wikipedia)
There are many reasons why text has business value. But this is a narrow view. Textual data provides a means of understanding all human behavior through a data-driven, analytical approach. Let’s enumerate some reasons for this.
In a talk at the 17th ACM Conference on Information Knowledge and Management (CIKM ’08), Google’s director of research Peter Norvig stated his unequivocal preference for data over algorithms—“data is more agile than code.” Yet, it is well-understood that too much data can lead to overfitting so that an algorithm becomes mostly useless out-of-sample.
Chris Anderson: “Data is the New Theory.”
nb_setup.images_hconcat(["DSTMAA_images/algo_complexity.jpg"], width=400)
Das, Martinez-Jerez, and Tufano (FM 2005)
nb_setup.images_hconcat(["DSTMAA_images/news_cycle.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/breakdown_newsflow.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/freq_postings.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/weekly_posting.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/intraday_posting.png"], width=600)
nb_setup.images_hconcat(["DSTMAA_images/characters_posting.png"], width=600)
text = "We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America."
#How many characters including blanks?
len(text)
327
#Tokenize the words, separating by spaces, periods, commas
x = text.split(" ")
print(x)
['We', 'the', 'People', 'of', 'the', 'United', 'States,', 'in', 'Order', 'to', 'form', 'a', 'more', 'perfect', 'Union,', 'establish', 'Justice,', 'insure', 'domestic', 'Tranquility,', 'provide', 'for', 'the', 'common', 'defence,', 'promote', 'the', 'general', 'Welfare,', 'and', 'secure', 'the', 'Blessings', 'of', 'Liberty', 'to', 'ourselves', 'and', 'our', 'Posterity,', 'do', 'ordain', 'and', 'establish', 'this', 'Constitution', 'for', 'the', 'United', 'States', 'of', 'America.']
#How many words?
len(x)
52
But this returns words with commas and periods included, which is not desired. So what we need is the regular expressions package, i.e., re.
import re
x = re.split('[ ,.]',text)
print(x)
['We', 'the', 'People', 'of', 'the', 'United', 'States', '', 'in', 'Order', 'to', 'form', 'a', 'more', 'perfect', 'Union', '', 'establish', 'Justice', '', 'insure', 'domestic', 'Tranquility', '', 'provide', 'for', 'the', 'common', 'defence', '', 'promote', 'the', 'general', 'Welfare', '', 'and', 'secure', 'the', 'Blessings', 'of', 'Liberty', 'to', 'ourselves', 'and', 'our', 'Posterity', '', 'do', 'ordain', 'and', 'establish', 'this', 'Constitution', 'for', 'the', 'United', 'States', 'of', 'America', '']
#Use a list comprehension to remove spaces
x = [j for j in x if len(j)>0]
print(x)
['We', 'the', 'People', 'of', 'the', 'United', 'States', 'in', 'Order', 'to', 'form', 'a', 'more', 'perfect', 'Union', 'establish', 'Justice', 'insure', 'domestic', 'Tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'Welfare', 'and', 'secure', 'the', 'Blessings', 'of', 'Liberty', 'to', 'ourselves', 'and', 'our', 'Posterity', 'do', 'ordain', 'and', 'establish', 'this', 'Constitution', 'for', 'the', 'United', 'States', 'of', 'America']
len(x)
52
#Unique words
y = [j.lower() for j in x]
z = unique(y)
print(z)
['a' 'america' 'and' 'blessings' 'common' 'constitution' 'defence' 'do' 'domestic' 'establish' 'for' 'form' 'general' 'in' 'insure' 'justice' 'liberty' 'more' 'of' 'ordain' 'order' 'our' 'ourselves' 'people' 'perfect' 'posterity' 'promote' 'provide' 'secure' 'states' 'the' 'this' 'to' 'tranquility' 'union' 'united' 'we' 'welfare']
len(z)
38
#Find words greater than 3 characters
[j for j in x if len(j)>3]
['People', 'United', 'States', 'Order', 'form', 'more', 'perfect', 'Union', 'establish', 'Justice', 'insure', 'domestic', 'Tranquility', 'provide', 'common', 'defence', 'promote', 'general', 'Welfare', 'secure', 'Blessings', 'Liberty', 'ourselves', 'Posterity', 'ordain', 'establish', 'this', 'Constitution', 'United', 'States', 'America']
#Find capitalized words
[j for j in x if j.istitle()]
['We', 'People', 'United', 'States', 'Order', 'Union', 'Justice', 'Tranquility', 'Welfare', 'Blessings', 'Liberty', 'Posterity', 'Constitution', 'United', 'States', 'America']
#Find words that begin with c
[j for j in x if j.startswith('c')]
['common']
#Find words that end in t
[j for j in x if j.endswith('t')]
['perfect']
#Find words that contain a
[j for j in x if "a" in set(j.lower())]
['States', 'a', 'establish', 'Tranquility', 'general', 'Welfare', 'and', 'and', 'ordain', 'and', 'establish', 'States', 'America']
Or, use regular expressions to help us with more complex parsing.
For example '@[A-Za-z0-9_]+'
will return all words that:
'@'
and are followed by at least one: 'A-Z'
)'a-z'
) '0-9'
)'_'
)#Find words that contain 'a' using RE
[j for j in x if re.search('[Aa]',j)]
['States', 'a', 'establish', 'Tranquility', 'general', 'Welfare', 'and', 'and', 'ordain', 'and', 'establish', 'States', 'America']
#Test type of tokens
print(x)
[j for j in x if j.islower()]
['We', 'the', 'People', 'of', 'the', 'United', 'States', 'in', 'Order', 'to', 'form', 'a', 'more', 'perfect', 'Union', 'establish', 'Justice', 'insure', 'domestic', 'Tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'Welfare', 'and', 'secure', 'the', 'Blessings', 'of', 'Liberty', 'to', 'ourselves', 'and', 'our', 'Posterity', 'do', 'ordain', 'and', 'establish', 'this', 'Constitution', 'for', 'the', 'United', 'States', 'of', 'America']
['the', 'of', 'the', 'in', 'to', 'form', 'a', 'more', 'perfect', 'establish', 'insure', 'domestic', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'and', 'secure', 'the', 'of', 'to', 'ourselves', 'and', 'our', 'do', 'ordain', 'and', 'establish', 'this', 'for', 'the', 'of']
print(x)
[j for j in x if j.isdigit()]
['We', 'the', 'People', 'of', 'the', 'United', 'States', 'in', 'Order', 'to', 'form', 'a', 'more', 'perfect', 'Union', 'establish', 'Justice', 'insure', 'domestic', 'Tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'Welfare', 'and', 'secure', 'the', 'Blessings', 'of', 'Liberty', 'to', 'ourselves', 'and', 'our', 'Posterity', 'do', 'ordain', 'and', 'establish', 'this', 'Constitution', 'for', 'the', 'United', 'States', 'of', 'America']
[]
[j for j in x if j.isalnum()]
['We', 'the', 'People', 'of', 'the', 'United', 'States', 'in', 'Order', 'to', 'form', 'a', 'more', 'perfect', 'Union', 'establish', 'Justice', 'insure', 'domestic', 'Tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'Welfare', 'and', 'secure', 'the', 'Blessings', 'of', 'Liberty', 'to', 'ourselves', 'and', 'our', 'Posterity', 'do', 'ordain', 'and', 'establish', 'this', 'Constitution', 'for', 'the', 'United', 'States', 'of', 'America']
y = ' To be or not to be. '
print(y.strip())
print(y.rstrip())
print(y.lstrip())
print(y.lower())
print(y.upper())
To be or not to be. To be or not to be. To be or not to be. to be or not to be. TO BE OR NOT TO BE.
#Return the starting position of the string
print(y.find('be'))
print(y.rfind('be'))
5 18
print(y.replace('be','do'))
To do or not to do.
y = 'Supercalifragilisticexpialidocious'
ytok = y.split('i')
print(ytok)
['Supercal', 'frag', 'l', 'st', 'cexp', 'al', 'doc', 'ous']
print('i'.join(ytok))
print(list(y))
Supercalifragilisticexpialidocious ['S', 'u', 'p', 'e', 'r', 'c', 'a', 'l', 'i', 'f', 'r', 'a', 'g', 'i', 'l', 'i', 's', 't', 'i', 'c', 'e', 'x', 'p', 'i', 'a', 'l', 'i', 'd', 'o', 'c', 'i', 'o', 'u', 's']
## Reading in a URL
import requests
url = 'http://srdas.github.io/bio-candid.html'
f = requests.get(url)
text = f.text
print(text)
f.close()
<HTML> <BODY background="http://algo.scu.edu/~sanjivdas/graphics/back2.gif"> Sanjiv Das is the William and Janice Terry Professor of Finance and Data Science at Santa Clara University's Leavey School of Business. He previously held faculty appointments as Professor at Harvard Business School and UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management, Ahmedabad, B.Com in Accounting and Economics (University of Bombay, Sydenham College), and is also a qualified Cost and Works Accountant (AICWA). He is a senior editor of The Journal of Investment Management, Associate Editor of Management Science and other academic journals, and is on the Advisory Board of the Journal of Financial Data Science. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice-President at Citibank. His current research interests include: portfolio theory and wealth management ,machine learning, financial networks, derivatives pricing models, the modeling of default risk, systemic risk, and venture capital. He has published over a hundred articles in academic journals, and has won numerous awards for research and teaching. His recent book "Derivatives: Principles and Practice" was published in May 2010 (second edition 2016). <p> <B>Sanjiv Das: A Short Academic Life History</B> <p> After loafing and working in many parts of Asia, but never really growing up, Sanjiv moved to New York to change the world, hopefully through research. He graduated in 1994 with a Ph.D. from NYU, and since then spent five years in Boston, and now lives in San Jose, California. Sanjiv loves animals, places in the world where the mountains meet the sea, riding sport motorbikes, reading, gadgets, science fiction movies, and writing cool software code. When there is time available from the excitement of daily life, Sanjiv writes academic papers, which helps him relax. Always the contrarian, Sanjiv thinks that New York City is the most calming place in the world, after California of course. <p> Sanjiv is now a Professor of Finance at Santa Clara University. He came to SCU from Harvard Business School and spent a year at UC Berkeley. In his past life in the unreal world, Sanjiv worked at Citibank, N.A. in the Asia-Pacific region. He takes great pleasure in merging his many previous lives into his current existence, which is incredibly confused and diverse. <p> Sanjiv's research style is instilled with a distinct "New York state of mind" - it is chaotic, diverse, with minimal method to the madness. He has published articles on derivatives, term-structure models, mutual funds, the internet, portfolio choice, banking models, credit risk, and has unpublished articles in many other areas. Some years ago, he took time off to get another degree in computer science at Berkeley, confirming that an unchecked hobby can quickly become an obsession. There he learnt about the fascinating field of Randomized Algorithms, skills he now applies earnestly to his editorial work, and other pursuits, many of which stem from being in the epicenter of Silicon Valley. <p> Coastal living did a lot to mold Sanjiv, who needs to live near the ocean. The many walks in Greenwich village convinced him that there is no such thing as a representative investor, yet added many unique features to his personal utility function. He learnt that it is important to open the academic door to the ivory tower and let the world in. Academia is a real challenge, given that he has to reconcile many more opinions than ideas. He has been known to have turned down many offers from Mad magazine to publish his academic work. As he often explains, you never really finish your education - "you can check out any time you like, but you can never leave." Which is why he is doomed to a lifetime in Hotel California. And he believes that, if this is as bad as it gets, life is really pretty good.
len(text)
4044
lines = text.splitlines()
print(len(lines))
print(lines[3])
75 Sanjiv Das is the William and Janice Terry Professor of Finance and
from bs4 import BeautifulSoup
sanjivbio = BeautifulSoup(text,'lxml').get_text()
print(sanjivbio)
Sanjiv Das is the William and Janice Terry Professor of Finance and Data Science at Santa Clara University's Leavey School of Business. He previously held faculty appointments as Professor at Harvard Business School and UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management, Ahmedabad, B.Com in Accounting and Economics (University of Bombay, Sydenham College), and is also a qualified Cost and Works Accountant (AICWA). He is a senior editor of The Journal of Investment Management, Associate Editor of Management Science and other academic journals, and is on the Advisory Board of the Journal of Financial Data Science. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice-President at Citibank. His current research interests include: portfolio theory and wealth management ,machine learning, financial networks, derivatives pricing models, the modeling of default risk, systemic risk, and venture capital. He has published over a hundred articles in academic journals, and has won numerous awards for research and teaching. His recent book "Derivatives: Principles and Practice" was published in May 2010 (second edition 2016). Sanjiv Das: A Short Academic Life History After loafing and working in many parts of Asia, but never really growing up, Sanjiv moved to New York to change the world, hopefully through research. He graduated in 1994 with a Ph.D. from NYU, and since then spent five years in Boston, and now lives in San Jose, California. Sanjiv loves animals, places in the world where the mountains meet the sea, riding sport motorbikes, reading, gadgets, science fiction movies, and writing cool software code. When there is time available from the excitement of daily life, Sanjiv writes academic papers, which helps him relax. Always the contrarian, Sanjiv thinks that New York City is the most calming place in the world, after California of course. Sanjiv is now a Professor of Finance at Santa Clara University. He came to SCU from Harvard Business School and spent a year at UC Berkeley. In his past life in the unreal world, Sanjiv worked at Citibank, N.A. in the Asia-Pacific region. He takes great pleasure in merging his many previous lives into his current existence, which is incredibly confused and diverse. Sanjiv's research style is instilled with a distinct "New York state of mind" - it is chaotic, diverse, with minimal method to the madness. He has published articles on derivatives, term-structure models, mutual funds, the internet, portfolio choice, banking models, credit risk, and has unpublished articles in many other areas. Some years ago, he took time off to get another degree in computer science at Berkeley, confirming that an unchecked hobby can quickly become an obsession. There he learnt about the fascinating field of Randomized Algorithms, skills he now applies earnestly to his editorial work, and other pursuits, many of which stem from being in the epicenter of Silicon Valley. Coastal living did a lot to mold Sanjiv, who needs to live near the ocean. The many walks in Greenwich village convinced him that there is no such thing as a representative investor, yet added many unique features to his personal utility function. He learnt that it is important to open the academic door to the ivory tower and let the world in. Academia is a real challenge, given that he has to reconcile many more opinions than ideas. He has been known to have turned down many offers from Mad magazine to publish his academic work. As he often explains, you never really finish your education - "you can check out any time you like, but you can never leave." Which is why he is doomed to a lifetime in Hotel California. And he believes that, if this is as bad as it gets, life is really pretty good.
print(len(sanjivbio))
type(sanjivbio)
3947
str
Webster’s defines a “dictionary” as “…a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses.”
The Grammarly Handbook provides the folowing negation words (see https://www.grammarly.com/handbook/):
Text can be scored using dictionaries and word lists. Here is an example of mood scoring. We use a psychological dictionary from Harvard. There is also WordNet.
WordNet is a large database of words in English, i.e., a lexicon. The repository is at http://wordnet.princeton.edu. WordNet groups words together based on their meanings (synonyms) and hence may be used as a thesaurus. WordNet is also useful for natural language processing as it provides word lists by language category, such as noun, verb, adjective, etc.
## Read in a file
## Here we will read in an entire dictionary from Harvard Inquirer
f = open('DSTMAA_data/inqdict.txt')
HIDict = f.read()
HIDict = HIDict.splitlines()
HIDict[:20]
['Entryword Source Pos Neg Pstv Affil Ngtv Hostile Strng Power Weak Subm Actv Psv Pleasure Pain Arousal EMOT Feel Virtue Vice Ovrst Undrst Acad Doctr Econ* Exch ECON Exprs Legal Milit Polit* POLIT Relig Role COLL Work Ritual Intrel Race Kin* MALE Female Nonadlt HU ANI PLACE Social Region Route Aquatic Land Sky Object Tool Food Vehicle Bldgpt Natobj Bodypt Comnobj Comform COM Say Need Goal Try Means Ach Persist Complt Fail Natpro Begin Vary Change Incr Decr Finish Stay Rise Move Exert Fetch Travel Fall Think Know Causal Ought Percv Comp Eval EVAL Solve Abs* ABS Qual Quan NUMB ORD CARD FREQ DIST Time* TIME Space POS DIM Dimn Rel COLOR Self Our You Name Yes No Negate Intrj IAV DAV SV IPadj IndAdj POWGAIN POWLOSS POWENDS POWAREN POWCON POWCOOP POWAPT POWPT POWDOCT POWAUTH POWOTH POWTOT RCTETH RCTREL RCTGAIN RCTLOSS RCTENDS RCTTOT RSPGAIN RSPLOSS RSPOTH RSPTOT AFFGAIN AFFLOSS AFFPT AFFOTH AFFTOT WLTPT WLTTRAN WLTOTH WLTTOT WLBGAIN WLBLOSS WLBPHYS WLBPSYC WLBPT WLBTOT ENLGAIN ENLLOSS ENLENDS ENLPT ENLOTH ENLTOT SKLAS SKLPT SKLOTH SKLTOT TRNGAIN TRNLOSS TRANS MEANS ENDS ARENAS PARTIC NATIONS AUD ANOMIE NEGAFF POSAFF SURE IF NOT TIMESP FOOD FORM Othertags Definition ', 'A H4Lvd DET ART | article: Indefinite singular article--some or any one', 'ABANDON H4Lvd Neg Ngtv Weak Fail IAV AFFLOSS AFFTOT SUPV |', 'ABANDONMENT H4 Neg Weak Fail Noun |', 'ABATE H4Lvd Neg Psv Decr IAV TRANS SUPV |', 'ABATEMENT Lvd Noun ', 'ABDICATE H4 Neg Weak Subm Psv Finish IAV SUPV |', 'ABHOR H4 Neg Hostile Psv Arousal SV SUPV |', 'ABIDE H4 Pos Affil Actv Doctr IAV SUPV |', 'ABIDE#1 Lvd Modif ', 'ABIDE#2 Lvd SUPV ', 'ABILITY Lvd MEANS Noun ABS ABS* ', 'ABJECT H4 Neg Weak Subm Psv Vice IPadj Modif |', 'ABLE H4Lvd Pos Pstv Strng Virtue EVAL MEANS Modif | adjective: Having necessary power, skill, resources, etc.', 'ABNORMAL H4Lvd Neg Ngtv Vice NEGAFF Modif |', 'ABOARD H4Lvd Space PREP LY |', 'ABOLISH H4Lvd Neg Ngtv Hostile Strng Power Actv Intrel IAV POWOTH POWTOT SUPV |', 'ABOLITION Lvd TRANS Noun ', 'ABOMINABLE H4 Neg Strng Vice Ovrst Eval IndAdj Modif |', 'ABORTIVE Lvd POWOTH POWTOT Modif POLIT ']
#Extract all the lines that contain the Pos tag
HIDict = HIDict[1:]
print(HIDict[:5])
print(len(HIDict))
poswords = [j for j in HIDict if "Pos" in j] #using a list comprehension
poswords = [j.split()[0] for j in poswords]
poswords = [j.split("#")[0] for j in poswords]
poswords = unique(poswords)
poswords = [j.lower() for j in poswords]
print(poswords[:20])
print(len(poswords))
['A H4Lvd DET ART | article: Indefinite singular article--some or any one', 'ABANDON H4Lvd Neg Ngtv Weak Fail IAV AFFLOSS AFFTOT SUPV |', 'ABANDONMENT H4 Neg Weak Fail Noun |', 'ABATE H4Lvd Neg Psv Decr IAV TRANS SUPV |', 'ABATEMENT Lvd Noun '] 11895 ['abide', 'able', 'abound', 'absolve', 'absorbent', 'absorption', 'abundance', 'abundant', 'accede', 'accentuate', 'accept', 'acceptable', 'acceptance', 'accessible', 'accession', 'acclaim', 'acclamation', 'accolade', 'accommodate', 'accommodation'] 1646
#Extract all the lines that contain the Neg tag
negwords = [j for j in HIDict if "Neg" in j] #using a list comprehension
negwords = [j.split()[0] for j in negwords]
negwords = [j.split("#")[0] for j in negwords]
negwords = unique(negwords)
negwords = [j.lower() for j in negwords]
print(negwords[:20])
print(len(negwords))
['abandon', 'abandonment', 'abate', 'abdicate', 'abhor', 'abject', 'abnormal', 'abolish', 'abominable', 'abrasive', 'abrupt', 'abscond', 'absence', 'absent', 'absent-minded', 'absentee', 'absurd', 'absurdity', 'abuse', 'abyss'] 2120