Text Analytics (an Introduction)

Sanjiv R. Das

Reading references

News Analysis

In Finance, for example, text has become a major source of trading information, leading to a new field known as News Metrics.

News analysis is defined as “the measurement of the various qualitative and quantitative attributes of textual news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers permits the manipulation of everyday information in a mathematical and statistical way.” (Wikipedia). In this chapter, I provide a framework for text analytics techniques that are in widespread use. I will discuss various text analytic methods and software, and then provide a set of metrics that may be used to assess the performance of analytics. Various directions for this field are discussed through the exposition. The techniques herein can aid in the valuation and trading of securities, facilitate investment decision making, meet regulatory requirements, provide marketing insights, or manage risk.

News Analytics

See: https://www.amazon.com/Handbook-News-Analytics-Finance/dp/047066679X/ref=sr_1_1?ie=UTF8&qid=1466897817&sr=8-1&keywords=handbook+of+news+analytics

“News analytics are used in financial modeling, particularly in quantitative and algorithmic trading. Further, news analytics can be used to plot and characterize firm behaviors over time and thus yield important strategic insights about rival firms. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, `bag of words’, among other techniques.” (Wikipedia)

Text as Data

There are many reasons why text has business value. But this is a narrow view. Textual data provides a means of understanding all human behavior through a data-driven, analytical approach. Let’s enumerate some reasons for this.

In a talk at the 17th ACM Conference on Information Knowledge and Management (CIKM ’08), Google’s director of research Peter Norvig stated his unequivocal preference for data over algorithms—“data is more agile than code.” Yet, it is well-understood that too much data can lead to overfitting so that an algorithm becomes mostly useless out-of-sample.

Chris Anderson: “Data is the New Theory.”

Definition: Text-Mining

Algorithm Complexity

The Response to News

Das, Martinez-Jerez, and Tufano (FM 2005)

Breakdown of news flow

Frequency of posting

Weekly posting

Intraday posting

Number of characters per posting

Examples: Basic Text Handling

But this returns words with commas and periods included, which is not desired. So what we need is the regular expressions package, i.e., re.

Using List Comprehensions to find specific words

Or, use regular expressions to help us with more complex parsing.

For example '@[A-Za-z0-9_]+' will return all words that:

String operations

Read in a URL

Use Beautiful Soup to clean up all the html stuff


Webster’s defines a “dictionary” as “…a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses.”

  1. The Harvard General Inquirer: http://www.wjh.harvard.edu/~inquirer/
  2. Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.
  3. Computer dictionary: http://www.hyperdictionary.com/computer that contains about 14,000 computer related words, such as “byte” or “hyperlink”.
  4. Math dictionary, such as http://www.amathsdictionaryforkids.com/dictionary.html.
  5. Medical dictionary, see http://www.hyperdictionary.com/medical.
  6. Internet lingo dictionaries may be used to complement standard dictionaries with words that are not usually found in standard language, for example, see http://www.netlingo.com/dictionary/all.php for words such as “2BZ4UQT” which stands for “too busy for you cutey” (LOL). When extracting text messages, postings on Facebook, or stock message board discussions, internet lingo does need to be parsed and such a dictionary is very useful.
  7. Associative dictionaries are also useful when trying to find context, as the word may be related to a concept, identified using a dictionary such as http://www.visuwords.com/. This dictionary doubles up as a thesaurus, as it provides alternative words and phrases that mean the same thing, and also related concepts.
  8. Value dictionaries deal with values and may be useful when only affect (positive or negative) is insufficient for scoring text. The Lasswell Value Dictionary http://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used to score the loading of text on the eight basic value categories: Wealth, Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Well being.


Constructing a lexicon

Lexicons as Word Lists

Negation Tagging

The Grammarly Handbook provides the folowing negation words (see https://www.grammarly.com/handbook/):

  1. Negative words: No, Not, None, No one, Nobody, Nothing, Neither, Nowhere, Never.
  2. Negative Adverbs: Hardly, Scarcely, Barely.
  3. Negative verbs: Doesn’t, Isn’t, Wasn’t, Shouldn’t, Wouldn’t, Couldn’t, Won’t, Can’t, Don’t.

Scoring Text

Read in a dictionary

Sentiment Score the Text using this Dictionary from Harvard Inquirer