Text Analytics (an Introduction)¶

Sanjiv R. Das

News Analysis¶

In Finance, for example, text has become a major source of trading information, leading to a new field known as News Metrics.

News analysis is defined as “the measurement of the various qualitative and quantitative attributes of textual news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers permits the manipulation of everyday information in a mathematical and statistical way.” (Wikipedia). In this chapter, I provide a framework for text analytics techniques that are in widespread use. I will discuss various text analytic methods and software, and then provide a set of metrics that may be used to assess the performance of analytics. Various directions for this field are discussed through the exposition. The techniques herein can aid in the valuation and trading of securities, facilitate investment decision making, meet regulatory requirements, provide marketing insights, or manage risk.

News Analytics¶

“News analytics are used in financial modeling, particularly in quantitative and algorithmic trading. Further, news analytics can be used to plot and characterize firm behaviors over time and thus yield important strategic insights about rival firms. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, bag of words’, among other techniques.” (Wikipedia)

Text as Data¶

There are many reasons why text has business value. But this is a narrow view. Textual data provides a means of understanding all human behavior through a data-driven, analytical approach. Let’s enumerate some reasons for this.

• Big Text: there is more textual data than numerical data.
• Text is versatile. Nuances and behavioral expressions are not conveyed with numbers, so analyzing text allows us to explore these aspects of human interaction.
• Text contains emotive content. This has led to the ubiquity of “Sentiment analysis”. See for example: Admati-Pfleiderer 2001; DeMarzo et al 2003; Antweiler-Frank 2004, 2005; Das-Chen 2007; Tetlock 2007; Tetlock et al 2008; Mitra et al 2008; Leinweber-Sisk 2010.
• Text contains opinions and connections. See: Das et al 2005; Das and Sisk 2005; Godes et al 2005; Li 2006; Hochberg et al 2007.
• Numbers aggregate; text disaggregates. Text allows us to drill down into underlying behavior when understanding human interaction.

In a talk at the 17th ACM Conference on Information Knowledge and Management (CIKM ’08), Google’s director of research Peter Norvig stated his unequivocal preference for data over algorithms—“data is more agile than code.” Yet, it is well-understood that too much data can lead to overfitting so that an algorithm becomes mostly useless out-of-sample.

Chris Anderson: “Data is the New Theory.”

Definition: Text-Mining¶

• Text mining is the large-scale, automated processing of plain text language in digital form to extract data that is converted into useful quantitative or qualitative information.
• Text mining is automated on big data that is not amenable to human processing within reasonable time frames. It entails extracting data that is converted into information of many types.
• Simple: Text mining may be simple as key word searches and counts.
• Complicated: It may require language parsing and complex rules for information extraction.
• Involves structured text, such as the information in forms and some kinds of web pages.
• May be applied to unstructured text is a much harder endeavor.
• Text mining is also aimed at unearthing unseen relationships in unstructured text as in meta analyses of research papers, see Van Noorden 2012.

The Response to News¶

Das, Martinez-Jerez, and Tufano (FM 2005)

Examples: Basic Text Handling¶

But this returns words with commas and periods included, which is not desired. So what we need is the regular expressions package, i.e., re.

Using List Comprehensions to find specific words¶

Or, use regular expressions to help us with more complex parsing.

For example '@[A-Za-z0-9_]+' will return all words that:

• start with '@' and are followed by at least one:
• capital letter ('A-Z')
• lowercase letter ('a-z')
• number ('0-9')
• or underscore ('_'`)

Dictionaries¶

Webster’s defines a “dictionary” as “…a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses.”

1. The Harvard General Inquirer: http://www.wjh.harvard.edu/~inquirer/
2. Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.
3. Computer dictionary: http://www.hyperdictionary.com/computer that contains about 14,000 computer related words, such as “byte” or “hyperlink”.
4. Math dictionary, such as http://www.amathsdictionaryforkids.com/dictionary.html.
5. Medical dictionary, see http://www.hyperdictionary.com/medical.
6. Internet lingo dictionaries may be used to complement standard dictionaries with words that are not usually found in standard language, for example, see http://www.netlingo.com/dictionary/all.php for words such as “2BZ4UQT” which stands for “too busy for you cutey” (LOL). When extracting text messages, postings on Facebook, or stock message board discussions, internet lingo does need to be parsed and such a dictionary is very useful.
7. Associative dictionaries are also useful when trying to find context, as the word may be related to a concept, identified using a dictionary such as http://www.visuwords.com/. This dictionary doubles up as a thesaurus, as it provides alternative words and phrases that mean the same thing, and also related concepts.
8. Value dictionaries deal with values and may be useful when only affect (positive or negative) is insufficient for scoring text. The Lasswell Value Dictionary http://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used to score the loading of text on the eight basic value categories: Wealth, Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Well being.

Lexicons¶

• A lexicon is defined by Webster’s as “a book containing an alphabetical arrangement of the words in a language and their definitions; the vocabulary of a language, an individual speaker or group of speakers, or a subject; the total stock of morphemes in a language.” This suggests it is not that different from a dictionary.
• A “morpheme” is defined as “a word or a part of a word that has a meaning and that contains no smaller part that has a meaning.”
• In the text analytics realm, we will take a lexicon to be a smaller, special purpose dictionary, containing words that are relevant to the domain of interest.
• The benefit of a lexicon is that it enables focusing only on words that are relevant to the analytics and discards words that are not.
• Another benefit is that since it is a smaller dictionary, the computational effort required by text analytics algorithms is drastically reduced.

Constructing a lexicon¶

• By hand. This is an effective technique and the simplest. It calls for a human reader who scans a representative sample of text documents and culls important words that lend interpretive meaning.
• Examine the term document matrix for most frequent words, and pick the ones that have high connotation for the classification task at hand.
• Use pre-classified documents in a text corpus. We analyze the separate groups of documents to find words whose difference in frequency between groups is highest. Such words are likely to be better in discriminating between groups.

Lexicons as Word Lists¶

• Das and Chen (2007) constructed a lexicon of about 375 words that are useful in parsing sentiment from stock message boards.
• Loughran and McDonald (2011): Taking a sample of 50,115 firm-year 10-Ks from 1994 to 2008, they found that almost three-fourths of the words identified as negative by the Harvard Inquirer dictionary are not typically negative words in a financial context.
• Therefore, they specifically created separate lists of words by the following attributes of words: negative, positive, uncertainty, litigious, strong modal, and weak modal. Modal words are based on Jordan’s categories of strong and weak modal words. These word lists may be downloaded from http://www3.nd.edu/~mcdonald/Word_Lists.html.

Negation Tagging¶

• Das and Chen (2007) introduced the notion of “negation tagging” into the literature. Negation tags create additional words in the word list using some rule. In this case, the rule used was to take any sentence, and if a negation word occurred, then tag all remaining positive words in the sentence as negative. For example, take a sentence - “This is not a good book.” Here the positive words after “not” are candidates for negation tagging. So, we would replace the sentence with “This is not a n__good book."
• Sometimes this can be more nuanced. For example, a sentence such as “There is nothing better than sliced bread.” So now, the negation word “nothing” is used in conjunction with “better” so is an exception to the rule. Such exceptions may need to be coded in to rules for parsing textual content.

The Grammarly Handbook provides the folowing negation words (see https://www.grammarly.com/handbook/):

1. Negative words: No, Not, None, No one, Nobody, Nothing, Neither, Nowhere, Never.
2. Negative Adverbs: Hardly, Scarcely, Barely.
3. Negative verbs: Doesn’t, Isn’t, Wasn’t, Shouldn’t, Wouldn’t, Couldn’t, Won’t, Can’t, Don’t.

Scoring Text¶

• Text can be scored using dictionaries and word lists. Here is an example of mood scoring. We use a psychological dictionary from Harvard. There is also WordNet.

• WordNet is a large database of words in English, i.e., a lexicon. The repository is at http://wordnet.princeton.edu. WordNet groups words together based on their meanings (synonyms) and hence may be used as a thesaurus. WordNet is also useful for natural language processing as it provides word lists by language category, such as noun, verb, adjective, etc.