2. Introduction to Text Analytics#

Sanjiv R. Das

Reading references

%%capture
#INCLUDING SCIENTIFIC AND NUMERICAL COMPUTING LIBRARIES
#Run this code to make sure that you have all the libraries at one go.
%pylab inline
import os
import pandas as pd
from IPython.display import Image
%load_ext rpy2.ipython
# Basic lines of code needed to import a data file with permissions from Google Drive
from google.colab import drive
# drive.mount("/content/drive", force_remount=True)
drive.mount('/content/drive')
os.chdir("drive/My Drive/Books_Writings/NLPBook/")
Mounted at /content/drive

Question: Why has Covid made text analysis in finance more relevant?

2.1. News Analysis for Finance#

In Finance, for example, text has become a major source of trading information, leading to a new field known as News Metrics.

News analysis is defined as “the measurement of the various qualitative and quantitative attributes of textual news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers permits the manipulation of everyday information in a mathematical and statistical way.” (Wikipedia).

In this class, we will study frameworks for text analytics techniques that are in widespread use. We will discuss various text analytic methods and software, and then provide a set of metrics that may be used to assess the performance of analytics. Various directions for this field are discussed through the exposition. The techniques herein can aid in the valuation and trading of securities, facilitate investment decision making, meet regulatory requirements, provide marketing insights, or manage risk.

2.2. News Analytics#

See: https://www.amazon.com/Handbook-News-Analytics-Finance/dp/047066679X/ref=sr_1_1?ie=UTF8&qid=1466897817&sr=8-1&keywords=handbook+of+news+analytics

“News analytics are used in financial modeling, particularly in quantitative and algorithmic trading. Further, news analytics can be used to plot and characterize firm behaviors over time and thus yield important strategic insights about rival firms. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, `bag of words’, among other techniques.” (Wikipedia)

2.3. Text as Data#

There are many reasons why text has business value. But this is a narrow view. Textual data provides a means of understanding all human behavior through a data-driven, analytical approach. Let’s enumerate some reasons for this.

  • Big Text: there is more textual data than numerical data.

  • Text is versatile. Nuances and behavioral expressions are not conveyed with numbers, so analyzing text allows us to explore these aspects of human interaction.

  • Text contains emotive content. This has led to the ubiquity of “Sentiment analysis”. See for example: Admati-Pfleiderer 2001; DeMarzo et al 2003; Antweiler-Frank 2004, 2005; Das-Chen 2007; Tetlock 2007; Tetlock et al 2008; Mitra et al 2008; Leinweber-Sisk 2010.

  • Text contains opinions and connections. See: Das et al 2005; Das and Sisk 2005; Godes et al 2005; Li 2006; Hochberg et al 2007.

  • Numbers aggregate; text disaggregates. Text allows us to drill down into underlying behavior when understanding human interaction.

  • Text is forward looking, much more than tabular data. Recent structural shifts in the economy (pandemics, trade wars, etc.) have made unstructured text data relatively more important than structured tabular data.

2.4. Chris Anderson: “Data is the New Theory.”#

In a talk at the 17th ACM Conference on Information Knowledge and Management (CIKM ’08), Google’s director of research Peter Norvig stated his unequivocal preference for data over algorithms—“data is more agile than code.” Yet, it is well-understood that too much data can lead to overfitting so that an algorithm becomes mostly useless out-of-sample.

2.5. Definition: Text-Mining#

  • Text mining is the large-scale, automated processing of plain text language in digital form to extract data that is converted into useful quantitative or qualitative information.

  • Text mining is automated on big data that is not amenable to human processing within reasonable time frames. It entails extracting data that is converted into information of many types.

  • Simple: Text mining may be simple as key word searches and counts.

  • Complicated: It may require language parsing and complex rules for information extraction.

  • Involves structured text, such as the information in forms and some kinds of web pages.

  • May be applied to unstructured text is a much harder endeavor.

  • Text mining is also aimed at unearthing unseen relationships in unstructured text as in meta analyses of research papers, see Van Noorden 2012.

Image("NLP_images/algo_complexity.jpg", width=500)

2.6. The Response to News#

Das, Martinez-Jerez, and Tufano (FM 2005)

Image("NLP_images/news_cycle.png", width=600)

2.7. Breakdown of news flow#

Image("NLP_images/breakdown_newsflow.png", width=600)

2.8. Frequency of posting#

Image("NLP_images/freq_postings.png", width=600)

2.9. Weekly posting#

Image("NLP_images/weekly_posting.png", width=600)

2.10. Intraday posting#

Image("NLP_images/intraday_posting.png", width=600)

2.11. Number of characters per posting#

Image("NLP_images/characters_posting.png", width=600)

As we move forward in our NLP journey, here is a very nice graphic designed by Fabio Chiusano, see his blog post.

Image("NLP_images/NLP_33Tasks.png", width=600)

https://medium.com/nlplanet/two-minutes-nlp-33-important-nlp-tasks-explained-31e2caad2b1b

2.12. NLP = NLU + NLG#

  • Natural language understanding is the discriminative aspect of NLP, for example, text classification.

  • Natural language generation is the generative aspect of NLP, for example, sentence completion.

  • These have been scaled using large language models (LLMs).