22. Text Comprehension#

An important attribite of written text. Is it easy to comprehend?

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
%pylab inline
import pandas as pd
import os
%load_ext rpy2.ipython

22.1. Readability of Text#

Or, how to grade text!

In recent years, the SAT exams added a new essay section. While the test aimed at assessing original writing, it also introduced automated grading. A goal of the test is to assess the writing level of the student. This is associated with the notion of readability.

“Readability” is a metric of how easy it is to comprehend text. Given a goal of efficient markets, regulators want to foster transparency by making sure financial documents that are disseminated to the investing public are readable. Hence, metrics for readability are very important and are recently gaining traction.

22.2. Gunning-Fog Index#

Gunning (1952) developed the Fog index. The index estimates the years of formal education needed to understand text on a first reading. A fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The index is based on the idea that poor readability is associated with longer sentences and complex words. Complex words are those that have more than two syllables. The formula for the Fog index is

\[ 0.4 \left[\frac{\#words}{\#sentences} + 100 \cdot \frac{\#complex words}{\#words} \right] \]

22.3. Flesch Score#

Alternative readability scores use similar ideas. The Flesch Reading Ease Score and the Flesch-Kincaid Grade Level also use counts of words, syllables, and sentences. See http://en.wikipedia.org/wiki/Flesch-Kincaid_readability_tests. The Flesch Reading Ease Score is defined as

\[ 206.835−1.015 \cdot \frac{\#words}{\#sentences} − 84.6 \cdot \frac{\#syllables}{\#words} \]

With a range of 90-100 easily accessible by a 11-year old, 60-70 being easy to understand for 13-15 year olds, and 0-30 for university graduates.

22.4. The Flesch-Kincaid Grade Level#

This is defined as

\[ 0.39 \cdot \frac{\#words}{\#sentences} + 11.8 \cdot \frac{\#syllables}{\#words} − 15.59 \]

which gives a number that corresponds to the grade level. As expected these two measures are negatively correlated. Various other measures of readability use the same ideas as in the Fog index. For example the Coleman and Liau (1975) index does not even require a count of syllables, as follows:

\[ CLI=0.0588L−0.296S−15.8 \]

where \(L\) is the average number of letters per hundred words and \(S\) is the average number of sentences per hundred words.

Standard readability metrics may not work well for financial text. Loughran and McDonald (2014) find that the Fog index is inferior to simply looking at 10-K file size.

References

  • M. Coleman and T. L. Liau. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology 60, 283-284.

  • T. Loughran and W. McDonald, (2014). Measuring readability in financial disclosures, The Journal of Finance 69, 1643-1671.

22.5. koRpus package#

R package koRpus for readability scoring here. http://www.inside-r.org/packages/cran/koRpus/docs/readability

First, let’s grab some text from my web site.

%%R
library(rvest)
url = "http://srdas.github.io/bio-candid.html"

doc.html = read_html(url)
text = doc.html %>% html_elements("p") %>% html_text()

text = gsub("[\t\n]"," ",text)
text = gsub('"'," ",text)   #removes single backslash
text = paste(text, collapse=" ")
print(text)
[1] " Sanjiv Das: A Short Academic Life History    After loafing and working in many parts of Asia, but never really growing up, Sanjiv moved to New York to change the world, hopefully through research.  He graduated in 1994 with a Ph.D. from NYU, and since then spent five years in Boston, and now lives in San Jose, California.  Sanjiv loves animals, places in the world where the mountains meet the sea, riding sport motorbikes, reading, gadgets, science fiction movies, and writing cool software code. When there is time available from the excitement of daily life, Sanjiv writes academic papers, which helps him relax. Always the contrarian, Sanjiv thinks that New York City is the most calming place in the world, after California of course.     Sanjiv is now a Professor of Finance at Santa Clara University. He came to SCU from Harvard Business School and spent a year at UC Berkeley. In his past life in the unreal world, Sanjiv worked at Citibank, N.A. in the Asia-Pacific region. He takes great pleasure in merging his many previous lives into his current existence, which is incredibly confused and diverse.     Sanjiv's research style is instilled with a distinct  New York state of mind  - it is chaotic, diverse, with minimal method to the madness. He has published articles on derivatives, term-structure models, mutual funds, the internet, portfolio choice, banking models, credit risk, and has unpublished articles in many other areas. Some years ago, he took time off to get another degree in computer science at Berkeley, confirming that an unchecked hobby can quickly become an obsession. There he learnt about the fascinating field of Randomized Algorithms, skills he now applies earnestly to his editorial work, and other pursuits, many of which stem from being in the epicenter of Silicon Valley.     Coastal living did a lot to mold Sanjiv, who needs to live near the ocean.  The many walks in Greenwich village convinced him that there is no such thing as a representative investor, yet added many unique features to his personal utility function. He learnt that it is important to open the academic door to the ivory tower and let the world in. Academia is a real challenge, given that he has to reconcile many more opinions than ideas. He has been known to have turned down many offers from Mad magazine to publish his academic work. As he often explains, you never really finish your education -  you can check out any time you like, but you can never leave.  Which is why he is doomed to a lifetime in Hotel California. And he believes that, if this is as bad as it gets, life is really pretty good. "
%%R
install.packages(c('koRpus', 'koRpus.lang.en'), quiet=TRUE)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: also installing the dependencies ‘sylly’, ‘sylly.en’
%%R
# install the language support package for the first time
# conda install -c conda-forge r-korpus.lang.en
#install.koRpus.lang("en")
# load the package
library(koRpus.lang.en)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Loading required package: koRpus

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Loading required package: sylly

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: For information on available language packages for 'koRpus', run

  available.koRpus.lang()

and see ?install.koRpus.lang()
%%R
library(koRpus)
write(text,file="textvec.txt")
#text_tokens = tokenize("textvec.txt",tag=FALSE)
text_tokens = tokenize("textvec.txt",lang="en")
print(text_tokens)
print(c("Number of sentences: ",text_tokens@desc$sentences))
         doc_id    token      tag lemma lttr   wclass desc stop stem idx sntc
1   textvec.txt   Sanjiv word.kRp          6     word <NA> <NA> <NA>   1    1
2   textvec.txt      Das word.kRp          3     word <NA> <NA> <NA>   2    1
3   textvec.txt        :     .kRp          1 fullstop <NA> <NA> <NA>   3    1
4   textvec.txt        A word.kRp          1     word <NA> <NA> <NA>   4    2
5   textvec.txt    Short word.kRp          5     word <NA> <NA> <NA>   5    2
6   textvec.txt Academic word.kRp          8     word <NA> <NA> <NA>   6    2
                                                [...]                        
508 textvec.txt     life word.kRp          4     word <NA> <NA> <NA> 508   24
509 textvec.txt       is word.kRp          2     word <NA> <NA> <NA> 509   24
510 textvec.txt   really word.kRp          6     word <NA> <NA> <NA> 510   24
511 textvec.txt   pretty word.kRp          6     word <NA> <NA> <NA> 511   24
512 textvec.txt     good word.kRp          4     word <NA> <NA> <NA> 512   24
513 textvec.txt        .     .kRp          1 fullstop <NA> <NA> <NA> 513   24
[1] "Number of sentences: "
%%R
print(readability(text_tokens))
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Hyphenation (language: en)
  |======================================================================| 100%

Automated Readability Index (ARI)
  Parameters: default 
       Grade: 9.88 


Coleman-Liau
  Parameters: default 
         ECP: 47% (estimted cloze percentage)
       Grade: 10.09 
       Grade: 10.1 (short formula)


Danielson-Bryan
  Parameters: default 
         DB1: 7.63 
         DB2: 48.67 
       Grade: 9-12 


Dickes-Steiwer's Handformel
  Parameters: default 
         TTR: 0.58 
       Score: 42.76 


Easy Listening Formula
  Parameters: default 
      Exsyls: 149 
       Score: 6.21 


Farr-Jenkins-Paterson
  Parameters: default 
          RE: 56.1 
       Grade: >= 10 (high school) 


Flesch Reading Ease
  Parameters: en (Flesch) 
          RE: 59.75 
       Grade: >= 10 (high school) 


Flesch-Kincaid Grade Level
  Parameters: default 
       Grade: 9.54 
         Age: 14.54 


Gunning Frequency of Gobbledygook (FOG)
  Parameters: default 
       Grade: 12.55 


FORCAST
  Parameters: default 
       Grade: 10.01 
         Age: 15.01 


Fucks' Stilcharakteristik
       Score: 28.17 
       Grade: 5.31 


Gutiérrez Fórmula de Comprensibilidad
       Score: 43.35 


Linsear Write
  Parameters: default 
  Easy words: 87 
  Hard words: 13 
       Grade: 11.71 


Läsbarhetsindex (LIX)
  Parameters: default 
       Index: 40.56 
      Rating: standard 
       Grade: 6 


Neue Wiener Sachtextformeln
  Parameters: default 
       nWS 1: 5.42 
       nWS 2: 5.97 
       nWS 3: 6.28 
       nWS 4: 6.81 


Readability Index (RIX)
  Parameters: default 
       Index: 4.08 
       Grade: 9 


Simple Measure of Gobbledygook (SMOG)
  Parameters: default 
       Grade: 12.01 
         Age: 17.01 


Strain Index
  Parameters: default 
       Index: 8.45 


Kuntzsch's Text-Redundanz-Index
  Parameters: default 
 Short words: 297 
 Punctuation: 71 
     Foreign: 0 
       Score: -56.22 


Tuldava's Text Difficulty Formula
  Parameters: default 
       Index: 4.43 


Wheeler-Smith
  Parameters: default 
       Score: 62.08 
       Grade: > 4 

Text language: en 

How to talk when a machine is listening: https://www.nber.org/papers/w27950