22. Text Comprehension#

An important attribite of written text. Is it easy to comprehend?

22.1. Readability of Text#

Or, how to grade text!

In recent years, the SAT exams added a new essay section. While the test aimed at assessing original writing, it also introduced automated grading. A goal of the test is to assess the writing level of the student. This is associated with the notion of readability.

“Readability” is a metric of how easy it is to comprehend text. Given a goal of efficient markets, regulators want to foster transparency by making sure financial documents that are disseminated to the investing public are readable. Hence, metrics for readability are very important and are recently gaining traction.

22.2. Gunning-Fog Index#

Gunning (1952) developed the Fog index. The index estimates the years of formal education needed to understand text on a first reading. A fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The index is based on the idea that poor readability is associated with longer sentences and complex words. Complex words are those that have more than two syllables. The formula for the Fog index is

\[ 0.4 \left[\frac{\#words}{\#sentences} + 100 \cdot \frac{\#complex words}{\#words} \right] \]

22.3. Flesch Score#

Alternative readability scores use similar ideas. The Flesch Reading Ease Score and the Flesch-Kincaid Grade Level also use counts of words, syllables, and sentences. See http://en.wikipedia.org/wiki/Flesch-Kincaid_readability_tests. The Flesch Reading Ease Score is defined as

\[ 206.835−1.015 \cdot \frac{\#words}{\#sentences} − 84.6 \cdot \frac{\#syllables}{\#words} \]

With a range of 90-100 easily accessible by a 11-year old, 60-70 being easy to understand for 13-15 year olds, and 0-30 for university graduates.

22.4. The Flesch-Kincaid Grade Level#

This is defined as

\[ 0.39 \cdot \frac{\#words}{\#sentences} + 11.8 \cdot \frac{\#syllables}{\#words} − 15.59 \]

which gives a number that corresponds to the grade level. As expected these two measures are negatively correlated. Various other measures of readability use the same ideas as in the Fog index. For example the Coleman and Liau (1975) index does not even require a count of syllables, as follows:

\[ CLI=0.0588L−0.296S−15.8 \]

where \(L\) is the average number of letters per hundred words and \(S\) is the average number of sentences per hundred words.

Standard readability metrics may not work well for financial text. Loughran and McDonald (2014) find that the Fog index is inferior to simply looking at 10-K file size.


22.5. koRpus package#

R package koRpus for readability scoring here. http://www.inside-r.org/packages/cran/koRpus/docs/readability

First, let’s grab some text from my web site.

url = "http://srdas.github.io/bio-candid.html"

doc.html = read_html(url)
text = doc.html %>% html_elements("p") %>% html_text()

text = gsub("[\t\n]"," ",text)
text = gsub('"'," ",text)   #removes single backslash
text = paste(text, collapse=" ")
url = "http://srdas.github.io/bio-candid.html"
text_tokens = tokenize("textvec.txt",lang="en")
print(c("Number of sentences: ",text_tokens@desc$sentences))
Automated Readability Index (ARI)
  Parameters: default 
       Grade: 9.88 

  Parameters: default 
         ECP: 47% (estimted cloze percentage)
       Grade: 10.09 
       Grade: 10.1 (short formula)

  Parameters: default 
         DB1: 7.63 
         DB2: 48.67 
       Grade: 9-12 

Dickes-Steiwer's Handformel
  Parameters: default 
         TTR: 0.58 
       Score: 42.76 

Easy Listening Formula
  Parameters: default 
      Exsyls: 149 
       Score: 6.21 

  Parameters: default 
          RE: 56.1 
       Grade: >= 10 (high school) 

Flesch Reading Ease
  Parameters: en (Flesch) 
          RE: 59.75 
       Grade: >= 10 (high school) 

Flesch-Kincaid Grade Level
  Parameters: default 
       Grade: 9.54 
         Age: 14.54 

Gunning Frequency of Gobbledygook (FOG)
  Parameters: default 
       Grade: 12.55 

  Parameters: default 
       Grade: 10.01 
         Age: 15.01 

Fucks' Stilcharakteristik
       Score: 28.17 
       Grade: 5.31 

Gutiérrez Fórmula de Comprensibilidad
       Score: 43.35 

Linsear Write
  Parameters: default 
  Easy words: 87 
  Hard words: 13 
       Grade: 11.71 

Läsbarhetsindex (LIX)
  Parameters: default 
       Index: 40.56 
      Rating: standard 
       Grade: 6 

Neue Wiener Sachtextformeln
  Parameters: default 
       nWS 1: 5.42 
       nWS 2: 5.97 
       nWS 3: 6.28 
       nWS 4: 6.81 

Readability Index (RIX)
  Parameters: default 
       Index: 4.08 
       Grade: 9 

Simple Measure of Gobbledygook (SMOG)
  Parameters: default 
       Grade: 12.01 
         Age: 17.01 

Strain Index
  Parameters: default 
       Index: 8.45 

Kuntzsch's Text-Redundanz-Index
  Parameters: default 
 Short words: 297 
 Punctuation: 71 
     Foreign: 0 
       Score: -56.22 

Tuldava's Text Difficulty Formula
  Parameters: default 
       Index: 4.43 

  Parameters: default 
       Score: 62.08 
       Grade: > 4 

Text language: en 

How to talk when a machine is listening: https://www.nber.org/papers/w27950