# Summarization

Can a machine summarize a document?

In [None]:
from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')

Mounted at /content/drive


In [None]:
%%capture
%pylab inline
import pandas as pd
import os
%load_ext rpy2.ipython
import textwrap

## Types of Summarization

There are two broad types of text summarization:

1. Extractive: provide the most meaningful extracted subsample from the text.
2. Abstractive: generate new language that explains the document more briefly.

There are some metrics for the quality of summarization, see: http://nlpprogress.com/english/summarization.html

But now we have "Generative" summarization using LLMs. Ask yourself when this is better and when it is worse.

## Jaccard Summarizer

Here we present a simple approach to extractive summarization.

A document $D$  is comprised of  $m$  sentences  $s_i,i=1,2,...,m$, where each  $s_i$  is a set of words. We compute the pairwise overlap between sentences using the Jaccard similarity index:

$$
J_{ij} = J(s_i,s_j)=\frac{|s_i \cap s_j|}{|s_i \cup s_j|} = J_{ji}
$$

The overlap is the ratio of the size of the intersect of the two word sets in sentences  $s_i$  and  $s_j$, divided by the size of the union of the two sets. The similarity score of each sentence is computed as the row sums of the Jaccard similarity matrix.

$$
S_i=\sum_{j=1}^m J_{ij}
$$

### Generating the summary

Once the row sums are obtained, they are sorted and the summary is the first  $n$  sentences based on the  $S_i$  values.

In [1]:
%%R
# FUNCTION TO RETURN n SENTENCE SUMMARY
# Input: array of sentences (text)
# Output: n most common intersecting sentences
text_summary = function(text, n) {
  m = length(text)  # No of sentences in input
  jaccard = matrix(0,m,m)  #Store match index
  for (i in 1:m) {
    for (j in i:m) {
      a = text[i]; aa = unlist(strsplit(a," "))
      b = text[j]; bb = unlist(strsplit(b," "))
      jaccard[i,j] = length(intersect(aa,bb))/
                          length(union(aa,bb))
      jaccard[j,i] = jaccard[i,j]
    }
  }
  similarity_score = rowSums(jaccard)
  res = sort(similarity_score, index.return=TRUE,
          decreasing=TRUE)
  idx = res$ix[1:n]
  summary = text[idx]
}

UsageError: Cell magic `%%R` not found.


## One Function to Rule All Text in R

Also, a quick introduction to the tm package in R: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Install (if needed from the command line): `conda install -c r r-tm` or install it as shown below.

In [None]:
%%R
install.packages("tm", quiet=TRUE)
# ! conda install -c conda-forge r-tm -y
# ! conda install -c r r-tm -y





In [None]:
%%R
library(tm)
library(stringr)
#READ IN TEXT FOR ANALYSIS, PUT IT IN A CORPUS, OR ARRAY, OR FLAT STRING
#cstem=1, if stemming needed
#cstop=1, if stopwords to be removed
#ccase=1 for lower case, ccase=2 for upper case
#cpunc=1, if punctuation to be removed
#cflat=1 for flat text wanted, cflat=2 if text array, else returns corpus
read_web_page = function(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=0) {
    text = readLines(url)
    text = text[setdiff(seq(1,length(text)),grep("<",text))]
    text = text[setdiff(seq(1,length(text)),grep(">",text))]
    text = text[setdiff(seq(1,length(text)),grep("]",text))]
    text = text[setdiff(seq(1,length(text)),grep("}",text))]
    text = text[setdiff(seq(1,length(text)),grep("_",text))]
    text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
    ctext = Corpus(VectorSource(text))
    if (cstem==1) { ctext = tm_map(ctext, stemDocument) }
    if (cstop==1) { ctext = tm_map(ctext, removeWords, stopwords("english"))}
    if (cpunc==1) { ctext = tm_map(ctext, removePunctuation) }
    if (ccase==1) { ctext = tm_map(ctext, tolower) }
    if (ccase==2) { ctext = tm_map(ctext, toupper) }
    text = ctext
    #CONVERT FROM CORPUS IF NEEDED
    if (cflat>0) {
        text = NULL
        for (j in 1:length(ctext)) {
            temp = ctext[[j]]$content
            if (temp!="") { text = c(text,temp) }
        }
        text = as.array(text)
    }
    if (cflat==1) {
        text = paste(text,collapse="\n")
        text = str_replace_all(text, "[\r\n]" , " ")
    }
    result = text
}




## Example: Summarization

We will use a sample of text that I took from Bloomberg news. It is about the need for data scientists.

In [None]:
%%R
url = "NLP_data/dstext_sample.txt"   #You can put any text file or URL here
text = read_web_page(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=1)
print(length(text[[1]]))

[1] 1


In [None]:
text = %Rget text
text = text[0]
print(textwrap.fill(text, width=80))

THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of
big data, the hype around it having surpassed the reality of what it can
deliver.  Gartner suggested that the “gravitational pull of big data is now so
strong that even people who haven’t a clue as to what it’s all about report that
they’re running big data projects.”  Indeed, their research with business
decision makers suggests that organisations are struggling to get value from big
data. Data scientists were meant to be the answer to this issue. Indeed, Hal
Varian, Chief Economist at Google famously joked that “The sexy job in the next
10 years will be statisticians.” He was clearly right as we are now used to
hearing that data scientists are the key to unlocking the value of big data.
This has created a huge market for people with these skills. US recruitment
agency, Glassdoor, report that the average salary for a data scientist is
$118,709 versus $64,537 for a skilled programmer. And a McKinsey study 

In [None]:
%%R
text2 = strsplit(text,". ",fixed=TRUE)  #Special handling of the period.
text2 = text2[[1]]
print(text2)

 [1] "THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of big data, the hype around it having surpassed the reality of what it can deliver"                                                                                                                                                     
 [2] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
 [3] "Data scientists were meant to be the answer to this issue"                                                                                                                                                                                                                                                             
 [4] "Indeed, Hal Varian, Chief Economist at G

In [None]:
%%R
res = text_summary(text2,5)
print(res)

[1] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
[2] "The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole"                                                                                         
[3] "May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets"                                                                                                                                      
[4] "The problem with a centralized ‘IT-style’ ap

## Text Summarization with Python

This is a approach that distills a document down to its most important sentences. The idea is very simple. The algorithm simply focuses on the essence of a document. The customer use case is that the quantity of reading is too high and a smaller pithy version would be great to have.

However, in the absence of an article/document, I have some examples where we download an article using selector gadget, Beautiful Soup, and extract the text of the article. But the summarizer/compressor assumes that the article is clean flat file text.

https://www.dataquest.io/blog/web-scraping-tutorial-python/

Install these if needed:

In [None]:
!pip install lxml
!pip install cssselect
!pip install nltk

Collecting cssselect
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: cssselect
Successfully installed cssselect-1.2.0


In [None]:
# Read in the news article from the URL and extract only the title and text of the article.
# Some examples provided below.

import requests
from lxml.html import fromstring
url = "https://www.theverge.com/2023/10/4/23903986/sam-bankman-fried-opening-statements-trial-fraud"
# url = "https://www.nytimes.com/2023/10/03/us/politics/kevin-mccarthy-speaker.html"
# url = "https://www.theatlantic.com/technology/archive/2022/04/doxxing-meaning-libs-of-tiktok/629643/"
# url = 'https://economictimes.indiatimes.com/news/economy/policy/a-tax-cut-for-you-in-budget-wont-give-india-the-boost-it-needs/articleshow/73476138.cms?utm_source=Colombia&utm_medium=C1&utm_campaign=CTN_ET_hp&utm_content=18'


In [None]:
html = requests.get(url, timeout=10).text

#See: http://infohost.nmt.edu/~shipman/soft/pylxml/web/etree-fromstring.html
doc = fromstring(html)

#http://lxml.de/cssselect.html#the-cssselect-method
doc.cssselect(".lg\:max-w-none")
# doc.cssselect(".evys1bk0") # nytimes
# doc.cssselect(".Normal")  #economic times
# doc.cssselect(".ArticleParagraph_root__wy3UI")   #Atlantic

[<Element div at 0x7d5b68370b40>, <Element div at 0x7d5b68370cd0>]

In [None]:
#economic times
# x = doc.cssselect(".Normal")
# news = x[0].text_content()
# print(news)

# Verge
x = doc.cssselect(".lg\:max-w-none")

#nytimes
# x = doc.cssselect(".StoryBodyCompanionColumn")

# Atlantic
# x = doc.cssselect(".ArticleParagraph_root__wy3UI")
news = " ".join([x[j].text_content() for j in range(len(x))])

Make sure the text you extracted is in string form. Then convert the article into individual sentences. Put the individual sentences into a list. Use BeautifulSoup for this.

In [None]:
from bs4 import BeautifulSoup
news = BeautifulSoup(news,'lxml').get_text()
print(textwrap.fill(news, width=80))
type(news)

TechIs Sam Bankman-Fried’s defense even trying to win?The prosecution came out
swinging. Oddly, Bankman-Fried’s defense didn’t.By  Elizabeth Lopatto, a
reporter who writes about tech, money, and human behavior. She joined The Verge
in 2014 as science editor. Previously, she was a reporter at Bloomberg.  Oct 4,
2023, 11:02 PM UTCShare this storyThreadsEven the defense’s opening statement
was a bad look for Sam Bankman-Fried Photo Illustration by Cath Virginia / The
Verge I have never seen Sam Bankman-Fried so still as he was during the
prosecution’s opening statement. The characteristic leg-jiggling was absent. He
barely moved as the prosecutor listed the evidence against him: internal company
files, what customers were told, the testimony of his co-conspirators and his
own words.His hair was shorn, the result of a haircut from a fellow prisoner,
the Wall Street Journal reported. He wore a suit bought at a discount at Macy’s,
per the Journal; it hung on him. He appeared to have lost som

str

In [None]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")
from nltk.tokenize import sent_tokenize   # To get separate sentences
sentences = sent_tokenize(news)
print("Number of sentences =", len(sentences))
for s in sentences:
    print(textwrap.fill(s, width=80), end="\n\n")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Number of sentences = 63
TechIs Sam Bankman-Fried’s defense even trying to win?The prosecution came out
swinging.

Oddly, Bankman-Fried’s defense didn’t.By  Elizabeth Lopatto, a reporter who
writes about tech, money, and human behavior.

She joined The Verge in 2014 as science editor.

Previously, she was a reporter at Bloomberg.

Oct 4, 2023, 11:02 PM UTCShare this storyThreadsEven the defense’s opening
statement was a bad look for Sam Bankman-Fried Photo Illustration by Cath
Virginia / The Verge I have never seen Sam Bankman-Fried so still as he was
during the prosecution’s opening statement.

The characteristic leg-jiggling was absent.

He barely moved as the prosecutor listed the evidence against him: internal
company files, what customers were told, the testimony of his co-conspirators
and his own words.His hair was shorn, the result of a haircut from a fellow
prisoner, the Wall Street Journal reported.

He wore a suit bought at a discount at Macy’s, per the Journal; it hung on hi

In [None]:
# Python Summarizer
import re
# Pass in a list of sentences, returns a n sentence summary
def text_summarizer(sentences, n_summary):
    n = len(sentences)
    x = [re.split('[ ,.]',j) for j in sentences]
    jaccsim = array(zeros(n*n)).reshape((n,n))
    for i in range(n):
        for j in range(i,n):
            jaccsim[i,j] = len(set(x[i]).intersection(set(x[j])))/len(set(x[i]).union(set(x[j])))
            jaccsim[j,i] = jaccsim[i,j]
    #Summary
    idx = argsort(sum(jaccsim, axis=0))[::-1][:n_summary]  #reverse sort
    summary = [sentences[j] for j in list(idx)]
    #Anomalies
    idx = argsort(sum(jaccsim, axis=0))[:n_summary]
    anomalies = [sentences[j] for j in list(idx)]
    return summary, anomalies

In [None]:
# Get the summary and the anomaly sentences
summary, anomalies = text_summarizer(sentences, int(len(sentences)/4))
summ = "  ".join(summary)
print(textwrap.fill(summ, width=80))

Juilliard, who was born in Paris and lives in London, testified that he trusted
FTX because Bankman-Fried came across as a leading figure of the industry.  He
noted that Julliard was a licensed commodities broker, who was trading in crypto
because he didn’t have to disclose it; that Julliard knew that crypto was new
and risky, and that Julliard didn’t review the terms of service agreement he’d
assented to when making his FTX account.  “All of that was built on lies,” Rehn
said.In his opening statement, Rehn dodged explaining cryptocurrency to the
jury.  He appeared to have lost some weight.“All of that was built on
lies.”RelatedFTX’s Sam Bankman-Fried is on trial for fraud and
conspiracyBankman-Fried, at this time last year, had a luxury lifestyle as the
CEO of crypto exchange FTX, said the assistant US attorney, Thane Rehn, in the
cadence of a high schooler delivering his lines in a student play.  I was very
curious, having learned yesterday that Bankman-Fried had never been offered a

In [None]:
for a in anomalies:
    print(a)

Was he worried about what one might find?
That’s the whole point!
That’s why you hire a risk officer and delegate!
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.From our sponsorAdvertiser Content From
)He also noted FTX’s glossy ads — featuring Gisele Bündchen, for instance —  suggested a very high budget.
Given what I saw today, setting up an appeal seems wise.
Well, sure, but so what?
Assets are fine” tweets, along with “FTX has enough to cover all client holdings.
Man, it’s no good when your defense lawyer has just made you sound worse than the prosecution already did.
It is, at minimum, risk management.Most PopularMost PopularNFL teams can’t use BlueskyGoogle’s Gemini is already winning the next-gen assistant warsStar Trek: Section 31 is firing on all cylindersYouTube Premium gets more experimental features that can now be tested all at onceNvidia GeForce RTX 5090 review: a new king of 4K is here Verge Deals / Sign up for Verge Deals t

## Modern Methods

- Extractive Summarization vs Abstractive Summarization

- Summarization with pointer networks: https://drive.google.com/file/d/1fAgr85WAQU8OXYkwifuF4Ep-LXfrwinv/view?usp=sharing

- Use Hugging Face Transformers as shown next: https://huggingface.co/transformers/main_classes/pipelines.html

In [None]:
!pip install transformers



In [None]:
from transformers import pipeline
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


In [None]:
# All in one example
html = requests.get(url, timeout=10).text
doc = fromstring(html)
# x = doc.cssselect(".ArticleParagraph_root__wy3UI")
x = doc.cssselect(".lg\:max-w-none")
news = " ".join([x[j].text_content() for j in range(len(x))])
news = BeautifulSoup(news,'lxml').get_text()
print(len(news))
if len(news)>1024:   # max seq length
    news = news[:1024]
summ = summarizer(news, max_length=int(len(news)/4), min_length=25)
print(summ)

Your max_length is set to 256, but your input_length is only 254. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=127)


9131
[{'summary_text': " Sam Bankman-Fried's hair was shorn, the result of a haircut from a fellow prisoner . He wore a suit bought at a discount at Macy's, it hung on him ."}]


Try this additional blog post for more on the T5 (text to text transfer transformer) summarizer.

https://towardsdatascience.com/simple-abstractive-text-summarization-with-pretrained-t5-text-to-text-transfer-transformer-10f6d602c426

This is a nice web site explaining Hugging Face transformers: https://zenodo.org/record/3733180#.X40RxEJKjlx

And the paper: https://arxiv.org/pdf/1910.10683.pdf

And here is a nice application of the same: https://towardsdatascience.com/summarization-has-gotten-commoditized-thanks-to-bert-9bb73f2d6922

## Long document summarization

This is not feasible unless we break up the text into maximal chunk sizes and do the summary piecemeal.

In [None]:
html = requests.get(url, timeout=10).text
doc = fromstring(html)
# x = doc.cssselect(".ArticleParagraph_root__wy3UI")
x = doc.cssselect(".lg\:max-w-none")
news = " ".join([x[j].text_content() for j in range(len(x))])
news = BeautifulSoup(news,'lxml').get_text()
print("Size of article =",len(news)," | #Chunks =",int(len(news)/1024))
for j in range(0,len(news),1024):
    print(summarizer(news[j:j+1024], max_length=int(len(news)/4), min_length=25))

Your max_length is set to 2282, but your input_length is only 254. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=127)


Size of article = 9131  | #Chunks = 8


Your max_length is set to 2282, but your input_length is only 254. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=127)


[{'summary_text': " Sam Bankman-Fried's hair was shorn, the result of a haircut from a fellow prisoner . He wore a suit bought at a discount at Macy's, it hung on him ."}]


Your max_length is set to 2282, but your input_length is only 249. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=124)


[{'summary_text': ' FTX CEO Sam Bankman-Fried is on trial for fraud and conspiracy . Assistant US attorney Thane Rehn told the jury the FTX exec sold stock in FTX and borrowed millions from lenders by lying . Rehn dodged explaining cryptocurrency to the jury .'}]


Your max_length is set to 2282, but your input_length is only 246. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=123)


[{'summary_text': ' Rehn: FTX "didn’t have a chief risk officer, which became an issue when the storm hit" Bankman-Fried tweeted, "FTX is fine.  customer money to repay loans"'}]


Your max_length is set to 2282, but your input_length is only 247. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=123)


[{'summary_text': ' FTX was named such as it was because it was a futures exchange, which sits between the winners and losers of bets . That means FTX can’t pay out what it owes the winners unless the losers pay up . Risk officers exist to identify business’ potential risks, monitor, and mitigate them .'}]


Your max_length is set to 2282, but your input_length is only 246. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=123)


[{'summary_text': " Bankman-Fried was a math nerd who didn’t party, Cohen said . He said the defense brought up the missing risk officer, but the prosecution hadn't mentioned it . If he had been a party-hardy trainwreck, he could see overlooking a risk officer to do another line ."}]


Your max_length is set to 2282, but your input_length is only 250. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=125)


[{'summary_text': ' The prosecution called its first witness, Marc-Antoine Julliard, whose money got stuck on FTX . Juilliard testified that he trusted FTX because Bankman-Fried came across as a leading figure of the industry .'}]


Your max_length is set to 2282, but your input_length is only 261. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=130)


[{'summary_text': ' Julliard followed Bankman-Fried on Twitter, and read aloud the “FTX is fine” tweets, along with ‘FTX has enough to cover all client holdings” and “We don’t invest client assets” In November 2022, things went bad .'}]


Your max_length is set to 2282, but your input_length is only 210. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=105)


[{'summary_text': ' Bankman-Fried’s lawyers told the judge that he wasn’t getting  his money back . The jury was dismissed . The next witness called was the former college (and FTX) roommate Adam Yedidia .'}]
[{'summary_text': ' The defense appeared to be setting up the grounds for an appeal . Given what I saw today, setting up an appeal seems wise . It is, at minimum, risk management .'}]
