23. Summarization#

Can a machine summarize a document?

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
%pylab inline
import pandas as pd
import os
%load_ext rpy2.ipython
import textwrap

23.1. Types of Summarization#

There are two broad types of text summarization:

  1. Extractive: provide the most meaningful extracted subsample from the text.

  2. Abstractive: generate new language that explains the document more briefly.

There are some metrics for the quality of summarization, see: http://nlpprogress.com/english/summarization.html

But now we have “Generative” summarization using LLMs. Ask yourself when this is better and when it is worse.

23.2. Jaccard Summarizer#

Here we present a simple approach to extractive summarization.

A document \(D\) is comprised of \(m\) sentences \(s_i,i=1,2,...,m\), where each \(s_i\) is a set of words. We compute the pairwise overlap between sentences using the Jaccard similarity index:

\[ J_{ij} = J(s_i,s_j)=\frac{|s_i \cap s_j|}{|s_i \cup s_j|} = J_{ji} \]

The overlap is the ratio of the size of the intersect of the two word sets in sentences \(s_i\) and \(s_j\), divided by the size of the union of the two sets. The similarity score of each sentence is computed as the row sums of the Jaccard similarity matrix.

\[ S_i=\sum_{j=1}^m J_{ij} \]

23.3. Generating the summary#

Once the row sums are obtained, they are sorted and the summary is the first \(n\) sentences based on the \(S_i\) values.

%%R
# FUNCTION TO RETURN n SENTENCE SUMMARY
# Input: array of sentences (text)
# Output: n most common intersecting sentences
text_summary = function(text, n) {
  m = length(text)  # No of sentences in input
  jaccard = matrix(0,m,m)  #Store match index
  for (i in 1:m) {
    for (j in i:m) {
      a = text[i]; aa = unlist(strsplit(a," "))
      b = text[j]; bb = unlist(strsplit(b," "))
      jaccard[i,j] = length(intersect(aa,bb))/
                          length(union(aa,bb))
      jaccard[j,i] = jaccard[i,j]
    }
  }
  similarity_score = rowSums(jaccard)
  res = sort(similarity_score, index.return=TRUE,
          decreasing=TRUE)
  idx = res$ix[1:n]
  summary = text[idx]
}

23.4. One Function to Rule All Text in R#

Also, a quick introduction to the tm package in R: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Install (if needed from the command line): conda install -c r r-tm or install it as shown below.

%%R
install.packages("tm", quiet=TRUE)
# ! conda install -c conda-forge r-tm -y
# ! conda install -c r r-tm -y
also installing the dependencies ‘NLP’, ‘slam’, ‘BH’
%%R
library(tm)
library(stringr)
#READ IN TEXT FOR ANALYSIS, PUT IT IN A CORPUS, OR ARRAY, OR FLAT STRING
#cstem=1, if stemming needed
#cstop=1, if stopwords to be removed
#ccase=1 for lower case, ccase=2 for upper case
#cpunc=1, if punctuation to be removed
#cflat=1 for flat text wanted, cflat=2 if text array, else returns corpus
read_web_page = function(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=0) {
    text = readLines(url)
    text = text[setdiff(seq(1,length(text)),grep("<",text))]
    text = text[setdiff(seq(1,length(text)),grep(">",text))]
    text = text[setdiff(seq(1,length(text)),grep("]",text))]
    text = text[setdiff(seq(1,length(text)),grep("}",text))]
    text = text[setdiff(seq(1,length(text)),grep("_",text))]
    text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
    ctext = Corpus(VectorSource(text))
    if (cstem==1) { ctext = tm_map(ctext, stemDocument) }
    if (cstop==1) { ctext = tm_map(ctext, removeWords, stopwords("english"))}
    if (cpunc==1) { ctext = tm_map(ctext, removePunctuation) }
    if (ccase==1) { ctext = tm_map(ctext, tolower) }
    if (ccase==2) { ctext = tm_map(ctext, toupper) }
    text = ctext
    #CONVERT FROM CORPUS IF NEEDED
    if (cflat>0) {
        text = NULL
        for (j in 1:length(ctext)) {
            temp = ctext[[j]]$content
            if (temp!="") { text = c(text,temp) }
        }
        text = as.array(text)
    }
    if (cflat==1) {
        text = paste(text,collapse="\n")
        text = str_replace_all(text, "[\r\n]" , " ")
    }
    result = text
}
Loading required package: NLP

23.5. Example: Summarization#

We will use a sample of text that I took from Bloomberg news. It is about the need for data scientists.

%%R
url = "NLP_data/dstext_sample.txt"   #You can put any text file or URL here
text = read_web_page(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=1)
print(length(text[[1]]))
[1] 1
text = %Rget text
text = text[0]
print(textwrap.fill(text, width=80))
THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of
big data, the hype around it having surpassed the reality of what it can
deliver.  Gartner suggested that the “gravitational pull of big data is now so
strong that even people who haven’t a clue as to what it’s all about report that
they’re running big data projects.”  Indeed, their research with business
decision makers suggests that organisations are struggling to get value from big
data. Data scientists were meant to be the answer to this issue. Indeed, Hal
Varian, Chief Economist at Google famously joked that “The sexy job in the next
10 years will be statisticians.” He was clearly right as we are now used to
hearing that data scientists are the key to unlocking the value of big data.
This has created a huge market for people with these skills. US recruitment
agency, Glassdoor, report that the average salary for a data scientist is
$118,709 versus $64,537 for a skilled programmer. And a McKinsey study predicts
that by 2018, the United States alone faces a shortage of 140,000 to 190,000
people with analytical expertise and a 1.5 million shortage of managers with the
skills to understand and make decisions based on analysis of big data.  It’s no
wonder that companies are keen to employ data scientists when, for example, a
retailer using big data can reportedly increase their margin by more than 60%.
However, is it really this simple? Can data scientists actually justify earning
their salaries when brands seem to be struggling to realize the promise of big
data? Perhaps we are expecting too much of data scientists. May be we are
investing too much in a relatively small number of individuals rather than
thinking about how we can design organisations to help us get the most from data
assets. The focus on the data scientist often implies a centralized approach to
analytics and decision making; we implicitly assume that a small team of highly
skilled individuals can meet the needs of the organisation as a whole. This
theme of centralized vs. decentralized decision-making is one that has long been
debated in the management literature.  For many organisations a centralized
structure helps maintain control over a vast international operation, plus
ensures consistency of customer experience. Others, meanwhile, may give managers
at a local level decision-making power particularly when it comes to tactical
needs.   But the issue urgently needs revisiting in the context of big data as
the way in which organisations manage themselves around data may well be a key
factor for brands in realizing the value of their data assets. Economist and
philosopher Friedrich Hayek took the view that organisations should consider the
purpose of the information itself. Centralized decision-making can be more cost-
effective and co-ordinated, he believed, but decentralization can add speed and
local information that proves more valuable, even if the bigger picture is less
clear.  He argued that organisations thought too highly of centralized
knowledge, while ignoring ‘knowledge of the particular circumstances of time and
place’. But it is only relatively recently that economists are starting to
accumulate data that allows them to gauge how successful organisations organize
themselves. One such exercise reported by Tim Harford was carried out by Harvard
Professor Julie Wulf and the former chief economist of the International
Monetary Fund, Raghuram Rajan. They reviewed the workings of large US
organisations over fifteen years from the mid-80s. What they found was
successful companies were often associated with a move towards decentralisation,
often driven by globalisation and the need to react promptly to a diverse and
swiftly-moving range of markets, particularly at a local level. Their research
indicated that decentralisation pays. And technological advancement often goes
hand-in-hand with decentralization. Data analytics is starting to filter down to
the department layer, where executives are increasingly eager to trawl through
the mass of information on offer. Cloud computing, meanwhile, means that line
managers no longer rely on IT teams to deploy computer resources. They can do it
themselves, in just minutes.  The decentralization trend is now impacting on
technology spending. According to Gartner, chief marketing officers have been
given the same purchasing power in this area as IT managers and, as their
spending rises, so that of data centre managers is falling. Tim Harford makes a
strong case for the way in which this decentralization is important given that
the environment in which we operate is so unpredictable. Innovation typically
comes, he argues from a “swirling mix of ideas not from isolated minds.” And he
cites Jane Jacobs, writer on urban planning– who suggested we find innovation in
cities rather than on the Pacific islands. But this approach is not necessarily
always adopted. For example, research by academics Donald Marchand and Joe
Peppard discovered that there was still a tendency for brands to approach big
data projects the same way they would existing IT projects: i.e. using
centralized IT specialists with a focus on building and deploying technology on
time, to plan, and within budget. The problem with a centralized ‘IT-style’
approach is that it ignores the human side of the process of considering how
people create and use information i.e. how do people actually deliver value from
data assets. Marchand and Peppard suggest (among other recommendations) that
those who need to be able to create meaning from data should be at the heart of
any initiative. As ever then, the real value from data comes from asking the
right questions of the data. And the right questions to ask only emerge if you
are close enough to the business to see them. Are data scientists earning their
salary? In my view they are a necessary but not sufficient part of the solution;
brands need to be making greater investment in working with a greater range of
users to help them ask questions of the data. Which probably means that data
scientists’ salaries will need to take a hit in the process.
%%R
text2 = strsplit(text,". ",fixed=TRUE)  #Special handling of the period.
text2 = text2[[1]]
print(text2)
 [1] "THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of big data, the hype around it having surpassed the reality of what it can deliver"                                                                                                                                                     
 [2] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
 [3] "Data scientists were meant to be the answer to this issue"                                                                                                                                                                                                                                                             
 [4] "Indeed, Hal Varian, Chief Economist at Google famously joked that “The sexy job in the next 10 years will be statisticians.” He was clearly right as we are now used to hearing that data scientists are the key to unlocking the value of big data"                                                                   
 [5] "This has created a huge market for people with these skills"                                                                                                                                                                                                                                                           
 [6] "US recruitment agency, Glassdoor, report that the average salary for a data scientist is $118,709 versus $64,537 for a skilled programmer"                                                                                                                                                                             
 [7] "And a McKinsey study predicts that by 2018, the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and a 1.5 million shortage of managers with the skills to understand and make decisions based on analysis of big data"                                                     
 [8] " It’s no wonder that companies are keen to employ data scientists when, for example, a retailer using big data can reportedly increase their margin by more than 60%"                                                                                                                                                  
 [9] " However, is it really this simple? Can data scientists actually justify earning their salaries when brands seem to be struggling to realize the promise of big data? Perhaps we are expecting too much of data scientists"                                                                                            
[10] "May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets"                                                                                                                                      
[11] "The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole"                                                                                         
[12] "This theme of centralized vs"                                                                                                                                                                                                                                                                                          
[13] "decentralized decision-making is one that has long been debated in the management literature"                                                                                                                                                                                                                          
[14] " For many organisations a centralized structure helps maintain control over a vast international operation, plus ensures consistency of customer experience"                                                                                                                                                           
[15] "Others, meanwhile, may give managers at a local level decision-making power particularly when it comes to tactical needs"                                                                                                                                                                                              
[16] "  But the issue urgently needs revisiting in the context of big data as the way in which organisations manage themselves around data may well be a key factor for brands in realizing the value of their data assets"                                                                                                  
[17] "Economist and philosopher Friedrich Hayek took the view that organisations should consider the purpose of the information itself"                                                                                                                                                                                      
[18] "Centralized decision-making can be more cost-effective and co-ordinated, he believed, but decentralization can add speed and local information that proves more valuable, even if the bigger picture is less clear"                                                                                                    
[19] " He argued that organisations thought too highly of centralized knowledge, while ignoring ‘knowledge of the particular circumstances of time and place’"                                                                                                                                                               
[20] "But it is only relatively recently that economists are starting to accumulate data that allows them to gauge how successful organisations organize themselves"                                                                                                                                                         
[21] "One such exercise reported by Tim Harford was carried out by Harvard Professor Julie Wulf and the former chief economist of the International Monetary Fund, Raghuram Rajan"                                                                                                                                           
[22] "They reviewed the workings of large US organisations over fifteen years from the mid-80s"                                                                                                                                                                                                                              
[23] "What they found was successful companies were often associated with a move towards decentralisation, often driven by globalisation and the need to react promptly to a diverse and swiftly-moving range of markets, particularly at a local level"                                                                     
[24] "Their research indicated that decentralisation pays"                                                                                                                                                                                                                                                                   
[25] "And technological advancement often goes hand-in-hand with decentralization"                                                                                                                                                                                                                                           
[26] "Data analytics is starting to filter down to the department layer, where executives are increasingly eager to trawl through the mass of information on offer"                                                                                                                                                          
[27] "Cloud computing, meanwhile, means that line managers no longer rely on IT teams to deploy computer resources"                                                                                                                                                                                                          
[28] "They can do it themselves, in just minutes"                                                                                                                                                                                                                                                                            
[29] " The decentralization trend is now impacting on technology spending"                                                                                                                                                                                                                                                   
[30] "According to Gartner, chief marketing officers have been given the same purchasing power in this area as IT managers and, as their spending rises, so that of data centre managers is falling"                                                                                                                         
[31] "Tim Harford makes a strong case for the way in which this decentralization is important given that the environment in which we operate is so unpredictable"                                                                                                                                                            
[32] "Innovation typically comes, he argues from a “swirling mix of ideas not from isolated minds.” And he cites Jane Jacobs, writer on urban planning– who suggested we find innovation in cities rather than on the Pacific islands"                                                                                       
[33] "But this approach is not necessarily always adopted"                                                                                                                                                                                                                                                                   
[34] "For example, research by academics Donald Marchand and Joe Peppard discovered that there was still a tendency for brands to approach big data projects the same way they would existing IT projects: i.e"                                                                                                              
[35] "using centralized IT specialists with a focus on building and deploying technology on time, to plan, and within budget"                                                                                                                                                                                                
[36] "The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e"                                                                                                                                                          
[37] "how do people actually deliver value from data assets"                                                                                                                                                                                                                                                                 
[38] "Marchand and Peppard suggest (among other recommendations) that those who need to be able to create meaning from data should be at the heart of any initiative"                                                                                                                                                        
[39] "As ever then, the real value from data comes from asking the right questions of the data"                                                                                                                                                                                                                              
[40] "And the right questions to ask only emerge if you are close enough to the business to see them"                                                                                                                                                                                                                        
[41] "Are data scientists earning their salary? In my view they are a necessary but not sufficient part of the solution; brands need to be making greater investment in working with a greater range of users to help them ask questions of the data"                                                                        
[42] "Which probably means that data scientists’ salaries will need to take a hit in the process."                                                                                                                                                                                                                           
%%R
res = text_summary(text2,5)
print(res)
[1] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
[2] "The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole"                                                                                         
[3] "May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets"                                                                                                                                      
[4] "The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e"                                                                                                                                                          
[5] "Which probably means that data scientists’ salaries will need to take a hit in the process."                                                                                                                                                                                                                           

23.6. Text Summarization with Python#

This is a approach that distills a document down to its most important sentences. The idea is very simple. The algorithm simply focuses on the essence of a document. The customer use case is that the quantity of reading is too high and a smaller pithy version would be great to have.

However, in the absence of an article/document, I have some examples where we download an article using selector gadget, Beautiful Soup, and extract the text of the article. But the summarizer/compressor assumes that the article is clean flat file text.

https://www.dataquest.io/blog/web-scraping-tutorial-python/

Install these if needed:

!pip install lxml
!pip install cssselect
!pip install nltk
Requirement already satisfied: lxml in /usr/local/lib/python3.12/dist-packages (5.4.0)
Collecting cssselect
  Downloading cssselect-1.3.0-py3-none-any.whl.metadata (2.6 kB)
Downloading cssselect-1.3.0-py3-none-any.whl (18 kB)
Installing collected packages: cssselect
Successfully installed cssselect-1.3.0
Requirement already satisfied: nltk in /usr/local/lib/python3.12/dist-packages (3.9.1)
Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from nltk) (8.2.1)
Requirement already satisfied: joblib in /usr/local/lib/python3.12/dist-packages (from nltk) (1.5.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.12/dist-packages (from nltk) (2024.11.6)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from nltk) (4.67.1)
# Read in the news article from the URL and extract only the title and text of the article.
# Some examples provided below.

import requests
from lxml.html import fromstring
url = "https://www.allthingsdistributed.com/2025/10/better-with-age.html"
# url = "https://www.theverge.com/2023/10/4/23903986/sam-bankman-fried-opening-statements-trial-fraud"
# url = "https://www.nytimes.com/2023/10/03/us/politics/kevin-mccarthy-speaker.html"
# url = "https://www.theatlantic.com/technology/archive/2022/04/doxxing-meaning-libs-of-tiktok/629643/"
# url = 'https://economictimes.indiatimes.com/news/economy/policy/a-tax-cut-for-you-in-budget-wont-give-india-the-boost-it-needs/articleshow/73476138.cms?utm_source=Colombia&utm_medium=C1&utm_campaign=CTN_ET_hp&utm_content=18'
html = requests.get(url, timeout=10).text

#See: http://infohost.nmt.edu/~shipman/soft/pylxml/web/etree-fromstring.html
doc = fromstring(html)

#http://lxml.de/cssselect.html#the-cssselect-method
doc.cssselect("section") # all things
# doc.cssselect(".lg\:max-w-none") # verge
# doc.cssselect(".evys1bk0") # nytimes
# doc.cssselect(".Normal")  #economic times
# doc.cssselect(".ArticleParagraph_root__wy3UI")   #Atlantic
[<Element section at 0x7906c8f83ac0>]
#economic times
# x = doc.cssselect(".Normal")
# news = x[0].text_content()
# print(news)

# Verge
# x = doc.cssselect(".lg\:max-w-none")

#nytimes
# x = doc.cssselect(".StoryBodyCompanionColumn")

# Atlantic
# x = doc.cssselect(".ArticleParagraph_root__wy3UI")

# All things
x = doc.cssselect("section")

news = " ".join([x[j].text_content() for j in range(len(x))])
news
'Development gets better with AgeOctober 01, 2025 • 856 wordsHe has heard the whispers, â\x80\x9che is getting older, who will replace him?â\x80\x9d People asking him with a straight face, â\x80\x9cwhen will you retire?â\x80\x9d After close to 25 years at Amazon, where each year has been different and amazing, He feels as young as the day he decided to leave academia and join Amazon.The thing about getting older as a developer, is that you have seen a lot and encountered many of the problems younger developers are facing (even if they look a little different on first glance). If youâ\x80\x99ve been around the block as many times as some of us have, youâ\x80\x99ll have earned battle scars along the way. There are days in war rooms you will never forget. You have experimented a lot, and you have failed more times than you care to remember. You have half-a-head full of what is practical and works. And a quarter of that space has been trained to look for red flags, scanning for things that you know will go wrong.Whatâ\x80\x99s left in your head is used for creativity. Taking in all sorts of signals, building mental models, and coming up with new unique solutions. Itâ\x80\x99s the best part of our job. As developers, every day we get to create something new. Let that sink in for a second. Who else gets to do that? And thatâ\x80\x99s why I never take it for granted.As an older developer, youâ\x80\x99ve also seen patterns repeat themselvesâ\x80¦ constantly. Companies promising the moon but only delivering a package of Swiss cheese.And along comes AI. Not the AI youâ\x80\x99ve been using for the last 15-20 years: NLP, voice-to-text, text-to-speech, translation, image recognition, recommendations, fraud detection, all the things that Amazon.com was built on. No, weâ\x80\x99re talking about generative AI, which even as an older developer, Iâ\x80\x99ll admit is really exciting. The speed of experimentation has dramatically increased. In the hands of a seasoned builder with a healthy dose of scepticism, it is powerful. But itâ\x80\x99s also been challenging, because it wasnâ\x80\x99t released like other technologies. No one educated users before release. The magic was just let out of the bottle, and because it was so unexpected, the hype absolutely exploded. And this feels strange to us, because weâ\x80\x99ve been used to seeing our software evolve with minor version bumps that take a year or more to come out. It took 2 years for Windows 3 to reach Windows 3.1. And Mac OS X made minor version bumps from 2001 to 2019 before it started doing major version bumps each year. But it seems like every week models swap places on the leaderboard with each new version they release.AWS has always been a B2B company. Weâ\x80\x99ve always provided the building blocks that allow other companies to innovate for their customers (S3, EC2, DynamoDB, Lambda, DSQL). Yet amidst the hype, we were suddenly being compared to B2C companies. It was frustrating. But experience had taught us what to do. We went back to our roots, democratizing access to technology (models in this case), giving customers choice, keeping privacy and security as our top priorities, providing the guardrails companies need for safety and compliance, and leveraging automated reasoning to reduce potential model errors. That’s the value of having seen patterns repeat over decades - you know which ones work.The older developer isnâ\x80\x99t worried about the barrage of new model announcements and feature releases that come out every week. Heâ\x80\x99s seen that before. New tech, same patterns.After all, over the past decades the older developer has probably learned more than 10 programming languages, tons of OSS libraries, and more platforms than he cares to remember. He was always keeping track of technology trends, reading papers, studying new directions, because that was the fun part of the job (you know, learning things). The older developer made sure he was fully prepared when his company was ready to start attacking problems where generative AI is uniquely suited. Heâ\x80\x99s also read Marc Brookerâ\x80\x99s fantastic article about LLM-driven development, and will probably follow his advice.Almost every customer I speak with asks: â\x80\x9cWhat should we be doing with gen AI?â\x80\x9d The best response Iâ\x80\x99ve seen so far is from Byron Cook, one of our brilliant scientists: â\x80\x9cSorry for not answering your question immediately, but why did you ask me this question?â\x80\x9dYouâ\x80\x99ll find that 90% of the answers you get back are not because they think generative AI will solve a specific problem that their business is encountering, but because theyâ\x80\x99re anxious. That they have very strong feelings of FOMO (the fear of missing out).And the older developer knows that this is exactly the time to press the pause button. To take a beat. He motivates juniors to get educated on the pros and cons, and that board & C-Suite read books like Jeff Lawson â\x80\x9cAsk Your developerâ\x80\x9d.Then you do exactly what youâ\x80\x99ve always done. Have an in-depth conversation with your customer, listen, dive deep into their challenges, suggest architectures, migrations, and tools. And sometimes, the solution will be generative AI.But as an older developer, you already knew this.Now, go build!'

Make sure the text you extracted is in string form. Then convert the article into individual sentences. Put the individual sentences into a list. Use BeautifulSoup for this.

from bs4 import BeautifulSoup
news = BeautifulSoup(news,'lxml').get_text()
print(textwrap.fill(news, width=80))
type(news)
Development gets better with AgeOctober 01, 2025 • 856 wordsHe has heard the
whispers, “he is getting older, who will replace him?” People asking him
with a straight face, “when will you retire?” After close to 25 years at
Amazon, where each year has been different and amazing, He feels as young as the
day he decided to leave academia and join Amazon.The thing about getting older
as a developer, is that you have seen a lot and encountered many of the problems
younger developers are facing (even if they look a little different on first
glance). If you’ve been around the block as many times as some of us have,
you’ll have earned battle scars along the way. There are days in war rooms you
will never forget. You have experimented a lot, and you have failed more times
than you care to remember. You have half-a-head full of what is practical and
works. And a quarter of that space has been trained to look for red flags,
scanning for things that you know will go wrong.What’s left in your head is
used for creativity. Taking in all sorts of signals, building mental models, and
coming up with new unique solutions. It’s the best part of our job. As
developers, every day we get to create something new. Let that sink in for a
second. Who else gets to do that? And that’s why I never take it for
granted.As an older developer, you’ve also seen patterns repeat themselves…
constantly. Companies promising the moon but only delivering a package of Swiss
cheese.And along comes AI. Not the AI you’ve been using for the last 15-20
years: NLP, voice-to-text, text-to-speech, translation, image recognition,
recommendations, fraud detection, all the things that Amazon.com was built on.
No, we’re talking about generative AI, which even as an older developer,
I’ll admit is really exciting. The speed of experimentation has dramatically
increased. In the hands of a seasoned builder with a healthy dose of scepticism,
it is powerful. But it’s also been challenging, because it wasn’t released
like other technologies. No one educated users before release. The magic was
just let out of the bottle, and because it was so unexpected, the hype
absolutely exploded. And this feels strange to us, because we’ve been used to
seeing our software evolve with minor version bumps that take a year or more to
come out. It took 2 years for Windows 3 to reach Windows 3.1. And Mac OS X made
minor version bumps from 2001 to 2019 before it started doing major version
bumps each year. But it seems like every week models swap places on the
leaderboard with each new version they release.AWS has always been a B2B
company. We’ve always provided the building blocks that allow other companies
to innovate for their customers (S3, EC2, DynamoDB, Lambda, DSQL). Yet amidst
the hype, we were suddenly being compared to B2C companies. It was frustrating.
But experience had taught us what to do. We went back to our roots,
democratizing access to technology (models in this case), giving customers
choice, keeping privacy and security as our top priorities, providing the
guardrails companies need for safety and compliance, and leveraging automated
reasoning to reduce potential model errors. That’s the value of having seen
patterns repeat over decades - you know which ones work.The older developer
isn’t worried about the barrage of new model announcements and feature
releases that come out every week. He’s seen that before. New tech, same
patterns.After all, over the past decades the older developer has probably
learned more than 10 programming languages, tons of OSS libraries, and more
platforms than he cares to remember. He was always keeping track of technology
trends, reading papers, studying new directions, because that was the fun part
of the job (you know, learning things). The older developer made sure he was
fully prepared when his company was ready to start attacking problems where
generative AI is uniquely suited. He’s also read Marc Brooker’s fantastic
article about LLM-driven development, and will probably follow his advice.Almost
every customer I speak with asks: “What should we be doing with gen AI?” The
best response I’ve seen so far is from Byron Cook, one of our brilliant
scientists: “Sorry for not answering your question immediately, but why did
you ask me this question?”You’ll find that 90% of the answers you get back
are not because they think generative AI will solve a specific problem that
their business is encountering, but because they’re anxious. That they have
very strong feelings of FOMO (the fear of missing out).And the older developer
knows that this is exactly the time to press the pause button. To take a beat.
He motivates juniors to get educated on the pros and cons, and that board &
C-Suite read books like Jeff Lawson “Ask Your developer”.Then you do exactly
what you’ve always done. Have an in-depth conversation with your customer,
listen, dive deep into their challenges, suggest architectures, migrations, and
tools. And sometimes, the solution will be generative AI.But as an older
developer, you already knew this.Now, go build!
str
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")
from nltk.tokenize import sent_tokenize   # To get separate sentences
sentences = sent_tokenize(news)
print("Number of sentences =", len(sentences))
for s in sentences:
    print(textwrap.fill(s, width=80), end="\n\n")
Number of sentences = 40
Development gets better with AgeOctober 01, 2025 • 856 wordsHe has heard the
whispers, “he is getting older, who will replace him?” People asking him
with a straight face, “when will you retire?” After close to 25 years at
Amazon, where each year has been different and amazing, He feels as young as the
day he decided to leave academia and join Amazon.The thing about getting older
as a developer, is that you have seen a lot and encountered many of the problems
younger developers are facing (even if they look a little different on first
glance).

If you’ve been around the block as many times as some of us have, you’ll
have earned battle scars along the way.

There are days in war rooms you will never forget.

You have experimented a lot, and you have failed more times than you care to
remember.

You have half-a-head full of what is practical and works.

And a quarter of that space has been trained to look for red flags, scanning for
things that you know will go wrong.What’s left in your head is used for
creativity.

Taking in all sorts of signals, building mental models, and coming up with new
unique solutions.

It’s the best part of our job.

As developers, every day we get to create something new.

Let that sink in for a second.

Who else gets to do that?

And that’s why I never take it for granted.As an older developer, you’ve
also seen patterns repeat themselves… constantly.

Companies promising the moon but only delivering a package of Swiss cheese.And
along comes AI.

Not the AI you’ve been using for the last 15-20 years: NLP, voice-to-text,
text-to-speech, translation, image recognition, recommendations, fraud
detection, all the things that Amazon.com was built on.

No, we’re talking about generative AI, which even as an older developer,
I’ll admit is really exciting.

The speed of experimentation has dramatically increased.

In the hands of a seasoned builder with a healthy dose of scepticism, it is
powerful.

But it’s also been challenging, because it wasn’t released like other
technologies.

No one educated users before release.

The magic was just let out of the bottle, and because it was so unexpected, the
hype absolutely exploded.

And this feels strange to us, because we’ve been used to seeing our software
evolve with minor version bumps that take a year or more to come out.

It took 2 years for Windows 3 to reach Windows 3.1.

And Mac OS X made minor version bumps from 2001 to 2019 before it started doing
major version bumps each year.

But it seems like every week models swap places on the leaderboard with each new
version they release.AWS has always been a B2B company.

We’ve always provided the building blocks that allow other companies to
innovate for their customers (S3, EC2, DynamoDB, Lambda, DSQL).

Yet amidst the hype, we were suddenly being compared to B2C companies.

It was frustrating.

But experience had taught us what to do.

We went back to our roots, democratizing access to technology (models in this
case), giving customers choice, keeping privacy and security as our top
priorities, providing the guardrails companies need for safety and compliance,
and leveraging automated reasoning to reduce potential model errors.

That’s the value of having seen patterns repeat over decades - you know which
ones work.The older developer isn’t worried about the barrage of new model
announcements and feature releases that come out every week.

He’s seen that before.

New tech, same patterns.After all, over the past decades the older developer has
probably learned more than 10 programming languages, tons of OSS libraries, and
more platforms than he cares to remember.

He was always keeping track of technology trends, reading papers, studying new
directions, because that was the fun part of the job (you know, learning
things).

The older developer made sure he was fully prepared when his company was ready
to start attacking problems where generative AI is uniquely suited.

He’s also read Marc Brooker’s fantastic article about LLM-driven
development, and will probably follow his advice.Almost every customer I speak
with asks: “What should we be doing with gen AI?” The best response I’ve
seen so far is from Byron Cook, one of our brilliant scientists: “Sorry for
not answering your question immediately, but why did you ask me this
question?”You’ll find that 90% of the answers you get back are not because
they think generative AI will solve a specific problem that their business is
encountering, but because they’re anxious.

That they have very strong feelings of FOMO (the fear of missing out).And the
older developer knows that this is exactly the time to press the pause button.

To take a beat.

He motivates juniors to get educated on the pros and cons, and that board &
C-Suite read books like Jeff Lawson “Ask Your developer”.Then you do exactly
what you’ve always done.

Have an in-depth conversation with your customer, listen, dive deep into their
challenges, suggest architectures, migrations, and tools.

And sometimes, the solution will be generative AI.But as an older developer, you
already knew this.Now, go build!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
# Python Summarizer
import re
# Pass in a list of sentences, returns a n sentence summary
def text_summarizer(sentences, n_summary):
    n = len(sentences)
    x = [re.split('[ ,.]',j) for j in sentences]
    jaccsim = array(zeros(n*n)).reshape((n,n))
    for i in range(n):
        for j in range(i,n):
            jaccsim[i,j] = len(set(x[i]).intersection(set(x[j])))/len(set(x[i]).union(set(x[j])))
            jaccsim[j,i] = jaccsim[i,j]
    #Summary
    idx = argsort(sum(jaccsim, axis=0))[::-1][:n_summary]  #reverse sort
    summary = [sentences[j] for j in list(idx)]
    #Anomalies
    idx = argsort(sum(jaccsim, axis=0))[:n_summary]
    anomalies = [sentences[j] for j in list(idx)]
    return summary, anomalies
# Get the summary and the anomaly sentences
summary, anomalies = text_summarizer(sentences, int(len(sentences)/4))
summ = "  ".join(summary)
print(textwrap.fill(summ, width=80))
It’s the best part of our job.  In the hands of a seasoned builder with a
healthy dose of scepticism, it is powerful.  That they have very strong feelings
of FOMO (the fear of missing out).And the older developer knows that this is
exactly the time to press the pause button.  And a quarter of that space has
been trained to look for red flags, scanning for things that you know will go
wrong.What’s left in your head is used for creativity.  The magic was just let
out of the bottle, and because it was so unexpected, the hype absolutely
exploded.  Let that sink in for a second.  You have experimented a lot, and you
have failed more times than you care to remember.  And sometimes, the solution
will be generative AI.But as an older developer, you already knew this.Now, go
build!  You have half-a-head full of what is practical and works.  That’s the
value of having seen patterns repeat over decades - you know which ones work.The
older developer isn’t worried about the barrage of new model announcements and
feature releases that come out every week.
for a in anomalies:
    print(a)
Who else gets to do that?
Have an in-depth conversation with your customer, listen, dive deep into their challenges, suggest architectures, migrations, and tools.
We went back to our roots, democratizing access to technology (models in this case), giving customers choice, keeping privacy and security as our top priorities, providing the guardrails companies need for safety and compliance, and leveraging automated reasoning to reduce potential model errors.
There are days in war rooms you will never forget.
He’s also read Marc Brooker’s fantastic article about LLM-driven development, and will probably follow his advice.Almost every customer I speak with asks: “What should we be doing with gen AI?” The best response I’ve seen so far is from Byron Cook, one of our brilliant scientists: “Sorry for not answering your question immediately, but why did you ask me this question?”You’ll find that 90% of the answers you get back are not because they think generative AI will solve a specific problem that their business is encountering, but because they’re anxious.
No one educated users before release.
But it’s also been challenging, because it wasn’t released like other technologies.
And Mac OS X made minor version bumps from 2001 to 2019 before it started doing major version bumps each year.
Not the AI you’ve been using for the last 15-20 years: NLP, voice-to-text, text-to-speech, translation, image recognition, recommendations, fraud detection, all the things that Amazon.com was built on.
No, we’re talking about generative AI, which even as an older developer, I’ll admit is really exciting.

23.7. Modern Methods#

!pip install transformers
Requirement already satisfied: transformers in /usr/local/lib/python3.12/dist-packages (4.56.1)
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from transformers) (3.19.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.34.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.35.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from transformers) (2.0.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (25.0)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.12/dist-packages (from transformers) (2024.11.6)
Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from transformers) (2.32.4)
Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.22.0)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.6.2)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers) (4.67.1)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.34.0->transformers) (2025.3.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.34.0->transformers) (4.15.0)
Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.34.0->transformers) (1.1.10)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->transformers) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests->transformers) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests->transformers) (2025.8.3)
from transformers import pipeline
summarizer = pipeline("summarization")
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
# All in one example
html = requests.get(url, timeout=10).text
doc = fromstring(html)
# x = doc.cssselect(".ArticleParagraph_root__wy3UI")
# x = doc.cssselect(".lg\:max-w-none")
news = " ".join([x[j].text_content() for j in range(len(x))])
news = BeautifulSoup(news,'lxml').get_text()
print(len(news))
if len(news)>1024:   # max seq length
    news = news[:1024]
summ = summarizer(news, max_length=int(len(news)/4), min_length=25)
print(summ)
5145
[{'summary_text': ' Development gets better with age, says senior Amazon developer . He has heard the whispers, â\x80\x9d he is getting older, who will replace him?'}]

Try this additional blog post for more on the T5 (text to text transfer transformer) summarizer.

https://towardsdatascience.com/simple-abstractive-text-summarization-with-pretrained-t5-text-to-text-transfer-transformer-10f6d602c426

This is a nice web site explaining Hugging Face transformers: https://zenodo.org/record/3733180#.X40RxEJKjlx

And the paper: https://arxiv.org/pdf/1910.10683.pdf

And here is a nice application of the same: https://towardsdatascience.com/summarization-has-gotten-commoditized-thanks-to-bert-9bb73f2d6922

23.8. Long document summarization#

This is not feasible unless we break up the text into maximal chunk sizes and do the summary piecemeal.

html = requests.get(url, timeout=10).text
doc = fromstring(html)
# x = doc.cssselect(".ArticleParagraph_root__wy3UI")
# x = doc.cssselect(".lg\:max-w-none")
news = " ".join([x[j].text_content() for j in range(len(x))])
news = BeautifulSoup(news,'lxml').get_text()
print("Size of article =",len(news)," | #Chunks =",int(len(news)/1024))
for j in range(0,len(news),1024):
    print(summarizer(news[j:j+1024], max_length=int(len(news)/4), min_length=25))
Your max_length is set to 1286, but your input_length is only 256. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=128)
Size of article = 5145  | #Chunks = 5
Your max_length is set to 1286, but your input_length is only 266. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=133)
[{'summary_text': ' Development gets better with age, says senior Amazon developer . He has heard the whispers, â\x80\x9d he is getting older, who will replace him?'}]
Your max_length is set to 1286, but your input_length is only 231. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=115)
[{'summary_text': ' As developers, every day we get to create something new. Who else gets to do that? And thatâ\x80\x99s why I never take it for granted. Generative AI is the best part of our job. It is powerful. In the hands of a seasoned builder, it is powerful .'}]
Your max_length is set to 1286, but your input_length is only 219. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=109)
[{'summary_text': ' The magic was just let out of the bottle, and because it was so unexpected, the hype exploded . This feels strange to us, because weâ\x80\x99ve been used to seeing our software evolve with minor version bumps that take a year or more to come out .'}]
Your max_length is set to 1286, but your input_length is only 270. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=135)
[{'summary_text': ' The older developer isnâ\x80\x99�s worried about the barrage of new model announcements and feature releases that come out every week . The value of having seen patterns repeat over decades - you know which ones work .'}]
Your max_length is set to 1286, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
[{'summary_text': ' An older developer knows this is exactly the time to press the pause button . 90% of the answers you get back are not because they think generative AI will solve a specific problem that their business is encountering, but because they’re anxious .'}]
[{'summary_text': " Now, go build!  knew this. Now, you're going to be able to build your dream home ."}]