Natural Language Generation (NLG)

35. Natural Language Generation (NLG)#

NLP comprises NLU (natural language understanding) plus NLG (natural language generation). Whereas NLU has been around for quite some time, NLG has recently made huge strides with the creation of ultra large models. The size of these models run into the trillions of parameters!

NLG with Transformers

https://huggingface.co/tftransformers/gpt2-large

Sanjiv: I have adapted the notebook to run in our Colab accounts. You will need the file called ascii_bible.txt to run the notebook as well, placed in the NLP_data folder.
For the leaderboard of the latest large language models, see: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard. However, this is only the open LLMs and there are several closed LLMs from OpenAI, Cohere, AI21, Anthropic.
In this notebook we borrow code by Max Woolf to train a GPT-2 Text-Generating Model using gpt-2-simple. (See the license at the end of the notebook.)

35.1. LICENSE#

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

35.2. Recap of Transformers and LLMs#

This is an excellent video that visualizes transformers and explains succinctly the inner workings of transformers and LLMs: https://www.youtube.com/watch?v=KJtZARuO3JY (by Grant Sanderson, 3Blue1Brown).

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')

Mounted at /content/drive

%%capture
%pylab inline
import pandas as pd
import os
from IPython.display import Image

35.3. gpt-2-simple#

Here is the repository: minimaxir/gpt-2-simple

# %tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

  Preparing metadata (setup.py) ... ?25l?25hdone
  Building wheel for gpt-2-simple (setup.py) ... ?25l?25hdone

35.4. Check GPU#

You can verify which GPU is active by running the cell below.

!nvidia-smi

Thu Feb 27 04:05:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   35C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

35.5. Getting GPT-2#

There are four sizes of GPT-2:

124M (default): the “small” model, 500MB on disk.
355M: the “medium” model, 1.5GB on disk.
774M: the “large” model, cannot currently be finetuned with Colaboratory but can be used to generate text from the pretrained model (see later in Notebook)
1558M: the “extra large”, true model. Will not work if a K80/P4 GPU is attached to the notebook. (like 774M, it cannot be finetuned).

Larger models have more knowledge, but take longer to finetune and longer to generate text. Specify which base model to use in the code block below.

The next cell downloads it from Google Cloud Storage and saves it in the Colaboratory VM at /models/<model_name>. This model isn’t permanently saved in the Colaboratory VM; you’ll have to redownload it if you want to retrain it at a later time.

We use the smallest GPT-2 model to do a quick implementation.

%%time
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 4.48Git/s]                                                     
Fetching encoder.json: 1.05Mit [00:00, 2.55Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 2.22Git/s]                                                   
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:43, 11.4Mit/s]
Fetching model.ckpt.index: 1.05Mit [00:00, 3.36Git/s]                                               
Fetching model.ckpt.meta: 1.05Mit [00:00, 3.22Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 3.46Mit/s]

CPU times: user 1.28 s, sys: 556 ms, total: 1.83 s
Wall time: 1min 2s

I have provided different examples of text on which to fine-tune below:

## EXAMPLE FILES
file_name = "NLP_data/ascii_bible.txt" # Generates biblical text
# file_name = "NLP_data/canterbury_tales_chaucer.txt" # Generates poetry like text
# file_name = "NLP_data/history_indian_philosophy.txt" # generates plain text

35.6. Finetune GPT-2#

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of steps. (to have the finetuning run indefinitely, set steps = -1)

The model checkpoints will be saved in /checkpoint/run1 by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don’t lose them!

IMPORTANT NOTE: If you want to rerun this cell, restart the VM first (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for gpt2.finetune:

restore_from: Set to fresh to start training from the base GPT-2, or set to latest to restart training from an existing checkpoint.
sample_every: Number of steps to print example output
print_every: Number of steps to print training progress.
learning_rate: Learning rate for the training. (default 1e-4, can lower to 1e-5 if you have <1MB input data)
run_name: subfolder within checkpoint to save the model. This is useful if you want to work with multiple models (will also need to specify run_name when loading the model)
overwrite: Set to True if you want to continue finetuning an existing model (w/ restore_from='latest') without creating duplicate copies.

%%time
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=150,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=50,
              # reuse=True
              )

Loading checkpoint models/124M/model.ckpt
Loading dataset...

100%|██████████| 1/1 [00:04<00:00,  4.75s/it]

dataset has 1564935 tokens
Training...
[10 | 26.05] loss=1.94 avg=1.94
[20 | 47.08] loss=2.02 avg=1.98
[30 | 68.42] loss=1.91 avg=1.96
[40 | 90.03] loss=1.98 avg=1.96
[50 | 111.90] loss=1.91 avg=1.95
Saving checkpoint/run1/model-50
[60 | 136.82] loss=1.83 avg=1.93
[70 | 159.45] loss=1.82 avg=1.91
[80 | 182.31] loss=2.03 avg=1.93
[90 | 205.16] loss=1.92 avg=1.93
[100 | 228.24] loss=1.84 avg=1.92
Saving checkpoint/run1/model-100

WARNING:tensorflow:From /usr/local/lib/python3.11/dist-packages/tensorflow/python/training/saver.py:1068: remove_checkpoint (from tensorflow.python.checkpoint.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.

[110 | 256.33] loss=1.83 avg=1.91
[120 | 279.74] loss=1.81 avg=1.90
[130 | 303.29] loss=1.88 avg=1.90
[140 | 327.03] loss=1.74 avg=1.89
[150 | 350.85] loss=1.94 avg=1.89
Saving checkpoint/run1/model-150
CPU times: user 3min 44s, sys: 37.3 s, total: 4min 21s
Wall time: 6min 15s

35.7. Save the model#

After the model is trained, you can copy the checkpoint folder to your own Google Drive. (Look for a folder called checkpoints.)

If you want to download it to your personal computer, it’s strongly recommended you copy it there first, then download from Google Drive. The checkpoint folder is copied as a .rar compressed file; you can download it and uncompress it locally.

%%time
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

CPU times: user 259 ms, sys: 1.21 s, total: 1.47 s
Wall time: 23 s

You’re done! Feel free to go to the Generate Text From The Trained Model section to generate text based on your retrained model.

35.8. Load a Trained Model Checkpoint#

Running the next cell will copy the .rar checkpoint file from your Google Drive into the Colaboratory VM.

gpt2.copy_checkpoint_from_gdrive(run_name='run1')

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

IMPORTANT NOTE: If you want to rerun this cell, restart the VM first (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

# sess = gpt2.start_tf_sess()
# gpt2.load_gpt2(sess, run_name='run1')

35.9. Generate Text From The Trained Model#

After you’ve trained the model or loaded a retrained model from checkpoint, you can now generate text. generate generates a single text from the loaded model.

%%time
gpt2.generate(sess, run_name='run1')

        that is good in the sight of the LORD.

008:002 But he is an austere man in the sight of the LORD.

008:003 He is a man of unrighteousness, and a foolish man in heart.

008:004 But the LORD is merciful in his sight; and the wicked in his
        heart.

008:005 For the LORD make to the wicked an upright heart, so that he
        may know that he is in need.

008:006 He is an upright man in the sight of the LORD.

008:007 He is a man of ignorance, in the sight of the LORD; and a
        foolish man in heart.

008:008 He is a man of torments, even in the sight of the LORD.

008:009 He is a man of deceit, a rash man in the sight of the
        LORD; and a false prophet in the sight of the LORD.

008:010 The LORD shall be merciful to the wicked, and he shall
        be faithful to the righteous.

008:011 He will be an upright man in the sight of the LORD; and he
        shall know that he is in need.

008:012 He is a man of good knowledge, in the sight of the LORD; and he
        shall know that he is in need.

008:013 He is an upright man in the sight of the LORD; and he shall
        know that he is in need.

008:014 He is an upright man in the sight of the LORD; and he shall know
        that he is in need.

008:015 He is a man of good knowledge, in the sight of the LORD; and he
        shall know that he is in need.

008:016 He is an upright man in the sight of the LORD; and he shall know
        that he is in need.

008:017 He is a man of good knowledge, in the sight of the LORD; and he
        shall know that he is in need.

008:018 He is an upright man in the sight of the LORD; and he shall know
        that he is in need.

008:019 He is a man of knowledge, in the sight of the LORD; and he
        shall know that he is in need.

008:020 He is a man of wickedness, in the sight of the LORD; and he
        shall know that he is in need.

008:021 He is a man of wickedness, in the sight of the LORD; and he
        shall know that he is in need.

008:022 He is a man of first righteousness, in the sight of the LORD; and
        he shall know that he is in need.

008:023 He is a man of corruption, in the sight of the LORD; and he
        shall know that he is in need.

008:024 He is a man of deceit, in the sight of the LORD; and he
        shall know that he is in need.

008:025 He is a man of folly, in the sight of the LORD; and he
        shall know that he is in need.

008:026 He is a man of deceit, in the sight of the LORD; and he
        shall know that he is in need.

008:027 He is a man of first righteousness, in the sight of the
        LORD; and he shall know that he is in need.

008:028 He is a man of first knowledge, in the sight of the
        LORD; and he shall know that he is in need.

008:029 He is a man of first knowledge, in the sight of the LORD; and he
        shall know that he is in need.

008:030 He is a man of first knowledge, in the sight of the LORD; and he
        shall know that he is in need.

008:031 He is a man of first understanding, in the
CPU times: user 15.2 s, sys: 1.1 s, total: 16.3 s
Wall time: 19.8 s

If you’re creating an API based on your model and need to pass the generated text elsewhere, you can do text = gpt2.generate(sess, return_as_list=True)[0]

You can also pass in a prefix to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing nsamples. Unique to GPT-2, you can pass a batch_size to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for batch_size).

Other optional-but-helpful parameters for gpt2.generate and friends:

length: Number of tokens to generate (default 1023, the maximum)
temperature: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
top_k: Limits the generated guesses to the top k guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set top_k=40)
top_p: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with top_p=0.9)
truncate: Truncates the input text until a given sequence, excluding that sequence (e.g. if truncate='<|endoftext|>', the returned text will include everything before the first <|endoftext|>). It may be useful to combine this with a smaller length if the input texts are short.
include_prefix: If using truncate and include_prefix=False, the specified prefix will not be included in the returned text.

%%time
# Choose the prefix you want and then let it rip!
gpt2.generate(sess,
              length=250,
              temperature=0.7,
              prefix="Destiny is",
              nsamples=5,
              batch_size=5
              )

Destiny is the root of all nations.
        Thou shalt have light, and light shall not fail:

        and all nations shall come together in peace.

002:002 Also every tribe shall have dominion over the poor, and
        shall reign over the needy.

002:003 This is the end of all of the kingdoms which have been
        broken up.

002:004 The Jews shall rule over the few: they shall be
        rulers over all nations, and over all nations will be
        the gates of the world.

002:005 The sons of the LORD shall be as the handmaids of the
        children of Israel, which shall inhabit the earth, and shall
        serve the princes of the children of Israel.

002:006 Because of the faith of the LORD, and the great work of
        the LORD, how is it that the children of Israel the LORD
    
====================
Destiny is for us a record, as a city;
        the same also is for us as a wall: for we are a house of the
        inhabitants of the Gentiles.

024:008 And the same is for us as a house of the Jews: for we
        are a house of the Jews, a house that are all the
        Gentiles: they have nothing to do with us.

024:009 For the remnant of the Gentiles have a house in the
        Gentiles; for they are a house of the Jews; they have no
        inheritance in us; for they have laid down their burden for
        us; they have given their inheritance to us.

024:010 For we have a covenant with the Gentiles, that we will not
        give unto them any thing; but we will give, that they may be
        rich in the land.

024:011 Also, the same
====================
Destiny is a new law unto thee, O LORD, and all thy judgments are upon thee:

013:018 But the LORD will put them to shame.

013:019 Therefore do saith the LORD GOD GOD;

013:020 That thou mayest trust in me, that I may be
        exalted, and that thou mayest be glorified in the
        sight of the LORD.

013:021 But I will stand before thee with utmost respect, and
        with great zeal; for I will make thee a mighty city, and
        a mighty city with a mighty army.

013:022 And I will make a great city like unto your city, and
        shall make it the heap of the dust, and the heap of the
        world.

013:023 And I will make thee ruler over all the earth; and
        thou shalt send thy armies against me; and I will make thy
        city the heap of the
====================
Destiny is to be reckoned with, as the soul which is in peace.

019:021 For the whole world will be at hand, and the
        world will be like the day that I rejoice.

019:022 And the house which beareth the earth shall be glad, and the
        house which is in it shall be glad.

019:023 But the house which is not in it shall not rejoice:
        neither shall the house that is in it rejoice in me.

019:024 And I will make a present of it, that my mother may have
        peace, and I will give her thanks.

019:025 But I will not make a present of it, because I have
        not yet called it, nor have I yet called it, when they
        shall gather together all the tribes of the earth, and shall
        gather together all the tribes of the earth, to bring
        forth new generations.

019
====================
Destiny is made to come
        into the dominion of the heathen;

024:011 That the children of heathen may not be hid from their
        shallower than their own; that the child may not be
        hid from their own.

024:012 That the children of heathen may not be deceived
        into the truth of the truth, which is amiss
        in the kingdom of God, and in the kingdom of the
        Jew: that the children of heathen may not be hid
        from their own.

024:013 That the children of heathen may not be deceived into
        the truth of the truth, which is amiss in the kingdom
        of God, and in the kingdom of the Jew, and in the
        kingdom of the heathen, and in the kingdom of the
        heathen.

024:014
====================
CPU times: user 7.19 s, sys: 201 ms, total: 7.39 s
Wall time: 8.62 s

But, will AI truly learn to write well? Probably. Look at this letter written by John Steinbeck, and ask, can an AI write in this way?

In 2022, we have seen AIs write incredibly well informed text. Models such as Google’s LaMDA (https://blog.google/technology/ai/lamda/) are astonighingly literate. The paper is here: https://arxiv.org/abs/2201.08239

Take a look at how well it performs using the interface from https://beta.character.ai.

Image("NLP_images/steinbeck_monroe.jpg", width=600)

_images/fb535bc9635bb48ee068dee99f6293477f0d9f5470a2b911f97bafa91a36c81d.jpg

35.10. Large Language Models#

Training large language models (LLMs) is extremely expensive. Generative models such as GPT-3 (https://en.wikipedia.org/wiki/GPT-3) is estimated to have cost $12M.
The pre-training dataset for LaMDA consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words. Pre-trained on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch. Approximately equivalent to 22 passengers taking a round trip between San Francisco and New York (1.2 tCO2e / passenger).
BLOOM (176B parameters) is another large language model: https://bigscience.huggingface.co/blog/bloom. As stated by the site: “With its 176 billion parameters, BLOOM is able to generate text in 46 natural languages and 13 programming languages. For almost all of them, such as Spanish, French and Arabic, BLOOM will be the first language model with over 100B parameters ever created. This is the culmination of a year of work involving over 1000 researchers from 70+ countries and 250+ institutions, leading to a final run of 117 days (March 11 - July 6) training the BLOOM model on the Jean Zay supercomputer in the south of Paris, France thanks to a compute grant worth an estimated €3M from French research agencies CNRS and GENCI.” BLOOM is an example of open LLM modeling, in contrast to other models that are built by large tech companies.
BLOOM also advocates Responsible AI via its new license, RAIL: https://bigscience.huggingface.co/blog/the-bigscience-rail-license. This connects to the discussion of ML Explainability, one aspect of Responsible AI.
Stable Diffusion deploys open source text to image models: https://stability.ai/blog/stable-diffusion-public-release
Deployed in SageMaker: https://aws.amazon.com/about-aws/whats-new/2022/11/sagemaker-jumpstart-stable-diffusion-bloom-models/
A collection of links to LLMs on Github: https://gist.github.com/rain-1/eebd5e5eb2784feecf450324e3341c8d
Five Years of GPTs: https://finbarr.ca/five-years-of-gpt-progress/

35.11. Using BLOOM#

This is an excellent source for example code: https://amazon.awsapps.com/workdocs/index.html#/document/0beeca78eaca3b53f2a3beb37b3d515848ccb8965c9241c846bf44b37b21203a

35.12. Building LLMs from scratch#

This is a nice 3-hour presentation by Sebastian Raschka, if you want to dive into a better understanding of LLMs. See the substack post: https://magazine.sebastianraschka.com/p/building-llms-from-the-ground-up

https://www.youtube.com/watch?v=quh7z1q7-uc