18. Hugging Face for Sentiment, Q&A#

%%capture
#INCLUDING SCIENTIFIC AND NUMERICAL COMPUTING LIBRARIES
#Run this code to make sure that you have all the libraries at one go.
%pylab inline
import os
import pandas as pd
# %load_ext rpy2.ipython
from IPython.display import Image
# Basic lines of code needed to import a data file with permissions from Google Drive
from google.colab import drive
# drive.mount("/content/drive", force_remount=True)
drive.mount('/content/drive')
os.chdir("drive/My Drive/Books_Writings/NLPBook/")
Mounted at /content/drive

18.1. Using Hugging Face Transformers for Sentiment Scoring#

See: https://huggingface.co/

Hugging Face provides an incredible array of NLP tools that are very simple to use. It is an incredible approach to democratizing NLP. We will use their Transformers package.

This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).

https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

See for example: https://www.curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/

# %%capture
# !pip install pytorch
!pip install transformers[torch] --quiet
!pip install accelerate -U --quiet
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 363.4/363.4 MB 3.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.8/13.8 MB 103.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.6/24.6 MB 79.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 883.7/883.7 kB 43.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 2.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.5/211.5 MB 6.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 14.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 127.9/127.9 MB 7.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.5/207.5 MB 6.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 90.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 336.6/336.6 kB 9.2 MB/s eta 0:00:00
?25h
import torch
from transformers import pipeline
scoreSentiment = pipeline("sentiment-analysis")
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
Device set to use cuda:0
text = 'We’re on a journey to advance and democratize NLP for everyone. Along the way, we contribute to the development of technology for the better.'
print(text)
scoreSentiment(text)
We’re on a journey to advance and democratize NLP for everyone. Along the way, we contribute to the development of technology for the better.
[{'label': 'POSITIVE', 'score': 0.9995071887969971}]
print(scoreSentiment('Weeks of violent clashes between federal agents and protesters in Portland, Ore., galvanized thousands of people to march through the streets of American cities on Saturday, injecting new life into protests that had largely waned in recent weeks.'))
print(scoreSentiment('The police responded by firing flash grenades, showering protesters with pepper spray and abruptly rushing into crowds, knocking people to the ground. After a flash grenade left one woman with bloody injuries, police officers shoved people who had stopped to help her.'))
[{'label': 'POSITIVE', 'score': 0.9797046184539795}]
[{'label': 'NEGATIVE', 'score': 0.9931089878082275}]

18.2. Question and Answering#

Transformers can also be used for Q&A. It is very easy to set up.

This is a nice blog that gives you more details: https://towardsdatascience.com/explainbertqa-687abe3b2fcc

QnA = pipeline("question-answering") # This is a pre-trained model
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0
text = 'The federal courthouse in Portland has been the scene of nightly, chaotic demonstrations for weeks, which continued again into Sunday morning, as thousands participated in marches around the city, the 59th consecutive day of protests there. Earlier, a group of nurses in scrubs had joined an organized group of mothers in helmets and fathers in hard hats, all assembled against the fence of a federal courthouse where federal agents — a deployment that has been a key focus of the recent demonstrations — have been assembled.'
print(text)

ans = QnA({'question': 'What is happening in Portland?', 'context': text})
print(ans)
The federal courthouse in Portland has been the scene of nightly, chaotic demonstrations for weeks, which continued again into Sunday morning, as thousands participated in marches around the city, the 59th consecutive day of protests there. Earlier, a group of nurses in scrubs had joined an organized group of mothers in helmets and fathers in hard hats, all assembled against the fence of a federal courthouse where federal agents — a deployment that has been a key focus of the recent demonstrations — have been assembled.
{'score': 0.3924693465232849, 'start': 74, 'end': 88, 'answer': 'demonstrations'}
ans = QnA({'question': 'Who are the demonstators?', 'context': text})
print(ans)
{'score': 0.5871760845184326, 'start': 418, 'end': 432, 'answer': 'federal agents'}
ans = QnA({'question': 'When did the demonstrations happen?', 'context': text})
print(ans)
{'score': 0.9344968795776367, 'start': 127, 'end': 141, 'answer': 'Sunday morning'}
ans = QnA({'question': 'How long did the demonstrations last?', 'context': text})
print(ans)
{'score': 0.23806874454021454, 'start': 127, 'end': 141, 'answer': 'Sunday morning'}
 ans = QnA({'question': 'Why on earth did I take this boring course?', 'context': text}) # Unrelated question
print(ans)
{'score': 0.11029750108718872, 'start': 66, 'end': 88, 'answer': 'chaotic demonstrations'}

Much of this functionality has been usurped by LLMs!

18.3. Distillation#

Ref: https://towardsdatascience.com/distillation-of-bert-like-models-the-theory-32e19a02641f

The idea behind distillation is to reduce the number of parameters of the original model (also known as the “teacher” model) to a smaller model (denoted as the “student” model). This is usually done by having fewer hidden layers in the student model and training these layers to generate the same output as the teacher model. The reference above depicts this architecture in the following graphic. You can read about the details in the article, which is excellent.

Image("NLP_images/teacher_student.png", width=800)

18.4. Using Hugging Face for text classification#

We will download the AG News dataset: http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

See the same on Hugging Face: https://huggingface.co/datasets/viewer/?dataset=ag_news

https://huggingface.co/datasets/wangrongsheng/ag_news

!pip install datasets --quiet
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/480.6 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 480.6/480.6 kB 15.2 MB/s eta 0:00:00
?25h?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/116.3 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 10.8 MB/s eta 0:00:00
?25h?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/179.3 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.3/179.3 kB 17.2 MB/s eta 0:00:00
?25h?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/143.5 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.5/143.5 kB 13.4 MB/s eta 0:00:00
?25h?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/194.8 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.8/194.8 kB 11.7 MB/s eta 0:00:00
?25hERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.

import datasets
datasets
<module 'datasets' from '/usr/local/lib/python3.11/dist-packages/datasets/__init__.py'>
# # https://hackernoon.com/nlp-datasets-from-huggingface-how-to-access-and-train-them-i22u35t9
# from datasets import list_datasets, load_dataset
# from pprint import pprint

# datasets_list = list_datasets()
# print("Number of datasets =", len(datasets_list))
# # pprint(datasets_list,compact=True)
from datasets import load_dataset
dataset = load_dataset('ag_news')
print(dataset)
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})
df_train = load_dataset('ag_news','default',split='train').to_pandas()
df_test = load_dataset('ag_news','default',split='test').to_pandas()
df_train.head()
text label
0 Wall St. Bears Claw Back Into the Black (Reute... 2
1 Carlyle Looks Toward Commercial Aerospace (Reu... 2
2 Oil and Economy Cloud Stocks' Outlook (Reuters... 2
3 Iraq Halts Oil Exports from Main Southern Pipe... 2
4 Oil prices soar to all-time record, posing new... 2
df_train.label.groupby(df_train.label).count()
label
label
0 30000
1 30000
2 30000
3 30000

# Subset data for quick run
ntrain = 5000
ntest = 1000
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
inputs = tokenizer("Hello new World!", return_tensors="pt")
outputs = model(**inputs)

x = array(outputs.last_hidden_state.detach()) # To convert to a numpy array, detach first from torch
print(x.shape)
x[0][:,:5]
(1, 6, 768)
array([[-0.12557712, -0.13725324,  0.14662963, -0.18246058, -0.22283383],
       [-0.3664784 ,  0.41534883,  0.7468105 , -0.6168315 ,  0.27368066],
       [-0.232606  , -0.08210962,  0.6729307 , -0.09339743,  0.4407267 ],
       [ 0.01433105,  0.1426486 ,  0.5805187 , -0.13335617,  0.65808153],
       [-0.23666444, -0.26574713,  0.2277732 , -0.5975928 ,  0.07788546],
       [ 0.9137555 ,  0.1831059 , -0.24395643,  0.3854501 , -0.46605077]],
      dtype=float32)

We get 6 tokens because of word piece tokenization: https://aclanthology.org/2021.emnlp-main.160.pdf

18.5. Text Classification using PyTorch#

https://www.thepythoncode.com/article/finetuning-bert-using-huggingface-transformers-python

import torch
# from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import random
from sklearn.model_selection import train_test_split
# Tokenizer and max number of tokens per example
max_seq_length = 512
model_name = 'bert-base-uncased'
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

# Create train and test data as lists
train_texts = list(df_train.text)[:ntrain]
train_labels = list(df_train.label)[:ntrain]
test_texts = list(df_test.text)[:ntest]
test_labels = list(df_test.label)[:ntest]
%%time
# Tokenize train and test data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_seq_length)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=max_seq_length)
CPU times: user 1.98 s, sys: 125 ms, total: 2.1 s
Wall time: 3.85 s
class AGNewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = AGNewsDataset(train_encodings, train_labels)
test_dataset = AGNewsDataset(test_encodings, test_labels)
print(train_texts[:2])
train_dataset[1]
["Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.']
{'input_ids': tensor([  101, 18431,  2571,  3504,  2646,  3293, 13395,  1006, 26665,  1007,
         26665,  1011,  2797,  5211,  3813, 18431,  2571,  2177,  1010,  1032,
          2029,  2038,  1037,  5891,  2005,  2437,  2092,  1011, 22313,  1998,
          5681,  1032,  6801,  3248,  1999,  1996,  3639,  3068,  1010,  2038,
          5168,  2872,  1032,  2049, 29475,  2006,  2178,  2112,  1997,  1996,
          3006,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]),
 'labels': tensor([2])}
# load the model and pass to CUDA (if GPU available)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=ntrain).to("cuda")
# model = BertForSequenceClassification.from_pretrained(model_name, num_labels=ntrain)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }
# Hyperparameters
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=10,               # log & save weights each logging_steps
    save_steps=100,
    evaluation_strategy="steps",     # evaluate each `logging_steps`
)
/usr/local/lib/python3.11/dist-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,           # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)
%%time
trainer.train()
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
 ··········
wandb: WARNING If you're specifying your api key in code, ensure this code is not shared publicly.
wandb: WARNING Consider setting the WANDB_API_KEY environment variable, or running `wandb login` from the command line.
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
wandb: Currently logged in as: srdas to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
Tracking run with wandb version 0.19.5
Run data is saved locally in /content/drive/MyDrive/Books_Writings/NLPBook/wandb/run-20250206_004836-i1iq9g26
wandb: WARNING Serializing object of type dict that is 147544 bytes
wandb: WARNING Serializing object of type dict that is 103856 bytes
[625/625 21:38, Epoch 1/1]
Step Training Loss Validation Loss Accuracy
10 8.515900 8.566512 0.000000
20 8.478000 8.509310 0.000000
30 8.409000 8.396599 0.000000
40 8.319000 8.230279 0.000000
50 8.170300 8.031937 0.000000
60 7.945400 7.810511 0.004000
70 7.703600 7.565921 0.157000
80 7.403400 7.278660 0.412000
90 7.156700 6.944510 0.527000
100 6.761000 6.599313 0.510000
110 6.456400 6.228070 0.503000
120 6.092600 5.865695 0.542000
130 5.711900 5.466675 0.590000
140 5.250800 5.074615 0.617000
150 4.950700 4.673220 0.489000
160 4.490500 4.240108 0.497000
170 4.017000 3.801923 0.514000
180 3.774900 3.385123 0.513000
190 3.462000 2.962988 0.545000
200 2.848700 2.585617 0.785000
210 2.530500 2.173271 0.837000
220 2.190000 1.809053 0.829000
230 1.635200 1.504657 0.870000
240 1.480000 1.239617 0.863000
250 1.246700 1.058543 0.860000
260 1.025800 0.895702 0.857000
270 0.914000 0.762751 0.875000
280 0.808600 0.670900 0.881000
290 0.810700 0.613794 0.887000
300 0.599700 0.577927 0.888000
310 0.498600 0.632159 0.849000
320 0.724900 0.525783 0.887000
330 0.442000 0.543760 0.868000
340 0.669600 0.542330 0.856000
350 0.667500 0.515317 0.860000
360 0.425800 0.556679 0.862000
370 0.458100 0.492545 0.880000
380 0.639300 0.520489 0.872000
390 0.266000 0.477289 0.884000
400 0.483400 0.463034 0.883000
410 0.635700 0.426393 0.887000
420 0.371200 0.562870 0.864000
430 0.708600 0.460499 0.878000
440 0.439500 0.408264 0.882000
450 0.272100 0.402953 0.895000
460 0.462700 0.544710 0.886000
470 0.358400 0.495952 0.889000
480 0.653800 0.551614 0.876000
490 0.445900 0.525518 0.877000
500 0.549300 0.493981 0.872000
510 0.798700 0.524189 0.875000
520 0.375100 0.452902 0.879000
530 0.395400 0.385472 0.895000
540 0.307000 0.441526 0.883000
550 0.419000 0.409718 0.883000
560 0.334600 0.434326 0.893000
570 0.383400 0.467434 0.888000
580 0.297500 0.539987 0.865000
590 0.604300 0.522813 0.872000
600 0.502800 0.478167 0.883000
610 0.446800 0.451406 0.890000
620 0.336800 0.448145 0.892000

Could not locate the best model at ./results/checkpoint-530/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.
CPU times: user 15min 58s, sys: 18.6 s, total: 16min 16s
Wall time: 27min 36s
TrainOutput(global_step=625, training_loss=2.4775263092041016, metrics={'train_runtime': 1655.6869, 'train_samples_per_second': 3.02, 'train_steps_per_second': 0.377, 'total_flos': 918183733920000.0, 'train_loss': 2.4775263092041016, 'epoch': 1.0})

18.6. Text Classification with TensorFlow#

This has been made very easy since the development of the ktrain library: amaiya/ktrain

For text classification using ktrain, see:

  1. https://nbviewer.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-04-text-classification.ipynb

  2. https://towardsdatascience.com/text-classification-with-hugging-face-transformers-in-tensorflow-2-without-tears-ee50e4f3e7ed

!pip install tf_keras==2.18.0 --quiet
import os
os.environ['TF_USE_LEGACY_KERAS'] = 'True'
!pip install -U ktrain --quiet
# Set up datasets as lists
x_train = list(df_train.text)[:ntrain]
y_train = list(df_train.label)[:ntrain]
x_test = list(df_test.text)[:ntest]
y_test = list(df_test.label)[:ntest]
print("Classes :", unique(y_train))

model_name = 'distilbert-base-uncased'
max_seq_length = 512

import ktrain
from ktrain import text
t = text.Transformer(model_name, maxlen=max_seq_length, classes=unique(y_train))
Classes : [0 1 2 3]
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-48-827b31200153> in <cell line: 0>()
      9 max_seq_length = 512
     10 
---> 11 import ktrain
     12 from ktrain import text
     13 t = text.Transformer(model_name, maxlen=max_seq_length, classes=unique(y_train))

/usr/local/lib/python3.11/dist-packages/ktrain/__init__.py in <module>
      1 from . import imports as I
----> 2 from . import utils as U
      3 from .core import (
      4     ArrayLearner,
      5     GenLearner,

/usr/local/lib/python3.11/dist-packages/ktrain/utils.py in <module>
     29         "adam"
     30         if version.parse(tf.__version__) < version.parse("2.11")
---> 31         else tf.keras.optimizers.legacy.Adam()
     32     )
     33 else:

/usr/local/lib/python3.11/dist-packages/keras/src/optimizers/__init__.py in __init__(self, *args, **kwargs)
    113 class LegacyOptimizerWarning:
    114     def __init__(self, *args, **kwargs):
--> 115         raise ImportError(
    116             "`keras.optimizers.legacy` is not supported in Keras 3. When using "
    117             "`tf.keras`, to continue using a `tf.keras.optimizers.legacy` "

ImportError: `keras.optimizers.legacy` is not supported in Keras 3. When using `tf.keras`, to continue using a `tf.keras.optimizers.legacy` optimizer, you can install the `tf_keras` package (Keras 2) and set the environment variable `TF_USE_LEGACY_KERAS=True` to configure TensorFlow to use `tf_keras` when accessing `tf.keras`.

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
# Preprocess the datasets into a format required by the pre-trained model
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
# Create a model and training module
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=16)
# Estimate the learning rate
learner.lr_find(show_plot=True, max_epochs=1)
# TRAIN
init_lr = 5e-4 # This is about where the LR starts to decline
num_epochs = 1
learner.fit_onecycle(init_lr, num_epochs)
# PREDICT
predictor = ktrain.get_predictor(learner.model, preproc=t)
z = randint(len(y_test))
print(x_test[z])
print(y_test[z])
print(predictor.predict(x_test[0]))
# EXPLAIN
predictor.explain(x_test[z])

For demo purposes, we trained just one epoch above, so it will need a few more to achieve good levels of accuracy.