Generalized Language Models

32. Generalized Language Models#

Good reference: https://lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html

Slides: https://docs.google.com/presentation/d/1xhOocjNJ-6YU_jXPJb_Yi0Vloen65QCyM51S7-Dadfk/edit?usp=sharing

Hands On LLMs (book) GitHub: HandsOnLLM/Hands-On-Large-Language-Models

Interviews on the history of the transformer: https://www.quantamagazine.org/when-chatgpt-broke-an-entire-field-an-oral-history-20250430/

# # Use pytorch kernel and install TF, as needed
# !pip install --upgrade torch
# !pip install --upgrade tensorflow

32.1. Classification with Embeddings and BERT#

We can use many approaches as seen earlier. A good summary of classification approaches in various NLP libraries is discussed here: https://towardsdatascience.com/which-is-the-best-nlp-d7965c71ec5f

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')

Mounted at /content/drive

%%capture
%pylab inline
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image

32.2. Sequence of Classification Approaches#

Various forms of input are possible:

Use a single vector from the TDM for classification of each document. Easy to construct, but lacking context. Input size is fixed. Vocab size is large.
Use a single TFIDF vector. Same as TDM vectors.
Word2Vec. Convert each word in the document into a fixed length vector. Combine vectors into a matrix for the document and this is the input into the classifier. Requires a package like gensim to make the word embeddings, some compute effort required. Fixed input size, 100-300, not huge as in TDM, TFIDF.
Doc2Vec. Each document is converted into a vector, which is input into the classifier (also needs gensim). Input size is fixed.
MPN. Use a standard NN to create word embeddings. Only takes a few tokens and truncates the document/sentence. Enlarging the window results in an explosion in parameters. No context.
RNN. Keeps track of word sequences and generates one embedding for a sequence of words. Can take any sequence length. Same weight matrix for all inputs. Keeps context. But slow, loses track of words further back in the sequence, so may be giving greater weight to words at the end. Vanishing gradients.
LSTMs. Same as RNN, but tries to fix the problem of vanishing gradients for RNNs. Goes in only one direction, so full context is missed. For example, in translation, words before and after current word matter.
CNN. Faster than RNNs as they do not have to wait to process tokens sequentially. Parallelization possible. Not fixed input, so padding is required.
Attention. These are bidirectional, so work better for all tasks as they have greater context. Also computationally better than LSTMs. No fixed input, limited maximum sequence length.

This historical sequence is also presented in these slides from NVIDIA:

32.3. Read in the data#

Datasets:

Reddit news with Dow sign, https://www.kaggle.com/aaron7sun/stocknews
Movie reviews, https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Financial Phrase Bank, https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10

# Read data
# df = pd.read_csv('NLP_data/Combined_News_DJIA.csv')  # Reddit News vs Dow data
# df = pd.read_csv('NLP_data/movie_review.csv', parse_dates=True, index_col=0)  # Movie Reviews data
df = pd.read_csv('NLP_data/Sentences_AllAgree.txt', sep=".@", header=None, encoding = "ISO-8859-1")  # Finbert data
print(df.shape)
# df.columns = ["Label","Text"]  # for movie reviews
df.columns = ["Text","Label"]
df.head()

<ipython-input-3-4963ab198381>:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  df = pd.read_csv('NLP_data/Sentences_AllAgree.txt', sep=".@", header=None, encoding = "ISO-8859-1")  # Finbert data

(2264, 2)

	Text	Label
0	According to Gran , the company has no plans t...	neutral
1	For the last quarter of 2010 , Componenta 's n...	positive
2	In the third quarter of 2010 , net sales incre...	positive
3	Operating profit rose to EUR 13.1 mn from EUR ...	positive
4	Operating profit totalled EUR 21.1 mn , up fro...	positive

# # Remove all the b-prefixes (for DJIA dataset)
# for k in range(1,26):
#   colname = "Top"+str(k)
#   df[colname] = df[colname].str[2:]

# # Prepare the data
# columns = ['Top' + str(i+1) for i in range(25)]
# df['Text'] = df[columns].apply(lambda x: ' '.join(x.astype(str)), axis=1)
# df = df[['Label', 'Text']]
# df.head()

# Plot class distribution
import seaborn as sns
sns.countplot(x='Label', data=df)

<Axes: xlabel='Label', ylabel='count'>

_images/39c8e561d69fa4728b2fdadea809ec724f8f6cef9cca95f7207cde74c63a3b6e.png

32.4. Now install raw text tools#

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# See Transformers from Hugging Face: https://huggingface.co/transformers/
# Simple Transformers: https://github.com/ThilinaRajapakse/simpletransformers

!pip install gensim
# !pip install transformers

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 5.8 MB/s eta 0:00:00
?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 kB 6.0 MB/s eta 0:00:00
?25hRequirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.11/dist-packages (from gensim) (7.1.0)
Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open>=1.8.1->gensim) (1.17.2)
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.7/26.7 MB 69.9 MB/s eta 0:00:00
?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 85.5 MB/s eta 0:00:00
?25hDownloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.6/38.6 MB 14.4 MB/s eta 0:00:00
?25hInstalling collected packages: numpy, scipy, gensim
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
  Attempting uninstall: scipy
    Found existing installation: scipy 1.14.1
    Uninstalling scipy-1.14.1:
      Successfully uninstalled scipy-1.14.1
Successfully installed gensim-4.3.3 numpy-1.26.4 scipy-1.13.1

import json
from sklearn import feature_extraction, feature_selection, metrics
from sklearn import model_selection, naive_bayes, pipeline, manifold, preprocessing
import gensim
import gensim.downloader as gensim_api
from tensorflow.keras import models, layers, preprocessing as kprocessing
from tensorflow.keras import backend as K
from tensorflow.keras.utils import plot_model
import transformers

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-26529b653c02> in <cell line: 0>()
      2 from sklearn import feature_extraction, feature_selection, metrics
      3 from sklearn import model_selection, naive_bayes, pipeline, manifold, preprocessing
----> 4 import gensim
      5 import gensim.downloader as gensim_api
      6 from tensorflow.keras import models, layers, preprocessing as kprocessing

/usr/local/lib/python3.11/dist-packages/gensim/__init__.py in <module>
      9 import logging
     10 
---> 11 from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils  # noqa:F401
     12 
     13 

/usr/local/lib/python3.11/dist-packages/gensim/corpora/__init__.py in <module>
      4 
      5 # bring corpus classes directly into package namespace, to save some typing
----> 6 from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
      7 
      8 from .mmcorpus import MmCorpus  # noqa:F401

/usr/local/lib/python3.11/dist-packages/gensim/corpora/indexedcorpus.py in <module>
     12 import numpy
     13 
---> 14 from gensim import interfaces, utils
     15 
     16 logger = logging.getLogger(__name__)

/usr/local/lib/python3.11/dist-packages/gensim/interfaces.py in <module>
     17 import logging
     18 
---> 19 from gensim import utils, matutils
     20 
     21 

/usr/local/lib/python3.11/dist-packages/gensim/matutils.py in <module>
   1032 try:
   1033     # try to load fast, cythonized code if possible
-> 1034     from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
   1035 
   1036 except ImportError:

/usr/local/lib/python3.11/dist-packages/gensim/_matutils.pyx in init gensim._matutils()

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

import nltk
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download('omw-1.4')
stopwords = nltk.corpus.stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...

# Use texthero as an alternative text cleaner, instead of the code below
import re # regex

def removeNumbersStr(s):
    for c in range(10):
        n = str(c)
        s = s.replace(n," ")
    return s

def cleanText(text, stem=False, lemm=True, stop=True):
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())  # remove stuff
    text = removeNumbersStr(text)
    text = text.split()  # tokenize
    if stop is not None:  # remove stopwords
        text = [word for word in text if word not in stopwords]
    if stem == True:  # stemming
        ps = nltk.stem.porter.PorterStemmer()
        text = [ps.stem(word) for word in text]
    if lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        text = [lem.lemmatize(word) for word in text]
    text = " ".join(text)
    return text

df["cleanTxt"] = [cleanText(df.Text[j]) for j in range(len(df.Label))]
print(df.shape)
df.head()

(2264, 3)

	Text	Label	cleanTxt
0	According to Gran , the company has no plans t...	neutral	according gran company plan move production ru...
1	For the last quarter of 2010 , Componenta 's n...	positive	last quarter componenta net sale doubled eur e...
2	In the third quarter of 2010 , net sales incre...	positive	third quarter net sale increased eur mn operat...
3	Operating profit rose to EUR 13.1 mn from EUR ...	positive	operating profit rose eur mn eur mn correspond...
4	Operating profit totalled EUR 21.1 mn , up fro...	positive	operating profit totalled eur mn eur mn repres...

df_train, df_test = model_selection.train_test_split(df, test_size=0.2)
y_train = df_train["Label"].values
y_test = df_test["Label"].values

# Choose BOW or TFIDF in NLTK
# vectorizer = feature_extraction.text.CountVectorizer(max_features=10000, ngram_range=(1,2))  # BOW
vectorizer = feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))  # TFIDF

corpus = df_train["cleanTxt"]
vectorizer.fit(corpus)
X_train = vectorizer.transform(corpus)
X_train

<1811x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 29631 stored elements in Compressed Sparse Row format>

vocab = vectorizer.vocabulary_   # is a dict
list(vocab.keys())[:30]

['finnish',
 'silicon',
 'wafer',
 'technology',
 'company',
 'okmetic',
 'oyj',
 'omx',
 'helsinki',
 'okm',
 'said',
 'wednesday',
 'september',
 'invest',
 'eur',
 'business',
 'finnish silicon',
 'silicon wafer',
 'wafer technology',
 'technology company',
 'company okmetic',
 'okmetic oyj',
 'oyj omx',
 'omx helsinki',
 'helsinki okm',
 'wednesday september',
 'invest eur',
 'eur sensor',
 'real',
 'estate']

X_train.shape

(1811, 10000)

32.5. Visualize the DTM#

figure(figsize=(15,7))
sns.heatmap(X_train.todense() [:,np.random.randint(0,X_train.shape[1],2000)]==0,
            vmin=0, vmax=1, cbar=False).set_title('Document Term Matrix (DTM)')
xlabel('Terms'); ylabel('Documents')

Text(158.22222222222223, 0.5, 'Documents')

_images/711bd963125a3a24b40ede069d01c86d10551b111c31cd6f79d4ee2cbbeb349b.png

32.6. Reduce the dimension of the vocabulary#

# Feature reduction using feature selection in sklearn
# This can also be done using TextHero
y = df_train["Label"]
X_names = vectorizer.get_feature_names_out()
p_value_limit = 0.75
df_features = pd.DataFrame()
for cat in np.unique(y):
    chi2, p = feature_selection.chi2(X_train, y==cat)
    df_features = pd.concat([df_features, pd.DataFrame({"feature":X_names, "score":1-p, "y":cat})])
    df_features = df_features.sort_values(["y","score"],ascending=[True,False])
    df_features = df_features[df_features["score"]>p_value_limit]
X_names = df_features["feature"].unique().tolist()
print(type(X_names)); print(X_names[:10])
print("# features =",len(X_names))

<class 'list'>
['decreased', 'decreased eur', 'fell', 'eur mn', 'mn', 'operating loss', 'compared profit', 'fell eur', 'profit fell', 'eur']
# features = 1302

# !conda install -c conda-forge xgboost -y

32.7. TFIDF Transform Classification#

# Define Vectorizer
vectorizer = feature_extraction.text.TfidfVectorizer(vocabulary=X_names)
vectorizer.fit(corpus)
X_train = vectorizer.transform(corpus)
vocab = vectorizer.vocabulary_
print("Check vocab length:", len(vocab))

tmp = zeros(X_train.shape[0])
for j in range(len(tmp)):
    if y_train[j]=='negative':
        tmp[j] = 1
    elif y_train[j]=='positive':
        tmp[j] = 2
y_train = tmp

# Define Classifier
import xgboost as xgb
# classifier = xgb.XGBClassifier(objective="binary:logistic") # for 2 classes
classifier = xgb.XGBClassifier(objective="multi:softmax") # for multiclass

Check vocab length: 1302

# Pipeline using sklearn
model = pipeline.Pipeline([("vectorizer", vectorizer),
                           ("classifier", classifier)])

model["classifier"].fit(X_train, y_train)
X_test = df_test["cleanTxt"].values

# Accessing the classifier directly when making prediction
predicted = model["classifier"].predict(vectorizer.transform(X_test))  # Use transform here
predicted_prob = model["classifier"].predict_proba(vectorizer.transform(X_test))  # Use transform here

tmp = zeros(X_test.shape[0])
for j in range(len(tmp)):
    if y_test[j]=='negative':
        tmp[j] = 1
    elif y_test[j]=='positive':
        tmp[j] = 2
y_test = tmp

accuracy = metrics.accuracy_score(y_test, predicted)
# auc = metrics.roc_auc_score(y_test, predicted_prob[:,1])  # only for binary classification
print("Accuracy:", round(accuracy,2))
# print("Auc:", round(auc,2))
print("Detail:")
print(metrics.classification_report(y_test, predicted))
cm = metrics.confusion_matrix(y_test, predicted)
print(cm)

Accuracy: 0.86
Detail:
              precision    recall  f1-score   support

         0.0       0.87      0.96      0.91       273
         1.0       0.80      0.64      0.71        58
         2.0       0.83      0.73      0.78       122

    accuracy                           0.86       453
   macro avg       0.84      0.78      0.80       453
weighted avg       0.85      0.86      0.85       453

[[262   2   9]
 [ 12  37   9]
 [ 26   7  89]]

32.8. Using Embeddings#

Ideally, we want an embedding model which gives us the smallest embedding vector and works great for the task. The smaller the embedding size, the lesser the compute required for training as well as inference.

Instead of TFIDF representations of text, we can use embeddings based on Word2Vec, Doc2Vec, etc. Much of these approaches have been superceded now by Transformer generated embeddings in the class of BERT models.

Slides: https://docs.google.com/presentation/d/1xhOocjNJ-6YU_jXPJb_Yi0Vloen65QCyM51S7-Dadfk/edit?usp=sharing

Image("NLP_images/BERT (0).png", width=900)

_images/cf8027d6da7a90974aaa57f7ebcb6baa307c52322465340bdc047a1ccda970d8.png

Image("NLP_images/BERT (1).png", width=900)

_images/eab87f43e3444b2b1bab4aef6b69126c7af1db1cb48daaff2839cdef32a2830a.png

Image("NLP_images/models_timeline.png", width=900)

_images/4002b6afa4ff71c1b1cf81605aad7553c7f466f79326c7905ead059f399d99f2.png

https://lifearchitect.ai/timeline/

Image("NLP_images/BERT (2).png", width=900)

_images/73c70c0dc6ff7eed949302be2e58500853c9c3ed278a01027d45c201d3b5f4dd.png

32.9. Transformers#

Read this amazing short book for a complete introduction to transformers.

Image("NLP_images/BERT (3).png", width=900)

_images/ec6bed12d0d69b4746640f48de8807612a1ce28591cc19c5156051e5075b1539.png

Image("NLP_images/BERT (4).png", width=900)

_images/bc1bced3d1b29cac9aba99a684022fca59f66a103c2b5f088a3b19ab2bf335f8.png

Image("NLP_images/BERT (5).png", width=900)

_images/0c54e1b4f3cd9263b26664fd386c32533417c118f38809c77a3b04c215351cf9.png

Image("NLP_images/BERT (6).png", width=900)

_images/c011bb21cf8e3af5feabd97a87d6f8d9364df83d3e13f29e0de31a80dbe6c8ce.png

Image("NLP_images/BERT (7).png", width=900)

_images/bf04c7bc2d087a1e1850f4df626c0097223d365cb36776930dc2cef759082929.png

Image("NLP_images/BERT (8).png", width=900)

_images/099ede74b26edfc9eb2dace8be7dd3ae36ec5c1b6dbc839fca25a1b2c6e53ab4.png

Image("NLP_images/BERT (9).png", width=900)

_images/4c92603d9aa66015ee9b24ebe7da9714e665eff7c6c351d99e5658ec56573bec.png

Image("NLP_images/BERT (10).png", width=900)

_images/cdeb79d4ea6f10005f5c8c48de18a0a1aa79b2ab035911f511c6e8137e5c47a5.png

There is some controversy about when the term “attention” first entered the literature, going way back to the 1990s. For a full and very balanced discussion, see: https://www.turingpost.com/p/attention

Image("NLP_images/BERT (11).png", width=900)

_images/71f0ce1d1057f736b77bd7b48ed4730b1c8425a83310f5e97c7e27e21cd47421.png

Image("NLP_images/BERT (12).png", width=900)

_images/0edf4febfcd69e586dad2d02a4f11cd7a6109e9994deb06eea6c3587910a4e7f.png

Image("NLP_images/BERT (13).png", width=900)

_images/d9ff7258fca0d401a655d682de438c4d3ad39ada870defec5d55d715d0b3fe38.png

Image("NLP_images/BERT (14).png", width=900)

_images/091b87d565872411b1bb7e32695bbb9cf33cfbc643a14df97ca2bd9d40359ca0.png

Image("NLP_images/BERT (15).png", width=900)

_images/f0acd97b5770a765087da4c3616cdace7d9ce5c960a916f4470328f10ea272a8.png

Image("NLP_images/BERT (16).png", width=900)

_images/f6543e9a76562cdde6501d3bbee23ab265da18cf17e357aa61724e5faa307758.png

See the BertViz library for visualizing attention.

Image("NLP_images/BERT (17).png", width=900)

_images/fbbd62e3806a39a1208d9bbfd09606d1545b19e4084765ddaa4eea2dd999ab0a.png

32.10. Transformers - Toy Example#

See also Layer Normalization: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html

# Code to exemplify transformers
# 2 tokens, embedding dimension of 4

# Embedding matrix
x = rand(2,4)
print(f"Input Embeddings:\n {x}")

# Set up weight matrices for query (Q), key (K), and value (V) vectors
wQ = rand(4,3)
wK = rand(4,3)
wV = rand(4,3)

# Generate the queries, keys, and value vectors
q = x.dot(wQ)
k = x.dot(wK)
v = x.dot(wV)

print(f"Query:\n {q}")
print(f"Key:\n {k}")
print(f"Value:\n {v}")

# Self attention for each word
score = q.dot(k.T)
print(f"Score:\n {score}") # each row is self attention for a word

# Divide by Sqrt of query dimension
sqrt_dk = sqrt(len(q[0]))
score = score/sqrt_dk
print(f"Sqrt dk: {sqrt_dk}\nRevised score:\n {score}")

# Softmax
softmax = np.exp(score) / np.sum(np.exp(score), axis=1, keepdims=True)
print(f"Softmax:\n {softmax}") # This gives the self-attention values

# z = Softmax x value
z = softmax.dot(v)
print(f"Z:\n {z}")

Input Embeddings:
 [[0.31563171 0.44176784 0.25149221 0.29977094]
 [0.06663767 0.16917287 0.31703913 0.70841912]]
Query:
 [[0.86700314 0.68307357 0.30303276]
 [0.82678166 0.73652915 0.31562099]]
Key:
 [[0.41669407 0.55762353 0.66296132]
 [0.33530572 0.36804167 0.56957365]]
Value:
 [[0.68185003 0.61262228 0.68627932]
 [0.66726867 0.3548414  0.76609268]]
Score:
 [[0.94307196 0.71471012]
 [0.96446551 0.72806744]]
Sqrt dk: 1.7320508075688772
Revised score:
 [[0.54448285 0.41263808]
 [0.55683442 0.42034993]]
Softmax:
 [[0.53291353 0.46708647]
 [0.53406825 0.46593175]]
Z:
 [[0.67503927 0.49221632 0.72355906]
 [0.67505611 0.49251398 0.7234669 ]]

# Multiheads for Attention
# Suppose we have 6 of these Attention results and we put them side by side
# Then we get a matrix of size 2 x 24, lets create a dummy one here:
z = rand(2,24)

# We want to bring this down to a matrix of 2 x 4 in size
# Which means we need to multiply it by a matrix of dimension 24 x 4
wO = rand(24,4)
z = z.dot(wO)
print(f"Output:\n {z}") # Each row to be fed into the separate FFNNs to complete the encoder

Output:
 [[4.30417318 4.84871889 4.82765262 4.34780469]
 [6.60671926 6.15907951 5.86312545 5.86546625]]

Image("NLP_images/BERT (18).png", width=900)

_images/48cc2b18076f21e9744de110a190fedc5060c9f7662e4a7c0e308975af121b0f.png

Image("NLP_images/BERT (19).png", width=900)

_images/7de2c7489e1f43643fa918325643804112f40b1908e38c224a9a819ec32ccf87.png

Both schemes shown above are examples of “Self-supervised Learning”.

Image("NLP_images/BERT (20).png", width=900)

_images/1bed6c26865e31cbf957839430ec7c6b36ab98a3d43e71bbc5d7fafbb0687a3f.png

Image("NLP_images/BERT (21).png", width=900)

_images/cae450f109103b8d2b363d387fa80f4e4fb841b3d0d41e4f7fd691dd015f039c.png

Image("NLP_images/BERT (22).png", width=900)

_images/e37f08d853d05669991a26fd3ca6babf81a3a9f9e94089362da512b0226678ff.png

Image("NLP_images/BERT (23).png", width=900)

_images/5b008a33a860dd0de2f92111abcc0e140ac3d3b922413d8501395e1ec5f2c57b.png

Image("NLP_images/BERT (24).png", width=900)

_images/2b01e00cfbe0814b3657adc9160892d9e026b6a5debc4ae5a5f282d960c5cc58.png

Image("NLP_images/BERT (25).png", width=900)

_images/2875198d7c8d8f04a72405418049cf8a7bb1069f98ab02695bcc5733a3a76dab.png

32.11. Transformers – Other Useful Links#

Video on transformers: https://www.youtube.com/watch?v=bCz4OMemCcA
3Blue1Brown: https://www.youtube.com/watch?v=wjZofJX0v4M (Transformers); https://www.youtube.com/watch?v=eMlx5fFNoYc (Attention)
Build a transformer from scratch: https://medium.com/towards-data-science/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb
Detailed view of transformers: https://e2eml.school/transformers.html
Understanding and coding attention: https://drive.google.com/file/d/1wzQ8q0gno23v8zoyP6_lCsa-DXg5xbqS/view?usp=sharing
Attnetion in Transformers: Concepts and Code in PyTorch (from DeepLearning.ai: https://learn.deeplearning.ai/courses/attention-in-transformers-concepts-and-code-in-pytorch/

32.12. BERT (Summary)#

BERT piggybacks on Transformer models, for a technical overview, see: https://drive.google.com/file/d/1G4tEu0SQrYglVIvgRqY17Khlbhvuf-4s/view?usp=sharing
BERT handles context better than word embeddings. Therefore it takes care of polysemy, i.e., same word meaning different things in different context.
BERT is trained using a denoising objective (masked language modeling), where it aims to reconstruct a noisy version of a sentence back into its original version. The concept is similar to autoencoders.
The original BERT uses a next-sentence prediction objective, but it was shown in the RoBERTa paper that this training objective doesn’t help that much. In this way, BERT is trained on gigabytes of data from various sources (much of Wikipedia) in an unsupervised fashion.
Google Research and Toyota Technological Institute jointly released a much smaller/smarter Lite Bert called ALBERT. (“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”). BERT x-large has 1.27 Billion parameters, vs ALBERT x-large with 59 Million parameters! The core architecture of ALBERT is BERT-like in that it uses a transformer encoder architecture, along with GELU activation. It also uses the identical vocabulary size of 30K as used in the original BERT. (V=30,000).
The downside of BERT is compute: you definitely need a GPU.

Will Transformers take over everything in NLP and computer vision? https://www.quantamagazine.org/will-transformers-take-over-artificial-intelligence-20220310/

32.13. BERT with transfer learning#

BERT input has a special structure. Take the sequence of words - “The paycheck protection program” and it gets encoded as

CLS | The | paycheck | protection | program | SEP | PAD | PAD | PAD

There are 3 vectors that are generated by this:

(1) Token IDs: these are integers that refer to the vocab index. Some have fixed IDs such as

CLS = 101 (start id)
UNK = 100 (unknown id)
SEP = 102 (end/separator id)
PAD = 0 (padding slots id)

The actual words get their ids from the vocab.

(2) Mask = 1 from CLS through SEP, 0 thereafter. It delineates the text from its padding.

(3) Segment = 1 for SEP, zero elsewhere. It delineates sentences.

Steps:

tokenize + transform + create embedding
See the first element of the embedding that is generated, it is what is passed to the NN.

import transformers
# Just trying out original BERT
txt = "love the show"
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
input_ids = array(tokenizer.encode(txt))[None,:]

## Use language model to return hidden layer with embeddings
nlp = transformers.TFBertModel.from_pretrained('bert-base-uncased')
embedding = nlp(input_ids)

/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/transformers/utils/import_utils.py in _get_module(self, module_name)
   1975         try:
-> 1976             return importlib.import_module("." + module_name, self.__name__)
   1977         except Exception as e:

/usr/lib/python3.11/importlib/__init__.py in import_module(name, package)
    125             level += 1
--> 126     return _bootstrap._gcd_import(name[level:], package, level)
    127 

/usr/lib/python3.11/importlib/_bootstrap.py in _gcd_import(name, package, level)

/usr/lib/python3.11/importlib/_bootstrap.py in _find_and_load(name, import_)

/usr/lib/python3.11/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

/usr/lib/python3.11/importlib/_bootstrap.py in _load_unlocked(spec)

/usr/lib/python3.11/importlib/_bootstrap_external.py in exec_module(self, module)

/usr/lib/python3.11/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

/usr/local/lib/python3.11/dist-packages/transformers/models/bert/modeling_tf_bert.py in <module>
     25 import numpy as np
---> 26 import tensorflow as tf
     27 

/usr/local/lib/python3.11/dist-packages/tensorflow/__init__.py in <module>
    466   else:
--> 467     importlib.import_module("keras.src.optimizers")
    468 except (ImportError, AttributeError):

/usr/lib/python3.11/importlib/__init__.py in import_module(name, package)
    125             level += 1
--> 126     return _bootstrap._gcd_import(name[level:], package, level)
    127 

/usr/local/lib/python3.11/dist-packages/keras/__init__.py in <module>
      1 # DO NOT EDIT. Generated by api_gen.sh
----> 2 from keras.api import DTypePolicy
      3 from keras.api import FloatDTypePolicy

/usr/local/lib/python3.11/dist-packages/keras/api/__init__.py in <module>
      7 
----> 8 from keras.api import activations
      9 from keras.api import applications

/usr/local/lib/python3.11/dist-packages/keras/api/activations/__init__.py in <module>
      6 
----> 7 from keras.src.activations import deserialize
      8 from keras.src.activations import get

/usr/local/lib/python3.11/dist-packages/keras/src/__init__.py in <module>
     12 from keras.src import utils
---> 13 from keras.src import visualization
     14 from keras.src.backend import KerasTensor

/usr/local/lib/python3.11/dist-packages/keras/src/visualization/__init__.py in <module>
----> 1 from keras.src.visualization import draw_bounding_boxes
      2 from keras.src.visualization import plot_image_gallery

/usr/local/lib/python3.11/dist-packages/keras/src/visualization/draw_bounding_boxes.py in <module>
     10 try:
---> 11     import cv2
     12 except ImportError:

/usr/lib/python3.11/importlib/_bootstrap.py in _find_and_load(name, import_)

/usr/lib/python3.11/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

/usr/lib/python3.11/importlib/_bootstrap.py in _load_unlocked(spec)

/usr/lib/python3.11/importlib/_bootstrap.py in _load_backward_compatible(spec)

/usr/local/lib/python3.11/dist-packages/google/colab/_import_hooks/_cv2.py in load_module(self, name)
     77     module_info = imp.find_module(name, self.path)
---> 78     cv_module = imp.load_module(name, *module_info)
     79 

/usr/lib/python3.11/imp.py in load_module(name, file, filename, details)
    244     elif type_ == PKG_DIRECTORY:
--> 245         return load_package(name, filename)
    246     elif type_ == C_BUILTIN:

/usr/lib/python3.11/imp.py in load_package(name, path)
    216     else:
--> 217         return _load(spec)
    218 

/usr/local/lib/python3.11/dist-packages/cv2/__init__.py in <module>
     11     import numpy
---> 12     import numpy.core.multiarray
     13 except ImportError:

/usr/local/lib/python3.11/dist-packages/numpy/core/multiarray.py in <module>
     84 
---> 85 @array_function_from_c_func_and_dispatcher(_multiarray_umath.empty_like)
     86 def empty_like(prototype, dtype=None, order=None, subok=None, shape=None):

/usr/local/lib/python3.11/dist-packages/numpy/_core/overrides.py in decorator(dispatcher)

/usr/local/lib/python3.11/dist-packages/numpy/_core/overrides.py in decorator(implementation)

RuntimeError: empty_like method already has a different docstring

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-18-db6b739dbd56> in <cell line: 0>()
      6 
      7 ## Use language model to return hidden layer with embeddings
----> 8 nlp = transformers.TFBertModel.from_pretrained('bert-base-uncased')
      9 embedding = nlp(input_ids)

/usr/local/lib/python3.11/dist-packages/transformers/utils/import_utils.py in __getattr__(self, name)
   1963         elif name in self._class_to_module.keys():
   1964             module = self._get_module(self._class_to_module[name])
-> 1965             value = getattr(module, name)
   1966         elif name in self._modules:
   1967             value = self._get_module(name)

/usr/local/lib/python3.11/dist-packages/transformers/utils/import_utils.py in __getattr__(self, name)
   1962             value = Placeholder
   1963         elif name in self._class_to_module.keys():
-> 1964             module = self._get_module(self._class_to_module[name])
   1965             value = getattr(module, name)
   1966         elif name in self._modules:

/usr/local/lib/python3.11/dist-packages/transformers/utils/import_utils.py in _get_module(self, module_name)
   1976             return importlib.import_module("." + module_name, self.__name__)
   1977         except Exception as e:
-> 1978             raise RuntimeError(
   1979                 f"Failed to import {self.__name__}.{module_name} because of the following error (look up to see its"
   1980                 f" traceback):\n{e}"

RuntimeError: Failed to import transformers.models.bert.modeling_tf_bert because of the following error (look up to see its traceback):
empty_like method already has a different docstring

print("Structure of BERT input (size of text + 2):", input_ids)
print("Length of embedding structure:", len(embedding))
print("Shape of first element of embedding:", embedding[0][0].shape)  # size of input ids, BERT input vector size
print("Shape of second element of embedding:", embedding[1][0].shape)  #

Structure of BERT input (size of text + 2): [[ 101 2293 1996 2265  102]]
Length of embedding structure: 2
Shape of first element of embedding: (5, 768)
Shape of second element of embedding: (768,)

32.14. Use distill BERT from Hugging Face#

https://huggingface.co/docs/transformers/index

tokenizer = transformers.AutoTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)
input_ids = array(tokenizer.encode(txt))[None,:]

## Use language model to return hidden layer with embeddings
nlp = transformers.TFBertModel.from_pretrained('distilbert-base-uncased')
embedding = nlp(input_ids)

You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
WARNING:huggingface_hub.file_download:Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['distilbert.transformer.layer.2.attention.out_lin.bias', 'distilbert.transformer.layer.3.sa_layer_norm.weight', 'distilbert.transformer.layer.3.ffn.lin2.weight', 'distilbert.transformer.layer.5.attention.v_lin.weight', 'distilbert.transformer.layer.2.ffn.lin1.bias', 'distilbert.transformer.layer.5.ffn.lin2.weight', 'distilbert.transformer.layer.5.ffn.lin1.weight', 'distilbert.transformer.layer.5.ffn.lin1.bias', 'distilbert.transformer.layer.4.sa_layer_norm.bias', 'distilbert.transformer.layer.4.ffn.lin2.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.3.attention.out_lin.weight', 'distilbert.transformer.layer.4.ffn.lin1.bias', 'distilbert.transformer.layer.5.attention.out_lin.bias', 'distilbert.transformer.layer.3.attention.v_lin.bias', 'distilbert.transformer.layer.4.ffn.lin2.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.3.attention.out_lin.bias', 'distilbert.transformer.layer.5.ffn.lin2.bias', 'vocab_transform.weight', 'distilbert.transformer.layer.0.output_layer_norm.bias', 'distilbert.transformer.layer.4.attention.out_lin.weight', 'distilbert.transformer.layer.1.sa_layer_norm.bias', 'distilbert.transformer.layer.3.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.transformer.layer.2.attention.v_lin.weight', 'distilbert.transformer.layer.3.output_layer_norm.bias', 'vocab_layer_norm.bias', 'distilbert.transformer.layer.3.output_layer_norm.weight', 'distilbert.transformer.layer.3.attention.q_lin.bias', 'distilbert.transformer.layer.5.sa_layer_norm.bias', 'distilbert.transformer.layer.5.attention.out_lin.weight', 'distilbert.transformer.layer.4.sa_layer_norm.weight', 'distilbert.transformer.layer.4.attention.v_lin.weight', 'distilbert.transformer.layer.0.output_layer_norm.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.transformer.layer.1.ffn.lin1.bias', 'distilbert.transformer.layer.5.output_layer_norm.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.1.ffn.lin2.bias', 'distilbert.transformer.layer.3.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.embeddings.word_embeddings.weight', 'vocab_projector.bias', 'distilbert.transformer.layer.2.attention.k_lin.weight', 'distilbert.transformer.layer.2.output_layer_norm.weight', 'distilbert.transformer.layer.5.sa_layer_norm.weight', 'distilbert.transformer.layer.0.ffn.lin2.bias', 'distilbert.transformer.layer.1.attention.k_lin.weight', 'distilbert.transformer.layer.1.sa_layer_norm.weight', 'distilbert.transformer.layer.5.attention.q_lin.bias', 'distilbert.transformer.layer.2.output_layer_norm.bias', 'distilbert.transformer.layer.1.ffn.lin1.weight', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.4.attention.k_lin.weight', 'distilbert.transformer.layer.0.ffn.lin2.weight', 'distilbert.transformer.layer.1.attention.out_lin.bias', 'distilbert.transformer.layer.1.output_layer_norm.weight', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.4.attention.out_lin.bias', 'distilbert.transformer.layer.2.attention.v_lin.bias', 'distilbert.transformer.layer.4.output_layer_norm.bias', 'distilbert.transformer.layer.2.attention.q_lin.weight', 'distilbert.transformer.layer.2.attention.q_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.5.attention.k_lin.bias', 'vocab_transform.bias', 'distilbert.transformer.layer.4.ffn.lin1.weight', 'distilbert.transformer.layer.1.attention.v_lin.weight', 'distilbert.transformer.layer.4.attention.k_lin.bias', 'vocab_layer_norm.weight', 'distilbert.transformer.layer.2.attention.k_lin.bias', 'distilbert.transformer.layer.4.attention.v_lin.bias', 'distilbert.transformer.layer.3.attention.k_lin.weight', 'distilbert.transformer.layer.2.ffn.lin1.weight', 'distilbert.transformer.layer.1.attention.q_lin.weight', 'distilbert.transformer.layer.3.attention.k_lin.bias', 'distilbert.transformer.layer.5.attention.v_lin.bias', 'distilbert.transformer.layer.3.ffn.lin1.bias', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.1.output_layer_norm.bias', 'distilbert.transformer.layer.1.attention.k_lin.bias', 'distilbert.transformer.layer.5.output_layer_norm.weight', 'distilbert.transformer.layer.3.attention.q_lin.weight', 'distilbert.transformer.layer.2.ffn.lin2.bias', 'distilbert.transformer.layer.4.attention.q_lin.weight', 'distilbert.transformer.layer.2.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.5.attention.q_lin.weight', 'distilbert.transformer.layer.5.attention.k_lin.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.1.ffn.lin2.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.3.ffn.lin2.bias', 'distilbert.transformer.layer.2.sa_layer_norm.bias', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.3.ffn.lin1.weight', 'distilbert.transformer.layer.1.attention.q_lin.bias', 'distilbert.transformer.layer.1.attention.v_lin.bias', 'distilbert.transformer.layer.2.sa_layer_norm.weight', 'distilbert.transformer.layer.4.output_layer_norm.weight', 'distilbert.transformer.layer.1.attention.out_lin.weight', 'distilbert.transformer.layer.2.ffn.lin2.weight', 'distilbert.transformer.layer.4.attention.q_lin.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertModel were not initialized from the PyTorch model and are newly initialized: ['embeddings.word_embeddings.weight', 'embeddings.token_type_embeddings.weight', 'embeddings.position_embeddings.weight', 'embeddings.LayerNorm.weight', 'embeddings.LayerNorm.bias', 'encoder.layer.0.attention.self.query.weight', 'encoder.layer.0.attention.self.query.bias', 'encoder.layer.0.attention.self.key.weight', 'encoder.layer.0.attention.self.key.bias', 'encoder.layer.0.attention.self.value.weight', 'encoder.layer.0.attention.self.value.bias', 'encoder.layer.0.attention.output.dense.weight', 'encoder.layer.0.attention.output.dense.bias', 'encoder.layer.0.attention.output.LayerNorm.weight', 'encoder.layer.0.attention.output.LayerNorm.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.layer.0.output.dense.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.1.attention.self.query.weight', 'encoder.layer.1.attention.self.query.bias', 'encoder.layer.1.attention.self.key.weight', 'encoder.layer.1.attention.self.key.bias', 'encoder.layer.1.attention.self.value.weight', 'encoder.layer.1.attention.self.value.bias', 'encoder.layer.1.attention.output.dense.weight', 'encoder.layer.1.attention.output.dense.bias', 'encoder.layer.1.attention.output.LayerNorm.weight', 'encoder.layer.1.attention.output.LayerNorm.bias', 'encoder.layer.1.intermediate.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.1.output.dense.bias', 'encoder.layer.1.output.LayerNorm.weight', 'encoder.layer.1.output.LayerNorm.bias', 'encoder.layer.2.attention.self.query.weight', 'encoder.layer.2.attention.self.query.bias', 'encoder.layer.2.attention.self.key.weight', 'encoder.layer.2.attention.self.key.bias', 'encoder.layer.2.attention.self.value.weight', 'encoder.layer.2.attention.self.value.bias', 'encoder.layer.2.attention.output.dense.weight', 'encoder.layer.2.attention.output.dense.bias', 'encoder.layer.2.attention.output.LayerNorm.weight', 'encoder.layer.2.attention.output.LayerNorm.bias', 'encoder.layer.2.intermediate.dense.weight', 'encoder.layer.2.intermediate.dense.bias', 'encoder.layer.2.output.dense.weight', 'encoder.layer.2.output.dense.bias', 'encoder.layer.2.output.LayerNorm.weight', 'encoder.layer.2.output.LayerNorm.bias', 'encoder.layer.3.attention.self.query.weight', 'encoder.layer.3.attention.self.query.bias', 'encoder.layer.3.attention.self.key.weight', 'encoder.layer.3.attention.self.key.bias', 'encoder.layer.3.attention.self.value.weight', 'encoder.layer.3.attention.self.value.bias', 'encoder.layer.3.attention.output.dense.weight', 'encoder.layer.3.attention.output.dense.bias', 'encoder.layer.3.attention.output.LayerNorm.weight', 'encoder.layer.3.attention.output.LayerNorm.bias', 'encoder.layer.3.intermediate.dense.weight', 'encoder.layer.3.intermediate.dense.bias', 'encoder.layer.3.output.dense.weight', 'encoder.layer.3.output.dense.bias', 'encoder.layer.3.output.LayerNorm.weight', 'encoder.layer.3.output.LayerNorm.bias', 'encoder.layer.4.attention.self.query.weight', 'encoder.layer.4.attention.self.query.bias', 'encoder.layer.4.attention.self.key.weight', 'encoder.layer.4.attention.self.key.bias', 'encoder.layer.4.attention.self.value.weight', 'encoder.layer.4.attention.self.value.bias', 'encoder.layer.4.attention.output.dense.weight', 'encoder.layer.4.attention.output.dense.bias', 'encoder.layer.4.attention.output.LayerNorm.weight', 'encoder.layer.4.attention.output.LayerNorm.bias', 'encoder.layer.4.intermediate.dense.weight', 'encoder.layer.4.intermediate.dense.bias', 'encoder.layer.4.output.dense.weight', 'encoder.layer.4.output.dense.bias', 'encoder.layer.4.output.LayerNorm.weight', 'encoder.layer.4.output.LayerNorm.bias', 'encoder.layer.5.attention.self.query.weight', 'encoder.layer.5.attention.self.query.bias', 'encoder.layer.5.attention.self.key.weight', 'encoder.layer.5.attention.self.key.bias', 'encoder.layer.5.attention.self.value.weight', 'encoder.layer.5.attention.self.value.bias', 'encoder.layer.5.attention.output.dense.weight', 'encoder.layer.5.attention.output.dense.bias', 'encoder.layer.5.attention.output.LayerNorm.weight', 'encoder.layer.5.attention.output.LayerNorm.bias', 'encoder.layer.5.intermediate.dense.weight', 'encoder.layer.5.intermediate.dense.bias', 'encoder.layer.5.output.dense.weight', 'encoder.layer.5.output.dense.bias', 'encoder.layer.5.output.LayerNorm.weight', 'encoder.layer.5.output.LayerNorm.bias', 'encoder.layer.6.attention.self.query.weight', 'encoder.layer.6.attention.self.query.bias', 'encoder.layer.6.attention.self.key.weight', 'encoder.layer.6.attention.self.key.bias', 'encoder.layer.6.attention.self.value.weight', 'encoder.layer.6.attention.self.value.bias', 'encoder.layer.6.attention.output.dense.weight', 'encoder.layer.6.attention.output.dense.bias', 'encoder.layer.6.attention.output.LayerNorm.weight', 'encoder.layer.6.attention.output.LayerNorm.bias', 'encoder.layer.6.intermediate.dense.weight', 'encoder.layer.6.intermediate.dense.bias', 'encoder.layer.6.output.dense.weight', 'encoder.layer.6.output.dense.bias', 'encoder.layer.6.output.LayerNorm.weight', 'encoder.layer.6.output.LayerNorm.bias', 'encoder.layer.7.attention.self.query.weight', 'encoder.layer.7.attention.self.query.bias', 'encoder.layer.7.attention.self.key.weight', 'encoder.layer.7.attention.self.key.bias', 'encoder.layer.7.attention.self.value.weight', 'encoder.layer.7.attention.self.value.bias', 'encoder.layer.7.attention.output.dense.weight', 'encoder.layer.7.attention.output.dense.bias', 'encoder.layer.7.attention.output.LayerNorm.weight', 'encoder.layer.7.attention.output.LayerNorm.bias', 'encoder.layer.7.intermediate.dense.weight', 'encoder.layer.7.intermediate.dense.bias', 'encoder.layer.7.output.dense.weight', 'encoder.layer.7.output.dense.bias', 'encoder.layer.7.output.LayerNorm.weight', 'encoder.layer.7.output.LayerNorm.bias', 'encoder.layer.8.attention.self.query.weight', 'encoder.layer.8.attention.self.query.bias', 'encoder.layer.8.attention.self.key.weight', 'encoder.layer.8.attention.self.key.bias', 'encoder.layer.8.attention.self.value.weight', 'encoder.layer.8.attention.self.value.bias', 'encoder.layer.8.attention.output.dense.weight', 'encoder.layer.8.attention.output.dense.bias', 'encoder.layer.8.attention.output.LayerNorm.weight', 'encoder.layer.8.attention.output.LayerNorm.bias', 'encoder.layer.8.intermediate.dense.weight', 'encoder.layer.8.intermediate.dense.bias', 'encoder.layer.8.output.dense.weight', 'encoder.layer.8.output.dense.bias', 'encoder.layer.8.output.LayerNorm.weight', 'encoder.layer.8.output.LayerNorm.bias', 'encoder.layer.9.attention.self.query.weight', 'encoder.layer.9.attention.self.query.bias', 'encoder.layer.9.attention.self.key.weight', 'encoder.layer.9.attention.self.key.bias', 'encoder.layer.9.attention.self.value.weight', 'encoder.layer.9.attention.self.value.bias', 'encoder.layer.9.attention.output.dense.weight', 'encoder.layer.9.attention.output.dense.bias', 'encoder.layer.9.attention.output.LayerNorm.weight', 'encoder.layer.9.attention.output.LayerNorm.bias', 'encoder.layer.9.intermediate.dense.weight', 'encoder.layer.9.intermediate.dense.bias', 'encoder.layer.9.output.dense.weight', 'encoder.layer.9.output.dense.bias', 'encoder.layer.9.output.LayerNorm.weight', 'encoder.layer.9.output.LayerNorm.bias', 'encoder.layer.10.attention.self.query.weight', 'encoder.layer.10.attention.self.query.bias', 'encoder.layer.10.attention.self.key.weight', 'encoder.layer.10.attention.self.key.bias', 'encoder.layer.10.attention.self.value.weight', 'encoder.layer.10.attention.self.value.bias', 'encoder.layer.10.attention.output.dense.weight', 'encoder.layer.10.attention.output.dense.bias', 'encoder.layer.10.attention.output.LayerNorm.weight', 'encoder.layer.10.attention.output.LayerNorm.bias', 'encoder.layer.10.intermediate.dense.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.10.output.dense.weight', 'encoder.layer.10.output.dense.bias', 'encoder.layer.10.output.LayerNorm.weight', 'encoder.layer.10.output.LayerNorm.bias', 'encoder.layer.11.attention.self.query.weight', 'encoder.layer.11.attention.self.query.bias', 'encoder.layer.11.attention.self.key.weight', 'encoder.layer.11.attention.self.key.bias', 'encoder.layer.11.attention.self.value.weight', 'encoder.layer.11.attention.self.value.bias', 'encoder.layer.11.attention.output.dense.weight', 'encoder.layer.11.attention.output.dense.bias', 'encoder.layer.11.attention.output.LayerNorm.weight', 'encoder.layer.11.attention.output.LayerNorm.bias', 'encoder.layer.11.intermediate.dense.weight', 'encoder.layer.11.intermediate.dense.bias', 'encoder.layer.11.output.dense.weight', 'encoder.layer.11.output.dense.bias', 'encoder.layer.11.output.LayerNorm.weight', 'encoder.layer.11.output.LayerNorm.bias', 'pooler.dense.weight', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

print("Structure of BERT input (size of text + 2):", input_ids)
print("Length of embedding structure:", len(embedding))
print("   Shape of first element of embedding:", embedding[0][0].shape)  # size of input ids, BERT input vector size
print("   Shape of second element of embedding:", embedding[1][0].shape)  #

Structure of BERT input (size of text + 2): [[ 101 2293 1996 2265  102]]
Length of embedding structure: 2
   Shape of first element of embedding: (5, 768)
   Shape of second element of embedding: (768,)

# Reuse the data from the TFIDF classification problem
corpus = df_train["Text"] # use Text not cleanTxt as we want context and need to keep the sentences as they are
len(corpus)

## Prepare BERT input for the training dataset
max_seq_length = 160  # The longest sentences in the dataset are less than this, adjust as needed

# Create a string of BERT and word tokens
corpus_tokenized = ["[CLS] "+" ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '',
                    str(txt).lower().strip()))[:max_seq_length])+" [SEP] " for txt in corpus]

## 1. Generate index values for the tokens and add padding, then make sure tokens are at max sequence length
txt2seq = [txt + " [PAD]"*(10+max_seq_length-len(txt.split(" "))) for txt in corpus_tokenized] # added 10 for extra padding (hack)
idx = [tokenizer.encode(seq)[1:-1][:max_seq_length] for seq in txt2seq]  # Need to drop the first and last element, and set no of token to max_seq_len

## 2. Generate masks
masks = [[1]*len(txt.split(" ")) + [0]*(max_seq_length - len( txt.split(" "))) for txt in corpus_tokenized]

## 3. Generate segments
segments = []
for seq in txt2seq:
    temp, i = [], 0
    for token in seq.split(" "):
        temp.append(i)
        if token == "[SEP]":
            i += 1
    segments.append(temp)

# Finally, put all 3 elements into a feature matrix
X_train = [asarray(idx, dtype='int32'),
           asarray(masks, dtype='int32'),
           asarray(segments, dtype='int32')]

# X_train is a 3 dimension tensor
print(len(X_train))  # one each for Token ID, Mask, Segment arrays
print(len(X_train[0])) # Size of the training set
print(len(X_train[0][0])) # max sequence length + 2 (for CLS and SEP)

3
1811
160

df_train

	Text	Label	cleanTxt
798	At the end of the review period , Nordic Alumi...	positive	end review period nordic aluminium order book ...
1507	The share capital of Basware Corporation is 11...	neutral	share capital basware corporation
2058	Efore 's results for the last quarter showed a...	positive	efore result last quarter showed even faster i...
158	Operating profit rose from EUR 1.94 mn to EUR ...	positive	operating profit rose eur mn eur mn
786	The total amount of subscription prices was re...	neutral	total amount subscription price recorded fund ...
...	...	...	...
366	Finnish pharmaceuticals company Orion reports ...	positive	finnish pharmaceutical company orion report pr...
1138	+£lemiste City is the environment for a knowle...	neutral	lemiste city environment knowledgebased econom...
1325	Oriola-KD is a spin-off of Finnish pharmaceuti...	neutral	oriolakd spinoff finnish pharmaceutical group ...
1119	Two other sites will be included later on	neutral	two site included later
165	Operating profit rose to EUR 3.2 mn from EUR 1...	positive	operating profit rose eur mn eur mn correspond...

1811 rows × 3 columns

k = randint(len(X_train[0])) # Pick a random sentence, try 302
# print(df_train["Text"][k])
print(len(X_train[0][k]))
print(X_train[0][k])   # Token ids
print(X_train[1][k])   # mask
print(X_train[2][k])   # segment

160
[  101  4082  5618 23596  7327  1001  1001  1054  4583 24098  2039  2013
1001  1001  1054  5718 24098   102     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0     0     0     0     0     0     0     0     0
   0     0     0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

32.15. Fine-Tuning#

We are using DistilBERT below, which only requires the token IDs and masks (not segments).

# Build up model
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
import transformers
import numpy as np

## inputs
idx = layers.Input((max_seq_length,), dtype="int32", name="input_idx")
masks = layers.Input((max_seq_length,), dtype="int32", name="input_masks")
segments = layers.Input((max_seq_length,), dtype="int32", name="input_segments")
## pre-trained bert with config
config = transformers.DistilBertConfig(dropout=0.2, attention_dropout=0.2)
config.output_hidden_states = False
nlp = transformers.TFDistilBertModel.from_pretrained('distilbert-base-uncased', config=config)

# Wrap the DistilBERT call in a Lambda layer and specify output_shape
bert_out = layers.Lambda(
    lambda x: nlp(x[0], attention_mask=x[1])[0],
    output_shape=(max_seq_length, nlp.config.hidden_size)  # Add output_shape
)([idx, masks])

## fine-tuning
x = layers.GlobalAveragePooling1D()(bert_out)
x = layers.Dense(64, activation="relu")(x)
y_out = layers.Dense(len(np.unique(y_train)),activation='softmax')(x)
## compile
model = models.Model([idx, masks], y_out)
for layer in model.layers[:3]:
    layer.trainable = False

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.

Model: "functional"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)              ┃ Output Shape           ┃        Param # ┃ Connected to           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ input_idx (InputLayer)    │ (None, 160)            │              0 │ -                      │
├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤
│ input_masks (InputLayer)  │ (None, 160)            │              0 │ -                      │
├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤
│ lambda (Lambda)           │ (None, 160, 768)       │              0 │ input_idx[0][0],       │
│                           │                        │                │ input_masks[0][0]      │
├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤
│ global_average_pooling1d  │ (None, 768)            │              0 │ lambda[0][0]           │
│ (GlobalAveragePooling1D)  │                        │                │                        │
├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤
│ dense (Dense)             │ (None, 64)             │         49,216 │ global_average_poolin… │
├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤
│ dense_1 (Dense)           │ (None, 3)              │            195 │ dense[0][0]            │
└───────────────────────────┴────────────────────────┴────────────────┴────────────────────────┘

 Total params: 49,411 (193.01 KB)

 Trainable params: 49,411 (193.01 KB)

 Non-trainable params: 0 (0.00 B)

plot_model(model)  # option to_file=

_images/28d15d1ffa74caafb72cf57d41cc8692a1069a67eb842a20f8866bdb5c97f497.png

# Create label y
dic_y_mapping = {n:label for n,label in enumerate(np.unique(y_train))}
inverse_dic = {v:k for k,v in dic_y_mapping.items()}
y_train = np.array([inverse_dic[y] for y in y_train])

print(len(X_train[1]))
print(unique(y_train))

1811
[0 1 2]

%%time
## train
training = model.fit(x=X_train[:2], y=y_train, batch_size=32, epochs=10, shuffle=True, verbose=1, validation_split=0.3)

Epoch 1/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 30s 355ms/step - accuracy: 0.6012 - loss: 0.8802 - val_accuracy: 0.7482 - val_loss: 0.5994
Epoch 2/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 8s 198ms/step - accuracy: 0.7836 - loss: 0.5345 - val_accuracy: 0.7702 - val_loss: 0.5125
Epoch 3/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 8s 204ms/step - accuracy: 0.8094 - loss: 0.4589 - val_accuracy: 0.7831 - val_loss: 0.5149
Epoch 4/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 11s 210ms/step - accuracy: 0.8535 - loss: 0.4067 - val_accuracy: 0.8235 - val_loss: 0.4361
Epoch 5/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 8s 214ms/step - accuracy: 0.8701 - loss: 0.3674 - val_accuracy: 0.8309 - val_loss: 0.4182
Epoch 6/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 13s 282ms/step - accuracy: 0.8785 - loss: 0.3342 - val_accuracy: 0.8162 - val_loss: 0.4186
Epoch 7/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 21s 286ms/step - accuracy: 0.9018 - loss: 0.2932 - val_accuracy: 0.8107 - val_loss: 0.4217
Epoch 8/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 18s 231ms/step - accuracy: 0.9042 - loss: 0.2813 - val_accuracy: 0.8327 - val_loss: 0.4031
Epoch 9/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 13s 297ms/step - accuracy: 0.9221 - loss: 0.2580 - val_accuracy: 0.8199 - val_loss: 0.4153
Epoch 10/10
40/40 ━━━━━━━━━━━━━━━━━━━━ 12s 293ms/step - accuracy: 0.8930 - loss: 0.2811 - val_accuracy: 0.8309 - val_loss: 0.4170
CPU times: user 22.4 s, sys: 5.27 s, total: 27.6 s
Wall time: 2min 21s

# INPUT STRUCTURE FOR TEST DATASET
corpus = df_test["Text"]

# Create a string of BERT and word tokens
corpus_tokenized = ["[CLS] "+" ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '',
                    str(txt).lower().strip()))[:max_seq_length])+" [SEP] " for txt in corpus]

## 1. Generate index values for the tokens and add padding, then make sure tokens are at max sequence length
txt2seq = [txt + " [PAD]"*(10+max_seq_length-len(txt.split(" "))) for txt in corpus_tokenized] # added 10 for extra padding (hack)
idx = [tokenizer.encode(seq)[1:-1][:max_seq_length] for seq in txt2seq]  # Need to drop the first and last element, and set no of token to max_seq_len

## 2. Generate masks
masks = [[1]*len(txt.split(" ")) + [0]*(max_seq_length - len( txt.split(" "))) for txt in corpus_tokenized]

## 3. Generate segments
segments = []
for seq in txt2seq:
    temp, i = [], 0
    for token in seq.split(" "):
        temp.append(i)
        if token == "[SEP]":
            i += 1
    segments.append(temp)

# Finally, put all 3 elements into a feature matrix
X_test = [asarray(idx, dtype='int32'),
           asarray(masks, dtype='int32'),
           asarray(segments, dtype='int32')]

## test labels
predicted_prob = model.predict(X_test[:2])
dic_y_mapping = {0:'neutral', 1:'negative', 2:'positive'}  # for the financial phrase bank dataset
predicted = [dic_y_mapping[argmax(pred)] for pred in predicted_prob]

15/15 ━━━━━━━━━━━━━━━━━━━━ 7s 348ms/step

# Convert y_test to numerical labels, similar to how predicted was transformed
tmp = zeros(len(y_test))
for j in range(len(tmp)):
    if y_test[j]=='negative':
        tmp[j] = 1
    elif y_test[j]=='positive':
        tmp[j] = 2
y_test = tmp

accuracy = metrics.accuracy_score(y_test, predicted)
# auc = metrics.roc_auc_score(y_test, predicted_prob[:,1])  # only for binary classification
print("Accuracy:", round(accuracy,2))
# print("Auc:", round(auc,2))
print("Detail:")
print(metrics.classification_report(y_test, predicted))
cm = metrics.confusion_matrix(y_test, predicted)
print(cm)

Accuracy: 0.08
Detail:
              precision    recall  f1-score   support

         0.0       1.00      0.08      0.15       453
         1.0       0.00      0.00      0.00         0
         2.0       0.00      0.00      0.00         0

    accuracy                           0.08       453
   macro avg       0.33      0.03      0.05       453
weighted avg       1.00      0.08      0.15       453

[[ 36 290 127]
 [  0   0   0]
 [  0   0   0]]

/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

TFIDF accuracy = 80-85%
Word2Vec accuracy = 70-75%
BERT accuracy = 82-89%

32.16. REFERENCES#

https://towardsdatascience.com/a-no-frills-guide-to-most-natural-language-processing-models-the-transformer-xl-era-ff5035f04e0f
Using BERT for the first time (by J Alammar): http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/; code: https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb
Stanford Sentiment Treebank: https://nlp.stanford.edu/sentiment/index.html

32.17. From the reference above, using the SST dataset (movie reviews)#

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

df_large = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
print(df_large.shape)
df_large.head()

(6920, 2)

	0	1
0	a stirring , funny and finally transporting re...	1
1	apparently reassembled from the cutting room f...	0
2	they presume their audience wo n't sit still f...	0
3	this is a visually stunning rumination on love...	1
4	jonathan parker 's bartleby should have been t...	1

# Is the dataset balanced in labels?
df = df_large[:1500]  # take a small subset
df[1].value_counts()

	count
1
1	782
0	718

dtype: int64

32.18. Get the pre-trained model#

# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

# Get tokenized version
tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
print(tokenized.shape)
tokenized[0]

(1500,)

32.19. Construct token IDs and masks#

# Add padding and set max len to the longest entry in the dataset
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = array([i + [0]*(max_len-len(i)) for i in tokenized.values])
print(array(padded).shape)
padded[:3]

(1500, 59)

array([[  101,  1037, 18385,  1010,  6057,  1998,  2633, 18276,  2128,
        16603,  1997,  5053,  1998,  1996,  6841,  1998,  5687,  5469,
         3152,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0],
       [  101,  4593,  2128, 27241, 23931,  2013,  1996,  6276,  2282,
         2723,  1997,  2151,  2445, 12217,  7815,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0],
       [  101,  2027,  3653, 23545,  2037,  4378, 24185,  1050,  1005,
         1056,  4133,  2145,  2005,  1037, 11507, 10800,  1010,  2174,
        14036,  2135,  3591,  1010,  2061,  2027, 19817,  4140,  2041,
         1996,  7511,  2671,  4349,  3787,  1997, 11829,  7168,  9219,
         1998, 28971,  2308,  1999,  8301,  8737,  2100,  4253,   102,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0]])

# Add a mask to let BERT know where the real tokens are and not the padding
# Essentially we can use zero for the padding mask so those tokens do not compute
attention_mask = where(padded != 0, 1, 0)
print(attention_mask.shape)
attention_mask[:3]

(1500, 59)

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

32.20. Run token IDs and attention masks through BERT to get embeddings#

%%time
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)  #embeddings

CPU times: user 1min 50s, sys: 16.1 s, total: 2min 6s
Wall time: 2min 6s

# Embeddings from BERT
print(len(last_hidden_states[0][:,0,:]))
last_hidden_states[0][:,0,:]

tensor([[-0.2159, -0.1403,  0.0083,  ..., -0.1369,  0.5867,  0.2011],
        [-0.1726, -0.1448,  0.0022,  ..., -0.1744,  0.2139,  0.3720],
        [-0.0506,  0.0720, -0.0296,  ..., -0.0715,  0.7185,  0.2623],
        ...,
        [ 0.0062,  0.0426, -0.1080,  ..., -0.0417,  0.6836,  0.3451],
        [ 0.0087,  0.0605, -0.3309,  ..., -0.2005,  0.6268,  0.1546],
        [-0.2395, -0.1362,  0.0463,  ..., -0.0285,  0.2219,  0.3242]])

# Collect the CLS embedding and labels to set up the classification task
features = last_hidden_states[0][:,0,:].numpy()
labels = df[1]
print(features.shape)

(1500, 768)

32.21. Use the BERT transformed dataset for machine learning as usual#

train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
lr_clf.score(test_features, test_labels)

0.8053333333333333

32.22. Speed#

As you can see, BERT runs slow, so it is good to use a machine with GPUs. For more on computation speed, see: https://blog.inten.to/speeding-up-bert-5528e18bb4ea