17. PyTorch Basics and Neural Nets#
PyTorch is Meta’s deep learning framework and is also used for tensor mathematics, taking gradients, etc. It is extremely versatile and a useful alternative to Tensorflow and Keras.
In this notebook, we will review a few basics of PyTorch - detailed examples are available from the site: https://pytorch.org
Installation: https://pytorch.org/get-started/locally/
Quick tutorial: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html (this is a great and quick read)
from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>
import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
# pylab contains numpy, scipy, and matplotlib
%pylab inline
import pandas as pd
import os
%%capture
# Install if required, and not installed by default, uncomment the line below to install:
# !pip install torch torchvision
import torch
17.1. Tensors#
Just as we have numpy arrays, we can also initialize a torch array of any dimension, so it is better to call it a tensor.
Here are some basic operations on torch tensors just to get an idea of the syntax. The parallels with NumPy make this easy.
a = torch.tensor(4)
print(type(a))
print(a+a)
<class 'torch.Tensor'>
tensor(8)
# Casting from ndarry to torch tensor
b = array(8) # numpy array
a = torch.tensor(b)
print(type(a))
print(a+a)
a = torch.from_numpy(b)
print(type(a))
print(a+a)
<class 'torch.Tensor'>
tensor(16)
<class 'torch.Tensor'>
tensor(16)
# Higher dimensions
b = rand(30).reshape((5,2,3))
print(type(b))
print(b.shape)
a = torch.tensor(b)
print(type(a))
print(a)
print(a.shape)
<class 'numpy.ndarray'>
(5, 2, 3)
<class 'torch.Tensor'>
tensor([[[0.2615, 0.3925, 0.0420],
         [0.1419, 0.2679, 0.8675]],
        [[0.6362, 0.7969, 0.3467],
         [0.2474, 0.8218, 0.2374]],
        [[0.0526, 0.6614, 0.9680],
         [0.1531, 0.6672, 0.3134]],
        [[0.7777, 0.7695, 0.8015],
         [0.4038, 0.1718, 0.1386]],
        [[0.0048, 0.8072, 0.9854],
         [0.7540, 0.0512, 0.2410]]], dtype=torch.float64)
torch.Size([5, 2, 3])
# Using random number generators
a = torch.rand((5,2,3))
print(a)
tensor([[[0.6842, 0.1975, 0.2352],
         [0.5127, 0.5446, 0.8785]],
        [[0.7669, 0.1932, 0.4483],
         [0.2681, 0.3545, 0.1566]],
        [[0.0888, 0.0166, 0.2399],
         [0.0748, 0.5597, 0.2346]],
        [[0.8537, 0.6162, 0.2169],
         [0.7025, 0.0897, 0.4324]],
        [[0.6458, 0.4631, 0.8967],
         [0.0569, 0.5493, 0.2178]]])
17.2. Matrix operations#
a = [[2.0,3],[1,4]]
b = [[5.0,8]]
c = torch.rand((2,2))
a = torch.tensor(a)
b = torch.tensor(b)
b = torch.t(b)
print(a)
print(b)
print(c)
print(torch.transpose(c,0,1)) # more useful for tensors with more than 2 dimensions
tensor([[2., 3.],
        [1., 4.]])
tensor([[5.],
        [8.]])
tensor([[0.5315, 0.6161],
        [0.6049, 0.9154]])
tensor([[0.5315, 0.6049],
        [0.6161, 0.9154]])
print(torch.transpose(c,1,0))
tensor([[0.5315, 0.6049],
        [0.6161, 0.9154]])
# Addition
print(a+c)
print(torch.add(a,c))
tensor([[2.5315, 3.6161],
        [1.6049, 4.9154]])
tensor([[2.5315, 3.6161],
        [1.6049, 4.9154]])
# Subtraction
print(a-c)
print(torch.sub(a,c))
tensor([[1.4685, 2.3839],
        [0.3951, 3.0846]])
tensor([[1.4685, 2.3839],
        [0.3951, 3.0846]])
# Multiplication
print(a*c) # elementwise
print(torch.mm(a,c)) # matrix multiply
print(torch.mm(a,b)) # notice the transpose required here (not so in numpy)
print(torch.mm(torch.mm(a,c),b))
tensor([[1.0631, 1.8482],
        [0.6049, 3.6615]])
tensor([[2.8777, 3.9782],
        [2.9511, 4.2775]])
tensor([[34.],
        [37.]])
tensor([[46.2146],
        [48.9756]])
# Much easier
a.mm(c).mm(b)
tensor([[46.2146],
        [48.9756]])
17.3. Reshaping and joining tensors#
a = torch.rand(4,3)
b = torch.randn(3,4)
print(torch.cat((a,torch.t(b))))
print(torch.cat((a,torch.t(b)),dim=1))
tensor([[ 0.8530,  0.8253,  0.3404],
        [ 0.4164,  0.9495,  0.0584],
        [ 0.9934,  0.4807,  0.6694],
        [ 0.3504,  0.3575,  0.3691],
        [-0.8873, -1.3509, -0.2681],
        [ 0.6545, -0.3850,  1.2115],
        [-0.0502,  0.8647, -0.0428],
        [ 0.1123, -1.3075, -0.8425]])
tensor([[ 0.8530,  0.8253,  0.3404, -0.8873, -1.3509, -0.2681],
        [ 0.4164,  0.9495,  0.0584,  0.6545, -0.3850,  1.2115],
        [ 0.9934,  0.4807,  0.6694, -0.0502,  0.8647, -0.0428],
        [ 0.3504,  0.3575,  0.3691,  0.1123, -1.3075, -0.8425]])
torch.reshape(b,(4,3))
tensor([[-0.8873,  0.6545, -0.0502],
        [ 0.1123, -1.3509, -0.3850],
        [ 0.8647, -1.3075, -0.2681],
        [ 1.2115, -0.0428, -0.8425]])
17.4. Vector/Matrix Calculus#
A look at taking derivatives (gradients).
PyTorch uses automatic differentiation, so if we have a function \(f(x)\) where \(x\) is a tensor and we designate automatic differentiation for \(x\), then PyTorch will keep track of all functions of \(x\) (through the computation tree) and also the derivatives of those functions w.r.t. \(x\).
Note that if \(x\) is a vector of dimension \(d\), and \(f(x)\) is a scalar function of \(x\), then the derivative \(\frac{\partial f}{\partial x}\) will be of dimension \(d\).
See: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
Jason Brownlees, Calculating Derivatives in PyTorch
%%time
x = torch.tensor([[2.,3.,4.]], requires_grad=True) # make sure this is a matrix [[]]
a = torch.rand((5,3)) # this is automatically a matrix
print("a =", a)
f = torch.mm(a,torch.t(x)).sum() # f is a vector
print("f =", f)
# external_grad = torch.tensor([1., 1., 1.])
# f.backward(gradient=external_grad)
# f.backward() # sum f to get a scalar and then differentiate wrt x, gradients are stored in object x
# print("Gradient =", x.grad) # since there are 3 values in the x vector, we will get 3 derivatives
a = tensor([[0.1934, 0.6820, 0.1840],
        [0.8891, 0.4797, 0.5202],
        [0.7871, 0.7764, 0.4398],
        [0.5652, 0.0170, 0.7704],
        [0.1034, 0.2405, 0.5148]])
f = tensor(21.3798, grad_fn=<SumBackward0>)
CPU times: user 3.56 ms, sys: 0 ns, total: 3.56 ms
Wall time: 7.32 ms
Let’s do the same gradient manually and check that we get the same gradients as we obtained from AutoGrad above.
%%time
f.backward() # sum f to get a scalar and then differentiate wrt x, gradients are stored in object x
print("Gradient =", x.grad) # since there are 3 values in the x vector, we will get 3 derivatives
Gradient = tensor([[2.5382, 2.1956, 2.4291]])
CPU times: user 1.59 ms, sys: 1.68 ms, total: 3.27 ms
Wall time: 6.62 ms
%%time
a1 = a.numpy().copy()
dx = 0.1
for j in range(3):
    x1 = x.detach().numpy().copy() # need to detach gradient vector, and also make a copy, else x will be updated
    x1[0,j] = x1[0,j] + dx
    f1 = a1.dot(x1.T).sum()
    print((f1-f)/dx)
tensor(2.5382, grad_fn=<DivBackward0>)
tensor(2.1956, grad_fn=<DivBackward0>)
tensor(2.4292, grad_fn=<DivBackward0>)
CPU times: user 3.33 ms, sys: 76 µs, total: 3.41 ms
Wall time: 4.1 ms
x = torch.tensor([[2.,3.,4.]], requires_grad=True) # make sure this is a matrix [[]]
a = torch.rand((3,3)) # this is automatically a matrix
print("a =", a)
f = torch.mm(x,torch.mm(a,torch.t(x))) # f is a vector
print("f =", f)
# external_grad = torch.tensor([1., 1., 1.])
# f.backward(gradient=external_grad)
f.backward() # sum f to get a scalar and then differentiate wrt x, gradients are stored in object x
print("Gradient =", x.grad) # since there are 3 values in the x vector, we will get 3 derivatives
a = tensor([[0.4253, 0.2944, 0.8337],
        [0.5360, 0.6568, 0.0366],
        [0.8111, 0.0515, 0.2004]])
f = tensor([[30.0181]], grad_fn=<MmBackward0>)
Gradient = tensor([[10.7720,  5.9540,  5.1576]])
f = torch.mm(torch.mm(x,torch.mm(a,torch.t(x))),torch.mm(x,torch.t(x)))
f.backward()
print("Gradient =", x.grad)
Gradient = tensor([[443.2314, 358.7285, 394.8722]])
# Repeat the example here, but do not sum but keep each element separate with external gradients,
# so f is not needed to be scalar
x = torch.tensor([[2.,3.,4.]], requires_grad=True)
f = torch.mm(a,torch.t(x))
print("f =", f) # f is a vector
external_grad = torch.t(torch.tensor([[1., 1., 1.]]))
f.backward(gradient=external_grad)
# f.sum().backward()
print("Gradient =", x.grad)
f = tensor([[5.0686],
        [3.1887],
        [2.5787]], grad_fn=<MmBackward0>)
Gradient = tensor([[1.7725, 1.0027, 1.0707]])
# Example 1 from PyTorch documentation
a = torch.tensor([2., 3., 5.], requires_grad=True)
b = torch.tensor([6., 4., 2.], requires_grad=True)
Q = 3*a**3 - b**2
print("Q =", Q)
external_grad = torch.tensor([1., 1., 1.])
Q.backward(gradient=external_grad)
# check if collected gradients are correct
print(a.grad, b.grad)
print(9*a**2 == a.grad)
print(-2*b == b.grad)
Q = tensor([-12.,  65., 371.], grad_fn=<SubBackward0>)
tensor([ 36.,  81., 225.]) tensor([-12.,  -8.,  -4.])
tensor([True, True, True])
tensor([True, True, True])
# Example 2 from PyTorch documentation
a = torch.tensor([2., 3., 5.], requires_grad=True)
b = torch.tensor([6., 4., 2.], requires_grad=True)
Q = 3*a**3 - b**2
print(Q)
Q.sum().backward()
# check if collected gradients are correct
print(a.grad, b.grad)
print(9*a**2 == a.grad)
print(-2*b == b.grad)
tensor([-12.,  65., 371.], grad_fn=<SubBackward0>)
tensor([ 36.,  81., 225.]) tensor([-12.,  -8.,  -4.])
tensor([True, True, True])
tensor([True, True, True])
17.5. Reprise of the MNIST digit classification problem#
Recall that the MNIST digits occupy a pixel frame of size 28 x 28 = 784 pixels. This is the input size. The labels are digits from 0,1,2,…,9, i.e., a 10-class classification problem.
# importing the libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
#Sigmoid Function
def sigmoid (x):
    return 1/(1 + torch.exp(-x))
#Derivative of Sigmoid Function/
def derivatives_sigmoid(x):
    return sigmoid(x) * (1 - sigmoid(x))
# Import all the required PyTorch libraries
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as transforms
# Use a GPU if available, else use a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Device =", device)
# Hyper-parameters
input_size = 784
hidden_size = 500 # number of neurons in the hidden layer
num_classes = 10
num_epochs = 5
batch_size = 100
learning_rate = 0.001
# Define a fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        # The super() function is used to give access to methods and properties of a parent or sibling class.
        # The super() function returns an object that represents the parent class.
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) # fully connected input and hidden layer 1
        self.relu = nn.ReLU() # type is ReLU
        self.fc2 = nn.Linear(hidden_size, num_classes) # fully connected hidden layer and output layer
    # forward pass that takes input and produces output
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out
# Define the model
model = NeuralNet(input_size, hidden_size, num_classes).to(device) # push it to cpu/gpu (using related RAM)
# Loss and optimizer (Adam)
criterion = nn.CrossEntropyLoss() # This will do both softmax and cross-entropy loss in one
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
Device = cpu
# MNIST dataset (images and labels)
train_dataset = torchvision.datasets.MNIST(root='../../data',
                                           train=True,
                                           transform=transforms.ToTensor(),
                                           download=True)
test_dataset = torchvision.datasets.MNIST(root='../../data',
                                          train=False,
                                          transform=transforms.ToTensor())
# Data loader (input pipeline)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=10000,
                                          shuffle=False)
print(train_dataset.data.shape)
print(test_dataset.data.shape)
torch.Size([60000, 28, 28])
torch.Size([10000, 28, 28])
%%time
# Train the model
total_step = len(train_loader)
print("Total steps =", total_step)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Move tensors to the configured device
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        # Backward and optimize
        optimizer.zero_grad() # zero out all gradients
        loss.backward() # calculate gradients
        optimizer.step() # update weights
        if (i+1) % batch_size == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))
Total steps = 600
Epoch [1/5], Step [100/600], Loss: 0.4732
Epoch [1/5], Step [200/600], Loss: 0.5229
Epoch [1/5], Step [300/600], Loss: 0.2170
Epoch [1/5], Step [400/600], Loss: 0.1897
Epoch [1/5], Step [500/600], Loss: 0.1795
Epoch [1/5], Step [600/600], Loss: 0.1855
Epoch [2/5], Step [100/600], Loss: 0.2252
Epoch [2/5], Step [200/600], Loss: 0.0933
Epoch [2/5], Step [300/600], Loss: 0.0508
Epoch [2/5], Step [400/600], Loss: 0.2727
Epoch [2/5], Step [500/600], Loss: 0.0973
Epoch [2/5], Step [600/600], Loss: 0.1390
Epoch [3/5], Step [100/600], Loss: 0.0574
Epoch [3/5], Step [200/600], Loss: 0.1147
Epoch [3/5], Step [300/600], Loss: 0.0984
Epoch [3/5], Step [400/600], Loss: 0.0585
Epoch [3/5], Step [500/600], Loss: 0.0536
Epoch [3/5], Step [600/600], Loss: 0.0882
Epoch [4/5], Step [100/600], Loss: 0.0211
Epoch [4/5], Step [200/600], Loss: 0.0223
Epoch [4/5], Step [300/600], Loss: 0.0185
Epoch [4/5], Step [400/600], Loss: 0.0679
Epoch [4/5], Step [500/600], Loss: 0.0287
Epoch [4/5], Step [600/600], Loss: 0.0356
Epoch [5/5], Step [100/600], Loss: 0.0887
Epoch [5/5], Step [200/600], Loss: 0.0248
Epoch [5/5], Step [300/600], Loss: 0.0591
Epoch [5/5], Step [400/600], Loss: 0.0236
Epoch [5/5], Step [500/600], Loss: 0.0438
Epoch [5/5], Step [600/600], Loss: 0.0327
CPU times: user 1min 2s, sys: 52.2 ms, total: 1min 2s
Wall time: 1min 4s
%%time
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))
# Save the model checkpoint
torch.save(model.state_dict(), 'model.ckpt')
Accuracy of the network on the 10000 test images: 97.52 %
CPU times: user 1.64 s, sys: 88.8 ms, total: 1.73 s
Wall time: 2.18 s
confusion_matrix(predicted, labels)
array([[ 968,    0,    3,    0,    1,    2,    2,    1,    3,    2],
       [   1, 1129,    4,    1,    1,    0,    3,   14,    2,    8],
       [   1,    3, 1002,    6,    1,    0,    1,    9,    1,    0],
       [   0,    0,    3,  989,    0,   20,    1,    6,   13,   14],
       [   2,    0,    3,    0,  971,    2,    6,    1,    5,   11],
       [   0,    1,    0,    3,    0,  856,    1,    0,    3,    2],
       [   4,    1,    3,    0,    3,    3,  943,    0,    3,    1],
       [   1,    0,    6,    3,    1,    1,    0,  988,    4,    2],
       [   2,    1,    7,    3,    0,    6,    1,    1,  938,    1],
       [   1,    0,    1,    5,    4,    2,    0,    8,    2,  968]])
print(classification_report(predicted, labels))
              precision    recall  f1-score   support
           0       0.99      0.99      0.99       982
           1       0.99      0.97      0.98      1163
           2       0.97      0.98      0.97      1024
           3       0.98      0.95      0.96      1046
           4       0.99      0.97      0.98      1001
           5       0.96      0.99      0.97       866
           6       0.98      0.98      0.98       961
           7       0.96      0.98      0.97      1006
           8       0.96      0.98      0.97       960
           9       0.96      0.98      0.97       991
    accuracy                           0.98     10000
   macro avg       0.97      0.98      0.98     10000
weighted avg       0.98      0.98      0.98     10000
How will you use the code above for text classification?
