PyTorch Basics and Neural Nets

17. PyTorch Basics and Neural Nets#

PyTorch is Meta’s deep learning framework and is also used for tensor mathematics, taking gradients, etc. It is extremely versatile and a useful alternative to Tensorflow and Keras.

In this notebook, we will review a few basics of PyTorch - detailed examples are available from the site: https://pytorch.org

Installation: https://pytorch.org/get-started/locally/

Quick tutorial: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html (this is a great and quick read)

from google.colab import drive
drive.mount('/content/drive')  # Add My Drive/<>

import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')

Mounted at /content/drive

%%capture

# pylab contains numpy, scipy, and matplotlib
%pylab inline
import pandas as pd
import os

%%capture
# Install if required, and not installed by default, uncomment the line below to install:
# !pip install torch torchvision

import torch

17.1. Tensors#

Just as we have numpy arrays, we can also initialize a torch array of any dimension, so it is better to call it a tensor.

Here are some basic operations on torch tensors just to get an idea of the syntax. The parallels with NumPy make this easy.

a = torch.tensor(4)
print(type(a))
print(a+a)

<class 'torch.Tensor'>
tensor(8)

# Casting from ndarry to torch tensor
b = array(8) # numpy array
a = torch.tensor(b)
print(type(a))
print(a+a)

a = torch.from_numpy(b)
print(type(a))
print(a+a)

<class 'torch.Tensor'>
tensor(16)
<class 'torch.Tensor'>
tensor(16)

# Higher dimensions
b = rand(30).reshape((5,2,3))
print(type(b))
print(b.shape)

a = torch.tensor(b)
print(type(a))
print(a)
print(a.shape)

<class 'numpy.ndarray'>
(5, 2, 3)
<class 'torch.Tensor'>
tensor([[[0.5347, 0.8809, 0.3510],
         [0.3725, 0.6652, 0.2865]],

        [[0.9280, 0.9821, 0.5421],
         [0.5180, 0.4321, 0.0695]],

        [[0.9771, 0.6403, 0.9119],
         [0.6229, 0.2983, 0.3976]],

        [[0.1336, 0.2787, 0.2232],
         [0.9405, 0.8489, 0.1787]],

        [[0.5569, 0.6145, 0.6317],
         [0.6332, 0.0610, 0.5749]]], dtype=torch.float64)
torch.Size([5, 2, 3])

# Using random number generators
a = torch.rand((5,2,3))
print(a)

tensor([[[0.8676, 0.7957, 0.5383],
         [0.2872, 0.9935, 0.5708]],

        [[0.7331, 0.8052, 0.1956],
         [0.3138, 0.6612, 0.9448]],

        [[0.9985, 0.3491, 0.8972],
         [0.4451, 0.1998, 0.7309]],

        [[0.4810, 0.6493, 0.5842],
         [0.8067, 0.1026, 0.4111]],

        [[0.6343, 0.2544, 0.0570],
         [0.8978, 0.0196, 0.8367]]])

17.2. Matrix operations#

a = [[2.0,3],[1,4]]
b = [[5.0,8]]
c = torch.rand((2,2))
a = torch.tensor(a)
b = torch.tensor(b)
b = torch.t(b)
print(a)
print(b)
print(c)
print(torch.transpose(c,0,1)) # more useful for tensors with more than 2 dimensions

tensor([[2., 3.],
        [1., 4.]])
tensor([[5.],
        [8.]])
tensor([[0.2260, 0.9125],
        [0.1072, 0.0627]])
tensor([[0.2260, 0.1072],
        [0.9125, 0.0627]])

print(torch.transpose(c,1,0))

tensor([[0.2260, 0.1072],
        [0.9125, 0.0627]])

# Addition
print(a+c)
print(torch.add(a,c))

tensor([[2.2260, 3.9125],
        [1.1072, 4.0627]])
tensor([[2.2260, 3.9125],
        [1.1072, 4.0627]])

# Subtraction
print(a-c)
print(torch.sub(a,c))

tensor([[1.7740, 2.0875],
        [0.8928, 3.9373]])
tensor([[1.7740, 2.0875],
        [0.8928, 3.9373]])

# Multiplication
print(a*c) # elementwise
print(torch.mm(a,c)) # matrix multiply
print(torch.mm(a,b)) # notice the transpose required here (not so in numpy)
print(torch.mm(torch.mm(a,c),b))

tensor([[0.4521, 2.7376],
        [0.1072, 0.2506]])
tensor([[0.7738, 2.0130],
        [0.6550, 1.1631]])
tensor([[34.],
        [37.]])
tensor([[19.9730],
        [12.5802]])

# Much easier
a.mm(c).mm(b)

tensor([[19.9730],
        [12.5802]])

17.3. Reshaping and joining tensors#

a = torch.rand(4,3)
b = torch.randn(3,4)
print(torch.cat((a,torch.t(b))))
print(torch.cat((a,torch.t(b)),dim=1))

tensor([[ 0.7198,  0.5718,  0.3233],
        [ 0.4059,  0.5090,  0.6226],
        [ 0.8694,  0.8245,  0.8100],
        [ 0.4210,  0.3403,  0.6833],
        [ 0.1051,  0.5749,  0.3737],
        [-0.8434,  1.4556,  0.3203],
        [-1.2136, -0.5525, -0.1356],
        [-0.8654,  1.1253,  1.1332]])
tensor([[ 0.7198,  0.5718,  0.3233,  0.1051,  0.5749,  0.3737],
        [ 0.4059,  0.5090,  0.6226, -0.8434,  1.4556,  0.3203],
        [ 0.8694,  0.8245,  0.8100, -1.2136, -0.5525, -0.1356],
        [ 0.4210,  0.3403,  0.6833, -0.8654,  1.1253,  1.1332]])

torch.reshape(b,(4,3))

tensor([[ 0.1051, -0.8434, -1.2136],
        [-0.8654,  0.5749,  1.4556],
        [-0.5525,  1.1253,  0.3737],
        [ 0.3203, -0.1356,  1.1332]])

17.4. Vector/Matrix Calculus#

A look at taking derivatives (gradients).

PyTorch uses automatic differentiation, so if we have a function \(f(x)\) where \(x\) is a tensor and we designate automatic differentiation for \(x\), then PyTorch will keep track of all functions of \(x\) (through the computation tree) and also the derivatives of those functions w.r.t. \(x\).

Note that if \(x\) is a vector of dimension \(d\), and \(f(x)\) is a scalar function of \(x\), then the derivative \(\frac{\partial f}{\partial x}\) will be of dimension \(d\).

See: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

Jason Brownlees, Calculating Derivatives in PyTorch

%%time
x = torch.tensor([[2.,3.,4.]], requires_grad=True) # make sure this is a matrix [[]]
a = torch.rand((5,3)) # this is automatically a matrix
print("a =", a)
f = torch.mm(a,torch.t(x)).sum() # f is a vector
print("f =", f)

# external_grad = torch.tensor([1., 1., 1.])
# f.backward(gradient=external_grad)
# f.backward() # sum f to get a scalar and then differentiate wrt x, gradients are stored in object x
# print("Gradient =", x.grad) # since there are 3 values in the x vector, we will get 3 derivatives

a = tensor([[3.8891e-01, 4.8304e-01, 3.0387e-01],
        [5.4478e-01, 9.6467e-01, 1.6530e-01],
        [5.0820e-02, 4.9364e-01, 5.5453e-01],
        [4.4001e-01, 7.4792e-01, 6.2277e-01],
        [3.7788e-01, 6.9910e-04, 1.6357e-01]])
f = tensor(18.9149, grad_fn=<SumBackward0>)
CPU times: user 3.1 ms, sys: 828 µs, total: 3.93 ms
Wall time: 8.78 ms

Let’s do the same gradient manually and check that we get the same gradients as we obtained from AutoGrad above.

%%time
f.backward() # sum f to get a scalar and then differentiate wrt x, gradients are stored in object x
print("Gradient =", x.grad) # since there are 3 values in the x vector, we will get 3 derivatives

Gradient = tensor([[1.8024, 2.6900, 1.8100]])
CPU times: user 2.75 ms, sys: 129 µs, total: 2.88 ms
Wall time: 6.29 ms

%%time
a1 = a.numpy().copy()
dx = 0.1
for j in range(3):
    x1 = x.detach().numpy().copy() # need to detach gradient vector, and also make a copy, else x will be updated
    x1[0,j] = x1[0,j] + dx
    f1 = a1.dot(x1.T).sum()
    print((f1-f)/dx)

tensor(1.8024, grad_fn=<DivBackward0>)
tensor(2.6900, grad_fn=<DivBackward0>)
tensor(1.8101, grad_fn=<DivBackward0>)
CPU times: user 3.82 ms, sys: 0 ns, total: 3.82 ms
Wall time: 4.77 ms

x = torch.tensor([[2.,3.,4.]], requires_grad=True) # make sure this is a matrix [[]]
a = torch.rand((3,3)) # this is automatically a matrix
print("a =", a)
f = torch.mm(x,torch.mm(a,torch.t(x))) # f is a vector
print("f =", f)

# external_grad = torch.tensor([1., 1., 1.])
# f.backward(gradient=external_grad)
f.backward() # sum f to get a scalar and then differentiate wrt x, gradients are stored in object x
print("Gradient =", x.grad) # since there are 3 values in the x vector, we will get 3 derivatives

a = tensor([[0.6734, 0.8725, 0.3706],
        [0.3121, 0.8525, 0.0145],
        [0.5885, 0.3819, 0.0853]])
f = tensor([[31.2674]], grad_fn=<MmBackward0>)
Gradient = tensor([[10.0838,  9.0696,  3.7896]])

f = torch.mm(torch.mm(x,torch.mm(a,torch.t(x))),torch.mm(x,torch.t(x)))
f.backward()
print("Gradient =", x.grad)

Gradient = tensor([[427.5844, 459.6911, 363.8282]])

# Repeat the example here, but do not sum but keep each element separate with external gradients,
# so f is not needed to be scalar
x = torch.tensor([[2.,3.,4.]], requires_grad=True)
f = torch.mm(a,torch.t(x))
print("f =", f) # f is a vector

external_grad = torch.t(torch.tensor([[1., 1., 1.]]))
f.backward(gradient=external_grad)
# f.sum().backward()
print("Gradient =", x.grad)

f = tensor([[5.4468],
        [3.2396],
        [2.6637]], grad_fn=<MmBackward0>)
Gradient = tensor([[1.5740, 2.1069, 0.4704]])

# Example 1 from PyTorch documentation
a = torch.tensor([2., 3., 5.], requires_grad=True)
b = torch.tensor([6., 4., 2.], requires_grad=True)
Q = 3*a**3 - b**2
print("Q =", Q)
external_grad = torch.tensor([1., 1., 1.])
Q.backward(gradient=external_grad)
# check if collected gradients are correct
print(a.grad, b.grad)
print(9*a**2 == a.grad)
print(-2*b == b.grad)

Q = tensor([-12.,  65., 371.], grad_fn=<SubBackward0>)
tensor([ 36.,  81., 225.]) tensor([-12.,  -8.,  -4.])
tensor([True, True, True])
tensor([True, True, True])

# Example 2 from PyTorch documentation
a = torch.tensor([2., 3., 5.], requires_grad=True)
b = torch.tensor([6., 4., 2.], requires_grad=True)
Q = 3*a**3 - b**2
print(Q)
Q.sum().backward()
# check if collected gradients are correct
print(a.grad, b.grad)
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([-12.,  65., 371.], grad_fn=<SubBackward0>)
tensor([ 36.,  81., 225.]) tensor([-12.,  -8.,  -4.])
tensor([True, True, True])
tensor([True, True, True])

17.5. Reprise of the MNIST digit classification problem#

Recall that the MNIST digits occupy a pixel frame of size 28 x 28 = 784 pixels. This is the input size. The labels are digits from 0,1,2,…,9, i.e., a 10-class classification problem.

# importing the libraries
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

#Sigmoid Function
def sigmoid (x):
    return 1/(1 + torch.exp(-x))

#Derivative of Sigmoid Function/
def derivatives_sigmoid(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Import all the required PyTorch libraries
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as transforms

# Use a GPU if available, else use a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Device =", device)

# Hyper-parameters
input_size = 784
hidden_size = 500 # number of neurons in the hidden layer
num_classes = 10
num_epochs = 5
batch_size = 100
learning_rate = 0.001

# Define a fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        # The super() function is used to give access to methods and properties of a parent or sibling class.
        # The super() function returns an object that represents the parent class.
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) # fully connected input and hidden layer 1
        self.relu = nn.ReLU() # type is ReLU
        self.fc2 = nn.Linear(hidden_size, num_classes) # fully connected hidden layer and output layer

    # forward pass that takes input and produces output
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Define the model
model = NeuralNet(input_size, hidden_size, num_classes).to(device) # push it to cpu/gpu (using related RAM)

# Loss and optimizer (Adam)
criterion = nn.CrossEntropyLoss() # This will do both softmax and cross-entropy loss in one
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Device = cpu

# MNIST dataset (images and labels)
train_dataset = torchvision.datasets.MNIST(root='../../data',
                                           train=True,
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='../../data',
                                          train=False,
                                          transform=transforms.ToTensor())

# Data loader (input pipeline)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=10000,
                                          shuffle=False)
print(train_dataset.data.shape)
print(test_dataset.data.shape)

torch.Size([60000, 28, 28])
torch.Size([10000, 28, 28])

%%time
# Train the model
total_step = len(train_loader)
print("Total steps =", total_step)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Move tensors to the configured device
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad() # zero out all gradients
        loss.backward() # calculate gradients
        optimizer.step() # update weights

        if (i+1) % batch_size == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

Total steps = 600
Epoch [1/5], Step [100/600], Loss: 0.3186
Epoch [1/5], Step [200/600], Loss: 0.1232
Epoch [1/5], Step [300/600], Loss: 0.2067
Epoch [1/5], Step [400/600], Loss: 0.3111
Epoch [1/5], Step [500/600], Loss: 0.1333
Epoch [1/5], Step [600/600], Loss: 0.1045
Epoch [2/5], Step [100/600], Loss: 0.2166
Epoch [2/5], Step [200/600], Loss: 0.1771
Epoch [2/5], Step [300/600], Loss: 0.1634
Epoch [2/5], Step [400/600], Loss: 0.0602
Epoch [2/5], Step [500/600], Loss: 0.0666
Epoch [2/5], Step [600/600], Loss: 0.0665
Epoch [3/5], Step [100/600], Loss: 0.0637
Epoch [3/5], Step [200/600], Loss: 0.1046
Epoch [3/5], Step [300/600], Loss: 0.1168
Epoch [3/5], Step [400/600], Loss: 0.1040
Epoch [3/5], Step [500/600], Loss: 0.1012
Epoch [3/5], Step [600/600], Loss: 0.0356
Epoch [4/5], Step [100/600], Loss: 0.1065
Epoch [4/5], Step [200/600], Loss: 0.0418
Epoch [4/5], Step [300/600], Loss: 0.0372
Epoch [4/5], Step [400/600], Loss: 0.0741
Epoch [4/5], Step [500/600], Loss: 0.0581
Epoch [4/5], Step [600/600], Loss: 0.1231
Epoch [5/5], Step [100/600], Loss: 0.0484
Epoch [5/5], Step [200/600], Loss: 0.0407
Epoch [5/5], Step [300/600], Loss: 0.0386
Epoch [5/5], Step [400/600], Loss: 0.0226
Epoch [5/5], Step [500/600], Loss: 0.0535
Epoch [5/5], Step [600/600], Loss: 0.0163
CPU times: user 58.5 s, sys: 129 ms, total: 58.6 s
Wall time: 59.3 s

%%time
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

# Save the model checkpoint
torch.save(model.state_dict(), 'model.ckpt')

Accuracy of the network on the 10000 test images: 97.68 %
CPU times: user 1.31 s, sys: 86 ms, total: 1.4 s
Wall time: 2.12 s

confusion_matrix(predicted, labels)

array([[ 970,    0,    4,    0,    0,    2,    5,    0,    3,    2],
       [   0, 1127,    9,    1,    0,    0,    3,    5,    0,    2],
       [   2,    2,  987,    1,    2,    0,    0,    6,    2,    0],
       [   0,    0,    4,  965,    0,    1,    1,    0,    4,    3],
       [   0,    0,    1,    0,  961,    2,    1,    0,    7,    6],
       [   0,    1,    0,   14,    0,  876,    2,    0,    3,    3],
       [   4,    2,    3,    0,    6,    3,  942,    0,    3,    1],
       [   2,    0,   18,   11,    3,    2,    0, 1010,    5,    5],
       [   2,    3,    5,    6,    0,    4,    4,    0,  943,    0],
       [   0,    0,    1,   12,   10,    2,    0,    7,    4,  987]])

print(classification_report(predicted, labels))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99       986
           1       0.99      0.98      0.99      1147
           2       0.96      0.99      0.97      1002
           3       0.96      0.99      0.97       978
           4       0.98      0.98      0.98       978
           5       0.98      0.97      0.98       899
           6       0.98      0.98      0.98       964
           7       0.98      0.96      0.97      1056
           8       0.97      0.98      0.97       967
           9       0.98      0.96      0.97      1023

    accuracy                           0.98     10000
   macro avg       0.98      0.98      0.98     10000
weighted avg       0.98      0.98      0.98     10000

How will you use the code above for text classification?