17. PyTorch Basics and Neural Nets#
PyTorch
is Meta’s deep learning framework and is also used for tensor mathematics, taking gradients, etc. It is extremely versatile and a useful alternative to Tensorflow
and Keras
.
In this notebook, we will review a few basics of PyTorch
- detailed examples are available from the site: https://pytorch.org
Installation: https://pytorch.org/get-started/locally/
Quick tutorial: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html (this is a great and quick read)
from google.colab import drive
drive.mount('/content/drive') # Add My Drive/<>
import os
os.chdir('drive/My Drive')
os.chdir('Books_Writings/NLPBook/')
Mounted at /content/drive
%%capture
# pylab contains numpy, scipy, and matplotlib
%pylab inline
import pandas as pd
import os
%%capture
# Install if required, and not installed by default, uncomment the line below to install:
# !pip install torch torchvision
import torch
17.1. Tensors#
Just as we have numpy arrays, we can also initialize a torch array of any dimension, so it is better to call it a tensor.
Here are some basic operations on torch tensors just to get an idea of the syntax. The parallels with NumPy make this easy.
a = torch.tensor(4)
print(type(a))
print(a+a)
<class 'torch.Tensor'>
tensor(8)
# Casting from ndarry to torch tensor
b = array(8) # numpy array
a = torch.tensor(b)
print(type(a))
print(a+a)
a = torch.from_numpy(b)
print(type(a))
print(a+a)
<class 'torch.Tensor'>
tensor(16)
<class 'torch.Tensor'>
tensor(16)
# Higher dimensions
b = rand(30).reshape((5,2,3))
print(type(b))
print(b.shape)
a = torch.tensor(b)
print(type(a))
print(a)
print(a.shape)
<class 'numpy.ndarray'>
(5, 2, 3)
<class 'torch.Tensor'>
tensor([[[0.5348, 0.5363, 0.2537],
[0.4955, 0.5561, 0.7541]],
[[0.7039, 0.7626, 0.3896],
[0.4855, 0.3901, 0.4626]],
[[0.2073, 0.2634, 0.5219],
[0.7576, 0.2488, 0.7974]],
[[0.2249, 0.2695, 0.2217],
[0.9772, 0.9335, 0.3637]],
[[0.7285, 0.3694, 0.4674],
[0.3440, 0.1306, 0.5499]]], dtype=torch.float64)
torch.Size([5, 2, 3])
# Using random number generators
a = torch.rand((5,2,3))
print(a)
tensor([[[0.0889, 0.1380, 0.5658],
[0.8028, 0.6986, 0.0965]],
[[0.0837, 0.9840, 0.5174],
[0.2920, 0.5077, 0.4636]],
[[0.5629, 0.0727, 0.0981],
[0.7366, 0.0018, 0.0012]],
[[0.3837, 0.8692, 0.6788],
[0.8077, 0.0251, 0.1417]],
[[0.5395, 0.1575, 0.9190],
[0.3999, 0.6020, 0.0949]]])
17.2. Matrix operations#
a = [[2.0,3],[1,4]]
b = [[5.0,8]]
c = torch.rand((2,2))
a = torch.tensor(a)
b = torch.tensor(b)
b = torch.t(b)
print(a)
print(b)
print(c)
print(torch.transpose(c,0,1)) # more useful for tensors with more than 2 dimensions
tensor([[2., 3.],
[1., 4.]])
tensor([[5.],
[8.]])
tensor([[0.8037, 0.0560],
[0.6180, 0.3415]])
tensor([[0.8037, 0.6180],
[0.0560, 0.3415]])
print(torch.transpose(c,1,0))
tensor([[0.8037, 0.6180],
[0.0560, 0.3415]])
# Addition
print(a+c)
print(torch.add(a,c))
tensor([[2.8037, 3.0560],
[1.6180, 4.3415]])
tensor([[2.8037, 3.0560],
[1.6180, 4.3415]])
# Subtraction
print(a-c)
print(torch.sub(a,c))
tensor([[1.1963, 2.9440],
[0.3820, 3.6585]])
tensor([[1.1963, 2.9440],
[0.3820, 3.6585]])
# Multiplication
print(a*c) # elementwise
print(torch.mm(a,c)) # matrix multiply
print(torch.mm(a,b)) # notice the transpose required here (not so in numpy)
print(torch.mm(torch.mm(a,c),b))
tensor([[1.6073, 0.1679],
[0.6180, 1.3659]])
tensor([[3.4613, 1.1363],
[3.2756, 1.4218]])
tensor([[34.],
[37.]])
tensor([[26.3969],
[27.7524]])
# Much easier
a.mm(c).mm(b)
tensor([[26.3969],
[27.7524]])
17.3. Reshaping and joining tensors#
a = torch.rand(4,3)
b = torch.randn(3,4)
print(torch.cat((a,torch.t(b))))
print(torch.cat((a,torch.t(b)),dim=1))
tensor([[ 0.9288, 0.7381, 0.2384],
[ 0.5969, 0.4394, 0.2242],
[ 0.9905, 0.0317, 0.5225],
[ 0.2020, 0.6733, 0.3272],
[-0.8985, -0.8288, -0.7510],
[-0.9929, -1.3209, -1.6572],
[ 0.3097, 1.1610, 1.0873],
[ 0.1304, 1.1366, 0.6809]])
tensor([[ 0.9288, 0.7381, 0.2384, -0.8985, -0.8288, -0.7510],
[ 0.5969, 0.4394, 0.2242, -0.9929, -1.3209, -1.6572],
[ 0.9905, 0.0317, 0.5225, 0.3097, 1.1610, 1.0873],
[ 0.2020, 0.6733, 0.3272, 0.1304, 1.1366, 0.6809]])
torch.reshape(b,(4,3))
tensor([[-0.8985, -0.9929, 0.3097],
[ 0.1304, -0.8288, -1.3209],
[ 1.1610, 1.1366, -0.7510],
[-1.6572, 1.0873, 0.6809]])
17.4. Vector/Matrix Calculus#
A look at taking derivatives (gradients).
PyTorch uses automatic differentiation, so if we have a function \(f(x)\) where \(x\) is a tensor and we designate automatic differentiation for \(x\), then PyTorch will keep track of all functions of \(x\) (through the computation tree) and also the derivatives of those functions w.r.t. \(x\).
Note that if \(x\) is a vector of dimension \(d\), and \(f(x)\) is a scalar function of \(x\), then the derivative \(\frac{\partial f}{\partial x}\) will be of dimension \(d\).
See: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
Jason Brownlees, Calculating Derivatives in PyTorch
%%time
x = torch.tensor([[2.,3.,4.]], requires_grad=True) # make sure this is a matrix [[]]
a = torch.rand((5,3)) # this is automatically a matrix
print("a =", a)
f = torch.mm(a,torch.t(x)).sum() # f is a vector
print("f =", f)
# external_grad = torch.tensor([1., 1., 1.])
# f.backward(gradient=external_grad)
# f.backward() # sum f to get a scalar and then differentiate wrt x, gradients are stored in object x
# print("Gradient =", x.grad) # since there are 3 values in the x vector, we will get 3 derivatives
a = tensor([[0.1527, 0.4320, 0.4375],
[0.1844, 0.7497, 0.2634],
[0.7359, 0.4054, 0.2788],
[0.1239, 0.5019, 0.9134],
[0.0547, 0.8669, 0.9395]])
f = tensor(22.7014, grad_fn=<SumBackward0>)
CPU times: user 6.67 ms, sys: 16 µs, total: 6.69 ms
Wall time: 13.4 ms
Let’s do the same gradient manually and check that we get the same gradients as we obtained from AutoGrad above.
%%time
f.backward() # sum f to get a scalar and then differentiate wrt x, gradients are stored in object x
print("Gradient =", x.grad) # since there are 3 values in the x vector, we will get 3 derivatives
Gradient = tensor([[1.2515, 2.9559, 2.8326]])
CPU times: user 12.6 ms, sys: 1.93 ms, total: 14.6 ms
Wall time: 20.5 ms
%%time
a1 = a.numpy().copy()
dx = 0.1
for j in range(3):
x1 = x.detach().numpy().copy() # need to detach gradient vector, and also make a copy, else x will be updated
x1[0,j] = x1[0,j] + dx
f1 = a1.dot(x1.T).sum()
print((f1-f)/dx)
tensor(1.2515, grad_fn=<DivBackward0>)
tensor(2.9559, grad_fn=<DivBackward0>)
tensor(2.8326, grad_fn=<DivBackward0>)
CPU times: user 10.6 ms, sys: 924 µs, total: 11.5 ms
Wall time: 12.1 ms
x = torch.tensor([[2.,3.,4.]], requires_grad=True) # make sure this is a matrix [[]]
a = torch.rand((3,3)) # this is automatically a matrix
print("a =", a)
f = torch.mm(x,torch.mm(a,torch.t(x))) # f is a vector
print("f =", f)
# external_grad = torch.tensor([1., 1., 1.])
# f.backward(gradient=external_grad)
f.backward() # sum f to get a scalar and then differentiate wrt x, gradients are stored in object x
print("Gradient =", x.grad) # since there are 3 values in the x vector, we will get 3 derivatives
a = tensor([[0.9857, 0.7094, 0.1037],
[0.5984, 0.3621, 0.8753],
[0.5341, 0.7135, 0.9308]])
f = tensor([[54.1095]], grad_fn=<MmBackward0>)
Gradient = tensor([[10.4177, 11.1436, 13.4882]])
f = torch.mm(torch.mm(x,torch.mm(a,torch.t(x))),torch.mm(x,torch.t(x)))
f.backward()
print("Gradient =", x.grad)
Gradient = tensor([[528.9687, 658.9653, 837.5213]])
# Repeat the example here, but do not sum but keep each element separate with external gradients,
# so f is not needed to be scalar
x = torch.tensor([[2.,3.,4.]], requires_grad=True)
f = torch.mm(a,torch.t(x))
print("f =", f) # f is a vector
external_grad = torch.t(torch.tensor([[1., 1., 1.]]))
f.backward(gradient=external_grad)
# f.sum().backward()
print("Gradient =", x.grad)
f = tensor([[4.5147],
[5.7845],
[6.9316]], grad_fn=<MmBackward0>)
Gradient = tensor([[2.1182, 1.7850, 1.9098]])
# Example 1 from PyTorch documentation
a = torch.tensor([2., 3., 5.], requires_grad=True)
b = torch.tensor([6., 4., 2.], requires_grad=True)
Q = 3*a**3 - b**2
print("Q =", Q)
external_grad = torch.tensor([1., 1., 1.])
Q.backward(gradient=external_grad)
# check if collected gradients are correct
print(a.grad, b.grad)
print(9*a**2 == a.grad)
print(-2*b == b.grad)
Q = tensor([-12., 65., 371.], grad_fn=<SubBackward0>)
tensor([ 36., 81., 225.]) tensor([-12., -8., -4.])
tensor([True, True, True])
tensor([True, True, True])
# Example 2 from PyTorch documentation
a = torch.tensor([2., 3., 5.], requires_grad=True)
b = torch.tensor([6., 4., 2.], requires_grad=True)
Q = 3*a**3 - b**2
print(Q)
Q.sum().backward()
# check if collected gradients are correct
print(a.grad, b.grad)
print(9*a**2 == a.grad)
print(-2*b == b.grad)
tensor([-12., 65., 371.], grad_fn=<SubBackward0>)
tensor([ 36., 81., 225.]) tensor([-12., -8., -4.])
tensor([True, True, True])
tensor([True, True, True])
17.5. Reprise of the MNIST digit classification problem#
Recall that the MNIST digits occupy a pixel frame of size 28 x 28 = 784 pixels. This is the input size. The labels are digits from 0,1,2,…,9, i.e., a 10-class classification problem.
# importing the libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
#Sigmoid Function
def sigmoid (x):
return 1/(1 + torch.exp(-x))
#Derivative of Sigmoid Function/
def derivatives_sigmoid(x):
return sigmoid(x) * (1 - sigmoid(x))
# Import all the required PyTorch libraries
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as transforms
# Use a GPU if available, else use a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Device =", device)
# Hyper-parameters
input_size = 784
hidden_size = 500 # number of neurons in the hidden layer
num_classes = 10
num_epochs = 5
batch_size = 100
learning_rate = 0.001
# Define a fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
# The super() function is used to give access to methods and properties of a parent or sibling class.
# The super() function returns an object that represents the parent class.
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size) # fully connected input and hidden layer 1
self.relu = nn.ReLU() # type is ReLU
self.fc2 = nn.Linear(hidden_size, num_classes) # fully connected hidden layer and output layer
# forward pass that takes input and produces output
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
return out
# Define the model
model = NeuralNet(input_size, hidden_size, num_classes).to(device) # push it to cpu/gpu (using related RAM)
# Loss and optimizer (Adam)
criterion = nn.CrossEntropyLoss() # This will do both softmax and cross-entropy loss in one
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
Device = cpu
# MNIST dataset (images and labels)
train_dataset = torchvision.datasets.MNIST(root='../../data',
train=True,
transform=transforms.ToTensor(),
download=True)
test_dataset = torchvision.datasets.MNIST(root='../../data',
train=False,
transform=transforms.ToTensor())
# Data loader (input pipeline)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=10000,
shuffle=False)
print(train_dataset.data.shape)
print(test_dataset.data.shape)
torch.Size([60000, 28, 28])
torch.Size([10000, 28, 28])
%%time
# Train the model
total_step = len(train_loader)
print("Total steps =", total_step)
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
# Move tensors to the configured device
images = images.reshape(-1, 28*28).to(device)
labels = labels.to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad() # zero out all gradients
loss.backward() # calculate gradients
optimizer.step() # update weights
if (i+1) % batch_size == 0:
print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
.format(epoch+1, num_epochs, i+1, total_step, loss.item()))
Total steps = 600
Epoch [1/5], Step [100/600], Loss: 0.3240
Epoch [1/5], Step [200/600], Loss: 0.2925
Epoch [1/5], Step [300/600], Loss: 0.1980
Epoch [1/5], Step [400/600], Loss: 0.1234
Epoch [1/5], Step [500/600], Loss: 0.2307
Epoch [1/5], Step [600/600], Loss: 0.1781
Epoch [2/5], Step [100/600], Loss: 0.0454
Epoch [2/5], Step [200/600], Loss: 0.0583
Epoch [2/5], Step [300/600], Loss: 0.0968
Epoch [2/5], Step [400/600], Loss: 0.1097
Epoch [2/5], Step [500/600], Loss: 0.0687
Epoch [2/5], Step [600/600], Loss: 0.1479
Epoch [3/5], Step [100/600], Loss: 0.0590
Epoch [3/5], Step [200/600], Loss: 0.1444
Epoch [3/5], Step [300/600], Loss: 0.0565
Epoch [3/5], Step [400/600], Loss: 0.0485
Epoch [3/5], Step [500/600], Loss: 0.0680
Epoch [3/5], Step [600/600], Loss: 0.0427
Epoch [4/5], Step [100/600], Loss: 0.0550
Epoch [4/5], Step [200/600], Loss: 0.0286
Epoch [4/5], Step [300/600], Loss: 0.0431
Epoch [4/5], Step [400/600], Loss: 0.0283
Epoch [4/5], Step [500/600], Loss: 0.1147
Epoch [4/5], Step [600/600], Loss: 0.0926
Epoch [5/5], Step [100/600], Loss: 0.0553
Epoch [5/5], Step [200/600], Loss: 0.0253
Epoch [5/5], Step [300/600], Loss: 0.0327
Epoch [5/5], Step [400/600], Loss: 0.1568
Epoch [5/5], Step [500/600], Loss: 0.0278
Epoch [5/5], Step [600/600], Loss: 0.0356
CPU times: user 1min 5s, sys: 56.3 ms, total: 1min 5s
Wall time: 1min 6s
%%time
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
correct = 0
total = 0
for images, labels in test_loader:
images = images.reshape(-1, 28*28).to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))
# Save the model checkpoint
torch.save(model.state_dict(), 'model.ckpt')
Accuracy of the network on the 10000 test images: 97.83 %
CPU times: user 1.31 s, sys: 127 ms, total: 1.44 s
Wall time: 1.88 s
confusion_matrix(predicted, labels)
array([[ 970, 0, 5, 0, 1, 3, 5, 0, 5, 1],
[ 0, 1121, 0, 0, 0, 0, 3, 2, 0, 2],
[ 1, 3, 1004, 2, 2, 0, 0, 10, 7, 0],
[ 0, 1, 6, 989, 1, 5, 1, 1, 8, 5],
[ 1, 0, 1, 0, 955, 1, 3, 0, 4, 3],
[ 2, 1, 0, 4, 0, 876, 5, 0, 7, 1],
[ 3, 3, 2, 0, 6, 3, 941, 0, 4, 1],
[ 0, 2, 7, 4, 3, 0, 0, 1006, 4, 6],
[ 2, 4, 7, 4, 0, 2, 0, 1, 931, 0],
[ 1, 0, 0, 7, 14, 2, 0, 8, 4, 990]])
print(classification_report(predicted, labels))
precision recall f1-score support
0 0.99 0.98 0.98 990
1 0.99 0.99 0.99 1128
2 0.97 0.98 0.97 1029
3 0.98 0.97 0.98 1017
4 0.97 0.99 0.98 968
5 0.98 0.98 0.98 896
6 0.98 0.98 0.98 963
7 0.98 0.97 0.98 1032
8 0.96 0.98 0.97 951
9 0.98 0.96 0.97 1026
accuracy 0.98 10000
macro avg 0.98 0.98 0.98 10000
weighted avg 0.98 0.98 0.98 10000
How will you use the code above for text classification?