Decoder-Only Transformer (LLM) For Question
Asking
"FROM SCRATCH"
Notebook Structure
Data
Data source
Tokenization
Features and Target
Test data
Model Design
Positional encoding
Multi-head attention
Transformer Decoder
Final Architecture
Training script
Simplistic Inference Script
Issues and mistakes
Pre-training with a downstream task
Not masking Padding layers
Context window
In [1]: #necessary imports
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plot
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
import random
In [2]: if torch.cuda.is_available():
device = torch.device("cuda")
print("GPU is available")
else:
device = torch.device("cpu")
print("GPU is not available, using CPU")
GPU is available
DATA
Data Source
The data I used for this project is the Stanford Question Ansewring Dataset (SQuAD).
SQuAD was prepared such that a question and a context would map to an Answer (Q+C -->
A). I modified this the data so that a Context would map to question (C --> Q).
Find my modified data here = link to dataset
In [3]: data = pd.read_json('{fill with path to your data}').to_dict(orient='list')
Tokenization
In [ ]: #bert tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
In [5]: #Example
data['conversation'][0]
[{'from': 'human',
Out[5]:
'value': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1
981) is an American singer, songwriter, record producer and actress. Born and raised in
Houston, Texas, she performed in various singing and dancing competitions as a child, an
d rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Mana
ged by her father, Mathew Knowles, the group became one of the world\'s best-selling gir
l groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerousl
y in Love (2003), which established her as a solo artist worldwide, earned five Grammy A
wards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Bo
y".'},
{'from': 'gpt', 'value': 'When did Beyonce start becoming popular?'}]
In [6]: def tokenize_input(qa):
#1. tokenizing with a max seq length of 300 and padding layers
#2. Adding an <sos> and <eos> token to target values. In this case; [CLS] and [SEP]
seq_length = 300
q_tokens = tokenizer(qa[0]['value'],add_special_tokens=False)['input_ids']
a_tokens = tokenizer(qa[1]['value'],padding=True)['input_ids']
x_tokens = q_tokens + a_tokens[:-1]
y_tokens = q_tokens[1:] + a_tokens
x_pad = [0 for i in range(seq_length-len(x_tokens))]
y_pad = [0 for i in range(seq_length-len(x_tokens))]
final_x = x_tokens + x_pad
final_y = y_tokens + y_pad
return final_x, final_y
In [7]: #tokenizing all data
tokens = []
targets = []
for i in random.sample(data['conversation'],len(data['conversation'])):
try:
x, y = tokenize_input(i)
if len(x) == 300:
tokens.append(x)
targets.append(y)
except:
pass
X = torch.IntTensor(tokens)
Y = torch.LongTensor(targets)
Token indices sequence length is longer than the specified maximum sequence length for t
his model (718 > 512). Running this sequence through the model will result in indexing e
rrors
In [8]: X.shape, Y.shape
(torch.Size([124975, 300]), torch.Size([124975, 300]))
Out[8]:
Test Data
Create your test data here
Model Design
Important Notes
Embedding layer: I used the emmbedding layer from the bert model.
Positional encoding: Sinusoidal encoding from Attention is all you need
Attention: Multihead (4 heads)
Linear projection: Projected input to 224 before passing it through decoder
Number of decoders: 8
Embedding layer
In [9]: class embed(torch.nn.Module):
def __init__(self):
super().__init__()
self.embedder = AutoModel.from_pretrained('bert-base-uncased')
def forward(self,x_tokens):
inputs = {'input_ids':x_tokens}
with torch.no_grad():
attention_mask = (inputs['input_ids'] != 0).int()
outputs = self.embedder(**inputs,attention_mask=attention_mask)
embeddings = outputs.last_hidden_state * attention_mask.unsqueeze(-1)
return embeddings
Positional Encoding
In [10]: ## sinusoidal positional encoding
class pos_enc(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
def forward(self,x):
batch_size, max_seq_length, dmodel = x.shape
pe = torch.zeros_like(x) #position encoding matrix
# Compute the positional encoding values
for pos in range(max_seq_length):
for i in range(0, dmodel):
if i % 2 == 0:
pe[:, pos, i] = torch.math.sin(pos / (10000 ** (2 * i / dmodel)))
else:
pe[:, pos, i] = torch.math.cos(pos / (10000 ** (2 * i / dmodel)))
x = x + pe
return x
Self-attention mechanisim
In [11]: class self_attention(torch.nn.Module):
def __init__(self,no_of_heads: int ,shape: tuple, mask: bool=False, QKV: list=[]
'''
Initializes a Self Attention module as described in the "Attention is all you ne
This module splits the input into multiple heads to allow the model to jointly a
from different representation subspaces at different positions. After attention
on each head, the module concatenates and linearly transforms the results.
## Parameters:
* no_of_heads (int): Number of attention heads. To implement single head att
* shape (tuple): A tuple (seq_length, dmodel) where `seq_length` is the lengt
and `dmodel` is the dimensionality of the input feature space
* mask (bool, optional): If True, a mask will be applied to prevent attentio
* QKV (list, optional): A list containing pre-computed Query (Q), Key (K), a
The forward pass computes the multi-head attention for input `x` and returns the
'''
super().__init__()
self.h = no_of_heads
self.seq_length,self.dmodel = shape
self.dk = self.dmodel//self.h
self.softmax = torch.nn.Softmax(dim=-1)
self.mQW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mKW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mVW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.output_linear = torch.nn.Linear(self.dmodel,self.dmodel)
self.mask = mask
self.QKV = QKV
def __add_mask(self,atten_values):
#masking attention values
mask_value = -1e9
mask = torch.triu(torch.ones(atten_values.shape) * mask_value, diagonal=1)
masked = atten_values + mask.to(device)
return masked
def forward(self, x):
heads = []
for i in range(self.h):
# Apply linear projections in batch from dmodel => h x d_k
if self.QKV:
q = self.mQW[i](self.QKV[0])
k = self.mKW[i](self.QKV[1])
v = self.mVW[i](self.QKV[2])
else:
q = self.mQW[i](x)
k = self.mKW[i](x)
v = self.mVW[i](x)
# Calculate attention using the projected vectors q, k, and v
self.scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.te
if self.mask:
self.scores = self.__add_mask(self.scores)
attn = self.softmax(self.scores)
head_i = torch.matmul(attn, v)
heads.append(head_i)
# Concatenate all the heads together
multi_head = torch.cat(heads, dim=-1)
# Final linear layer
output = self.output_linear(multi_head)
return output + x # Residual connection
Decoder
In [12]: class decoder_layer(torch.nn.Module):
def __init__(self,shape: tuple,no_of_heads:int = 1):
'''
Implementation of Transformer Dencoder
Parameters:
shape (tuple): The shape (H, W) of the input tensor
no_of_heads (int): number of heads in the attention mechanism. set this to 1
Returns:
Tensor: The output of the encoder layer after applying attention, feedforwar
'''
super().__init__()
self.max_seq_length,self.dmodel = shape
def ff_weights():
layer1 = torch.nn.Linear(self.dmodel,600)
layer2 = torch.nn.Linear(600,600)
layer3 = torch.nn.Linear(600,self.dmodel)
return layer1,layer2,layer3
self.no_of_heads = no_of_heads
self.multi_head = self_attention(no_of_heads=no_of_heads, mask=True,
shape=(self.max_seq_length,self.dmod
self.layer1,self.layer2,self.layer3 = ff_weights()
self.softmax = torch.nn.Softmax(dim=-1)
self.layerNorm = torch.nn.LayerNorm(shape)
self.relu1 = torch.nn.ReLU()
self.relu2 = torch.nn.ReLU()
def feed_forward(self,x):
f = self.layer1(x)
f = self.relu1(f)
f = self.layer2(f)
f = self.relu2(f)
f = self.layer3(f)
return self.layerNorm(f + x) #residual connection
def forward(self,x):
x = self.multi_head(x)
x = self.layerNorm(x)
x = self.feed_forward(x)
x = self.layerNorm(x)
return x
Full Model Architecture
In [13]: class architecture(torch.nn.Module):
def __init__(self,n_classes,shape) -> None:
super().__init__()
self.max_seq_length,self.dmodel = shape
self.projected_dmodel = 224
self.embedding_layer = embed()
self.proj_to_224 = torch.nn.Linear(self.dmodel, self.projected_dmodel)
self.positional = pos_enc()
self.decoder1 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder2 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder3 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder4 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder5 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder6 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder7 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder8 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
# self.decoder5 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel
self.final_MLP = torch.nn.Linear(self.projected_dmodel,n_classes)
self.softmax = torch.nn.Softmax(dim=2)
def forward(self,x,temperature=1.0):
x = self.embedding_layer(x)
x = self.proj_to_224(x)
x = self.positional(x)
x = self.decoder1(x)
x = self.decoder2(x)
x = self.decoder3(x)
x = self.decoder4(x)
x = self.decoder5(x)
x = self.decoder6(x)
x = self.decoder7(x)
x = self.decoder8(x)
x = self.final_MLP(x)
logits = x / temperature
x = self.softmax(logits)
return x
Training Script
In [ ]: !pip install torchmetrics
In [14]: from torchmetrics import Accuracy
from tqdm import tqdm
In [15]: dataset = torch.utils.data.TensorDataset(X,Y)
loader = torch.utils.data.DataLoader(dataset,batch_size=20,num_workers=0,shuffle=False)
In [ ]: vocab_size = tokenizer.vocab_size
model = architecture(n_classes = vocab_size, shape = (300,768))
model = model.to(device)
model.load_state_dict(torch.load("{fill with path to model weights if any}"))
In [76]: metric = Accuracy(num_classes=vocab_size,task='multiclass').to(device)
optimizer = torch.optim.Adam(model.parameters(),lr=0.0001)
criterion = torch.nn.CrossEntropyLoss(ignore_index=0,label_smoothing=0.01)
In [ ]: from tqdm import tqdm
print('Training Started')
NUM_EPOCHS = 1
for epoch in range(NUM_EPOCHS):
model.train() # Set the model to training mode
running_loss = 0.0
epoch_accuracy = 0.0
num_batches = len(loader)
# Initialize tqdm progress bar
with tqdm(total=num_batches, desc=f"Epoch {epoch + 1}", leave=True) as pbar:
for i, (x_batch, y_batch) in enumerate(loader):
x_batch, y_batch = x_batch.to(device), y_batch.to(device)
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass
outputs = model(x_batch)
# Flatten the outputs and y_batch tensors one dimension lower
outputs = outputs.view(-1, outputs.shape[-1])
y_batch = y_batch.view(-1)
# Loss calculation
loss = criterion(outputs, y_batch).to(device)
# Backward pass and optimize
loss.backward()
optimizer.step()
# Metrics
argmax_pred = outputs.argmax(axis=1)
metric.update(argmax_pred, y_batch)
# Print statistics
running_loss += loss.item()
if i % 10 == 9: # update every 10 mini-batches
accuracy = metric.compute().item()
epoch_accuracy += accuracy
pbar.set_postfix({'Loss': running_loss / (i + 1), 'Accuracy': accuracy})
# Update the progress bar
pbar.update(1)
# Save model weights periodically
if i % 10 == 9:
torch.save(model.state_dict(), '/kaggle/working/model_weights.pth')
# Compute and print average loss and accuracy for the epoch
avg_loss = running_loss / num_batches
avg_accuracy = epoch_accuracy / (num_batches // 10) # since we're summing accuracy
print(f'Epoch {epoch + 1} - Loss: {avg_loss:.4f}, Accuracy: {avg_accuracy:.4f}')
print('Training Completed')
Training Started
Epoch 1: 66%|██████ | 2770/4166 [1:58:07<1:00:08, 2.59s/it, Loss=9.89, Accuracy=0.2
34]
In [ ]: #Link to download model
from IPython.display import FileLink
FileLink(r'model_weights.pth')
Simplistic Inference Script
In [17]: text = '''Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 198
In [6]: def model_pred(tokens,temp):
model.eval()
with torch.no_grad():
pred = model(tokens,temp)
pred = pred.view(-1, pred.shape[-1]).argmax(axis=1)
return pred
def tokenize_text(text):
seq_length = 300
q_tokens = tokenizer(text,add_special_tokens=False)['input_ids']
pad = [0 for i in range(seq_length-len(q_tokens))]
final_tokens = [q_tokens + pad]
last_index = len(q_tokens)-1
return torch.tensor(final_tokens),last_index
def inference(text, starter='', temperature=1.0):
curr = 0
pred_list = []
t, last_token = tokenize_text(text + '[CLS]' + starter)
t = t.to(device)
while curr != 102:
print('\n',"Generating...")
all_pred = model_pred(t, temperature)
pred = all_pred[last_token].item()
pred_list.append(pred)
t[0][last_token + 1] = pred
last_token += 1
curr = pred
if curr > 10:
break
print("Question from the model: ".upper(), starter + ' ' + tokenizer.decode(pred_lis
return starter + ' ' + tokenizer.decode(pred_list)
inference(text, '')
QUESTION FROM THE MODEL: what is the name of the singer? [SEP]
Issues
Not Enough data: I trained on only 130K samples which is too small
Pre-trainig on a downstream task: Pretraining is supposed to be self supervised