0% found this document useful (0 votes)

78 views9 pages

Transformers for NLP Developers

Natural Language Processing

Uploaded by

Daniel Ayesu Kissiedu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views9 pages

Transformers for NLP Developers

Natural Language Processing

Uploaded by

Daniel Ayesu Kissiedu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Decoder-Only Transformer (LLM) For Question

Asking
"FROM SCRATCH"

Notebook Structure

Data
Data source
Tokenization
Features and Target
Test data
Model Design
Positional encoding
Multi-head attention
Transformer Decoder
Final Architecture
Training script
Simplistic Inference Script
Issues and mistakes
Pre-training with a downstream task
Not masking Padding layers
Context window

In [1]: #necessary imports

import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plot
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
import random

In [2]: if torch.cuda.is_available():
device = torch.device("cuda")
print("GPU is available")
else:
device = torch.device("cpu")
print("GPU is not available, using CPU")

GPU is available

DATA

Data Source
The data I used for this project is the Stanford Question Ansewring Dataset (SQuAD).
SQuAD was prepared such that a question and a context would map to an Answer (Q+C -->
A). I modified this the data so that a Context would map to question (C --> Q).
Find my modified data here = link to dataset

In [3]: data = pd.read_json('{fill with path to your data}').to_dict(orient='list')

Tokenization
In [ ]: #bert tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [5]: #Example
data['conversation'][0]

[{'from': 'human',
Out[5]:
'value': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1
981) is an American singer, songwriter, record producer and actress. Born and raised in
Houston, Texas, she performed in various singing and dancing competitions as a child, an
d rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Mana
ged by her father, Mathew Knowles, the group became one of the world\'s best-selling gir
l groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerousl
y in Love (2003), which established her as a solo artist worldwide, earned five Grammy A
wards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Bo
y".'},
{'from': 'gpt', 'value': 'When did Beyonce start becoming popular?'}]

In [6]: def tokenize_input(qa):

#1. tokenizing with a max seq length of 300 and padding layers
#2. Adding an <sos> and <eos> token to target values. In this case; [CLS] and [SEP]
seq_length = 300
q_tokens = tokenizer(qa[0]['value'],add_special_tokens=False)['input_ids']
a_tokens = tokenizer(qa[1]['value'],padding=True)['input_ids']

x_tokens = q_tokens + a_tokens[:-1]

y_tokens = q_tokens[1:] + a_tokens

x_pad = [0 for i in range(seq_length-len(x_tokens))]

y_pad = [0 for i in range(seq_length-len(x_tokens))]
final_x = x_tokens + x_pad
final_y = y_tokens + y_pad

return final_x, final_y

In [7]: #tokenizing all data

tokens = []
targets = []
for i in random.sample(data['conversation'],len(data['conversation'])):
try:
x, y = tokenize_input(i)

if len(x) == 300:
tokens.append(x)
targets.append(y)
except:
pass

X = torch.IntTensor(tokens)
Y = torch.LongTensor(targets)

Token indices sequence length is longer than the specified maximum sequence length for t
his model (718 > 512). Running this sequence through the model will result in indexing e
rrors
In [8]: X.shape, Y.shape

(torch.Size([124975, 300]), torch.Size([124975, 300]))

Out[8]:

Test Data
Create your test data here

Model Design

Important Notes
Embedding layer: I used the emmbedding layer from the bert model.
Positional encoding: Sinusoidal encoding from Attention is all you need
Attention: Multihead (4 heads)
Linear projection: Projected input to 224 before passing it through decoder
Number of decoders: 8

Embedding layer
In [9]: class embed(torch.nn.Module):
def __init__(self):
super().__init__()
self.embedder = AutoModel.from_pretrained('bert-base-uncased')

def forward(self,x_tokens):
inputs = {'input_ids':x_tokens}
with torch.no_grad():
attention_mask = (inputs['input_ids'] != 0).int()
outputs = self.embedder(**inputs,attention_mask=attention_mask)
embeddings = outputs.last_hidden_state * attention_mask.unsqueeze(-1)
return embeddings
Positional Encoding
In [10]: ## sinusoidal positional encoding

class pos_enc(torch.nn.Module):
def __init__(self) -> None:
super().__init__()

def forward(self,x):
batch_size, max_seq_length, dmodel = x.shape
pe = torch.zeros_like(x) #position encoding matrix

# Compute the positional encoding values

for pos in range(max_seq_length):
for i in range(0, dmodel):
if i % 2 == 0:
pe[:, pos, i] = torch.math.sin(pos / (10000 ** (2 * i / dmodel)))
else:
pe[:, pos, i] = torch.math.cos(pos / (10000 ** (2 * i / dmodel)))

x = x + pe
return x

Self-attention mechanisim
In [11]: class self_attention(torch.nn.Module):
def __init__(self,no_of_heads: int ,shape: tuple, mask: bool=False, QKV: list=[]
'''
Initializes a Self Attention module as described in the "Attention is all you ne
This module splits the input into multiple heads to allow the model to jointly a
from different representation subspaces at different positions. After attention
on each head, the module concatenates and linearly transforms the results.

## Parameters:
* no_of_heads (int): Number of attention heads. To implement single head att

* shape (tuple): A tuple (seq_length, dmodel) where `seq_length` is the lengt

and `dmodel` is the dimensionality of the input feature space

* mask (bool, optional): If True, a mask will be applied to prevent attentio

* QKV (list, optional): A list containing pre-computed Query (Q), Key (K), a
The forward pass computes the multi-head attention for input `x` and returns the
'''
super().__init__()
self.h = no_of_heads
self.seq_length,self.dmodel = shape
self.dk = self.dmodel//self.h
self.softmax = torch.nn.Softmax(dim=-1)
self.mQW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mKW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mVW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.output_linear = torch.nn.Linear(self.dmodel,self.dmodel)
self.mask = mask
self.QKV = QKV

def __add_mask(self,atten_values):
#masking attention values
mask_value = -1e9
mask = torch.triu(torch.ones(atten_values.shape) * mask_value, diagonal=1)
masked = atten_values + mask.to(device)
return masked

def forward(self, x):

heads = []
for i in range(self.h):
# Apply linear projections in batch from dmodel => h x d_k
if self.QKV:
q = self.mQW[i](self.QKV[0])
k = self.mKW[i](self.QKV[1])
v = self.mVW[i](self.QKV[2])
else:
q = self.mQW[i](x)
k = self.mKW[i](x)
v = self.mVW[i](x)

# Calculate attention using the projected vectors q, k, and v

self.scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.te
if self.mask:
self.scores = self.__add_mask(self.scores)

attn = self.softmax(self.scores)
head_i = torch.matmul(attn, v)

heads.append(head_i)

# Concatenate all the heads together

multi_head = torch.cat(heads, dim=-1)
# Final linear layer
output = self.output_linear(multi_head)

return output + x # Residual connection

Decoder
In [12]: class decoder_layer(torch.nn.Module):
def __init__(self,shape: tuple,no_of_heads:int = 1):
'''
Implementation of Transformer Dencoder
Parameters:
shape (tuple): The shape (H, W) of the input tensor
no_of_heads (int): number of heads in the attention mechanism. set this to 1
Returns:
Tensor: The output of the encoder layer after applying attention, feedforwar
'''
super().__init__()

self.max_seq_length,self.dmodel = shape
def ff_weights():
layer1 = torch.nn.Linear(self.dmodel,600)
layer2 = torch.nn.Linear(600,600)
layer3 = torch.nn.Linear(600,self.dmodel)
return layer1,layer2,layer3

self.no_of_heads = no_of_heads

self.multi_head = self_attention(no_of_heads=no_of_heads, mask=True,

shape=(self.max_seq_length,self.dmod

self.layer1,self.layer2,self.layer3 = ff_weights()
self.softmax = torch.nn.Softmax(dim=-1)
self.layerNorm = torch.nn.LayerNorm(shape)
self.relu1 = torch.nn.ReLU()
self.relu2 = torch.nn.ReLU()

def feed_forward(self,x):
f = self.layer1(x)
f = self.relu1(f)
f = self.layer2(f)
f = self.relu2(f)
f = self.layer3(f)

return self.layerNorm(f + x) #residual connection

def forward(self,x):
x = self.multi_head(x)
x = self.layerNorm(x)
x = self.feed_forward(x)
x = self.layerNorm(x)

return x

Full Model Architecture

In [13]: class architecture(torch.nn.Module):
def __init__(self,n_classes,shape) -> None:
super().__init__()
self.max_seq_length,self.dmodel = shape
self.projected_dmodel = 224
self.embedding_layer = embed()
self.proj_to_224 = torch.nn.Linear(self.dmodel, self.projected_dmodel)
self.positional = pos_enc()
self.decoder1 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder2 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder3 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder4 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder5 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder6 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder7 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder8 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
# self.decoder5 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel
self.final_MLP = torch.nn.Linear(self.projected_dmodel,n_classes)
self.softmax = torch.nn.Softmax(dim=2)

def forward(self,x,temperature=1.0):
x = self.embedding_layer(x)
x = self.proj_to_224(x)
x = self.positional(x)
x = self.decoder1(x)
x = self.decoder2(x)
x = self.decoder3(x)
x = self.decoder4(x)
x = self.decoder5(x)
x = self.decoder6(x)
x = self.decoder7(x)
x = self.decoder8(x)
x = self.final_MLP(x)
logits = x / temperature
x = self.softmax(logits)

return x

Training Script
In [ ]: !pip install torchmetrics

In [14]: from torchmetrics import Accuracy

from tqdm import tqdm

In [15]: dataset = torch.utils.data.TensorDataset(X,Y)

loader = torch.utils.data.DataLoader(dataset,batch_size=20,num_workers=0,shuffle=False)

In [ ]: vocab_size = tokenizer.vocab_size
model = architecture(n_classes = vocab_size, shape = (300,768))
model = model.to(device)
model.load_state_dict(torch.load("{fill with path to model weights if any}"))

In [76]: metric = Accuracy(num_classes=vocab_size,task='multiclass').to(device)

optimizer = torch.optim.Adam(model.parameters(),lr=0.0001)
criterion = torch.nn.CrossEntropyLoss(ignore_index=0,label_smoothing=0.01)

In [ ]: from tqdm import tqdm

print('Training Started')
NUM_EPOCHS = 1

for epoch in range(NUM_EPOCHS):

model.train() # Set the model to training mode
running_loss = 0.0
epoch_accuracy = 0.0
num_batches = len(loader)

# Initialize tqdm progress bar

with tqdm(total=num_batches, desc=f"Epoch {epoch + 1}", leave=True) as pbar:
for i, (x_batch, y_batch) in enumerate(loader):
x_batch, y_batch = x_batch.to(device), y_batch.to(device)

# Zero the parameter gradients

optimizer.zero_grad()

# Forward pass
outputs = model(x_batch)
# Flatten the outputs and y_batch tensors one dimension lower
outputs = outputs.view(-1, outputs.shape[-1])
y_batch = y_batch.view(-1)

# Loss calculation
loss = criterion(outputs, y_batch).to(device)

# Backward pass and optimize

loss.backward()
optimizer.step()

# Metrics
argmax_pred = outputs.argmax(axis=1)
metric.update(argmax_pred, y_batch)

# Print statistics
running_loss += loss.item()
if i % 10 == 9: # update every 10 mini-batches
accuracy = metric.compute().item()
epoch_accuracy += accuracy
pbar.set_postfix({'Loss': running_loss / (i + 1), 'Accuracy': accuracy})

# Update the progress bar

pbar.update(1)
# Save model weights periodically
if i % 10 == 9:
torch.save(model.state_dict(), '/kaggle/working/model_weights.pth')

# Compute and print average loss and accuracy for the epoch
avg_loss = running_loss / num_batches
avg_accuracy = epoch_accuracy / (num_batches // 10) # since we're summing accuracy
print(f'Epoch {epoch + 1} - Loss: {avg_loss:.4f}, Accuracy: {avg_accuracy:.4f}')

print('Training Completed')

Training Started
Epoch 1: 66%|██████ | 2770/4166 [1:58:07<1:00:08, 2.59s/it, Loss=9.89, Accuracy=0.2
34]

In [ ]: #Link to download model

from IPython.display import FileLink
FileLink(r'model_weights.pth')

Simplistic Inference Script

In [17]: text = '''Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 198

In [6]: def model_pred(tokens,temp):

model.eval()
with torch.no_grad():
pred = model(tokens,temp)
pred = pred.view(-1, pred.shape[-1]).argmax(axis=1)
return pred

def tokenize_text(text):
seq_length = 300
q_tokens = tokenizer(text,add_special_tokens=False)['input_ids']
pad = [0 for i in range(seq_length-len(q_tokens))]
final_tokens = [q_tokens + pad]
last_index = len(q_tokens)-1

return torch.tensor(final_tokens),last_index

def inference(text, starter='', temperature=1.0):

curr = 0
pred_list = []
t, last_token = tokenize_text(text + '[CLS]' + starter)
t = t.to(device)

while curr != 102:

print('\n',"Generating...")
all_pred = model_pred(t, temperature)
pred = all_pred[last_token].item()
pred_list.append(pred)
t[0][last_token + 1] = pred
last_token += 1
curr = pred

if curr > 10:

break
print("Question from the model: ".upper(), starter + ' ' + tokenizer.decode(pred_lis

return starter + ' ' + tokenizer.decode(pred_list)

inference(text, '')
QUESTION FROM THE MODEL: what is the name of the singer? [SEP]

Issues
Not Enough data: I trained on only 130K samples which is too small
Pre-trainig on a downstream task: Pretraining is supposed to be self supervised

Project Source
No ratings yet
Project Source
21 pages
Assignment No 4
No ratings yet
Assignment No 4
8 pages
Coding Attention Mechanisms
No ratings yet
Coding Attention Mechanisms
24 pages
NLP 4
No ratings yet
NLP 4
10 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
A4
No ratings yet
A4
8 pages
LLM Code Ref
No ratings yet
LLM Code Ref
10 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Modeling Chatglm
No ratings yet
Modeling Chatglm
20 pages
Notes On Implementing Attention - Eli Bendersky
No ratings yet
Notes On Implementing Attention - Eli Bendersky
12 pages
Fast Transformer Decoding - One Write-Head Is All You Need
No ratings yet
Fast Transformer Decoding - One Write-Head Is All You Need
9 pages
NLP4 Prasen
No ratings yet
NLP4 Prasen
5 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
Transformer Flux
No ratings yet
Transformer Flux
11 pages
Pytorch Demo 1749471354
No ratings yet
Pytorch Demo 1749471354
10 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Self Attention With Trainable Weights 1726701162
No ratings yet
Self Attention With Trainable Weights 1726701162
12 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Visual Transformers
No ratings yet
Visual Transformers
26 pages
Astro AI
No ratings yet
Astro AI
20 pages
IBest DeepLearning
No ratings yet
IBest DeepLearning
123 pages
CS541 HW4
No ratings yet
CS541 HW4
11 pages
Transformer
No ratings yet
Transformer
10 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Transformer
No ratings yet
Transformer
4 pages
PyTorch Autoencoder & VAE Tutorial
No ratings yet
PyTorch Autoencoder & VAE Tutorial
17 pages
Karpathy MinGPT Model
No ratings yet
Karpathy MinGPT Model
7 pages
How To Implement Multi-Head Attention From Scratch in TensorFlow and Keras
No ratings yet
How To Implement Multi-Head Attention From Scratch in TensorFlow and Keras
20 pages
TXT
No ratings yet
TXT
7 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
39 pages
Deep Learning Lab Manual With Code
No ratings yet
Deep Learning Lab Manual With Code
10 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Mlp-Fromscratch Sigmoid-Mse
No ratings yet
Mlp-Fromscratch Sigmoid-Mse
13 pages
Transformer 2
No ratings yet
Transformer 2
6 pages
Tutorials Sources Beginner Ptcheat
No ratings yet
Tutorials Sources Beginner Ptcheat
7 pages
PyTorch Cheat Sheet & Quick Reference
No ratings yet
PyTorch Cheat Sheet & Quick Reference
6 pages
Tut4 NN Pytorch Updated - Ipynb - Colab
No ratings yet
Tut4 NN Pytorch Updated - Ipynb - Colab
11 pages
Attention Variants
No ratings yet
Attention Variants
24 pages
Chapter 2
No ratings yet
Chapter 2
52 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
16 pages
Code File
No ratings yet
Code File
6 pages
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
No ratings yet
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
18 pages
PyTorch Tensor and Autograd Guide
No ratings yet
PyTorch Tensor and Autograd Guide
15 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
17 pages
Homework IntroToDL
No ratings yet
Homework IntroToDL
3 pages
Transformers
No ratings yet
Transformers
15 pages
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
No ratings yet
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
11 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Harvard CS197 Lecture 6 & 7 Notes
No ratings yet
Harvard CS197 Lecture 6 & 7 Notes
18 pages
Autoencoder From Scratch
No ratings yet
Autoencoder From Scratch
21 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Hyper Parameteres: Dataset
No ratings yet
Hyper Parameteres: Dataset
13 pages
Reviewer Local Legislation Finals
No ratings yet
Reviewer Local Legislation Finals
7 pages
Become The Better You
90% (10)
Become The Better You
12 pages
Company Law ct1
No ratings yet
Company Law ct1
7 pages
Modicon X80 I/O Platform: Compatibility
No ratings yet
Modicon X80 I/O Platform: Compatibility
2 pages
Difference Equations
No ratings yet
Difference Equations
10 pages
A Simplified D-Q Small-Signal Modelling Method
No ratings yet
A Simplified D-Q Small-Signal Modelling Method
7 pages
Full Chapter of Psychology Themes and Variations 10e 10th Edition Ebook and TestBank Bundle EPUB DOCX PDF Download Now
No ratings yet
Full Chapter of Psychology Themes and Variations 10e 10th Edition Ebook and TestBank Bundle EPUB DOCX PDF Download Now
401 pages
Launch of X ORSaline by Friends Ltd
0% (1)
Launch of X ORSaline by Friends Ltd
23 pages
En 10255 Eng
No ratings yet
En 10255 Eng
2 pages
Mrunal's Weekly MockTest Pillar 1D1 Insurance Unacademy Dark
No ratings yet
Mrunal's Weekly MockTest Pillar 1D1 Insurance Unacademy Dark
21 pages
The Flashventure
No ratings yet
The Flashventure
9 pages
Class 8 Social Science Exam 2017
No ratings yet
Class 8 Social Science Exam 2017
4 pages
Evolution of Smartphones
No ratings yet
Evolution of Smartphones
2 pages
NIMA617 2010 348-350 - A Multichannel Compact Readout System For Single Photon Detection
No ratings yet
NIMA617 2010 348-350 - A Multichannel Compact Readout System For Single Photon Detection
4 pages
Ishan Enterprises Quotation: GSTIN 10AKRPG0977H1Z6 To, Kumari Guriya Quotation# Date: 13-09-2023
No ratings yet
Ishan Enterprises Quotation: GSTIN 10AKRPG0977H1Z6 To, Kumari Guriya Quotation# Date: 13-09-2023
1 page
Ge3 - Reviewer
No ratings yet
Ge3 - Reviewer
15 pages
Grammar Simple Verb Tenses
100% (2)
Grammar Simple Verb Tenses
6 pages
Print Question Paper
No ratings yet
Print Question Paper
1 page
BIO151 Lecture Notes 4
No ratings yet
BIO151 Lecture Notes 4
5 pages
2nd QUARTER CO INSCRIBED ANGLES LESSON PLAN
No ratings yet
2nd QUARTER CO INSCRIBED ANGLES LESSON PLAN
7 pages
Articulos Fast Food
No ratings yet
Articulos Fast Food
2 pages
Luminescent 2.1 Professor Oak Tracker
No ratings yet
Luminescent 2.1 Professor Oak Tracker
342 pages
Graphic Design HIstory Timeline
100% (11)
Graphic Design HIstory Timeline
17 pages
Cost Analysis & Pricing Strategies
No ratings yet
Cost Analysis & Pricing Strategies
63 pages
Grooming Techniques & Personal Care
No ratings yet
Grooming Techniques & Personal Care
27 pages
Batch Manufacturing Review: Sr. No. Check Points Reference Documents Step 1: Introduction of New Product (BMR Review)
No ratings yet
Batch Manufacturing Review: Sr. No. Check Points Reference Documents Step 1: Introduction of New Product (BMR Review)
15 pages
Group 3 Research Paper Final
No ratings yet
Group 3 Research Paper Final
52 pages
Azevedo J - A Simplified Coptic Dictionary Sahidic Dialect
No ratings yet
Azevedo J - A Simplified Coptic Dictionary Sahidic Dialect
235 pages
Guide to Information Architecture
No ratings yet
Guide to Information Architecture
15 pages
Unit-3 EE IIIsem
No ratings yet
Unit-3 EE IIIsem
17 pages

Transformers for NLP Developers

Uploaded by

Transformers for NLP Developers

Uploaded by

Decoder-Only Transformer (LLM) For Question

In [1]: #necessary imports

In [3]: data = pd.read_json('{fill with path to your data}').to_dict(orient='list')

In [6]: def tokenize_input(qa):

x_tokens = q_tokens + a_tokens[:-1]

x_pad = [0 for i in range(seq_length-len(x_tokens))]

return final_x, final_y

In [7]: #tokenizing all data

(torch.Size([124975, 300]), torch.Size([124975, 300]))

# Compute the positional encoding values

* shape (tuple): A tuple (seq_length, dmodel) where `seq_length` is the lengt

* mask (bool, optional): If True, a mask will be applied to prevent attentio

def forward(self, x):

# Calculate attention using the projected vectors q, k, and v

# Concatenate all the heads together

return output + x # Residual connection

self.multi_head = self_attention(no_of_heads=no_of_heads, mask=True,

return self.layerNorm(f + x) #residual connection

Full Model Architecture

In [14]: from torchmetrics import Accuracy

In [15]: dataset = torch.utils.data.TensorDataset(X,Y)

In [76]: metric = Accuracy(num_classes=vocab_size,task='multiclass').to(device)

In [ ]: from tqdm import tqdm

for epoch in range(NUM_EPOCHS):

# Initialize tqdm progress bar

# Zero the parameter gradients

# Backward pass and optimize

# Update the progress bar

In [ ]: #Link to download model

Simplistic Inference Script

In [6]: def model_pred(tokens,temp):

def inference(text, starter='', temperature=1.0):

while curr != 102:

if curr > 10:

return starter + ' ' + tokenizer.decode(pred_list)

You might also like