Transformers Laid Out - Pramod's Blog
Transformers Laid Out - Pramod's Blog
       I have encountered that there are mainly three types of blogs/videos/tutorials talking about
       transformers
             Explaining how a transformer works (One of the best is Jay Alammar’s blog)
             Explaining the “Attention is all you need” paper (The Annotated Transformer)
             Coding tranformers in PyTorch (Coding a ChatGPT Like Transformer From Scratch in
             PyTorch)
       Each follows an amazing pedagogy, Helping one understand a singular concept from multiple
       point of views. (This blog has been highly influenced by the above works)
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                                 1/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
Once we have a baseline context setup, We will dive into the code itself.
       I will mention the section from the paper and the part of the transformer that we will be
       coding, along with that I will give you a sample code block with hints and links to
       documentation like the following:
          class TransformerLRScheduler:
                   def __init__(self, optimizer, d_model, warmup_steps):
                       """
                         Args:
                             optimizer: Optimizer to adjust learning rate for
                                d_model: Model dimensionality
                                warmup_steps: Number of warmup steps
                         """
                         #YOUR CODE HERE
       I will add helpful links after the code block, but I will recommend you do your own research
       first. That is the first steps to becoming a cracked engineer
I recommend you copy these code blocks and try to implement them by yourself.
       To make it easier, before we start coding I will explain each part in detail. If you are still unable
       to solve it, come back and see my implementation.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                     2/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       The original transformers was made for machine translation task and that is what we shall do
       as well. We will try to translate “I like Pizza” from English to Hindi.
       But before that, Let’s have a brief look into the blackbox that is our Transformer. We can see
       that it consists of Encoders and Decoders
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                              3/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       Before being passed to the Encoder, The sentence “I like Pizza” is broken down into it’s
       respective words* and each word is embedded using an embeddings matrix. (which is trained
       along with the transformer)
       After that these embeddings are passed to the encoder block which essentially does two
       things
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           4/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       The decoder block takes the output from the encoder, runs it through it self, produces an
       output ,and sends it back to itself to create the next word
       Think of it like this. The encoder understands your language let’s call it X and another language
       called Y The decoder understands Y and the language you are trying to translate X to, lets call
       it Z.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                 5/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       So Y acts as the common language that both the encoder and decoder speak to produce the
       final output.
       *We are using words for easier understanding, most modern LLMs do not work with words.
       But rather “Tokens”
       Understanding Self-attention
       We have all heard of the famous trio, “Query, Key and Values”. I absolutely lost my head trying
       to understand how the terms came up behind this idea Was Q,K,Y related to dictionaries? (or
       maps in traditional CS) Was it inspired by a previous paper? if so how did they come up with?
Question:
       You can come up with as many Questions (the queries) for the sentence as you want. Now for
       each query, you will have one specific piece of information (the key) that will give you the
       desired answer (the value)
Query:
       This is an over simplification really, but it helps understand that the queries, keys and values all
       can be created only using the sentences.
       Let us first understand how Self-attention is applied and subsequently understand why is it
       even done. Also, for the rest of the explanation treat Q,K,V purely as matrices and nothing else.
       First, The word “Delicious Pizza” is converted into embeddings. Then it is multiplied with the
       weights W_Q, W_K, W_V to produce Q,K,V vectors.
       These weights W_Q, W_K, W_V are trained alongside with the transformer. Notice how vector
       Q,K,V are smaller than the size of x1,x2. Namely, x1,x2 are vectors of size 512. Whereas Q,K,V
       are of size 64. This is an architectural choice to make the computation smaller and faster.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                    6/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       Calculating the attention score for the first word “Delicious” we take the query (q1) and key
       (k1) of the word and take a dot product of them. (Dot products are great to find similarity
       between things).
       Then we divide that by Square root of the dimension of key vector. This is done to stabilize
       training.
       The same process is done with the query of word one (q1) and all the keys of the different
       words in this case k1 & k2.
       Then these are multiplied with the value of each word (v1,v2). Intuitvely to get the importance
       of each word with respect to the selected words. Less important words are drowned out by
       lets say multipling with 0.001.
       And finally everything is summed up to get the Z vector
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                               7/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       The thing that made transformers was that computation could be parallelized, So we do not
       deal with vectors. But rather matrices.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           8/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           9/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
This is the how the output from each attention head will look like
       Finally join the outputs from all the attention head and multiply it with a matrix WO (which is
       trained along with the model) To get the final attention score
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                               10/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
But you will never know the best path, till you have tried a lot of them. More the better.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                             11/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       Hence a single matrix multiplication does not get you the best representation of query and key
       Multiple queries can be made, multiple keys can be done for each of these query
       That is the reason we do so many matrix multiplication to try and get the best key for a query
       that is relevant to the question asked by the user
       To visualize how Self-Attention creates different representation. Let’s have a look at the three
       different representation of different words “apple”,”market” & “cellphone”
       Representation 2 will be best to choose for this question, and it gives us the answer
       “cellphone” as that is the one closest to it.
In this case Representation 3 will be the best option, and we will get the answer “market”.
       (These are linear transformation and can be applied to any matrix, the 3rd one is called a shear
       operation)
       First an input sentence, for E.g “Pramod likes to eat pizza with his friends”. Will be broken
       down into the respective words*
“Pramod”,”likes”,”to”,”eat”,”pizza”,”with”,”his”,”friends”
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                12/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       Now, without Positional Encoding. The model has no information about the relative positioning
       of the words. (As everything is taken at once in parallel)
       So the sentence is no different from “Pramod likes to eat friends with his pizza” or any other
       permutation of the word.
       Hence, the reason we need PE (Positional Encoding) is to tell the model about the positions of
       different words relative to each other.
             Unique encoding for each position: Because otherwise it will keep changing for different
             length of sentences. Position 2 for a 10 word sentence will be different than for a 100
             word sentence. This will hamper training as there is no predictable pattern that can be
             followed.
             Linear relation between two encoded positions: If I know the position p of one word, it
             should be easy to calculate the position p+k of another word. This will make it easier for
             the model to learn the patter.
             Generalizes to longer sequences than those encountered in training: If the model is
             limited by the length of the sentences used in training, it will never work in the real world.
             Generated by a deterministic process the model can learn: It should be a simple formula,
             or easily calculable algorithm. To help our model generalize better.
             Extensible to multiple dimensions: Different scenarios can have different dimensions, we
             want it to work in all cases.
Integer Encoding
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                    13/59
1/5/25, 12:42 PM                                                      Transformers Laid Out | Pramod’s Blog
       Reading the above conditions the first thought that will come to anyone’s mind will be. “Why
       not just add the position of the word” This naive solution will work for small sentences. But for
       longer sentences like lets say for some essay with 2000 words, adding position 2000 can lead
       to exploding or vanishing gradients.
       There are other alternatives as well, Normalizing the integer encoding, binary encoding. But
       each has its own problems. To read more on detail go here.
Sinusoidal Encoding
       One of the encoding method that satisfies all our conditions is using sinusoidal functions. As
       done in the paper.
But why use cosine alternatively if sine satisfies all the conditions?
       Well sine does not satisfy all, but most conditions. Our need for a linear relation is not satisfied
       by sine and hence we need cosine for it as well. Here let me present a simple proof which has
       been taken from here
       Consider a sequence of sine and cosine pairs, each associated with a frequency ωi. Our goal is
       to find a linear transformation matrix M that can shift these sinusoidal functions by a fixed
       offset k:
       The frequencies ωi follow a geometric progression that decreases with dimension index i,
       defined as:
                                                                               1
                                                                 ωi =
                                                                                   2i/d
                                                                         10000
       To find this transformation matrix, we can express it as a general 2×2 matrix with unknown
       coefficients u1, v1, u2, and v2:
By applying the trigonometric addition theorem to the right-hand side, we can expand this into:
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                            14/59
1/5/25, 12:42 PM                                                     Transformers Laid Out | Pramod’s Blog
       By comparing terms with sin(ωip) and cos(ωip) on both sides, we can solve for the unknown
       coefficients:
u1 = cos(ωik) v1 = sin(ωik)
u2 = − sin(ωik) v2 = cos(ωik)
                                                                   cos(ωik)         sin(ωik)
                                                      Mk = [                                       ]
                                                                  − sin(ωik)        cos(ωik)
       Now that we understand what is PE and why we use sine and cos. Let us understand how it
       works.
                                            pos                                                 pos
       P E(pos,2i) = sin (                   2i/d
                                                            ) P E(pos,2i+1) = cos (              2i/d
                                                                                                                )
                                      10000         model                                 10000         model
       pos = position of the word in the sentence (“Pramod likes pizza” Pramod is at position 0, likes
       in 1 and so on)
       i = value for ith and (i+1)th index of the embedding, sine for even column number cosine for
       odd column number (“Pramod” is converted into a vector of embedding. Which has different
       indexes)
       d_model = dimension of the model (in our case it is 512)
       10,000 (n) = this is a constant determined experimentally
       As you can see, using this we can calculate the PE value for each position and all the indexes
       for that position. Here is a simple illustration showing how its done.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                          15/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
Now expanding on the above this is how it looks like for the function
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           16/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       This is how it looks like for the original with n = 10,000, d_model = 10,000 and sequence
       length=100 Code to generate here
       Imagine it as such, each index on the y axis represents a word, and everything corresponding
       on the x axis to that index. Is it’s positional encoding.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                             17/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
Encoder
it consists for multiple encoders, and each encoder block consists of the following parts:
             Multi-head Attention
             Residual connection
             Layer Normalization
             Feed Forward network
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           18/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
Residual connection
       We have already talked about Multi-head attention in great detail so let’s talk about the
       remaining three.
       Residual connection or also known as skip connections, they work as the name applies. They
       take the input and skip it over a block and take it to the next block.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           19/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
Layer Normalization
       Layer normalization was a development after batch normalization. Before we talk about either
       of these, we have to understand what normalization is.
       Normalization is a method to bring different features in the same scale, This is done to stabilize
       training. Because when models try to learn from features with drastically different scales, it can
       slow down training as well as cause exploding gradients. (Read more here)
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                  20/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       Batch normalization is the method where the mean and standard deviation of an entire batch is
       subtracted from the future layer.
(image
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                    21/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       In Layer normalization instead of focusing on the entire batch, all the features of a single
       instance is focused on.
Think of it like this, we take each word from a sentence, and normalize that word.
       The Feed Forward network (FFN) is added to introduce non-linearity and complexity to the
       model. While the attention mechanism is great at capturing relationships between different
       positions in the sequence, It is inherently still a linear operation (as mentioned earlier).
       The FFN adds non-linearity through its activation functions (typically ReLU), allowing the
       model to learn more complex patterns and transformations that pure attention alone cannot
       capture.
       Imagine it this way: if the attention mechanism is like having a conversation where everyone
       can talk to everyone else (global interaction), the FFN is like giving each person time to think
       deeply about what they’ve heard and process it independently (local processing). Both are
       necessary for effective understanding and transformation of the input. Without the FFN,
       transformers would be severely limited in their ability to learn complex functions and would
       essentially be restricted to weighted averaging operations through attention mechanisms
       alone.
Decoder Block
       The output of the encoder is fed each decoder block as the data processes as Key and Value
       matrix. The decoder block is auto-regressive. Meaning it outputs one after the other and takes
       its own output as an input.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                22/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
          1. The decoder block takes the Keys and Values from the encoder and creates it own queries
             from the previous output.
          2. Using the output of one, it moves to step 2, where the output from the previous decoder
             block is taken as the query and key, value is taken from the encoder.
          3. This repeats till we get an output from the decoder, which it takes as the input for creating
             the next token
          4. This repeats till we reach token
       There is also a slight variation in the decoder block, As in it we apply a mask to let the self-
       attention mechanism only attend to earlier positions in the output sequence.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                   23/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       That is all the high level understanding you need to have, to be able to write a transformer of
       your own. Now let us look at the paper as well as the code
       The decoder gives out a vector of numbers (floating point generally), Which is sent to a linear
       layer.
       The linear layer outputs the score for each word in the vocabulary(the amount of unique words
       in the training dataset)
       Which is then sent to the softmax layer, which converts these scores into probabilities. And the
       word with the highest probability is given out. (This is usually the case, sometimes we can set
       it so that we get the 2nd most probable word, or the 3rd most and so on)
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                24/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
       This section brings you up to speed about what the paper is about and why it was made in the
       first place.
       There are some concepts that can help you learn new things, RNNs, Convolution neural
       network and about BLEU.
       Also it is important to know that transformers were originally created for text to text
       translation. I.E from one language to another.
       Hence they have an encoder section and a decoder section. They pass around information and
       it is known as cross attention (more on the difference between self-attention and cross
       attention later)
Background
       This section usually talks about the work done previously in the field, known issues and what
       people have used to fix them. One very important thing for us to understand to keep in mind is.
       “Keeping track of distant information”. Transformers are amazing for multitude of reasons but
       one key one is that they can remember distant relations.
       Solutions like RNNs and LSTMs lose the contextual meaning as the sentence gets longer. But
       transformers do not run into such problem. (A problem tho, hopefully none existent when you
       read it is. The context window length. This fixes how much information the transformer can
       see)
Model Architecture
       The section all of us had been waiting for. I will divert a bit from the paper here. Because I find
       it easier to follow the data. Also if you read the paper, each word of it should make sense to
       you.
       We will first start with the Multi-Head Attention, then the feed forward network, followed by
       the positional encoding, Using these we will finish the Encoder Layer, subsequently we will
       move to the Decoder Layer, After which we will write the Encoder & Decoder block, and finally
       end it with writing the training loop for an entire Transformer on real world data.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                   25/59
1/5/25, 12:42 PM                                                           Transformers Laid Out | Pramod’s Blog
                Attention mechanism
                                                               Residual
                                                             connections
                                                                                                               Linear layer ->
                                                       Make it easier for                                              ReLU ->
                                                         the network to
                                                       preserve important                                          A network with sequence of
                                                        information from                                           linear layer, dropout and ReLU
          This is where the magic happens                 earlier layers                                           activation
... .. ...
Necessary imports
import math
          import torch
          import torch.nn as nn
          from torch.nn.functional import softmax
Multi-Head Attention
       By now you should have good grasp of how attention works, so let us first start with coding
       the scaled dot-product attention (as MHA is basically multiple scaled dot-product stacked
       together). Reference section is 3.2.1 Scaled Dot-Product Attention
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                                                            26/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                     # add inf is a mask is given, This is used for the decoder layer. You
                     #YOUR CODE HERE
             Tensor size
             Matrix multiplication
             Masked fill
          # my implementation
          def scaled_dot_product_attention(query, key, value, mask=None):
                """
                     Args:
                         query: (batch_size, num_heads, seq_len_q, d_k)
                             key: (batch_size, num_heads, seq_len_k, d_k)
                             value: (batch_size, num_heads, seq_len_v, d_v)
                             mask: Optional mask to prevent attention to certain positions
                     """
                     # Shape checks
                     assert query.dim() == 4, f"Query should be 4-dim but got {query.dim(
                     assert key.size(-1) == query.size(-1), "Key and query depth must be
                     assert key.size(-2) == value.size(-2), "Key and value sequence lengt
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           27/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
d_k = query.size(-1)
                     # Attention scores
                     scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
          class MultiHeadAttention(nn.Module):
              #Let me write the initializer just for this class, so you get an idea o
                   def __init__(self, d_model, num_heads):
                       super().__init__()
                         assert d_model % num_heads == 0, "d_model must be divisible by num_
                         self.d_model = d_model
                         self.num_heads = num_heads
                         self.d_k = d_model // num_heads                   # Note: use integer division //
                   @staticmethod
                   def scaled_dot_product_attention(query, key, value, mask=None):
                     #YOUR IMPLEMENTATION HERE
                     # 1. Linear projections
                     #YOUR CODE HERE
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                   28/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                     # 3. Apply attention
                     #YOUR CODE HERE
                     # 4. Concatenate heads
                     #YOUR CODE HERE
                     # 5. Final projection
                     #YOUR CODE HERE
             I had a hard time understanding the difference between view and transpose. These 2 links
             should help you out, When to use view,transpose & permute and Difference between view
             & transpose
             Contiguous and view, still eluded me. Till I read these, Pytorch Internals and Contiguous &
             Non-Contiguous Tensor
             Linear
             I also have a post talking about how the internal memory management of tensors work,
             read this if you are interested.
          #my implementation
          class MultiHeadAttention(nn.Module):
                   def __init__(self, d_model, num_heads):
                       super().__init__()
                         assert d_model % num_heads == 0, "d_model must be divisible by num_
                         self.d_model = d_model
                         self.num_heads = num_heads
                         self.d_k = d_model // num_heads                   # Note: use integer division //
                   @staticmethod
                   def scaled_dot_product_attention(query, key, value, mask=None):
                     """
                     Args:
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                   29/59
1/5/25, 12:42 PM                                                  Transformers Laid Out | Pramod’s Blog
d_k = query.size(-1)
                     # Attention scores
                     scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
                     # 1. Linear projections
                     Q = self.W_q(query)                     # (batch_size, seq_len, d_model)
                     K = self.W_k(key)
                     V = self.W_v(value)
                     # 3. Apply attention
                     output = self.scaled_dot_product_attention(Q, K, V, mask)
                     # 4. Concatenate heads
                     output = output.transpose(1, 2).contiguous().view(batch_size, seq_le
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                30/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                     # 5. Final projection
                     return self.W_o(output)
       {explain “Another way of describing this is as two convolutions with kernel size 1. The
       dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality df f
       = 2048. “}
Section 3.3
          class FeedForwardNetwork(nn.Module):
                   """Position-wise Feed-Forward Network
                   Args:
                       d_model: input/output dimension
                         d_ff: hidden dimension
                         dropout: dropout rate (default=0.1)
                   """
                   def __init__(self, d_model, d_ff, dropout=0.1):
                         super().__init__()
                         #create a sequential ff model as mentioned in section 3.3
                         #YOUR CODE HERE
             Dropout
             Where to put Dropout
             ReLU
          #my implementation
          class FeedForwardNetwork(nn.Module):
                   """Position-wise Feed-Forward Network
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                               31/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                   Args:
                       d_model: input/output dimension
                         d_ff: hidden dimension
                         dropout: dropout rate (default=0.1)
                   """
                   def __init__(self, d_model, d_ff, dropout=0.1):
                         super().__init__()
                         self.model = nn.Sequential(
                                nn.Linear(d_model, d_ff),
                                nn.ReLU(),
                                nn.Dropout(dropout),
                                nn.Linear(d_ff, d_model),
                                nn.Dropout(dropout)
                         )
Positional Encoding
Section 3.5
          class PositionalEncoding(nn.Module):
                   def __init__(self, d_model, max_seq_length=5000):
                       super().__init__()
                         # Register buffer
                         #YOUR CODE HERE
          class PositionalEncoding(nn.Module):
                   def __init__(self, d_model, max_seq_length=5000):
                       super().__init__()
                         # Register buffer
                         self.register_buffer('pe', pe.unsqueeze(0))                             # Shape: (1, max_seq_
Encoder Layer
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                               33/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
          class EncoderLayer(nn.Module):
                   def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
                       super().__init__()
                         # 1. Multi-head attention
                         #YOUR CODE HERE
                         # 2. Layer normalization
                         #YOUR CODE HERE
                         # 3. Feed forward
                         #YOUR CODE HERE
                         # 5. Dropout
                         #YOUR CODE HERE
          class EncoderLayer(nn.Module):
              def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
                         super().__init__()
                         # 1. Multi-head attention
                         self.mha = MultiHeadAttention(d_model,num_heads)
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           34/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                         # 2. Layer normalization
                         self.layer_norm_1 = nn.LayerNorm(d_model)
                         # 3. Feed forward
                         self.ff = FeedForwardNetwork(d_model,d_ff)
                         ff_output = self.ff(x)
                         x = self.dropout(x + ff_output)                   # Apply dropout after residual
                         x = self.layer_norm_2(x)
return x
Decoder Layer
          class DecoderLayer(nn.Module):
                   def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
                       super().__init__()
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                                  35/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                         # 7. Dropout
                         #YOUR CODE HERE
          class DecoderLayer(nn.Module):
              def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
                         super().__init__()
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           36/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                         # 7. Dropout
                         self.dropout = nn.Dropout(dropout)
                         ff_output = self.ff(x)
                         x = self.dropout(x + ff_output)
                         x = self.layer_norm_3(x)
return x
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           37/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
Encoder
          class Encoder(nn.Module):
                   def __init__(self,
                                vocab_size,
                                          d_model,
                                          num_layers=6,
                                          num_heads=8,
                                          d_ff=2048,
                                          dropout=0.1,
                                          max_seq_length=5000):
                         super().__init__()
                         # 1. Input embedding
                         #YOUR CODE HERE
                         # 2. Positional encoding
                         #YOUR CODE HERE
                         # 3. Dropout
                         #YOUR CODE HERE
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           38/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
          class Encoder(nn.Module):
                   def __init__(self,
                                vocab_size,
                                          d_model,
                                          num_layers=6,
                                          num_heads=8,
                                          d_ff=2048,
                                          dropout=0.1,
                                          max_seq_length=5000):
                         super().__init__()
                         # 1. Input embedding
                         self.embeddings = nn.Embedding(vocab_size, d_model)
                         self.scale = math.sqrt(d_model)
                         # 2. Positional encoding
                         self.pe = PositionalEncoding(d_model, max_seq_length)
                         # 3. Dropout
                         self.dropout = nn.Dropout(dropout)
return x
Decoder
          class Decoder(nn.Module):
                   def __init__(self,
                                vocab_size,
                                          d_model,
                                          num_layers=6,
                                          num_heads=8,
                                          d_ff=2048,
                                          dropout=0.1,
                                          max_seq_length=5000):
                         super().__init__()
                         # 1. Output embedding
                         #YOUR CODE HERE
                         # 2. Positional encoding
                         #YOUR CODE HERE
                         # 3. Dropout
                         #YOUR CODE HERE
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           40/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
          class Decoder(nn.Module):
                   def __init__(self,
                                vocab_size,
                                          d_model,
                                          num_layers=6,
                                          num_heads=8,
                                          d_ff=2048,
                                          dropout=0.1,
                                          max_seq_length=5000):
                         super().__init__()
                         # 1. Output embedding
                         self.embeddings = nn.Embedding(vocab_size, d_model)
                         self.scale = math.sqrt(d_model)
                         # 2. Positional encoding
                         self.pe = PositionalEncoding(d_model, max_seq_length)
                         # 3. Dropout
                         self.dropout = nn.Dropout(dropout)
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           41/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                         Returns:
                                decoder_output: (batch_size, target_seq_len, d_model)
                         """
                         # 1. Pass through embedding layer and scale
                         x = self.embeddings(x) * self.scale
return x
Utility Code
          def create_padding_mask(seq):
                   """
                   Create mask for padding tokens (0s)
                   Args:
                       seq: Input sequence tensor (batch_size, seq_len)
                   Returns:
                       mask: Padding mask (batch_size, 1, 1, seq_len)
                   """
                   #YOUR CODE HERE
          def create_future_mask(size):
                   """
                   Create mask to prevent attention to future positions
                   Args:
                       size: Size of square mask (target_seq_len)
                   Returns:
                       mask: Future mask (1, 1, size, size)
                   """
                   # Create upper triangular matrix and invert it
                   #YOUR CODE HERE
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           42/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                   Args:
                         src: Source sequence (batch_size, src_len)
                         tgt: Target sequence (batch_size, tgt_len)
                   Returns:
                       src_mask: Padding mask for encoder
                         tgt_mask: Combined padding and future mask for decoder
                   """
                   # 1. Create padding masks
                   #YOUR CODE HERE
          def create_padding_mask(seq):
                   """
                   Create mask for padding tokens (0s)
                   Args:
                       seq: Input sequence tensor (batch_size, seq_len)
                   Returns:
                       mask: Padding mask (batch_size, 1, 1, seq_len)
                   """
                   batch_size, seq_len = seq.shape
                   output = torch.eq(seq, 0).float()
                   return output.view(batch_size, 1, 1, seq_len)
          def create_future_mask(size):
                   """
                   Create mask to prevent attention to future positions
                   Args:
                       size: Size of square mask (target_seq_len)
                   Returns:
                       mask: Future mask (1, 1, size, size)
                   """
                   # Create upper triangular matrix and invert it
                   mask = torch.triu(torch.ones((1, 1, size, size)), diagonal=1) == 0
                   return mask
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           43/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                   """
                   Create all masks needed for training
                   Args:
                         src: Source sequence (batch_size, src_len)
                         tgt: Target sequence (batch_size, tgt_len)
                   Returns:
                       src_mask: Padding mask for encoder
                         tgt_mask: Combined padding and future mask for decoder
                   """
                   # 1. Create padding masks
                   src_padding_mask = create_padding_mask(src)
                   tgt_padding_mask = create_padding_mask(tgt)
Transformer
          class Transformer(nn.Module):
                   def __init__(self,
                                src_vocab_size,
                                          tgt_vocab_size,
                                          d_model,
                                          num_layers=6,
                                          num_heads=8,
                                          d_ff=2048,
                                          dropout=0.1,
                                  max_seq_length=5000):
                         super().__init__()
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           44/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
          class Transformer(nn.Module):
              def __init__(self,
                                          src_vocab_size,
                                          tgt_vocab_size,
                                          d_model,
                                          num_layers=6,
                                          num_heads=8,
                                          d_ff=2048,
                                          dropout=0.1,
                                          max_seq_length=5000):
                         super().__init__()
                         self.decoder = Decoder(
                                tgt_vocab_size,
                                d_model,
                                num_layers,
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           45/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                                num_heads,
                                d_ff,
                                dropout,
                                max_seq_length
                         )
          class TransformerLRScheduler:
              def __init__(self, optimizer, d_model, warmup_steps):
                         """
                         Args:
                                optimizer: Optimizer to adjust learning rate for
                                d_model: Model dimensionality
                                warmup_steps: Number of warmup steps
                         """
                         self.optimizer = optimizer
                         self.d_model = d_model
                         self.warmup_steps = warmup_steps
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           46/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
          class LabelSmoothing(nn.Module):
              def __init__(self, smoothing=0.1):
                         super().__init__()
                         self.smoothing = smoothing
                         self.confidence = 1.0 - smoothing
          class TransformerLRScheduler:
              def __init__(self, optimizer, d_model, warmup_steps):
                         """
                         Args:
                                optimizer: Optimizer to adjust learning rate for
                                d_model: Model dimensionality
                                warmup_steps: Number of warmup steps
                         """
                         # Your code here
                         # lrate = d_model^(-0.5) * min(step_num^(-0.5), step_num * warmup_s
                         self.optimizer = optimizer
                         self.d_model = d_model
                         self.warmup_steps = warmup_steps
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           47/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
          class LabelSmoothing(nn.Module):
              def __init__(self, smoothing=0.1):
                         super().__init__()
                         self.smoothing = smoothing
                         self.confidence = 1.0 - smoothing
Training transformers
                   Args:
                         model: Transformer model
                         train_dataloader: DataLoader for training data
                         criterion: Loss function (with label smoothing)
                         optimizer: Optimizer
                         scheduler: Learning rate scheduler
                         num_epochs: Number of training epochs
                   """
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           48/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                   # 1. Setup
                   # 2. Training loop
                   Args:
                         model: Transformer model
                         train_dataloader: DataLoader for training data
                         criterion: Loss function (with label smoothing)
                         optimizer: Optimizer
                         scheduler: Learning rate scheduler
                         num_epochs: Number of training epochs
                   """
                   # 1. Setup
                   model = model.to(device)
                   model.train()
                   # 2. Training loop
                   for epoch in range(num_epochs):
                       print(f"Epoch {epoch + 1}/{num_epochs}")
                         epoch_loss = 0
                                # Create masks
                                src_mask, tgt_mask = create_masks(src, tgt)
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           49/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                                # Zero gradients
                                optimizer.zero_grad()
                                # Forward pass
                                outputs = model(src, tgt_input, src_mask, tgt_mask)
                                # Calculate loss
                                loss = criterion(outputs, tgt_output)
                                # Backward pass
                                loss.backward()
                                # Clip gradients
                                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.
                                # Update weights
                                optimizer.step()
                                scheduler.step()
                         # Save checkpoint
                         torch.save({
                             'epoch': epoch,
                                'model_state_dict': model.state_dict(),
                                'optimizer_state_dict': optimizer.state_dict(),
                             'loss': avg_epoch_loss,
                         }, f'checkpoint_epoch_{epoch+1}.pt')
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           50/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
return all_losses
          import os
          import torch
          import spacy
          import urllib.request
          import zipfile
          from torch.utils.data import Dataset, DataLoader
          def download_multi30k():
              """Download Multi30k dataset if not present"""
                   # Create data directory
                   if not os.path.exists('data'):
                         os.makedirs('data')
          def load_data(filename):
              """Load data from file"""
                   with open(filename, 'r', encoding='utf-8') as f:
                       return [line.strip() for line in f]
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           51/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
          def create_dataset():
                   """Create dataset from files"""
                   # Download data if needed
                   download_multi30k()
                   # Load data
                   train_de = load_data('data/train.de')
                   train_en = load_data('data/train.en')
                   val_de = load_data('data/val.de')
                   val_en = load_data('data/val.en')
          class TranslationDataset(Dataset):
              def __init__(self, src_texts, tgt_texts, src_vocab, tgt_vocab, src_tok
                         self.src_texts = src_texts
                         self.tgt_texts = tgt_texts
                         self.src_vocab = src_vocab
                         self.tgt_vocab = tgt_vocab
                         self.src_tokenizer = src_tokenizer
                         self.tgt_tokenizer = tgt_tokenizer
                   def __len__(self):
                         return len(self.src_texts)
                         # Tokenize
                         src_tokens = [tok.text for tok in self.src_tokenizer(src_text)]
                         tgt_tokens = [tok.text for tok in self.tgt_tokenizer(tgt_text)]
                         # Convert to indices
                         src_indices = [self.src_vocab["<s>"]] + [self.src_vocab[token] for
                         tgt_indices = [self.tgt_vocab["<s>"]] + [self.tgt_vocab[token] for
                         return {
                             'src': torch.tensor(src_indices),
                                'tgt': torch.tensor(tgt_indices)
                         }
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           52/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                   # Create vocabulary
                   vocab = {"<s>": 0, "</s>": 1, "<blank>": 2, "<unk>": 3}
                   idx = 4
                   for word, freq in counter.items():
                       if freq >= min_freq:
                                vocab[word] = idx
                                idx += 1
                   return vocab
          def create_dataloaders(batch_size=32):
              # Load tokenizers
                   spacy_de = spacy.load("de_core_news_sm")
                   spacy_en = spacy.load("en_core_web_sm")
                   # Get data
                   (train_de, train_en), (val_de, val_en) = create_dataset()
                   # Build vocabularies
                   vocab_src = build_vocab_from_texts(train_de, spacy_de)
                   vocab_tgt = build_vocab_from_texts(train_en, spacy_en)
                   # Create datasets
                   train_dataset = TranslationDataset(
                         train_de, train_en,
                         vocab_src, vocab_tgt,
                         spacy_de, spacy_en
                   )
                   val_dataset = TranslationDataset(
                         val_de, val_en,
                         vocab_src, vocab_tgt,
                         spacy_de, spacy_en
                   )
                   # Create dataloaders
                   train_dataloader = DataLoader(
                       train_dataset,
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           53/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                         batch_size=batch_size,
                         shuffle=True,
                         collate_fn=collate_batch
                   )
                   val_dataloader = DataLoader(
                       val_dataset,
                         batch_size=batch_size,
                         shuffle=False,
                         collate_fn=collate_batch
                   )
          def collate_batch(batch):
                   src_tensors = [item['src'] for item in batch]
                   tgt_tensors = [item['tgt'] for item in batch]
                   # Pad sequences
                   src_padded = torch.nn.utils.rnn.pad_sequence(src_tensors, batch_first=T
                   tgt_padded = torch.nn.utils.rnn.pad_sequence(tgt_tensors, batch_first=T
                   return {
                         'src': src_padded,
                         'tgt': tgt_padded
                   }
          import os
          import torch
          import spacy
          import urllib.request
          import zipfile
          from torch.utils.data import Dataset, DataLoader
          def download_multi30k():
              """Download Multi30k dataset if not present"""
                   # Create data directory
                   if not os.path.exists('data'):
                         os.makedirs('data')
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           54/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                   base_url = "https://raw.githubusercontent.com/multi30k/dataset/master/
                   files = {
                       "train.de": "train.de.gz",
                         "train.en": "train.en.gz",
                         "val.de": "val.de.gz",
                         "val.en": "val.en.gz",
                         "test.de": "test_2016_flickr.de.gz",
                         "test.en": "test_2016_flickr.en.gz"
                   }
          def load_data(filename):
              """Load data from file"""
                   with open(filename, 'r', encoding='utf-8') as f:
                       return [line.strip() for line in f]
          def create_dataset():
                   """Create dataset from files"""
                   # Download data if needed
                   download_multi30k()
                   # Load data
                   train_de = load_data('data/train.de')
                   train_en = load_data('data/train.en')
                   val_de = load_data('data/val.de')
                   val_en = load_data('data/val.en')
          class TranslationDataset(Dataset):
              def __init__(self, src_texts, tgt_texts, src_vocab, tgt_vocab, src_tok
                         self.src_texts = src_texts
                         self.tgt_texts = tgt_texts
                         self.src_vocab = src_vocab
                         self.tgt_vocab = tgt_vocab
                         self.src_tokenizer = src_tokenizer
                         self.tgt_tokenizer = tgt_tokenizer
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           55/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                   def __len__(self):
                       return len(self.src_texts)
                         # Tokenize
                         src_tokens = [tok.text for tok in self.src_tokenizer(src_text)]
                         tgt_tokens = [tok.text for tok in self.tgt_tokenizer(tgt_text)]
                         # Convert to indices
                         src_indices = [self.src_vocab["<s>"]] + [self.src_vocab[token] for
                         tgt_indices = [self.tgt_vocab["<s>"]] + [self.tgt_vocab[token] for
                         return {
                                'src': torch.tensor(src_indices),
                                'tgt': torch.tensor(tgt_indices)
                         }
                   # Create vocabulary
                   vocab = {"<s>": 0, "</s>": 1, "<blank>": 2, "<unk>": 3}
                   idx = 4
                   for word, freq in counter.items():
                       if freq >= min_freq:
                                vocab[word] = idx
                                idx += 1
                   return vocab
          def create_dataloaders(batch_size=32):
              # Load tokenizers
                   spacy_de = spacy.load("de_core_news_sm")
                   spacy_en = spacy.load("en_core_web_sm")
# Get data
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           56/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                   # Build vocabularies
                   vocab_src = build_vocab_from_texts(train_de, spacy_de)
                   vocab_tgt = build_vocab_from_texts(train_en, spacy_en)
                   # Create datasets
                   train_dataset = TranslationDataset(
                       train_de, train_en,
                         vocab_src, vocab_tgt,
                         spacy_de, spacy_en
                   )
                   val_dataset = TranslationDataset(
                       val_de, val_en,
                         vocab_src, vocab_tgt,
                         spacy_de, spacy_en
                   )
                   # Create dataloaders
                   train_dataloader = DataLoader(
                         train_dataset,
                         batch_size=batch_size,
                         shuffle=True,
                         collate_fn=collate_batch
                   )
                   val_dataloader = DataLoader(
                       val_dataset,
                         batch_size=batch_size,
                         shuffle=False,
                         collate_fn=collate_batch
                   )
          def collate_batch(batch):
                   src_tensors = [item['src'] for item in batch]
                   tgt_tensors = [item['tgt'] for item in batch]
                   # Pad sequences
                   src_padded = torch.nn.utils.rnn.pad_sequence(src_tensors, batch_first=T
                   tgt_padded = torch.nn.utils.rnn.pad_sequence(tgt_tensors, batch_first=T
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                           57/59
1/5/25, 12:42 PM                                             Transformers Laid Out | Pramod’s Blog
                   return {
                       'src': src_padded,
                         'tgt': tgt_padded
                   }
Starting the training loop and Some Analysis (with tips for good convergence)
       Congratulations for completing this tutorial/lesson/blog however you see it. It is by nature of
       human curosity that you must have a few questions now. Feel free to create issues in github for
       those questions, and I will add any questions that I feel most beginners would have here.
https://goyalpramod.github.io/blogs/Transformers_laid_out/                                               58/59
1/5/25, 12:42 PM                                                  Transformers Laid Out | Pramod’s Blog
Cheers, Pramod
       P.S All the code as well as assets can be accessed from my github and are free to use and
       distribute, Consider citing this work though :)
       P.S.S I know there is a bit of issue with a few code samples, I will fix them within the week.
       This had been a work in progress for almost 3 months now. So I thought it’s better to publish
       something which is 95% done than to keep waiting for the perfect end product
Comments
Write Preview
Sign in to comment
       A space for Machine Learning, Philosophy, Life, Food or whatever I fancy at the
                                                                                                                 
       moment.
https://goyalpramod.github.io/blogs/Transformers_laid_out/ 59/59