GenAI Module2
GenAI Module2
• Recurrent neural networks (RNN) are the state of the art algorithms
for sequential data and are used by Apple's Siri and Google's voice
search.
• Sequential Data?
• the points in the dataset are dependent on the other points in the dataset,
the data is said to be Sequential data.
• Ex: time series data, stock market price data, words in a sentence, gene
sequence data, etc.
• Why ANN cannot be used for sequential data?
• It doesn’t consider the dependencies within a sequence data.
• Ex: Given time-series data, develop a DNN to predict the outlook of a
day as sunny/rainy/windy.
• The traditional NN makes the prediction for each observation independent of the other observations.
• This violates the fact that weather on a particular day is strongly correlated with the weather of the previous
day and the following day.
• a traditional neural network assumes the data is non-sequential, and that each data point is independent of
other data points.
• Hence, the inputs are analyzed in isolation, which can cause problems in case there are dependencies in the
data.
• In traditional neural networks, all the inputs and outputs are independent of each other, but in cases when it is
required to predict the next word of a sentence, the previous words are required and hence there is a need to
remember the previous words.
• RNN are a type of Neural Network where the output from previous step are fed
as input to the current step.
• Most important feature of RNN is Hidden state, which remembers some information about a
sequence.
• RNN has a “memory” which remembers all information about what has been calculated in the
previous day.
• It uses the same parameters for each input as it performs the same task on all the inputs or
hidden layers to produce the output.
• This reduces the complexity of parameters, unlike other neural networks
Some Applications of RNN
Why not ANN?
• 1. An issue with using an ANN for language translation is, we cannot
fix the no. of neurons in a layer. It depends on the no. of words in the
input sentence.
Why not ANN?
2. Too much computations.
• Input words have to be converted to vectors(word2vec) using one-hot encoding.
• Hence that many neurons and parameters have to be learnt by the model.
Why not ANN?
• 3. Doesn’t preserve the sequence relationship in the input data
• a traditional neural network assumes the data is non-sequential, and that each data point is independent of
other data points.
• Hence, the inputs are analyzed in isolation, which can cause problems in case there are dependencies in the
data.
• Since each hidden layer has its own weights, bias and activations, they behave independently.
• When the input is a sequence data, the model should be also able to identify the relationship between
successive inputs
• If the task is to predict the• next word in a sentence using a MLP.
This will not help. All hidden layers with different
weights and bias work independently.
• To make the hidden layers preserve the sequence
relationship in the input, all hidden layers have to be
combined.
• To combine them use same weights and activation
functions
All these hidden layers can be rolled in
together in a single recurrent layer
How RNN works?
• Neurons in recurrent layer are called recurrent neurons
• At all the time steps weights of the recurrent neurons would be the same
• So a recurrent neuron stores the state of a previous input and combines with the
current input thereby preserving some relationship of the current input with the
previous input.
• RNN converts the independent activations into dependent activations by
providing the same weights and biases to all the layers, thus reducing the
complexity of increasing parameters and memorizing each previous outputs by
giving each output as input to the next hidden layer.
• Entire RNN computation involves – computations to update the cell
state at that time step and computations to predict the output at that
time step.
• During forward pass, we calculate the outputs at each time step, to
calculate the individual loss at each time step.
• The individual losses are combined to form the total loss.
• This total loss is used to train the neural network
Desirable Characteristics of RN N for
Sequence Modeling
One to Many
• One to Many is a kind of RNN architecture is
applied in situations that give multiple output for
a single input.
• These “Gates” control which information from the “distant path” should be
passed through the network to update the current cell state.
• The most commonly used variants of RNN which are capable of remembering
long term dependencies using “gated cell” is the
LSTM (Long Short Term Memory) and GRU(gated recurrent unit).
Forget Gate:
• This gate decides what information should be thrown away or kept.
• Information from the previous hidden state and information from the
current input is passed through the sigmoid function.
• Values come out between 0 and 1.
• The closer to 0 means to forget, and the closer to 1 means to keep.
Output gate:
• Output gate has 2 layers:
• “Tanh layer” generates a vector of new information that could be
written to the cell state
• “Sigmoid layer” decides which information should be kept from the
output of tanh function.
Cell State :
• Now we should have enough information to calculate the cell state.
• First, the cell state gets pointwise multiplied by the forget vector.
• This has a possibility of dropping values in the cell state if it gets
multiplied by values near 0.
• Then we take the output from the input gate and do a pointwise
addition which updates the cell state to new values that the neural
network finds relevant.
• That gives us our new cell state.
To summarize the LSTM:
• The “forget gate” with sigmoid function decides which information to
be forgotten from previous steps
• The “input gate” along with a tanh and a sigmoid function decides
what new inputs are added to the network
• The “output gate” updates the cell state, using the outputs from the
previous two gates.
• The tanh and sigmoid layers decide which part of the cell state are to
be output to the hidden state.
• To review,
• the Forget gate decides what is relevant to keep from prior steps.
• The input gate decides what information is relevant to add from the
current step.
• The output gate determines what the next hidden state should be.
1
Building blocks of LLMs
• Transformers :
• transformers are a type of machine learning model utilizing attention as the
primary learning mechanism. [1] “Attention is all you need”
https://dl.acm.org/doi/10.5555/3295222.3295349
2
• “An Image is Worth 16x16 Words” modified the transformer put forth
in [1] to solve image classification tasks, creating
the Vision Transformer (ViT)
• https://www.semanticscholar.org/reader/268d347e8a55b5eb82fb5e7d2f800e33c75ab18a
• https://medium.com/@weiwen21/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-957f88e53726
3
1. TOKENISATION
4
Tokenisation
• This initial step breaks down input data into smaller units or tokens.
• For text, tokens could be words, subwords, or characters.
• In image processing, tokens are pixel groups,
• in video processing, tokens represent frames or segments.
• Sentence : "I love apples."
• It breaks down into three tokens: "I," "love," and "apples."
• Tokenization is a crucial process because it converts raw data into a format that
can be processed by the model.
• Different tokenization methods exist for various data types. For instance, Byte-
Pair Encoding (BPE) is commonly used for text, while Vision Transformers (ViT)
use specific methods for patch tokenization.
5
Tokenisation methods for text
• Word Tokenization:
• breaks text into individual words based on a delimiter.
• can struggle with a large vocabulary size.
• Character Tokenization:
• breaks text down into individual characters.
• it drastically reduces the vocabulary size, but fails to capture the semantic meaning of longer
word sequences.
• Subword Tokenization:
• strikes a balance between word and character tokenization.
• It breaks text down into subwords, which are larger than individual characters but smaller
than whole words.
• It keeps frequently used words in their whole form but breaks rare words down into more
meaningful subwords.
6
Tokenisation…
• Byte-Pair Encoding (BPE) is a subword tokenization algorithm
https://medium.com/@hsinhungw/understanding-byte-pair-encoding-
fd196ebfe93f
• It is a compression algorithm used in Natural Language Processing (NLP) to
represent a large vocabulary with a small set of subword units.
• widely used in various NLP tasks such as machine translation, text classification,
and text generation.
• It iteratively merges the most frequent pair of consecutive bytes or characters in
a text corpus until a predefined vocabulary size is reached.
• The resulting subword units can be used to represent the original text in a more
compact and efficient way.
7
Ex of Byte Pair Encoding
• Given text : “low low low low low lower lower newest newest
newest newest newest newest widest widest widest”
• Get the list of unique words with frequency:
(low_: 5, lower_: 2, newest_: 6, widest_: 3)
• Construct the base vocabulary:
vocabs = (l, o, w, e, r, n, s, t, i, d, _)
• Represent the Words with Base Vocabs
((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, e, s, t, _): 6, (w, i, d, e, s, t, _):
3)
8
• Vocabulary Merging
Iteratively merge the most frequent pairs of symbols:
• Merge 1:
• Merge the most frequent pair (e, s), which occurs 6 + 3 = 9 times, to form the newly merged
symbol ‘es’.
• Update the vocabulary and replace every occurrence of (e, s) with ‘es’:
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es)
• ((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, es, t, _): 6, (w, i, d, es, t, _): 3)
• Merge 2:
• Merge the most frequent pair (es, t), which occurs 6 + 3 = 9 times, to form the newly merged
symbol ‘est’.
• Update the vocabulary and replace every occurrence of (es, t) with ‘est’
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est)
• ((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, est, _): 6, (w, i, d, est, _): 3)
9
• Merge 3:
• Merge the most frequent pair (est, _), which occurs 6 + 3 = 9 times, to form
the newly merged symbol ‘est_’.
• Update the vocabulary and replace every occurrence of (est, _) with ‘est_’:
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_)
• ((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, est_): 6, (w, i, d, est_): 3)
• Merge 4:
• Merge the most frequent pair (l, o), which occurs 5 + 2 = 7 times, to form the
newly merged symbol ‘lo’.
• Update the vocabulary and replace every occurrence of (l, o) with ‘lo’
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo)
• ((lo, w, _): 5, (lo, w, e, r, _): 2, (n, e, w, est_): 6, (w, i, d, est_): 3)
10
• Merge 5:
• Merge the most frequent pair (lo, w), which occurs 5 + 2 = 7 times, to form
the newly merged symbol ‘low’.
• Update the vocabulary and replace every occurrence of (lo, w) with ‘low’
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low)
• ((low, _): 5, (low, e, r, _): 2, (n, e, w, est_): 6, (w, i, d, est_): 3)
11
• Using the constructed vocabulary and learned rules tokenise the text
“newest binded lowers”
• Pre-tokenise and append the end of word symbol:
• (newest_, binded_, lowers_)
• Apply Merge Rules
• Breakdown the pre-tokenised text to characters and apply the merge rules in
that order
• ((n, e, w, e, s, t, _), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _))
• Rules : (e, s) → es, (es, t) → est, (est, _) → est_, (l, o) → lo, (lo, w) → low
12
13
• Any token not in the vocabulary will be replaced by an unknown
token “[UNK]”
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low)
• ((n, e, w, est_), ([UNK], i, n, d, e, d, _), (low, e, r, s, _))
• Result of tokenisation:
• The new text is tokenized into the following sequence:
• “newest binded lowers” =
• [n, e, w, est_, [UNK], i, n, d, e, d, _, low, e, r, s, _]
14
Steps of BPE
1.Initialize the vocabulary with all the bytes or characters in the text corpus
3.Repeat the following steps until the desired vocabulary size is reached:
3. Find the most frequent pair of consecutive bytes or characters in the text
corpus
4. Merge the pair to create a new subword unit.
5. Update the frequency counts of all the bytes or characters that contain the
merged pair.
6. Add the new subword unit to the vocabulary.
4.Represent the text corpus using the subword units in the vocabulary.
15
Ex of BPE
• Suppose the corpus has 4 words : “ab”, “bc”, “bcd”, and “cde”.
• Initial vocabulary has : {“a”, “b”, “c”, “d”, “e”}.
16
17
18
AutoTokenizer
Automatically chooses the tokeniser for the LLM.
1. Install “Transformers” library
2. Import AutoTokeniser
5. The tokenizer function breaks the text into tokens and converts them into
numbers.
19
6. Decode text:
20
Image Tokenisation
• Transformers operate on a sequence of tokens; in NLP, this is
commonly a sentence of words.
• The ViT converts an image to tokens such that each token represents
a local area — or patch — of the image.
21
Image Tokenisation..
• Input image is divided into fixed size non-overlapping patches
• an image of height H, width W, and channels C is split into
into N tokens with patch size P:
22
Patch Tokenisation : Example
• Ref : https://towardsdatascience.com/vision-transformers-explained-
a9d07147e4c8
• Original image size : 60 x 100x 3
• cmap of its grayscale image :
23
Patch Tokenisation : Example
• Image after Patch tokenisation with P = 20:
24
2. EMBEDDINGS
25
Embeddings
• These are high-dimensional vectors representing tokens in a way that
captures their semantic meaning and relationships.
• Embeddings enable LLMs to understand context and nuances in data,
whether it’s text, images, or videos.
• The quality of embeddings significantly impacts the performance of
LLMs.
26
• Embeddings allows models to understand not just the identity of a
token but also its relationships with other tokens.
• a significant distance between
two words with the same name
but different meanings
(apple, the fruit vs. Apple, the company).
27
Types of embeddings
1.Uni-modal Embeddings:
Generated from a single type of input data (e.g., text), capturing the
semantic context within that modality.
• Uni-modal embeddings are used in tasks specific to one type of data.
• Ex: text embeddings are used in NLP tasks like text classification,
sentiment analysis, and machine translation.
• Ex:image embeddings are used in tasks like object detection and
image classification.
28
2. Multi-modal Embeddings:
Generated from multiple types of input data (e.g., text and images),
• capturing the relationships and interactions across different
modalities.
• Multi-modal embeddings are crucial for tasks that require
understanding the interplay between different types of data.
• Ex: in a video with subtitles, multi-modal embeddings can help the
model understand the relationship between the visual content and
the accompanying text.
• This capability is essential for tasks like video captioning and cross-
modal retrieval.
29
Embedding Techniques/Approaches
• Early techniques : Count based/frequency based - one-hot encoding
and BoW and TF-IDF for text data
• Limitations
1. Unable to capture semantic relationships.
• Ex: Tokens with same semantics but different embeddings
30
2. Sparse representations:
31
32
• Modern techniques:
1. Word2Vec:
• It represents each distinct word with a dense vector of decimal values/numbers
• It uses a neural network model to learn word associations from a large corpus of
text.
• Once trained, the network can detect similar words and also predict the
surrounding words in a sentence.
• The network learns to associate words that are semantically similar with similar
vector representations.
• Allows vector arithmetic on word embeddings
• Ex: King – man + woman = queen
• USA – Washington D C + Delhi = India
33
Ex:
• Columns are vocabulary of a story about a king and queen
• Rows are few words from the story
34
35
• There are two main variants of Word2Vec:
• Continuous Bag-of-Words (CBOW):
• The CBOW model predicts the surrounding words in a sentence based on the
current word.
• Ex: the model might be trained to predict the words “the” and “dog” given
the word “cat”.
• Skip-gram:
• The skip-gram model predicts the current word based on the surrounding
words in a sentence.
• Ex: the model might be trained to predict the word “cat” given the words
“the” and “dog”.
36
37
Applications: Find whether two given words are similar/dissimilar from
their vector representations
https://www.youtube.com/watch?v=Q95SIG4g7SA
38
39
CBOW
Ref : https://www.youtube.com/watch?v=Q95SIG4g7SA
40
One hot encoded values of
dimension 7 – size of the
vocabulary
41
42
• The size of the embedding vector for each word in the vocabulary is
equal to the window size.
• Following shows the embedding for “iNeuron”.
43
• 2. GloVe:
• GloVe works by creating a co-occurrence matrix.
• The co-occurrence matrix is a table that shows how often two words appear
together in a corpus of text.
• Ex: the co-occurrence matrix for the words “cat” and “dog” would show how
often the words “cat” and “dog” appear together in a corpus of text.
• GloVe then uses machine learning algorithm which learns to associate words
that appear together frequently with similar vector representations.
44
3. FastText:
• An extension of Word2Vec that captures the meaning of shorter
words and affixes.
• FastText represents words as bags of character n-grams.
• This approach helps the model understand the meanings of words by
considering their subword information, making it effective in handling
rare and out-of-vocabulary words.
45
46
47
3. ATTENTION MECHANISM
1
Seq-to-seq models without attention
• https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
2
• In neural machine translation, a sequence is a series of words,
processed one after another.
• The output is, likewise, a series of words:
3
• the neural machine translation model is composed of an encoder and a decoder.
• The encoder processes each item in the input sequence, it compiles the information it captures
into a vector (called the context).
• After processing the entire input sequence, the encoder sends the context over to the decoder,
• The decoder begins producing the output sequence item by item.
• The context is a vector(an array of numbers)
• The encoder and decoder is a RNN. Size of the context vector is the no. of hidden units in the
encoder.
• Real world applications – vector size is 256, 512 or 1024.
4
• Words are converted to vectors using embedding algorithms
• Recap of RNN:
5
• As both encoder and decoder are RNN, at each time step one of the RNNs does
some processing, it updates its hidden state based on its inputs and previous
inputs it has seen.
• The last hidden state of the encoder is the context sent to the decoder
• The decoder also maintains a hidden state that it passes from one time step to
the next.
6
• Seq-to-seq models using RNNs in the encoder and decoder
components, are unable to handle longer sequences.
• A solution to this problem is to use “Attention” which highly improved
the quality of machine translation systems
• Attention allows the model to focus on the relevant parts of the input
sequence as needed.
7
Attention Mechanism
• an attention mechanism allows a model to focus on different parts of the input
data with varying degrees of importance
• These mechanisms assign different weights to the embeddings of tokens based
on their relevance to the context
• This allows the model to focus on important elements and improves its
understanding and generation capabilities.
• The attention mechanism enables models to handle long-range dependencies in
data.
• In sequences where certain tokens are more relevant than others, the attention
mechanism helps the model focus on these critical tokens, thereby enhancing the
overall performance.
8
Attention in text
9
Attention in video interpretation
• in a video of a bustling cityscape,
• the attention mechanism might assign higher weights to the tokens
representing the main subjects of the video,
• such as a prominent building, a moving car, or a person interacting with
others.
• At the same time, it might assign lower weights to the tokens
representing the background or less significant elements,
• the sky, stationary objects, or the general crowd.
• allows the model to understand the continuity and relationship
between different parts of the video,
• such as the movement of the car from one frame to another or the
interaction of the person throughout the video.
10
Neural Machine Translation with attention
• At time step 7, the attention mechanism enables the decoder to focus
on the word "étudiant" ("student" in french) before it generates the
English translation.
• This ability to amplify the signal from the relevant part of the input
sequence makes attention models produce better results than models
without attention.
11
• An attention model differs from a classic sequence-to-sequence
model in two main ways:
1. the encoder passes a lot more data to the decoder. Instead of
passing the last hidden state of the encoding stage,
the encoder passes all the hidden states to the decoder:
12
2. an attention decoder does an extra step before producing its output. At a
decoding time step ‘t’, the attention decoder focusses on the parts of the input that
are relevant to this decoding time step,
• the decoder does the following:
• Look at the set of encoder hidden states it received – each encoder hidden
state is most associated with a certain word in the input sentence
• Multiply each hidden state by its softmaxed score, thus amplifying hidden
states with high scores, and drowning out hidden states with low scores
• This scoring exercise is done at each time step on the decoder side.
13
14
Putting all together:
1. The attention decoder RNN takes in the embedding of the <END> token, and
an initial decoder hidden state.
2. The RNN processes its inputs, producing an output and a new hidden
state vector (h4). The output is discarded.
3. Attention Step: calculate a context vector (C4) for this time step, using
the encoder hidden states and the h4 vector.
4. concatenate h4 and C4 into one vector.
5. pass this vector through a feedforward neural network (one trained jointly with
the model).
6. The output of the feedforward neural networks indicates the output word of
this time step.
7. Repeat for the next time steps
15
16
• in a particular decoding time step ‘t’, at which part of the input
sequence the decoder pays attention to?
17
• The model doesn’t blindly align the first input word with the first output word.
• It learns from the training phase how to align the word pair in French-English.
• the model paid attention correctly when outputing "European Economic Area".
• In French, the order of these
words is reversed
("européenne économique
zone") as compared to English
• Every other word in both
sentence is in similar order.
18
Building blocks of LLMs ..contn…
• Transformers :
• transformers are a type of machine learning model utilizing attention as the
primary learning mechanism. [1] “Attention is all you need”
• Transformers quickly became the state of the art for sequence-to-sequence
tasks such as language translation.
• Applications: Language Translation:
19
20
• Encoding and decoding components is a stack of 6 layers
• Hidden state of the last encoder is the input to the decoders
21
• The encoders are all identical in structure (yet they do not share weights).
• Each one is broken down into two sub-layers
• The encoder’s inputs first flow through a
self-attention layer – a layer that helps the
encoder look at other words in the input sequence as it
encodes a specific word.
• The outputs of the self-attention layer are fed
to a feed-forward neural network.
• The exact same feed-forward network is
independently applied to each position.
22
• The decoder has both those layers, but between them is an attention
layer that helps the decoder focus on relevant parts of the input
sentence (similar what attention does in seq2seq models).
23
• Input to all encoders is a vector of size 512.
• Input to the bottom most encoder is the embedding vector of each word
• Input to all other encoders is the output vector of the encoder directly below
them.
• each of word embedding flows through an independent path of the two layers of
the encoder(input is processed parallelly unlike RNN)
24
• Specific property of a transformer:
• the word in each position flows through its own path in the encoder.
• There are dependencies between these paths in the self-attention layer.
• The feed-forward layer does not have those dependencies,and thus the various paths can be
executed in parallel while flowing through the feed-forward layer.
• Ex: a smaller sentence:
25
Self Attention for a single word
How to find the attention weight for a word in a particular position?
• Assume input is “Thinking Machines”
26
Self Attention for a single word
How to find the attention weight for a word in a particular position?
• Assume input is “Thinking Machines”
• The model learns three weight matrices 𝑊 𝑄 , 𝑊 𝐾 and 𝑊 𝑉 during training phase.
STEP 1:
1. Input to the encoder is the embedding vector of the words X1 and X2 of dimension 512.
2. For each embedding vector, the model calculates three vectors called query, key and value of
dimension 64.
1. q = X1 * 𝑊 𝑄
2. k = X1 * 𝑊 𝐾
3. v = X1 * 𝑊 𝑉
The query, key and value vectors are abstractions used to think and calculate the attention
weights/scores.
27
STEP 2: Calculate the attention weight of all words.
• For the first word “Thinking”, find the score of each word against this word.
• The score determines how much focus to place on other parts of the input sentence as the
model encodes a word at a certain position.
• Multiply the query vector of this word q1, with the key vector of all other words k1 and k2
using dot product.
• the first score would be the dot product of q1 and k1.
• The second score would be the dot product of q1 and k2.
28
STEP 3 and 4:
• divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This
leads to having more stable gradients.
• pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add
up to 1.
• This softmax score determines how much each
word will be expressed at this position.
• Clearly the word at this
position will have the highest
softmax score, but sometimes it’s useful to
attend to another word that is relevant to
the current word.
29
STEP 5:
• multiply each value vector by the softmax score (in preparation to
sum them up).
• The intuition here is to keep intact the values of the word(s) the
model has to focus on, and drown-out irrelevant words (by
multiplying them by tiny numbers like 0.001, for example).
STEP 6:
• sum up the weighted value vectors.
• This produces the output of the self-attention layer at this position
(for the first word).
30
• The resulting vector is
is sent to the
feed-forward neural
network.
• In the actual
implementation,
however, this
calculation is done in
matrix form for faster
processing.
31
Self attention for all words
• Self attention weight of all words in the input sequence is calculated using matrices
• Create a matrix X, where each row of X is an embedding vector of the input sequence.
• Get Q, K and V matrices using X and the weight matrices 𝑊 𝑄 , 𝑊 𝐾 and 𝑊 𝑉 .
32
• Steps 2 through 6 for word level attention can be condensed into one
step using matrices.
33
Multi-headed attention
• Transformers refined the self-attention layer by adding a mechanism
called “multi-headed” attention.
• Original paper uses 8-headed attention mechanism
• Advantages of multi-headed attention:
• It expands the model’s ability to focus on different positions.
• In the previous example, z1 contains a little bit of every other encoding, but it could be
dominated by the actual word itself
• It gives the attention layer multiple “representation subspaces”
• with multi-headed attention we have multiple sets of Query/Key/Value weight
matrices (the Transformer uses eight attention heads, so we end up with eight
sets for each encoder/decoder).
34
• Each of these sets is randomly initialized.
• after training, each set is used to project the input embeddings (or vectors
from lower encoders/decoders) into a different representation subspace.
35
• Repeating this just eight different times with different weight matrices, ends up
with eight different Z matrices for a word/embedding vector.
36
• As the feed forward layer takes only one vector for a word,
concatenate the z vectors of a word.
• Hence concatenate all z matrices of all words and multiply them by a
weight matrix WO
37
38
• When ‘it’ is encoded, attention heads focus on “the animal” and
“tired”.
39
Adding positional embeddings
• Model learns the order of the words using positional encodings
40
Encoding layers with normalization
41
• Normalised outputs go to the decoder also
42
Decoding side
• The output of the top encoder is then transformed into a set of attention vectors K and V.
• These are to be used by each decoder in its “encoder-decoder attention” layer which helps the
decoder focus on appropriate places in the input sequence
• https://jalammar.github.io/illustrated-transformer/
43
• repeat the following process until a special symbol is reached indicating the transformer decoder
has completed its output.
• The output of each step is fed to the bottom decoder in the next time step
• The decoders bubble up their decoding results just like the encoders did.
• Word embeddings and positional encodings are the decoder inputs to indicate the position of
each word.
44
45
• In the decoder, the self-attention layer is only allowed to attend to
earlier positions in the output sequence.
• before the softmax step in the self-attention calculation.
• The “Encoder-Decoder Attention” layer works just like multiheaded
self-attention, except it creates its Queries matrix from the layer
below it, and takes the Keys and Values matrix from the output of the
encoder stack.
• The decoder stack outputs a vector of floats.
• The Linear layer is a simple fully connected neural network that
projects the vector produced by the stack of decoders, into a much,
much larger vector called a logits vector.
46
• If the model learns 10,000 unique words from the training data, then the logit vector has 10,000
cells.
• Each cell is the logit score of a unique word
• The softmax layer then turns those scores into probabilities
• The cell with the highest probability is chosen, and the word associated is the decoder output for
this time step.
47
• In the first phase of training, the weight matrices are initialised to random values.
• Assume we want the output to be a probability distribution indicating the word “thanks”. But since this
model is not yet trained, that’s unlikely to happen just yet.
• The untrained model produces an output :
48
• Find the loss as the difference of the prob distributions.
• Adjust the weights and retrain
49
• After training the model for enough time on a large enough dataset,
the produced probability distributions would look like this:
50
VARIANTS OF TRANSFORMER ARCHITECTURE – BERT AND GPT
1
Variants of the transformer architecture – BERT and GPT
• The original transformer architecture has both encoder and decoder.
o the encoder encodes an input sequence in the form of a vector of numbers —
a low-level format that is understood by machines
o the decoder takes the encoded sequence and by applying a language
modeling task, it generates a new sequence.
• Encoders and decoders can be used individually for specific tasks.
• The two famous models deriving their parts from the original Transformer are
o BERT (Bidirectional Encoder Representations from Transformer) consisting of encoder blocks
o GPT (Generative Pre-Trained Transformer) composed of decoder blocks.
2
3
BERT and GPT
• Encoder models (e.g. BERT) predict tokens based on the context
from both sides
o Example: “The bank is situated on the _______ of the river.”
4
BERT - Introduction
• BERT, an acronym for Bidirectional Encoder Representations from
Transformers, is an open-source machine learning
framework designed for tasks in NLP.
• Designed by researchers from Google AI in 2018.
• It is a transformer-based neural network to understand and generate
human-like language.
• It employs an encoder-only architecture.
• This is to emphasis understanding the input sequence rather than
generating the output sequence.
• In the original Transformer architecture, there are both encoder and
decoder modules.
5
Special feature of BERT
• Traditional language models process text sequentially, either from left
to right or right to left.
• This method limits the model’s awareness to the immediate context
preceding the target word.
• BERT uses a bi-directional approach considering both the left and
right context of words in a sentence,
• instead of analyzing the text sequentially, BERT looks at all the words
in a sentence simultaneously(using the attention mechanism from the
transformer encoder)
6
Example: “The bank is situated on the _______ of the river.”
• In a unidirectional model, target word is heavily dependant on the
preceding words,
• Model struggles to understand whether “bank” refers to a financial institution
or the side of the river.
7
BERT Pre-training and Fine Tuning
• The BERT model undergoes a two-step process:
1. Pre-training on Large amounts of unlabeled text to learn contextual
embeddings.
2. Fine-tuning on labeled data for specific NLP tasks.
8
1. Pre-training
• BERT is pre-trained on large amount of unlabeled text data.
• The model learns contextual embeddings,
• Contextual embeddings - the representations of words that take into
account their surrounding context in a sentence.
• BERT is used in various unsupervised pre-training tasks.
1. Masked Language Model or MLM task - learn to predict missing words in a
sentence
2. understand the relationship between two sentences,
3. predict the next sentence in a pair.
9
Fine-tuning
• After the pre-training phase, the BERT model, uses the contextual
embeddings, to get fine-tuned for specific NLP tasks.
• It adapts its general language understanding to complete more
targeted applications.
• Fine tuning is done using labeled data specific to the downstream
tasks of interest.
• These tasks are :
• sentiment analysis, question-answering, named entity recognition, or any
other NLP application.
• The model’s parameters are adjusted to optimize its performance for the
particular task.
10
How BERT works?
• BERT is designed to generate a language model so, only the encoder
mechanism is used.
• Sequence of tokens are fed to the Transformer encoder.
• These tokens are first embedded into vectors and then processed in the
neural network.
• The output is a sequence of vectors, each corresponding to an input token,
providing contextualized representations.
• When training language models, defining a prediction goal is a challenge.
• Many models predict the next word in a sequence, which is a directional approach
and may limit context learning.
• BERT addresses this challenge with two innovative training strategies:
Masked Language Model (MLM)
Next Sentence Prediction (NSP)
11
Masked Language Model (MLM)
• 15% of the words in each input sequence are replaced with a [MASK] token.
• The model then attempts to predict the original value of the masked words,
based on the context provided by the other, non-masked, words in the sequence.
•
12
Architecture of MLM:
1. Adding a classification layer on top of the encoder output.
• This layer is crucial for predicting the masked words.
2. Multiplying the output vectors by the embedding matrix,
transforming them into the vocabulary dimension.
• This step helps align the predicted representations with the vocabulary space.
3. Calculating the probability of each word in the vocabulary with
softmax.
• For every masked position, this step generates a probability distribution over
the entire vocabulary
13
• The loss function used during training considers only the prediction of
the masked values.
• The model is penalized for the deviation between its predictions and
the actual values of the masked words.
• The model converges slower than directional models.
• This is because, during training, BERT is only concerned with predicting the
masked values, ignoring the prediction of the non-masked words.
• The increased context awareness achieved through this strategy compensates
for the slower convergence.
14
Next sentence prediction (NSP)
• Preparing the Training data :
• 50% of the training data – second sentence is a subsequent of the first
sentence
• 50% of the training data – the first and second sentences are not related.
• To distinguish two sentences in the training set, input is processed by:
1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP]
token is inserted at the end of each sentence.
2. A sentence embedding indicating Sentence A or Sentence B is added to each
token. Sentence embeddings are similar in concept to token embeddings with a
vocabulary of 2.
3. A positional embedding is added to each token to indicate its position in the
sequence.
15
16
• To predict if the second sentence is connected to the first:
1. The entire input sequence goes through the Transformer model.
2. The output of the [CLS] token is transformed into a 2×1 shaped vector, using
a simple classification layer (learned matrices of weights and biases).
3. calculating the probability of whether the second sentence follows the first
using SoftMax.
17
BERT architecture
• The architecture of BERT is a multilayer bidirectional transformer
encoder which is quite similar to the transformer model.
• A transformer architecture is an encoder-decoder network that
uses self-attention on the encoder side and attention on the decoder
side.
• Two architectures:
18
Differences between BERT and transformer architecture
1. Encoder Stack :
• BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack.
• These are more than the Transformer architecture described in the original paper (6 encoder
layers).
2. Feedforward network :
• BERT architectures (BASE and LARGE) also have larger feedforward networks - 768 and 1024
hidden units respectively
• Original Transformer models have 512 hidden units.
3. Attention heads :
• BERT architectures (BASE and LARGE) have more attention heads (12 and 16 respectively).
• Original Transformer models have 8 attention heads.
4. No. of parameters :
• BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.
• Transformer – 65M parameters
19
GPT framework has 2 phases : pre-training and fine tuning
• Pre-training :
o this stage teaches the model to anticipate the next word that will come next in a
sentence.
o Trained on a variety of internet material
• Fine tuning :
o training it on data specific to a given domain or task.
20
21
GPT – Pre-training
• u – input sequence
• k – context window size
• n – no. of substrings of u with length k
• at each step,
• the model outputs the probability distribution of all possible tokens being the
next token i for the sequence consisting of the last k context tokens.
• Then, the logarithm of the probability for the real token is calculated and used as
one of several values in the sum for the loss function.
• Loss function is a log likelihood function
22
• Ex : u = "Mixing red and blue produces purple"
• K=3
o For each of these substrings, the model predicts a probability distribution for every token in
the vocabulary for the language modeling task.
o the probability corresponding to the true token in the sequence is taken
(highlighted in yellow) is used for loss calculation.
o The final loss equals the sum of logarithms of true token probabilities.
23
24
• The log-likelihood loss function in GPT maximizes the logarithm of the probability
of correctly predicting all the tokens in the input sequence.
• Once GPT is pre-trained, it can already be used for text generation.
• GPT is an autoregressive model - it uses previously predicted tokens as input for
prediction of next tokens.
• the process lasts until the [end] token is predicted or the maximum input size is
reached.
25
GPT – fine tuning
• After pre-training, GPT has captured the linguistic knowledge
• It is fine tuned for a specific task(text classification) on a supervised
task specific data set.
• The labelled dataset has each example with an input sequence x with a
corresponding label y which needs to be predicted.
• Every example is passed through the model which outputs their hidden
representations h on the last layer.
• The resulting vectors are then passed to an added linear layer with learnable
parameters W and then through the softmax layer.
26
27
• The loss function used for fine-tuning is very similar to that of pre-training phase
but this time, it evaluates the probability of observing the target value y instead
of predicting the next token
28
Limitations of LLMs
1
Limitations of LLMs
1. Outdated knowledge:
• LLMs are totally dependent on their training data.
• Without integration to external data sources, they fail to provide recent real
world information
2
Ex:
3
2. Inability to take action:
• LLMs cannot perform interactive actions like searches, calculations or look
ups.
• They cannot interface with APIs and take any practical actions based on new
prompts.
• This limits their functionality
Wrong answer.
Correct one is
6528025
4
3. Lack of context:
• LLMs struggle to incorporate relevant context like previous conversations
4. Biases and discrimination:
• Depending on the training data, LLMs can generate biased responses,
targeting a particular religion, ideology or political in nature.
5. Lack of transparency:
• behaviour of the LLMs are opaque and difficult to interpret.
• Its challenging to align with human values.
5
6. Hallucination risks:
• Due to insufficient knowledge of certain topics and most recent information,
LLMs generate incorrect responses.
• Ex:
6
Techniques to mitigate these limitations
1. Retrieval augmentation:
• Uses external knowledge bases to supplement an LLM’s outdated training
data, providing external context and reducing hallucination risk.
2. Chaining:
• This technique integrates actions like searches and calculations.
3. Prompt engineering:
• This involves the careful crafting of prompts by providing critical context that
guides appropriate responses
4. Monitoring, filtering, and reviews:
• Constitutional principles - monitor and filter unethical or inappropriate
content. c.
• Human reviews provide insight into model behavior and output.
7
5. Memory:
• Retains conversation context by persisting conversation data and context
across interactions.
6. Fine-tuning:
• Training and tuning the LLM on more appropriate data for the application
domain and principles.
• This adapts the model’s behavior for its specific purpose.
• To summarize - Prompting supplies context, chaining enables
inference and retrieval incorporates facts.
• Frameworks like LangChain provides these features for responsible
use of LLMs.
8
LangChain
• LangChain is a framework that simplifies the process of creating generative AI application.
• Foundation models (Ex: GPT) are generally trained on data up to their release to the public.
• CHATGPT was released to the public near the end of 2022, but its knowledge base was limited to
data from 2021 and before.
• LangChain can connect AI models to external data sources to give them knowledge of recent data
without limitations.
A simple LLM application
• an LLM app is an application that utilizes an LLM to understand
natural language prompts and generate responsive text outputs.
10
• Optionally to augment the LLMs capabilities :
• integration with external services via function APIs, knowledge bases, and
reasoning algorithms.
11
RAG – Retrieval Augmented Generation
• LLMs are trained on vast volumes of data to generate original output
for tasks like answering questions, translating languages, and
completing sentences
• Known challenges of LLMs include:
• Present false information when it does not have the answer.
• Present out-of-date or generic information when the user expects a specific,
current response.
• Create a response from non-authoritative sources.
• Create inaccurate responses due to terminology confusion, wherein different
training sources use the same terminology to talk about different things.
12
• It’s a cost-effective approach to improve an LLMs output so it remains
relevant, accurate, and useful in various contexts.
13
• RAG is a fusion of information retrieval models and generative
models.
14
Components of a RAG in LangChain
1. Indexing:
• a pipeline for ingesting data from a source and indexing it. This usually
happens offline.
15
1. Indexing
16
1. Indexing..
1. Load:
• Load our data with Document Loaders.
2. Splits :
• Text splitters split large documents into smaller chunks.
• This is useful both for indexing data and for passing it in to a model.
• large chunks are harder to search over and will exceed the model's finite
context window.
3. Store:
• store and index our splits, inorder to search them later.
• Chunks are stored using a VectorStore and Embeddings model.
17
2. Retrieval and generation
4. Retrieve:
• Given a user input, relevant splits are retrieved from storage using a Retriever.
5. Generate :
• produces an answer using a prompt that includes the question and the
retrieved data
18