[go: up one dir, main page]

0% found this document useful (0 votes)
16 views190 pages

GenAI Module2

Recurrent Neural Networks (RNNs) are advanced algorithms designed for sequential data, effectively capturing dependencies between data points, unlike traditional Artificial Neural Networks (ANNs). RNNs utilize a hidden state to remember information from previous inputs, making them suitable for tasks like language translation and sentiment analysis. However, RNNs face challenges such as short-term memory and the vanishing gradient problem, which can be mitigated by using Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs).

Uploaded by

susheeth1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views190 pages

GenAI Module2

Recurrent Neural Networks (RNNs) are advanced algorithms designed for sequential data, effectively capturing dependencies between data points, unlike traditional Artificial Neural Networks (ANNs). RNNs utilize a hidden state to remember information from previous inputs, making them suitable for tasks like language translation and sentiment analysis. However, RNNs face challenges such as short-term memory and the vanishing gradient problem, which can be mitigated by using Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs).

Uploaded by

susheeth1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 190

Recurrent Neural Network(RNN)

• Recurrent neural networks (RNN) are the state of the art algorithms
for sequential data and are used by Apple's Siri and Google's voice
search.
• Sequential Data?
• the points in the dataset are dependent on the other points in the dataset,
the data is said to be Sequential data.
• Ex: time series data, stock market price data, words in a sentence, gene
sequence data, etc.
• Why ANN cannot be used for sequential data?
• It doesn’t consider the dependencies within a sequence data.
• Ex: Given time-series data, develop a DNN to predict the outlook of a
day as sunny/rainy/windy.
• The traditional NN makes the prediction for each observation independent of the other observations.
• This violates the fact that weather on a particular day is strongly correlated with the weather of the previous
day and the following day.
• a traditional neural network assumes the data is non-sequential, and that each data point is independent of
other data points.
• Hence, the inputs are analyzed in isolation, which can cause problems in case there are dependencies in the
data.
• In traditional neural networks, all the inputs and outputs are independent of each other, but in cases when it is
required to predict the next word of a sentence, the previous words are required and hence there is a need to
remember the previous words.
• RNN are a type of Neural Network where the output from previous step are fed
as input to the current step.

• Most important feature of RNN is Hidden state, which remembers some information about a
sequence.
• RNN has a “memory” which remembers all information about what has been calculated in the
previous day.
• It uses the same parameters for each input as it performs the same task on all the inputs or
hidden layers to produce the output.
• This reduces the complexity of parameters, unlike other neural networks
Some Applications of RNN
Why not ANN?
• 1. An issue with using an ANN for language translation is, we cannot
fix the no. of neurons in a layer. It depends on the no. of words in the
input sentence.
Why not ANN?
2. Too much computations.
• Input words have to be converted to vectors(word2vec) using one-hot encoding.
• Hence that many neurons and parameters have to be learnt by the model.
Why not ANN?
• 3. Doesn’t preserve the sequence relationship in the input data
• a traditional neural network assumes the data is non-sequential, and that each data point is independent of
other data points.
• Hence, the inputs are analyzed in isolation, which can cause problems in case there are dependencies in the
data.
• Since each hidden layer has its own weights, bias and activations, they behave independently.
• When the input is a sequence data, the model should be also able to identify the relationship between
successive inputs
• If the task is to predict the• next word in a sentence using a MLP.
This will not help. All hidden layers with different
weights and bias work independently.
• To make the hidden layers preserve the sequence
relationship in the input, all hidden layers have to be
combined.
• To combine them use same weights and activation
functions
All these hidden layers can be rolled in
together in a single recurrent layer
How RNN works?
• Neurons in recurrent layer are called recurrent neurons
• At all the time steps weights of the recurrent neurons would be the same
• So a recurrent neuron stores the state of a previous input and combines with the
current input thereby preserving some relationship of the current input with the
previous input.
• RNN converts the independent activations into dependent activations by
providing the same weights and biases to all the layers, thus reducing the
complexity of increasing parameters and memorizing each previous outputs by
giving each output as input to the next hidden layer.
• Entire RNN computation involves – computations to update the cell
state at that time step and computations to predict the output at that
time step.
• During forward pass, we calculate the outputs at each time step, to
calculate the individual loss at each time step.
• The individual losses are combined to form the total loss.
• This total loss is used to train the neural network
Desirable Characteristics of RN N for
Sequence Modeling

• Ability to handle sequences of variable


lengths.
• Information about the next word to be
predicted in the sequence, might be
present very much earlier at the
beginning of the sequence.

• Ability to capture and model Long-Term


Dependencies
• This is possible since RNNs, keep
updating information collected from
the past by updating their
recurrent/hidden cell state at each
time step.
Desirable Characteristics of RNN for
Sequence Modeling…

• Ability to capture differences in sequence


order
• Two sentences with same words, but
different meaning.
• But the RNNs capture this difference,
since it uses the same weight matrices
at each time step, to update its hidden
state and remembers past information.
• In FFNN, the gradient of
the loss function is back
propagated through one
feed forward network in
one time step/input
• But in RNN, the gradient
of the total error is
propagated to the
individual time steps and
also across the time steps
from the most recent
time step to the very
beginning of the
sequence.
• Hence the name
“Backpropagation
through time”
Types of RNN
• One to One RNN
• One to Many RNN
• Many to One RNN
• Many to Many RNN
One to One RNN
• One to One RNN is the most basic and traditional
type of Neural network giving a single output for
a single input, as can be seen in the above image.
• It is also known as Vanilla Neural Network. It is
used to solve regular machine learning problems.
Ex: image classification

One to Many
• One to Many is a kind of RNN architecture is
applied in situations that give multiple output for
a single input.

• Image Captioning – Here, let’s say we have an


image for which we need a textual description.
So we have a single input – the image, and a
series or sequence of words as output. Here the
image might be of a fixed size, but the output is a
description of varying lengths
Many to One
• It takes a sequence of information as input and
produces a fixed size output.
• Many-to-one RNN architecture is usually seen for
sentiment analysis model as a common example.
As the name suggests, this kind of model is used
when multiple inputs are required to give a single
output.
• Take for example The Twitter sentiment analysis
model. In that model, a text input (words as
multiple inputs) gives its fixed sentiment (single
output).
• Another example could be movie ratings model
that takes review texts as input to provide a rating
to a movie that may range from 1 to 5.
Many-to-Many
• Many-to-Many RNN Architecture takes multiple
input and gives multiple output.
• Ex: language translaion
• Input is a sentence that has many words-> output
sentence with many words
Problem of Long-Term Dependencies
• Problem of Vanishing Gradients?
• Multiply two small numbers(gradients) will result in a smaller number
(gradient).
• It becomes harder and harder for the neurons to propagate the error
to the earlier stages.
• Hence the parameters will be biased only to capture short term
dependencies.
• RNNs predict the next word in a sequence, based on the relevant
information in the distant past
• If the distance between the distant past and the current time step is
small, RRNs predict the next word correctly.
• As the sequence length increases, RNNs won’t be able to remember
the relevant information in the distant past and predict the next
word.

• This is common in real life use cases with long sequences.


• This is due to the vanishing gradient problem.
A solution to the Vanishing gradient problem of RNN:

• Keep track of the long term dependencies by using “gates” .

• These “Gates” control which information from the “distant path” should be
passed through the network to update the current cell state.

• The most commonly used variants of RNN which are capable of remembering
long term dependencies using “gated cell” is the
LSTM (Long Short Term Memory) and GRU(gated recurrent unit).

• The “gates” perform different tensor operations to decide which information


can be removed/added to the current hidden state.
Problems of RNN
• Recurrent Neural Networks suffer from short-term memory.
• If a sequence is long enough, they’ll have a hard time carrying information
from earlier time steps to later ones. So to process a paragraph of text to do
predictions, RNN’s may leave out important information from the beginning.

• During back propagation, recurrent neural networks suffer from the


vanishing gradient problem.
• Layers that get a small gradient update stops learning.
• Those are usually the earlier layers.
• So because these layers don’t learn, RNN’s can forget what it seen in longer
sequences, thus having a short-term memory.
LSTMs and GRUs as a solution
• LSTM ’s and GRU’s were created as the solution to short-term memory. They have
internal mechanisms called gates that can regulate the flow of information.

• These gates can learn which data in a sequence is important to keep
or throw away.
• By doing that, it can pass relevant information down the long chain of
sequences to make predictions.
• LSTM’s and GRU’s can be found in speech recognition, speech
synthesis, and text generation. You can even use them to generate
captions for videos.
Recap of RNN
https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-
44e9eb85bf21
• How a cell in a RNN calculates the hidden state? (Short-term memory)
• It combines the current input and the previous hidden state into a vector
• This vector has information on the current input and previous inputs.
• The vector goes through the tanh activation, and the output is the new
hidden state, or the memory of the network.
Recap of RNN
• The tanh activation is used to help regulate the values flowing
through the network.
• It squishes values to always be between -1 and 1.
• RNNs involve less computational resources
• But works well only for shorter sequences
LSTM
• An LSTM has a similar control flow
as a recurrent neural network.
• It processes data passing on
information as it propagates forward.
• The differences are the operations
within the LSTM’s cells.
• LSTMs have short-term memory in
“hidden states” and long-term
memory in “cell-states”
• The core concept of LSTM’s are the cell state, and it’s various gates.
• The cell state transfers relative information all the way down the sequence
chain.
• It helps to preserve the “long term memory” of the network.
• The cell state, helps information from the earlier time steps to make it’s
way to later time steps, reducing the effects of short-term memory.
• As the cell state goes on its journey, information get’s added or removed to
the cell state via gates.
• The gates are different neural networks that decide which information is
allowed on the cell state.
• The gates can learn what information is relevant to keep or forget during
training.
LSTM…
• Gates use “sigmoid” activation function to update or forget data.
• Sigmoid squishes its input values between 0 and 1.
• If sigmoid squishes its input X closer to 0, then X is forgotten.
• If sigmoid squishes its input X closer to 1, then X is kept.
• The network can learn which data is not important therefore can be
forgotten or which data is important to keep using Sigmoid.
• Three different gates regulate information flow in an LSTM cell. A
forget gate, input gate, and output gate.
LSTM

Forget Gate:
• This gate decides what information should be thrown away or kept.
• Information from the previous hidden state and information from the
current input is passed through the sigmoid function.
• Values come out between 0 and 1.
• The closer to 0 means to forget, and the closer to 1 means to keep.
Output gate:
• Output gate has 2 layers:
• “Tanh layer” generates a vector of new information that could be
written to the cell state
• “Sigmoid layer” decides which information should be kept from the
output of tanh function.
Cell State :
• Now we should have enough information to calculate the cell state.
• First, the cell state gets pointwise multiplied by the forget vector.
• This has a possibility of dropping values in the cell state if it gets
multiplied by values near 0.
• Then we take the output from the input gate and do a pointwise
addition which updates the cell state to new values that the neural
network finds relevant.
• That gives us our new cell state.
To summarize the LSTM:
• The “forget gate” with sigmoid function decides which information to
be forgotten from previous steps
• The “input gate” along with a tanh and a sigmoid function decides
what new inputs are added to the network

• The “output gate” updates the cell state, using the outputs from the
previous two gates.
• The tanh and sigmoid layers decide which part of the cell state are to
be output to the hidden state.
• To review,
• the Forget gate decides what is relevant to keep from prior steps.
• The input gate decides what information is relevant to add from the
current step.
• The output gate determines what the next hidden state should be.
1
Building blocks of LLMs
• Transformers :
• transformers are a type of machine learning model utilizing attention as the
primary learning mechanism. [1] “Attention is all you need”
https://dl.acm.org/doi/10.5555/3295222.3295349

• Transformers quickly became the state of the art for sequence-to-sequence


tasks such as language translation.
• Applications: Language Translation:

2
• “An Image is Worth 16x16 Words” modified the transformer put forth
in [1] to solve image classification tasks, creating
the Vision Transformer (ViT)
• https://www.semanticscholar.org/reader/268d347e8a55b5eb82fb5e7d2f800e33c75ab18a
• https://medium.com/@weiwen21/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-957f88e53726

• Building blocks of LLM


• Tokenisation
• Embeddings
• Attention Mechanism

3
1. TOKENISATION

4
Tokenisation
• This initial step breaks down input data into smaller units or tokens.
• For text, tokens could be words, subwords, or characters.
• In image processing, tokens are pixel groups,
• in video processing, tokens represent frames or segments.
• Sentence : "I love apples."
• It breaks down into three tokens: "I," "love," and "apples."
• Tokenization is a crucial process because it converts raw data into a format that
can be processed by the model.
• Different tokenization methods exist for various data types. For instance, Byte-
Pair Encoding (BPE) is commonly used for text, while Vision Transformers (ViT)
use specific methods for patch tokenization.

5
Tokenisation methods for text
• Word Tokenization:
• breaks text into individual words based on a delimiter.
• can struggle with a large vocabulary size.
• Character Tokenization:
• breaks text down into individual characters.
• it drastically reduces the vocabulary size, but fails to capture the semantic meaning of longer
word sequences.
• Subword Tokenization:
• strikes a balance between word and character tokenization.
• It breaks text down into subwords, which are larger than individual characters but smaller
than whole words.
• It keeps frequently used words in their whole form but breaks rare words down into more
meaningful subwords.

6
Tokenisation…
• Byte-Pair Encoding (BPE) is a subword tokenization algorithm
https://medium.com/@hsinhungw/understanding-byte-pair-encoding-
fd196ebfe93f
• It is a compression algorithm used in Natural Language Processing (NLP) to
represent a large vocabulary with a small set of subword units.
• widely used in various NLP tasks such as machine translation, text classification,
and text generation.
• It iteratively merges the most frequent pair of consecutive bytes or characters in
a text corpus until a predefined vocabulary size is reached.
• The resulting subword units can be used to represent the original text in a more
compact and efficient way.

7
Ex of Byte Pair Encoding
• Given text : “low low low low low lower lower newest newest
newest newest newest newest widest widest widest”
• Get the list of unique words with frequency:
(low_: 5, lower_: 2, newest_: 6, widest_: 3)
• Construct the base vocabulary:
vocabs = (l, o, w, e, r, n, s, t, i, d, _)
• Represent the Words with Base Vocabs
((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, e, s, t, _): 6, (w, i, d, e, s, t, _):
3)

8
• Vocabulary Merging
Iteratively merge the most frequent pairs of symbols:
• Merge 1:
• Merge the most frequent pair (e, s), which occurs 6 + 3 = 9 times, to form the newly merged
symbol ‘es’.
• Update the vocabulary and replace every occurrence of (e, s) with ‘es’:
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es)
• ((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, es, t, _): 6, (w, i, d, es, t, _): 3)
• Merge 2:
• Merge the most frequent pair (es, t), which occurs 6 + 3 = 9 times, to form the newly merged
symbol ‘est’.
• Update the vocabulary and replace every occurrence of (es, t) with ‘est’
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est)
• ((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, est, _): 6, (w, i, d, est, _): 3)

9
• Merge 3:
• Merge the most frequent pair (est, _), which occurs 6 + 3 = 9 times, to form
the newly merged symbol ‘est_’.
• Update the vocabulary and replace every occurrence of (est, _) with ‘est_’:
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_)
• ((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, est_): 6, (w, i, d, est_): 3)
• Merge 4:
• Merge the most frequent pair (l, o), which occurs 5 + 2 = 7 times, to form the
newly merged symbol ‘lo’.
• Update the vocabulary and replace every occurrence of (l, o) with ‘lo’
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo)
• ((lo, w, _): 5, (lo, w, e, r, _): 2, (n, e, w, est_): 6, (w, i, d, est_): 3)

10
• Merge 5:
• Merge the most frequent pair (lo, w), which occurs 5 + 2 = 7 times, to form
the newly merged symbol ‘low’.
• Update the vocabulary and replace every occurrence of (lo, w) with ‘low’
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low)
• ((low, _): 5, (low, e, r, _): 2, (n, e, w, est_): 6, (w, i, d, est_): 3)

• Final Vocabulary & Merge Rules


• Continue merging until reaching the desired vocabulary size.
• The final vocabulary and merge rules after our five merges would be
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low)
• (e, s) → es, (es, t) → est, (est, _) → est_, (l, o) → lo, (lo, w) → low

11
• Using the constructed vocabulary and learned rules tokenise the text
“newest binded lowers”
• Pre-tokenise and append the end of word symbol:
• (newest_, binded_, lowers_)
• Apply Merge Rules
• Breakdown the pre-tokenised text to characters and apply the merge rules in
that order
• ((n, e, w, e, s, t, _), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _))
• Rules : (e, s) → es, (es, t) → est, (est, _) → est_, (l, o) → lo, (lo, w) → low

12
13
• Any token not in the vocabulary will be replaced by an unknown
token “[UNK]”
• vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low)
• ((n, e, w, est_), ([UNK], i, n, d, e, d, _), (low, e, r, s, _))

• Result of tokenisation:
• The new text is tokenized into the following sequence:
• “newest binded lowers” =
• [n, e, w, est_, [UNK], i, n, d, e, d, _, low, e, r, s, _]

14
Steps of BPE
1.Initialize the vocabulary with all the bytes or characters in the text corpus

2.Calculate the frequency of each byte or character in the text corpus.

3.Repeat the following steps until the desired vocabulary size is reached:
3. Find the most frequent pair of consecutive bytes or characters in the text
corpus
4. Merge the pair to create a new subword unit.
5. Update the frequency counts of all the bytes or characters that contain the
merged pair.
6. Add the new subword unit to the vocabulary.

4.Represent the text corpus using the subword units in the vocabulary.

15
Ex of BPE
• Suppose the corpus has 4 words : “ab”, “bc”, “bcd”, and “cde”.
• Initial vocabulary has : {“a”, “b”, “c”, “d”, “e”}.

16
17
18
AutoTokenizer
Automatically chooses the tokeniser for the LLM.
1. Install “Transformers” library
2. Import AutoTokeniser

3. Use AutoTokenizer to load the tokenizer for GPT-2.

4. Pass the text to be tokenized :

5. The tokenizer function breaks the text into tokens and converts them into
numbers.

19
6. Decode text:

Output : I love reading books

20
Image Tokenisation
• Transformers operate on a sequence of tokens; in NLP, this is
commonly a sentence of words.
• The ViT converts an image to tokens such that each token represents
a local area — or patch — of the image.

21
Image Tokenisation..
• Input image is divided into fixed size non-overlapping patches
• an image of height H, width W, and channels C is split into
into N tokens with patch size P:

• Each token is of length P²∗C.

22
Patch Tokenisation : Example
• Ref : https://towardsdatascience.com/vision-transformers-explained-
a9d07147e4c8
• Original image size : 60 x 100x 3
• cmap of its grayscale image :

23
Patch Tokenisation : Example
• Image after Patch tokenisation with P = 20:

24
2. EMBEDDINGS

25
Embeddings
• These are high-dimensional vectors representing tokens in a way that
captures their semantic meaning and relationships.
• Embeddings enable LLMs to understand context and nuances in data,
whether it’s text, images, or videos.
• The quality of embeddings significantly impacts the performance of
LLMs.

26
• Embeddings allows models to understand not just the identity of a
token but also its relationships with other tokens.
• a significant distance between
two words with the same name
but different meanings
(apple, the fruit vs. Apple, the company).

27
Types of embeddings
1.Uni-modal Embeddings:
Generated from a single type of input data (e.g., text), capturing the
semantic context within that modality.
• Uni-modal embeddings are used in tasks specific to one type of data.
• Ex: text embeddings are used in NLP tasks like text classification,
sentiment analysis, and machine translation.
• Ex:image embeddings are used in tasks like object detection and
image classification.

28
2. Multi-modal Embeddings:
Generated from multiple types of input data (e.g., text and images),
• capturing the relationships and interactions across different
modalities.
• Multi-modal embeddings are crucial for tasks that require
understanding the interplay between different types of data.
• Ex: in a video with subtitles, multi-modal embeddings can help the
model understand the relationship between the visual content and
the accompanying text.
• This capability is essential for tasks like video captioning and cross-
modal retrieval.

29
Embedding Techniques/Approaches
• Early techniques : Count based/frequency based - one-hot encoding
and BoW and TF-IDF for text data
• Limitations
1. Unable to capture semantic relationships.
• Ex: Tokens with same semantics but different embeddings

30
2. Sparse representations:

• Modern Techniques of word embedding addresses this issue

31
32
• Modern techniques:

1. Word2Vec:
• It represents each distinct word with a dense vector of decimal values/numbers
• It uses a neural network model to learn word associations from a large corpus of
text.
• Once trained, the network can detect similar words and also predict the
surrounding words in a sentence.
• The network learns to associate words that are semantically similar with similar
vector representations.
• Allows vector arithmetic on word embeddings
• Ex: King – man + woman = queen
• USA – Washington D C + Delhi = India

33
Ex:
• Columns are vocabulary of a story about a king and queen
• Rows are few words from the story

34
35
• There are two main variants of Word2Vec:
• Continuous Bag-of-Words (CBOW):
• The CBOW model predicts the surrounding words in a sentence based on the
current word.
• Ex: the model might be trained to predict the words “the” and “dog” given
the word “cat”.
• Skip-gram:
• The skip-gram model predicts the current word based on the surrounding
words in a sentence.
• Ex: the model might be trained to predict the word “cat” given the words
“the” and “dog”.

36
37
Applications: Find whether two given words are similar/dissimilar from
their vector representations
https://www.youtube.com/watch?v=Q95SIG4g7SA

• To find the cosine similarity between two vectors A and B

38
39
CBOW
Ref : https://www.youtube.com/watch?v=Q95SIG4g7SA

Creates word embeddings using the steps:

With window size of 5, target is the centre word

40
One hot encoded values of
dimension 7 – size of the
vocabulary

41
42
• The size of the embedding vector for each word in the vocabulary is
equal to the window size.
• Following shows the embedding for “iNeuron”.

43
• 2. GloVe:
• GloVe works by creating a co-occurrence matrix.
• The co-occurrence matrix is a table that shows how often two words appear
together in a corpus of text.
• Ex: the co-occurrence matrix for the words “cat” and “dog” would show how
often the words “cat” and “dog” appear together in a corpus of text.
• GloVe then uses machine learning algorithm which learns to associate words
that appear together frequently with similar vector representations.

44
3. FastText:
• An extension of Word2Vec that captures the meaning of shorter
words and affixes.
• FastText represents words as bags of character n-grams.
• This approach helps the model understand the meanings of words by
considering their subword information, making it effective in handling
rare and out-of-vocabulary words.

45
46
47
3. ATTENTION MECHANISM

1
Seq-to-seq models without attention
• https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

• A sequence-to-sequence model is a model that takes a sequence of items (words,


letters, features of an images…etc) and outputs another sequence of items.
• A trained model would work like this:

2
• In neural machine translation, a sequence is a series of words,
processed one after another.
• The output is, likewise, a series of words:

3
• the neural machine translation model is composed of an encoder and a decoder.
• The encoder processes each item in the input sequence, it compiles the information it captures
into a vector (called the context).
• After processing the entire input sequence, the encoder sends the context over to the decoder,
• The decoder begins producing the output sequence item by item.
• The context is a vector(an array of numbers)
• The encoder and decoder is a RNN. Size of the context vector is the no. of hidden units in the
encoder.
• Real world applications – vector size is 256, 512 or 1024.

4
• Words are converted to vectors using embedding algorithms

• Recap of RNN:

5
• As both encoder and decoder are RNN, at each time step one of the RNNs does
some processing, it updates its hidden state based on its inputs and previous
inputs it has seen.
• The last hidden state of the encoder is the context sent to the decoder

• The decoder also maintains a hidden state that it passes from one time step to
the next.

6
• Seq-to-seq models using RNNs in the encoder and decoder
components, are unable to handle longer sequences.
• A solution to this problem is to use “Attention” which highly improved
the quality of machine translation systems
• Attention allows the model to focus on the relevant parts of the input
sequence as needed.

7
Attention Mechanism
• an attention mechanism allows a model to focus on different parts of the input
data with varying degrees of importance
• These mechanisms assign different weights to the embeddings of tokens based
on their relevance to the context
• This allows the model to focus on important elements and improves its
understanding and generation capabilities.
• The attention mechanism enables models to handle long-range dependencies in
data.
• In sequences where certain tokens are more relevant than others, the attention
mechanism helps the model focus on these critical tokens, thereby enhancing the
overall performance.

8
Attention in text

• Ex: in a sentence like "The captain, against the suggestions of his


crew, chose to save the pirate because he was touched by his tale,"
• Key words : "captain," "save," and "pirate" are key to understanding
the meaning.
• The attention mechanism would allocate higher weights to these
words, enhancing the model’s comprehension.

9
Attention in video interpretation
• in a video of a bustling cityscape,
• the attention mechanism might assign higher weights to the tokens
representing the main subjects of the video,
• such as a prominent building, a moving car, or a person interacting with
others.
• At the same time, it might assign lower weights to the tokens
representing the background or less significant elements,
• the sky, stationary objects, or the general crowd.
• allows the model to understand the continuity and relationship
between different parts of the video,
• such as the movement of the car from one frame to another or the
interaction of the person throughout the video.

10
Neural Machine Translation with attention
• At time step 7, the attention mechanism enables the decoder to focus
on the word "étudiant" ("student" in french) before it generates the
English translation.
• This ability to amplify the signal from the relevant part of the input
sequence makes attention models produce better results than models
without attention.

11
• An attention model differs from a classic sequence-to-sequence
model in two main ways:
1. the encoder passes a lot more data to the decoder. Instead of
passing the last hidden state of the encoding stage,
the encoder passes all the hidden states to the decoder:

12
2. an attention decoder does an extra step before producing its output. At a
decoding time step ‘t’, the attention decoder focusses on the parts of the input that
are relevant to this decoding time step,
• the decoder does the following:
• Look at the set of encoder hidden states it received – each encoder hidden
state is most associated with a certain word in the input sentence

• Assigns each hidden state a score (called the attention weight)

• Multiply each hidden state by its softmaxed score, thus amplifying hidden
states with high scores, and drowning out hidden states with low scores
• This scoring exercise is done at each time step on the decoder side.

13
14
Putting all together:
1. The attention decoder RNN takes in the embedding of the <END> token, and
an initial decoder hidden state.
2. The RNN processes its inputs, producing an output and a new hidden
state vector (h4). The output is discarded.
3. Attention Step: calculate a context vector (C4) for this time step, using
the encoder hidden states and the h4 vector.
4. concatenate h4 and C4 into one vector.
5. pass this vector through a feedforward neural network (one trained jointly with
the model).
6. The output of the feedforward neural networks indicates the output word of
this time step.
7. Repeat for the next time steps

15
16
• in a particular decoding time step ‘t’, at which part of the input
sequence the decoder pays attention to?

17
• The model doesn’t blindly align the first input word with the first output word.
• It learns from the training phase how to align the word pair in French-English.
• the model paid attention correctly when outputing "European Economic Area".
• In French, the order of these
words is reversed
("européenne économique
zone") as compared to English
• Every other word in both
sentence is in similar order.

18
Building blocks of LLMs ..contn…
• Transformers :
• transformers are a type of machine learning model utilizing attention as the
primary learning mechanism. [1] “Attention is all you need”
• Transformers quickly became the state of the art for sequence-to-sequence
tasks such as language translation.
• Applications: Language Translation:

19
20
• Encoding and decoding components is a stack of 6 layers
• Hidden state of the last encoder is the input to the decoders

21
• The encoders are all identical in structure (yet they do not share weights).
• Each one is broken down into two sub-layers
• The encoder’s inputs first flow through a
self-attention layer – a layer that helps the
encoder look at other words in the input sequence as it
encodes a specific word.
• The outputs of the self-attention layer are fed
to a feed-forward neural network.
• The exact same feed-forward network is
independently applied to each position.

22
• The decoder has both those layers, but between them is an attention
layer that helps the decoder focus on relevant parts of the input
sentence (similar what attention does in seq2seq models).

23
• Input to all encoders is a vector of size 512.
• Input to the bottom most encoder is the embedding vector of each word
• Input to all other encoders is the output vector of the encoder directly below
them.
• each of word embedding flows through an independent path of the two layers of
the encoder(input is processed parallelly unlike RNN)

24
• Specific property of a transformer:
• the word in each position flows through its own path in the encoder.
• There are dependencies between these paths in the self-attention layer.
• The feed-forward layer does not have those dependencies,and thus the various paths can be
executed in parallel while flowing through the feed-forward layer.
• Ex: a smaller sentence:

25
Self Attention for a single word
How to find the attention weight for a word in a particular position?
• Assume input is “Thinking Machines”

26
Self Attention for a single word
How to find the attention weight for a word in a particular position?
• Assume input is “Thinking Machines”
• The model learns three weight matrices 𝑊 𝑄 , 𝑊 𝐾 and 𝑊 𝑉 during training phase.
STEP 1:
1. Input to the encoder is the embedding vector of the words X1 and X2 of dimension 512.
2. For each embedding vector, the model calculates three vectors called query, key and value of
dimension 64.
1. q = X1 * 𝑊 𝑄
2. k = X1 * 𝑊 𝐾
3. v = X1 * 𝑊 𝑉
The query, key and value vectors are abstractions used to think and calculate the attention
weights/scores.

27
STEP 2: Calculate the attention weight of all words.
• For the first word “Thinking”, find the score of each word against this word.
• The score determines how much focus to place on other parts of the input sentence as the
model encodes a word at a certain position.
• Multiply the query vector of this word q1, with the key vector of all other words k1 and k2
using dot product.
• the first score would be the dot product of q1 and k1.
• The second score would be the dot product of q1 and k2.

28
STEP 3 and 4:

• divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This
leads to having more stable gradients.
• pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add
up to 1.
• This softmax score determines how much each
word will be expressed at this position.
• Clearly the word at this
position will have the highest
softmax score, but sometimes it’s useful to
attend to another word that is relevant to
the current word.

29
STEP 5:
• multiply each value vector by the softmax score (in preparation to
sum them up).
• The intuition here is to keep intact the values of the word(s) the
model has to focus on, and drown-out irrelevant words (by
multiplying them by tiny numbers like 0.001, for example).
STEP 6:
• sum up the weighted value vectors.
• This produces the output of the self-attention layer at this position
(for the first word).

30
• The resulting vector is
is sent to the
feed-forward neural
network.
• In the actual
implementation,
however, this
calculation is done in
matrix form for faster
processing.

31
Self attention for all words
• Self attention weight of all words in the input sequence is calculated using matrices
• Create a matrix X, where each row of X is an embedding vector of the input sequence.
• Get Q, K and V matrices using X and the weight matrices 𝑊 𝑄 , 𝑊 𝐾 and 𝑊 𝑉 .

32
• Steps 2 through 6 for word level attention can be condensed into one
step using matrices.

33
Multi-headed attention
• Transformers refined the self-attention layer by adding a mechanism
called “multi-headed” attention.
• Original paper uses 8-headed attention mechanism
• Advantages of multi-headed attention:
• It expands the model’s ability to focus on different positions.
• In the previous example, z1 contains a little bit of every other encoding, but it could be
dominated by the actual word itself
• It gives the attention layer multiple “representation subspaces”
• with multi-headed attention we have multiple sets of Query/Key/Value weight
matrices (the Transformer uses eight attention heads, so we end up with eight
sets for each encoder/decoder).

34
• Each of these sets is randomly initialized.
• after training, each set is used to project the input embeddings (or vectors
from lower encoders/decoders) into a different representation subspace.

35
• Repeating this just eight different times with different weight matrices, ends up
with eight different Z matrices for a word/embedding vector.

36
• As the feed forward layer takes only one vector for a word,
concatenate the z vectors of a word.
• Hence concatenate all z matrices of all words and multiply them by a
weight matrix WO

37
38
• When ‘it’ is encoded, attention heads focus on “the animal” and
“tired”.

39
Adding positional embeddings
• Model learns the order of the words using positional encodings

40
Encoding layers with normalization

41
• Normalised outputs go to the decoder also

42
Decoding side
• The output of the top encoder is then transformed into a set of attention vectors K and V.
• These are to be used by each decoder in its “encoder-decoder attention” layer which helps the
decoder focus on appropriate places in the input sequence
• https://jalammar.github.io/illustrated-transformer/

43
• repeat the following process until a special symbol is reached indicating the transformer decoder
has completed its output.
• The output of each step is fed to the bottom decoder in the next time step
• The decoders bubble up their decoding results just like the encoders did.
• Word embeddings and positional encodings are the decoder inputs to indicate the position of
each word.

44
45
• In the decoder, the self-attention layer is only allowed to attend to
earlier positions in the output sequence.
• before the softmax step in the self-attention calculation.
• The “Encoder-Decoder Attention” layer works just like multiheaded
self-attention, except it creates its Queries matrix from the layer
below it, and takes the Keys and Values matrix from the output of the
encoder stack.
• The decoder stack outputs a vector of floats.
• The Linear layer is a simple fully connected neural network that
projects the vector produced by the stack of decoders, into a much,
much larger vector called a logits vector.

46
• If the model learns 10,000 unique words from the training data, then the logit vector has 10,000
cells.
• Each cell is the logit score of a unique word
• The softmax layer then turns those scores into probabilities
• The cell with the highest probability is chosen, and the word associated is the decoder output for
this time step.

47
• In the first phase of training, the weight matrices are initialised to random values.
• Assume we want the output to be a probability distribution indicating the word “thanks”. But since this
model is not yet trained, that’s unlikely to happen just yet.
• The untrained model produces an output :

48
• Find the loss as the difference of the prob distributions.
• Adjust the weights and retrain

49
• After training the model for enough time on a large enough dataset,
the produced probability distributions would look like this:

50
VARIANTS OF TRANSFORMER ARCHITECTURE – BERT AND GPT

1
Variants of the transformer architecture – BERT and GPT
• The original transformer architecture has both encoder and decoder.
o the encoder encodes an input sequence in the form of a vector of numbers —
a low-level format that is understood by machines
o the decoder takes the encoded sequence and by applying a language
modeling task, it generates a new sequence.
• Encoders and decoders can be used individually for specific tasks.
• The two famous models deriving their parts from the original Transformer are
o BERT (Bidirectional Encoder Representations from Transformer) consisting of encoder blocks
o GPT (Generative Pre-Trained Transformer) composed of decoder blocks.

2
3
BERT and GPT
• Encoder models (e.g. BERT) predict tokens based on the context
from both sides
o Example: “The bank is situated on the _______ of the river.”

• Decoder models (e.g. GPT) only use the previous context to


generate text
o Example : Mixing red and blue produces ___________

4
BERT - Introduction
• BERT, an acronym for Bidirectional Encoder Representations from
Transformers, is an open-source machine learning
framework designed for tasks in NLP.
• Designed by researchers from Google AI in 2018.
• It is a transformer-based neural network to understand and generate
human-like language.
• It employs an encoder-only architecture.
• This is to emphasis understanding the input sequence rather than
generating the output sequence.
• In the original Transformer architecture, there are both encoder and
decoder modules.

5
Special feature of BERT
• Traditional language models process text sequentially, either from left
to right or right to left.
• This method limits the model’s awareness to the immediate context
preceding the target word.
• BERT uses a bi-directional approach considering both the left and
right context of words in a sentence,
• instead of analyzing the text sequentially, BERT looks at all the words
in a sentence simultaneously(using the attention mechanism from the
transformer encoder)

6
Example: “The bank is situated on the _______ of the river.”
• In a unidirectional model, target word is heavily dependant on the
preceding words,
• Model struggles to understand whether “bank” refers to a financial institution
or the side of the river.

• BERT, being bidirectional, simultaneously considers both the left (“The


bank is situated on the”) and right context (“of the river”),
• This enables a more nuanced understanding.
• It comprehends that the missing word is likely related to the geographical
location of the bank,
• This demonstrates the contextual richness of the bidirectional models.

7
BERT Pre-training and Fine Tuning
• The BERT model undergoes a two-step process:
1. Pre-training on Large amounts of unlabeled text to learn contextual
embeddings.
2. Fine-tuning on labeled data for specific NLP tasks.

8
1. Pre-training
• BERT is pre-trained on large amount of unlabeled text data.
• The model learns contextual embeddings,
• Contextual embeddings - the representations of words that take into
account their surrounding context in a sentence.
• BERT is used in various unsupervised pre-training tasks.
1. Masked Language Model or MLM task - learn to predict missing words in a
sentence
2. understand the relationship between two sentences,
3. predict the next sentence in a pair.

9
Fine-tuning
• After the pre-training phase, the BERT model, uses the contextual
embeddings, to get fine-tuned for specific NLP tasks.
• It adapts its general language understanding to complete more
targeted applications.
• Fine tuning is done using labeled data specific to the downstream
tasks of interest.
• These tasks are :
• sentiment analysis, question-answering, named entity recognition, or any
other NLP application.
• The model’s parameters are adjusted to optimize its performance for the
particular task.

10
How BERT works?
• BERT is designed to generate a language model so, only the encoder
mechanism is used.
• Sequence of tokens are fed to the Transformer encoder.
• These tokens are first embedded into vectors and then processed in the
neural network.
• The output is a sequence of vectors, each corresponding to an input token,
providing contextualized representations.
• When training language models, defining a prediction goal is a challenge.
• Many models predict the next word in a sequence, which is a directional approach
and may limit context learning.
• BERT addresses this challenge with two innovative training strategies:
 Masked Language Model (MLM)
 Next Sentence Prediction (NSP)

11
Masked Language Model (MLM)
• 15% of the words in each input sequence are replaced with a [MASK] token.
• The model then attempts to predict the original value of the masked words,
based on the context provided by the other, non-masked, words in the sequence.

12
Architecture of MLM:
1. Adding a classification layer on top of the encoder output.
• This layer is crucial for predicting the masked words.
2. Multiplying the output vectors by the embedding matrix,
transforming them into the vocabulary dimension.
• This step helps align the predicted representations with the vocabulary space.
3. Calculating the probability of each word in the vocabulary with
softmax.
• For every masked position, this step generates a probability distribution over
the entire vocabulary

13
• The loss function used during training considers only the prediction of
the masked values.
• The model is penalized for the deviation between its predictions and
the actual values of the masked words.
• The model converges slower than directional models.
• This is because, during training, BERT is only concerned with predicting the
masked values, ignoring the prediction of the non-masked words.
• The increased context awareness achieved through this strategy compensates
for the slower convergence.

14
Next sentence prediction (NSP)
• Preparing the Training data :
• 50% of the training data – second sentence is a subsequent of the first
sentence
• 50% of the training data – the first and second sentences are not related.
• To distinguish two sentences in the training set, input is processed by:
1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP]
token is inserted at the end of each sentence.
2. A sentence embedding indicating Sentence A or Sentence B is added to each
token. Sentence embeddings are similar in concept to token embeddings with a
vocabulary of 2.
3. A positional embedding is added to each token to indicate its position in the
sequence.

15
16
• To predict if the second sentence is connected to the first:
1. The entire input sequence goes through the Transformer model.
2. The output of the [CLS] token is transformed into a 2×1 shaped vector, using
a simple classification layer (learned matrices of weights and biases).
3. calculating the probability of whether the second sentence follows the first
using SoftMax.

• Both MLM and NSP can also be trained together.

17
BERT architecture
• The architecture of BERT is a multilayer bidirectional transformer
encoder which is quite similar to the transformer model.
• A transformer architecture is an encoder-decoder network that
uses self-attention on the encoder side and attention on the decoder
side.
• Two architectures:

18
Differences between BERT and transformer architecture
1. Encoder Stack :
• BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack.
• These are more than the Transformer architecture described in the original paper (6 encoder
layers).
2. Feedforward network :
• BERT architectures (BASE and LARGE) also have larger feedforward networks - 768 and 1024
hidden units respectively
• Original Transformer models have 512 hidden units.
3. Attention heads :
• BERT architectures (BASE and LARGE) have more attention heads (12 and 16 respectively).
• Original Transformer models have 8 attention heads.
4. No. of parameters :
• BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.
• Transformer – 65M parameters

19
GPT framework has 2 phases : pre-training and fine tuning
• Pre-training :
o this stage teaches the model to anticipate the next word that will come next in a
sentence.
o Trained on a variety of internet material
• Fine tuning :
o training it on data specific to a given domain or task.

20
21
GPT – Pre-training
• u – input sequence
• k – context window size
• n – no. of substrings of u with length k
• at each step,
• the model outputs the probability distribution of all possible tokens being the
next token i for the sequence consisting of the last k context tokens.
• Then, the logarithm of the probability for the real token is calculated and used as
one of several values in the sum for the loss function.
• Loss function is a log likelihood function

22
• Ex : u = "Mixing red and blue produces purple"
• K=3
o For each of these substrings, the model predicts a probability distribution for every token in
the vocabulary for the language modeling task.
o the probability corresponding to the true token in the sequence is taken
(highlighted in yellow) is used for loss calculation.
o The final loss equals the sum of logarithms of true token probabilities.

23
24
• The log-likelihood loss function in GPT maximizes the logarithm of the probability
of correctly predicting all the tokens in the input sequence.
• Once GPT is pre-trained, it can already be used for text generation.
• GPT is an autoregressive model - it uses previously predicted tokens as input for
prediction of next tokens.
• the process lasts until the [end] token is predicted or the maximum input size is
reached.

25
GPT – fine tuning
• After pre-training, GPT has captured the linguistic knowledge
• It is fine tuned for a specific task(text classification) on a supervised
task specific data set.
• The labelled dataset has each example with an input sequence x with a
corresponding label y which needs to be predicted.
• Every example is passed through the model which outputs their hidden
representations h on the last layer.
• The resulting vectors are then passed to an added linear layer with learnable
parameters W and then through the softmax layer.

26
27
• The loss function used for fine-tuning is very similar to that of pre-training phase
but this time, it evaluates the probability of observing the target value y instead
of predicting the next token

28
Limitations of LLMs

1
Limitations of LLMs
1. Outdated knowledge:
• LLMs are totally dependent on their training data.

• No inherent connection to external data resources.

• Zero awareness of current events that occurred after its training.

• Cannot respond to breaking news or latest societal developments.

• Without integration to external data sources, they fail to provide recent real
world information

2
Ex:

3
2. Inability to take action:
• LLMs cannot perform interactive actions like searches, calculations or look
ups.
• They cannot interface with APIs and take any practical actions based on new
prompts.
• This limits their functionality

Wrong answer.
Correct one is
6528025

4
3. Lack of context:
• LLMs struggle to incorporate relevant context like previous conversations
4. Biases and discrimination:
• Depending on the training data, LLMs can generate biased responses,
targeting a particular religion, ideology or political in nature.
5. Lack of transparency:
• behaviour of the LLMs are opaque and difficult to interpret.
• Its challenging to align with human values.

5
6. Hallucination risks:
• Due to insufficient knowledge of certain topics and most recent information,
LLMs generate incorrect responses.
• Ex:

6
Techniques to mitigate these limitations
1. Retrieval augmentation:
• Uses external knowledge bases to supplement an LLM’s outdated training
data, providing external context and reducing hallucination risk.
2. Chaining:
• This technique integrates actions like searches and calculations.
3. Prompt engineering:
• This involves the careful crafting of prompts by providing critical context that
guides appropriate responses
4. Monitoring, filtering, and reviews:
• Constitutional principles - monitor and filter unethical or inappropriate
content. c.
• Human reviews provide insight into model behavior and output.

7
5. Memory:
• Retains conversation context by persisting conversation data and context
across interactions.
6. Fine-tuning:
• Training and tuning the LLM on more appropriate data for the application
domain and principles.
• This adapts the model’s behavior for its specific purpose.
• To summarize - Prompting supplies context, chaining enables
inference and retrieval incorporates facts.
• Frameworks like LangChain provides these features for responsible
use of LLMs.

8
LangChain
• LangChain is a framework that simplifies the process of creating generative AI application.
• Foundation models (Ex: GPT) are generally trained on data up to their release to the public.
• CHATGPT was released to the public near the end of 2022, but its knowledge base was limited to
data from 2021 and before.
• LangChain can connect AI models to external data sources to give them knowledge of recent data
without limitations.
A simple LLM application
• an LLM app is an application that utilizes an LLM to understand
natural language prompts and generate responsive text outputs.

• It has the following components:


• A client layer to collect user input as text queries or decisions.
• A prompt engineering layer to construct prompts that guide the LLM.
• An LLM backend to analyze prompts and produce relevant text responses.
• An output parsing layer to interpret LLM responses for the application
interface

10
• Optionally to augment the LLMs capabilities :
• integration with external services via function APIs, knowledge bases, and
reasoning algorithms.

11
RAG – Retrieval Augmented Generation
• LLMs are trained on vast volumes of data to generate original output
for tasks like answering questions, translating languages, and
completing sentences
• Known challenges of LLMs include:
• Present false information when it does not have the answer.
• Present out-of-date or generic information when the user expects a specific,
current response.
• Create a response from non-authoritative sources.
• Create inaccurate responses due to terminology confusion, wherein different
training sources use the same terminology to talk about different things.

12
• It’s a cost-effective approach to improve an LLMs output so it remains
relevant, accurate, and useful in various contexts.

• RAG extends the powerful capabilities of LLMs to specific domains or


an organization's internal knowledge base, without the need to
retrain the model

• RAG helps the LLM to refer an authoritative knowledge base outside


of its training data sources before generating a response

13
• RAG is a fusion of information retrieval models and generative
models.

14
Components of a RAG in LangChain
1. Indexing:
• a pipeline for ingesting data from a source and indexing it. This usually
happens offline.

2. Retrieval and generation:


• the actual RAG chain, which takes the user query at run time and retrieves
the relevant data from the index, then passes that to the model.

15
1. Indexing

16
1. Indexing..
1. Load:
• Load our data with Document Loaders.
2. Splits :
• Text splitters split large documents into smaller chunks.
• This is useful both for indexing data and for passing it in to a model.
• large chunks are harder to search over and will exceed the model's finite
context window.
3. Store:
• store and index our splits, inorder to search them later.
• Chunks are stored using a VectorStore and Embeddings model.

17
2. Retrieval and generation
4. Retrieve:
• Given a user input, relevant splits are retrieved from storage using a Retriever.
5. Generate :
• produces an answer using a prompt that includes the question and the
retrieved data

18

You might also like