Building LLM Applications
Building LLM Applications
Open in app
Search
Member-only story
Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented
Generation ( RAG ) Application.
2. Data Preparation
4. Vector Database
6. LLM
7. Open-Source RAG
8. Evaluation
9. Serving LLMs
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 1/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Table of Contents
· 1. Embedding Models
∘ 1.1. Context-independent Embeddings
∘ 1.2. Context-Dependent Embeddings
· 2. BERT
∘ 2.1. Input representations
∘ 2.2. Why Sentence BERT (S-BERT) Over BERT?
· 3. Sentence Transformers
∘ 3.1. Siamese BERT Pre-Training
· 4. SBERT Objective Functions
∘ 4.1. Classification
∘ 4.2. Regression
∘ 4.3. Triplet Loss
· 5. Hands-On with Sentence Transformers
· 6. Which Embedding Model to Choose?
· Conclusion
· Credits
Greetings!
In the last blogs, we learned about Data Preparation for RAG which involved Data
Ingestion, Data Preparation and Chunking.
As we need to search for relevant contextual chunks during RAG, we have to convert
the data from textual format to vector embeddings.
Thus, we will be exploring the most efficient way to convert the text via Sentence
Transformers.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 2/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Image by Jina AI
Let’s get started with some most commonly used embedding models.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 3/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
1. Embedding Models
Embeddings are a type of word representation (with numerical vectors) that allows
words with similar meanings to have a similar representation.
The process of creating vector embeddings from different types of data: Audio, Text, Video
Several word embedding methods have been proposed in the past decade, here are
some of them.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 4/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Image by Author
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 5/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Bag of Words
Bag of words will create a dictionary of the most common words in all the sentences
and then encode the sentences as shown below.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 6/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Bag of Words
TF-IDF
Inverse Document Frequency: How important the term is in the whole corpus?
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 7/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
TF-IDF
Word2Vec:
Word embeddings in Word2Vec are learned through a two-layer neural network,
which inadvertently captures linguistic contexts during the training process.
The embeddings serve as a byproduct of the algorithm’s primary objective,
showcasing the efficiency of this approach. Word2Vec provides flexibility
through two distinct model architectures: CBOW and continuous skip-gram.
Continuous Skip-Gram:
Uses the current word to predict the surrounding window of context words.
Focuses on the predictive power of the target word in generating the context
words.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 8/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Word-2-Vec
GloVe (Global Vectors for Word Representation): GloVe’s strength lies in its
utilization of aggregated global word-word co-occurrence statistics from a
corpus during training. The resulting representations not only encapsulate
semantic relationships but also unveil intriguing linear substructures within the
word vector space, adding depth to the understanding of word embeddings.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 9/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
RNN based
Transformer-based
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 10/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
2. BERT
BERT (Bidirectional Encoder Representations from Transformers), a powerhouse in
natural language processing developed by Google AI, has reshaped the landscape of
language models. This exploration delves into the pre-training methodology and the
intricacies of its bi-directional architecture.
In addition to the masked language model, BERT uses a NSP task that jointly pre-
trains text-pair representations. Many important downstream tasks such as
Question Answering(QA) and Natural Language Inference(NLI) are based on
understanding the relationship between two sentences, which is not directly
captured by language modeling.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 11/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
The architecture of BERT is structured with multiple encoder layers, each applying
self-attention to the input and passing it to the subsequent layer. Even the smallest
variant, BERT BASE, boasts 12 encoder layers, a feed-forward neural network block
with 768 hidden units, and 12 attention heads.
BERT Structures
Input sequences are prepared before being fed to the model using WordPiece
Tokenizer with a 30k vocabulary size token. It works by splitting a word into several
subwords (Tokens).
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 12/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
[CLS] used as the first token of each sequence. The final hidden state
corresponding to this token is used as the aggregate sequence representation for
classification tasks.
[PAD] used to represent paddings in the input sentences (empty tokens). The
model expects fixed-length sentences as input. A maximum length is thus fixed
depending on the dataset. Shorter sentences are padded, whereas longer
sentences are truncated. To explicitly differentiate between real tokens and
[PAD] tokens, we use an attention mask.
BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation
embeddings and the position embeddings.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 13/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
To get the token embedding, an embedding lookup table is used at the embedding
layer (as illustrated in Figure above), where rows represent all possible token IDs in
the vocabulary (30k rows for instance) and columns represent the token embedding
size.
2.2. Why Sentence BERT (S-BERT) Over BERT?
So far, so good, but these transformer models had one issue when building sentence
vectors: Transformers work using word or token-level embeddings, not sentence-
level embeddings.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 14/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Siamese (bi-encoder) architecture is shown on the left, and the Non-Siamese (cross-encoder) architecture is
on the right. The principal difference is that on the left the model accepts both inputs at the same time. On
the right, the model accepts both inputs in parallel, so both outputs are not dependent on each other.
The cross-encoder network does produce very accurate similarity scores (better
than SBERT), but it’s not scalable. If we wanted to perform a similarity search
through a small 100K sentence dataset, we would need to complete the cross-
encoder inference computation 100K times.
To cluster sentences, we would need to compare all sentences in our 100K dataset,
resulting in just under 500M comparisons — this is simply not realistic.
Ideally, we need to pre-compute sentence vectors that can be stored and then used
whenever required. If these vector representations are good, all we need to do is
calculate the cosine similarity between each. With the original BERT (and other
transformers), we can build a sentence embedding by averaging the values across all
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 15/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
token embeddings output by BERT (if we input 512 tokens, we output 512
embeddings). [Approach — 1]
Alternatively, we can use the output of the first [CLS] token (a BERT-specific token
whose output embedding is used in classification tasks). [Approach — 2]
Using one of these two approaches gives us our sentence embeddings that can be
stored and compared much faster, shifting search times from 65 hours to around 5
seconds. However, the accuracy is not good, and is worse than using averaged GloVe
embeddings (which were developed in 2014)
Thus, finding the most similar sentence pair from 10K sentences took 65 hours with
BERT. With SBERT, embeddings are created in ~5 seconds and compared with
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 16/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Since the SBERT paper, many more sentence transformer models have been built
using similar concepts that went into training the original SBERT. They’re all trained
on many similar and dissimilar sentence pairs.
Using a loss function such as softmax loss, multiple negatives ranking loss, or MSE
margin loss these models are optimized to produce similar embeddings for similar
sentences, and dissimilar embeddings otherwise.
Sentence Transformer
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 17/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
network weights.
An SBERT model applied to a sentence pair sentence A and sentence B. Note that the BERT model outputs
token embeddings (consisting of 512 768-dimensional vectors). We then compress that data into a single
768-dimensional sentence vector using a pooling function.
In reality, we are using a single BERT model. However, because we process sentence
A followed by sentence B as pairs during training, it is easier to think of this as two
models with tied weights.
3.1. Siamese BERT Pre-Training
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 18/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Siamese Architecture
SNLI contains 570K sentence pairs, and MNLI contains 430K. The pairs in both
corpora include a premise and a hypothesis. Each pair is assigned one of three
labels:
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 19/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
1 — neutral, the premise and hypothesis could both be true, but they are not
necessarily related.
Given this data, we feed sentence A (let’s say the premise) into siamese BERT A and
sentence B (hypothesis) into siamese BERT B.
The siamese BERT outputs our pooled sentence embeddings. There were the results
of three different pooling methods in the SBERT paper. Those are mean, max, and
[CLS]-pooling. The mean-pooling approach was best performing for both NLI and
STSb datasets.
There are now two sentence embeddings. We will call embeddings A as u and
embeddings B as v. The next step is to concatenate u and v. Again, several
concatenation approaches were tested, but the highest performing was a (u, v, |u-v|)
operation:
|u-v| is calculated to give us the element-wise difference between the two vectors.
Alongside the original two embeddings (u and v), these are all fed into a
feedforward neural net (FFNN) that has three outputs.
These three outputs align to our NLI similarity labels 0, 1, and 2. We need to
calculate the softmax from our FFNN, which is done within the cross-entropy loss
function. The softmax and labels are used to optimize on this ‘softmax-loss’.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 20/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
The operations were performed during training on two sentence embeddings, u and v. Note that softmax-
loss refers cross-entropy loss (which contains a softmax function by default).
The operations were performed during training on two sentence embeddings, u and
v. Note that softmax-loss refers cross-entropy loss (which contains a softmax
function by default).
This results in our pooled sentence embeddings for similar sentences (label 0)
becoming more similar, and embeddings for dissimilar sentences (label 2) becoming
less similar.
Remember we are using siamese BERTs not dual BERTs. Meaning we don’t use two
independent BERT models but a single BERT that processes sentence A followed by
sentence B.
This means that when we optimize the model weights, they are pushed in a
direction that allows the model to output more similar vectors where we see an
entailment label and more dissimilar vectors where we see a contradiction label.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 21/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
SBERT architecture for classification objective. Parameter n stands for the dimensionality of embeddings
(768 by default for BERT base) while k designates the number of labels.
4.2. Regression
In this formulation, after getting vectors u and v, the similarity score between them
is directly computed by a chosen similarity metric. The predicted similarity score is
compared with the true value and the model is updated by using the MSE loss
function.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 22/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
SBERT architecture for regression objective. Parameter n stands for the dimensionality of embeddings (768
by default for BERT base).
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 23/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
For now, let’s look at how we can initialize and use these sentence-transformer
models.
model = SentenceTransformer('bert-base-nli-mean-tokens')
model
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 24/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Output:
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transf
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': Fals
)
The output we can see here is the SentenceTransformer object which contains three
components:
The transformer itself, here we can see the max sequence length of 128 tokens
and whether to lowercase any input (in this case, the model does not). We can
also see the model class, BertModel.
The pooling operation, here we can see that we are producing a 768-
dimensional sentence embedding. We are doing this using the mean pooling
method.
Once we have the model, building sentence embeddings is quickly done using the
encode method.
sentences = [
"the fifty mannequin heads floating in the pool kind of freaked them out",
"she swore she just saw her sushi move",
"he embraced his new life as an eggplant",
"my dentist tells me that chewing bricks is very bad for your teeth",
"the dental specialist recommended an immediate stop to flossing with const
]
embeddings = model.encode(sentences)
embeddings.shape
Output:
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 25/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
(5, 768)
And there are also embedding models that excel in handling multilingual
datasets.
Alternatively, you can conduct testing with various embedding models and compile
a final evaluation table to pinpoint the most suitable one for your specific use case. I
highly recommend incorporating a re-ranker into this process as it can significantly
enhance retriever performance, ultimately yielding the optimal results.
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 26/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
HuggingFace MTEB
If you opt for the second approach, there is an excellent Medium blog post available
that shows how to utilise the Retrieval Evaluation module from LlamaIndex. This
resource can help you efficiently assess and identify the optimal combination of
embedding and reranker from an initial list of models.
I trust that you now feel better equipped to make an informed decision when
selecting the most suitable embedding and reranking models for your RAG
architecture!
Conclusion
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 27/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
The blog explores various embedding models for generating vector representations
of text, including Bag of Words (BoW), TF-IDF, Word2Vec, GloVe, FastText, ELMO,
BERT, and more. It delves into the architecture and pre-training of BERT, introduces
Sentence BERT (SBERT) for efficient sentence embeddings, and provides a hands-on
example using the sentence-transformers library. The conclusion emphasizes the
challenge of choosing the right embedding model and suggests leveraging resources
like the Hugging Face Massive Text Embedding Benchmark (MTEB) Leaderboard for
evaluation.
Credits
In this blog post, we have compiled information from various sources, including
research papers, technical blogs, official documentations, YouTube videos, and
more. Each source has been appropriately credited beneath the corresponding
images, with source links provided.
1. https://jina.ai/news/the-1950-2024-text-embeddings-evolution-poster/
2. https://partee.io/2022/08/11/vector-embeddings/
3. https://www.nlplanet.org/course-practical-nlp/01-intro-to-nlp/11-text-as-vectors-
embeddings
4. https://www.deeplearning.ai/resources/natural-language-processing/
5. https://www.mygreatlearning.com/blog/word-embedding/#sh4
6. https://mlwhiz.com/blog/2019/02/08/deeplearning_nlp_conventional_methods/
7. https://vitalflux.com/bert-vs-gpt-differences-real-life-examples/
8. https://d3mlabs.de/?p=1169
9. https://www.linkedin.com/pulse/why-does-bert-stand-out-sea-sentence-
embedding-models-bhaskar-t-
bi6wc%3FtrackingId=RKK3MNdP8pugx6iyhwJ2hw%253D%253D/?
trackingId=RKK3MNdP8pugx6iyhwJ2hw%3D%3D
10. https://www.amazon.science/blog/improving-unsupervised-sentence-pair-
comparison
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 28/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
11. https://www.researchgate.net/figure/Sentence-BERT-model_fig3_360530243
12. https://www.youtube.com/watch?app=desktop&v=O3xbVmpdJwU
13. https://www.pinecone.io/learn/series/nlp/sentence-embeddings/
14. https://towardsdatascience.com/sbert-deb3d4aef8a4
15. https://huggingface.co/spaces/mteb/leaderboard
Your claps help me create more valuable content for our vibrant Python or ML
community.
Follow
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 29/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Vipra Singh
Apr 28 755 6
Vipra Singh
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 30/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Jan 8 1.2K 5
Vipra Singh
Jan 8 651 2
Vipra Singh
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 31/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Apr 17 824 5
Vipra Singh
Apr 17 824 5
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 32/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Jun 6 705 3
Lists
Staff Picks
737 stories · 1323 saves
Self-Improvement 101
20 stories · 2772 saves
Productivity 101
20 stories · 2374 saves
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 33/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Apr 7 858 12
Tarun Singh
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 34/36
25/09/2024, 16:54 Building LLM Applications: Sentence Transformers (Part 3) | by Vipra Singh | Medium
Apr 17 238
May 3 177 2
May 11 484 2
https://medium.com/@vipra_singh/building-llm-applications-sentence-transformers-part-3-a9e2529f99c1 36/36