Newwhitepaper - Embeddings & Vector Stores
Newwhitepaper - Embeddings & Vector Stores
Acknowledgements
Grace Mollison
Ruiqi Guo
Designer
Jinhyuk Lee
Alan Li
Patricia Florissi
Andrew Brook
Omid Fatemieh
Zhuyun Dai
Lee Boonstra
Per Jacobsson
Xi Cheng
Raphael Hoffmann
Antonio Gulli
Anant Nawalgaria
Grace Mollison
September 2024 2
Table of contents
Introduction 5
Types of embeddings 9
Text embeddings 9
Word embeddings 11
Document embeddings 15
Graph embeddings 25
Training Embeddings 26
Vector search 28
ScaNN 34
Vector databases 37
Operational considerations 39
Applications 40
Summary 46
Endnotes 48
Embeddings & Vector Stores
Introduction
Modern machine learning thrives on diverse data—images, text, audio, and more. This
whitepaper explores the power of embeddings, which transform this heterogeneous data into
a unified vector representation for seamless use in various applications.
• Understanding Embeddings: Why they are essential for handling multimodal data and
their diverse applications.
• Embedding Techniques: Methods for mapping different data types into a common
vector space.
September 2024 5
Embeddings & Vector Stores
• Efficient Management: Techniques for storing, retrieving, and searching vast collections
of embeddings.
Throughout the whitepaper, code snippets provide hands-on illustrations of key concepts.
September 2024 6
Embeddings & Vector Stores
One of the key applications for embeddings is retrieval and recommendations, where the
result is usually from a massive search space. For example, Google Search is a retrieval with
the search space of the whole internet. Today’s retrieval and recommendation systems’
success depends on the following:
3. Efficient computing and retrieving of the nearest neighbors of the query embeddings in
the search space.
Embeddings also shine in the world of multimodality. Most applications work with large
amounts of data of various modalities: text, speech, image, and videos to name a few.
Because every entity or object is represented in its own unique format, it’s very difficult
to project these objects into the same vector space that is both compact and informative.
Ideally, such a representation would capture as much of the original object’s characteristics
as possible. An embedding refers to the projected vector of an object from an input space to
a relatively low-dimensional vector space. Each vector is a list of floating point numbers.
September 2024 7
Embeddings & Vector Stores
Figure 1. Projecting objects/content into a joint vector space with semantic meaning
Ideally the embeddings are created so they place objects with similar semantic properties
closer in the embedding space (a low-dimensional vector space where items can be
projected). The embeddings can then be used as a condensed, meaningful input in
downstream applications. For example, you can use them as features for ML models,
recommender systems, search engines, and many more. So your data not only gets a
compact numerical representation, but this representation also preserves the semantic
meanings for a specific task or across a variety of tasks. The fact that these representations
are task-specific means you can generate different embeddings for the same object,
optimized for the task at hand.
September 2024 8
Embeddings & Vector Stores
Types of embeddings
Embeddings aim to obtain a low dimensional representation of the original data while
preserving most of the ‘essential information’. The types of data an embedding represents
can be of various different forms. Below you’ll see some standard techniques used for
different types of data, including text and image.
Text embeddings
Text embeddings are used extensively as part of natural language processing (NLP). They
are often used to embed the meaning of natural language in machine learning for processing
in various downstream applications such as text generation, classification, sentiment
analysis, and more. These embeddings broadly fall into two categories: token/word and
document embeddings.
Before diving deeper into these categories, it’s important to understand the entire lifecycle
of text: from its input by the user to its conversion to embeddings.
It all starts with the input string which is split into smaller meaningful pieces called tokens.
This process is called tokenization. Commonly, these tokens are wordpieces, characters,
words, numbers, and punctuations using one of the many existing tokenization techniques.1
After the string is tokenized, each of these tokens is then assigned a unique integer value
September 2024 9
Embeddings & Vector Stores
usually in the the range: [0, cardinality of the total number of tokens in the corpus]. For
example, for a 16 word vocabulary the IDs would range between 0-15. This value is also
referred to as token ID. These tokens can be used to represent each string as a sparse
numerical vector representation of documents used for downstream tasks directly, or after
one-hot encoding. One-hot encoding is a binary representation of categorical values where
the presence of a word is represented by 1, and its absence by 0. This ensures that the token
IDs are treated as categorical values as they are, but often results in a dense vector the size
of the vocabulary of the corpus. Snippet 1 and Figure 3 show an example of how this can be
done using Tensorflow.
September 2024 10
Embeddings & Vector Stores
However, since these Integer IDs (or their corresponding one-hot encoded vectors) are
assigned randomly to words, they lack any inherent semantic meaning. This is where
embeddings are much more useful. Although it’s possible to embed character and sub-word
level tokens as well, let us look at word and document embeddings to understand some of
the methods behind them.
Word embeddings
In this section, you’ll see a few word embedding techniques and algorithms to both train
and use word embeddings. While there are many ML driven algorithms developed over
time optimized for different objectives, the most common ones are GloVe,2 SWIVEL,3 and
Word2Vec.4 Word embeddings or sub-word embeddings can also be directly obtained from
hidden layers of language models. However, the embeddings will be different for the same
word in different contexts of the text. This section focuses on lightweight, context-free
word embedding and leaves the context-aware document embeddings for the document
embeddings section. Word embedding can be directly applied to downstream tasks like
named entity extraction and topic modeling.
Word2Vec is a family of model architectures that operates on the principle of “the semantic
meaning of a word is defined by its neighbors”, or words that frequently appear close to each
other in the training corpus. This method can be both used to train your own embeddings
September 2024 11
Embeddings & Vector Stores
from large datasets or be quickly integrated through one of the readily available pre-trained
embeddings available online.5 The embeddings for each word - which are essentially fixed
length vectors - are randomly initialized to kick off the process, resulting in a matrix of shape
(size_of_vocabulary, size_of_each_embedding). This matrix can be used as a lookup table
after the training process is completed using one of the following methods (see Figure 4).
• The Continuous bag of words (CBOW) approach: Tries to predict the middle word, using
the embeddings of the surrounding words as input. This method is agnostic to the order
of the surrounding words in the context. This approach is fast to train and is slightly more
accurate for frequent words.
• The skip-gram approach: The setup is inverse of that of CBOW, with the middle word
being used to predict the surrounding words within a certain range. This approach is
slower to train but works well with small data and is more accurate for rare words.
September 2024 12
Embeddings & Vector Stores
The Word2Vec algorithms can also be extended to the sub-word level, which has been the
inspiration for algorithms such as FastText.6 However, one of the major caveats of Word2Vec
is that although it accounts well for local statistics of words within a certain sliding window, it
does not capture the global statistics (words in the whole corpus). This shortcoming is what
methods like the GloVe algorithm address.
GloVe is a word embedding technique that leverages both global and local statistics of words.
It does this by first creating a co-occurrence matrix, which represents the relationships
between words. GloVe then uses a factorization technique to learn word representations
from the co-occurrence matrix. The resulting word representations are able to capture both
global and local information about words, and they are useful for a variety of NLP tasks.
Word embeddings can be directly used in some downstream tasks like Named Entity
Recognition (NER).
September 2024 13
Embeddings & Vector Stores
plt.figure(figsize=(len(models)*30, len(models)*30))
model_ix = 0
for model in models:
labels = []
tokens = []
model_ix +=1
plt.subplot(10, 10, model_ix)
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.tight_layout()
plt.show()
v2w_model = api.load('word2vec-google-news-300')
glove_model = api.load('glove-twitter-25')
print("words most similar to 'computer' with word2vec and glove respectively:")
pprint.pprint( v2w_model.most_similar("computer")[:3])
pprint.pprint( glove_model.most_similar("computer")[:3])
pprint.pprint("2d projection of some common words of both models")
sample_common_words= list(set(v2w_model.index_to_key[100:10000])
& set(glove_model.index_to_key[100:10000]))[:100]
tsne_plot([v2w_model, glove_model], sample_common_words)
September 2024 14
Embeddings & Vector Stores
Figure 5 Shows semantically similar words are clustered differently for the two algorithms
Document embeddings
September 2024 15
Embeddings & Vector Stores
However, the bag-of-words paradigm also has two major weaknesses: both the word
ordering and the semantic meanings are ignored. BoW models fail to capture the sequential
relationships between words, which are crucial for understanding meaning and context.
Inspired by Word2Vec, Doc2Vec10 was proposed in 2014 for generating document
embeddings using (shallow) neural networks. The Doc2Vec model adds an additional
‘paragraph’ embedding or, in other words, document embedding in the model of Word2Vec
as illustrated in Figure 6. The paragraph embedding is concatenated or averaged with other
word embeddings to predict a random word in the paragraph. After training, for existing
paragraphs or documents, the learned embeddings can be directly used in downstream
tasks. For a new paragraph or document, extra inference steps need to be performed to
generate the paragraph or document embedding.
September 2024 16
Embeddings & Vector Stores
Snippet 3 below shows how you can train your own doc2Vec models on a custom corpus:
September 2024 17
Embeddings & Vector Stores
The success of applying neural networks in the embedding world inspired an increasing
interest in using deep neural networks to generate embeddings.
Motivated by the development of deep neural networks, different embedding models and
techniques were proposed, and the state-of-the-art models are refreshed frequently. Main
changes of the models include:
1. Using more complex learning models, especially bi-directional deep neural network
models.
In 2018, BERT11 - which stands for bidirectional encoder representations from transformers -
was proposed with groundbreaking results on 11 NLP tasks. Transformer, the model paradigm
BERT based on, has become the mainstream model paradigm until today. Besides using a
transformer as the model backbone, another key of BERT’s success is from pre-training with
a massive unlabeled corpus. In pretraining, BERT utilized masked language model (MLM) as
the pre-training objective. It did this by randomly masking some tokens of the input and using
the masked token id as the prediction objective. This allows the model to utilize both the
right and left context to pretrain a deep bidirectional transformer. BERT also utilizes the next
sentence prediction task in pretraining. BERT outputs a contextualized embedding for every
token in the input. Typically, the embedding of the first token (a special token named [CLS]) is
used as the embedding for the whole input.
September 2024 18
Embeddings & Vector Stores
BERT became the base model for multiple embedding models, including Sentence-
BERT,12 SimCSE,13 and E5.14 Meanwhile, the evolution of language models - especially large
language models - never stops. T5 was proposed in 2019 with up to 11B parameters. PaLM
was proposed in 2022 to push the large language model to a surprising 540B parameters.
Models like Gemini from Google, GPT models from OpenAI and Llama models from Meta are
also evolving to newer generations at astonishing speed. Please refer to the whitepaper on
Foundational models for more information about some common LLMs.
New embedding models based on large language models have been proposed. For example,
GTR and Sentence-T5 show better performance on retrieval and sentence similarity
(respectively) than BERT family models.
September 2024 19
Embeddings & Vector Stores
Although the deep neural network models require a lot more data and compute time to train,
they have much better performance compared to models using bag-of-words paradigms.
For example, for the same word the embeddings would be different with different contexts.
Snippet 4 demonstrates how pre-trained document embedding models from Tensorflow-
hub17 (for example,Sentence t5)A and Vertex AIB can be used for training models with Keras
and TF datasets. Vertex Generative AI text embeddings can be used with the Vertex AI SDK,
Langchain, and Google’s BigQuery (Snippet 5) for embedding and advanced workflows.18
A. Note: not all models on https://tfhub.dev/ can be commercially used. Please check the licenses of the models
and the training datasets and consult the legal team before commercial usage.
B. Note: not all models on https://tfhub.dev/ can be commercially used. Please check the licenses of the models
and the training datasets and consult the legal team before commercial usage.
September 2024 20
Embeddings & Vector Stores
import vertexai
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
def LLM_embed(text):
def embed_text(text):
text_inp = TextEmbeddingInput(task_type="CLASSIFICATION", text=text.numpy())
return np.array(embeddings_vx.get_embeddings([text_inp])[0].values)
output = tf.py_function(func=embed_text, inp=[text], Tout=tf.float32)
output.set_shape((768,))
return output
# Train model
model = tf.keras.Sequential()
model.add(hub_layer) # omit this layer if using Vertex LLM embeddings
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='adam',loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_data.shuffle(100).batch(8))
Snippet 4. Creating & integrating text embeddings (Vertex, Tfhub) into keras text classification models
September 2024 21
Embeddings & Vector Stores
Snippet 5. Creating LLM based text embeddings in BigQuery for selected columns in a table
Much like text, it’s also possible to create both image and multimodal embeddings.
Unimodal image embeddings can be derived in many ways: one of which is by training a
CNN or Vision Transformer model on a large scale image classification task (for example,
Imagenet), and then using the penultimate layer as the image embedding. This layer has
learnt some important discriminative feature maps for the training task. It contains a set of
feature maps that are discriminative for the task at hand and can be extended to other tasks
as well.
To obtain multimodal embeddings19 you take the individual unimodal text and image
embeddings and their semantic relationships learnt via another training process. This
gives you a fixed size semantic representation in the same latent space. The below snippet
(Snippet 6) can be used to compute image and multimodal embeddings for images and text
and be used with a keras model directly (much like the text embedding example).
September 2024 22
Embeddings & Vector Stores
import base64
import tensorflow as tf
from google.cloud import aiplatform
from google.protobuf import struct_pb2
#fine-tunable layer for image embeddings which can be used for downstream keras modelimage_
embed=hub.KerasLayer("https://tfhub.dev/google/imagenet/efficientnet_v2_imagenet21k_ft1k_s/feature_
vector/2",trainable=False)
class EmbeddingPredictionClient:
"""Wrapper around Prediction Service Client."""
def __init__(self, project : str,
location : str = "us-central1",
api_regional_endpoint: str = "us-central1-aiplatform.googleapis.com"):
client_options = {"api_endpoint": api_regional_endpoint}
self.client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
self.location = location
self.project = project
text_embedding = None
if text:
text_emb_value = response.predictions[0]['textEmbedding']
text_embedding = [v for v in text_emb_value]
image_embedding = None
if image_bytes:
image_emb_value = response.predictions[0]['imageEmbedding']
image_embedding = [v for v in image_emb_value]
September 2024 23
Embeddings & Vector Stores
There are two common ways to generate embeddings for structured data, one is more
general while the other is more tailored for recommendation applications.
Given a general structured data table, we can create embedding for each row. This can be
done by the ML models in the dimensionality reduction category, such as the PCA model.
One use case for these embeddings are for anomaly detection. For example, we can create
embeddings for anomaly detection using large data sets of labeled sensor information
that identify anomalous occurrences.20 Another case use is to feed these embeddings
to downstream ML tasks such as classification. Compared to using the original high-
dimensional data, using embeddings to train a supervised model requires less data. This is
particularly important in cases where training data is not sufficient.
September 2024 24
Embeddings & Vector Stores
The input is no longer a general structured data table as above. Instead, the input includes
the user data, item/product data plus the data describing the interaction between user and
item/product, such as rating score.
This category is for recommendation purposes, as it maps two sets of data (user dataset,
item/product/etc dataset) into the same embedding space. For recommender systems, we
can create embeddings out of structured data that correlate to different entities such as
products, articles, etc. Again, we have to create our own embedding model. Sometimes this
can be combined with unstructured embedding methods when images or text descriptions
are found.
Graph embeddings
Graph embeddings are another embedding technique that lets you represent not
only information about a specific object but also its neighbors (namely, their graph
representation). Take an example of a social network where each person is a node, and the
connections between people are defined as edges. Using graph embedding you can model
each node as an embedding, such that the embedding captures not only the semantic
information about the person itself, but also its relations and associations hence enriching
the embedding. For example, if two nodes are connected by an edge, the vectors for those
nodes would be similar. You might then be able to predict who the person is most similar
to and recommend new connections. Graph embeddings can also be used for a variety of
tasks, including node classification, graph classification, link prediction, clustering, search,
recommendation systems, and more. Popular algorithms21,22 for graph embedding include
DeepWalk, Node2vec, LINE, and GraphSAGE.23
September 2024 25
Embeddings & Vector Stores
Training Embeddings
Current embedding models usually use dual encoder (two tower) architecture. For example,
for the text embedding model used in question-answering, one tower is used to encode
the queries and the other tower is used to encode the documents. For the image and text
embedding model, one tower is used to encode the images and the other tower is used
to encode the text. The model can have various sub architectures, depending on how the
model components are shared between the two towers. The following figure shows some
architectures of the dual encoders.24
The loss used in embedding models training is usually a variation of contrastive loss, which
takes a tuple of <inputs, positive targets, [optional] negative targets> as the inputs. Training
with contrastive loss brings positive examples closer and negative examples far apart.
Similar to foundation model training, training of an embedding model from scratch usually
includes two stages: pretraining (unsupervised learning) and fine tuning (supervised
learning). Nowadays, the embedding models are usually directly initialized from foundation
models such as BERT, T5, GPT, Gemini, CoCa. You can use these base models to leverage the
massive knowledge that has been learned from the large-scale pretraining of the foundation
September 2024 26
Embeddings & Vector Stores
models. The fine-tuning of the embedding models can have one or more phases. The fine-
tuning datasets can be created in various methods, including human labeling, synthetic
dataset generation, model distillation, and hard negative mining.
To use embeddings for downstream tasks like classification or named entity recognition,
extra layers (for example, softmax classification layer) can be added on top of the embedding
models. The embedding model can either be frozen (especially when the training dataset is
small), trained from scratch, or fine-tuned together with the downstream tasks.
Vertex AI provides the ability to customize the Vertex AI text embedding models.25 Users can
also choose to fine-tune the models directly. See26 for an example of fine tuning the BERT
model using tensorflow model garden. You can also directly load the embedding models from
tfhub and fine-tune on top of the model. Snippet 7 shows an example how to build a classifier
based on tfhub models.
class Classifier(tf.keras.Model):
def __init__(self, num_classes):
super(Classifier, self).__init__(name="prediction")
self.encoder = hub.KerasLayer(tfhub_link, trainable=True)
self.dropout = tf.keras.layers.Dropout(0.1)
self.dense = tf.keras.layers.Dense(num_classes)
September 2024 27
Embeddings & Vector Stores
So far you’ve seen the various types of embeddings, techniques and best practices to train
them for various data modalities, and some of their applications. The next section discusses
how to persist and search the embeddings that have been created in a fast and scalable way
for production workloads.
Vector search
Full-text keyword search has been the lynchpin of modern IT systems for years. Full-text
search engines and databases (relational and non-relational) often rely on explicit keyword
matching. For example, if you search for ‘cappuccino’ the search engine or database returns
all documents that mention the exact query in the tags or text description. However, if the
key word is misspelled or described with a differently worded text, a traditional keyword
search returns incorrect or no results. There are traditional approaches which are tolerant of
misspellings and other typographical errors. However, they are still unable to find the results
having the closest underlying semantic meanings to the query. This is where vector search is
very powerful: it uses the vector or embedded semantic representation of documents.
Vector search lets you to go beyond searching for exact query literals and allows you to
search for the meaning across various data modalities. This provides you more nuanced
results. After you have a function that can compute embeddings of various items, you
compute the embedding of the items of interest and store this embedding in a database.
You then embed the incoming query in the same vector space as the items. Next, you have
to find the best matches to the query. This process is analogous to finding the most ‘similar’
matches across the entire collection of searchable vectors: similarity between vectors can be
computed using a metric such as euclidean distance, cosine similarity, or dot product.
September 2024 28
Embeddings & Vector Stores
Euclidean distance (i.e., L2 distance) is a geometric measure of the distance between two
points in a vector space. This works well for lower dimensions. Cosine similarity is a measure
of the angle between two vectors. And inner/dot product, is the projection of one vector
onto another. They are equivalent when the vector norms are 1. This seems to work better
for higher dimensional data. Vector databases store and help manage and operationalize the
complexity of vector search at scale, while also addressing the common database needs.
The most straightforward way to find the most similar match is to run a traditional linear
search by comparing the query vector with each document vector and return the one with
the highest similarity. However, the runtime of this approach scales linearly (O(N)) with the
amount of documents or items to search. This approach is unacceptably slow for most use
cases involving several millions of documents or more. Using approximate nearest neighbour
September 2024 29
Embeddings & Vector Stores
(ANN) search for that purpose is more practical. ANN is a technique for finding the closest
points to a given point in a dataset with a small margin of error - but with a tremendous boost
in performance. There are many approaches with varying trade-offs across scale, indexing
time, performance, simplicity and more.27 They use one or more implementations of the
following techniques: quantization, hashing, clustering and trees, among others. Some of the
most popular approaches are discussed below.
Locality sensitive hashing (LSH) 28 is a technique for finding similar items in a large dataset.
It does this by creating one or more hash functions that map similar items to the same hash
bucket with high probability. This means that you can quickly find all of the similar items to
a given item by only looking at the candidate items in the same hash bucket (or adjacent
buckets) and do a linear search amongst those candidate pairs. This allows for significantly
faster lookups within a specific radius. The number of hash functions/tables and buckets
determine the search recall/speed tradeoff, as well as the false positive / true positive one.
Having too many hash functions might cause similar items to different buckets, while too few
might result in too many items falsely being hashed to the same bucket and the number of
linear searches to increase.
Another intuitive way to think about LSH is grouping residences by their postal code or
neighborhood name. Then based on where someone chooses to move you look at the
residences for only that neighborhood and find the closest match.
September 2024 30
Embeddings & Vector Stores
Figure 11. Visualization of how LSH uses random hyperplanes to partition the vector space
Tree-based algorithms work similarly. For example, the Kd-tree approach works by creating
the decision boundaries by computing the median of the values of the first dimension, then
that of the second dimension and so on. This approach is very much like a decision tree.
Naturally this can be ineffective if searchable vectors are high dimensional. In that case, the
Ball-tree algorithm is better suited. It is similar in functionality, except instead of going by
dimension-wise medians it creates buckets based on the radial distance of the data points
from the center. Here is an example of the implementation of these three approaches:
September 2024 31
Embeddings & Vector Stores
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@004")
test_items= [
"The earth is spherical.",
"The earth is a planet.",
"I like to eat at a restaurant."]
query = "the shape of earth"
embedded_test_items = np.array([embedding.values for embedding in model.get_embeddings(test_items)])
embedded_query = np.array(model.get_embeddings([query])[0].values)
#LSH
lsh_random_parallel = LSHRandom(embedded_test_items, 4, parallel = True)
lsh_random_parallel.knn_search(embedded_test_items, embedded_query, n_neighbors, 3, parallel = True)
#output for all 3 indices = [0, 1] , distances [0.66840428, 0.71048843] for the first 2 neighbours
#ANN retrieved the same ranking of items as brute force in a much scalable manner
Snippet 8. Using scikit-learn29 and lshashing30 for ANN with LSH, KD/Ball-tree and linear search
Hashing and tree-based approaches can also be combined and extended upon to obtain
the optimal tradeoff between recall and latency for search algorithms. FAISS with HNSW and
ScaNN are good examples.
September 2024 32
Embeddings & Vector Stores
Figure 12. Diagram showing how HNSW ‘zooms in’ to perform ANN
One of the FAISS (Facebook AI similarity search) implementations leverages the concept
of hierarchical navigable small world (HNSW) 31 to perform vector similarity search in sub-
linear (O(Logn)) runtime with a good degree of accuracy. A HNSW is a proximity graph with a
hierarchical structure where the graph links are spread across different layers. The top layer
has the longest links and the bottom layer has the shortest ones. As shown in Figure 9, the
search starts at the topmost layer where the algorithm greedily traverses the graph to find
the vertex most semantically similar to the query. Once the local minimum for that layer is
found, it then switches to the graph for the closest vertex on the layer below. This process
continues iteratively until the local minimum for the lowest layer is found, with the algorithm
keeping track of all the vertices traversed to return the K-nearest neighbors. This algorithm
can be optionally augmented with quantization and vector indexing to boost speed and
memory efficiency.
September 2024 33
Embeddings & Vector Stores
import faiss
M=32 #creating high degree graph:higher recall for larger index & searching time
d=768 # dimensions of the vectors/embeddings
index = faiss.IndexHNSWFlat(d, M)
index.add(embedded_test_items) #build the index using the embeddings in Snippet 9
#execute the ANN search
index.search(np.expand_dims(embedded_query, axis=0), k=2)
Snippet 9. Indexing and executing ANN search with the FAISS library using HNSW
ScaNN
Google developed the scalable approximate nearest neighbor (ScaNN)32,33 approach which is
used across a lot of its products and services. This includes being externally available to all
customers of Google Cloud through the Vertex AI Vector Search. Below is how ScaNN uses
a variety of steps to perform efficient vector search, with each one of them having their own
subset of parameters.
The first step is the optional partitioning step during training: it uses one of the multiple
algorithms available to partition the vector store into logical partitions/clusters where
the semantically related are grouped together. The partitioning step is optional for small
datasets. However, for larger datasets with >100k embedding vectors, the partitioning step
is crucial since by pruning the search space it cuts down the search space by magnitudes
therefore significantly speeds up the query. The space pruning is configured through the
number of partitions and the number of partitions to search. A larger number leads to better
recall but larger partition creation time. A good heuristic is to set the number of partitions to
be the square root of the number of vectors.
September 2024 34
Embeddings & Vector Stores
Figure 13. Search space partitioning & pruning(left) & Approximate scoring (right)
At query time ScaNN uses the user-specified distance measure to select the specified
number of top partitions (a value specified by the user), and then executes the scoring
step next. In this step ScaNN compares the query with all the points in the top partitions
and selects the top K’. This distance computation can be configured as exact distance or
approximate distance. The approximate distance computation leverages either standard
product quantization or anisotropic quantization techniques, the latter of which is a specific
method employed by ScaNN which gives the better speed and accuracy tradeoffs.
Finally, as a last step the user can optionally choose to rescore the user specified top K
number of results more accurately. This results in an industry leading speed/accuracy
tradeoff ScaNN is known for as can be inferred from Figure 14. Snippet 10 shows a
code example.
September 2024 35
Embeddings & Vector Stores
Figure 14. Accuracy/speed tradeoffs for various SOTA ANN search algorithms
import tensorflow as tf
import tensorflow_recommenders as tfrs
from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput
# Embed documents & query(from snip 9.) and convert them to tensors and tf.datasets
embedded_query = tf.constant((LM_embed(query, "RETRIEVAL_QUERY")))
embedded_docs = [LM_embed(doc, "RETIREVAL_DOCUMENT") for doc in searchable_docs]
embedded_docs = tf.data.Dataset.from_tensor_slices(embedded_docs).enumerate().batch(1)
# Build index from tensorflow dataset and execute ANN search based on dot product metric
scann = tfrs.layers.factorized_top_k.ScaNN(
distance_measure= 'dot_product',
num_leaves = 4, #increase for higher number of partitions / latency for increased recall
num_leaves_to_search= 2) # increase for higher recall but increased latency
scann = scann.index_from_dataset(embedded_docs)
scann(embedded_query, k=2)
September 2024 36
Embeddings & Vector Stores
Snippet 10. Using Tensorflow Recommenders34 to perform ANN search using the ScaNN algorithm
In this whitepaper we have seen both State-of-the-Art SOTA and traditional ANN search
algorithms: ScaNN, FAISS , LSH, KD-Tree, and Ball-tree, and examined the great speed/
accuracy tradeoffs that they provide. However, to use these algorithms they need to
be deployed in a scalable, secure and production-ready manner. For that we need
vector databases.
Vector databases
Vector embeddings embody semantic meanings of data, while vector search algorithms
provide a means for efficiently querying them. Historically traditional databases lacked the
means to combine semantic meaning and efficient querying in a way that the most relevant
embeddings can be both stored, queried, and retrieved in a secure, scalable, and flexible
manner for complex analysis and real-time enterprise grade applications. This is what
gave rise to vector databases, which are built ground-up to manage these embeddings for
production scenarios. Due to the recent popularity of Generative AI, an increasing number
of traditional databases are starting to incorporate supporting vector search functionality
as well in addition to traditional search (‘hybrid search’) functionalities. Let’s look at the
workflow for a simple Vector Database, with hybrid search capabilities.
September 2024 37
Embeddings & Vector Stores
Each vector database differs in its implementation, but the general flow is shown in Figure 15:
1. An appropriate trained embedding model is used to embed the relevant data points as
vectors with fixed dimensions.
2. The vectors are then augmented with appropriate metadata and complementary
information (such as tags) and indexed using the specified algorithm for efficient search.
3. An incoming query gets embedded with the same model, and used to query and return
specific amounts of the most semantically similar items and their associated unembedded
content/metadata. Some databases might provide caching and pre-filtering (based on
tags) and post-filtering capabilities (reranking using another more accurate model) to
further enhance the query speed and performance.
There are quite a few vector databases available today, each tailored to different business
needs and considerations. A few good examples of commercially managed vector databases
include Google Cloud’s Vertex Vector Search,35 Google Cloud’s AlloyDB & Cloud SQL
Postgres ElasticSearch,36 and Pinecone37 to name a few. Vertex AI Vector Search is a vector
database built by Google that uses the ScaNN algorithm for fast vector search, while still
maintaining all the security and access guarantees of Google Cloud. AlloyDB & Cloud SQL
Postgres supports vector search through the OSS pgvector38 extension, which allows for
SQL queries to combine ANN search with traditional predicates and the usual transactional
semantics for ANN search index. AlloyDB also has a ScaNN index extension that is a native
implementation of ScaNN and is pgvector-compatible. Similarly, many of the other traditional
databases have also started to add plugins to enable vector search. Pinecone and Weaviate
leverage HNSW for their fast vector search in addition to the ability to filter data using
September 2024 38
Embeddings & Vector Stores
traditional search. Amongst their open source peers: Weaviate39 and ChromaDB40 provide a
full suite of functionality upon deployment and can be tested in memory as well during the
prototyping phase.
Operational considerations
Vector Databases are critical to managing the majority of technical challenges that arise
with storing and querying embeddings at scale. Some of these challenges are specific to the
nature of vector stores, while others overlap with that of traditional databases. These include
horizontal and vertical scalability, availability, data consistency, real time updates, backups,
access control, compliance, and much more. However, there are also many more challenges
and considerations you need to take into account while using embedding and vector stores.
Firstly, embeddings, unlike traditional content, can mutate over time. This means that the
same text, image, video or other content could and should be embedded using different
embedding models to optimize for the performance of the downstream applications. This is
especially true for embeddings of supervised models after the model is retrained to account
for various drifts or changing objectives. Similarly, the same applies to unsupervised models
when they are updated to a newer model. However, frequently updating the embeddings
- especially those trained on large amounts of data - can be prohibitively expensive.
Consequently, a balance needs to be struck. This necessitates a well-defined automated
process to store, manage, and possibly purge embeddings from the vector databases taking
the budget into consideration.
September 2024 39
Embeddings & Vector Stores
Secondly, while embeddings are great at representing semantic information, sometimes they
can be suboptimal at representing literal or syntactic information. This is especially true for
domain-specific words or IDs. These values are potentially missing or underrepresented
in the data the embeddings models were trained on. For example, if a user enters a query
that contains the ID of a specific number along with a lot of text, the model might find
semantically similar neighbors which match the meaning of the text closely, but not the ID,
which is the most important component in this context. You can overcome this challenge by
using a combination of full-text search to pre-filter or post-filter the search space before
passing it onto the semantic search module.
Another important point to consider is that depending on the nature of the workload in which
the semantic query occurs, it might be worth relying on different vector databases. For
example, for OLTP workloads that require frequent reads/write operations, an operational
database like Postgres or CloudSQL is the best choice. For large-scale OLAP analytical
workloads and batch use cases, using Bigquery’s vector search is preferable.
Applications
Embeddings models are one of the fundamental machine learning models that power a
variety of applications. We summarize some popular applications in the following table.
September 2024 40
Embeddings & Vector Stores
Task Description
Given a query and a set of objects (for example, documents, images,
and videos), retrieve the most relevant objects. Based on the definition
Retrieval
of relevant objects, the subtasks include question answering and
recommendations.
Embeddings together with vector stores providing ANN can be powerful tools which can be
used for a variety of applications. These include Retrieval augmented Generation for LLMs,
Search, Recommendation Systems, Anomaly detection, few shot- classification and much
more.
For ranking problems like search and recommendations, embeddings are normally used
at the first stage of the process. They retrieve the potentially good candidates that are
semantically similar and consequently improve the relevance of search results. Since the
amount of information to sort through can be quite large (in some cases even millions or
billions) ANN techniques like ScaNN greatly aids in scalably narrowing the search space.
Let’s look at an application which combines both LLMs and RAG to help answer questions.
September 2024 41
Embeddings & Vector Stores
Retrieval augmented generation (RAG) for Q&A is a technique that combines the best of both
worlds from retrieval and generation. It first retrieves relevant documents from a knowledge
base and then uses prompt expansion to generate an answer from those documents. Prompt
expansion is a technique that when combined with database search can be very powerful.
With prompt expansion the model retrieves relevant information from the database (mostly
using a combination of semantic search and business rules), and augments the original
prompt with it. The model uses this augmented prompt to generate much more interesting,
factual, and informative content than with retrieval or generation alone.
RAGs can help with a common problem with LLMs: their tendency to ‘hallucinate’ and
generate factually incorrect but plausible sounding responses. Although RAG can reduce
hallucinations, it does not completely eliminate them. What can help mitigate this problem
further is to also return the sources from the retrieval and do a quick coherence check either
by a human or an LLM. This ensures the LLM response is consistent with the semantically
relevant sources. Let’s look at an example (Snippet 11) of RAG with sources, which can be
scalably implemented using Vertex AI LLM text embeddings and Vertex AI Vector Search in
conjunction with libraries like langchain.41
September 2024 42
Embeddings & Vector Stores
# Index Constants
DISPLAY_NAME = "<my_matching_engine_index_id>"
DEPLOYED_INDEX_ID = "yourname01" # you set this. Start with a letter.
September 2024 43
Embeddings & Vector Stores
# Create an endpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
display_name=f"{DISPLAY_NAME}-endpoint", public_endpoint_enabled=True
)
# retrieve the id of the most recently deployed index or manually look up the index
deployed above
index_id=my_index_endpoint.deployed_indexes[-1].index.split("/")[-1]
endpoint_id= my_index_endpoint.name
# Input texts
texts= [
"The earth is spherical.",
"The earth is a planet.",
"I like to eat at a restaurant.",
]
September 2024 44
Embeddings & Vector Stores
retriever=vector_store.as_retriever(search_kwargs={'k':1 })
messages = [
SystemMessagePromptTemplate.from_template(prompt_template),
HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)
Snippet 11. Build/deploy ANN Index for Vertex AI Vector Search and use RAG with LLM prompts to generate
grounded results/sources.
September 2024 45
Embeddings & Vector Stores
Figure 16. Model responses along with sources demonstrating the LLM being grounded in the database
As we can infer from Figure 16, the output not only grounds LLM in the semantically similar
results retrieved from the database (hence refusing to answer when context cannot be found
in the database). This not only significantly reduces hallucination, but also provides sources
for verification, either human or using another LLM.
Summary
In this whitepaper we have discussed various methods to create, manage, store, and retrieve
embeddings of various data modalities effectively in the context of production-grade
applications. Creating, maintaining and using embeddings for downstream applications can
be a complex task that involves several roles in the organization. However, by thoroughly
operationalizing and automating its usage, you can safely leverage the incredible benefits
they offer across some of the most important applications. Some key takeaways from this
whitepaper include:
1. Choose your embedding model wisely for your data and use case. Ensure the data used in
inference is consistent with the data used in training. The distribution shift from training to
inference can come from various areas, including domain distribution shift or downstream
task distribution shift. If no existing embedding models fit the current inference data
distribution, fine-tuning the existing model can significantly help on the performance.
September 2024 46
Embeddings & Vector Stores
Another tradeoff comes from the model size. The large deep neural network (large
multimodal models) based models usually have better performance but can come with a
cost of longer serving latency. Using Cloud-based embedding services can conquer the
above issue by providing both high-quality and low-latency embedding service. For most
business applications using a pre-trained embedding model provides a good baseline,
which can be further fine-tuned or integrated in downstream models. In case the data has
an inherent graph structure, graph embeddings can provide superior performance.
2. Once your embedding strategy is defined, it’s important to make the choice of the
appropriate vector database that suits your budget and business needs. It might seem
quicker to prototype with available open source alternatives, but opting for a more secure,
scalable, and battle-tested managed vector database is certain to be better off in the long
term. There are various open source alternatives using one of the many powerful ANN
vector search algorithms, but ScaNN and HNSW have proven to provide some of the best
accuracy and performance trade offs in that order.
September 2024 47
Embeddings & Vector Stores
Endnotes
1. Rai, A., 2020, Study of various methods for tokenization. In Advances in Natural Language Processing.
Available at: https://doi.org/10.1007/978-981-15-6198-6_18
2. Pennington, J., Socher, R. & Manning, C., 2014, GloVe: Global Vectors for Word Representation. [online]
Available at: https://nlp.stanford.edu/pubs/glove.pdf.
3. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V. & Hinton, G., 2016, Swivel: Improving embeddings
by noticing what's missing. ArXiv, abs/1602.02215. Available at: https://arxiv.org/abs/1602.02215.
4. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J., 2013, Efficient estimation of word representations
in vector space. ArXiv, abs/1301.3781. Available at: https://arxiv.org/pdf/1301.3781.pdf.
5. Rehurek, R., 2021, Gensim: open source python library for word and document embeddings. Available
at: https://radimrehurek.com/gensim/intro.html.
6. Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T., 2016, Enriching word vectors with subword information.
ArXiv, abs/1607.04606. Available at: https://arxiv.org/abs/1607.04606.
7. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R., 1990, Indexing by latent
semantic analysis. Journal of the American Society for Information Science, 41(6), pp. 391-407.
8. Blei, D. M., Ng, A. Y., & Jordan, M. I., 2001, Latent Dirichlet allocation. In T. G. Dietterich, S. Becker, & Z.
Ghahramani (Eds.), Advances in Neural Information Processing Systems 14. MIT Press, pp. 601-608. Available
at: https://proceedings.neurips.cc/paper/2001/hash/296472c9542ad4d4788d543508116cbc-Abstract.html.
9. Muennighoff, N., Tazi, N., Magne, L., & Reimers, N., 2022, Mteb: Massive text embedding benchmark. ArXiv,
abs/2210.07316. Available at: https://arxiv.org/abs/2210.07316.
10. Le, Q. V., Mikolov, T., 2014, Distributed representations of sentences and documents. ArXiv, abs/1405.4053.
Available at: https://arxiv.org/abs/1405.4053.
11. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K., 2019, BERT: Pre-training deep Bidirectional Transformers
for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pp. 4171-4186. Available at: https://www.aclweb.org/anthology/N19-1423/.
12. Reimers, N. & Gurevych, I., 2020, Making monolingual sentence embeddings multilingual using knowledge
distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pp. 254-265. Available at: https://www.aclweb.org/anthology/2020.emnlp-main.21/.
September 2024 48
Embeddings & Vector Stores
13. Gao, T., Yao, X. & Chen, D., 2021, Simcse: Simple contrastive learning of sentence embeddings. ArXiv,
abs/2104.08821. Available at: https://arxiv.org/abs/2104.08821.
14. Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R. & Wei, F., 2022, Text embeddings by
weakly supervised contrastive pre-training. ArXiv. Available at: https://arxiv.org/abs/2201.01279.
15. Khattab, O. & Zaharia, M., 2020, colBERT: Efficient and effective passage search via contextualized late
interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 39-48. Available at: https://dl.acm.org/doi/10.1145/3397271.3401025.
16. Lee, J., Dai, Z., Duddu, S. M. K., Lei, T., Naim, I., Chang, M. W. & Zhao, V. Y., 2023, Rethinking the role of token
retrieval in multi-vector retrieval. ArXiv, abs/2304.01982. Available at: https://arxiv.org/abs/2304.01982.
17. TensorFlow, 2021, TensorFlow hub, a model zoo with several easy to use pre-trained models. Available
at: https://tfhub.dev/.
18. Zhang, W., Xiong, C., & Zhao, H., 2023, Introducing BigQuery text embeddings for NLP tasks.
Google Cloud Blog. Available at: https://cloud.google.com/blog/products/data-analytics/introducing
-bigquery-text-embeddings.
21. Cai, H., Zheng, V. W., & Chang, K. C., 2020, A survey of algorithms and applications related with graph
embedding. In Proceedings of the 29th ACM International Conference on Information & Knowledge
Management. Available at: https://dl.acm.org/doi/10.1145/3444370.3444568.
22. Cai, H., Zheng, V. W., & Chang, K. C., 2017, A comprehensive survey of graph embedding: problems,
techniques and applications. ArXiv, abs/1709.07604. Available at: https://arxiv.org/pdf/1709.07604.pdf.
23. Hamilton, W. L., Ying, R. & Leskovec, J., 2017, Inductive representation learning on large graphs.
In Advances in Neural Information Processing Systems 30. Available at:
https://cs.stanford.edu/people/jure/pubs/graphsage -nips17.pdf.
24. Dong, Z., Ni, J., Bikel, D. M., Alfonseca, E., Wang, Y., Qu, C. & Zitouni, I., 2022, Exploring dual encoder
architectures for question answering. ArXiv, abs/2204.07120. Available at: https://arxiv.org/abs/2204.07120.
25. Google Cloud, 2021, Vertex AI Generative AI: Tune Embeddings. Available at:
https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-embeddings.
September 2024 49
Embeddings & Vector Stores
26. TensorFlow, 2021, TensorFlow Models: NLP, Fine-tune BERT. Available at:
https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert.
27. Matsui, Y., 2020, Survey on approximate nearest neighbor methods. ACM Computing Surveys (CSUR), 53(6),
Article 123. Available at: https://wangzwhu.github.io/home/file/acmmm-t-part3-ann.pdf.
28. Friedman, J. H., Bentley, J. L. & Finkel, R. A., 1977, An algorithm for finding best matches in logarithmic
expected time. ACM Transactions on Mathematical Software (TOMS), 3(3), pp. 209-226. Available at:
https://dl.acm.org/doi/pdf/10.1145/355744.355745.
29. Scikit-learn, 2021, Scikit-learn, a library for unsupervised and supervised neighbors-based learning methods.
Available at: https://scikit-learn.org/.
30. lshashing, 2021, An open source python library to perform locality sensitive hashing. Available at:
https://pypi.org/project/lshashing/.
31. Malkov, Y. A., Yashunin, D. A., 2016, Efficient and robust approximate nearest neighbor search using
hierarchical navigable small world graphs. ArXiv, abs/1603.09320. Available at:
https://arxiv.org/pdf/1603.09320.pdf.
32. Google Research, 2021, A library for fast ANN by Google using the ScaNN algorithm. Available at:
https://github.com/google-research/google-research/tree/master/scann.
33. Guo, R., Zhang, L., Hinton, G. & Zoph, B., 2020, Accelerating large-scale inference with anisotropic vector
quantization. ArXiv, abs/1908.10396. Available at: https://arxiv.org/pdf/1908.10396.pdf.
34. TensorFlow, 2021, TensorFlow Recommenders, an open source library for building ranking & recommender
system models. Available at: https://www.tensorflow.org/recommenders.
35. Google Cloud, 2021, Vertex AI Vector Search, Google Cloud’s high-scale low latency vector database.
Available at: https://cloud.google.com/vertex-ai/docs/vector-search/overview.
36. Elasticsearch, 2021, Elasticsearch: a RESTful search and analytics engine. Available at:
https://www.elastic.co/elasticsearch/.
37. Pinecone, 2021, Pinecone, a commercial fully managed vector database. Available at:
https://www.pinecone.io.
38. pgvector, 2021, Open Source vector similarity search for Postgres. Available at:
https://github.com/pgvector/pgvector.
39. Weaviate, 2021, Weaviate, an open source vector database. Available at: https://weaviate.io/.
September 2024 50
Embeddings & Vector Stores
40. ChromaDB, 2021, ChromaDB, an open source vector database. Available at: https://www.trychroma.com/.
41. LangChain, 2021.,LangChain, an open source framework for developing applications powered by language
model. Available at: https://langchain.com.
September 2024 51