[go: up one dir, main page]

0% found this document useful (0 votes)
435 views54 pages

LLM mesh

Uploaded by

olegleyz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
435 views54 pages

LLM mesh

Uploaded by

olegleyz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Efficient and Governed

Generative AI With Dataiku


THE LLM MESH AIPOWERED ASSISTANTS
A common backbone for GenAI Go faster and farther with AI Prepare,
applications, enabling choice and AI Code Assistant, and AI Explain, all of
flexibility among the growing number which improve efficiency and the
of models and providers. overall product.

DATAIKU ANSWERS LLMPOWERED DATA


A packaged, scalable web application No-code text recipes enhanced with
to democratize enterprise-ready LLM pre-trained Hugging Face models and LLMs
chat and retrieval-augmented for text summarization, classification, and
generation (RAG). other common language tasks.

PROMPT STUDIOS GENAI SOLUTIONS


Iteratively design and evaluate LLM Pre-built Generative AI use cases and
prompts, compare performance and applications for even faster time to value.
cost across models, and operationalize
GenAI in your data projects.

LEARN MORE ABOUT THE LLM MESH


The LLM Mesh
A Practical Guide to Using Generative
AI in the Enterprise

With Early Release ebooks, you get books


in their earliest form—the author’s raw and
unedited content as they write—so you can take
advantage of these technologies long before the
official release of these titles.

Kurt Muehmel

Beijing Boston Farnham Sebastopol Tokyo


The LLM Mesh
by Kurt Muehmel
Copyright © 2025 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional
use. Online editions are also available for most titles (http://oreilly.com). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.

Editors: Jeff Bleiel and Aaron Black Cover Designer: Susan Brown
Production Editor: Kristen Brown Illustrator: Kate Dullea
Interior Designer: David Futato

January 2025: First Edition

Revision History for the Early Release


2024-08-02: First Release
2024-10-01: Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The LLM Mesh,
the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the author and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts
to ensure that the information and instructions contained in this work are accurate,
the publisher and the author disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
This work is part of a collaboration between O’Reilly and Dataiku. See our statement
of editorial independence.

978-1-098-17661-7
[LSI]
Table of Contents

Brief Table of Contents (Not Yet Final). . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1. Using LLMs in the Enterprise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


What Is an LLM Mesh? 3
The Right Model for the Right Application 6
Bottom Line: Why the LLM Mesh? 19

2. Objects for Building LLM-Powered Applications. . . . . . . . . . . . . . . . . 21


The Potential of New LLM-powered applications 22
LLM Mesh-Related Objects: An Overview 27
The Objects of an LLM Mesh in Detail 30
Cataloging LLM-Related Objects 45
Conclusion 47

v
Brief Table of Contents (Not Yet
Final)

Chapter 1: Using LLMs in the Enterprise (available)


Chapter 2: Objects for Building LLM-Powered Applications (avail‐
able)
Chapter 3: Cost Reporting and Management (unavailable)
Chapter 4: PII Detection and Content Moderation (unavailable)
Chapter 5: Audit Trail, Security, and Permissions (unavailable)
Chapter 6: Retrieval Augmentation (unavailable)
Chapter 7: Conclusion (unavailable)

vii
CHAPTER 1
Using LLMs in the Enterprise

A Note for Early Release Readers


With Early Release ebooks, you get books in their earliest form—
the author’s raw and unedited content as they write—so you can
take advantage of these technologies long before the official release
of these titles.
This will be the 1st chapter of the final book.
If you have comments about how we might improve the con‐
tent and/or examples in this book, or if you notice missing
material within this chapter, please reach out to the editor at
jbleiel@oreilly.com.

“May you live in times of rapid technological progress.” This is the


blessing and the curse of our current moment. Recent advances in
AI and a growing interest in technology, thanks to the release of
wildly popular consumer products, have led to a frenzy of interest
in, and use of, AI, and Large Language Models (LLMs) in particular,
in the enterprise.
However, AI and LLMs remain nascent in the enterprise, meaning
that best practices for their use are being defined. At the same time,
the core technologies — the models themselves, technologies to host
and serve the models, etc. — are evolving rapidly.
Table 1-1 provides a brief timeline of the release of various mod‐
els and technologies that could be relevant for enterprise use. The

1
diversity and speed of release create both opportunities and chal‐
lenges when you are looking to use these technologies in production
use cases.

Table 1-1. A (Non-Exhaustive) Timeline of Enterprise-Relevant Model


and Product Releases
Developer or Model or Release Date Description
Provider Product
OpenAI GPT-3 May 2020 175 billion parameter LLM with 2048 token
context window
OpenAI ChatGPT November 2022 Consumer chatbot application, powered by
GPT-3.5 Turbo
Microsoft OpenAI January 2023 Managed service offering LLMs from OpenAI
Azure Service
Amazon Web Bedrock September Managed service offering LLMs from various
Services 2023 developers
Dataiku LLM Mesh September Commercial LLM Mesh offering for connecting
2023 to LLMs and building LLM-powered applications
in the enterprise
Databricks DBRX March 2024 Open-weights mixture of experts model with
132B total parameters and 32k-token input
context window, licensed for commercial use
Meta LLaMA 3 (8B, April 2024 Updated LLM with 4096-token input context
70B) window, with updated license allowing certain
commercial uses
Mistral Mixtral April 2024 Open-weights mixture of experts model with
8x22B up to 141B parameters and 64k-input context
window, licensed for commercial use
OpenAI GPT-4o May 2024 Multimodal LLM supporting voice-to-voice
generation and 128k-token input context
window
Google Gemini 1.5 May 2024 Multimodal LLM with 1M-token input context
Pro window

Today, you can build entirely new capabilities that would not have
been possible previously, to improve the lives of your employees and
better serve your customers. But you also have to keep up with rapid
changes in the core technologies and use techniques that have not
been fully proven. We are all now at the cutting edge.
This diversity of options among the technologies and techniques is
truly a great thing. In fact, we are just scratching the surface for
the potential uses of LLMs in the enterprise. It’s easy to imagine

2 Chapter 1: Using LLMs in the Enterprise


a future where these technologies are generating massive amounts
of value for the enterprise, automating mundane tasks, and making
new products and services possible.
In this chapter, we will briefly introduce what an LLM Mesh is, and
then take an in-depth look at the many different types of LLMs that
can be appropriate for use in the enterprise. We’ll discuss different
characteristics of models, and how models are built, published, run,
and perform.
After reading this chapter, you should be able to think about how
you would want to use different models for different applications in
your business. Given this multitude of models, you will see why an
LLM Mesh architecture is going to be a key part of your AI strategy
going forward.

What Is an LLM Mesh?


An LLM Mesh is an architecture paradigm for building LLM-
powered applications in the enterprise. There are three principles
regarding what an LLM Mesh should accomplish. An LLM Mesh
should enable you to:

1. Access various LLM-related services through an abstraction


layer.
2. Provide federated services for control and analysis.
3. Provide central discovery and documentation for LLM-related
objects.

These principles allow for LLM-powered applications to be built in a


modular manner, simplifying their development and maintenance.
Figure 1-1 illustrates an LLM Mesh architecture being used to
develop two applications. Various objects, referenced in the Catalog
and accessed via the Gateway, are combined to build the logic of
the applications. Federated services provide control and analysis
throughout the lifecycle of the application.

What Is an LLM Mesh? 3


Figure 1-1. An LLM Mesh architecture

It is certainly possible to build LLM-powered applications in the


enterprise without an LLM Mesh. Many of the initial applications
that organizations have built since the release of ChatGPT do not
use an LLM Mesh. In these cases, the logic for connecting the vari‐
ous objects of the application (LLM services, retrieval services, etc.)
is built directly into the application, as are any additional capabilities
such as access controls or logging. This approach is perfectly appro‐
priate for building a first proof of concept, or a single application.
An LLM Mesh, however, becomes valuable when:

1. The total number of LLM-powered applications being devel‐


oped begins to grow,
2. More teams start building and using the applications,

4 Chapter 1: Using LLMs in the Enterprise


3. More complex LLM-powered applications are being designed
and built.

In this context, the LLM Mesh will accelerate the development of


the applications, simplify their maintenance, and help to ensure that
the applications meet enterprise standards for safety, security, and
performance.

Why LLMs and Not Generative AI?


An LLM Mesh architecture focuses on LLMs and not
Generative AI more broadly because LLMs are the core
building blocks of the AI applications that will be built in
the enterprise.
LLMs are large neural networks trained on text data. They
possess a variety of natural language processing capabili‐
ties. Many, but not all, LLMs can generate text. Generative
AI is a broader category of AI that includes models that
can generate text, audio, images, and videos.
Beyond simply generating text, LLMs are also used to rea‐
son through a problem, to give instructions to various
tools, and to write the code to connect to various tools.
While image-generating models, for example, can be useful
in the enterprise, they are not relevant in the context of
building sophisticated AI applications that are the focus of
the LLM Mesh.

An LLM Mesh provides a gateway not only to LLMs, but also to


the full range of objects that are needed to build fully-featured,
LLM-powered applications. These include the LLMs themselves and
the services to host them, but also agents, tools, retriever services,
and applications such as chatbots.
These objects are, for most organizations, new types of assets that
will need to be developed and used. The skills to develop and use
these kinds of objects are not yet commonplace in organizations,
and best practices for their development and use are still being
defined. Amid this rapid innovation, the LLM Mesh architecture
paradigm aims to simplify the management and use of these objects
to accelerate and standardize the development of LLM-powered
applications. Chapter 2 will explore in depth these different types of
objects and how an LLM Mesh can simplify their use.

What Is an LLM Mesh? 5


The Right Model for the Right Application
The challenge for the use of LLMs in the enterprise is not a lack
of availability of models. As of June 2024, the popular model reposi‐
tory Hugging Face lists 727,3541 models, of which 114,7962 are text-
generation models. More models are being developed and released
every day.
In fact, the abundance can actually be a hindrance, as you have to
sort through the many different options to choose the ones that are
best for your applications.
A large general model that can do most things pretty well is a good
place to start. But as an enterprise’s use of LLMs matures and it seeks
higher levels of performance and optimized budgets, it will need to
use a growing number of models across different applications.
The following sections explore the different characteristics of mod‐
els and how these characteristics may make a model more or less
appropriate for the many different, specific uses in the enterprise.

Model Size: The Upside and Downside of More


Parameters
The word “large” in large language model refers to the number
of parameters in the model. Alternatively, “large” may refer to the
number of tokens in the training data that the model is trained on.
More training tokens lead to more parameters.
LLMs often have hundreds of billions to trillions of parameters. For
example, GPT-3, released in May 20203 and the immediate precur‐
sor to the model behind the first version of ChatGPT, has 175 billion
parameters. The first version of LLaMA from Meta AI in February
2023 has 65 billion parameters.4 Increasingly, the makers of propri‐
etary models are no longer making the number of parameters in
their models public.

1 https://huggingface.co/models, accessed June 20, 2024


2 https://huggingface.co/models?pipeline_tag=text-generation&sort=trending, accessed
June 20, 2024
3 https://arxiv.org/abs/2005.14165
4 https://ai.meta.com/blog/large-language-model-llama-meta-ai/

6 Chapter 1: Using LLMs in the Enterprise


These parameters are the numerical values (sometimes they will be
called weights and biases) that make up the simple mathematical
formulae of each neuron in the neural network. Usually, they are
32-bit floating point numbers. A process called quantization can
simplify these numbers to 4- or 8-bit integers. This process can
often dramatically improve the efficiency of a model while having
only a modest impact on model performance.
Figure 1-2 illustrates a simple neural network architecture, showing
the input layer, two hidden layers, and the output layer. The circles
represent the nodes in the network, the values under the nodes
are the biases, while the values on the lines connecting the nodes
represent the weights. Larger neural networks, like LLMs, are built
on the same basic architecture but are billions of times larger with
more than one hundred hidden layers.

Figure 1-2. Simplified example of a neural network showing the input,


hidden, and output layers and the weights connecting each node and
the biases of each node

Generally speaking, larger models perform better: They can do


more tasks and they can do those tasks better. Thus, it could be easy
to conclude that you should choose the largest model your budget
allows and use it for everything. But that would be like using your
large, comfortable, powerful grand touring car for every trip. While
it would be the right choice for a cross-country roadtrip, it would be
overkill for a quick trip to the grocery store or the bakery around
the corner. A bicycle or your own two feet would be better for such
errands.

The Right Model for the Right Application 7


The following subsections explain the tradeoffs related to the size of
a model.

Inference Costs
The most direct impact of a larger model size will be on inference
cost. Inference is the process of generating tokens in response to a
particular input. A model with more parameters will require more
calculations during inference. One way or another, those calcula‐
tions must be run on some hardware that is installed and managed
somewhere and that is consuming electricity, for which someone
will have to pay the bill at the end of the month.
In some cases, companies offering these models as a service may
obfuscate these costs, for example, by subsidizing the cost in order
to gain more customers. This may make an apples-to-apples com‐
parison difficult. We’ll dig into cost considerations in Chapter 3.
Some models function, in essence, as a combination of smaller mod‐
els, each specialized in different tasks. This architecture, known as
Mixture of Experts (MoE), can dramatically reduce the cost of infer‐
ence. One well-known model using an MoE architecture is 8x7B
(also known as Mixtral) from Mistral. Despite being a 46.7-billion
parameter model, only 12.9-billion parameters are used per token.
This approach has led to improvements in inference cost, but makes
the model more challenging to build and to fine-tune.
All other things being equal, larger models will be more expensive to
use, though technological advancements like MoE mean that these
tradeoffs will become more complex in the future. The benefit that
large models bring to a particular use case may justify their expense
in certain cases, as discussed in the sections and chapters below, but
a wise strategy will use them only where needed.

Inference Speed
While the inference of a larger model will require more calculations,
these calculations can be done more quickly when using larger and
higher-performance hardware. Furthermore, many of these calcula‐
tions can be parallelized, using multiple processing units at the same
time to run all of the necessary calculations. Again, MoE models do
not need to use all parameters for every task.
Standard benchmarks are being established to accurately quantify
and compare the speed of different models, acknowledging that

8 Chapter 1: Using LLMs in the Enterprise


hardware and network performance will have a significant impact
on the results. The two metrics that are used most commonly are
latency and throughput:

• Latency, often measured as Time To First Token (TTFT), is


a measure of how long the model takes to generate its first
response token to a user’s input. In applications where the end
user is interacting with the model in real time, latency will
influence whether the model “feels” responsive. In applications
where the model’s response is part of a longer chain of interac‐
tions, latency will need to be considered when setting when the
application will time out.
• Throughput, often measured as Tokens Per Second (TPS), meas‐
ures the overall rate at which the model will generate tokens in
response to a given request. Like latency, it will influence if a
model feels fast to an end user. Throughput needs to be taken
into consideration when building applications that depend on
the output of the model.

When comparing the speed of models, pay close attention to the


units being used, as different testers are using different methodolo‐
gies.
While the relationship between model size and inference speed is
indirect (because large models can be run more quickly on higher-
performance hardware, and factors like network performance can
influence the time it takes to receive a response), the measured
speed of a deployed model must be taken into account when build‐
ing an LLM-powered application.

Task Coverage and Performance


One of the main functional differences between LLMs and previous
generations of models used for Natural Language Processing (NLP)
is that those earlier models were always task specific. For example,
separate models would be used for sentiment analysis, text summa‐
rization, or language translation.
LLMs can do all of those tasks, and generative LLMs can do some‐
thing that previous models could not: Generate text based on a
prompt. Generally speaking, models with more parameters can per‐
form more tasks, which can be useful in the enterprise when several
tasks need to be performed on the input text.

The Right Model for the Right Application 9


LLMs have also been shown to gain new abilities as they grow larger
and are trained on more and more data. These emergent abilities are
unpredictable. In other words, researchers cannot predict ahead of
time at what point in its training a model will gain new abilities.5 It is
possible that future LLMs will be capable of many more tasks or will
see dramatic improvement in their performance of existing tasks as
they grow larger.
In addition to gaining new abilities as they grow, larger LLMs gener‐
ally show better performance on any task that they are capable of
performing as well. Recent research shows that this improvement
in performance is not linear nor predictable, as with the emergent
abilities mentioned above.6

Context Windows
The amount of input text that a model can receive within a single
prompt is known as its context window. Measured in tokens, it
defines how much information a model can work with at a single
time.
For example, a model with a small context window can only be used
to summarize a document that can fit in its context window. You
could break up the document into smaller pieces, but the model
would summarize each separately, without knowledge of the entire
document, potentially resulting in repetitive or incoherent results.
Large context windows, on the other hand, allow for plenty of
space to provide examples of what you want the LLM to produce
(called few-shot learning) and to engage in more complex prompt
engineering techniques.
Generally speaking, larger models have larger context windows, and
some models have been optimized for exceptionally large context
windows. While the original GPT model had a context window of
only 512 tokens (approximately one page of text) Gemini 1.5 from
Google now has a context window of more than 1 million tokens
and has been shown in internal testing to handle up to 10 million
tokens.

5 https://openreview.net/pdf?id=yzkSU5zdwD
6 https://arxiv.org/abs/2210.14891

10 Chapter 1: Using LLMs in the Enterprise


Sizing Models to the Task
Reading these previous sections, it is easy to conclude that if cost
and complexity are no barrier, then the largest models are always the
best choice for any application in the enterprise. But, in which enter‐
prise are cost and complexity not a barrier? In fact, these are the two
greatest barriers to the practical use of LLMs in the enterprise!
Given this reality, enterprise users of LLMs will need to choose a
model that strikes the right balance of ability, performance, cost,
and complexity for a specific application. The right choice for one
application may not be the right choice for another application.

General Models vs. Specialized Models


Building on this understanding of the implications of model size,
we will now explore the differences between general models and
specialized models.
General models are those that have been trained to perform at
human level across a wide range of tasks. OpenAI’s GPT-4 (released
in March 20237) is an excellent example of such a model. It demon‐
strates very high performance across a great number of tasks, cover‐
ing natural languages, programming languages, and a wide variety
of specialized jargon. It can generate, summarize, and translate text,
it can write technical reports, and it can write poetry. Furthermore,
GPT-4 can take image data as input, a capability known as multimo‐
dality.
In contrast to these general models, specialized models have been
trained to perform well on specific tasks, in specific domains, or
have been compressed and optimized for performance at a smaller
model size.
Note that while high-performing, general models tend to be larger
models, specialized models may be larger or smaller.

Types of Specialized Models


Task-specific models are those that are focused on doing specific
tasks very well. Some examples of task-specific models include
M2M1008, a model that is designed to translate between any pair

7 https://openai.com/index/gpt-4-research/

The Right Model for the Right Application 11


of natural languages, or OpenAI’s Codex9, an evolution of GPT-3
that is trained specifically to generate code across a wide variety of
programming languages. A common application might be a model
that is specialized in summarization, allowing it to be much smaller
than a general model. Thanks to its small size, it could run locally
on a mobile device and be used for rapidly summarizing content
directly on the phone.
Domain-specific models are those that are trained on the language
of a specific domain. For example, BioMedLM10 is a 2.7-billion
parameter model trained on biomedical literature and is thus well-
adapted to answering questions about medical topics, while Bloom‐
bergGPT11 is a 50-billion parameter model trained on a very large
dataset of financial documents designed to serve the financial serv‐
ices industry.
Resource-constrained models are models that have been compressed
through various techniques to maintain good performance in their
desired tasks or across a wide range of tasks, while being less
resource intensive to run. An example is MobileBERT12, a com‐
pressed version of the popular BERT model designed to be run on
mobile devices.
Embedding models transform text into numerical representations
called embeddings or vectors. These embeddings capture the
semantic meanings of the text and the relationships between the
different parts of the text. A common application is retrieval aug‐
mented generation (RAG) where a corpus of text (e.g., thousands of
documents) are converted into embeddings and stored in a special‐
ized database called a vector store.
Reranking models are used to refine the initial ranking of search
results of an embedding model to make the results more relevant
to the end user. There are LLM-based and non-LLM rerankers,
each presenting tradeoffs in terms of performance and quality of
response.

8 https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
9 https://openai.com/index/openai-codex/
10 https://arxiv.org/abs/2403.18421
11 https://arxiv.org/abs/2303.17564
12 https://arxiv.org/abs/2004.02984

12 Chapter 1: Using LLMs in the Enterprise


Choosing a General or Specialized Model
The existence of a diverse and growing ecosystem of both general
and specialized models gives enterprises the opportunity to use
different models for different purposes.
In the enterprise, general models are well-suited to tasks where
the input is going to be highly unpredictable. This could be the
classification of documents into different categories. For example,
if a directory contained a mix of contracts, invoices, and emails, a
first step in the analysis could be to use a general model to sort the
documents into different categories so that the contracts could be
analyzed separately from the invoices.
Specialized models are well adapted for tasks where the input data
is more homogeneous and predictable. Let’s explore what this might
look like across a pharmaceutical company. That company may
wish to build a chatbot to serve its customers (doctors, nurses,
pharmacists, and other healthcare providers) in their interactions
with patients. It would likely choose a domain-specific model like
BioMedLM to ensure higher quality and more relevant results. The
same company may then use a model like ESM13 from Meta AI
researchers which has been trained on the language of proteins
as part of their molecular research applications. Finally, that same
organization may use a non-LLM computer vision model to watch
their products as they come off of the manufacturing line to quickly
identify any anomalies as part of their quality assurance processes.
General models can be a very good starting point for enterprises as
they experiment and build their first use cases using LLMs. At those
early stages, the simplicity of using a single model for a variety of
tasks and use cases outweighs the benefits of further optimization
using specialized models. But, as an enterprise scales its use of
LLMs across use cases, enterprises will want to optimize their use to
improve performance and reduce costs. In this context, specialized
models become more relevant, and the number of models that an
organization will need to manage and apply will tend to increase.

13 https://github.com/facebookresearch/esm

The Right Model for the Right Application 13


What Is Fine-Tuning?
A common way of creating a domain-specific model is to fine-tune
an existing base model. For example, this is how BloombergGPT
was built, by fine-tuning the open-access BLOOM model on a
proprietary dataset of financial documents.14
Fine-tuning is a type of transfer learning that feeds new — usually
specialized — data into a model to retrain some parts of the model
on this new data. Compared to building a model from scratch, it is
far less complex and compute-intensive.
While fine-tuning is simpler and less expensive than building a
base model, it remains an advanced technique and should be used
only when other, simpler, and less expensive avenues have been
exhausted.
Fine-tuning has often been cited as a way to elicit better perfor‐
mance from base models, allowing enterprises to differentiate their
use of LLMs from their competition. While this is true, it ignores
or downplays the difficulties of fine-tuning and leaves unexplored
the opportunity to generate differentiated results using simpler
techniques like prompt engineering and Retrieval Augmented Gen‐
eration (RAG).

Making Sense of Model Licenses


There has often been a conflation between a model’s license (e.g.,
open source vs. proprietary) and where the model is hosted (e.g.,
provided as a service via API vs. self-hosted or on-premises). It is
important to distinguish between the two dimensions. For example,
hosted services like Amazon Bedrock serve both proprietary and
open models, while providers like Cohere license their proprietary
models for self-hosting in addition to hosting the model themselves.
Hosting options will be covered in the next section, while this sec‐
tion will distinguish between the different license types.

Proprietary Models
Proprietary models are just that: proprietary to their creators. The
creator of a proprietary model retains full control over the intellec‐

14 https://arxiv.org/abs/2303.17564

14 Chapter 1: Using LLMs in the Enterprise


tual property of the model itself. Most often, these models are a
black box. In other words, their training data, the algorithm used
to train the model, any subsequent steps such as reinforcement
learning or fine-tuning, and the weights of the model itself remain
hidden from the end user, unless the developer chooses to disclose
any of this information.
Early in the development of LLMs, there was a trend towards
openness, even among developers of proprietary models. OpenAI
published a technical paper detailing the development process of
GPT-3.15 The release of subsequent models, such as GPT-4, have not
been accompanied by such detail.
The use of proprietary models is governed by the terms of use that
a customer agrees to when using the model. An enterprise should
ensure a full and detailed legal review of these terms to ensure that
they are appropriate for the intended use. Specific attention should
be given to any rights that the model provider may claim to have
on any data sent to the model for inference. Generally, models that
are licensed for professional use do not retain any customer data
for retraining purposes, though they may retain customer data for
quality assurance purposes.

Open-Weights Models
An open-weights model provides public access to the pre-trained
parameters of the model. This can allow the end user to modify the
weights through fine-tuning or other techniques to adapt the model
to their needs.
Open-weights models typically do not publish their training data,
training algorithms, or other associated information. As such, it can
limit the ability to perform a detailed technical inspection of the
model or to reproduce the model’s performance. These limitations,
however, are most relevant to other researchers and are less relevant
to enterprises that are seeking to simply use a model in the most
efficient and effective way.

15 https://arxiv.org/abs/2005.14165

The Right Model for the Right Application 15


Open Access Models
Open access is a growing category of models that are nearly open,
but have custom terms that cannot be considered fully open source
in the traditional definition of that term. It covers a wide gamut of
licenses with different restrictions, and thus should be the subject
of a detailed legal review to ensure that the license allows for the
intended use.
Some examples include:

• BLOOM, which was released under the OpenRAIL-M license.16


Though quite nearly open source, it has requirements for
responsible use of the model, which means that it is not fully
open source.
• LLaMA 2 and 3 from Meta AI, which have been released with
their own custom licenses (called the LLaMA 217 and LLaMa 318
Community Licenses, respectively) that set limits to the use of
the model. Specifically, the licenses forbid the use of the model
in applications with more than 700 million monthly active users
and for the purpose of building competitive models.

Open Source Models


Open source models are the most open of all, publishing details
of their training data, training algorithms, and model parameters,
allowing for the most permissive use of the model. Common open
source licenses include Apache 2.0 and MIT. Meta AI’s first version
of their LLaMA model was released under a GPLv3 license, restrict‐
ing it from commercial use and thus making it not useful for most
enterprise applications.19

Choosing a License for Enterprise Use


Even though proprietary models are the most restrictive, they are
often entirely appropriate for use in the enterprise, as with any
other proprietary enterprise software. By charging for access to their
models, providers of proprietary models may be able to more easily

16 https://bigscience.huggingface.co/blog/the-bigscience-rail-license
17 https://llama.meta.com/llama2/license/
18 https://llama.meta.com/llama3/license/
19 https://arxiv.org/abs/2302.13971

16 Chapter 1: Using LLMs in the Enterprise


provide for services and support for the use of their model. This
may make their use more appropriate for use in the enterprise.
Open-weights, open access, and open source models may be more
useful in applications where an enterprise wants more control over
the model itself and possesses the technical expertise to make any
such modifications or to host the model.

Model Hosting
Enterprises have three main hosting options when looking to access
LLMs:

1. API services from the model developers, such as OpenAI,


Anthropic, Cohere, and Mistral.
2. Cloud Service Providers offering hosted LLM services, such as
Azure OpenAI Service, AWS Bedrock, or Google Vertex AI
Model Garden. These services also allow customers to load their
own models, while the underlying hardware is managed by the
cloud provider.
3. Self-managed hosting of models. Many models with different
licensing terms are available for self hosting, including open-
source and open-access models as described above. Cohere also
licenses its proprietary models for self-managed hosting.

In many cases, models hosted by their developer or a cloud service


provider (options 1 and 2, above) are the best choice in the enter‐
prise. In the same way that cloud computing outsourced the burden
of running data centers, hosted models are a simple continuation of
that trend, offering infrastructure-as-a-service. Given the intensive
compute requirements of LLMs, especially under heavy workloads,
outsourcing this can be a wise choice.
The most common objection to using a hosted service is that it
requires sending corporate data to a third-party service. But, in
many cases, this corporate data is already hosted by a third party
that may also be hosting internal communications and other sen‐
sitive data (e.g., a company that uses Microsoft 365 productivity
and communication tools has its data in Azure). Is using the LLM
service from that same provider any different? It is ultimately a
question that warrants review by your legal and risk teams, but in

The Right Model for the Right Application 17


most cases, the conclusion is that it is not different in a meaningful
way.
Self-hosting a model requires acquiring the necessary hardware,
configuring it to run the LLM, and then maintaining that stack for
reliable internal use. Typically, this will require a cluster of GPUs
that have been properly configured with the right drivers and pack‐
ages to run the LLM in question. The LLM must then be loaded into
this environment so that it can begin to serve internal requests.
Self-hosting can be an appropriate choice for an enterprise in cases
where an organization needs full control over the model and the
hardware it runs on and cannot use a third-party service for its data.
This may be the case in the most restrictive data environments, or
if the enterprise does not want to rely on a third party to ensure the
performance of the environment, notably in contexts where third-
party providers may need to throttle access to certain customers to
ensure the overall stability and availability of their service.
In the case of both self-hosting and hosted services, applications that
use the LLM will access the model through an API endpoint. The
difference is simply who is hosting and maintaining that endpoint
and whether the data going to and returning from the LLM leaves
the corporate firewall of the enterprise.

Building a Base Model Is Not for Most Organizations


Early on in the popular interest in LLMs, a lot of attention was given
to the expense and complexity of building these models. Billions
of dollars were being spent building these models, and sometimes
training them took many months. A huge amount of this initial
work was amassing the enormous training sets required to build
models of this scale.
Recent advances have brought down the time needed to build new
models, and open source training data repositories now exist. But
the fundamental question for an enterprise that is considering build‐
ing a model remains: Why would you? Given the great diversity
of today’s models, which offer seemingly endless combinations of
performance, specificity, licensing, and hosting options, what would
justify the time and expense needed to build your own model,
especially given that you are uncertain of being successful?

18 Chapter 1: Using LLMs in the Enterprise


Any company whose core business is not building or serving AI
models should not consider building their own model. There are
more than enough options on the market today. The challenge is
not getting access to a model, but using it safely, securely, efficiently,
and effectively to further your business goals. This is where an LLM
Mesh comes into play.

Bottom Line: Why the LLM Mesh?


As you have read in the previous sections, a great variety of models
exists in an ecosystem that is rapidly evolving. This is ultimately a
very good thing for enterprises: It means that they will be able to
pick and choose the right model for the right applications within
their business. Building applications that are powered by these
LLMs requires combining them with other objects, like retrieval
systems, prompts, and tools. This requires careful attention to many
different factors:

• How the models, services, and associated objects are registered


and used within the organization,
• How the data is routed to the model,
• How access to the models and services is controlled,
• How the use of the model is logged and audited,
• How the content generated by the model is moderated,
• How the models can be enriched with proprietary data,
• How can the applications be developed, deployed and main‐
tained efficiently, and
• How can more people become involved in this process?

As more LLM-powered applications are built and used in the enter‐


prise, the cost and complexity of managing all of these dimensions
risks spiraling out of control. This could force the enterprise to
make compromises, potentially limiting the value that it derives
from AI.
For example, perhaps there is a use case that would benefit from
using a small, specialized model that is self-hosted and to which
access is restricted. This could be a code assistant that is well-versed
in the company’s proprietary code libraries. If the organization lacks
the ability to quickly and efficiently add this model to its mix, it may

Bottom Line: Why the LLM Mesh? 19


not pursue this use case, leaving the potential gains in efficiency on
the table and falling behind its competition.
This would be unfortunate, given that many of the additional capa‐
bilities that are required to use an LLM efficiently and effectively in
an enterprise are common to all models.
This is the power of an LLM Mesh: its ability to reduce the cost of
building an additional LLM-powered application in the enterprise.
With an LLM Mesh, an enterprise is free to develop an optimal
AI strategy without compromising on performance, cost, safety, or
security.
The remaining chapters of this technical guide will go into much
more detail about how implementing an LLM Mesh can be done.

20 Chapter 1: Using LLMs in the Enterprise


CHAPTER 2
Objects for Building LLM-Powered
Applications

A Note for Early Release Readers


With Early Release ebooks, you get books in their earliest form—
the author’s raw and unedited content as they write—so you can
take advantage of these technologies long before the official release
of these titles.
This will be the 2nd chapter of the final book.
If you have comments about how we might improve the con‐
tent and/or examples in this book, or if you notice missing
material within this chapter, please reach out to the editor at
jbleiel@oreilly.com.

LLM Mesh is a new architecture paradigm for building LLM-


powered applications in the enterprise. It enables an organization
to build and maintain more LLM-powered applications, ultimately
getting more value from LLMs.
An LLM Mesh will allow you to:

• Access various LLM-related services through an abstraction


layer.
• Provide federated services for control and analysis.

21
• Provide central discovery and documentation for LLM-related
objects via a catalog.

This chapter will describe the many different types of LLM-related


services that are used in building LLM-powered applications, and
how they can be connected with one another. These various LLM-
related services are called “objects” in an LLM Mesh. The final
section of this chapter will describe the importance of the catalog for
the discovery and use of the objects in an LLM Mesh.
We start with an explanation of why using an LLM Mesh to build
LLM-powered applications is increasingly important in today’s com‐
petitive landscape. The bottom line is that you are going to need to
build a lot of custom LLM-powered applications.

The Potential of New LLM-powered


applications
In Chapter 1, we learned about the many different types of models
available, how they work, and the options available for hosting
them. What can these models be used for in the enterprise? There
have been two main, initial uses of these models in the enterprise.
The first is simply providing a version of the consumer chatbot
experience within a wrapper that meets enterprise security and
auditability requirements. The second is using these models to pro‐
vide software assistants, often called “copilots,” that can accelerate
the use of existing SaaS products.
This first generation of enterprise use has been met with a mixed
reaction. In some cases, notably when used as coding assistants for
software developers, the copilots have proven to be valuable addi‐
tions to the enterprise IT mix. Other feedback has been more mixed,
leading in some cases to disillusionment. It is too early, however, to
discount the potential for LLMs in the enterprise. This is because the
second generation of LLM-powered applications in the enterprise
will be more sophisticated.
These applications will not only use the ability of these models to
generate text but also their ability to solve arbitrary problems when
instructed to do so. For example, an LLM could be provided with
the documentation for an API that looks up the current price of a
stock. With that documentation and its coding ability, the LLM can
write a script to call that API for a given stock price. If allowed to

22 Chapter 2: Objects for Building LLM-Powered Applications


execute that script, the LLM—without having ever been explicitly
programmed to do so—could then become a tool for the end user
to look up arbitrary stock prices. This ability to accomplish tasks
for which the LLM has not specifically been programmed is called
“generalization,” and it is what allows LLMs to be the engines in a
new class of enterprise applications.
These new applications will provide automation and decision sup‐
port throughout the enterprise. In order to do so in a reliable
and cost effective manner, however, they will need to be carefully
designed, tested, deployed, and monitored. While many of the con‐
straints of traditional enterprise applications will also apply to this
new class of LLM-powered applications, the way in which they are
built, and the components that are used to build them, will be
different.
Given LLM’s ability to generalize, would it be possible to develop
a single, all-powerful application that can solve any problem and
answer any question in the enterprise? In short, no. While the LLM
itself is capable of generalization, the constraints of the enterprise
will require that the scope of any one application be relatively nar‐
row to ensure consistently good performance and to control access
to data and tools.
For example, this imaginary, all-powerful application sounds conve‐
nient but would require full access to all of the company’s data
and tools, from the most mundane to the most sensitive. Just as
an employee should only have access to the data and the tools that
they need to do their job, so too must the access of any one LLM-
powered application be limited to that which it needs to perform
its function. Furthermore, while LLMs are capable of generalization,
they require quite specific instructions to deliver consistent results,
often with examples of the expected input and output. This also
drives towards a larger number of more narrowly-scoped applica‐
tions.
Concretely, how many such LLM-powered applications might a
large corporation need? Let’s do some order-of-magnitude estima‐
tions. Let’s say that a large corporation has 10 departments and
each department has 5 core functions. Each of these functions
could potentially benefit from 5 such applications. For example,
within a Sales Department, the Sales Operations function could have
one application that researches their target accounts, a second that

The Potential of New LLM-powered applications 23


checks if the sales process is being respected, a third that continu‐
ously analyzes the health of the sales pipeline, a fourth that summa‐
rizes meetings with prospects and a fifth that assists salespeople with
their follow-ups.
Doing the multiplication of this order of magnitude estimate gives
us 10 x 5 x 5 = 250 such applications in that enterprise. Again, this is
not a precise number, it’s a rough estimate of the order of magnitude
of applications to expect. Expecting several hundred such applica‐
tions in use in a large organization seems like a reasonable estimate.

Build vs. Buy


If an organization would benefit from several hundred novel appli‐
cations, where will they come from? As always, organizations will
face a “build versus buy” decision. On the “buy” side of that balance,
existing software vendors and new startups are already bringing
these applications to market, and organizations will have a lively
marketplace of competitive offers to choose from. On the “build”
side of the balance, more advanced organizations are building their
first production-ready LLM-powered applications. Which approach
is best? Each has its advantages, disadvantages, and appropriate uses,
meaning that most organizations will buy some applications, and
build others. Table 2-1 summarizes these tradeoffs and considera‐
tions.

Table 2-1. Comparing the tradeoffs of building versus buying LLM-


powered applications
Advantages Disadvantages Ideal Uses
Buying Off- • Turn-key performance • Same performance as Non-critical functions
the-Shelf once implemented your competitors that where the goal is to
LLM-powered • Developed and use the same solution gain in efficiency, not
applications maintained by • Complex to integrate necessarily to
professional software with enterprise systems differentiate from
engineers • Governance challenges competitors.
for tracking which
models are used by
which applications

24 Chapter 2: Objects for Building LLM-Powered Applications


Advantages Disadvantages Ideal Uses
Building • Adapted to specific • Skills required to build Core and strategic
Custom LLM- business context with applications may not be functions where full
powered the potential to build available control and strong
Applications differentiated • Complexity of competitive
capabilities monitoring and differentiation are
• Full control and maintaining grows with needed.
transparency over the the number of
application applications
• Independence from
software, AI, and cloud
providers

LLM-powered applications, whether they are custom-built or


bought off-the-shelf, have the potential to improve the efficiency of
an organization’s operations. But simply improving your efficiency
in lockstep with that of your competitors does not improve your
competitive position in the market. If you are making the same
efficiency gains as your competitors and no more, you are not
becoming more competitive, you are simply keeping up.
Building custom LLM-powered applications allows an organization
to create a capability that its competitors do not possess, and thus
to outperform them in that particular domain. Given the cost and
complexity of building, monitoring, and maintaining these applica‐
tions, organizations will choose to focus their internal development
efforts on the parts of their business that stand to benefit most from
strong competitive differentiation. In most cases, this will be their
core business. For example, it may be R&D and supply chain man‐
agement for a pharmaceutical company, or risk and price modeling
for an insurance company. The needs of non-core functions will be
satisfied with applications bought off the shelf.

The Complexity Threshold


How many custom applications can any given organization develop
and maintain? Each organization is different but every organization
has a maximum number of applications that it is able to develop,
monitor and maintain with its current practices and techniques. We
call this the organization’s “complexity threshold”, and it is illustrated
in Figure 2-1.
As the organization develops and deploys more LLM-powered
applications, the complexity of monitoring and maintaining them

The Potential of New LLM-powered applications 25


increases until, at some point, the maximum complexity is reached
and no more applications can be developed. Reaching this threshold
means that the organization cannot develop more applications, even
if doing so would benefit its business. If the organization wants
to develop more applications, it must find a way to increase its
complexity threshold. This requires standardizing and structuring
the way that the organization builds these applications.

Figure 2-1. Comparing the tradeoffs of building versus buying LLM-


powered applications

A new paradigm for building LLM-powered applications


Bringing standardization and structure to the way that applications
are built in the enterprise is a show we’ve seen before. Over the
years, organizations have used different architecture paradigms for
developing applications. Starting with monolithic applications in the
early days of application development, where all components were
tightly integrated into a single codebase, organizations then shifted
to an architecture paradigm with a higher degree of abstraction with
the services-oriented architectures of the late nineties and, now, the
modern standard of microservices has taken that abstraction even
further.

26 Chapter 2: Objects for Building LLM-Powered Applications


Today, the architecture paradigm for building LLM-powered appli‐
cations is monolithic applications using packages like LangChain.
This approach is appropriate for building your first few POCs and
production applications, but it reflects the relative immaturity of
LLM-powered application design in the enterprise.
A new architecture paradigm is needed for building and maintain‐
ing many LLM-powered applications that can raise an organization’s
complexity threshold. LLM Mesh is that new architecture paradigm.
Now, let’s look at the objects used in building an LLM-powered
application.

LLM Mesh-Related Objects: An Overview


Building an LLM Mesh requires understanding the different types of
objects that must interact with one another within an LLM-powered
application. Chapter 1 covered the LLMs and the various services
that host and serve them. While those models and services are
at the heart of an LLM-powered application, more is needed, espe‐
cially if the developer hopes to build a custom application that will
stand apart from the competition and deliver better and more valua‐
ble performance. This requires integrating the LLMs with various
objects unique to the organization.
An LLM Mesh thus treats objects of a similar type in the same
way, with the LLM Mesh itself providing the translation between the
generic object (e.g. a tool) and the specific service (e.g. a specific
SQL database). In this way, we say that the LLM Mesh provides
“abstraction” between the high-level object and the underlying, spe‐
cific service.
Figure 2-2 illustrates the objects of an LLM Mesh organized into
different layers that comprise the typical stack of an LLM-powered
application, overlaying the typical stack of a traditional application.

LLM Mesh-Related Objects: An Overview 27


Figure 2-2. The objects of an LLM Mesh in comparison with those of a
traditional application

Note that in Figure 2-2, the objects in the lighter-colored rectangles


are not themselves part of an LLM Mesh, but rather are abstracted
as the higher-level objects in the darker shade. This will be discussed
further in the retrieval services and tools sections below. In contrast,
traditional applications use Data Querying Services and API Serv‐
ices directly, without abstraction as tools. Unstructured Data is not
used directly in traditional applications but is first transformed into
structured data using traditional natural language processing (NLP)
techniques.
Here is an overview of the different objects in Figure 2-1 and how
they relate to one another:
Large Language Models
The base model — the trained neural network comprising the
core mathematical weights — as described in Chapter 1.
Unstructured Data
Enterprise data that is not in tabular form. A common type of
unstructured data is documents, which may be in PDF, DOCX,

28 Chapter 2: Objects for Building LLM-Powered Applications


or other formats. Unstructured data is abstracted as retrieval
services in an LLM Mesh.
Structured Data
Enterprise data that is in tabular form, typically stored in data‐
bases, data warehouses, and data lakes. Structured data is stored
in data querying services, which are in turn abstracted as tools
in an LLM Mesh.
LLM Services
The services which are comprised of the hardware and software
systems used to deploy and interact with the model in real time.
As described in Chapter 1, these services may be managed by
the model developer, a third party, or internally by the enter‐
prise.
Retrieval Services
A service that allows for the efficient and effective querying of
unstructured data. The retrieval services usually consist of an
LLM used for embedding, storage for the embeddings (which
can be either a dedicated vector store or another type of
database—SQL or search, for example—that has added these
capabilities), and some system for ranking the results to best
respond to the query.1
Data Querying Services
Databases and their associated query languages, like SQL, that
allow for the efficient retrieval of structured data. These systems
are abstracted in an LLM Mesh as tools.
API Services
Any internal or external API services to be integrated with
the LLM-powered application. An external example could be a
weather service to look up a forecast, while an internal example
could be the data catalog to allow for data discovery. These
services can be very diverse and are abstracted in an LLM Mesh
as tools.

1 As retrieval services are relative newcomers to the enterprise architecture landscape


and are themselves powered by LLMs, they are treated as distinct objects from other
tools in an LLM Mesh.

LLM Mesh-Related Objects: An Overview 29


Prompts
The input to the LLM services, they can be templated and
standardized and can run the gamut of prompting techniques
(few-shot, chain of thoughts, etc.).
Agent
An LLM-powered system that seeks to accomplish a certain goal
over multiple iterations within a defined level of autonomy, and
using tools to meet its objective. Note the centrality of agents
for LLM-powered applications in this architecture. They are
the object where the logic and behavior of the application are
defined.
Tool
Any function or resource that an agent can use to accomplish
its task. It can be a programming or querying language, an API
service, or even another agent.
LLM-Powered Applications
An application that provides a user interface and other func‐
tionality on top of the agent. A chatbot is one example of an
application type, but LLM-powered applications could have
many different types of interfaces running the gamut from
dashboards, to mobile apps, to assistants embedded in other
applications, to headless applications running behind the scenes
and altering users only when needed.

The Objects of an LLM Mesh in Detail


In this section, we will walk through each of the seven types of
objects that are used in building LLM-powered applications in the
enterprise. Each section will first define and describe the object.
At the end of each section, you will find a tip box titled “Thinking
Like an LLM Mesh”. This box describes the expected input and
output of each object. Recall that one of the main benefits of an
LLM Mesh is that it creates an abstraction layer that standardizes the
inputs and outputs of diverse services into standardized objects. The
tip boxes summarize what those standardized inputs and outputs
should be.
Building an LLM-powered application requires integrating several
different objects. For example, a simple chatbot application using a

30 Chapter 2: Objects for Building LLM-Powered Applications


retrieval augmentation technique could be built using the following
objects:

• An application with a chatbot interface where the end users ask


their questions and receive their responses as well as provide
feedback to the developers.
• An agent, composed of several templated prompts, that defines
how the user’s question will be handled by the LLM.
• An LLM service that receives the question, tokenizes it, and
submits it to the LLM that will generate the response, enriching
it with an answer from a retrieval service.
• A retrieval service that provides access to unstructured data
from documents. The retrieval service is comprised of
— an embedding model that converts the text data to vectors
and
— a reranking model that will select the most relevant answer to
the user’s question, providing it back to the LLM service for
inclusion in the reply.

Such a chatbot could, of course, be built without an LLM Mesh


simply by building a monolithic application that calls the various
services, passing the results from one object onto the next. In prac‐
tice, the developer of such an application would be writing many
API calls, each of which is specific to each service. If the developer
would later want to change, for example, from one third-party LLM
service to another, this would require manually updating the code
so that the application calls the new LLM service in the way that is
expected by that service.
In that scenario, an efficient application design would provide a
certain degree of abstraction, defining the interface with the LLM
service as a single function within the application and not specifying
the details of the API call in every instance where the LLM service is
called.
An LLM Mesh takes this abstraction further, completely separating
the service from the application and providing a standard interface
for all objects of the same type for use across all LLM-powered
applications in the enterprise.

The Objects of an LLM Mesh in Detail 31


LLMs
We covered LLMs in detail in Chapter 1. When we talk about an
LLM, we are talking about a very large file, often measuring in
gigabytes or terabytes. For example, the 405 billion parameters of
Meta’s Llama 3.1 model weighs in at 2.3TB. The majority of the data
volume is taken up by the weights of the model itself. Remember,
as described in Chapter 1, the weights of a model are simply a great
quantity of floating-point numbers.
If an organization is using a managed LLM service, they will never
interact with the model itself, only with the service endpoint. But, if
an organization self-hosts an LLM, then they will need to load the
LLM into their hosting infrastructure.

Thinking Like an LLM Mesh


From the perspective of an LLM Mesh, an LLM is thus an
object that can be interacted with in only two very simple
ways: It can be downloaded and updated in the environ‐
ment where it is hosted. These two actions will generally
be done by interfacing with an API supplied by the model
provider or from an aggregator of models (a model hub)
such as Hugging Face.

LLM Services
An LLM service is a combination of storage resources, compute
resources, and supporting software that allows an LLM to be hosted
and accessed for inference.
The developer of the model may provide LLM services. For exam‐
ple, OpenAI, Anthropic, and Mistral all provide services that run
their proprietary models. In these services, the end user does not
load the model into the service, they simply select the service run‐
ning the model that they prefer.
Alternatively, an organization may choose to build and run its own
LLM service, managing the GPUs and associated technologies.
Finally, cloud service providers (CSP) offer managed LLM services.
In these, the end user may select the model that they wish to run,
but the CSP manages the compute and storage infrastructure.

32 Chapter 2: Objects for Building LLM-Powered Applications


An LLM service, be it hosted by your organization or by a third
party, is accessed via an API. Generally, most LLM services will
expect similar variables when they are called. Those include:

• Which model version to use


• A system prompt set by the developer to guide the model’s
completion
• The user prompt for the model to complete
• Temperature setting to define the level of randomness in the
response
• Alternatives to temperature, such as top_p or top_k, use differ‐
ent sampling methods to determine which subsequent token to
select

In response to such requests, the LLM service will generally reply


with a response that includes:

• An indication of the type of response (e.g., text completion or


streaming chat)
• A unique identifier of the response
• The generated content
• Reasons for why completion may have stopped
• Usage statistics about the number of tokens in the request and
response

Most services have broadly similar expected inputs and outputs.


An LLM Mesh abstracts and standardizes these inputs and outputs
through its abstraction layer, ensuring that the request sent to a
given service is formatted appropriately and uses the correct syntax.
When using an LLM Mesh to build an application that calls an LLM
service, the end user calls the LLM service object in the LLM Mesh,
indicating which service to use, and the LLM Mesh translates that
generic call into the specific call expected by the indicated service.
To better understand the value of providing a standard interface for
all LLM services, let’s compare the expected syntax of two common

The Objects of an LLM Mesh in Detail 33


providers, OpenAI and Google Gemini, starting with OpenAI. The
OpenAI documentation2 gives the following example:
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
Let’s compare that with the expected request to the Google Gemini
API. Google’s documentation3 gives the following specification:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)"
\
-H "Content-Type: application/json" \
https://${LOCATION}-
aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/$
{LOCATION}/publishers/google/models/${MODEL_ID}:streamGenerate-
Content \
-d '{
"contents": [{
"role": "user",
"parts": [{
"text": "TEXT"
}]
}]
}
These short samples from the documentation already show some
differences between the two APIs :

• OpenAI specifies the model in the JSON payload with the


model key-value pair, while Google specifies the model in the
URL path.

2 https://platform.openai.com/docs/api-reference/chat
3 https://docs.anthropic.com/en/api/complete

34 Chapter 2: Objects for Building LLM-Powered Applications


• The array containing the content of all messages is called mes
sages by OpenAI and contents by Google.
• Google nests an additional array, parts, within its contents
array.

An LLM Mesh standardizes these and other differences, allowing


for faster application development and easy switching between LLM
services. As LLM service providers update their services, the LLM
Mesh developer will update the Mesh accordingly, freeing the appli‐
cation developers from the need to do so.

Thinking Like an LLM Mesh


As an object in an LLM Mesh, an LLM service expects
a prompt as input and is expected to provide text as its
output.

Retrieval Services
From the perspective of an LLM Mesh, a retrieval service takes a
user’s query as its input and provides a relevant result from unstruc‐
tured data as its output. It is how an LLM-powered application
accesses unstructured data. In this context, the unstructured data
is text data coming from documents, often stored as PDFs, plain
text documents, or other common document formats, like DOCX.
Retrieval services allow LLM-powered applications to make this
data available to the employees of an enterprise by allowing them
to discover it more accurately and rapidly. Importantly, the informa‐
tion in these documents is not only made available to the employees.
It will also be available for LLM-powered applications themselves to
inform their inference on how to solve a problem. Retrieval services
serve this dual purpose: making the unstructured text data available
to both the employee and the LLM-powered applications which, in
both cases, leads to better decisions.
In retrieval services, like traditional search systems that came before
them, there is usually a tradeoff between the speed of the results and
the quality of the results. While it is possible to have fast results or
good quality results, it is difficult to have both. Retrieval services are
usually made of three separate components to provide the highest
quality results as quickly as possible. These are:

The Objects of an LLM Mesh in Detail 35


• Embedding models, which will convert the text of the document
base, as well as the text of the query, into dense vector represen‐
tations,
• Data storage, commonly vector stores, for storing the vectors
and performing efficient similarity search, and
• Reranking models, to improve the quality of the search results.

Increasingly, these services are being provided as bundled services


from various providers, as their combined functionality is required
to provide the desired result to the end user: a relevant result from
unstructured data in response to a natural language query. From the
perspective of an LLM Mesh, the mutual dependency of these three
underlying technologies is why they are combined as a single object,
“retrieval services.”
The following sections describe the components of a retrieval ser‐
vice and some of the tradeoffs that the various choices will entail.

Embedding Models
As described in the previous chapter, embedding models trans‐
form text into numerical representations called embeddings, stored
as high-dimensional vectors. For example, the word “banana”
might have the following embedding: [0.534, 0.312, -0.123, 0.874,
-0.567, ...] Each number represents the value of a particular dimen‐
sion. If the vector had 100 dimensions, there would be 100 numbers
in the list.
These embeddings capture the semantic meanings of the text, such
that “Denver” and “capital of Colorado” will have similar vector
representations, even though they share no keywords, while “kid”
meaning “young goat” will have a different vector representation
than “kid” meaning “young human”.
Different embedding models use different embedding lengths,
meaning more or fewer dimensions for each vector. Put more sim‐
ply, a shorter embedding length means fewer dimensions in each
vector, and thus fewer numbers in the list that represents each
word or part of a word. More embeddings require more storage
and compute resources. New embedding models, like OpenAI’s text-
embedding-3 family of models, allow for the embedding size to be
shortened to a degree specified by the user. Shorter embeddings
can have lower storage and computational costs but may result in

36 Chapter 2: Objects for Building LLM-Powered Applications


degraded performance. Model developers are working to increase
the performance of vectors with fewer embeddings.
Embedding models expect input text that has been pre-processed
to a certain degree. Different models have different requirements;
an LLM Mesh provides a standard interface that is mapped to each
model. Pre-processing will generally include extracting the text from
any documents (e.g., PDF or DOCX formats), removing punctua‐
tion, adding special tokens to tell the model about relevant breaks in
the text, and splitting longer documents into smaller “chunks” that
are sized appropriately for the embedding model.
In a retrieval service, embedding models serve the dual purpose of
converting the corpus of documents into vectors and then doing
the same for the query. Converting the corpus of text into vectors
is usually an offline task, done once while converting the query is
necessarily done at runtime, when the query is received from the
user.

Vector Store, or Other Data Storage


The embeddings are written to a data store, often a dedicated vector
store. A vector store is a database specially designed to store and
efficiently query high-dimension, dense vectors, like those created
by embedding models. Vector stores have built-in retrieval function‐
ality for finding a stored vector most similar to the query vector,
usually using a cosine similarity function.
Traditional data stores — including relational databases like Post‐
greSQL, document databases like MongoDB, search engines like
ElasticSearch, and graph databases such as Neo4J — are all adding
support for dense vector data types. As the use of vector data increa‐
ses in the enterprise thanks to the growth of text embeddings used
in LLM-powered applications, the use of these more traditional data
storage technologies may become increasingly relevant, reducing the
need for dedicated vector stores.
This evolving technology landscape is one more reason why
abstracting these services as “retrieval services” is important in an
LLM Mesh. While the underlying technologies may change, the
function remains the same: Provide relevant results from unstruc‐
tured data to user queries.

The Objects of an LLM Mesh in Detail 37


Reranking Models
The vector store’s retrieval function will provide a fast result, but
it may not always be the most accurate. More accurate results can
be obtained by using the vector store’s retrieval function to narrow
down the results and then a reranking model to analyze the subset
more carefully selecting the best result to return.
In contrast to the retrieval function of the vector store, the reranking
model will take the entire source document plus the input query
for comparison. Given that the source data could contain thousands
or even millions of source documents, it would be too slow and
too costly to run that process across every document. By using the
retrieval function to narrow down the results to the top few (a
number which can be specified), and then running the reranking
model across the subset, you strike the best balance between speed
and quality. This is called two-stage retrieval.
The ranked results will be the output of the reranking model. The
retrieval system will provide the top-ranked result back to the user
in response to their query.

Thinking Like an LLM Mesh


As an object in an LLM Mesh, a retrieval service expects a
natural language query as its input and is expected to out‐
put the top-ranked result from unstructured data. These
results are then generally passed on to an agent.

Prompts
Since the popular use of LLM-powered chatbots increased dramat‐
ically following the release of ChatGPT and associated products,
many of us are now familiar with the notion of a prompt. A prompt
is the initial input (a question, a command, or instructions) pro‐
vided to the model, prompting its response.
In contrast to the ad hoc prompting often used in consumer appli‐
cations, prompting in the enterprise benefits from a structured,
templated, composable approach. This allows a bank of prompts to
be developed, tested, and then shared for reuse across the organiza‐
tion. There are many different types of such prompts. The following
sections discuss several categories of prompts, with some simple
examples of each.

38 Chapter 2: Objects for Building LLM-Powered Applications


Role-Based Prompts
These prompts direct the LLM to respond as a specific type of
expert, such as a customer support agent or HR consultant, or guide
the AI on the tone, formality, and style of its responses.
Examples:

• “You are an IT support technician. Assist the user in trouble‐


shooting their software issue.”
• “Respond in a professional and concise manner suitable for
senior management.”

Compliance and Ethical Prompts


These prompts direct the LLM to provide responses that adhere to
specific regulations or legal frameworks, or responses which follow
specific internal guidelines for ethical practices.
Examples:

• “Ensure that no response contains personally identifiable infor‐


mation (PII), such as names, phone numbers, or identifiers like
Social Security numbers.”
• “Generate responses that respect the following internal ethical
AI guidelines [corporate ethical AI guidelines].”

While using such prompt components can decrease the risk of non-
compliance, they cannot guarantee that any result will necessarily be
compliant. As such, human oversight is required.

Customization, Personalization, and Context-Specific Prompts


These prompts customize responses from an LLM based on known
information about a user or customer, a prediction about them, or
other relevant contextual information. The variables in the example
prompts below would be completed based on information in the
enterprise Customer Relationship Management system (CRM), cus‐
tomer support records, or using the result of a predictive model.
Examples:

• “Personalize the marketing message for a [age]-year-old [gen‐


der] living in [postal code].”

The Objects of an LLM Mesh in Detail 39


• “Recommend to the customer [result from next-best offer pre‐
diction].”
• “Given the customer’s previous request about [subject of the
previous request], provide a relevant response.”

Multi-Step Process Prompts


These prompts guide the LLM to respond in a multi-step process
by breaking down complex decisions into smaller, more manageable
steps. These multi-step process prompts are the building blocks of
agents.
Examples:

• “Step 1: Gather all financial data from Q1. Step 2: Generate


a financial report. Step 3: Summarize the key findings in a
presentation.”
• “First, evaluate the market demand. Next, assess the cost impli‐
cations. Finally, recommend a go/no-go decision.”

Thinking Like an LLM Mesh


For the purposes of an LLM Mesh, prompts need to be
tested, approved, and published in the catalog, which we
will explain further later in this chapter. Any prompt must
be associated to a specific model and version, as small
changes in the model or prompt may result in dramatically
different prompt performance. These prompts can then
be combined with one another to compose more complex
and sophisticated prompts, themselves part of agents and
LLM-powered applications.

Agents
While various definitions for “agent” exist, from the perspective
of an LLM Mesh, an agent is an LLM-powered system capable
of accomplishing its objective across multiple steps using tools,
without requiring prompting by an end user for each step.
Within an LLM Mesh, an agent is the object where the other objects
interact with one another to form a system that can respond to
users’ needs. They call one or more LLM services, they use several
templated prompts and use one or more tools. As such, agents are

40 Chapter 2: Objects for Building LLM-Powered Applications


some of the most important objects within an LLM Mesh and are at
the core of building LLM-powered applications in the enterprise.
Like the other objects, they must be built, described, cataloged, and
maintained. As the maturity of an organization increases, it will
begin to develop more agents, and will likely start chaining those
agents together, with one agent using another as a tool. This increas‐
ing complexity can be tamed by the abstraction and modularity that
the LLM Mesh offers.
There are a few important parts to the above definition of an agent,
so let’s look at them one by one.

Objective
An agent’s developer will define its objective by giving it a role-based
prompt, as described in the previous section. For example, an agent
that is part of an application that is designed to generate real-time
sales analytics could include the following role-based prompt tem‐
plate:
You are a Business Intelligence Analyst with access to the
company's sales data across various regions and time periods.
Your role is to assist in retrieving specific data as reques-
ted by the user and to provide additional analysis that high-
lights any interesting, unusual, or noteworthy aspects of the
data, just as a human analyst would do.
When the user makes a request:
1. Accurately identify the relevant data source and retrieve
the specific data they are asking for.
2. Perform a detailed analysis on the retrieved data to
uncover any trends, anomalies, or key insights. Consider
aspects such as:
- Comparisons with previous periods or other regions.
- Significant changes or trends in the data.
- Potential reasons behind the observed data patterns.
- Any other insights that might be valuable for the user to
know.
Finally, present the data and your analysis in a clear, con-
cise summary that the user can easily understand.
If the user’s request is unclear or requires data from mul-
tiple sources, use your judgment to clarify the request and
combine data sources as needed to provide a comprehensive anal-
ysis.
In this example, the objective is clearly described, as is what the
agent should do if the end user asks it to do something outside of its
prescribed scope.

The Objects of an LLM Mesh in Detail 41


Multiple Steps
Agents will execute multiple steps to meet their objectives. These
individual steps are linked in chains, which define the steps the
agent must take to meet the objective. This differentiates agents
from the simple, direct prompting of an LLM. For example, asking
an LLM to summarize a block of text cannot be considered an agent
because it is a single step.
Take for example an agent that has been built to summarize finan‐
cial reports. The multiple steps might be:

1. Call an API to download the desired report(s)


2. Locate and extract key figures from the report
3. Look up historical values for these figures and compare them
4. Extract key quotes from the report
5. Generate a semi-templated summary that includes both extrac‐
ted quotes, generated summary text, and comparison between
historical and current figure
6. Send the report to the recipient over the specified channel

Each step would include a templated prompt that would be modi‐


fied with either the user input or the LLM output from the preced‐
ing step. The steps are strung together in a chain, which may be
sequential, looping, branching, or parallel chains. Throughout this
process, multiple calls to the LLM service will occur without any
user involvement.

Autonomy
An agent is granted some degree of autonomy. Using the analytics-
generating agent as an example, a minimal degree of autonomy may
simply be deciding which Python package or function to use during
the data analysis step. A more significant degree of autonomy may
be choosing the tool that it will use to meet its objective from several
made available to it (e.g., deciding if it should query historical data
from a data warehouse or live data from a CRM to best respond to
the user’s request).
Less autonomy will mean that the agent is less flexible in the type
of problem it can solve, but more likely to give a good result in
that narrower range. More autonomy will mean more flexibility,

42 Chapter 2: Objects for Building LLM-Powered Applications


but more risk that the results will not be satisfactory. In the enter‐
prise, agents are likely to be quite limited in their autonomy, with
narrowly defined options available to them, especially during the
early stages of their development and use. This may change over
time as models and agent-building techniques evolve and improve.

Tool Use
A defining characteristic of an agent is its use of tools to accomplish
its objectives. These tools are described in more detail in the follow‐
ing section.

Thinking Like an LLM Mesh


As an object in an LLM Mesh, an agent expects some task
as an input and is expected to provide a satisfactory result
as an output. This broad definition reflects the breadth of
what agents can be built to accomplish.

Tools
In an LLM Mesh, a tool is any function or system that an agent is
provided with to accomplish its task. As such, tools cover a very
wide range of potential technologies. This breadth gives agents and
LLM-powered applications their incredible potential: they can auto‐
mate and accelerate tasks, decisions, and operations that otherwise
require manual work across the enterprise and its business systems.
The types of systems that can serve as tools in an LLM Mesh include
but are not limited to:

• Internal data storage and retrieval systems, such as databases,


data warehouses, and data lakes.
• Enterprise software systems, such as CRM, Human Resources
Management System (HRMS), and Enterprise Resource Plan‐
ning (ERP) systems.
• Advanced analytical assets, like predictive machine learning
models.
• Programming and querying languages, like Python and SQL,
along with specific packages or proprietary code.
• External data APIs, such as financial data or weather services.

The Objects of an LLM Mesh in Detail 43


• Other agents within the LLM Mesh

For an agent to use a tool, it needs to understand what the tool


is and how to use it. This is accomplished by creating a schema
for each tool. This schema is what allows for some standardized
interaction with the tool, despite the great diversity of tools that may
exist in the LLM Mesh. The schema should include:

• A description of the tool, including examples of the circumstan‐


ces in which it should be used.
• Instructions on how to interact with the tool, including what
input is expected and what output is expected.
• Connection details for accessing the tool.

By ensuring that each tool has a well-described schema, the tools


can be used across different agents, including those with a high
degree of autonomy, as those agents will rely on the descriptions in
the schema to decide which tool to use.

Thinking Like an LLM Mesh


As an object within an LLM Mesh, a tool provides a
schema, making itself available for use by an agent. The
tool expects an input and provides an output as defined
in that schema. Tools are very flexible and their schema is
essential to their use by agents.

Applications
In an LLM Mesh, an application is what makes an agent available
to end users. The agent defines the logic that orchestrates the differ‐
ent objects from the LLM Mesh that are required to accomplish
a specific purpose. The application is the interface and supporting
functions that allow the end users to interact with the agent, to
better understand the results provided by the agent, and to provide
feedback to the developers. The application is also where certain
services providing security, safety, and cost control are enforced.
There are several types of LLM-powered applications, including:

• Chat interfaces where users interact with the agent iteratively.

44 Chapter 2: Objects for Building LLM-Powered Applications


• Contextual assistants, either as desktop applications or browser
extensions, that provide some additional functionality or assis‐
tance in the context where the user is working at that moment.
• Backend or “headless” applications that run without direct end
user interaction.

LLM-powered applications can have a wide range of interfaces and


functionality. The abstraction and standardization within the LLM
Mesh makes it simpler for the developer to build the application
in a way that clearly communicates to the end user how the agent
underpinning the application is generating its results.
For example, in the case of an application that exposes an analytics-
generating agent to end users, it will be easier for the end user
to understand and trust the results if the application distinguishes
between outputs that come from a query to a retrieval system or
a tool versus outputs that are the result of the LLMs interpretation
or suggestion. Furthermore, the end user will also be more likely to
trust results if they can verify that the sources used, and the query
that the LLM generated, are appropriate for the objective of the
application.
Feedback mechanisms should also be built into the application to
ensure that when an agent does not behave as expected, end users
can flag this anomaly to the developers so that they can monitor the
agent’s performance and take corrective action if necessary.

Thinking Like an LLM Mesh


Within an LLM Mesh, the application object includes the
application itself, versioning for the deployed application,
and logging of user interactions with the application.

Cataloging LLM-Related Objects


As an organization begins developing more LLM-powered applica‐
tions, the number of different objects it will need to use to build
those applications will grow rapidly. This could become difficult to
manage, with users hunting for different objects, recreating existing
objects, or using unapproved objects.

Cataloging LLM-Related Objects 45


Overcoming these challenges starts by creating a central catalog for
all of these objects. This catalog is a fundamental component of an
LLM Mesh. The catalog should:

• Account for all LLM-related objects that are available for use in
the enterprise.
• Provide documentation that describes and provides instructions
for using each object.
• Track the version or other details about the ownership and
development history of the object.
• Assign a unique ID to each object to allow it to be referenced
and tracked unambiguously.

This information is stored in a structured format that allows human


and machine discovery of the available objects. Having a central
catalog of the objects provides various benefits for organizations as
they begin building more LLM-powered applications. Those bene‐
fits include:
Standardization
Only approved objects can be added after a vetting process.
Governance and Compliance
You can maintain full transparency and traceability of which
data are used with which LLM for which purposes, enabling
business alignment and regulatory compliance.
Security
The catalog allows access controls to be defined and enforced,
controlling which end users and automated systems have access
to what objects.
Composability
Once registered, objects can be easily added to new applications
where they are combined with other objects, accelerating the
development process.
Efficiency
Less time is spent manually connecting different objects, accel‐
erating application development.
Importantly, this catalog will be useful for both the end users and
the LLM-powered agents that they will be building. The agents will

46 Chapter 2: Objects for Building LLM-Powered Applications


also rely on the documentation to discover and use the objects and
the agents will be subject to the security model.

Conclusion
In this chapter, we have learned about the various objects of an
LLM Mesh, how they are abstracted, and how they can be integrated
with one another. The final chapter of this guide will go into greater
detail about how a specific LLM-powered application can be built
using an LLM Mesh. Before getting to that, however, the following
chapters will describe the various federated services that an LLM
Mesh must also provide to meet enterprise security, reliability, and
cost requirements for the many LLM-powered applications that will
be built within it. Chapter 3 will start with that most important
of enterprise considerations: cost. How can the overall cost of an
enterprise’s LLM use be optimized? Read on to learn more.

Conclusion 47
About the Author
As Head of AI Strategy at Dataiku, Kurt Muehmel brings Dataiku’s
vision of Everyday AI to industry analysts and media
worldwide. He advises Dataiku’s C-Suite on market and technology
trends, ensuring that they maintain their position as pioneers. Kurt
is a creative and analytical executive with 15+ years of experience
and foundational expertise in the Enterprise AI space and, more
broadly, B2B SaaS go-to-market strategy and tactics. He’s focused on
building a future where the most powerful technologies serve the
needs of people and businesses.

You might also like