LLM mesh
LLM mesh
Kurt Muehmel
Editors: Jeff Bleiel and Aaron Black Cover Designer: Susan Brown
Production Editor: Kristen Brown Illustrator: Kate Dullea
Interior Designer: David Futato
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The LLM Mesh,
the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the author and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts
to ensure that the information and instructions contained in this work are accurate,
the publisher and the author disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
This work is part of a collaboration between O’Reilly and Dataiku. See our statement
of editorial independence.
978-1-098-17661-7
[LSI]
Table of Contents
v
Brief Table of Contents (Not Yet
Final)
vii
CHAPTER 1
Using LLMs in the Enterprise
1
diversity and speed of release create both opportunities and chal‐
lenges when you are looking to use these technologies in production
use cases.
Today, you can build entirely new capabilities that would not have
been possible previously, to improve the lives of your employees and
better serve your customers. But you also have to keep up with rapid
changes in the core technologies and use techniques that have not
been fully proven. We are all now at the cutting edge.
This diversity of options among the technologies and techniques is
truly a great thing. In fact, we are just scratching the surface for
the potential uses of LLMs in the enterprise. It’s easy to imagine
Inference Costs
The most direct impact of a larger model size will be on inference
cost. Inference is the process of generating tokens in response to a
particular input. A model with more parameters will require more
calculations during inference. One way or another, those calcula‐
tions must be run on some hardware that is installed and managed
somewhere and that is consuming electricity, for which someone
will have to pay the bill at the end of the month.
In some cases, companies offering these models as a service may
obfuscate these costs, for example, by subsidizing the cost in order
to gain more customers. This may make an apples-to-apples com‐
parison difficult. We’ll dig into cost considerations in Chapter 3.
Some models function, in essence, as a combination of smaller mod‐
els, each specialized in different tasks. This architecture, known as
Mixture of Experts (MoE), can dramatically reduce the cost of infer‐
ence. One well-known model using an MoE architecture is 8x7B
(also known as Mixtral) from Mistral. Despite being a 46.7-billion
parameter model, only 12.9-billion parameters are used per token.
This approach has led to improvements in inference cost, but makes
the model more challenging to build and to fine-tune.
All other things being equal, larger models will be more expensive to
use, though technological advancements like MoE mean that these
tradeoffs will become more complex in the future. The benefit that
large models bring to a particular use case may justify their expense
in certain cases, as discussed in the sections and chapters below, but
a wise strategy will use them only where needed.
Inference Speed
While the inference of a larger model will require more calculations,
these calculations can be done more quickly when using larger and
higher-performance hardware. Furthermore, many of these calcula‐
tions can be parallelized, using multiple processing units at the same
time to run all of the necessary calculations. Again, MoE models do
not need to use all parameters for every task.
Standard benchmarks are being established to accurately quantify
and compare the speed of different models, acknowledging that
Context Windows
The amount of input text that a model can receive within a single
prompt is known as its context window. Measured in tokens, it
defines how much information a model can work with at a single
time.
For example, a model with a small context window can only be used
to summarize a document that can fit in its context window. You
could break up the document into smaller pieces, but the model
would summarize each separately, without knowledge of the entire
document, potentially resulting in repetitive or incoherent results.
Large context windows, on the other hand, allow for plenty of
space to provide examples of what you want the LLM to produce
(called few-shot learning) and to engage in more complex prompt
engineering techniques.
Generally speaking, larger models have larger context windows, and
some models have been optimized for exceptionally large context
windows. While the original GPT model had a context window of
only 512 tokens (approximately one page of text) Gemini 1.5 from
Google now has a context window of more than 1 million tokens
and has been shown in internal testing to handle up to 10 million
tokens.
5 https://openreview.net/pdf?id=yzkSU5zdwD
6 https://arxiv.org/abs/2210.14891
7 https://openai.com/index/gpt-4-research/
8 https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
9 https://openai.com/index/openai-codex/
10 https://arxiv.org/abs/2403.18421
11 https://arxiv.org/abs/2303.17564
12 https://arxiv.org/abs/2004.02984
13 https://github.com/facebookresearch/esm
Proprietary Models
Proprietary models are just that: proprietary to their creators. The
creator of a proprietary model retains full control over the intellec‐
14 https://arxiv.org/abs/2303.17564
Open-Weights Models
An open-weights model provides public access to the pre-trained
parameters of the model. This can allow the end user to modify the
weights through fine-tuning or other techniques to adapt the model
to their needs.
Open-weights models typically do not publish their training data,
training algorithms, or other associated information. As such, it can
limit the ability to perform a detailed technical inspection of the
model or to reproduce the model’s performance. These limitations,
however, are most relevant to other researchers and are less relevant
to enterprises that are seeking to simply use a model in the most
efficient and effective way.
15 https://arxiv.org/abs/2005.14165
16 https://bigscience.huggingface.co/blog/the-bigscience-rail-license
17 https://llama.meta.com/llama2/license/
18 https://llama.meta.com/llama3/license/
19 https://arxiv.org/abs/2302.13971
Model Hosting
Enterprises have three main hosting options when looking to access
LLMs:
21
• Provide central discovery and documentation for LLM-related
objects via a catalog.
LLM Services
An LLM service is a combination of storage resources, compute
resources, and supporting software that allows an LLM to be hosted
and accessed for inference.
The developer of the model may provide LLM services. For exam‐
ple, OpenAI, Anthropic, and Mistral all provide services that run
their proprietary models. In these services, the end user does not
load the model into the service, they simply select the service run‐
ning the model that they prefer.
Alternatively, an organization may choose to build and run its own
LLM service, managing the GPUs and associated technologies.
Finally, cloud service providers (CSP) offer managed LLM services.
In these, the end user may select the model that they wish to run,
but the CSP manages the compute and storage infrastructure.
2 https://platform.openai.com/docs/api-reference/chat
3 https://docs.anthropic.com/en/api/complete
Retrieval Services
From the perspective of an LLM Mesh, a retrieval service takes a
user’s query as its input and provides a relevant result from unstruc‐
tured data as its output. It is how an LLM-powered application
accesses unstructured data. In this context, the unstructured data
is text data coming from documents, often stored as PDFs, plain
text documents, or other common document formats, like DOCX.
Retrieval services allow LLM-powered applications to make this
data available to the employees of an enterprise by allowing them
to discover it more accurately and rapidly. Importantly, the informa‐
tion in these documents is not only made available to the employees.
It will also be available for LLM-powered applications themselves to
inform their inference on how to solve a problem. Retrieval services
serve this dual purpose: making the unstructured text data available
to both the employee and the LLM-powered applications which, in
both cases, leads to better decisions.
In retrieval services, like traditional search systems that came before
them, there is usually a tradeoff between the speed of the results and
the quality of the results. While it is possible to have fast results or
good quality results, it is difficult to have both. Retrieval services are
usually made of three separate components to provide the highest
quality results as quickly as possible. These are:
Embedding Models
As described in the previous chapter, embedding models trans‐
form text into numerical representations called embeddings, stored
as high-dimensional vectors. For example, the word “banana”
might have the following embedding: [0.534, 0.312, -0.123, 0.874,
-0.567, ...] Each number represents the value of a particular dimen‐
sion. If the vector had 100 dimensions, there would be 100 numbers
in the list.
These embeddings capture the semantic meanings of the text, such
that “Denver” and “capital of Colorado” will have similar vector
representations, even though they share no keywords, while “kid”
meaning “young goat” will have a different vector representation
than “kid” meaning “young human”.
Different embedding models use different embedding lengths,
meaning more or fewer dimensions for each vector. Put more sim‐
ply, a shorter embedding length means fewer dimensions in each
vector, and thus fewer numbers in the list that represents each
word or part of a word. More embeddings require more storage
and compute resources. New embedding models, like OpenAI’s text-
embedding-3 family of models, allow for the embedding size to be
shortened to a degree specified by the user. Shorter embeddings
can have lower storage and computational costs but may result in
Prompts
Since the popular use of LLM-powered chatbots increased dramat‐
ically following the release of ChatGPT and associated products,
many of us are now familiar with the notion of a prompt. A prompt
is the initial input (a question, a command, or instructions) pro‐
vided to the model, prompting its response.
In contrast to the ad hoc prompting often used in consumer appli‐
cations, prompting in the enterprise benefits from a structured,
templated, composable approach. This allows a bank of prompts to
be developed, tested, and then shared for reuse across the organiza‐
tion. There are many different types of such prompts. The following
sections discuss several categories of prompts, with some simple
examples of each.
While using such prompt components can decrease the risk of non-
compliance, they cannot guarantee that any result will necessarily be
compliant. As such, human oversight is required.
Agents
While various definitions for “agent” exist, from the perspective
of an LLM Mesh, an agent is an LLM-powered system capable
of accomplishing its objective across multiple steps using tools,
without requiring prompting by an end user for each step.
Within an LLM Mesh, an agent is the object where the other objects
interact with one another to form a system that can respond to
users’ needs. They call one or more LLM services, they use several
templated prompts and use one or more tools. As such, agents are
Objective
An agent’s developer will define its objective by giving it a role-based
prompt, as described in the previous section. For example, an agent
that is part of an application that is designed to generate real-time
sales analytics could include the following role-based prompt tem‐
plate:
You are a Business Intelligence Analyst with access to the
company's sales data across various regions and time periods.
Your role is to assist in retrieving specific data as reques-
ted by the user and to provide additional analysis that high-
lights any interesting, unusual, or noteworthy aspects of the
data, just as a human analyst would do.
When the user makes a request:
1. Accurately identify the relevant data source and retrieve
the specific data they are asking for.
2. Perform a detailed analysis on the retrieved data to
uncover any trends, anomalies, or key insights. Consider
aspects such as:
- Comparisons with previous periods or other regions.
- Significant changes or trends in the data.
- Potential reasons behind the observed data patterns.
- Any other insights that might be valuable for the user to
know.
Finally, present the data and your analysis in a clear, con-
cise summary that the user can easily understand.
If the user’s request is unclear or requires data from mul-
tiple sources, use your judgment to clarify the request and
combine data sources as needed to provide a comprehensive anal-
ysis.
In this example, the objective is clearly described, as is what the
agent should do if the end user asks it to do something outside of its
prescribed scope.
Autonomy
An agent is granted some degree of autonomy. Using the analytics-
generating agent as an example, a minimal degree of autonomy may
simply be deciding which Python package or function to use during
the data analysis step. A more significant degree of autonomy may
be choosing the tool that it will use to meet its objective from several
made available to it (e.g., deciding if it should query historical data
from a data warehouse or live data from a CRM to best respond to
the user’s request).
Less autonomy will mean that the agent is less flexible in the type
of problem it can solve, but more likely to give a good result in
that narrower range. More autonomy will mean more flexibility,
Tool Use
A defining characteristic of an agent is its use of tools to accomplish
its objectives. These tools are described in more detail in the follow‐
ing section.
Tools
In an LLM Mesh, a tool is any function or system that an agent is
provided with to accomplish its task. As such, tools cover a very
wide range of potential technologies. This breadth gives agents and
LLM-powered applications their incredible potential: they can auto‐
mate and accelerate tasks, decisions, and operations that otherwise
require manual work across the enterprise and its business systems.
The types of systems that can serve as tools in an LLM Mesh include
but are not limited to:
Applications
In an LLM Mesh, an application is what makes an agent available
to end users. The agent defines the logic that orchestrates the differ‐
ent objects from the LLM Mesh that are required to accomplish
a specific purpose. The application is the interface and supporting
functions that allow the end users to interact with the agent, to
better understand the results provided by the agent, and to provide
feedback to the developers. The application is also where certain
services providing security, safety, and cost control are enforced.
There are several types of LLM-powered applications, including:
• Account for all LLM-related objects that are available for use in
the enterprise.
• Provide documentation that describes and provides instructions
for using each object.
• Track the version or other details about the ownership and
development history of the object.
• Assign a unique ID to each object to allow it to be referenced
and tracked unambiguously.
Conclusion
In this chapter, we have learned about the various objects of an
LLM Mesh, how they are abstracted, and how they can be integrated
with one another. The final chapter of this guide will go into greater
detail about how a specific LLM-powered application can be built
using an LLM Mesh. Before getting to that, however, the following
chapters will describe the various federated services that an LLM
Mesh must also provide to meet enterprise security, reliability, and
cost requirements for the many LLM-powered applications that will
be built within it. Chapter 3 will start with that most important
of enterprise considerations: cost. How can the overall cost of an
enterprise’s LLM use be optimized? Read on to learn more.
Conclusion 47
About the Author
As Head of AI Strategy at Dataiku, Kurt Muehmel brings Dataiku’s
vision of Everyday AI to industry analysts and media
worldwide. He advises Dataiku’s C-Suite on market and technology
trends, ensuring that they maintain their position as pioneers. Kurt
is a creative and analytical executive with 15+ years of experience
and foundational expertise in the Enterprise AI space and, more
broadly, B2B SaaS go-to-market strategy and tactics. He’s focused on
building a future where the most powerful technologies serve the
needs of people and businesses.