The LLM Triangle Principles to Architect Reliable AI Apps

Software design principles for thoughtfully designing reliable, high-performing LLM applications. A framework to bridge the gap between potential and production-grade performance.

Published in

Towards Data Science

16 min readJul 16, 2024

Large Language Models (LLMs) hold immense potential, but developing reliable production-grade applications remains challenging. After building dozens of LLM systems, I’ve distilled the formula for success into 3+1 fundamental principles that any team can apply.

“LLM-Native apps are 10% sophisticated model, and 90% experimenting data-driven engineering work.”

Building production-ready LLM applications requires careful engineering practices. When users cannot interact directly with the LLM, the prompt must be meticulously composed to cover all nuances, as iterative user feedback may be unavailable.

Introducing the LLM Triangle Principles

The LLM Triangle Principles encapsulate the essential guidelines for building effective LLM-native apps. They provide a solid conceptual framework, guide developers in constructing robust and reliable LLM-native applications, and offer direction and support.

An optimal LLM Usage is achieved by optimizing the three prominent principles through the lens of the SOP. (Image by author)

The Key Apices

The LLM Triangle Principles introduces four programming principles to help you design and build LLM-Native apps.

The first principle is the Standard Operating Procedure (SOP). The SOP guides the three apices of our triangle: Model, Engineering Techniques, and Contextual Data.

Optimizing the three apices principles through the lens of the SOP is the key to ensuring a high-performing LLM-native app.

1. Standard Operating Procedure (SOP)

Standard Operating Procedure (SOP) is a well-known terminology in the industrial world. It’s a set of step-by-step instructions compiled by large organizations to help their workers carry out routine operations while maintaining high-quality and similar results each time. This practically turns inexperienced or low-skilled workers into experts by writing detailed instructions.

The LLM Triangle Principles borrow the SOP paradigm and encourage you to consider the model as an inexperienced/unskilled worker. We can ensure higher-quality results by “teaching” the model how an expert would perform this task.

The SOP ***guiding*** principle. (image by author)

“Without an SOP, even the most powerful LLM will fail to deliver consistently high-quality results.”

When thinking about the SOP guiding principle, we should identify what techniques will help us implement the SOP most effectively.

1.1. Cognitive modeling

To create an SOP, we need to take our best-performing workers (domain experts), model how they think and work to achieve the same results, and write down everything they do.

After editing and formalizing it, we’ll have detailed instructions to help every inexperienced or low-skilled worker succeed and yield excellent work.

Like humans, it’s essential to reduce the cognitive load of the task by simplifying or splitting it. Following a simple step-by-step instruction is more straightforward than a lengthy, complex procedure.

During this process, we identify the hidden implicit cognition “jumps” — the small, unconscious steps experts take that significantly impact the outcome. These subtle, unconscious, often unspoken assumptions or decisions can substantially affect the final result.

An example of an “implicit cognition jump.” (Image by author)

For example, let’s say we want to model an SQL analyst. We’ll start by interviewing them and ask them a few questions, such as:

What do you do when you are asked to analyze a business problem?
How do you make sure your solution meets the request?
<reflecting the process as we understand to the interviewee>
Does this accurately capture your process? <getting corrections>
Etc.

An example of the cognitive process that the analyst does and how to model it. (Image by author)

The implicit cognition process takes many shapes and forms; a typical example is a “domain-specific definition.” For example, “bestseller” might be a prominent term for our domain expert, but not for everyone else.

Expanding the implicit cognition process in our SQL analyst example. (Image by author)

Eventually, we’ll have a full SOP “recipe” that allows us to emulate our top-performing analyst.

When mapping out these complex processes, it can be helpful to visualize them as a graph. This is especially helpful when the process is nuanced and involves many steps, conditions, and splits.

The “SQL Analyst SOP” includes all the required technical steps, visualized as a graph. (Image by author)

Our final solution should mimic the steps defined in the SOP. In this stage, try to ignore the implementation—later, you can implement it across one or many steps/chains throughout our solution.

Unlike the rest of the principles, the cognitive modeling (SOP writing) is the only standalone process. It’s highly recommended that you model your process before writing code. That being said, while implementing it, you might go back and change it based on new insights or understandings you gained.

Now that we understand the importance of creating a well-defined SOP, that guides our business understanding of the problem, let’s explore how we can effectively implement it using various engineering techniques.

2. Engineering Techniques

Engineering Techniques help you practically implement your SOP and get the most out of the model. When thinking about the Engineering Techniques principle, we should consider what tools(techniques) in our toolbox can help us implement and shape our SOP and assist the model in communicating well with us.

The Engineering Techniques principle. (Image by author)

Some engineering techniques are only implemented in the prompt layer, while many require a software layer to be effective, and some combine both layers.

Engineering Techniques Layers. (Image by author)

While many small nuances and techniques are discovered daily, I’ll cover two primary techniques: workflow/chains and agents.

2.1. LLM-Native architectures (aka flow engineering or chains)

The LLM-Native Architecture describes the agentic flow your app is going through to yield the task’s result.

Each step in our flow is a standalone process that must occur to achieve our task. Some steps will be performed simply by deterministic code; for some, we will use an LLM (agent).

To do that, we can reflect on the Standard Operating Procedure (SOP) we drew and think:

Which SOP steps should we glue together to the same agent? And what steps should we split as different agents?
What SOP steps should be executed in a standalone manner (but they might be fed with information from previous steps)?
What SOP steps can we perform in a deterministic code?
Etc.

An LLM-Native Architecture example for “Wikipedia writer” based on a given SOP. (Image by author)

Before navigating to the next step in our architecture/graph, we should define its key properties:

Inputs and outputs — What is the signature of this step? What is required before we can take an action? (this can also serve as an output format for an agent)
Quality assurances—What makes the response “good enough”? Are there cases that require human intervention in the loop? What kinds of assertions can we configure?
Autonomous level — How much control do we need over the result’s quality? What range of use cases can this stage handle? In other words, how much can we trust the model to work independently at this point?
Triggers — What is the next step? What defines the next step?
Non-functional — What’s the required latency? Do we need special business monitoring here?
Failover control — What kind of failures(systematic and agentic) can occur? What are our fallbacks?
State management — Do we need a special state management mechanism? How do we retrieve/save states (define the indexing key)? Do we need persistence storage? What are the different usages of this state(e.g., cache, logging, etc.)?
Etc.

2.2. What are agents?

An LLM agent is a standalone component of an LLM-Native architecture that involves calling an LLM.

It’s an instance of LLM usage with the prompt containing the context. Not all agents are equal — Some will use “tools,” some won’t; some might be used “just once” in the flow, while others can be called recursively or multiple times, carrying the previous input and outputs.

2.2.1. Agents with tools

Some LLM agents can use “tools” — predefined functions for tasks like calculations or web searches. The agent outputs instructions specifying the tool and input, which the application executes, returning the result to the agent.

To understand the concept, let’s look at a simple prompt implementation for tool calling. This can work even with models not natively trained to call tools:

You are an assistant with access to these tools:

- calculate(expression: str) -> str - calculate a mathematical expression
- search(query: str) -> str - search for an item in the inventory

Given an input, Respond with a YAML with keys: `func`(str) and `arguments`(map) or `message`(str).Given input

It’s important to distinguish between agents with tools (hence autonomous agents) and agents whose output can lead to performing an action.

“Autonomous agents are agents that have the ability to generate a way to accomplish the task.”

Autonomous agents are given the right to decide if they should act and with what action. In contrast, a (nonautonomous) agent simply “processes” our request(e.g., classification), and based on this process, our deterministic code performs an action, and the model has zero control over that.

An autonomous agent VS agent that triggers an action. (Image by author)

As we increase the agent’s autonomy in planning and executing tasks, we enhance its decision-making capabilities but potentially reduce control over output quality. Although this might look like a magical solution to make it more “smart” or “advanced,” it comes with the cost of losing control over the quality.

The tradeoffs of an autonomous agent. (Image by author)

Beware the allure of fully autonomous agents. While their architecture might look appealing and simpler, using it for everything (or as the initial PoC) might be very deceiving from the “real production” cases. Autonomous agents are hard to debug and unpredictable(response with unstable quality), which makes them unusable for production.

Currently, agents (without implicit guidance) are not very good at planning complex processes and usually skip essential steps. For example, in our “Wikipedia writer” use-case, they’ll just start writing and skip the systematic process. This makes agents (and autonomous agents especially) only as good as the model, or more accurately — only as good as the data they were trained on relative to your task.

Instead of giving the agent (or a swarm of agents) the liberty to do everything end-to-end, try to hedge their task to a specific region of your flow/SOP that requires this kind of agility or creativity. This can yield higher-quality results because you can enjoy both worlds.

An excellent example is AlphaCodium: By combining a structured flow with different agents (including a novel agent that iteratively writes and tests code), they increased GPT-4 accuracy (pass@5) on CodeContests from 19% to 44%.

AlphaCodium’s LLM Architecture. (Image by the curtesy Codium.ai)

While engineering techniques lay the groundwork for implementing our SOP and optimizing LLM-native applications, we must also carefully consider another critical component of the LLM Triangle: the model itself.

3. Model

The model we choose is a critical component of our project’s success—a large one (such as GPT-4 or Claude Opus) might yield better results but be quite costly at scale, while a smaller model might be less “smart” but help with the budget. When thinking about the Model principle, we should aim to identify our constraints and goals and what kind of model can help us fulfill them.

“Not all LLMs are created equal. Match the model to the mission.”

The truth is that we don’t always need the largest model; it depends on the task. To find the right match, we must have an experimental process and try multiple variations of our solution.

It helps to look at our “inexperienced worker” analogy — a very “smart” worker with many academic credentials probably will succeed in some tasks easily. Still, they might be overqualified for the job, and hiring a “cheaper” candidate will be much more cost-effective.

When considering a model, we should define and compare solutions based on the tradeoffs we are willing to take:

Task Complexity — Simpler tasks (such as summarization) are easier to complete with smaller models, while reasoning usually requires larger models.
Inference infrastructure — Should it run on the cloud or edge devices? The model size might impact a small phone, but it can be tolerated for cloud-serving.
Pricing — What price can we tolerate? Is it cost-effective considering the business impact and predicated usage?
Latency — As the model grows larger, the latency grows as well.
Labeled data — Do we have data we can use immediately to enrich the model with examples or relevant information that is not trained upon?

In many cases, until you have the “in-house expertise,” it helps to pay a little extra for an experienced worker — the same applies to LLMs.

If you don’t have labeled data, start with a stronger (larger) model, collect data, and then utilize it to empower a model using a few-shot or fine-tuning.

3.1. Fine-tuning a model

There are a few aspects that you must consider before resorting to fine-tune a model:

Privacy — Your data might include pieces of private information that must be kept from the model. You must anonymize your data to avoid legal liabilities if your data contains private information.
Laws, Compliance, and Data Rights — Some legal questions can be raised when training a model. For example, the OpenAI terms-of-use policy prevents you from training a model without OpenAI using generated responses. Another typical example is complying with the GDPR’s laws, which require a “right for revocation,” where a user can require the company to remove information from the system. This raises legal questions about whether the model should be retrained or not.
Updating latency — The latency or data cutoff is much higher when training a model. Unlike embedding the new information via the context (see “4. Contextual Data” section below), which provides immediate latency, training the model is a long process that takes time. Due to that, models are retrained less often.
Development and operation — Implementing a reproducible, scalable, and monitored fine-tuning pipeline is essential while continuously evaluating the results’ performance. This complex process requires constant maintenance.
Cost — Retraining is considered expensive due to its complexity and the highly intensive resources(GPUs) required per training.

The ability of LLMs to act as in-context learners and the fact that the newer models support a much larger context window simplify our implementation dramatically and can provide excellent results even without fine-tuning. Due to the complexity of fine-tuning, using it as a last resort or skipping it entirely is recommended.

Conversely, fine-tuning models for specific tasks (e.g., structured JSON output) or domain-specific language can be highly efficient. A small, task-specific model can be highly effective and much cheaper in inference than large LLMs. Choose your solution wisely, and assess all the relevant considerations before escalating to LLM training.

“Even the most powerful model requires relevant and well-structured contextual data to shine.”

4. Contextual Data

LLMs are in-context learners. That means that by providing task-specific information, the LLM agent can help us to perform it without special training or fine-tuning. This enables us to “teach” new knowledge or skills easily. When thinking about the Contextual Data principle, we should aim to organize and model the available data and how to compose it within our prompt.

The Contextual Data principle. (Image by author)

To compose our context, we include the relevant (contextual) information within the prompt we send to the LLM. There are two kinds of contexts we can use:

Embedded contexts — embedded information pieces provided as part of the prompt.

You are the helpful assistant of <name>, a <role> at <company>

Attachment contexts — A list of information pieces glues by the beginning/end of the prompt

Summarize the provided emails while keeping a friendly tone.
---

<email_0>
<email_1>

Contexts are usually implemented using a “prompt template” (such as jinja2 or mustache or simply native formatting literal strings); this way, we can compose them elegantly while keeping the essence of our prompt:

# Embedded context with an attachment context
prompt = f"""
You are the helpful assistant of {name}. {name} is a {role} at {company}.

Help me write a {tone} response to the attached email.
Always sign your email with:
{signature}

---

{email}
"""

4.1. Few-shot learning

Few-shot learning is a powerful way to “teach” LLMs by example without requiring extensive fine-tuning. Providing a few representative examples in the prompt can guide the model in understanding the desired format, style, or task.

For instance, if we want the LLM to generate email responses, we could include a few examples of well-written responses in the prompt. This helps the model learn the preferred structure and tone.

We can use diverse examples to help the model catch different corner cases or nuances and learn from them. Therefore, it’s essential to include a variety of examples that cover a range of scenarios your application might encounter.

As your application grows, you may consider implementing “Dynamic few-shot,” which involves programmatically selecting the most relevant examples for each input. While it increases your implementation complexity, it ensures the model receives the most appropriate guidance for each case, significantly improving performance across a wide range of tasks without costly fine-tuning.

4.2. Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is a technique for retrieving relevant documents for additional context before generating a response. It’s like giving the LLM a quick peek at specific reference material to help inform its answer. This keeps responses current and factual without needing to retrain the model.

For instance, on a support chatbot application, RAG could pull relevant help-desk wiki pages to inform the LLM’s answers.

This approach helps LLMs stay current and reduces hallucinations by grounding responses in retrieved facts. RAG is particularly handy for tasks that require updated or specialized knowledge without retraining the entire model.

For example, suppose we are building a support chat for our product. In that case, we can use RAG to retrieve a relevant document from our helpdesk wiki, then provide it to an LLM agent and ask it to compose an answer based on the question and provide a document.

There are three key pieces to look at while implementing RAG:

Retrieval mechanism — While the traditional implementation of RAG involves retrieving a relevant document using a vector similarity search, sometimes it’s better or cheaper to use simpler methods such as keyword-based search (like BM-25).
Indexed data structure —Indexing the entire document naively, without preprocessing, may limit the effectiveness of the retrieval process. Sometimes, we want to add a data preparation step, such as preparing a list of questions and answers based on the document.
Metadata—Storing relevant metadata allows for more efficient referencing and filtering of information (e.g., narrowing down wiki pages to only those related to the user’s specific product inquiry). This extra data layer streamlines the retrieval process.

4.3. Providing relevant context

The context information relevant to your agent can vary. Although it may seem beneficial, providing the model (like the “unskilled worker”) with too much information can be overwhelming and irrelevant to the task. Theoretically, this causes the model to learn irrelevant information (or token connections), which can lead to confusion and hallucinations.

When Gemini 1.5 was released and introduced as an LLM that could process up to 10M tokens, some practitioners questioned whether the context was still an issue. While it’s a fantastic accomplishment, especially for some use cases (such as chat with PDFs), it’s still limited, especially when reasoning over various documents.

Compacting the prompt and providing the LLM agent with only relevant information is crucial. This reduces the processing power the model invests in irrelevant tokens, improves the quality, optimizes the latency, and reduces the cost.

There are many tricks to improve the relevancy of the provided context, most of which relate to how you store and catalog your data.
For RAG applications, it’s handy to add a data preparation that shapes the information you store (e.g., questions and answers based on the document, then providing the LLM agent only with the answer; this way, the agent gets a summarized and shorter context), and use re-ranking algorithms on top of the retrieved documents to refine the results.

“Data fuels the engine of LLM-native applications. A strategic design of contextual data unlocks their true potential.”

Conclusion and Implications

The LLM Triangle Principles provide a structured approach to developing high-quality LLM-native applications, addressing the gap between LLMs’ enormous potential and real-world implementation challenges. Developers can create more reliable and effective LLM-powered solutions by focusing on 3+1 key principles—the Model, Engineering Techniques, and Contextual Data—all guided by a well-defined SOP.

The LLM Triangle Principles. (Image by author)

Key takeaways

Start with a clear SOP: Model your expert’s cognitive process to create a step-by-step guide for your LLM application. Use it as a guide while thinking of the other principles.
Choose the right model: Balance capabilities with cost, and consider starting with larger models before potentially moving to smaller, fine-tuned ones.
Leverage engineering techniques: Implement LLM-native architectures and use agents strategically to optimize performance and maintain control. Experiment with different prompt techniques to find the most effective prompt for your case.
Provide relevant context: Use in-context learning, including RAG, when appropriate, but be cautious of overwhelming the model with irrelevant information.
Iterate and experiment: Finding the right solution often requires testing and refining your work. I recommend reading and implementing the “Building LLM Apps: A Clear Step-By-Step Guide” tips for a detailed LLM-Native development process guide.

By applying the LLM Triangle Principles, organizations can move beyond a simple proof-of-concept and develop robust, production-ready LLM applications that truly harness the power of this transformative technology.

If you find this whitepaper helpful, please give it a few claps 👏 on Medium and share it with your fellow AI enthusiasts. Your support means the world to me! 🌍

Let’s keep the conversation going — feel free to reach out via email or connect on LinkedIn 🤝

Special thanks to Gal Peretz, Gad Benram, Liron Izhaki Allerhand, Itamar Friedman, Lee Twito, Ofir Ziv, Philip Tannor, Yair Livne and Shai Alon for insights, feedback, and editing notes.