Route LLM
Route LLM
Data
Abstract
Large language models (LLMs) exhibit impressive capabilities across a wide range
of tasks, yet the choice of which model to use often involves a trade-off between
performance and cost. More powerful models, though effective, come with higher
expenses, while less capable models are more cost-effective. To address this
dilemma, we propose several efficient router models that dynamically select be-
tween a stronger and a weaker LLM during inference, aiming to optimize the
balance between cost and response quality. We develop a training framework
for these routers leveraging human preference data and data augmentation tech-
niques to enhance performance. Our evaluation on widely-recognized benchmarks
shows that our approach significantly reduces costs—by over 2 times in certain
cases—without compromising the quality of responses. Interestingly, our router
models also demonstrate significant transfer learning capabilities, maintaining
their performance even when the strong and weak models are changed at test
time. This highlights the potential of these routers to provide a cost-effective yet
high-performance solution for deploying LLMs.
1 Introduction
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across
a wide range of natural language tasks. From open-ended conversation and question answering
to text summarization and code generation, LLMs have showcased an impressive level of fluency
and understanding [1, 8]. This rapid progress has been enabled by a combination of architectural
innovations, such as the Transformer architecture [27], as well as improvements in scaling up data
and training infrastructure [7, 23].
However, not all LLMs are created equal—there exists wide variation in the costs and sizes of LLMs,
which can range in size from one billion to hundreds of billions of parameters. LLMs also differ in
terms of the data they are trained on, which in turn leads to variations in the strengths, weaknesses,
and capabilities of different models. Broadly speaking, larger models tend to be more capable but
come at a higher cost, while smaller models tend to be less capable but cheaper to serve.
†
Equal contribution.
This heterogeneous landscape presents a dilemma in the practical deployment of LLMs in real-world
applications. While routing all user queries to the largest, most capable model ensures high-quality
results, it is prohibitively expensive. Conversely, routing queries to smaller models can save costs—up
to more than 50x (e.g., Llama-3-70b vs. GPT-4, or Claude-3 Haiku vs. Opus1 )—but may result in
lower quality responses, as the smaller model may not be able to handle complex queries effectively.
To this end, LLM routing is a promising solution to this problem, whereby each user query is first
processed by a router model, before deciding which LLM to route the query to. This can potentially
route easier queries to smaller models, and more difficult queries to larger models, optimizing the
quality of model responses while minimizing cost. However, optimal LLM routing—defined as
achieving the highest quality given a cost target or minimizing cost given a quality target—is a
challenging problem. A robust router model needs to infer the intent, complexity, and domain of an
incoming query as well as understand candidate models’ capabilities to route the query to the most
appropriate model. Furthermore, the router model needs to be economical, fast, and adaptive to the
evolving model landscape, where new models with improved capabilities are continually introduced.
In this work, we present a principled framework for query routing between LLMs. Our setup involves
learning to route between a stronger model and a weaker model as seen in Figure 1. Our objective
is to minimize costs while achieving a specific performance target, such as 90% of the stronger
model’s performance, by intelligently routing simpler queries to the weaker model and reserving
more complex queries for the stronger model. We develop a training framework for router systems
utilizing human preference data and data augmentation techniques. We evaluate our router models on
widely recognized benchmarks, such as MMLU [15] and MT Bench [30], and demonstrate that our
framework can significantly reduce costs—by over 2 times—without substantially compromising
response quality.
To summarize, we make the following contributions:
• We formulate the LLM routing problem to explore the trade-off between cost and response quality.
• We propose a router training framework based on human preference data and augmentation
techniques, demonstrating over 2x cost saving on widely used benchmarks.
• We open-source the code and preference data used to train our routers.2 .
Several recent studies have explored optimizing the cost and performance trade-offs in deploying
large language models (LLMs). LLM-BLENDER [17] employs an ensemble framework that calls
multiple LLMs at inference and uses a router model to select the best response. Frugal-GPT [9]
employs an LLM cascade, sequentially querying LLMs until a reliable response is found. Both
approaches’ inference cost grows with the number of models involved. Our approach, by contrast,
routes each query to a single LLM. A closely related study, Hybrid-LLM [13], shares similarities
1
Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75)
2
https://github.com/lm-sys/RouteLLM
2
with our framework but differs in three key ways: it uses synthetic preference labels derived via
BARTScore [29], relies on a single BERT-based router architecture, and limits evaluation to in-
domain generalization. In contrast, our work leverages human preference labels from Chatbot Arena
[10], explores several router architectures, and demonstrates that that augmenting the dataset results
in significant performance improvements across all router architectures. Additionally, we emphasize
out-of-domain generalization by evaluating on multiple public benchmarks.
2 LLM Routing
2.1 Problem Formulation
By learning the winning probability on preference data, we capture the strengths and weaknesses of
both model classes on various kinds of queries. In Section 3.2, we propose several approaches for
parameterizing the win prediction model.
2) Cost Threshold α ∈ [0, 1] which converts the winning probability into a routing decision between
Mstrong and Mweak . Given a query q, the routing decision is formulated as:
α 0 (i.e., Mweak ) if P (winMj | q) < α,
Rbin (q) = (2)
1 (i.e., Mstrong ) otherwise.
The threshold α controls the quality/cost trade-off: a higher threshold imposes a stricter cost constraint,
reducing expenses but potentially compromising quality.
Finally, we denote the router’s response as MRbin
α (q) (q), which represents the response generated by
3
2.2 Metrics
In this section, we define evaluation metrics that capture the trade-off between cost and quality in
the LLM routing problem. We start with metrics that independently assess the quality and cost
α
efficiency of a given Rbin , then introduce two compounded metrics which we use in our experimental
evaluations.
For cost efficiency, we calculate the percentage of calls to the strong model:
α 1 X α
c(Rbin )= I{Rbin (q) = 1}, (3)
|Q|
q∈Q
since Mstrong models incur significantly higher costs than Mweak models.
For quality, we measure the average response quality on an evaluation set Q:
α 1 X
r(Rbin )= δ(MRbinα (q) (q)), (4)
|Q|
q∈Q
where δ(MRbin α (q) (q)) represents a numerical score of the router’s response to q. This score can be
result of a predefined metric measuring the correctness of the response in golden-labeled datasets
(e.g., MMLU) or a numerical label (e.g., 1-5 or 1-10) where higher values indicate better quality.
α
Given that the performance of Rbin lies between the weak and strong models’ performance, we
quantify the router’s performance relative to the performance gap between both models. We define
α
the overall performance gain of Rbin with the performance gap recovered (PGR):
α
α r(Rbin ) − r(Mweak )
P GR(Rbin )= . (5)
r(Mstrong ) − r(Mweak )
Neither of these metrics alone is sufficient, as they do not capture the quality-cost trade-off in routing.
For instance, a trivial router that sends every query to the strong model achieves a perfect P GR = 1
without any cost reduction. Therefore, we compute a call-performance graph for a router Rbin by
varying the threshold values α. We define the average performance gap recovered (APGR) as
an overall measure of how well the router can recover the performance gap under different cost
constraints: Z 1
α α
AP GR(Rbin ) = P GR(Rbin ) d (c(Rbin )) . (6)
0
In Figure 1-(right), APGR is represented by the area between the router’s performance curve and the
weak model’s performance. Empirically, we discretize the percentage of calls interval [0%, 100%]
into {ci }i∈[10] . For each ci , we determine the cutoff threshold αi that meets the cost constraint. We
approximate AP GR using the following formula:
10
1 X αi
AP GR(Rbin ) ≈ P GR(Rbin ) (7)
10 i=1
In many real-world applications, it is important to quantify the cost required to achieve a certain level
of performance. Therefore, we define a second metric which we call call-performance threshold
(CPT). Given a desired router performance (measured as PGR of x%), the CPT(x%) refers to the
minimum percentage of calls to the strong model required to obtained the desired PGR. In Figure
1-(right), the dotted green line denotes CPT(50%), i.e. the percentage of calls to GPT-4 required to
achieve the desired performance of 50% PGR; here, CP T (50%) ≈ 37%.
3 Methodology
3.1 Preference Data
We start by describing how we can obtain the preference data to train the routing function. We
primarily use 80k battles from the online Chatbot Arena platform [10]. On this platform, users
interact with a chatbot interface and submit prompts of their choice. Upon submission, they receive
4
responses from two anonymous models and vote for a winning model or a tie. The resulting dataset,
denoted as Darena = {(q, ai , aj , li,j ) | q ∈ Q, ai , aj ∈ A, li,j ∈ L}, consisting of user queries,
answers from two models Mi , Mj , and a pairwise comparison label based on human judgment.
A major issue with using the raw Chatbot Arena data is label sparsity. For instance, the percentage of
comparison labels between any two models, on average, is less than 0.1%. Therefore, we derive the
preference data for training the router as follows: First, we reduce label sparsity by clustering models
in Darena into 10 different tiers (see Appendix A), using each model’s Elo score on the Chatbot Arena
leaderboard5 and aiming to minimize the variation within each tier via dynamic programming. We
choose the models on the first and second tiers to represent strong models Mstrong , and the models on
the third tier to represent weak models Mweak . While we primarily train on battles across these tiers,
we also leverage battles involving other model tiers to regularize our learning methods. Crucially,
we omit the actual model responses in Darena , retaining only the model identities, i.e. e ∼ Dpref is
e = (q, Mi , Mj , li,j ). The comparison label li,j still provides insight into the relative capabilities of
the LLMs Mi and Mj on various types and complexity levels of the query q.
We now detail our methods for learning the win prediction model Pθ (winMstrong |q) from preference
data Dpref . We denote a sample (q, Mi , Mj , li,j ) ∼ Dpref as e = (q, Mw , Ml ), where Mw and Ml
refer to the winning and losing model respectively.
Similarity-weighted (SW) ranking We adopt a Bradley-Terry (BT) model [6] similar to [10].
Given a user query q, we compute a weight ωi = γ 1+S(q,q̂) 7 for each query qi in the train set based
on its similarity to q:
ϵ · ϵi
S(q, qi ) = ·ϵs , (8)
∥ϵ∥∥ϵi ∥ · max1≤s≤|Dpref | ∥ϵϵii∥∥ϵ s∥
5
where ℓ is a binary cross-entropy loss. The resulting BT coefficients allow us to estimate the win
probability as: P (winMw |q) = 1+eξ1w −ξl . For this router model, there is no training required and the
solving is performed at inference time.
Matrix factorization Inspired by matrix factorization models [18, 25] in recommendation systems
to capture the low-rank structure of user-item interactions, we leverage this approach for training
on preference data. The key is to uncover a hidden scoring function s : M × Q → R. The score
s(Mw , q) should represent the quality of the model Mw ’s answer to the query q, i.e. if a model Mw
is better than Ml on a query q, then s(Mw , q) > s(Ml , q). We enforce this relationship by modeling
the win probability with a BT relationship [6]:
P (winMw |q) = σ(s(Mw , q) − s(Ml , q)), (10)
which we optimize on preference data. We model the scoring function s as a bilinear function of
model and query, and embed the model identity M to a dm -dimensional vector vm and the query to a
dq -dimensional vector vq :
s(M, q) = w2T (vm ⊙ (W1T vq + b)) (11)
Here, ⊙ represents the Hadamard product, W1 ∈ Rdq ×dm and b ∈ Rdm is the projection layer to
align the dimension of vq with vm , and w2 ∈ Rdm is the linear regression layer to produce the final
scalar. This method is essentially learning a matrix factorization of the score matrix on the set Q × M.
We train the model on a 8GB GPU for ≈ 10 epochs, using batch size 64 and the Adam optimizer
with learning rate 3 × 10−4 and weight decay 1 × 10−5 .
BERT classifier Here we explore using a standard text classification method with a higher number
of parameters compared to previous methods. We use a BERT-base architecture [12], to give a
contextualized embedding of the user query, and define win probability as:
Pθ (winMw |q) = σ(W hCLS + b), (12)
where hCLS is an embedding corresponding to the special classification token (CLS) summarizing the
input query q; and W, b, σ are parameters and sigmoid activation of a logistic regression head. We
perform full-parameter fine-tuning on Dpref . We train the model on 2xL4 24GB GPUs for ∼ 2000
steps using a batch size of 16, maximum sequence length of 512, learning rate of 1 × 10−5 and
weight decay of 0.01.
Causal LLM classifier We finally expand the capacity of our router by parameterizing it with
Llama 3 8B[2]. We use an instruction-following paradigm [28], i.e. we provide as input an instruction
prompt containing the user query, and output the win probability in a next-token prediction fashion –
instead of using a separate classification head. Notably, we append the comparison labels as additional
tokens to the vocabulary, and compute the win probability as softmax over the label classes L. We
train the model on 8xA100 80GB GPUs for ∼ 2000 steps using a batch size of 8, maximum sequence
length of 2048, and a learning rate of 1 × 10−6 .
4 Experiments
Training data: As mentioned in Sec. 3.1, we primarily use the 80K Chatbot Arena for training
our models, but hold out 5k samples for validation. We prune all prompt samples shorter than 16
characters, resulting in 65k pairwise comparisons between 64 different models. These consist of
conversations from over 100 languages, with the bulk of the conversations (81%) in English, followed
by Chinese (3.1%), and Russian (2.2%). We assign models to 10 classes to reduce sparsity of
comparison labels. As discussed in Sec. 3.1.1, we further augment our training data with with either:
1) Dgold , golden-labeled data created from the MMLU validation split and 2) Djudge , GPT-4-as-a-judge
labeled chat data.
Evaluation benchmarks: We evaluate our routers on three widely-used academic benchmarks:
MMLU [15] consisting of 14,042 questions across 57 subjects, MT Bench [30] with 160 open-ended
questions using LLM-as-a-judge, and GSM8K [11] with over 1,000 grade school math problems.
Additionally, we conduct a cross-contamination check between our evaluation and training datasets,
and report uncontaminated results below. We present results on public benchmarks to understand the
out-of-domain generalization of routers.
6
Routers: For both the matrix factorization router and the similarity-weighted ranking router, we use
OpenAI’s embedding model text-embedding-3-small to embed the input query. We perform full-
parameter finetuning on both BERT and Causal LLM, and use the validation set for model selection.
We opt to use gpt-4-1106-preview [20] as a representative model in Mstrong , and Mixtral 8x7B
[16] as a representative model in Mweak , to concretely evaluate router performance. We also use a
random router that routes queries randomly under a cost constraint as a baseline.
4.1 Results
Table 1 displays our router performance on MT Bench. For routers trained on the Arena dataset, we
observe strong performance for both matrix factorization and similarity-weighted ranking, with both
routers performing significantly better than the random router across all metrics. Notably, matrix
factorization requires half the number of GPT-4 calls as compared to random to achieve a PGR of
50%. However, our BERT and causal LLM classifiers perform close to random when trained on the
Arena dataset, which we attribute to high capacity approaches performing worse in a low-data regime.
Augmenting the preference data using a GPT-4 judge leads to notable improvements across all routers.
The BERT and causal LLM routers now perform much better than the random baseline, with the
BERT classifier achieving an APGR improvement of over 50% as compared to random. When trained
on this augmented dataset, matrix factorization is the best-performing router as its CPT(80%) is
nearly halved, requiring 50% less GPT-4 calls as compared to random.
We also compare the MT Bench performance of our routers against existing routing systems in
Appendix E, demonstrating the substantial improvements that our routers achieve over other available
systems.
Table 2: 5-shot MMLU results for our routers. Note that the score for CPT at 50% (75) is 92% that of
GPT-4 performance (81). Routers trained only on Darena perform poorly due to most questions being
out-of-distribution, but dataset augmentation with Dgold is highly effective, leading to significant
improvement in router performance even with a small number of samples.
7
On MMLU (Table 2), all routers perform poorly at the level of the random router when trained
only on Arena dataset, which we attribute to most MMLU questions being out-of-distribution (see
Section 4.2). However, augmenting the training dataset with golden-label data from the MMLU
validation split leads to significant performance improvements on MMLU across all routers, with all
routers requiring approximately 20% less GPT-4 calls than random for CPT(50%). Importantly, this
is despite the fact that the additional golden-labeled dataset of approximately 1500 samples represents
less than 2% of the overall training data, demonstrating the effectiveness of dataset augmentation
even when the number of samples is small.
Finally, on GSM8K (Table 3), we observe that similar to MMLU, the performance of all routers
trained only on the Arena dataset is close to random. However, training our routers on the dataset
augmented with synthetic data from an LLM judge improves performance substantially, with all
routers going from an APGR worse than random to an APGR greater than random. When trained on
this augmented dataset, the causal LLM classifier performs the best out of all routers, requiring 17%
less GPT-4 calls than random to achieve CPT(50%) and CPT(80%).
We attribute the difference in the performance of routers trained on the same dataset across different
benchmarks to the differing distributions of evaluation data and training data. For each benchmark-
dataset pair, we compute a benchmark-dataset similarity score in Table 4 indicating how well-
represented evaluation data is in the training data, described in detail in Appendix C.
A higher benchmark-dataset similarity score is correlated with stronger performance on that bench-
mark for routers trained using the corresponding dataset, as shown in Section 4.1. Dataset augmenta-
tion, be it using golden-labeled datasets or LLM-judge-labeled datasets, shifts the overall distribution
of the preference data to be more in line with the benchmarks and increases the benchmark-dataset
similarity score, which translates into performance improvements. This similarity score is also useful
for understanding the relative performance of routers across different benchmarks: on the Arena
dataset, the similarity score between MT bench and all datasets is noticeably greater than other
benchmarks, which we believe explains the relatively stronger router performance on MT Bench as
8
compared to GSM8K and MMLU. Benchmark-dataset similarity scores are a promising direction to
systematically improve router performance in real-world use cases given knowledge about the query
distribution.
We pick gpt-4-1106-preview [20] and Mixtral 8x7B [16] as representative strong and weak
models for the above experiments. However, to demonstrate the generalizability of our framework to
different model pairs, we report in this section our router performance on MT Bench when routed
between Claude 3 Opus [5] and Llama 3 8B [4]. Importantly, we use the same routers without any
retraining, and only replace the strong model and weak model routed to. These two models are also
not present in our training data.
Again, we observe strong results across all existing routers on MT Bench even when the model pair
is replaced. Performance across all routers is comparable to with the original model pair. Results for
both the new model pair and original model pair are still significantly stronger than random, with
our routers requiring up to 30% less GPT-4 calls than random to achieve CPT(80%). These results
suggest that our routers have learned some common characteristics of problems that can distinguish
between strong and weak models, which generalize to new strong and weak model pairs without
additional training.
We estimate the average cost of using GPT-4 and Mixtral 8x7B to be $24.7 per million tokens and
$0.24 per million tokens respectively (see details in Appendix D). In Table 6, we show results of
quantifying the cost savings achieved by our approach. We calculate the inverse of the ratio of GPT-4
calls utilized by our top-performing router relative to the random baseline because the cost of GPT-4
is the dominant factor in our analysis. Our routers achieve optimal cost savings of up to 3.66x,
demonstrating that routing can significantly reduce cost while maintaining response quality.
9
4.5 Routing Overhead
A concern with LLM routing is the overhead of routing as compared to using a single model. There-
fore, we measure and report the overhead of our routers in Table 7 to demonstrate their practicality
using randomly-sampled conversations from Chatbot Arena. For routers that requires GPUs, namely
matrix factorization and the classifier methods, we utilize Google Cloud’s g2-standard-4 VM
containing a single NVIDIA L4 GPU. For similarity-weighted ranking, we use Google Cloud’s
CPU-only n2-standard-8 VM. Our GPU-based routers are currently much more efficient that our
CPU-based routers, but we note that there is still much room for improvement in optimizing the
throughput of our routers. However, even SW ranking, our most expensive router, introduces an
additional cost of no more than 0.4% when compared to GPT-4 generation, as detailed in Appendix
D.
5 Conclusion
We demonstrate strong routing performance by our routers across a variety of benchmarks, spanning
open-ended question answering, humanities, and math problems. By intelligently routing queries
between a strong model and weak model, our routers are able to achieve significant cost savings
while maintaining a high response quality.
Our results also highlight the effectiveness of dataset augmentation in improving router performance.
While training routers solely on the Arena dataset results in poor performance with MMLU and
GSM8K, augmenting the training data with an LLM judge or in-domain data enables our routers
to outperform the random baseline across all benchmarks. The largest performance gains occur
when the training data closely resembles the evaluation data, as indicated by the benchmark-dataset
similarity score. We believe that this framework provides a clear and scalable path to enhancing
routing performance for specific use cases.
While our work demonstrates strong results, there are a few limitations. First, although we evaluate
on a diverse set of benchmarks, real-world applications may have distributions that differ substantially
from these benchmarks. To this end, we show that users can collect a small amount of in-domain
data to improve performance for their specific use cases via dataset augmentation. Next, while we
focus on the two-model routing setting in this work, a promising future direction would be to extend
this work to multiple models. Finally, in our experiments, we observe that performance between
different routers trained on the same dataset can vary widely on the same benchmark without a clear
explanation—we leave further investigation into this for a future work.
10
Acknowledgments and Disclosure of Funding
We are grateful to Kourosh Hakhamaneshi, Goku Mohandas, Arman Zharmagambetov and Anastasiia
Razdaibiedina for their valuable discussions and feedback on this work. This work is in part supported
by gifts from Accenture, AMD, Anyscale, Google, IBM, Intel, Microsoft, Mohamed Bin Zayed
University of Artificial Intelligence, Samsung SDS, SAP, Uber, and VMware.
References
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2024.
Accessed: 2024-05-21.
[3] Unify AI. Unify ai, 2024. Accessed: 2024-06-30.
[4] AI@Meta. Llama 3 model card, 2024. Accessed: 2024-06-26.
[5] Anthropic. "introducing the next generation of claude", 2024. Accessed: 2024-05-22.
[6] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the
method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[8] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece
Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general
intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
[9] Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models
while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
[10] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li,
Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica.
Chatbot arena: An open platform for evaluating llms by human preference, 2024.
[11] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to
solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[13] Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks
V. S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality-
aware query routing. In The Twelfth International Conference on Learning Representations,
2024.
[14] Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos
Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for
methods that learn from human feedback. Advances in Neural Information Processing Systems,
36, 2024.
[15] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. In International
Conference on Learning Representations, 2020.
[16] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris
Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand,
et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
[17] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language
models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
11
[18] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recom-
mender systems. Computer, 42(8):30–37, 2009.
[19] Martian. Martian router, 2024. Accessed: 2024-06-30.
[20] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[21] OpenAI. Openai pricing, 2024. Accessed: 2024-06-30.
[22] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to
follow instructions with human feedback. Advances in neural information processing systems,
35:27730–27744, 2022.
[23] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[24] Together.AI. Together.ai pricing, 2024. Accessed: 2024-06-30.
[25] Andreas Töscher, Michael Jahrer, and Robert M Bell. The bigchaos solution to the netflix grand
prize. Netflix prize documentation, pages 1–52, 2009.
[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
[28] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan
Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv
preprint arXiv:2109.01652, 2021.
[29] Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text
generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
[30] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica.
Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on
Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
[31] Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving
llm helpfulness & harmlessness with rlaif, November 2023.
12
A Arena Model Tiers
Tier Models
Tier 0 gpt-4-0125-preview, gpt-4-1106-preview
Tier 1 gpt-4-0314, gpt-4-0613, mistral-medium, claude-1, qwen1.5-72b-chat
Tier 2 claude-2.0, mixtral-8x7b-instruct-v0.1, claude-2.1, gemini-pro-dev-api, gpt-3.5-turbo-
0314, gpt-3.5-turbo-0613, gemini-pro, gpt-3.5-turbo-0125, claude-instant-1, yi-34b-
chat, starling-lm-7b-alpha, wizardlm-70b, vicuna-33b, tulu-2-dpo-70b, nous-hermes-2-
mixtral-8x7b-dpo, llama-2-70b-chat, openchat-3.5
Tier 3 llama2-70b-steerlm-chat, pplx-70b-online, dolphin-2.2.1-mistral-7b, gpt-3.5-turbo-
1106, deepseek-llm-67b-chat, openhermes-2.5-mistral-7b, openchat-3.5-0106,
wizardlm-13b, mistral-7b-instruct-v0.2, solar-10.7b-instruct-v1.0, zephyr-7b-beta,
zephyr-7b-alpha, codellama-34b-instruct, mpt-30b-chat, llama-2-13b-chat, vicuna-13b,
qwen1.5-7b-chat, pplx-7b-online, falcon-180b-chat, llama-2-7b-chat, guanaco-33b,
qwen-14b-chat
Tier 4 stripedhyena-nous-7b, mistral-7b-instruct, vicuna-7b, qwen1.5-4b-chat, palm-2
Tier 5 koala-13b, chatglm3-6b, gpt4all-13b-snoozy
Tier 6 mpt-7b-chat, RWKV-4-Raven-14B, chatglm2-6b, alpaca-13b, oasst-pythia-12b
Tier 7 fastchat-t5-3b, chatglm-6b
Tier 8 dolly-v2-12b, stablelm-tuned-alpha-7b
Tier 9 llama-13b
B Data Contamination
We check for cross-contamination between our evaluation dataset and the preference data used for
training using embedding similarity search. Embeddings are generated for the evaluation and training
data using OpenAI’s text-embedding-3-small model. For each evaluation example, we perform
a similarity search across all training data with a threshold of 0.95, returning a list of contaminated
examples. We discard these evaluation examples and report results on uncontaminated scores.
C Benchmark-Dataset Similarity
Let ϵB = {b1 , b2 , . . . , bn } be the embeddings of the prompts for a given benchmark B and ϵD =
{d1 , d2 , . . . , dm } be the embeddings of a specific preference dataset Dpref , where n and m are the
total number of evaluation and preference data samples respectively. We define the benchmark-data
similarity score S(B, Dpref ) for each benchmark B as the average maximum similarity for each
evaluation prompt across all dataset samples:
n
1X bi · dj
S(B, Dpref ) = max (13)
n i=1 1≤j≤m ∥bi ∥∥dj ∥
We opt to use only the maximum similarity score because having a small number of samples of
preference data that are very similar to the user’s query is most valuable for efficient query routing, as
opposed to having many samples that are less similar to the user prompt.
D Cost Calculation
Since our evaluations are performed with the gpt-4-1106 endpoint, we use its pricing ($10 per 1
million input tokens and $30 per 1 million output tokens) in our analysis. For the sake of simplicity,
we assume the routers will be mostly handling short prompts in a single turn setting. We find the
average input prompt in the training set to be 95 tokens long, and the average output responses to be
95
264 tokens long. This means the input/output tokens ratio is roughly 264 . Using these information,
95×10 264×30
( 1,000,000 + 1,000,000 )×1,000,000
we estimate the average cost of using GPT-4 to be: 95+264 ≈ 24.7 USD per 1
million tokens. For Mixtral 8x7B, we assume the same price for both input and output tokens, which
makes the average cost $0.24 USD per 1 million tokens.
13
E Independent Benchmarks
Figure 2: Performance of our routers as compared to other routing systems on MT Bench. Our routers
demonstrate competitive performance, achieving stronger performance than existing routers for the
same cost.
F Additional Plots
We include additional plots for all results presented in Section 4.1.
14
Figure 4: 5-shot MMLU performance for all routers.
Figure 6: MT Bench performance for all routers when routed to Claude 3 Opus and Llama 3 8B
instead, without any retraining.
15