Splus - Test Time Scaling For Code Generation
Splus - Test Time Scaling For Code Generation
Shangyin Tan Kurt Keutzer Jiarong Xing Joseph E. Gonzalez Ion Stoica
1
Stage 1: Generation Stage 2: Selection
Iterative debugging enhanced hybrid scaling Adaptive input synthesis based selection
Problem description Prompt:
Given a positive integer Sample 1 … Select the best sample
num represented as a based on execution results
string, return the integer Run samples with
num without trailing zeros generated inputs
as a string.
Public tests Iterative Iterative
Input: num = "51230100" debugging debugging Best sample
Output: "512301“
…
Prompt:
Input: num = “123" Generate inputs that can
Output: “123" Sample N … distinguish pairwise samples
…
Round 1 Round 2 Stop at max round
Figure 2: Overview of S ∗ . Stage 1: Generation—S ∗ enhances parallel samples through iterative debugging. Each
sample is tested using public test cases executed via an interpreter, with outputs and/or error messages used to guide
the next round of sample generation. Stage 2: Selection—S ∗ selects the best sample by prompting an LLM to
generate inputs that differentiate between paired samples, then leveraging actual execution results to inform the
LLM to determine the optimal choice.
programs to obtain precise outputs and error mes- ods. We evaluate S ∗ on 12 models, spanning a wide
sages, providing a reliable grounding mechanism range of sizes, both open and closed, instruction-
for improving generation and selection (Chen et al., based and reasoning models. S ∗ consistently
2023; Li et al., 2022). enhances performance across these diverse set-
In this paper, we propose S ∗ , the first hybrid tings. Notably, S ∗ enables: (1) Small models
test-time scaling framework for code generation, to surpass larger models within the same family:
which substantially improves both coverage 1 and Qwen2.5-7B-Instruct + S ∗ outperforms Qwen2.5-
selection accuracy. S ∗ pushes the limits of existing 32B-Instruct on LiveCodeBench by 10.7%; (2)
parallel scaling strategies by integrating sequential Instruction-based models to outperform reason-
scaling through iterative debugging, while introduc- ing models: GPT-4o-mini + S ∗ surpasses o1-
ing a novel adaptive selection mechanism grounded preview by 3.7%; and (3) Open reasoning models
in execution. The framework operates in two key to achieve performance competitive with state-of-
stages, as shown in Fig. 2. the-art closed models: DeepSeek-R1-Distill-Qwen-
First, in the generation stage, S ∗ augments 32B + S ∗ achieves 85.7% on LiveCodeBench, ap-
parallel sampling (Brown et al., 2024; Li et al., proaching the state-of-the-art performance of o1-
2022) with sequential scaling via iterative debug- high at 88.7%. Fig. 3 provides an overview of the
ging. Each generated sample is executed on public performance improvements enabled by our tech-
test cases to obtain outputs and/or error messages, niques. In summary, our contributions are:
which are fed back into the model to iteratively
refine the code. Second, in the selection stage, ex- 1. We propose S ∗ , the first hybrid test-time scal-
isting methods often rely on generating test inputs ing framework for code generation, which
indiscriminately, which can fail to effectively dif- augments parallel scaling with sequential scal-
ferentiate between candidate solutions (Chen et al., ing via iterative debugging and introduces
2022; Zeng et al., 2025). To overcome this limi- adaptive test input synthesis using LLMs for
tation, S ∗ introduces adaptive input synthesis: for robust sample selection.
each pair of samples, an LLM is prompted to gen-
erate distinguishing test inputs. These inputs are 2. We conduct extensive evaluations on Live-
executed, where the outputs are further provided CodeBench and CodeContests, demonstrating
to ground the LLM to select the best sample. This that S ∗ consistently improves performance
adaptive, execution-grounded approach ensures ro- across diverse model families and sizes.
bust identification of correct solutions (§5.4).
S ∗ is a general approach that outperforms zero- 3. We will release all software artifacts, model
shot generation and existing test-time scaling meth- generations, and intermediate results to sup-
1
port and accelerate future research in this area.
The fraction of problems that are solved by any generated
sample (Brown et al., 2024).
2
o1-preview o1-mini o1-high
eration performance (Chen et al., 2023).
+6.7
Qwen-Coder-14B 44.8
+13.1
+ S* generation 51.5 Test Time Scaling for Code Generation. Chen
+S* selection 64.6
et al. (2022); Huang et al. (2023); Jiao et al. (2024)
R1-Distill-14B 63.2 +3.0
+ S* generation 66.2 +16.6 use models to generate code samples and test cases,
+ S* selection 82.8 selecting the final sample in a self-consistency man-
20 30 40 50 60 70 80 90 ner (Wang et al., 2022; Zeng et al., 2025). However,
these approaches often suffer from model halluci-
Figure 3: Ablation of S ∗ performance benefits:
Qwen2.5-Coder-14B-Instruct (denoted as Qwen-Coder-
nation, where the model fails to accurately predict
14B) (Hui et al., 2024) with S ∗ can surpass o1-preview the output of a test input (Jain et al., 2024; Zeng
without S ∗ . DeepSeek-R1-Distill-Qwen-14B (denoted et al., 2025; Gu et al., 2024). AlphaCode explores
as R1-Distill-14B) (Guo et al., 2025) with S ∗ outper- large-scale parallel sampling with a trained model
forms o1-mini without S ∗ . to generate test cases for filtering and selection (Li
et al., 2022). AlphaCodium uses a series of self-
revision on both public demonstration and model-
2 Related work generated tests to improve solutions (Ridnik et al.,
Test Time Scaling for LLMs. Existing ap- 2024). Saad-Falcon et al. (2024) searches over
proaches to increase test-time compute can be various inference techniques and finds that parallel
broadly categorized into two paradigms: parallel sampling with model-generated tests works well
scaling and sequential scaling (Muennighoff et al., for CodeContests problems (Li et al., 2022). Un-
2025). Parallel scaling (i.e., repeated sampling) like methods relying solely on parallel sampling or
involves generating multiple solutions simultane- sequential scaling, we use a hybrid approach that
ously and selecting the best one, a strategy com- combines their advantages.
monly known as Best-of-N. Coverage—the frac- Hybrid Test-Time Scaling. Many works in the
tion of problems solved by any of these N samples— math domain study hybrid approaches that combine
continues to improve as N increases (Chollet, parallel and sequential scaling, often leveraging
2019; Irvine et al., 2023), even at the scale of 104 to reward-model-guided tree search algorithms, such
106 (Brown et al., 2024; Li et al., 2022). However, as Monte-Carlo Tree Search (MCTS), to effectively
common selection strategies, such as (weighted) navigate the solution space (Gao et al., 2024; Li
majority voting (Wang et al., 2022) and reward et al., 2024b; Silver et al., 2016; Snell et al., 2024;
model scoring (Christiano et al., 2017; Lightman Hendrycks et al., 2021b). S1 (Muennighoff et al.,
et al., 2023; Wang et al., 2024a; Wu et al., 2024; 2025) primarily focuses on sequential scaling but
Beeching et al.; Pan et al., 2024), often struggle observes diminishing returns and thus incorporates
to select the correct best sample in parallel scal- parallel-based approaches like majority voting and
ing (Brown et al., 2024; Hassid et al., 2024; Stroebl tree search to further enhance performance.
et al., 2024). In this paper, we propose a novel
In contrast, our work applies hybrid scaling to
method that improves selection for coding tasks.
code generation tasks without relying on tree search
Sequential scaling, on the other hand, encour- methods, as developing a general and effective re-
ages the model to refine its reasoning over mul- ward model for the code generation domain re-
tiple steps. This includes methods like chain-of- mains challenging (Zeng et al., 2025). Instead,
thought (CoT) prompting (Wei et al., 2022; Nye S ∗ augments parallel scaling with sequential scal-
et al., 2021), and iterative rethinking and revi- ing via execution-grounded iterative debugging to
sion (Madaan et al., 2024; Lee et al., 2025; Hou improve coverage and introduces adaptive input
et al., 2025; Huang et al., 2022; Min et al., 2024; synthesis to enhance selection accuracy.
Team, 2025; Muennighoff et al., 2025; Wang et al.,
2024b; Li et al., 2025). Noticeably, OpenAI o1, Concurrent Work. CodeMonkeys is a notice-
DeepSeek R1, Qwen QwQ, and Kimi employ in- able concurrent work to this paper, released on
context long CoT with revision and backtracking Arxiv in Jan 2025 (Ehrlich et al., 2025). It also gen-
to find the best solution (OpenAI, 2024; Guo et al., erates multiple samples in parallel and revises sam-
2025; Qwen, 2024; Team et al., 2025). In this ples. However, CodeMonkeys focuses on the soft-
paper, we leverage iterative debugging from test ware engineering domain, optimizing performance
execution feedback for sequential scaling code gen- on SWE-Bench (Chowdhury et al., 2024), which
3
addresses challenges such as identifying files that Algorithm 1: Best Sample Selection in S ∗
need to be edited. In contrast, our work focuses on Input: Problem description: P
competition-level code generation, where domain Input: Candidate samples: X
Output: The best selected sample: x∗
differences influence our algorithm choice. For in- 1 T ← llm_test_input_gen(P )
stance, during sequential scaling, CodeMonkeys 2 O ← sample_execution(X, T )
requires a model to generate tests over multiple 3 C ← sample_clustering(O)
4 Scores ← 0
iterations, while we instead incorporate feedback 5 for each pair (Ci , Cj ) ∈ C do
from public tests (ablated variants in §5.3). 6 Sample xi , xj from Ci , Cj
7 Ta ← adaptive_input_gen(xi , xj )
3 Method 8 better_idx = exec_and_llm_select(xi , xj , Ta )
9 Scores[better_idx] += 1
S∗ takes as input a coding problem P and a code 10 end
11 C ∗ ← arg max(Scores)
generation model M. The model M aims to gen- 12 x∗ ← random_pick(C ∗ )
erate a program solution X(·) that maps inputs to 13 return x∗
outputs according to the problem specification.
We adopt the standard setup widely used in exist-
ing coding benchmarks (Chen et al., 2021; Li et al., niques (Chen et al., 2023). Each sample is then
2022, 2023; Jain et al., 2024; Hendrycks et al., refined through up to R rounds of sequential revi-
2021a; Gulwani et al.). Each coding problem P sion, informed by execution results on public test
consists of a natural language description and a set cases. The revision process halts once a sample
of public and private test cases, each represented passes all public tests or reaches the maximum
as input-output pairs. number of revision attempts.
Private tests evaluate the correctness of X but
Stage 2: Selection. After generating N candi-
remain inaccessible to M during code generation.
date solutions, the next step is to identify the best
A solution is considered correct if it passes all pri-
one. Since the public tests are already used dur-
vate tests. In contrast, public tests are provided to
ing the generation stage, additional evaluation is
clarify the problem’s intent and are typically in-
needed for reliable selection. We investigate two
cluded in the prompt. Public tests are usually far
existing approaches: (1) LLM-as-a-judge (Zheng
fewer than private tests; for instance, in CodeCon-
et al., 2023), which relies on pre-trained knowledge
tests (Li et al., 2022), there are, on average, 2.0
to compare candidate solutions, and (2) generated
public tests and 202.1 private tests per problem.
test cases (Li et al., 2022; Chen et al., 2022) which
This contrasts with mathematical reasoning tasks,
uses synthesized test cases to guide selection.
where evaluation typically relies on exact string
Unfortunately, we find that LLM-based judging
matching of the final solution without additional
alone often struggles to predict program behavior
test information (Li et al., 2024a).
accurately, while generated tests frequently fail to
3.1 The S ∗ Framework provide reliable outputs for grounding the selection
or to produce high-quality inputs that effectively
S ∗ is a two-stage hybrid test-time scaling frame-
distinguish samples (see Tab. 3).
work consisting of Generation and Selection stages,
To overcome these limitations, S ∗ introduces
as demonstrated in Fig. 2. It extends parallel sam-
adaptive input synthesis, a hybrid selection ap-
pling with sequential sampling via iterative debug-
proach that integrates LLM evaluation with
ging to improve coverage and employs adaptive in-
execution-grounded verification, as illustrated in
put synthesis during selection to enhance selection
Algorithm 1. First, we prompt an LLM to synthe-
accuracy, leveraging execution results throughout
size a set of test inputs. We execute these inputs
the process. An example of effect for different
and cluster the N samples based on their execution
stages can be found in Fig. 3.
outputs (Line 1 to Line 3) (Li et al., 2022). We then
Stage 1: Generation. In the generation stage, perform pairwise comparisons across clusters: for
S ∗ improves coverage by extending parallel scal- each comparison, we prompt the LLM to generate
ing with sequential scaling through iterative debug- distinguishing inputs, execute both samples using
ging grounded with execution feedback. Specif- these inputs, and select the superior one based on
ically, S ∗ first generates N initial samples in- the execution results (Line 7 to Line 9). This adap-
dependently, leveraging parallel sampling tech- tive selection process grounds LLM evaluations
4
in concrete execution feedback, resulting in more (v2) contains 511 problems, ranging from easy
reliable and accurate sample selection compared to (182 problems), medium (206 problems), to hard
naive LLM judging or generated tests-based meth- (123 problems). In addition, we evaluate S ∗ on
ods (see §4). CodeContests (Li et al., 2022), a collection of 165
challenging coding problems. We use Pass@1 as
4 Evaluation our primary metric (Chen et al., 2021). Some ex-
In this section, we evaluate S ∗ across a diverse set periments report Pass@N with N samples (often
of instruction-based and reasoning models, span- referred to as the ‘oracle’ settings) (Stroebl et al.,
ning various model families, sizes, and access 2024; Brown et al., 2024).
types (open and closed), as well as multiple bench- Baselines. Our evaluation considers two cate-
marks (Jain et al., 2024; Li et al., 2022). gories of baselines. First, we assess our method’s
Our key findings demonstrate the generality and improvement over the original model (without
effectiveness of S ∗ : test-time scaling), using three leading OpenAI
1. S ∗ consistently improves model performance reasoning models—o1-preview, o1-high, and o1-
across different families, sizes, and types, and mini (OpenAI, 2024)—as performance bench-
generalizes effectively to multiple code gener- marks. Second, we evaluate different test-time
ation benchmarks, including LiveCodeBench scaling methods applied to the same models, en-
(§4.2) and CodeContests (§4.4), showcasing compassing both parallel (i.e., majority voting) and
its robustness and broad applicability. sequential (i.e., self-debugging) approaches.
Implementation Details. We configure S ∗ to
2. S ∗ outperforms existing widely-used
generate 16 samples in parallel with a tempera-
test-time scaling methods, including self-
ture of 0.7 (without top-p sampling) and perform
debugging (Chen et al., 2023) and majority
2 rounds of iterative debugging on public tests.
voting (Wang et al., 2022; Li et al., 2022),
We justify our choice of hyper-parameters in §5.
by enhancing both coverage and selection
Prompts are automatically generated by a prompt-
accuracy (§4.3).
ing framework, DSPy, where detailed prompts can
4.1 Experimental Setup be found in Appendix A.2 (Khattab et al., 2023).
Models. We consider both instruction-based and We run codes in a sandbox to avoid maliciously
reasoning-based models. To compare performance generated code, according to (Chen et al., 2021).
across models of different sizes using S ∗ , we select Experiments with the largest model (DeepSeek-R1-
a series of models within the same family. We ex- Distill-Qwen32B) takes one day on 8 H100 GPUs.
periment with 12 models: (1) Instruction-based All experiments are conducted once.
models: Qwen2.5-Coder-Instruct series (0.5B, 4.2 S ∗ Main Results
1.5B, 3B, 7B, 14B, 32B), GPT-4o mini (0718
Fig. 1 presents a performance comparison on Live-
version) (Hui et al., 2024; Achiam et al., 2023);
CodeBench with and without S ∗ , alongside the o1-
(2) Reasoning-based models: QwQ-32B-Preview,
series reasoning models for reference. Our results
DeepSeek-R1-Distill-Qwen series (7B, 14B, 32B),
demonstrate that S ∗ consistently enhances model
and o1-mini (Qwen, 2024; Guo et al., 2025; Ope-
performance. When applied to models within
nAI, 2024).
the same family, S ∗ allows small models to sur-
Benchmarks. We primarily use LiveCodeBench pass large ones. For example, Qwen2.5-7B-Coder-
(MIT License) as our main evaluation benchmark, Instruct integrated with S ∗ outperforms Qwen2.5-
given its extensive usage by recent reasoning mod- 32B-Coder-Instruct without S ∗ by 10.1%. Addi-
els and its inclusion of difficulty levels, which help tionally, S ∗ enables instruction-based models to
analyze the behavior of different techniques (Jain surpass reasoning models, as evidenced by GPT-
et al., 2024; DeepSeek, 2024; Qwen, 2024). We 4o mini (0718) with S ∗ outperforming o1-Preview.
use its v4 version for development (e.g., selecting Moreover, S ∗ further improves strong reasoning
hyper-parameters), which contains problems from models: the most capable open-source reasoning
August 2024 to November 2024. For final evalua- model, DeepSeek-R1-Distill-Qwen-32B, when en-
tion, we use v2 version that is non-overlapping to hanced with S ∗ , surpasses o1-mini and achieves
v4, and contain more problems. LiveCodeBench near state-of-the-art results comparable to o1 (high
5
Method Qwen2.5-Coder-Instruct 4o-mini R1-Distill QwQ o1-mini
0.5B 1.5B 3B 7B 14B 32B 7B 14B 32B
Zero-Shot 1.2 7.0 18.4 29.4 44.8 47.4 40.9 48.4 63.2 69.1 62.1 76.5
Majority Vote 2.5 11.0 25.2 40.5 50.8 55.9 46.6 58.7 68.1 75.8 67.3 81.6
Self-Debugging 2.4 9.4 27.8 39.9 51.5 59.5 51.7 58.4 66.2 70.1 59.3 79.9
S* 10.9 27.6 42.7 54.4 64.6 70.1 61.3 73.2 82.8 85.7 76.3 85.3
Table 1: Pass@1 of zero-shot, majority voting (Wang et al., 2022; Li et al., 2022), self-debugging (Chen
et al., 2023), and S ∗ on LiveCodeBench (v2). Bold text denotes the best performance. "R1-Distill", "QwQ",
"4o-mini" is short for "DeepSeek-R1-Distill-Qwen" (Guo et al., 2025), "QwQ-32B-Preview" (Qwen, 2024), and
"GPT-4o-mini" (Achiam et al., 2023) respectively. S ∗ consistently outperforms other baselines.
6
public tests for every iteration works the best. Notably, Qwen2.5-Coder-7B-Instruct, the weakest
performer at N = 1, shows the largest gain, ex-
3. Selection Policy: We assess the performance ceeding 35% at N = 64. Similarly, the more capa-
of different selection policies, comparing our ble reasoning-model (QwQ-32B-Preview) follows
adaptive input synthesis approach with alter- the same trend, though its gains plateau beyond
native selection strategies (§5.4). We find N = 32. Nevertheless, it improves substantially,
that our adaptive input synthesis selection rising from 50% at N = 1 to 80% at N = 32.
method is consistently more reliable than the These results confirm that increasing the number
generated tests and the LLM judge selection of parallel samples is a simple yet effective strat-
method. egy for enhancing performance in both instruction-
All ablation experiments are conducted on Live- following and reasoning-based models.
CodeBench (v4).
5.2 Impact of In-Context Examples
5.1 Parallel Sampling Hyper-Parameters
Qwen2.5-Coder-7B-Inst.Qwen2.5-Coder-32B-Inst. GPT-4o mini
60 No ICL
Effect of Temperature Effect of Sample Size ICL (BM25)
Pass@N
Pass@N
Pass@N
ICL (Patterns)
80
40
70
60 20
Pass@N
7
Coder-32B-Instruct. In contrast, ICL (Pattern) out- tests does not enhance performance for GPT-4o
performs the baseline when n ≥ 8 for Qwen2.5- mini. (3) The benefits of iterative debugging tend
Coder-7B-Instruct and n ≥ 4 for Qwen2.5-Coder- to plateau, typically after 2–3 rounds: this finding
32B-Instruct, while showing comparable perfor- aligns with the observation that the benefit of se-
mance with GPT-4o mini. quential scaling flattens out (Muennighoff et al.,
These results highlight that selecting high- 2025). Motivated by these findings, we choose to
quality examples is crucial, and naive retrieval use 2 round of debugging, only on public tests for
methods often fall short. Although ICL itself is simplicity, and apply iterative debugging even for
promising, its performance is sensitive to example reasoning models in §4.2.
quality and retrieval effectiveness. We regard it
as future work to develop robust ICL techniques 5.4 Impact of Different Selection Policies
that can be integrated into S ∗ to further enhance
parallel scaling performance. Model Public Generated LLM Adaptive Input
Only Tests Judge Synthesis (Ours)
5.3 Impact of Iterative Debugging Variants Qwen-Coder-7B 42.3 42.3 42.3 44.5
Qwen-Coder-32B 54.6 57.8 55.5 58.3
GPT-4o mini 54.1 55.2 56.3 57.3
Qwen2.5-Coder-7B-Inst.
46
GPT-4o mini QwQ-32B-Preview QwQ-32B-Preview 64.3 65.9 68.6 69.7
62 74 Avg. 53.8 53.1 55.6 57.5
44 60
Pass@N
Pass@N
Pass@N
73
58 Public Tests
42 72 Last Round Context Table 3: Pass@1 Performance comparison between
+ Generated Tests
40 56 71 different selection methods on LiveCodeBench(v4).
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Round Round Round Bold text denotes the best performance of the same
model. "Qwen-Coder", "R1-Distill" is short for
Figure 6: Comparison of three iterative debugging "Qwen2.5-Coder-Instruct" and "DeepSeek-R1-Distill-
approaches: Public Tests, + Generated Tests and Last Qwen". Results are obtained with N=8 and tempera-
Round Context. Results are obtained with N = 8, ture=0.7. Our Adaptive Input Synthesis method achieves
temperature = 0.7 and up to four rounds of debugging. better accuracy over other methods.
We examine the effectiveness of three variants of
iterative debugging: (1) Public Tests: The model We compare different policies for selecting the
iteratively debugs using public tests and stops once best sample after iterative debugging. We evaluate
the sample passes all of them. (2) +Generated four approaches: (1) Public Only: using only pub-
Tests: In addition to public tests, the model contin- lic test cases for selection and randomly selecting
ues debugging on model-generated tests following a sample if it passes all tests; (2) Generated Tests:
the algorithm in (Ridnik et al., 2024). (3) Last applying public test filtering followed by additional
Round Context: The model iteratively debugs us- test case generation using GPT-4o mini, selecting
ing only public tests, but instead of using code the sample that passes the most test cases; (3) LLM
samples from all previous rounds for debugging, it Judge: applying public test filtering and then using
uses only the last round of code sample as context. LLMs for pairwise selection among code samples;
This is motivated by observations that LLMs may and (4) Adaptive Input Synthesis —applying the
perform sub-optimally when handling large context selection algorithm described in § 3.1 with GPT-4o
windows (Liu et al., 2024). mini after public test filtering.
Fig. 6 summarizes the result. We find that: (1) Tab. 3 summarizes the results. Notably, the Gen-
Even though reasoning models implicitly perform erated Tests approach does not yield improvements
self-reflection and revising, they benefit from ex- over the Public Only baseline. This is due to er-
plicit debugging through test execution feedback: rors in model-generated outputs, which, when ap-
the performance of QwQ-32B-Preview model im- plied to poorly chosen inputs, introduce significant
proves from 72.6 to 74.2 with 2 rounds of debug- noise in the selection process, ultimately degrading
ging. (2) Reducing the context window or consid- performance. In contrast, our Adaptive Selection
ering more model-generated tests does not show method enables the LLM to strategically select an
consistent improvement: while using only the last input that best differentiates samples while avoid-
round of context improves performance for the ing the need to predict outputs. By leveraging real
Qwen2.5-Coder-7B-Instruct model, it results in execution outputs rather than model predicttions,
worse performance for the other two models. Sim- the LLM makes more reliable decisions, leading to
ilarly, incorporating additional model-generated improved selection accuracy.
8
6 Conclusion Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan,
Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022.
We propose S ∗ , the first hybrid test-time scaling Codet: Code generation with generated tests. arXiv
framework for code generation that substantially preprint arXiv:2207.10397.
improves both coverage and selection accuracy.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
S ∗ extends the existing parallel scaling paradigm Yuan, Henrique Ponde De Oliveira Pinto, Jared Ka-
with sequential scaling through iterative debugging plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
and incorporates adaptive input synthesis, a novel Greg Brockman, et al. 2021. Evaluating large
mechanism that synthesizes distinguishing test in- language models trained on code. arXiv preprint
arXiv:2107.03374.
puts to differentiate candidates and identify correct
solutions via execution results. Xinyun Chen, Maxwell Lin, Nathanael Schärli, and
S ∗ consistently improves code generation per- Denny Zhou. 2023. Teaching large language models
formance across benchmarks, including Live- to self-debug. arXiv preprint arXiv:2304.05128.
CodeBench and CodeContests. Notably, S ∗ en-
François Chollet. 2019. On the measure of intelligence.
ables a 3B model to outperform GPT-4o mini, GPT- arXiv preprint arXiv:1911.01547.
4o mini to surpass o1-preview by 3.7% on Live-
CodeBench, and DeepSeek-R1-Distill-Qwen-32B Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-
to achieve 86.7% on LiveCodeBench, approaching tic, Shane Legg, and Dario Amodei. 2017. Deep
reinforcement learning from human preferences. Ad-
o1-high at 88.5%. vances in neural information processing systems, 30.
7 Limitations DeepSeek. 2024. Deepseek-r1-lite-preview re-
lease. https://api-docs.deepseek.com/news/
This work primarily focuses on competition-level news1120. Accessed: 2024-11-20.
code generation, where it does not studies tasks
such as software engineering tasks, e.g., SWE- Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald
BENCH (Jimenez et al., 2023). The method primar- Clark, Christopher Ré, and Azalia Mirhoseini. 2025.
Codemonkeys: Scaling test-time compute for soft-
ily focuses on improving accuracy, while it does ware engineering. arXiv preprint arXiv:2501.14723.
not aim for minimizing costs.
Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu,
8 Acknowledgment Hongzhang Liu, Aiwei Liu, Xuming Hu, and Lijie
Wen. 2024. Interpretable contrastive monte carlo tree
This work is funded by the Sky Computing Lab at search reasoning. arXiv preprint arXiv:2410.01707.
UC Berkeley. We extend our sincere gratitude to
Matei Zaharia and Anastasios Nikolas Angelopou- Alex Gu, Baptiste Rozière, Hugh Leather, Armando
Solar-Lezama, Gabriel Synnaeve, and Sida I Wang.
los for their invaluable feedback. We are grateful
2024. Cruxeval: A benchmark for code reason-
for the generous compute resources support from ing, understanding and execution. arXiv preprint
Databricks, Lambda Labs, and Anyscale. In partic- arXiv:2401.03065.
ular, we thank Jonathan Frankle (Databricks) and
Chuan Li (Lambda Labs) for facilitating access to Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh.
Foundations and trends in programming languages.
these resources. Bd, 4:1–119.
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Dan Hendrycks, Steven Basart, Saurav Kadavath, Man-
Clark, Quoc V Le, Christopher Ré, and Azalia Mirho- tas Mazeika, Akul Arora, Ethan Guo, Collin Burns,
seini. 2024. Large language monkeys: Scaling infer- Samir Puranik, Horace He, Dawn Song, and Jacob
ence compute with repeated sampling. arXiv preprint Steinhardt. 2021a. Measuring coding challenge com-
arXiv:2407.21787. petence with apps. NeurIPS.
9
Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave
tas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Marwood, Shumeet Baluja, Dale Schuurmans, and
Samir Puranik, Horace He, Dawn Song, et al. 2021b. Xinyun Chen. 2025. Evolving deeper llm thinking.
Measuring coding challenge competence with apps. arXiv preprint arXiv:2501.09891.
arXiv preprint arXiv:2105.09938.
Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi
Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Mo, Shishir G Patil, Matei Zaharia, Joseph E Gonza-
Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. lez, and Ion Stoica. 2025. Llms can easily learn to
2025. Advancing language model reasoning through reason from demonstrations structure, not content, is
reinforcement learning and inference scaling. arXiv what matters! arXiv preprint arXiv:2502.07374.
preprint arXiv:2501.11651.
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lip-
Baizhou Huang, Shuai Lu, Weizhu Chen, Xiaojun kin, Roman Soletskyi, Shengyi Huang, Kashif Rasul,
Wan, and Nan Duan. 2023. Enhancing large lan- Longhui Yu, Albert Q Jiang, Ziju Shen, et al. 2024a.
guage models in coding through multi-perspective Numinamath: The largest public dataset in ai4maths
self-consistency. arXiv preprint arXiv:2309.17272. with 860k pairs of competition math problems and
solutions. Hugging Face repository, 13:9.
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu,
Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Dai, Ruim-
Large language models can self-improve. arXiv ing Tang, Yasheng Wang, Yong Yu, and Weinan
preprint arXiv:2210.11610. Zhang. 2024b. Rethinkmcts: Refining erroneous
thoughts in monte carlo tree search for code genera-
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Day- tion. arXiv preprint arXiv:2409.09584.
iheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang,
Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-coder Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong
technical report. arXiv preprint arXiv:2409.12186. Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. 2023.
Taco: Topics in algorithmic code generation dataset.
Robert Irvine, Douglas Boubert, Vyas Raina, Adian arXiv preprint arXiv:2312.14852.
Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Kor-
Yujia Li, David Choi, Junyoung Chung, Nate Kushman,
shuk, Zongyi Liu, Fritz Cremer, Valentin Assassi,
Julian Schrittwieser, Rémi Leblond, Tom Eccles,
et al. 2023. Rewarding chatbots for real-world en-
James Keeling, Felix Gimeno, Agustin Dal Lago,
gagement with millions of users. arXiv preprint
et al. 2022. Competition-level code generation with
arXiv:2303.06135.
alphacode. Science, 378(6624):1092–1097.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri
Yan, Tianjun Zhang, Sida Wang, Armando Solar- Edwards, Bowen Baker, Teddy Lee, Jan Leike,
Lezama, Koushik Sen, and Ion Stoica. 2024. Live- John Schulman, Ilya Sutskever, and Karl Cobbe.
codebench: Holistic and contamination free eval- 2023. Let’s verify step by step. arXiv preprint
uation of large language models for code. arXiv arXiv:2305.20050.
preprint arXiv:2403.07974.
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling-
Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F ming Zhang. 2023. Is your code generated by chatgpt
Chen, Shafiq Joty, and Furu Wei. 2024. Preference really correct? rigorous evaluation of large language
optimization for reasoning with pseudo feedback. models for code generation. Advances in Neural
arXiv preprint arXiv:2411.16345. Information Processing Systems, 36:21558–21572.
Carlos E Jimenez, John Yang, Alexander Wettig, Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran-
Shunyu Yao, Kexin Pei, Ofir Press, and Karthik jape, Michele Bevilacqua, Fabio Petroni, and Percy
Narasimhan. 2023. Swe-bench: Can language mod- Liang. 2024. Lost in the middle: How language mod-
els resolve real-world github issues? arXiv preprint els use long contexts. Transactions of the Association
arXiv:2310.06770. for Computational Linguistics, 12:157–173.
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Zhiyuan Zhang, Keshav Santhanam, Sri Vard- Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling et al. 2024. Self-refine: Iterative refinement with
declarative language model calls into self-improving self-feedback. Advances in Neural Information Pro-
pipelines. arXiv preprint arXiv:2310.03714. cessing Systems, 36.
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen,
Shengyi Huang, Hamish Ivison, Faeze Brahman, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xi-
Lester James V Miranda, Alisa Liu, Nouha Dziri, aoxue Cheng, Huatong Song, et al. 2024. Imitate,
Shane Lyu, et al. 2024. T\" ulu 3: Pushing frontiers explore, and self-improve: A reproduction report
in open language model post-training. arXiv preprint on slow-thinking reasoning systems. arXiv preprint
arXiv:2411.15124. arXiv:2412.09413.
10
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- NovaSky Team. 2025. Sky-t1: Train your own o1
ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke preview model within 450 dollars. https://novasky-
Zettlemoyer, Percy Liang, Emmanuel Candès, and ai.github.io/posts/sky-t1. Accessed: 2025-01-09.
Tatsunori Hashimoto. 2025. s1: Simple test-time
scaling. arXiv preprint arXiv:2501.19393.
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui.
Henryk Michalewski, Jacob Austin, David Bieber, 2024a. Math-shepherd: Verify and reinforce llms
David Dohan, Aitor Lewkowycz, Maarten Bosma, step-by-step without human annotations. In Proceed-
David Luan, et al. 2021. Show your work: Scratch- ings of the 62nd Annual Meeting of the Association
pads for intermediate computation with language for Computational Linguistics (Volume 1: Long Pa-
models. arXiv preprint arXiv:2112.00114. pers), pages 9426–9439.
OpenAI. 2024. Learning to reason with Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
llms. https://openai.com/index/ Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
learning-to-reason-with-llms/. Accessed: Denny Zhou. 2022. Self-consistency improves chain
2024-11-20. of thought reasoning in language models. arXiv
preprint arXiv:2203.11171.
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep
Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. 2024.
Training software engineering agents and verifiers Yifei Wang, Yuyang Wu, Zeming Wei, Stefanie Jegelka,
with swe-gym. arXiv preprint arXiv: 2412.21139. and Yisen Wang. 2024b. A theoretical understand-
ing of self-correction through in-context alignment.
Qwen. 2024. Qwq: Reflect deeply on the boundaries of arXiv preprint arXiv:2405.18634.
the unknown. https://qwenlm.github.io/blog/
qwq-32b-preview/.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Tal Ridnik, Dedy Kredo, and Itamar Friedman. 2024. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
Code generation with alphacodium: From prompt et al. 2022. Chain-of-thought prompting elicits rea-
engineering to flow engineering. arXiv preprint soning in large language models. Advances in neural
arXiv:2401.08500. information processing systems, 35:24824–24837.
11
Question
Example 1:
Input: word = "cbaaaabc", forbidden =
["aaa","cb"]
Output: 4
Explanation: There are 11 valid substrings
in word: "c", "b", "a", "ba", "aa", "bc",
"baa", "aab", "ab", "abc" and "aabc". The
length of the longest valid substring is 4.
It can be shown that all other substrings
contain either "aaa" or "cb" as a substring.
Example 2:
Input: word = "leetcode", forbidden =
["de","le","e"]
Output: 4
Explanation: There are 11 valid substrings
in word: "l", "t", "c", "o", "d", "tc", "co",
"od", "tco", "cod", and "tcod". The length
of the longest valid substring is 4. It can
be shown that all other substrings contain
either "de", "le", or "e" as a substring.
Constraints:
1 ≤ word. length ≤ 105 word consists only
of lowercase English letters. 1 ≤ forbidden.
length ≤ 105 . 1 ≤ forbidden[i]. length ≤
10. forbidden[i] consists only of lowercase
English letters.
12
System message:
Your input fields are:
1. `prompt` (str)
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## prompt ## ]]
{prompt}
[[ ## reasoning ## ]]
{reasoning}
[[ ## code ## ]]
{code}
[[ ## completed ## ]]
In adhering to this structure, your objective is: Given the fields `prompt`, produce the fields
`code`.
User message:
[[ ## prompt ## ]]
{Question Prompt}
Code:
[Round 0 Reasoning]: {Round 0 Reasoning}
[Round 0 Generated code]: {Round 0 Generated Code}
[Round 0 Test Feedback]: {Round 0 Test Feedback}
Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then
`[[ ## code ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.
13
System message:
Your input fields are:
1. `prompt` (str)
Your output fields are:
1. `reasoning` (str)
2. `tests` (str): Generate a complete set of potential inputs to test an AI-generated solution to
the coding problem. Cover: (i) Edge cases, such as empty string or arrays, (ii) Complex and
difficult inputs, but do not include very long inputs. (iii) Other ones that can maximize the
chance of catching a bug. Provide the input and output in JSON format as follows: {"input":
<example_input>, "output": <expected_output>} Ensure the input and output match the types
and structure expected for the problem. Do not include any additional text or explanations, just
the JSON object.
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## prompt ## ]] {prompt}
[[ ## reasoning ## ]] {reasoning}
[[ ## tests ## ]] {tests}
[[ ## completed ## ]]
In adhering to this structure, your objective is: Given the fields `prompt`, produce the fields
`tests`.
User message:
[[ ## prompt ## ]] {Question Prompt}
Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then
`[[ ## tests ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.
14
System message:
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## prompt ## ]]
{prompt}
[[ ## reasoning ## ]]
{reasoning}
[[ ## code ## ]]
{code}
[[ ## completed ## ]]
User message:
[[ ## prompt ## ]]
{Question Prompt}
Code:
Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then
`[[ ## code ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.
15