0% found this document useful (0 votes)

47 views15 pages

Splus - Test Time Scaling For Code Generation

The paper introduces S ∗, a hybrid test-time scaling framework for code generation that enhances both coverage and selection accuracy of generated code by combining parallel and sequential scaling methods. Evaluations across 12 large language models demonstrate that S ∗ consistently improves performance, allowing smaller models to outperform larger ones and enabling non-reasoning models to surpass reasoning models. The framework's innovative adaptive input synthesis mechanism ensures robust identification of correct solutions, pushing the boundaries of existing test-time scaling techniques in code generation.

Uploaded by

brixustwo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views15 pages

Splus - Test Time Scaling For Code Generation

Uploaded by

brixustwo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

S ∗ : Test Time Scaling for Code Generation

Dacheng Li* † Shiyi Cao∗† Chengkun Cao† Xiuyu Li†

Shangyin Tan Kurt Keutzer Jiarong Xing Joseph E. Gonzalez Ion Stoica

University of California, Berkeley

Abstract Baseline Improvement w/ S* o1-preview mini high

Qwen-Coder-0.5B 1.2+9.7
Increasing test-time compute for LLMs shows Qwen-Coder-1.5B 7.0 +20.6
promise across domains but remains underex- Qwen-Coder-3B 18.4 +24.2
Qwen-Coder-7B 29.4 +25.0
arXiv:2502.14382v1 [cs.LG] 20 Feb 2025

plored in code generation, despite extensive Qwen-Coder-14B 44.8 +19.8

study in math. In this paper, we propose S ∗ , Qwen-Coder-32B 47.4 +20.9
the first hybrid test-time scaling framework GPT-4o-mini 40.9 +20.4
R1-Distill-7B 48.4 +24.7
that substantially improves the coverage and R1-Distill-14B 63.2 +19.6
selection accuracy of generated code. S ∗ ex- R1-Distill-32B 69.1 +16.6
tends the existing parallel scaling paradigm QwQ-32B-Preview 62.1 +14.3
o1-mini 76.5 +8.6
with sequential scaling to push performance
0 20 40 60 80
boundaries. It further leverages a novel se- Performance (Pass@1)
lection mechanism that adaptively generates
distinguishing inputs for pairwise comparison, Figure 1: Performance improvement with S ∗ in Live-
combined with execution-grounded informa- CodeBench (v2) (Jain et al., 2024). S ∗ consistently
tion to robustly identify correct solutions. improves models across different sizes, allowing non-
We evaluate across 12 Large Language Mod- reasoning models to surpass reasoning models and open
els and Large Reasoning Model and show: models to be competitive with o1 (high reasoning effort).
(1) S ∗ consistently improves performance "Qwen-Coder" denotes "Qwen2.5-Coder-Instruct," (Hui
across model families and sizes, enabling a et al., 2024) and "R1-Distill" denotes "DeepSeek-R1-
3B model to outperform GPT-4o-mini; (2) S ∗ Distill-Qwen." (Guo et al., 2025).
enables non-reasoning models to surpass rea- and reward models guide the search process more
soning models—GPT-4o-mini with S ∗ outper-
effectively (Ehrlich et al., 2025; Snell et al., 2024;
forms o1-preview by 3.7% on LiveCodeBench;
(3) S ∗ further boosts state-of-the-art reason- Li et al., 2024b). These methods collectively push
ing models—DeepSeek-R1-Distill-Qwen-32B the performance boundaries of LLMs by leveraging
with S ∗ achieves 85.7% on LiveCodeBench, additional compute during inference.
approaching o1 (high) at 88.5%. Code Despite these advancements in the math do-
will be available under https://github.com/ main, the potential of test-time scaling for code
NovaSky-AI/SkyThought. generation—a domain with both fundamental im-
1 Introduction portance and widespread practical applications—
remains under-explored. Code generation intro-
Increasing test-time compute has emerged as a duces unique challenges compared to math rea-
powerful approach for improving the performance soning. Correctness in math can often be veri-
of large language models (LLMs) across diverse fied through rule-based string matching with refer-
tasks (OpenAI, 2024; Guo et al., 2025; Qwen, 2024; ence answers (Guo et al., 2025; Zeng et al., 2025),
Muennighoff et al., 2025; Team, 2025; Brown et al., whereas validating code requires executing a large
2024; Snell et al., 2024). In particular, test-time set of test cases to accurately check functional cor-
scaling has been extensively explored in mathemat- rectness (Liu et al., 2023). This dependence on
ical reasoning, where parallel sampling increases execution increases the complexity of test-time
solution coverage, sequential refinement improves scaling and complicates the design of reward mod-
individual samples through rethinking and revising, els (Zeng et al., 2025). However, code generation
* Equal Contribution. also offers a distinct advantage: The availability of
†
Major Contributor. programmatic interpreters enables the execution of

1
Stage 1: Generation Stage 2: Selection
Iterative debugging enhanced hybrid scaling Adaptive input synthesis based selection
Problem description Prompt:
Given a positive integer Sample 1 … Select the best sample
num represented as a based on execution results
string, return the integer Run samples with
num without trailing zeros generated inputs
as a string.
Public tests Iterative Iterative
Input: num = "51230100" debugging debugging Best sample
Output: "512301“

…
Prompt:
Input: num = “123" Generate inputs that can
Output: “123" Sample N … distinguish pairwise samples
…
Round 1 Round 2 Stop at max round

Figure 2: Overview of S ∗ . Stage 1: Generation—S ∗ enhances parallel samples through iterative debugging. Each
sample is tested using public test cases executed via an interpreter, with outputs and/or error messages used to guide
the next round of sample generation. Stage 2: Selection—S ∗ selects the best sample by prompting an LLM to
generate inputs that differentiate between paired samples, then leveraging actual execution results to inform the
LLM to determine the optimal choice.

programs to obtain precise outputs and error mes- ods. We evaluate S ∗ on 12 models, spanning a wide
sages, providing a reliable grounding mechanism range of sizes, both open and closed, instruction-
for improving generation and selection (Chen et al., based and reasoning models. S ∗ consistently
2023; Li et al., 2022). enhances performance across these diverse set-
In this paper, we propose S ∗ , the first hybrid tings. Notably, S ∗ enables: (1) Small models
test-time scaling framework for code generation, to surpass larger models within the same family:
which substantially improves both coverage 1 and Qwen2.5-7B-Instruct + S ∗ outperforms Qwen2.5-
selection accuracy. S ∗ pushes the limits of existing 32B-Instruct on LiveCodeBench by 10.7%; (2)
parallel scaling strategies by integrating sequential Instruction-based models to outperform reason-
scaling through iterative debugging, while introducing models: GPT-4o-mini + S ∗ surpasses o1-
ing a novel adaptive selection mechanism grounded preview by 3.7%; and (3) Open reasoning models
in execution. The framework operates in two key to achieve performance competitive with state-of-
stages, as shown in Fig. 2. the-art closed models: DeepSeek-R1-Distill-Qwen-
First, in the generation stage, S ∗ augments 32B + S ∗ achieves 85.7% on LiveCodeBench, ap-
parallel sampling (Brown et al., 2024; Li et al., proaching the state-of-the-art performance of o1-
2022) with sequential scaling via iterative debug- high at 88.7%. Fig. 3 provides an overview of the
ging. Each generated sample is executed on public performance improvements enabled by our tech-
test cases to obtain outputs and/or error messages, niques. In summary, our contributions are:
which are fed back into the model to iteratively
refine the code. Second, in the selection stage, ex- 1. We propose S ∗ , the first hybrid test-time scal-
isting methods often rely on generating test inputs ing framework for code generation, which
indiscriminately, which can fail to effectively dif- augments parallel scaling with sequential scal-
ferentiate between candidate solutions (Chen et al., ing via iterative debugging and introduces
2022; Zeng et al., 2025). To overcome this limi- adaptive test input synthesis using LLMs for
tation, S ∗ introduces adaptive input synthesis: for robust sample selection.
each pair of samples, an LLM is prompted to gen-
erate distinguishing test inputs. These inputs are 2. We conduct extensive evaluations on Live-
executed, where the outputs are further provided CodeBench and CodeContests, demonstrating
to ground the LLM to select the best sample. This that S ∗ consistently improves performance
adaptive, execution-grounded approach ensures ro- across diverse model families and sizes.
bust identification of correct solutions (§5.4).
S ∗ is a general approach that outperforms zero- 3. We will release all software artifacts, model
shot generation and existing test-time scaling meth- generations, and intermediate results to sup-
1
port and accelerate future research in this area.
The fraction of problems that are solved by any generated
sample (Brown et al., 2024).

2
o1-preview o1-mini o1-high
eration performance (Chen et al., 2023).
+6.7
Qwen-Coder-14B 44.8
+13.1
+ S* generation 51.5 Test Time Scaling for Code Generation. Chen
+S* selection 64.6
et al. (2022); Huang et al. (2023); Jiao et al. (2024)
R1-Distill-14B 63.2 +3.0
+ S* generation 66.2 +16.6 use models to generate code samples and test cases,
+ S* selection 82.8 selecting the final sample in a self-consistency man-
20 30 40 50 60 70 80 90 ner (Wang et al., 2022; Zeng et al., 2025). However,
these approaches often suffer from model halluci-
Figure 3: Ablation of S ∗ performance benefits:
Qwen2.5-Coder-14B-Instruct (denoted as Qwen-Coder-
nation, where the model fails to accurately predict
14B) (Hui et al., 2024) with S ∗ can surpass o1-preview the output of a test input (Jain et al., 2024; Zeng
without S ∗ . DeepSeek-R1-Distill-Qwen-14B (denoted et al., 2025; Gu et al., 2024). AlphaCode explores
as R1-Distill-14B) (Guo et al., 2025) with S ∗ outper- large-scale parallel sampling with a trained model
forms o1-mini without S ∗ . to generate test cases for filtering and selection (Li
et al., 2022). AlphaCodium uses a series of self-
revision on both public demonstration and model-
2 Related work generated tests to improve solutions (Ridnik et al.,
Test Time Scaling for LLMs. Existing ap- 2024). Saad-Falcon et al. (2024) searches over
proaches to increase test-time compute can be various inference techniques and finds that parallel
broadly categorized into two paradigms: parallel sampling with model-generated tests works well
scaling and sequential scaling (Muennighoff et al., for CodeContests problems (Li et al., 2022). Un-
2025). Parallel scaling (i.e., repeated sampling) like methods relying solely on parallel sampling or
involves generating multiple solutions simultane- sequential scaling, we use a hybrid approach that
ously and selecting the best one, a strategy com- combines their advantages.
monly known as Best-of-N. Coverage—the frac- Hybrid Test-Time Scaling. Many works in the
tion of problems solved by any of these N samples— math domain study hybrid approaches that combine
continues to improve as N increases (Chollet, parallel and sequential scaling, often leveraging
2019; Irvine et al., 2023), even at the scale of 104 to reward-model-guided tree search algorithms, such
106 (Brown et al., 2024; Li et al., 2022). However, as Monte-Carlo Tree Search (MCTS), to effectively
common selection strategies, such as (weighted) navigate the solution space (Gao et al., 2024; Li
majority voting (Wang et al., 2022) and reward et al., 2024b; Silver et al., 2016; Snell et al., 2024;
model scoring (Christiano et al., 2017; Lightman Hendrycks et al., 2021b). S1 (Muennighoff et al.,
et al., 2023; Wang et al., 2024a; Wu et al., 2024; 2025) primarily focuses on sequential scaling but
Beeching et al.; Pan et al., 2024), often struggle observes diminishing returns and thus incorporates
to select the correct best sample in parallel scal- parallel-based approaches like majority voting and
ing (Brown et al., 2024; Hassid et al., 2024; Stroebl tree search to further enhance performance.
et al., 2024). In this paper, we propose a novel
In contrast, our work applies hybrid scaling to
method that improves selection for coding tasks.
code generation tasks without relying on tree search
Sequential scaling, on the other hand, encour- methods, as developing a general and effective re-
ages the model to refine its reasoning over mul- ward model for the code generation domain re-
tiple steps. This includes methods like chain-of- mains challenging (Zeng et al., 2025). Instead,
thought (CoT) prompting (Wei et al., 2022; Nye S ∗ augments parallel scaling with sequential scal-
et al., 2021), and iterative rethinking and revi- ing via execution-grounded iterative debugging to
sion (Madaan et al., 2024; Lee et al., 2025; Hou improve coverage and introduces adaptive input
et al., 2025; Huang et al., 2022; Min et al., 2024; synthesis to enhance selection accuracy.
Team, 2025; Muennighoff et al., 2025; Wang et al.,
2024b; Li et al., 2025). Noticeably, OpenAI o1, Concurrent Work. CodeMonkeys is a notice-
DeepSeek R1, Qwen QwQ, and Kimi employ in- able concurrent work to this paper, released on
context long CoT with revision and backtracking Arxiv in Jan 2025 (Ehrlich et al., 2025). It also gen-
to find the best solution (OpenAI, 2024; Guo et al., erates multiple samples in parallel and revises sam-
2025; Qwen, 2024; Team et al., 2025). In this ples. However, CodeMonkeys focuses on the soft-
paper, we leverage iterative debugging from test ware engineering domain, optimizing performance
execution feedback for sequential scaling code gen- on SWE-Bench (Chowdhury et al., 2024), which

3
addresses challenges such as identifying files that Algorithm 1: Best Sample Selection in S ∗
need to be edited. In contrast, our work focuses on Input: Problem description: P
competition-level code generation, where domain Input: Candidate samples: X
Output: The best selected sample: x∗
differences influence our algorithm choice. For in- 1 T ← llm_test_input_gen(P )
stance, during sequential scaling, CodeMonkeys 2 O ← sample_execution(X, T )
requires a model to generate tests over multiple 3 C ← sample_clustering(O)
4 Scores ← 0
iterations, while we instead incorporate feedback 5 for each pair (Ci , Cj ) ∈ C do
from public tests (ablated variants in §5.3). 6 Sample xi , xj from Ci , Cj
7 Ta ← adaptive_input_gen(xi , xj )
3 Method 8 better_idx = exec_and_llm_select(xi , xj , Ta )
9 Scores[better_idx] += 1
S∗ takes as input a coding problem P and a code 10 end
11 C ∗ ← arg max(Scores)
generation model M. The model M aims to gen- 12 x∗ ← random_pick(C ∗ )
erate a program solution X(·) that maps inputs to 13 return x∗
outputs according to the problem specification.
We adopt the standard setup widely used in exist-
ing coding benchmarks (Chen et al., 2021; Li et al., niques (Chen et al., 2023). Each sample is then
2022, 2023; Jain et al., 2024; Hendrycks et al., refined through up to R rounds of sequential revi-
2021a; Gulwani et al.). Each coding problem P sion, informed by execution results on public test
consists of a natural language description and a set cases. The revision process halts once a sample
of public and private test cases, each represented passes all public tests or reaches the maximum
as input-output pairs. number of revision attempts.
Private tests evaluate the correctness of X but
Stage 2: Selection. After generating N candi-
remain inaccessible to M during code generation.
date solutions, the next step is to identify the best
A solution is considered correct if it passes all pri-
one. Since the public tests are already used dur-
vate tests. In contrast, public tests are provided to
ing the generation stage, additional evaluation is
clarify the problem’s intent and are typically in-
needed for reliable selection. We investigate two
cluded in the prompt. Public tests are usually far
existing approaches: (1) LLM-as-a-judge (Zheng
fewer than private tests; for instance, in CodeCon-
et al., 2023), which relies on pre-trained knowledge
tests (Li et al., 2022), there are, on average, 2.0
to compare candidate solutions, and (2) generated
public tests and 202.1 private tests per problem.
test cases (Li et al., 2022; Chen et al., 2022) which
This contrasts with mathematical reasoning tasks,
uses synthesized test cases to guide selection.
where evaluation typically relies on exact string
Unfortunately, we find that LLM-based judging
matching of the final solution without additional
alone often struggles to predict program behavior
test information (Li et al., 2024a).
accurately, while generated tests frequently fail to
3.1 The S ∗ Framework provide reliable outputs for grounding the selection
or to produce high-quality inputs that effectively
S ∗ is a two-stage hybrid test-time scaling frame-
distinguish samples (see Tab. 3).
work consisting of Generation and Selection stages,
To overcome these limitations, S ∗ introduces
as demonstrated in Fig. 2. It extends parallel sam-
adaptive input synthesis, a hybrid selection ap-
pling with sequential sampling via iterative debug-
proach that integrates LLM evaluation with
ging to improve coverage and employs adaptive in-
execution-grounded verification, as illustrated in
put synthesis during selection to enhance selection
Algorithm 1. First, we prompt an LLM to synthe-
accuracy, leveraging execution results throughout
size a set of test inputs. We execute these inputs
the process. An example of effect for different
and cluster the N samples based on their execution
stages can be found in Fig. 3.
outputs (Line 1 to Line 3) (Li et al., 2022). We then
Stage 1: Generation. In the generation stage, perform pairwise comparisons across clusters: for
S ∗ improves coverage by extending parallel scal- each comparison, we prompt the LLM to generate
ing with sequential scaling through iterative debug- distinguishing inputs, execute both samples using
ging grounded with execution feedback. Specif- these inputs, and select the superior one based on
ically, S ∗ first generates N initial samples in- the execution results (Line 7 to Line 9). This adap-
dependently, leveraging parallel sampling tech- tive selection process grounds LLM evaluations

4
in concrete execution feedback, resulting in more (v2) contains 511 problems, ranging from easy
reliable and accurate sample selection compared to (182 problems), medium (206 problems), to hard
naive LLM judging or generated tests-based meth- (123 problems). In addition, we evaluate S ∗ on
ods (see §4). CodeContests (Li et al., 2022), a collection of 165
challenging coding problems. We use Pass@1 as
4 Evaluation our primary metric (Chen et al., 2021). Some ex-
In this section, we evaluate S ∗ across a diverse set periments report Pass@N with N samples (often
of instruction-based and reasoning models, span- referred to as the ‘oracle’ settings) (Stroebl et al.,
ning various model families, sizes, and access 2024; Brown et al., 2024).
types (open and closed), as well as multiple bench- Baselines. Our evaluation considers two cate-
marks (Jain et al., 2024; Li et al., 2022). gories of baselines. First, we assess our method’s
Our key findings demonstrate the generality and improvement over the original model (without
effectiveness of S ∗ : test-time scaling), using three leading OpenAI
1. S ∗ consistently improves model performance reasoning models—o1-preview, o1-high, and o1-
across different families, sizes, and types, and mini (OpenAI, 2024)—as performance bench-
generalizes effectively to multiple code gener- marks. Second, we evaluate different test-time
ation benchmarks, including LiveCodeBench scaling methods applied to the same models, en-
(§4.2) and CodeContests (§4.4), showcasing compassing both parallel (i.e., majority voting) and
its robustness and broad applicability. sequential (i.e., self-debugging) approaches.
Implementation Details. We configure S ∗ to
2. S ∗ outperforms existing widely-used
generate 16 samples in parallel with a tempera-
test-time scaling methods, including self-
ture of 0.7 (without top-p sampling) and perform
debugging (Chen et al., 2023) and majority
2 rounds of iterative debugging on public tests.
voting (Wang et al., 2022; Li et al., 2022),
We justify our choice of hyper-parameters in §5.
by enhancing both coverage and selection
Prompts are automatically generated by a prompt-
accuracy (§4.3).
ing framework, DSPy, where detailed prompts can
4.1 Experimental Setup be found in Appendix A.2 (Khattab et al., 2023).
Models. We consider both instruction-based and We run codes in a sandbox to avoid maliciously
reasoning-based models. To compare performance generated code, according to (Chen et al., 2021).
across models of different sizes using S ∗ , we select Experiments with the largest model (DeepSeek-R1-
a series of models within the same family. We ex- Distill-Qwen32B) takes one day on 8 H100 GPUs.
periment with 12 models: (1) Instruction-based All experiments are conducted once.
models: Qwen2.5-Coder-Instruct series (0.5B, 4.2 S ∗ Main Results
1.5B, 3B, 7B, 14B, 32B), GPT-4o mini (0718
Fig. 1 presents a performance comparison on Live-
version) (Hui et al., 2024; Achiam et al., 2023);
CodeBench with and without S ∗ , alongside the o1-
(2) Reasoning-based models: QwQ-32B-Preview,
series reasoning models for reference. Our results
DeepSeek-R1-Distill-Qwen series (7B, 14B, 32B),
demonstrate that S ∗ consistently enhances model
and o1-mini (Qwen, 2024; Guo et al., 2025; Ope-
performance. When applied to models within
nAI, 2024).
the same family, S ∗ allows small models to sur-
Benchmarks. We primarily use LiveCodeBench pass large ones. For example, Qwen2.5-7B-Coder-
(MIT License) as our main evaluation benchmark, Instruct integrated with S ∗ outperforms Qwen2.5-
given its extensive usage by recent reasoning mod- 32B-Coder-Instruct without S ∗ by 10.1%. Addi-
els and its inclusion of difficulty levels, which help tionally, S ∗ enables instruction-based models to
analyze the behavior of different techniques (Jain surpass reasoning models, as evidenced by GPT-
et al., 2024; DeepSeek, 2024; Qwen, 2024). We 4o mini (0718) with S ∗ outperforming o1-Preview.
use its v4 version for development (e.g., selecting Moreover, S ∗ further improves strong reasoning
hyper-parameters), which contains problems from models: the most capable open-source reasoning
August 2024 to November 2024. For final evalua- model, DeepSeek-R1-Distill-Qwen-32B, when en-
tion, we use v2 version that is non-overlapping to hanced with S ∗ , surpasses o1-mini and achieves
v4, and contain more problems. LiveCodeBench near state-of-the-art results comparable to o1 (high

5
Method Qwen2.5-Coder-Instruct 4o-mini R1-Distill QwQ o1-mini
0.5B 1.5B 3B 7B 14B 32B 7B 14B 32B
Zero-Shot 1.2 7.0 18.4 29.4 44.8 47.4 40.9 48.4 63.2 69.1 62.1 76.5
Majority Vote 2.5 11.0 25.2 40.5 50.8 55.9 46.6 58.7 68.1 75.8 67.3 81.6
Self-Debugging 2.4 9.4 27.8 39.9 51.5 59.5 51.7 58.4 66.2 70.1 59.3 79.9
S* 10.9 27.6 42.7 54.4 64.6 70.1 61.3 73.2 82.8 85.7 76.3 85.3

Table 1: Pass@1 of zero-shot, majority voting (Wang et al., 2022; Li et al., 2022), self-debugging (Chen
et al., 2023), and S ∗ on LiveCodeBench (v2). Bold text denotes the best performance. "R1-Distill", "QwQ",
"4o-mini" is short for "DeepSeek-R1-Distill-Qwen" (Guo et al., 2025), "QwQ-32B-Preview" (Qwen, 2024), and
"GPT-4o-mini" (Achiam et al., 2023) respectively. S ∗ consistently outperforms other baselines.

reasoning efforts). These results highlight that S ∗ Model Zero-Shot S* S* (Oracle)

serves as a powerful test-time scaling technique Qwen-Coder-7B 1.8 10.9 (+9.1) 12.1
Qwen-Coder-14B 9.7 21.8 (+12.1) 27.9
that can effectively improve model performance Qwen-Coder-32B 10.1 21.8 (+11.7) 29.7
across different scales, architectures, and reasoning gpt-4o-mini 9.1 23.0 (+13.9) 28.5
capabilities. o1-mini 32.7 48.5 (+15.8) 58.2

4.3 Comparison to Other Test-Time Methods Table 2: Performance comparison on CodeContests.

Bold text denotes the best performance of the same
We evaluate S ∗ against two popular test-time scal- model. "Qwen-Coder" is short for "Qwen2.5-Coder-
ing methods: majority voting (Li et al., 2022) and Instruct", "R1-Distill" is short for "DeepSeek-R1-Distill-
self-debugging (Chen et al., 2023). Majority vot- Qwen". S ∗ consistently improves model performance
ing employs parallel scaling: the model generates on benchmark beyond LiveCodeBench.
N samples, clusters them based on execution re-
Coder-7B-Instruct with S ∗ improves 9.1% from its
sults (Li et al., 2022), selects the largest cluster,
zero-shot peformance of 1.8%. It further outper-
and randomly picks a final sample from it. Self-
forms GPT-4o mini without S ∗ by 1.8%.
debugging follows a sequential scaling approach:
the model generates a single sample, iteratively re- 5 Ablation Studies
fines it using public tests (Chen et al., 2023), and
selects the final revised version. In this section, we conduct ablation studies to an-
To ensure fair comparison, we use consistent alyze the key components of S ∗ , focusing on the
hyperparameters: 16 parallel samples for majority effectiveness and variations within each stage of
voting and 2 debugging rounds for self-debugging. the framework. We evaluate the following aspects:
GPT-4o mini generates inputs for majority voting
1. Parallel Scaling: We analyze the impact of
clustering and refines code samples for reasoning
different hyper-parameter choices, such as the
models. We use the model itself to refines code
temperature setting and the number of sam-
for non-reasoning models. As shown in Tab. 1,
ples, on parallel sampling performance (§5.1).
S ∗ consistently outperforms both methods. For in-
Additionally, we investigate the effect of in-
stance, for Qwen-2.5-Coder series, S ∗ improves
corporating in-context example retrieval into
8.4% to 18.2% to baselines. For the best perform-
the parallel sampling process (§5.2). We find
ing model, DeepSeek-R1-Distill-Qwen-32B, S ∗
that moderate temperatures improve perfor-
outperforms the majority vote baseline by 9.9%,
mance, and adding ICL example can potential
and the self debugging baseline by 15.6%. These
further improve performance.
results demonstrating the effectiveness of our hy-
brid approach. 2. Sequential Scaling: We explore variations of
the iterative debugging process, including self-
4.4 Results on Other Benchmark
debugging with model-generated test cases
We further validate S ∗ on CodeContests (Li et al., (§5.3). We find that iteratively debugging
2022). Tab. 2 summarizes results, where S ∗ consis- from test execution feedback improve perfor-
tently improves both instruction-based and reason- mance, even for reasoning models. We find
ing models significantly. In particular, Qwen-2.5- that simply appending execution results from

6
public tests for every iteration works the best. Notably, Qwen2.5-Coder-7B-Instruct, the weakest
performer at N = 1, shows the largest gain, ex-
3. Selection Policy: We assess the performance ceeding 35% at N = 64. Similarly, the more capa-
of different selection policies, comparing our ble reasoning-model (QwQ-32B-Preview) follows
adaptive input synthesis approach with alter- the same trend, though its gains plateau beyond
native selection strategies (§5.4). We find N = 32. Nevertheless, it improves substantially,
that our adaptive input synthesis selection rising from 50% at N = 1 to 80% at N = 32.
method is consistently more reliable than the These results confirm that increasing the number
generated tests and the LLM judge selection of parallel samples is a simple yet effective strat-
method. egy for enhancing performance in both instruction-
All ablation experiments are conducted on Live- following and reasoning-based models.
CodeBench (v4).
5.2 Impact of In-Context Examples
5.1 Parallel Sampling Hyper-Parameters
Qwen2.5-Coder-7B-Inst.Qwen2.5-Coder-32B-Inst. GPT-4o mini
60 No ICL
Effect of Temperature Effect of Sample Size ICL (BM25)

Pass@N

Pass@N
ICL (Patterns)
80
40
70
60 20
Pass@N

50 1 2 4 8 163264 1 2 4 8 163264 1 2 4 8 163264

GPT-4o mini
# Samples (Log) # Samples (Log) # Samples (Log)
40
Qwen2.5-Coder-7B-Instruct
30 Qwen2.5-Coder-32B-Instruct Figure 5: Performance with in-context examples
QwQ-32B-Preview
20 across different numbers of parallel samples (N ),
0.20 0.50 0.70 0.95 1 2 4 8 16 32 64 128
Temperature Number of Samples (log) for GPT-4o mini, Qwen2.5-Coder-7B-Instruct, and
Qwen2.5-Coder-32B-Instruct.
Figure 4: The effect of hyper-parameters. Left:
The impact of temperature. A moderate temperature
While S ∗ primarily focuses on repeated sam-
(0.7) balances diversity and quality, leading to higher
Pass@N. In contrast, a higher temperature (0.95) does
pling for parallel scaling, it can be integrated with
not further improve Pass@N, potentially degrading code more advanced parallel scaling techniques. For
quality. Right: The effect of increasing the number of instance, varying input prompts can create more
samples. Performance improves log-linearly. diverse responses (Lambert et al., 2024), which in
turn may lead to better coverage. In this ablation
We examine the impact of two key factors in study, we investigate whether augmenting prompts
parallel sampling: temperature and the number of with in-context examples can further improve par-
parallel samples. Understanding their influence is allel scaling performance.
essential for optimizing test-time scaling strategies. We construct an example set from Live-
CodeBench (v2) containing correct solutions and
Moderate temperatures improve performance,
reasoning traces generated by GPT-4o mini. We
but high temperatures degrade it. Fig. 4 (left)
explore two retrieval approaches for selecting in-
shows that moderate temperatures (0.2–0.7) en-
context examples. ICL (BM25) retrieves the top-
hance performance by balancing exploration and
k similar prompts using a BM25 retriever and
sample diversity. However, beyond 0.7, perfor-
prepends each to a different sample when n =
mance plateaus or declines, likely due to exces-
k (Robertson et al., 2009). This approach is simple
sive randomness introducing noise. Some models,
but may overlook solution-level similarities. ICL
such as Qwen2.5-Coder-7B-Instruct, exhibit per-
(Pattern) groups problems by techniques (e.g., dy-
formance regression at higher temperatures, em-
namic programming) and retrieves examples from
phasizing the trade-off between diversity and solu-
the same technique, aiming to provide more rele-
tion consistency. These findings suggest that while
vant and structurally similar examples.
moderate temperatures improve generation quality,
We evaluate medium-difficulty problems from
excessively high values reduce code quality.
LiveCodeBench (v4) with oracle selection. As
Repeated sampling improves performance, even shown in Fig. 5, performance is highly sensitive to
for reasoning models. As shown in Fig. 4 (right), in-context example quality. ICL (BM25) performs
increasing the number of parallel samples signif- similarly to or worse than the zero-shot baseline
icantly improves performance across all models. in most cases, except for n = 64 with Qwen2.5-

7
Coder-32B-Instruct. In contrast, ICL (Pattern) out- tests does not enhance performance for GPT-4o
performs the baseline when n ≥ 8 for Qwen2.5- mini. (3) The benefits of iterative debugging tend
Coder-7B-Instruct and n ≥ 4 for Qwen2.5-Coder- to plateau, typically after 2–3 rounds: this finding
32B-Instruct, while showing comparable perfor- aligns with the observation that the benefit of se-
mance with GPT-4o mini. quential scaling flattens out (Muennighoff et al.,
These results highlight that selecting high- 2025). Motivated by these findings, we choose to
quality examples is crucial, and naive retrieval use 2 round of debugging, only on public tests for
methods often fall short. Although ICL itself is simplicity, and apply iterative debugging even for
promising, its performance is sensitive to example reasoning models in §4.2.
quality and retrieval effectiveness. We regard it
as future work to develop robust ICL techniques 5.4 Impact of Different Selection Policies
that can be integrated into S ∗ to further enhance
parallel scaling performance. Model Public Generated LLM Adaptive Input
Only Tests Judge Synthesis (Ours)
5.3 Impact of Iterative Debugging Variants Qwen-Coder-7B 42.3 42.3 42.3 44.5
Qwen-Coder-32B 54.6 57.8 55.5 58.3
GPT-4o mini 54.1 55.2 56.3 57.3
Qwen2.5-Coder-7B-Inst.
46
GPT-4o mini QwQ-32B-Preview QwQ-32B-Preview 64.3 65.9 68.6 69.7
62 74 Avg. 53.8 53.1 55.6 57.5
44 60
Pass@N

Pass@N

73
58 Public Tests
42 72 Last Round Context Table 3: Pass@1 Performance comparison between
+ Generated Tests
40 56 71 different selection methods on LiveCodeBench(v4).
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Round Round Round Bold text denotes the best performance of the same
model. "Qwen-Coder", "R1-Distill" is short for
Figure 6: Comparison of three iterative debugging "Qwen2.5-Coder-Instruct" and "DeepSeek-R1-Distill-
approaches: Public Tests, + Generated Tests and Last Qwen". Results are obtained with N=8 and tempera-
Round Context. Results are obtained with N = 8, ture=0.7. Our Adaptive Input Synthesis method achieves
temperature = 0.7 and up to four rounds of debugging. better accuracy over other methods.
We examine the effectiveness of three variants of
iterative debugging: (1) Public Tests: The model We compare different policies for selecting the
iteratively debugs using public tests and stops once best sample after iterative debugging. We evaluate
the sample passes all of them. (2) +Generated four approaches: (1) Public Only: using only pub-
Tests: In addition to public tests, the model contin- lic test cases for selection and randomly selecting
ues debugging on model-generated tests following a sample if it passes all tests; (2) Generated Tests:
the algorithm in (Ridnik et al., 2024). (3) Last applying public test filtering followed by additional
Round Context: The model iteratively debugs us- test case generation using GPT-4o mini, selecting
ing only public tests, but instead of using code the sample that passes the most test cases; (3) LLM
samples from all previous rounds for debugging, it Judge: applying public test filtering and then using
uses only the last round of code sample as context. LLMs for pairwise selection among code samples;
This is motivated by observations that LLMs may and (4) Adaptive Input Synthesis —applying the
perform sub-optimally when handling large context selection algorithm described in § 3.1 with GPT-4o
windows (Liu et al., 2024). mini after public test filtering.
Fig. 6 summarizes the result. We find that: (1) Tab. 3 summarizes the results. Notably, the Gen-
Even though reasoning models implicitly perform erated Tests approach does not yield improvements
self-reflection and revising, they benefit from ex- over the Public Only baseline. This is due to er-
plicit debugging through test execution feedback: rors in model-generated outputs, which, when ap-
the performance of QwQ-32B-Preview model im- plied to poorly chosen inputs, introduce significant
proves from 72.6 to 74.2 with 2 rounds of debug- noise in the selection process, ultimately degrading
ging. (2) Reducing the context window or consid- performance. In contrast, our Adaptive Selection
ering more model-generated tests does not show method enables the LLM to strategically select an
consistent improvement: while using only the last input that best differentiates samples while avoid-
round of context improves performance for the ing the need to predict outputs. By leveraging real
Qwen2.5-Coder-7B-Instruct model, it results in execution outputs rather than model predicttions,
worse performance for the other two models. Sim- the LLM makes more reliable decisions, leading to
ilarly, incorporating additional model-generated improved selection accuracy.

8
6 Conclusion Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan,
Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022.
We propose S ∗ , the first hybrid test-time scaling Codet: Code generation with generated tests. arXiv
framework for code generation that substantially preprint arXiv:2207.10397.
improves both coverage and selection accuracy.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
S ∗ extends the existing parallel scaling paradigm Yuan, Henrique Ponde De Oliveira Pinto, Jared Ka-
with sequential scaling through iterative debugging plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
and incorporates adaptive input synthesis, a novel Greg Brockman, et al. 2021. Evaluating large
mechanism that synthesizes distinguishing test in- language models trained on code. arXiv preprint
arXiv:2107.03374.
puts to differentiate candidates and identify correct
solutions via execution results. Xinyun Chen, Maxwell Lin, Nathanael Schärli, and
S ∗ consistently improves code generation per- Denny Zhou. 2023. Teaching large language models
formance across benchmarks, including Live- to self-debug. arXiv preprint arXiv:2304.05128.
CodeBench and CodeContests. Notably, S ∗ en-
François Chollet. 2019. On the measure of intelligence.
ables a 3B model to outperform GPT-4o mini, GPT- arXiv preprint arXiv:1911.01547.
4o mini to surpass o1-preview by 3.7% on Live-
CodeBench, and DeepSeek-R1-Distill-Qwen-32B Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-
to achieve 86.7% on LiveCodeBench, approaching tic, Shane Legg, and Dario Amodei. 2017. Deep
reinforcement learning from human preferences. Ad-
o1-high at 88.5%. vances in neural information processing systems, 30.
7 Limitations DeepSeek. 2024. Deepseek-r1-lite-preview re-
lease. https://api-docs.deepseek.com/news/
This work primarily focuses on competition-level news1120. Accessed: 2024-11-20.
code generation, where it does not studies tasks
such as software engineering tasks, e.g., SWE- Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald
BENCH (Jimenez et al., 2023). The method primar- Clark, Christopher Ré, and Azalia Mirhoseini. 2025.
Codemonkeys: Scaling test-time compute for soft-
ily focuses on improving accuracy, while it does ware engineering. arXiv preprint arXiv:2501.14723.
not aim for minimizing costs.
Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu,
8 Acknowledgment Hongzhang Liu, Aiwei Liu, Xuming Hu, and Lijie
Wen. 2024. Interpretable contrastive monte carlo tree
This work is funded by the Sky Computing Lab at search reasoning. arXiv preprint arXiv:2410.01707.
UC Berkeley. We extend our sincere gratitude to
Matei Zaharia and Anastasios Nikolas Angelopou- Alex Gu, Baptiste Rozière, Hugh Leather, Armando
Solar-Lezama, Gabriel Synnaeve, and Sida I Wang.
los for their invaluable feedback. We are grateful
2024. Cruxeval: A benchmark for code reason-
for the generous compute resources support from ing, understanding and execution. arXiv preprint
Databricks, Lambda Labs, and Anyscale. In partic- arXiv:2401.03065.
ular, we thank Jonathan Frankle (Databricks) and
Chuan Li (Lambda Labs) for facilitating access to Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh.
Foundations and trends in programming languages.
these resources. Bd, 4:1–119.

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song,

References Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma,
Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: In-
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
centivizing reasoning capability in llms via reinforce-
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
ment learning. arXiv preprint arXiv:2501.12948.
Diogo Almeida, Janko Altenschmidt, Sam Altman,
Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
Michael Hassid, Tal Remez, Jonas Gehring, Roy
arXiv preprint arXiv:2303.08774.
Schwartz, and Yossi Adi. 2024. The larger the better?
Edward Beeching, Lewis Tunstall, and Sasha Rush. improved llm code-generation via budget realloca-
Scaling test-time compute with open models. tion. arXiv preprint arXiv:2404.00725.

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Dan Hendrycks, Steven Basart, Saurav Kadavath, Man-
Clark, Quoc V Le, Christopher Ré, and Azalia Mirho- tas Mazeika, Akul Arora, Ethan Guo, Collin Burns,
seini. 2024. Large language monkeys: Scaling infer- Samir Puranik, Horace He, Dawn Song, and Jacob
ence compute with repeated sampling. arXiv preprint Steinhardt. 2021a. Measuring coding challenge com-
arXiv:2407.21787. petence with apps. NeurIPS.

9
Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave
tas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Marwood, Shumeet Baluja, Dale Schuurmans, and
Samir Puranik, Horace He, Dawn Song, et al. 2021b. Xinyun Chen. 2025. Evolving deeper llm thinking.
Measuring coding challenge competence with apps. arXiv preprint arXiv:2501.09891.
arXiv preprint arXiv:2105.09938.
Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi
Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Mo, Shishir G Patil, Matei Zaharia, Joseph E Gonza-
Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. lez, and Ion Stoica. 2025. Llms can easily learn to
2025. Advancing language model reasoning through reason from demonstrations structure, not content, is
reinforcement learning and inference scaling. arXiv what matters! arXiv preprint arXiv:2502.07374.
preprint arXiv:2501.11651.
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lip-
Baizhou Huang, Shuai Lu, Weizhu Chen, Xiaojun kin, Roman Soletskyi, Shengyi Huang, Kashif Rasul,
Wan, and Nan Duan. 2023. Enhancing large lan- Longhui Yu, Albert Q Jiang, Ziju Shen, et al. 2024a.
guage models in coding through multi-perspective Numinamath: The largest public dataset in ai4maths
self-consistency. arXiv preprint arXiv:2309.17272. with 860k pairs of competition math problems and
solutions. Hugging Face repository, 13:9.
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu,
Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Dai, Ruim-
Large language models can self-improve. arXiv ing Tang, Yasheng Wang, Yong Yu, and Weinan
preprint arXiv:2210.11610. Zhang. 2024b. Rethinkmcts: Refining erroneous
thoughts in monte carlo tree search for code genera-
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Day- tion. arXiv preprint arXiv:2409.09584.
iheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang,
Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-coder Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong
technical report. arXiv preprint arXiv:2409.12186. Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. 2023.
Taco: Topics in algorithmic code generation dataset.
Robert Irvine, Douglas Boubert, Vyas Raina, Adian arXiv preprint arXiv:2312.14852.
Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Kor-
Yujia Li, David Choi, Junyoung Chung, Nate Kushman,
shuk, Zongyi Liu, Fritz Cremer, Valentin Assassi,
Julian Schrittwieser, Rémi Leblond, Tom Eccles,
et al. 2023. Rewarding chatbots for real-world en-
James Keeling, Felix Gimeno, Agustin Dal Lago,
gagement with millions of users. arXiv preprint
et al. 2022. Competition-level code generation with
arXiv:2303.06135.
alphacode. Science, 378(6624):1092–1097.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri
Yan, Tianjun Zhang, Sida Wang, Armando Solar- Edwards, Bowen Baker, Teddy Lee, Jan Leike,
Lezama, Koushik Sen, and Ion Stoica. 2024. Live- John Schulman, Ilya Sutskever, and Karl Cobbe.
codebench: Holistic and contamination free eval- 2023. Let’s verify step by step. arXiv preprint
uation of large language models for code. arXiv arXiv:2305.20050.
preprint arXiv:2403.07974.
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling-
Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F ming Zhang. 2023. Is your code generated by chatgpt
Chen, Shafiq Joty, and Furu Wei. 2024. Preference really correct? rigorous evaluation of large language
optimization for reasoning with pseudo feedback. models for code generation. Advances in Neural
arXiv preprint arXiv:2411.16345. Information Processing Systems, 36:21558–21572.
Carlos E Jimenez, John Yang, Alexander Wettig, Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran-
Shunyu Yao, Kexin Pei, Ofir Press, and Karthik jape, Michele Bevilacqua, Fabio Petroni, and Percy
Narasimhan. 2023. Swe-bench: Can language mod- Liang. 2024. Lost in the middle: How language mod-
els resolve real-world github issues? arXiv preprint els use long contexts. Transactions of the Association
arXiv:2310.06770. for Computational Linguistics, 12:157–173.
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Zhiyuan Zhang, Keshav Santhanam, Sri Vard- Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling et al. 2024. Self-refine: Iterative refinement with
declarative language model calls into self-improving self-feedback. Advances in Neural Information Pro-
pipelines. arXiv preprint arXiv:2310.03714. cessing Systems, 36.
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen,
Shengyi Huang, Hamish Ivison, Faeze Brahman, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xi-
Lester James V Miranda, Alisa Liu, Nouha Dziri, aoxue Cheng, Huatong Song, et al. 2024. Imitate,
Shane Lyu, et al. 2024. T\" ulu 3: Pushing frontiers explore, and self-improve: A reproduction report
in open language model post-training. arXiv preprint on slow-thinking reasoning systems. arXiv preprint
arXiv:2411.15124. arXiv:2412.09413.

10
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- NovaSky Team. 2025. Sky-t1: Train your own o1
ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke preview model within 450 dollars. https://novasky-
Zettlemoyer, Percy Liang, Emmanuel Candès, and ai.github.io/posts/sky-t1. Accessed: 2025-01-09.
Tatsunori Hashimoto. 2025. s1: Simple test-time
scaling. arXiv preprint arXiv:2501.19393.
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui.
Henryk Michalewski, Jacob Austin, David Bieber, 2024a. Math-shepherd: Verify and reinforce llms
David Dohan, Aitor Lewkowycz, Maarten Bosma, step-by-step without human annotations. In Proceed-
David Luan, et al. 2021. Show your work: Scratch- ings of the 62nd Annual Meeting of the Association
pads for intermediate computation with language for Computational Linguistics (Volume 1: Long Pa-
models. arXiv preprint arXiv:2112.00114. pers), pages 9426–9439.

OpenAI. 2024. Learning to reason with Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
llms. https://openai.com/index/ Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
learning-to-reason-with-llms/. Accessed: Denny Zhou. 2022. Self-consistency improves chain
2024-11-20. of thought reasoning in language models. arXiv
preprint arXiv:2203.11171.
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep
Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. 2024.
Training software engineering agents and verifiers Yifei Wang, Yuyang Wu, Zeming Wei, Stefanie Jegelka,
with swe-gym. arXiv preprint arXiv: 2412.21139. and Yisen Wang. 2024b. A theoretical understand-
ing of self-correction through in-context alignment.
Qwen. 2024. Qwq: Reflect deeply on the boundaries of arXiv preprint arXiv:2405.18634.
the unknown. https://qwenlm.github.io/blog/
qwq-32b-preview/.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Tal Ridnik, Dedy Kredo, and Itamar Friedman. 2024. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
Code generation with alphacodium: From prompt et al. 2022. Chain-of-thought prompting elicits rea-
engineering to flow engineering. arXiv preprint soning in large language models. Advances in neural
arXiv:2401.08500. information processing systems, 35:24824–24837.

Stephen Robertson, Hugo Zaragoza, et al. 2009. The

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck,
probabilistic relevance framework: Bm25 and be-
and Yiming Yang. 2024. Inference scaling laws:
yond. Foundations and Trends® in Information Re-
An empirical analysis of compute-optimal inference
trieval, 3(4):333–389.
for problem-solving with language models. arXiv
Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok preprint arXiv:2408.00724.
Natarajan, Nahum Maru, Hristo Todorov, Etash
Guha, E Kelly Buchanan, Mayee Chen, Neel Guha, Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xi-
Christopher Ré, et al. 2024. Archon: An architec- aotong Chen, and Wenhu Chen. 2025. Acecoder: Ac-
ture search framework for inference-time techniques. ing coder rl via automated test-case synthesis. ArXiv,
arXiv preprint arXiv:2409.15254. 2502.01718.
David Silver, Aja Huang, Chris J Maddison, Arthur
Guez, Laurent Sifre, George Van Den Driessche, Ju- Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
lian Schrittwieser, Ioannis Antonoglou, Veda Pan- Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
neershelvam, Marc Lanctot, et al. 2016. Mastering Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023.
the game of go with deep neural networks and tree Judging llm-as-a-judge with mt-bench and chatbot
search. nature, 529(7587):484–489. arena. Advances in Neural Information Processing
Systems, 36:46595–46623.
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku-
mar. 2024. Scaling llm test-time compute optimally
can be more effective than scaling model parameters. A Appendix
arXiv preprint arXiv:2408.03314.
A.1 Example of Coding Problem
Benedikt Stroebl, Sayash Kapoor, and Arvind
Narayanan. 2024. Inference scaling flaws: The limits In the method section (§3), we introduce our prob-
of llm resampling with imperfect verifiers. arXiv
preprint arXiv:2411.17501. lem setup, which includes unambiguous configu-
ration with a small amount of demonstrations. In
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, this section, we provide one such example to better
Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun
Xiao, Chenzhuang Du, Chonghua Liao, et al. 2025.
illustrate how typically dataset provide questions.
Kimi k1. 5: Scaling reinforcement learning with llms. In particular, we show one sample from the hard
arXiv preprint arXiv:2501.12599. subset of LiveCodeBench (Jain et al., 2024).

11
Question

You are given a string word and an array

of strings forbidden. A string is called
valid if none of its substrings are present
in forbidden. Return the length of the
longest valid substring of the string word.
A substring is a contiguous sequence of
characters in a string, possibly empty.

Example 1:
Input: word = "cbaaaabc", forbidden =
["aaa","cb"]
Output: 4
Explanation: There are 11 valid substrings
in word: "c", "b", "a", "ba", "aa", "bc",
"baa", "aab", "ab", "abc" and "aabc". The
length of the longest valid substring is 4.
It can be shown that all other substrings
contain either "aaa" or "cb" as a substring.

Example 2:
Input: word = "leetcode", forbidden =
["de","le","e"]
Output: 4
Explanation: There are 11 valid substrings
in word: "l", "t", "c", "o", "d", "tc", "co",
"od", "tco", "cod", and "tcod". The length
of the longest valid substring is 4. It can
be shown that all other substrings contain
either "de", "le", or "e" as a substring.

Constraints:
1 ≤ word. length ≤ 105 word consists only
of lowercase English letters. 1 ≤ forbidden.
length ≤ 105 . 1 ≤ forbidden[i]. length ≤
10. forbidden[i] consists only of lowercase
English letters.

A.2 Prompt templates

We also provide detailed prompts used in our ex-
periments in Fig. 7 to Fig. 9. These prompts are
generated automatically by DSPy (Khattab et al.,
2023).

12
System message:
Your input fields are:
1. `prompt` (str)

Your output fields are:

1. `reasoning` (str)
2. `code` (str): Here is the past history of your code and the test case feedback. Please reason
why your code failed in the last round, and correct the code. Do not write non-code content in
the code field.

All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## prompt ## ]]
{prompt}

[[ ## reasoning ## ]]
{reasoning}

[[ ## code ## ]]
{code}

[[ ## completed ## ]]

In adhering to this structure, your objective is: Given the fields `prompt`, produce the fields
`code`.

User message:
[[ ## prompt ## ]]
{Question Prompt}
Code:
[Round 0 Reasoning]: {Round 0 Reasoning}
[Round 0 Generated code]: {Round 0 Generated Code}
[Round 0 Test Feedback]: {Round 0 Test Feedback}

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then
`[[ ## code ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

Figure 7: The prompt for iterative debugging.

13
System message:
Your input fields are:
1. `prompt` (str)
Your output fields are:
1. `reasoning` (str)
2. `tests` (str): Generate a complete set of potential inputs to test an AI-generated solution to
the coding problem. Cover: (i) Edge cases, such as empty string or arrays, (ii) Complex and
difficult inputs, but do not include very long inputs. (iii) Other ones that can maximize the
chance of catching a bug. Provide the input and output in JSON format as follows: {"input":
<example_input>, "output": <expected_output>} Ensure the input and output match the types
and structure expected for the problem. Do not include any additional text or explanations, just
the JSON object.

All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## prompt ## ]] {prompt}
[[ ## reasoning ## ]] {reasoning}
[[ ## tests ## ]] {tests}
[[ ## completed ## ]]

In adhering to this structure, your objective is: Given the fields `prompt`, produce the fields
`tests`.

User message:
[[ ## prompt ## ]] {Question Prompt}

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then
`[[ ## tests ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

Figure 8: The prompt for generating test cases.

14
System message:

Your input fields are:

1. `prompt` (str)

Your output fields are:

1. `reasoning` (str)
2. `code` (str): Executable Python function generated from the given prompt.
DO NOT include anything other than function body! Give me only the function itself!

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## prompt ## ]]
{prompt}

[[ ## reasoning ## ]]
{reasoning}

[[ ## code ## ]]
{code}

[[ ## completed ## ]]

In adhering to this structure, your objective is:

Given the fields `prompt`, produce the fields `code`.

User message:

[[ ## prompt ## ]]
{Question Prompt}

Code:

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then
`[[ ## code ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

Figure 9: The prompt for code generation.

Seed-CTS:: Unleashing The Power of Tree Search For Superior Performance in Competitive Coding Tasks
No ratings yet
Seed-CTS:: Unleashing The Power of Tree Search For Superior Performance in Competitive Coding Tasks
19 pages
SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
No ratings yet
SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
13 pages
Cure
No ratings yet
Cure
28 pages
An Empirical Evaluation of Using Large Language Models For Automated Unit Test Generation
No ratings yet
An Empirical Evaluation of Using Large Language Models For Automated Unit Test Generation
21 pages
Agent Coder 2312.13010v2
No ratings yet
Agent Coder 2312.13010v2
21 pages
Synthesizing High-Quality Programming Tasks With LLM-based Expert and Student Agents
No ratings yet
Synthesizing High-Quality Programming Tasks With LLM-based Expert and Student Agents
12 pages
Automated Code Testing with CODET
No ratings yet
Automated Code Testing with CODET
19 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
Teaching Large Language Models To Self-Debug
No ratings yet
Teaching Large Language Models To Self-Debug
78 pages
s1: Simple Test-Time Scaling
No ratings yet
s1: Simple Test-Time Scaling
46 pages
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
No ratings yet
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
7 pages
Rethinking The Influence of Source Code On Test Case Generation
No ratings yet
Rethinking The Influence of Source Code On Test Case Generation
23 pages
Stepcoder: Improve Code Generation With Reinforcement Learning From Compiler Feedback
No ratings yet
Stepcoder: Improve Code Generation With Reinforcement Learning From Compiler Feedback
13 pages
Multi-Agent Code Generation and Problem Solving Through Simulation-Driven Planning and Debugging
No ratings yet
Multi-Agent Code Generation and Problem Solving Through Simulation-Driven Planning and Debugging
27 pages
Code Act
No ratings yet
Code Act
12 pages
Plan Search
No ratings yet
Plan Search
52 pages
Evaluation of Generative AI Models in Python Code Generation A Comparative Study - PDF Mini
No ratings yet
Evaluation of Generative AI Models in Python Code Generation A Comparative Study - PDF Mini
14 pages
Open-Source Code Intelligence Models
No ratings yet
Open-Source Code Intelligence Models
23 pages
s1: Simple Test-Time Scaling
No ratings yet
s1: Simple Test-Time Scaling
45 pages
Scaling Test-Time Compute For LLM Agents: OPPO AI Agent Team
No ratings yet
Scaling Test-Time Compute For LLM Agents: OPPO AI Agent Team
16 pages
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
No ratings yet
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
10 pages
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters
No ratings yet
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters
37 pages
4786 Planning With Large Language M
No ratings yet
4786 Planning With Large Language M
28 pages
ASE2024 CodeGenSurvey-7
No ratings yet
ASE2024 CodeGenSurvey-7
17 pages
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
No ratings yet
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
12 pages
Fault-Aware Neural Code Rankers: Jeevana Priya Inala Chenglong Wang Mei Yang Andres Codas
No ratings yet
Fault-Aware Neural Code Rankers: Jeevana Priya Inala Chenglong Wang Mei Yang Andres Codas
19 pages
CodeGen- An Open Large Language Model for Code With Multi-Turn Program Synthesis (CodeGen：用于代码的开放大型语言模型，具备多轮程序综合能力)
No ratings yet
CodeGen- An Open Large Language Model for Code With Multi-Turn Program Synthesis (CodeGen：用于代码的开放大型语言模型，具备多轮程序综合能力)
25 pages
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
No ratings yet
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
10 pages
Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
No ratings yet
Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
20 pages
Code Tree
No ratings yet
Code Tree
16 pages
Seed Coder
No ratings yet
Seed Coder
46 pages
LLM-Powered Test Case Generation For Detecting Tricky Bugs
No ratings yet
LLM-Powered Test Case Generation For Detecting Tricky Bugs
11 pages
Treeof Code
No ratings yet
Treeof Code
13 pages
KeyuHe MaxLi JosephLiu
No ratings yet
KeyuHe MaxLi JosephLiu
12 pages
An Empirical Analysis of Compute-Optimal Inference For Problem-Solving With Language Models
No ratings yet
An Empirical Analysis of Compute-Optimal Inference For Problem-Solving With Language Models
20 pages
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
No ratings yet
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
15 pages
Can 1B Surpass 405B - Rethinking Test-Time Scaling
No ratings yet
Can 1B Surpass 405B - Rethinking Test-Time Scaling
34 pages
Torchtitan: One-Stop Pytorch Native Solution For Production Ready LLM Pretraining
No ratings yet
Torchtitan: One-Stop Pytorch Native Solution For Production Ready LLM Pretraining
21 pages
Liu Et Al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models For Verilog Code Generation
No ratings yet
Liu Et Al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models For Verilog Code Generation
8 pages
A Survey On Test-Time Scaling in Large Language Models: What, How, Where, and How Well
No ratings yet
A Survey On Test-Time Scaling in Large Language Models: What, How, Where, and How Well
43 pages
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
A Conversational Paradigm For Program Synthesis
No ratings yet
A Conversational Paradigm For Program Synthesis
25 pages
Deepseek LLM
No ratings yet
Deepseek LLM
48 pages
Chatgpt vs. Deepseek: A Comparative Study On Ai-Based Code Generation
No ratings yet
Chatgpt vs. Deepseek: A Comparative Study On Ai-Based Code Generation
5 pages
Software Testing With Large Language Models Survey Landscape and Vision
No ratings yet
Software Testing With Large Language Models Survey Landscape and Vision
26 pages
Large Language Monkeys: Scaling Inference Compute With Repeated Sampling
No ratings yet
Large Language Monkeys: Scaling Inference Compute With Repeated Sampling
27 pages
Oracle-Guided Program Selection From Large Language Models: Zhiyu Fan Haifeng Ruan
No ratings yet
Oracle-Guided Program Selection From Large Language Models: Zhiyu Fan Haifeng Ruan
13 pages
Fully Autonomous Programming With Large Language Models
No ratings yet
Fully Autonomous Programming With Large Language Models
10 pages
Toolcoder: Teach Code Generation Models To Use Api Search Tools
No ratings yet
Toolcoder: Teach Code Generation Models To Use Api Search Tools
13 pages
Guiding Dynamic Programing Via Structural Probability For Accelerating Programming by Example
No ratings yet
Guiding Dynamic Programing Via Structural Probability For Accelerating Programming by Example
29 pages
Evolm: in Search of Lost Language Model Training Dynamics: Harvard Stanford Epfl Cmu
No ratings yet
Evolm: in Search of Lost Language Model Training Dynamics: Harvard Stanford Epfl Cmu
28 pages
Use Property-Based Testing To Bridge LLM Code Generation and Validation
No ratings yet
Use Property-Based Testing To Bridge LLM Code Generation and Validation
13 pages
Program Synthesis With Large Language Models: Jacob Austin Augustus Odena
No ratings yet
Program Synthesis With Large Language Models: Jacob Austin Augustus Odena
34 pages
Efficient Code LLM Training Via Distribution-Consistent and Diversity-Aware Data Selection
No ratings yet
Efficient Code LLM Training Via Distribution-Consistent and Diversity-Aware Data Selection
11 pages
Self-Planning Code Generation With Large Language Models
No ratings yet
Self-Planning Code Generation With Large Language Models
29 pages
An Empirical Evaluation of Using Large Language Models For Automated Unit Test Generation
No ratings yet
An Empirical Evaluation of Using Large Language Models For Automated Unit Test Generation
21 pages
An Empirical Evaluation of Using Large Language Models For Automated Unit Test Generation
No ratings yet
An Empirical Evaluation of Using Large Language Models For Automated Unit Test Generation
21 pages
Aliens Colonial Marines Manualoptimizationguide
No ratings yet
Aliens Colonial Marines Manualoptimizationguide
4 pages
SVD: Concepts and Applications
No ratings yet
SVD: Concepts and Applications
24 pages
Tech Support for Acronis Products
No ratings yet
Tech Support for Acronis Products
2 pages
Bible Sharing Activity Sheet
No ratings yet
Bible Sharing Activity Sheet
2 pages
Axure RP Pro Tutorial
No ratings yet
Axure RP Pro Tutorial
11 pages
1st Commandment
No ratings yet
1st Commandment
6 pages
MCNN-LSTM Combining CNN and LSTM To Classify Multi-Class Text in Imbalanced News Data
No ratings yet
MCNN-LSTM Combining CNN and LSTM To Classify Multi-Class Text in Imbalanced News Data
16 pages
The Colonel's Intimidating Presence
No ratings yet
The Colonel's Intimidating Presence
10 pages
Zine Art Lesson Plan
No ratings yet
Zine Art Lesson Plan
2 pages
Their First Million
No ratings yet
Their First Million
13 pages
Solidworks Drawings
100% (22)
Solidworks Drawings
519 pages
Anthem Figurative Language Reading Selection 1
No ratings yet
Anthem Figurative Language Reading Selection 1
3 pages
Singular and Plural Nouns Advanced Worksheets
75% (4)
Singular and Plural Nouns Advanced Worksheets
9 pages
Network Programming C
100% (1)
Network Programming C
19 pages
Sample TOC Questions
No ratings yet
Sample TOC Questions
5 pages
CNC Mini Coil Winder Mk5
No ratings yet
CNC Mini Coil Winder Mk5
15 pages
Chris CGI Portfolio
No ratings yet
Chris CGI Portfolio
45 pages
GraphWorX64 - Setting Dynamic Local Aliases
No ratings yet
GraphWorX64 - Setting Dynamic Local Aliases
1 page
The Man With The Hoe Analysis
No ratings yet
The Man With The Hoe Analysis
2 pages
Nazarene Jewish Christianity From The en
No ratings yet
Nazarene Jewish Christianity From The en
18 pages
Here Are 10 Scriptural Ways We Can Glorify God
No ratings yet
Here Are 10 Scriptural Ways We Can Glorify God
3 pages
PTE Academic Tutorial
No ratings yet
PTE Academic Tutorial
38 pages
Designing Design
No ratings yet
Designing Design
5 pages
Ms. Lyn Fourth Periodical Exam Binan
No ratings yet
Ms. Lyn Fourth Periodical Exam Binan
37 pages
English 1
75% (8)
English 1
6 pages
Chapter3b - MQL
No ratings yet
Chapter3b - MQL
25 pages
Unit 3B
No ratings yet
Unit 3B
2 pages
NM2 Ltest9A1
No ratings yet
NM2 Ltest9A1
3 pages
Changes in Modal Verbs
No ratings yet
Changes in Modal Verbs
21 pages
Guardini
No ratings yet
Guardini
18 pages

Splus - Test Time Scaling For Code Generation

Uploaded by

Splus - Test Time Scaling For Code Generation

Uploaded by

S ∗ : Test Time Scaling for Code Generation

Dacheng Li* † Shiyi Cao∗† Chengkun Cao† Xiuyu Li†

University of California, Berkeley

Abstract Baseline Improvement w/ S* o1-preview mini high

plored in code generation, despite extensive Qwen-Coder-14B 44.8 +19.8

reasoning efforts). These results highlight that S ∗ Model Zero-Shot S* S* (Oracle)

4.3 Comparison to Other Test-Time Methods Table 2: Performance comparison on CodeContests.

50 1 2 4 8 163264 1 2 4 8 163264 1 2 4 8 163264

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song,

Stephen Robertson, Hugo Zaragoza, et al. 2009. The

You are given a string word and an array

A.2 Prompt templates

Your output fields are:

Figure 7: The prompt for iterative debugging.

Figure 8: The prompt for generating test cases.

Your input fields are:

Your output fields are:

In adhering to this structure, your objective is:

Figure 9: The prompt for code generation.

You might also like