ACL 2026 · Long Paper

Efficient Test-Time Scaling
via Probing Internal States
of Large Language Models

ReProbe replaces bulky Process Reward Models with a <10M-parameter probe that reads directly from the LLM's own hidden states — achieving comparable or better step-level verification at a fraction of the cost.

Jingwei Ni^1† Ekaterina Fadeeva^1† Tianyi Wu^1† Mubashara Akhtar³ Jiaheng Zhang² Elliott Ash¹ Markus Leippold⁴ Timothy Baldwin³ See-Kiong Ng² Artem Shelmanov^3‡ Mrinmaya Sachan^1‡

¹ ETH Zürich · ² National University of Singapore · ³ MBZUAI · ⁴ University of Zurich · University of Melbourne
^† Equal contribution · ^‡ Co-supervision

Paper Code

Model Checkpoints

JingweiNi/ReProbe_Qwen3-8B_self_anno Qwen3-8B CoT ReProbe · self-annotated data

JingweiNi/ReProbe_Qwen3-8B_DeepSeek_anno Qwen3-8B CoT ReProbe · DeepSeek-R1 annotated data

JingweiNi/ReProbe_Qwen3-1.7B_natural_reasoning_GPT-OSS-120B_anno Qwen3-1.7B native reasoning ReProbe · GPT-OSS-120B annotated data

JingweiNi/ReProbe_Qwen3-32B_GPT-OSS-120B_anno Qwen3-32B CoT ReProbe · GPT-OSS-120B annotated data

To load a checkpoint, first set up the target LLM following the GitHub instructions, then attach the probe weights as described in the loading guide.

ReProbe overview: LLM generates CoTs, annotator labels steps, ReProbe trains on frozen LLM internal states, then monitors at inference.

ReProbe training & inference overview. An annotator LLM labels step correctness on CoT traces (left). ReProbe is trained on internal signals from the frozen target LLM (middle). At inference, the tiny probe (<10M params) replaces a 1.5–8B PRM (right).

TL;DR

We train a tiny (<10M param) probe on LLM-internal signals (hidden states, attention, logits) to verify reasoning steps, matching or beating PRMs that are 750–810× larger, with better out-of-domain generalization and up to 25× faster inference.

Test-time scaling improves LLM reasoning by searching over multiple candidate trajectories, but requires a reliable step-level verifier. Existing Process Reward Models (PRMs) are large, expensive to train, and often fail to generalize beyond their training domain. ReProbe offers a lightweight alternative: instead of training a separate LLM-scale verifier, we attach a small probe directly to the frozen target LLM, extracting its hidden states as rich supervisory signals. Trained with either a larger annotator LLM or a self-supervised approach, ReProbe achieves competitive step-level correctness prediction across mathematics, planning, and question-answering benchmarks, while being orders of magnitude smaller and faster.

Approach

How ReProbe Works

A frozen LLM's internal signals are all you need — no PRM-scale model required.

Step 1

LLM Generates

Target LLM produces multi-step chain-of-thought traces. Its weights remain frozen throughout.

Step 2

Extract Hidden States

The LLM's internal hidden states are captured per token at each reasoning step boundary.

Step 3

Probe Predicts

A tiny transformer encoder aggregates signals and outputs a step correctness probability.

Step 4

Scale at Test Time

Scores guide Best-of-N selection or Beam Search to steer toward the best final answer.

Illustration of Best-of-N and Beam Search test-time scaling methods

Two test-time scaling strategies. Best-of-N samples N complete trajectories and selects the highest-scoring one. Beam Search maintains the top-B partial trajectories at each step, pruning with ReProbe scores.

Hidden State Features

Internal hidden states from the frozen LLM — extracted without any modification to the base model.

Self-Supervised Training

Labels come from a larger annotator LLM (DeepSeek-R1) or fully self-supervised — no human annotation needed.

Two Search Strategies

Best-of-N picks the highest-scoring complete trajectory; Beam Search steers generation step by step.

Plug-and-Play

No changes to the base LLM. ReProbe serves as a lightweight PRM — supervise target models without retraining the probe.

Experiments · Step-level PR-AUC

Results at a Glance

ReProbe (hidden states) vs. best competing PRM across three model families and three domains. All values are PR-AUC averaged within each domain group.

ReProbe (Hidden States)

Best PRM baseline

Best UQ baseline

Qwen3-8B

Main · Self-annotated

Avg 0.604

📐 Math ID

ReProbe

.558

Best PRM

.586

Best UQ

.257

♟ Planning OOD

ReProbe

.801

Best PRM

.734

Best UQ

.556

💬 QA OOD

ReProbe

.467

Best PRM

.383

Best UQ

.188

Qwen3-1.7B

Native Thinking · Self-annotated

Avg 0.495

📐 Math ID

ReProbe

.417

Best PRM

.248

♟ Planning OOD

ReProbe

.653

Best PRM

.607

💬 QA OOD

ReProbe

.411

Best PRM

.288

Phi-4

DeepSeek-annotated

Avg 0.497

📐 Math ID

ReProbe

.442

Best PRM

.474

♟ Planning OOD

ReProbe

.686

Best PRM

.664

💬 QA OOD

ReProbe

.393

Best PRM

.342

Qwen3-32B

Native Thinking · GPT-OSS-annotated

Avg 0.585

📐 Math ID

ReProbe

.676

Best PRM

.668

♟ Planning OOD

ReProbe

.728

Best PRM

.677

💬 QA OOD

ReProbe

.303

Best PRM

.314

Planning avg: Trips + Meetings + Calendar. QA avg: StrQA + SciQA. Math shown as MATH benchmark. Full tables in the paper.

Test-Time Scaling · Accuracy (%)

Best-of-N & Beam Search

Qwen3-8B accuracy when using ReProbe (hidden states) scores to select answers. Averages across domain groups; full per-dataset breakdowns in the paper.

ReProbe (Hidden States)

Best PRM baseline

pass@1 / no search

Best-of-N (N = 10)

Qwen3-8B · self-annotated

Overall 62.3%

📐 Math ID avg MATH+GSM8k+ProofNet

ReProbe

89.7

Best PRM

89.2

pass@1

87.4

♟ Planning OOD avg Trips+Meetings+Cal.

ReProbe

15.0

Best PRM

13.1

pass@1

12.4

💬 QA OOD avg StrQA+SciQA

ReProbe

92.0

Best PRM

91.6

pass@1

89.8

Beam Search (B=5, N=5)

Qwen3-8B · DeepSeek-annotated · 2.6–25× faster than PRM

Overall 76.6%

📐 Math ID avg MATH+GSM8k+ProofNet

ReProbe

93.7

Best PRM

88.5

♟ Planning OOD avg Trips+Meetings+Cal.

ReProbe

52.7

Best PRM

48.3

💬 QA OOD StrQA

ReProbe

96.7

Best PRM

91.2

Best PRM = Qwen2.5-Math-7B-PRM800K (BoN) / Qwen2.5-Math-7B-PRM800K (Beam Search). SciQA omitted from Beam Search (not evaluated). Full tables in the paper.

Generalization

Domains Evaluated

Trained only on mathematics, ReProbe transfers to planning and QA — domains where PRMs collapse out-of-distribution.

Mathematics

In-Domain

GSM8K, MATH500, MathBench. Step annotations from PRM800K training set.

Planning

Out-of-Domain

Blocksworld & logistics tasks requiring multi-step sequential reasoning — never seen at train time.

Question Answering

Out-of-Domain

ScienceQA, CommonsenseQA — diverse knowledge domains requiring factual and causal reasoning.

Analysis

Deeper Insights

ReProbe is robust, composable, and architecturally well-motivated.

Robust Across Reasoning Chain Lengths

ROC-AUC across bins of reasoning chain length for ReProbe and PRM models

Stable at Any Chain Length

ReProbe maintains consistent ROC-AUC across all reasoning chain length bins — including very long chains where external PRMs lose accuracy. Internal hidden states provide a reliable signal regardless of how many steps the model has taken.

Complementary to PRMs

Combining ReProbe scores with a PRM (product of scores) consistently outperforms either alone — they capture complementary signals.

Method	MATH	GSM8k	ProofNet
PRM only	.586	.613	.301
ReProbe only	.529	.594	.260
ReProbe + PRM	.613	.674	.318

PR-AUC on Math (ID). ReProbe + PRM = product of scores.

Step-Level Aggregation Matters

Step-level aggregation (aggregating token states within a step) consistently outperforms token-level probing, which in turn beats simple linear probes.

Step-level

.604

Token-level

.580

Linear Probe

.509

Overall PR-AUC (Qwen3-8B, all datasets). Higher = better step detection.

Efficiency

ReProbe vs. Process Reward Models

Dramatically smaller and faster, with no loss in verification quality.

Method	Params	Relative Size	Training Signal	OOD Generalization	Speed (Beam Search)
Skywork-PRM-7B	7B	700–800×	Human labels + MC rollouts	Poor	1× (baseline)
Math-Shepherd-8B	8B	810×	Monte-Carlo rollouts	Poor	1×
PRM (1.5B)	1.5B	150×	Human labels	Limited	~2.6×
ReProbe (Ours) ours	<10M	1×	Self-supervised / LLM annotator	Best	2.6–25×

ACL 2026 · Long Paper

Efficient Test-Time Scaling of Multi-Step Reasoning
by Probing Internal States of Large Language Models

Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan

arXiv:2511.06209 Code

BibTeX

@inproceedings{ni2025reprobe,
  title     = {Efficient Test-Time Scaling of Multi-Step Reasoning
               by Probing Internal States of Large Language Models},
  author    = {Ni, Jingwei and Fadeeva, Ekaterina and Wu, Tianyi and
               Akhtar, Mubashara and Zhang, Jiaheng and Ash, Elliott and
               Leippold, Markus and Baldwin, Timothy and Ng, See-Kiong and
               Shelmanov, Artem and Sachan, Mrinmaya},
  booktitle = {Proceedings of the 64th Annual Meeting of the
               Association for Computational Linguistics},
  year      = {2026}
}

Efficient Test-Time Scaling via Probing Internal States of Large Language Models

LLM Generates

Extract Hidden States

Probe Predicts

Scale at Test Time

Hidden State Features

Self-Supervised Training

Two Search Strategies

Plug-and-Play

Mathematics

Planning

Question Answering

Stable at Any Chain Length

Efficient Test-Time Scaling
via Probing Internal States
of Large Language Models