ACL 2026 · Long Paper

Efficient Test-Time Scaling
via Probing Internal States
of Large Language Models

ReProbe replaces bulky Process Reward Models with a <10M-parameter probe that reads directly from the LLM's own hidden states — achieving comparable or better step-level verification at a fraction of the cost.

Jingwei Ni1† Ekaterina Fadeeva1† Tianyi Wu1† Mubashara Akhtar3 Jiaheng Zhang2 Elliott Ash1 Markus Leippold4 Timothy Baldwin3 See-Kiong Ng2 Artem Shelmanov3‡ Mrinmaya Sachan1‡
1 ETH Zürich  ·  2 National University of Singapore  ·  3 MBZUAI  ·  4 University of Zurich  ·  University of Melbourne
Equal contribution  ·  Co-supervision
ReProbe overview: LLM generates CoTs, annotator labels steps, ReProbe trains on frozen LLM internal states, then monitors at inference.

ReProbe training & inference overview. An annotator LLM labels step correctness on CoT traces (left). ReProbe is trained on internal signals from the frozen target LLM (middle). At inference, the tiny probe (<10M params) replaces a 1.5–8B PRM (right).

TL;DR

We train a tiny (<10M param) probe on LLM-internal signals (hidden states, attention, logits) to verify reasoning steps, matching or beating PRMs that are 750–810× larger, with better out-of-domain generalization and up to 25× faster inference.

Test-time scaling improves LLM reasoning by searching over multiple candidate trajectories, but requires a reliable step-level verifier. Existing Process Reward Models (PRMs) are large, expensive to train, and often fail to generalize beyond their training domain. ReProbe offers a lightweight alternative: instead of training a separate LLM-scale verifier, we attach a small probe directly to the frozen target LLM, extracting its hidden states as rich supervisory signals. Trained with either a larger annotator LLM or a self-supervised approach, ReProbe achieves competitive step-level correctness prediction across mathematics, planning, and question-answering benchmarks, while being orders of magnitude smaller and faster.

Why ReProbe?
<10M
Parameters — vs. 1.5–8B for competing PRMs
810×
Smaller than the largest PRM it outperforms
25×
Faster beam search vs. state-of-the-art PRMs
3
Domains tested — math, planning, QA (incl. out-of-domain)
How ReProbe Works

A frozen LLM's internal signals are all you need — no PRM-scale model required.

Step 1

LLM Generates

Target LLM produces multi-step chain-of-thought traces. Its weights remain frozen throughout.

Step 2

Extract Hidden States

The LLM's internal hidden states are captured per token at each reasoning step boundary.

Step 3

Probe Predicts

A tiny transformer encoder aggregates signals and outputs a step correctness probability.

Step 4

Scale at Test Time

Scores guide Best-of-N selection or Beam Search to steer toward the best final answer.

Illustration of Best-of-N and Beam Search test-time scaling methods

Two test-time scaling strategies. Best-of-N samples N complete trajectories and selects the highest-scoring one. Beam Search maintains the top-B partial trajectories at each step, pruning with ReProbe scores.

Hidden State Features

Internal hidden states from the frozen LLM — extracted without any modification to the base model.

Self-Supervised Training

Labels come from a larger annotator LLM (DeepSeek-R1) or fully self-supervised — no human annotation needed.

Two Search Strategies

Best-of-N picks the highest-scoring complete trajectory; Beam Search steers generation step by step.

Plug-and-Play

No changes to the base LLM. ReProbe serves as a lightweight PRM — supervise target models without retraining the probe.

Results at a Glance

ReProbe (hidden states) vs. best competing PRM across three model families and three domains. All values are PR-AUC averaged within each domain group.

ReProbe (Hidden States)
Best PRM baseline
Best UQ baseline
Qwen3-8B
Main · Self-annotated
DS
Avg 0.604
📐 Math ID
ReProbe
.558
Best PRM
.586
Best UQ
.257
Planning OOD
ReProbe
.801
Best PRM
.734
Best UQ
.556
💬 QA OOD
ReProbe
.467
Best PRM
.383
Best UQ
.188
Qwen3-1.7B
Native Thinking · Self-annotated
Avg 0.495
📐 Math ID
ReProbe
.417
Best PRM
.248
Planning OOD
ReProbe
.653
Best PRM
.607
💬 QA OOD
ReProbe
.411
Best PRM
.288
Phi-4
DeepSeek-annotated
Avg 0.497
📐 Math ID
ReProbe
.442
Best PRM
.474
Planning OOD
ReProbe
.686
Best PRM
.664
💬 QA OOD
ReProbe
.393
Best PRM
.342
Qwen3-32B
Native Thinking · GPT-OSS-annotated
Avg 0.585
📐 Math ID
ReProbe
.676
Best PRM
.668
Planning OOD
ReProbe
.728
Best PRM
.677
💬 QA OOD
ReProbe
.303
Best PRM
.314

Planning avg: Trips + Meetings + Calendar.  QA avg: StrQA + SciQA.  Math shown as MATH benchmark. Full tables in the paper.

Best-of-N & Beam Search

Qwen3-8B accuracy when using ReProbe (hidden states) scores to select answers. Averages across domain groups; full per-dataset breakdowns in the paper.

ReProbe (Hidden States)
Best PRM baseline
pass@1 / no search
Best-of-N  (N = 10)
Qwen3-8B · self-annotated
Overall 62.3%
📐 Math ID avg MATH+GSM8k+ProofNet
ReProbe
89.7
Best PRM
89.2
pass@1
87.4
Planning OOD avg Trips+Meetings+Cal.
ReProbe
15.0
Best PRM
13.1
pass@1
12.4
💬 QA OOD avg StrQA+SciQA
ReProbe
92.0
Best PRM
91.6
pass@1
89.8
Beam Search  (B=5, N=5)
Qwen3-8B · DeepSeek-annotated  ·  2.6–25× faster than PRM
Overall 76.6%
📐 Math ID avg MATH+GSM8k+ProofNet
ReProbe
93.7
Best PRM
88.5
Planning OOD avg Trips+Meetings+Cal.
ReProbe
52.7
Best PRM
48.3
💬 QA OOD StrQA
ReProbe
96.7
Best PRM
91.2

Best PRM = Qwen2.5-Math-7B-PRM800K (BoN) / Qwen2.5-Math-7B-PRM800K (Beam Search). SciQA omitted from Beam Search (not evaluated). Full tables in the paper.

Domains Evaluated

Trained only on mathematics, ReProbe transfers to planning and QA — domains where PRMs collapse out-of-distribution.

Mathematics

In-Domain

GSM8K, MATH500, MathBench. Step annotations from PRM800K training set.

Planning

Out-of-Domain

Blocksworld & logistics tasks requiring multi-step sequential reasoning — never seen at train time.

Question Answering

Out-of-Domain

ScienceQA, CommonsenseQA — diverse knowledge domains requiring factual and causal reasoning.

Deeper Insights

ReProbe is robust, composable, and architecturally well-motivated.

Robust Across Reasoning Chain Lengths
ROC-AUC across bins of reasoning chain length for ReProbe and PRM models

Stable at Any Chain Length

ReProbe maintains consistent ROC-AUC across all reasoning chain length bins — including very long chains where external PRMs lose accuracy. Internal hidden states provide a reliable signal regardless of how many steps the model has taken.

Complementary to PRMs

Combining ReProbe scores with a PRM (product of scores) consistently outperforms either alone — they capture complementary signals.

Method MATH GSM8k ProofNet
PRM only .586 .613 .301
ReProbe only .529 .594 .260
ReProbe + PRM .613 .674 .318

PR-AUC on Math (ID). ReProbe + PRM = product of scores.

Step-Level Aggregation Matters

Step-level aggregation (aggregating token states within a step) consistently outperforms token-level probing, which in turn beats simple linear probes.

Step-level
.604
Token-level
.580
Linear Probe
.509

Overall PR-AUC (Qwen3-8B, all datasets). Higher = better step detection.

ReProbe vs. Process Reward Models

Dramatically smaller and faster, with no loss in verification quality.

Method Params Relative Size Training Signal OOD Generalization Speed (Beam Search)
Skywork-PRM-7B 7B 700–800× Human labels + MC rollouts Poor 1× (baseline)
Math-Shepherd-8B 8B 810× Monte-Carlo rollouts Poor
PRM (1.5B) 1.5B 150× Human labels Limited ~2.6×
ReProbe (Ours) ours <10M Self-supervised / LLM annotator Best 2.6–25×
ACL 2026 · Long Paper
Efficient Test-Time Scaling of Multi-Step Reasoning
by Probing Internal States of Large Language Models
Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan
BibTeX
@inproceedings{ni2025reprobe,
  title     = {Efficient Test-Time Scaling of Multi-Step Reasoning
               by Probing Internal States of Large Language Models},
  author    = {Ni, Jingwei and Fadeeva, Ekaterina and Wu, Tianyi and
               Akhtar, Mubashara and Zhang, Jiaheng and Ash, Elliott and
               Leippold, Markus and Baldwin, Timothy and Ng, See-Kiong and
               Shelmanov, Artem and Sachan, Mrinmaya},
  booktitle = {Proceedings of the 64th Annual Meeting of the
               Association for Computational Linguistics},
  year      = {2026}
}