ReProbe replaces bulky Process Reward Models with a <10M-parameter probe that reads directly from the LLM's own hidden states — achieving comparable or better step-level verification at a fraction of the cost.
To load a checkpoint, first set up the target LLM following the GitHub instructions, then attach the probe weights as described in the loading guide.
ReProbe training & inference overview. An annotator LLM labels step correctness on CoT traces (left). ReProbe is trained on internal signals from the frozen target LLM (middle). At inference, the tiny probe (<10M params) replaces a 1.5–8B PRM (right).
We train a tiny (<10M param) probe on LLM-internal signals (hidden states, attention, logits) to verify reasoning steps, matching or beating PRMs that are 750–810× larger, with better out-of-domain generalization and up to 25× faster inference.
Test-time scaling improves LLM reasoning by searching over multiple candidate trajectories, but requires a reliable step-level verifier. Existing Process Reward Models (PRMs) are large, expensive to train, and often fail to generalize beyond their training domain. ReProbe offers a lightweight alternative: instead of training a separate LLM-scale verifier, we attach a small probe directly to the frozen target LLM, extracting its hidden states as rich supervisory signals. Trained with either a larger annotator LLM or a self-supervised approach, ReProbe achieves competitive step-level correctness prediction across mathematics, planning, and question-answering benchmarks, while being orders of magnitude smaller and faster.
A frozen LLM's internal signals are all you need — no PRM-scale model required.
Target LLM produces multi-step chain-of-thought traces. Its weights remain frozen throughout.
The LLM's internal hidden states are captured per token at each reasoning step boundary.
A tiny transformer encoder aggregates signals and outputs a step correctness probability.
Scores guide Best-of-N selection or Beam Search to steer toward the best final answer.
Two test-time scaling strategies. Best-of-N samples N complete trajectories and selects the highest-scoring one. Beam Search maintains the top-B partial trajectories at each step, pruning with ReProbe scores.
Internal hidden states from the frozen LLM — extracted without any modification to the base model.
Labels come from a larger annotator LLM (DeepSeek-R1) or fully self-supervised — no human annotation needed.
Best-of-N picks the highest-scoring complete trajectory; Beam Search steers generation step by step.
No changes to the base LLM. ReProbe serves as a lightweight PRM — supervise target models without retraining the probe.
ReProbe (hidden states) vs. best competing PRM across three model families and three domains. All values are PR-AUC averaged within each domain group.
Planning avg: Trips + Meetings + Calendar. QA avg: StrQA + SciQA. Math shown as MATH benchmark. Full tables in the paper.
Qwen3-8B accuracy when using ReProbe (hidden states) scores to select answers. Averages across domain groups; full per-dataset breakdowns in the paper.
Best PRM = Qwen2.5-Math-7B-PRM800K (BoN) / Qwen2.5-Math-7B-PRM800K (Beam Search). SciQA omitted from Beam Search (not evaluated). Full tables in the paper.
Trained only on mathematics, ReProbe transfers to planning and QA — domains where PRMs collapse out-of-distribution.
GSM8K, MATH500, MathBench. Step annotations from PRM800K training set.
Blocksworld & logistics tasks requiring multi-step sequential reasoning — never seen at train time.
ScienceQA, CommonsenseQA — diverse knowledge domains requiring factual and causal reasoning.
ReProbe is robust, composable, and architecturally well-motivated.
ReProbe maintains consistent ROC-AUC across all reasoning chain length bins — including very long chains where external PRMs lose accuracy. Internal hidden states provide a reliable signal regardless of how many steps the model has taken.
Combining ReProbe scores with a PRM (product of scores) consistently outperforms either alone — they capture complementary signals.
| Method | MATH | GSM8k | ProofNet |
|---|---|---|---|
| PRM only | .586 | .613 | .301 |
| ReProbe only | .529 | .594 | .260 |
| ReProbe + PRM | .613 | .674 | .318 |
PR-AUC on Math (ID). ReProbe + PRM = product of scores.
Step-level aggregation (aggregating token states within a step) consistently outperforms token-level probing, which in turn beats simple linear probes.
Overall PR-AUC (Qwen3-8B, all datasets). Higher = better step detection.
Dramatically smaller and faster, with no loss in verification quality.
| Method | Params | Relative Size | Training Signal | OOD Generalization | Speed (Beam Search) |
|---|---|---|---|---|---|
| Skywork-PRM-7B | 7B | 700–800× | Human labels + MC rollouts | Poor | 1× (baseline) |
| Math-Shepherd-8B | 8B | 810× | Monte-Carlo rollouts | Poor | 1× |
| PRM (1.5B) | 1.5B | 150× | Human labels | Limited | ~2.6× |
| ReProbe (Ours) ours | <10M | 1× | Self-supervised / LLM annotator | Best | 2.6–25× |
@inproceedings{ni2025reprobe,
title = {Efficient Test-Time Scaling of Multi-Step Reasoning
by Probing Internal States of Large Language Models},
author = {Ni, Jingwei and Fadeeva, Ekaterina and Wu, Tianyi and
Akhtar, Mubashara and Zhang, Jiaheng and Ash, Elliott and
Leippold, Markus and Baldwin, Timothy and Ng, See-Kiong and
Shelmanov, Artem and Sachan, Mrinmaya},
booktitle = {Proceedings of the 64th Annual Meeting of the
Association for Computational Linguistics},
year = {2026}
}