Figure 1. LiteReason alternates between the model's normal discrete sampling and latent reasoning. In discrete mode, the LM Head samples a token and its embedding is fed back as the next input. If the sampled token is an implicit-thought tag such as <bot> (shorthand for our <implicit_thought> tag) with a thought budget, generation switches to latent mode: the Reasoning Projector maps the last hidden state to a continuous embedding, feeds that embedding back for the budgeted number of steps, and then returns to discrete sampling. The model can switch between these modes multiple times before producing the final answer.
LiteReason keeps ordinary LM Head sampling but adds a Reasoning Projector. When the model emits an implicit-thought tag with a step budget, the projector predicts continuous token embeddings directly from the final hidden state before generation returns to text.
We train in three stages: collect useful traces, initialize the projector with SFT, then run RL while treating only discrete token sampling as policy actions. After each RL epoch, we refresh the projector on trajectories from the current policy. See our paper for the full training recipe and inference procedure.
Illustrative example of LiteReason inference on a Flawed Fictions-style input. Discrete tokens are sampled normally; an <implicit_thought>n</implicit_thought> tag switches the model into latent mode for n steps (the pulsing dots) before discrete generation resumes.
Most latent reasoning methods are developed on math and synthetic logic benchmarks, where a single reasoning step is short and formulaic: an equation, or a one-line rule. The narrative tasks we study are different. A reasoning step must track characters, plot constraints, and information spread across thousands of tokens of context, so steps are longer, more varied, and harder to compress into latent tokens.
Example reasoning steps. Steps from common latent-reasoning benchmarks (top) are short and formulaic; steps from our narrative tasks (bottom) are longer, reference long-range context, and vary in structure.
| Dataset | Example Reasoning Step |
|---|---|
| GSM8K-Aug | The helmet costs $15 Γ 2 = $30. |
| ProsQA | Every bompus is a wumpus. |
| ProntoQA | Each vumpus is mean. |
| Flawed Fictions | The continuity error occurs because the story earlier establishes that the little girl was very poor and had no room to live in or bed to sleep in, but later it states that she returned to her small bed in the shelter with her newfound wealth. |
| Next Chapter Prediction | <citation>Source A (Character Sheet: Rose) says Rose is rebellious, disobedient, and has a sarcastic sense of humor.</citation>, therefore <reasoning>Rose will likely continue to challenge authority figures and express her opinions, possibly provoking Miss Wellwood and leading to a confrontation.</reasoning> |
These narrative steps are what the Reasoning Projector must learn to skip: each latent thought stands in for a sentence-level step like the ones above, not a single short equation.
We evaluate on two narrative tasks: Flawed Fictions, a 414-example plot-hole detection benchmark, and Next Chapter Prediction (NCP), a 1,347-example book-planning task. The plots below show the main performance-compute tradeoff: better methods move right, and cheaper methods move down. LiteReason is the only latent-reasoning method that moves close to non-latent RL performance while staying far below it in generated-token cost.
Flawed Fictions asks whether a story contains a plot hole. RL improves Qwen2.5-7B from 57.26% to 88.71% accuracy, but still generates 114.53 tokens on average. LiteReason reaches 87.42% accuracy with 34.01 tokens, while prior latent baselines remain much closer to the base model.
Figure 2. Accuracy vs. generated tokens on Flawed Fictions. Aside from RL-Trained, LiteReason is separated from the next-best method by roughly 30 accuracy points and about 190 generated tokens.
NCP evaluates a generated plan for the next chapter in a book. We define a contrastive version of the VR-CLI objective that measures how much a plan increases the likelihood of the true next chapter while decreasing the likelihood of other chapters, which we call Contrastive Improvement. The RL-trained model obtains the highest score (0.666) but uses 721.33 tokens on average. LiteReason obtains 0.478 with 193.11 tokens, landing on the same performance-cost frontier; CoLaR is similarly short but much weaker at 0.118.
Figure 3. Contrastive Improvement vs. generated tokens on NCP. LiteReason is far cheaper than RL-Trained while substantially outperforming the other latent-reasoning baselines.
With the same RL steps and samples, LiteReason uses about half as many generated training tokens and is faster in wall-clock inference.
RL training tokens. The RL and RL + LiteReason runs use the same RL epochs, steps, and samples; LiteReason generates 52.8% fewer tokens on Flawed Fictions and 49.5% fewer on NCP during training.
| Method | FF Training Tokens β | NCP Training Tokens β |
|---|---|---|
| RL | 61.4M | 108.8M |
| RL + LiteReason | 29.0M (-52.8%) | 54.9M (-49.5%) |
Inference wall-clock time. Mean seconds per example, averaged over three sequential repetitions, for Qwen2.5-7B models using vLLM on one H100. b=1 runs one example at a time; b=all passes the full test set for parallel computation. The Latent column indicates whether LiteReason uses the latent-reasoning prompt at inference.
| Task | Model | Latent | b=1 (s) β | b=all (s) β |
|---|---|---|---|---|
| FF | Base | Γ | 3.10 | 0.13 |
| FF | RL-Trained | Γ | 0.97 | 0.06 |
| FF | LiteReason | Γ | 0.83 | 0.20 |
| FF | LiteReason | β | 0.31 | 0.03 |
| NCP | Base | Γ | 8.02 | 0.43 |
| NCP | RL-Trained | Γ | 6.46 | 0.42 |
| NCP | LiteReason | Γ | 2.21 | 0.26 |
| NCP | LiteReason | β | 1.84 | 0.23 |
Compared with RL-Trained, LiteReason produces traces 70% smaller on Flawed Fictions and 73% smaller on NCP while still achieving 96% and 69% of the respective RL performance gains.
A DAPO-style length penalty improves both standard RL and LiteReason on Flawed Fictions. LiteReason plus the penalty reaches 93.55% accuracy with 6.00 generated tokens on average, suggesting the method can combine cleanly with other RL reward shaping.
| Method | Accuracy (%) β | Tokens β |
|---|---|---|
| RL-Trained | 88.71 | 114.53 |
| RL-Trained + LP | 91.94 | 16.65 |
| LiteReason | 87.42 | 34.01 |
| LiteReason + LP | 93.55 | 6.00 |
On GSM-Hard, AIME25, and MMLU-Redux, LiteReason and the RL-Trained baseline largely retain the abilities of Qwen2.5-7B. LiteReason models also reason more concisely, consistently producing fewer tokens than the base model and RL-Trained variants. The full table in our paper also compares latent-inference modes and non-LiteReason latent baselines.
| Setting | GSM Acc. β | GSM Tokens β | AIME Acc. β | AIME Tokens β | MMLU Acc. β | MMLU Tokens β |
|---|---|---|---|---|---|---|
| No finetuning | 27.3 | 371.4 | 11.3 | 891.1 | 78.7 | 317.0 |
| FF RL-Trained | 27.2 | 359.1 | 10.0 | 857.3 | 79.0 | 300.5 |
| FF LiteReason | 27.3 | 349.7 | 10.7 | 834.7 | 78.8 | 283.4 |
On Flawed Fictions with Qwen3-4B-Instruct-2507, LiteReason improves accuracy from 33.23% to 57.42% while reducing output length from 1709.86 to 995.38 tokens. On GSM-Hard with Gemma-3-1B-IT, LiteReason largely matches the base model while producing about 9% fewer tokens.
| Task / Model | Method | Accuracy (%) β | Tokens β |
|---|---|---|---|
| FF / Qwen3-4B | Base | 33.23 | 1709.86 |
| FF / Qwen3-4B | RL-Trained | 64.84 | 1078.57 |
| FF / Qwen3-4B | LiteReason | 57.42 | 995.38 |
| GSM-Hard / Gemma3-1B | Base | 14.70 | 932.82 |
| GSM-Hard / Gemma3-1B | RL-Trained | 15.91 | 941.13 |
| GSM-Hard / Gemma3-1B | LiteReason | 15.76 | 849.92 |
@article{gurung2026lightweightlatentreasoning,
title = {Lightweight Latent Reasoning for Narrative Tasks},
author = {Alexander Gurung and Esmeralda S. Whitammer and Mirella Lapata},
journal = {Transactions of the Association for Computational Linguistics},
year = {2026},
url = {https://arxiv.org/abs/2512.02240}
}