Lightweight Latent Reasoning for Narrative Tasks

Alexander Gurung¹, Esmeralda S. Whitammer^1,2, Mirella Lapata¹

¹ University of Edinburgh ² CIFAR Fellow

TACL 2026

TL;DR

LiteReason adds a lightweight Reasoning Projector to an LLM, letting RL-trained models interleave normal token generation with continuous latent reasoning. On narrative tasks, it reaches 69-96% of the gains from non-latent RL while cutting final reasoning traces by 70-73% and RL training tokens by about half. The savings come without hurting general capabilities (GSM-Hard, AIME25, MMLU-Redux), and stack with length-penalty reward shaping, producing the shortest and most performant model.

Figure 1. LiteReason alternates between the model's normal discrete sampling and latent reasoning. In discrete mode, the LM Head samples a token and its embedding is fed back as the next input. If the sampled token is an implicit-thought tag such as <bot> (shorthand for our <implicit_thought> tag) with a thought budget, generation switches to latent mode: the Reasoning Projector maps the last hidden state to a continuous embedding, feeds that embedding back for the budgeted number of steps, and then returns to discrete sampling. The model can switch between these modes multiple times before producing the final answer.

The LiteReason Framework

LiteReason keeps ordinary LM Head sampling but adds a Reasoning Projector. When the model emits an implicit-thought tag with a step budget, the projector predicts continuous token embeddings directly from the final hidden state before generation returns to text.

We train in three stages: collect useful traces, initialize the projector with SFT, then run RL while treating only discrete token sampling as policy actions. After each RL epoch, we refresh the projector on trajectories from the current policy. See our paper for the full training recipe and inference procedure.

Generation Preview

Illustrative example of LiteReason inference on a Flawed Fictions-style input. Discrete tokens are sampled normally; an <implicit_thought>n</implicit_thought> tag switches the model into latent mode for n steps (the pulsing dots) before discrete generation resumes.

What Does Narrative Reasoning Look Like?

Most latent reasoning methods are developed on math and synthetic logic benchmarks, where a single reasoning step is short and formulaic: an equation, or a one-line rule. The narrative tasks we study are different. A reasoning step must track characters, plot constraints, and information spread across thousands of tokens of context, so steps are longer, more varied, and harder to compress into latent tokens.

Example reasoning steps. Steps from common latent-reasoning benchmarks (top) are short and formulaic; steps from our narrative tasks (bottom) are longer, reference long-range context, and vary in structure.

Dataset	Example Reasoning Step
GSM8K-Aug	The helmet costs $15 × 2 = $30.
ProsQA	Every bompus is a wumpus.
ProntoQA	Each vumpus is mean.
Flawed Fictions	The continuity error occurs because the story earlier establishes that the little girl was very poor and had no room to live in or bed to sleep in, but later it states that she returned to her small bed in the shelter with her newfound wealth.
Next Chapter Prediction	`<citation>`Source A (Character Sheet: Rose) says Rose is rebellious, disobedient, and has a sarcastic sense of humor.`</citation>`, therefore `<reasoning>`Rose will likely continue to challenge authority figures and express her opinions, possibly provoking Miss Wellwood and leading to a confrontation.`</reasoning>`

These narrative steps are what the Reasoning Projector must learn to skip: each latent thought stands in for a sentence-level step like the ones above, not a single short equation.

How Does LiteReason Perform on Narrative Tasks?

We evaluate on two narrative tasks: Flawed Fictions, a 414-example plot-hole detection benchmark, and Next Chapter Prediction (NCP), a 1,347-example book-planning task. The plots below show the main performance-compute tradeoff: better methods move right, and cheaper methods move down. LiteReason is the only latent-reasoning method that moves close to non-latent RL performance while staying far below it in generated-token cost.

69-96%

of non-latent RL gains

77-92%

fewer inference tokens vs. base

49-53%

fewer RL training tokens

~3×

faster single-example inference vs. RL

Flawed Fictions

Flawed Fictions asks whether a story contains a plot hole. RL improves Qwen2.5-7B from 57.26% to 88.71% accuracy, but still generates 114.53 tokens on average. LiteReason reaches 87.42% accuracy with 34.01 tokens, while prior latent baselines remain much closer to the base model.

Figure 2. Accuracy vs. generated tokens on Flawed Fictions. Aside from RL-Trained, LiteReason is separated from the next-best method by roughly 30 accuracy points and about 190 generated tokens.

Next Chapter Prediction

NCP evaluates a generated plan for the next chapter in a book. We define a contrastive version of the VR-CLI objective that measures how much a plan increases the likelihood of the true next chapter while decreasing the likelihood of other chapters, which we call Contrastive Improvement. The RL-trained model obtains the highest score (0.666) but uses 721.33 tokens on average. LiteReason obtains 0.478 with 193.11 tokens, landing on the same performance-cost frontier; CoLaR is similarly short but much weaker at 0.118.

Figure 3. Contrastive Improvement vs. generated tokens on NCP. LiteReason is far cheaper than RL-Trained while substantially outperforming the other latent-reasoning baselines.

Does LiteReason Improve RL Efficiency?

With the same RL steps and samples, LiteReason uses about half as many generated training tokens and is faster in wall-clock inference.

RL training tokens. The RL and RL + LiteReason runs use the same RL epochs, steps, and samples; LiteReason generates 52.8% fewer tokens on Flawed Fictions and 49.5% fewer on NCP during training.

Method	FF Training Tokens ↓	NCP Training Tokens ↓
RL	61.4M	108.8M
RL + LiteReason	29.0M (-52.8%)	54.9M (-49.5%)

Inference wall-clock time. Mean seconds per example, averaged over three sequential repetitions, for Qwen2.5-7B models using vLLM on one H100. b=1 runs one example at a time; b=all passes the full test set for parallel computation. The Latent column indicates whether LiteReason uses the latent-reasoning prompt at inference.

Task	Model	Latent	b=1 (s) ↓	b=all (s) ↓
FF	Base	×	3.10	0.13
FF	RL-Trained	×	0.97	0.06
FF	LiteReason	×	0.83	0.20
FF	LiteReason	✓	0.31	0.03
NCP	Base	×	8.02	0.43
NCP	RL-Trained	×	6.46	0.42
NCP	LiteReason	×	2.21	0.26
NCP	LiteReason	✓	1.84	0.23

Compared with RL-Trained, LiteReason produces traces 70% smaller on Flawed Fictions and 73% smaller on NCP while still achieving 96% and 69% of the respective RL performance gains.

Is LiteReason Compatible with Length-Based Reward Shaping?

A DAPO-style length penalty improves both standard RL and LiteReason on Flawed Fictions. LiteReason plus the penalty reaches 93.55% accuracy with 6.00 generated tokens on average, suggesting the method can combine cleanly with other RL reward shaping.

Method	Accuracy (%) ↑	Tokens ↓
RL-Trained	88.71	114.53
RL-Trained + LP	91.94	16.65
LiteReason	87.42	34.01
LiteReason + LP	93.55	6.00

Does LiteReason Retain General Model Capabilities?

On GSM-Hard, AIME25, and MMLU-Redux, LiteReason and the RL-Trained baseline largely retain the abilities of Qwen2.5-7B. LiteReason models also reason more concisely, consistently producing fewer tokens than the base model and RL-Trained variants. The full table in our paper also compares latent-inference modes and non-LiteReason latent baselines.

Setting	GSM Acc. ↑	GSM Tokens ↓	AIME Acc. ↑	AIME Tokens ↓	MMLU Acc. ↑	MMLU Tokens ↓
No finetuning	27.3	371.4	11.3	891.1	78.7	317.0
FF RL-Trained	27.2	359.1	10.0	857.3	79.0	300.5
FF LiteReason	27.3	349.7	10.7	834.7	78.8	283.4

Does LiteReason Work Across Model Size and Family?

On Flawed Fictions with Qwen3-4B-Instruct-2507, LiteReason improves accuracy from 33.23% to 57.42% while reducing output length from 1709.86 to 995.38 tokens. On GSM-Hard with Gemma-3-1B-IT, LiteReason largely matches the base model while producing about 9% fewer tokens.

Task / Model	Method	Accuracy (%) ↑	Tokens ↓
FF / Qwen3-4B	Base	33.23	1709.86
FF / Qwen3-4B	RL-Trained	64.84	1078.57
FF / Qwen3-4B	LiteReason	57.42	995.38
GSM-Hard / Gemma3-1B	Base	14.70	932.82
GSM-Hard / Gemma3-1B	RL-Trained	15.91	941.13
GSM-Hard / Gemma3-1B	LiteReason	15.76	849.92

Takeaways

A lightweight Reasoning Projector is enough to bring latent reasoning into RL: the policy decides when to switch into latent mode, and only discrete tokens are treated as policy actions.
On narrative tasks, LiteReason recovers 69-96% of non-latent RL’s gains while producing 70-73% shorter final traces and using about half the RL training tokens.
The savings come without losing general capabilities, and stack with other reward shaping: with a DAPO-style length penalty, LiteReason reaches 93.55% accuracy on Flawed Fictions with just 6 generated tokens.

Citation

@article{gurung2026lightweightlatentreasoning,
  title     = {Lightweight Latent Reasoning for Narrative Tasks},
  author    = {Alexander Gurung and Esmeralda S. Whitammer and Mirella Lapata},
  journal   = {Transactions of the Association for Computational Linguistics},
  year      = {2026},
  url       = {https://arxiv.org/abs/2512.02240}
}