Findings of ACL 2026

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Zeguan Xiao, Yun Chen, Jian Yang, Guanhua Chen, Ke Tang

This page gives an intuition-first overview of POET, our lightweight method for making direct alignment algorithms better reflect the token-by-token dynamics of language generation.

arXiv Code

Problem

Direct alignment methods score full responses, but language models actually commit to prefixes first during decoding.

Intervention

POET compares preferred and dispreferred responses at matched length, shifting attention toward the part of the sequence that drives generation.

Benefit

The method is simple, hyperparameter-free in its default form, and improves alignment quality across several settings.

Illustration of the reward-generation gap between sequence-level rewards and prefix quality

Motivation

Direct alignment algorithms such as DPO and SimPO are popular because they simplify the RLHF pipeline. Instead of learning a separate reward model and then optimizing with reinforcement learning, they train directly from preference pairs. That simplicity is a major reason these methods are so widely used.

But there is an underappreciated mismatch hiding inside that setup. Training rewards are defined over complete responses, whereas inference unfolds one token at a time from left to right. A model can therefore become better at ranking full sequences under the training objective without becoming equally better at generating the crucial early tokens that steer the rest of the response.

We call this the reward-generation gap. The paper asks a straightforward question: can we reduce that gap without redesigning direct alignment from scratch?

Why Prefixes Matter

Autoregressive generation is path-dependent. Once a model makes a weak choice near the beginning of a response, later tokens are produced under that weaker context. Early mistakes can compound. In practice, prefix tokens often determine the direction, tone, structure, and safety profile of the whole answer.

That is why sequence-level optimization can miss something important. It treats the response as a single object, but the model experiences generation as a sequence of irreversible local commitments. If the preference signal is dominated by later suffix differences, learning may underweight the part of the response that matters most during actual decoding.

Prefix quality difference emerging early and saturating with longer prefixes

POET in One Idea

POET, short for Prefix-Oriented Equal-length Training, is intentionally simple. For each preferred and dispreferred pair, we truncate both responses to the length of the shorter one and train the underlying direct alignment algorithm on those matched-length responses.

The intervention is small, but the intuition is strong. Equal-length comparison prevents later suffixes from dominating the signal and encourages the method to focus on matched prefixes. That makes the learning problem better aligned with how the model will later generate text.

Universal

POET works on top of DPO, SimPO, and other direct alignment pipelines because it modifies the training pairs rather than the objective itself.

Lightweight

No extra reward model, no custom decoding trick, and no mandatory new hyperparameter in the default setup.

Targeted

The method addresses a specific mismatch between training and inference instead of adding general complexity.

Why Equal-Length Training Helps

The natural concern is whether truncation might corrupt the preference label. If the preferred response becomes clearly better only near the end, matching lengths could discard the decisive evidence. Our analysis suggests that many preference datasets do not behave that way. The quality gap often appears early and then grows with diminishing returns.

That means equal-length training can remove an important bias without throwing away most of the useful signal. Later tokens still matter, but they are not always the most informative part of the comparison for learning to generate better answers.

From that perspective, POET is not a heuristic bolt-on. It is a data-level way to bring the training comparison closer to the structure of generation.

Results Across Settings

The empirical pattern is consistent: POET improves alignment quality across multiple model families and multiple direct alignment objectives, while keeping the implementation burden extremely low.

Setting	Baseline	+ POET	Gain
Mistral-Base (7B) + DPO, AlpacaEval 2 LC	12.9	24.7	+11.8
Llama-3-Base (8B) + DPO, AlpacaEval 2 LC	16.9	28.4	+11.5
Llama-3-Base (8B) + SimPO, AlpacaEval 2 LC	28.0	33.8	+5.8
Mistral-Base (7B) + DPO, safety rate	45.2	82.1	+36.9

Representative results reported in the paper. The strongest gains appear in exactly the settings where getting the early trajectory of a response right matters most.

Higher-quality generated prefixes after POET training

More context: when POET is likely to help

POET is most compelling when preferred and dispreferred responses already diverge meaningfully in their early prefixes, and when the task depends heavily on getting the beginning of the answer right. That makes the method especially natural for instruction following and safety alignment.

By contrast, if a dataset's preference labels depend mostly on very late tokens, equal-length truncation may help less. The method is therefore best understood as a targeted correction for a specific training-inference mismatch, not as a claim that every preference dataset should always be truncated.

Takeaway

The broader lesson of this paper is that alignment objectives should respect generation dynamics. Direct alignment methods are appealing because they are simple, but sequence-level simplicity can hide a real mismatch with how language models actually decode.

POET shows that a very small intervention can reduce that mismatch. By forcing preferred and dispreferred responses to be compared at the same length, the training signal becomes more sensitive to prefixes and therefore more faithful to the part of the sequence that dominates real generation. That combination of clarity, simplicity, and broad empirical gain is what makes the result exciting to us.

Citation

@inproceedings{xiao2026poet,
  title={Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms},
  author={Xiao, Zeguan and Chen, Yun and Yang, Jian and Chen, Guanhua and Tang, Ke},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2026},
  year={2026}
}