Moving Past RLHF: In 2025 We Will Transition from Preference Tuning to Reward Optimization in Foundation Models

Models like GPT-o3 and Tülu 3 are showing the way.

Created Using Midjourney

A brief note: Given the limited market activity during the holiday season, we will replace our traditional Sunday edition for this week and next week with our popular ‘The Sequence Chat,’ in which we discuss some original ideas about the AI space. Now onto today’s subject:

In a recent essay in this newsletter we explored the transition from an emphasis in pretraining to post-training in foundation models. The release of models like GPT-o1 and the initial details about GPT-o3 as well as frameworks such as Tülu 3 really provide a glimpse of that trajectory. However, even within the post-training space we are seeing super intriguing changes in techniques. One of those is the transition from preference tuning with methods such as the famous RLHF to reward modeling. Today, I would like to explore some ideas about how preference tuning paved the way for reward optimization, examines the impact and limitations of RLHF, and discusses the emergence of new reward models that aim to capture complex human values more effectively.

Modern artificial intelligence has reached a pivotal stage with the advent of foundation models—massive neural networks that can be adapted to an array of tasks through minimal fine-tuning. These models, which learn statistical patterns from sprawling corpora of text, possess an extraordinary ability to generate and interpret natural language. However, as they grow more powerful, the need to align their outputs with human goals, values, and preferences becomes both more urgent and more challenging.

Initially, preference tuning served as the de facto approach to alignment, relying on human-annotated datasets to guide model behavior. Although preference tuning yields significant benefits in terms of helpfulness and safety, it struggles to incorporate the full range of human intentions, values, and context-specific nuances. In response, researchers have turned to reward optimization, particularly approaches like Reinforcement Learning from Human Feedback (RLHF), to further refine model behavior based on explicit reward signals.

Within this rapidly evolving field, recent projects—such as GPT-o3 deliverative alignment and Tülu 3—exemplify the shift from preference-based fine-tuning to more dynamic, reward-focused paradigms.

The Rise of Foundation Models and Preference Tuning