The deployment of large language models (LLMs) in real-world applications has raised fundamental questions about how we ensure these systems behave in ways that are both helpful to users and safe for society. While pretraining on internet text gives models impressive capabilities, it doesn't inherently teach them to follow instructions, avoid harmful outputs, or align with human preferences. This is where Reinforcement Learning from Human Feedback (RLHF) enters the picture—a technique that has become central to training the assistants we interact with today.
In this article, we'll explore the mathematical foundations of RLHF, trace its development through foundational research, and examine how recent advances like Safe RLHF are pushing the boundaries of what's possible in AI alignment.
The Alignment Problem
Pretrained language models learn to predict the next token in a sequence. This objective, while powerful for learning linguistic patterns, doesn't directly optimize for the qualities we want in an AI assistant: following instructions accurately, providing truthful information, refusing harmful requests, and being genuinely helpful.
The core insight of RLHF is elegant: rather than trying to hand-craft rules for good behavior, we can learn a model of human preferences and use it to guide the AI's learning process. This shifts the problem from "define what good behavior looks like" to "collect examples of humans expressing preferences between outputs."
The Three Stages of RLHF
RLHF typically proceeds in three stages:
- Supervised Fine-Tuning (SFT): Start with a pretrained model and fine-tune it on high-quality demonstrations of desired behavior
- Reward Model Training: Train a separate model to predict human preferences between outputs
- Policy Optimization: Use reinforcement learning to optimize the SFT model against the reward model
Let's examine each stage in detail.
Stage 1: Supervised Fine-Tuning
The journey begins with supervised fine-tuning on curated demonstrations. Human contractors write examples of ideal assistant responses to various prompts. The model learns to mimic this style of response through standard language modeling:
This stage transforms a raw language model into something that looks more like an assistant, but it's limited by the quality and coverage of the demonstration data.
Stage 2: Reward Model Training
The heart of RLHF lies in the reward model. Rather than asking humans to rate outputs on an absolute scale (which is noisy and inconsistent), we ask them to compare pairs of outputs and indicate which they prefer. This comparative approach is more reliable and produces cleaner training signal.
Given a prompt and two candidate responses and , the reward model assigns a scalar score to each response. We model human preferences using the Bradley-Terry model, which gives the probability that response is preferred over :
where is the sigmoid function. This formulation has an elegant interpretation: the probability of preferring one response over another depends only on the difference in their reward scores.
The reward model is trained to maximize the likelihood of observed human preferences. Given a dataset of comparisons where is the preferred (winning) response and is the less preferred (losing) response:
This loss function pushes the reward model to assign higher scores to preferred responses. Intuitively, we're training a critic that internalizes human judgment about what makes responses good or bad.
Stage 3: Policy Optimization with PPO
With a trained reward model in hand, we can now optimize the language model policy to generate responses that score highly. However, naively maximizing reward leads to a well-known problem: reward hacking. The model finds ways to exploit the reward model that don't correspond to genuinely better responses.
To prevent this, we add a KL divergence penalty that keeps the optimized policy close to a reference policy (typically the SFT model). The optimization objective becomes:
The hyperparameter controls the strength of the KL penalty. Higher values of keep the model closer to its original behavior, while lower values allow more aggressive optimization toward higher rewards.
The KL divergence term serves multiple purposes:
- Prevents reward hacking: Stops the model from finding adversarial inputs to the reward model
- Maintains coherence: Preserves the linguistic capabilities learned during pretraining
- Enables iteration: Allows the reference policy to be updated periodically for continued training
In practice, the KL penalty can be computed efficiently since both policies share the same architecture, and we only need to compute log probabilities of the generated tokens.
Anthropic's Foundational Work on Helpful and Harmless Assistants
The 2022 paper "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" by Anthropic researchers established much of the methodology used today for training AI assistants. This work introduced several key insights.
First, the paper demonstrated that RLHF could be used not just for helpfulness but also for harmlessness—training models to refuse dangerous requests while remaining maximally helpful for benign ones. This dual objective creates an inherent tension: a maximally helpful model might provide information that could be misused, while a maximally cautious model might refuse legitimate requests.
Second, the authors showed that model size matters for alignment. Larger models were better able to learn the nuanced boundary between helpful and harmful behavior. This finding suggests that alignment techniques scale with capability—a hopeful sign for aligning future, more capable systems.
Third, the paper introduced the practice of collecting red team data—deliberately adversarial prompts designed to elicit harmful behavior. By including human preferences on these challenging cases in the training data, models learned more robust refusal behaviors.
The trained models showed remarkable improvements in both helpfulness (as measured by human preference) and harmlessness (as measured by reduced rates of generating harmful content), demonstrating that these objectives, while in tension, could be jointly optimized.
Safe RLHF: Decoupling Helpfulness and Harmlessness
While the original RLHF formulation treats preference as a single dimension, the 2023 paper "Safe RLHF: Safe Reinforcement Learning from Human Feedback" proposed a more nuanced approach that explicitly decouples helpfulness and harmlessness into separate reward models.
The key insight is that helpfulness and harmlessness are fundamentally different objectives that shouldn't be collapsed into a single preference signal. A response might be highly helpful but somewhat unsafe, or very safe but unhelpful. By training separate reward models for each dimension, Safe RLHF can navigate this trade-off more explicitly.
The Safe RLHF framework trains two reward models:
- : A reward model for helpfulness
- : A reward model for safety/harmlessness
The optimization objective becomes a constrained problem. Rather than simply maximizing a weighted combination of rewards, Safe RLHF treats safety as a constraint:
This formulation ensures that safety is maintained above a threshold while helpfulness is maximized within that constraint. This is philosophically different from the weighted-sum approach: safety becomes a hard requirement rather than something to be traded off against helpfulness.
The practical algorithm alternates between:
- Estimating the current safety level of the policy
- Adjusting the Lagrange multiplier to enforce the safety constraint
- Taking PPO optimization steps with the adjusted objective
This approach has several advantages. It allows practitioners to set explicit safety thresholds appropriate for their deployment context. It prevents the optimization from trading away safety for marginal gains in helpfulness. And it provides interpretable metrics for both dimensions separately.
Mathematical Intuitions
Let's develop some intuition for why these mathematical formulations work.
Why Bradley-Terry?
The Bradley-Terry model comes from the study of paired comparisons, originally developed for ranking chess players. Its key property is transitivity: if is preferred to and is preferred to , the model assigns higher probability of being preferred to . This makes reward scores comparable across the entire output space, not just within a single comparison.
The sigmoid function ensures probabilities are well-calibrated. When two responses have similar reward scores, the preference probability approaches 0.5 (uncertainty). When scores differ greatly, we approach certainty.
Why KL Divergence?
The KL divergence measures how much one probability distribution differs from another. In RLHF, it measures how much the optimized policy has diverged from the reference.
KL divergence has useful properties: it's always non-negative, zero only when the distributions match, and it penalizes more heavily when the optimized policy puts probability mass where the reference policy puts little. This last property is crucial—it strongly discourages the model from generating responses that the reference model considers highly unlikely, which are often the reward-hacking outputs.
The Role of
The coefficient in the PPO objective controls a fundamental trade-off. We can rewrite the objective as:
As , we're purely maximizing reward with no constraint—likely to reward hack. As , we're keeping the policy identical to the reference—no learning. The art of RLHF lies in finding the sweet spot where meaningful improvement happens without catastrophic deviation.
In practice, is often adapted during training. Some approaches start with high and anneal it down, allowing larger changes as the reward model's accuracy on the current policy distribution improves.
Challenges and Limitations
Despite its success, RLHF has known limitations.
Reward model accuracy degrades off-distribution. The reward model is trained on comparisons of outputs from a particular policy. As the policy changes through optimization, it generates outputs the reward model hasn't seen, potentially leading to overconfident but incorrect reward predictions.
Human preferences are inconsistent. Different annotators have different values and preferences. Aggregating these into a single reward model necessarily loses information and may encode majority biases.
Goodhart's Law applies. "When a measure becomes a target, it ceases to be a good measure." The reward model is a proxy for human values, and optimizing hard against it can lead to high-scoring responses that don't actually reflect what humans want.
Scalable oversight remains unsolved. As AI systems become more capable, humans may struggle to accurately evaluate their outputs. How do we collect reliable preference data for responses that require expert knowledge to assess?
Looking Forward
RLHF represents a paradigm shift in how we think about training AI systems. Rather than specifying behavior through rules or demonstrations alone, we can learn models of human judgment and optimize against them. This approach has enabled the current generation of helpful, harmless AI assistants.
The Safe RLHF framework points toward a future where alignment objectives are explicitly decomposed and individually constrained. We might imagine systems with separate reward models for helpfulness, harmlessness, honesty, and other desirable properties—each with its own threshold and trade-off structure.
As models become more capable, the stakes of alignment grow higher. The mathematical foundations laid by RLHF research—preference learning, constrained optimization, distributional constraints—will likely form the basis for alignment techniques applied to future systems. Understanding these foundations isn't just academic; it's essential knowledge for anyone thinking seriously about the trajectory of AI development.
The cosmos of AI alignment is vast, and RLHF is one of our most powerful tools for navigation. By combining human judgment with mathematical rigor, we're learning to steer these systems toward outcomes that benefit humanity—one preference comparison at a time.
References
-
Bai, Y., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." Anthropic. Semantic Scholar
-
Dai, J., et al. (2023). "Safe RLHF: Safe Reinforcement Learning from Human Feedback." Semantic Scholar
