AI/ML/NLPDecember 8, 202513 min read

RLHF Explained: How ChatGPT and Claude Learn from Human Feedback

Understand RLHF - the technique behind ChatGPT and Claude. Learn reward modeling, PPO optimization, DPO, and how AI assistants become helpful and safe.

Space Services

ai tools

The deployment of large language models (LLMs) in real-world applications has raised fundamental questions about how we ensure these systems behave in ways that are both helpful to users and safe for society. While pretraining on internet text gives models impressive capabilities, it doesn't inherently teach them to follow instructions, avoid harmful outputs, or align with human preferences. This is where Reinforcement Learning from Human Feedback (RLHF) enters the picture—a technique that has become central to training the assistants we interact with today.

In this article, we'll explore the mathematical foundations of RLHF, trace its development through foundational research, and examine how recent advances like Safe RLHF are pushing the boundaries of what's possible in AI alignment.

RLHF Training Dynamics

Visualize how reward optimization shapes model behavior over training

import numpy as np
import matplotlib.pyplot as plt

# Simulate RLHF training dynamics
np.random.seed(42)
steps = 100
epochs = np.arange(steps)

# Simulated metrics during RLHF
reward = 0.1 + 0.6 * (1 - np.exp(-epochs/30)) + np.random.randn(steps) * 0.05
kl_divergence = 0.5 * (1 - np.exp(-epochs/50)) + np.random.randn(steps) * 0.02
kl_divergence = np.maximum(kl_divergence, 0)

# Response quality (correlates with reward but with noise)
quality = 0.2 + 0.5 * (1 - np.exp(-epochs/25)) + np.random.randn(steps) * 0.08

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Reward over time
ax = axes[0, 0]
ax.plot(epochs, reward, color='#7c6f9c', linewidth=2)
ax.fill_between(epochs, reward - 0.1, reward + 0.1, alpha=0.3, color='#7c6f9c')
ax.set_xlabel('Training Steps')
ax.set_ylabel('Average Reward')
ax.set_title('Reward Model Score Over Training')
ax.grid(True, alpha=0.3)

# KL Divergence
ax = axes[0, 1]
ax.plot(epochs, kl_divergence, color='#b8b0c8', linewidth=2)
ax.axhline(y=0.3, color='red', linestyle='--', label='KL Budget', alpha=0.7)
ax.set_xlabel('Training Steps')
ax.set_ylabel('KL Divergence')
ax.set_title('KL Divergence from Base Model')
ax.legend()
ax.grid(True, alpha=0.3)

# Quality improvement
ax = axes[1, 0]
ax.plot(epochs, quality, color='#9089a3', linewidth=2)
ax.set_xlabel('Training Steps')
ax.set_ylabel('Response Quality')
ax.set_title('Human-Rated Response Quality')
ax.grid(True, alpha=0.3)

# Reward vs KL tradeoff
ax = axes[1, 1]
scatter = ax.scatter(kl_divergence, reward, c=epochs, cmap='Purples', alpha=0.6)
ax.set_xlabel('KL Divergence')
ax.set_ylabel('Reward')
ax.set_title('Reward-KL Tradeoff During Training')
plt.colorbar(scatter, ax=ax, label='Training Step')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key RLHF Observations:")
print("1. Reward increases as model learns from preferences")
print("2. KL divergence is constrained to prevent mode collapse")
print("3. Quality correlates with reward (but not perfectly)")
print("4. There's a tradeoff between reward and staying close to base model")

Ctrl/Cmd + Enter to run

The Alignment Problem

Pretrained language models learn to predict the next token in a sequence. This objective, while powerful for learning linguistic patterns, doesn't directly optimize for the qualities we want in an AI assistant: following instructions accurately, providing truthful information, refusing harmful requests, and being genuinely helpful.

The core insight of RLHF is elegant: rather than trying to hand-craft rules for good behavior, we can learn a model of human preferences and use it to guide the AI's learning process. This shifts the problem from "define what good behavior looks like" to "collect examples of humans expressing preferences between outputs."

The Three Stages of RLHF

RLHF typically proceeds in three stages:

Supervised Fine-Tuning (SFT): Start with a pretrained model and fine-tune it on high-quality demonstrations of desired behavior
Reward Model Training: Train a separate model to predict human preferences between outputs
Policy Optimization: Use reinforcement learning to optimize the SFT model against the reward model

Let's examine each stage in detail.

Stage 1: Supervised Fine-Tuning

The journey begins with supervised fine-tuning on curated demonstrations. Human contractors write examples of ideal assistant responses to various prompts. The model learns to mimic this style of response through standard language modeling:

$\mathcal{L}_{SFT} = -\mathbb{E}_{(x,y) \sim \mathcal{D}}[\log \pi(y|x)]$

This stage transforms a raw language model into something that looks more like an assistant, but it's limited by the quality and coverage of the demonstration data.

Stage 2: Reward Model Training

The heart of RLHF lies in the reward model. Rather than asking humans to rate outputs on an absolute scale (which is noisy and inconsistent), we ask them to compare pairs of outputs and indicate which they prefer. This comparative approach is more reliable and produces cleaner training signal.

Given a prompt $x$ and two candidate responses $y_1$ and $y_2$ , the reward model $r_\theta$ assigns a scalar score to each response. We model human preferences using the Bradley-Terry model, which gives the probability that response $y_1$ is preferred over $y_2$ :

$P(y_1 \succ y_2 | x) = \sigma(r_\theta(x, y_1) - r_\theta(x, y_2))$

where $\sigma$ is the sigmoid function. This formulation has an elegant interpretation: the probability of preferring one response over another depends only on the difference in their reward scores.

The reward model is trained to maximize the likelihood of observed human preferences. Given a dataset $\mathcal{D}$ of comparisons where $y_w$ is the preferred (winning) response and $y_l$ is the less preferred (losing) response:

$\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}[\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))]$

This loss function pushes the reward model to assign higher scores to preferred responses. Intuitively, we're training a critic that internalizes human judgment about what makes responses good or bad.

Stage 3: Policy Optimization with PPO

With a trained reward model in hand, we can now optimize the language model policy $\pi$ to generate responses that score highly. However, naively maximizing reward leads to a well-known problem: reward hacking. The model finds ways to exploit the reward model that don't correspond to genuinely better responses.

To prevent this, we add a KL divergence penalty that keeps the optimized policy close to a reference policy $\pi_{ref}$ (typically the SFT model). The optimization objective becomes:

$\mathcal{L}_{PPO} = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi}[r_\theta(x, y) - \beta \cdot D_{KL}(\pi(\cdot|x) || \pi_{ref}(\cdot|x))]$

The hyperparameter $\beta$ controls the strength of the KL penalty. Higher values of $\beta$ keep the model closer to its original behavior, while lower values allow more aggressive optimization toward higher rewards.

The KL divergence term serves multiple purposes:

Prevents reward hacking: Stops the model from finding adversarial inputs to the reward model
Maintains coherence: Preserves the linguistic capabilities learned during pretraining
Enables iteration: Allows the reference policy to be updated periodically for continued training

In practice, the KL penalty can be computed efficiently since both policies share the same architecture, and we only need to compute log probabilities of the generated tokens.

Anthropic's Foundational Work on Helpful and Harmless Assistants

The 2022 paper "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" by Anthropic researchers established much of the methodology used today for training AI assistants. This work introduced several key insights.

First, the paper demonstrated that RLHF could be used not just for helpfulness but also for harmlessness—training models to refuse dangerous requests while remaining maximally helpful for benign ones. This dual objective creates an inherent tension: a maximally helpful model might provide information that could be misused, while a maximally cautious model might refuse legitimate requests.

Second, the authors showed that model size matters for alignment. Larger models were better able to learn the nuanced boundary between helpful and harmful behavior. This finding suggests that alignment techniques scale with capability—a hopeful sign for aligning future, more capable systems.

Third, the paper introduced the practice of collecting red team data—deliberately adversarial prompts designed to elicit harmful behavior. By including human preferences on these challenging cases in the training data, models learned more robust refusal behaviors.

The trained models showed remarkable improvements in both helpfulness (as measured by human preference) and harmlessness (as measured by reduced rates of generating harmful content), demonstrating that these objectives, while in tension, could be jointly optimized.

Safe RLHF: Decoupling Helpfulness and Harmlessness

While the original RLHF formulation treats preference as a single dimension, the 2023 paper "Safe RLHF: Safe Reinforcement Learning from Human Feedback" proposed a more nuanced approach that explicitly decouples helpfulness and harmlessness into separate reward models.

The key insight is that helpfulness and harmlessness are fundamentally different objectives that shouldn't be collapsed into a single preference signal. A response might be highly helpful but somewhat unsafe, or very safe but unhelpful. By training separate reward models for each dimension, Safe RLHF can navigate this trade-off more explicitly.

The Safe RLHF framework trains two reward models:

$r_H(x, y)$ : A reward model for helpfulness
$r_S(x, y)$ : A reward model for safety/harmlessness

The optimization objective becomes a constrained problem. Rather than simply maximizing a weighted combination of rewards, Safe RLHF treats safety as a constraint:

$\max_\pi \mathbb{E}[r_H(x, y)] \quad \text{subject to} \quad \mathbb{E}[r_S(x, y)] \geq \tau$

This formulation ensures that safety is maintained above a threshold $\tau$ while helpfulness is maximized within that constraint. This is philosophically different from the weighted-sum approach: safety becomes a hard requirement rather than something to be traded off against helpfulness.

The practical algorithm alternates between:

Estimating the current safety level of the policy
Adjusting the Lagrange multiplier to enforce the safety constraint
Taking PPO optimization steps with the adjusted objective

This approach has several advantages. It allows practitioners to set explicit safety thresholds appropriate for their deployment context. It prevents the optimization from trading away safety for marginal gains in helpfulness. And it provides interpretable metrics for both dimensions separately.

Mathematical Intuitions

Let's develop some intuition for why these mathematical formulations work.

Why Bradley-Terry?

The Bradley-Terry model comes from the study of paired comparisons, originally developed for ranking chess players. Its key property is transitivity: if $A$ is preferred to $B$ and $B$ is preferred to $C$ , the model assigns $A$ higher probability of being preferred to $C$ . This makes reward scores comparable across the entire output space, not just within a single comparison.

The sigmoid function ensures probabilities are well-calibrated. When two responses have similar reward scores, the preference probability approaches 0.5 (uncertainty). When scores differ greatly, we approach certainty.

Why KL Divergence?

The KL divergence $D_{KL}(P||Q) = \mathbb{E}_P[\log(P/Q)]$ measures how much one probability distribution differs from another. In RLHF, it measures how much the optimized policy has diverged from the reference.

KL divergence has useful properties: it's always non-negative, zero only when the distributions match, and it penalizes more heavily when the optimized policy puts probability mass where the reference policy puts little. This last property is crucial—it strongly discourages the model from generating responses that the reference model considers highly unlikely, which are often the reward-hacking outputs.

The Role of $\beta$

The coefficient $\beta$ in the PPO objective controls a fundamental trade-off. We can rewrite the objective as:

$\mathcal{L} = \mathbb{E}[r(y)] - \beta \cdot D_{KL}(\pi || \pi_{ref})$

As $\beta \to 0$ , we're purely maximizing reward with no constraint—likely to reward hack. As $\beta \to \infty$ , we're keeping the policy identical to the reference—no learning. The art of RLHF lies in finding the sweet spot where meaningful improvement happens without catastrophic deviation.

In practice, $\beta$ is often adapted during training. Some approaches start with high $\beta$ and anneal it down, allowing larger changes as the reward model's accuracy on the current policy distribution improves.

Challenges and Limitations

Despite its success, RLHF has known limitations.

Reward model accuracy degrades off-distribution. The reward model is trained on comparisons of outputs from a particular policy. As the policy changes through optimization, it generates outputs the reward model hasn't seen, potentially leading to overconfident but incorrect reward predictions.

Human preferences are inconsistent. Different annotators have different values and preferences. Aggregating these into a single reward model necessarily loses information and may encode majority biases.

Goodhart's Law applies. "When a measure becomes a target, it ceases to be a good measure." The reward model is a proxy for human values, and optimizing hard against it can lead to high-scoring responses that don't actually reflect what humans want.

Scalable oversight remains unsolved. As AI systems become more capable, humans may struggle to accurately evaluate their outputs. How do we collect reliable preference data for responses that require expert knowledge to assess?

Looking Forward

RLHF represents a paradigm shift in how we think about training AI systems. Rather than specifying behavior through rules or demonstrations alone, we can learn models of human judgment and optimize against them. This approach has enabled the current generation of helpful, harmless AI assistants.

The Safe RLHF framework points toward a future where alignment objectives are explicitly decomposed and individually constrained. We might imagine systems with separate reward models for helpfulness, harmlessness, honesty, and other desirable properties—each with its own threshold and trade-off structure.

As models become more capable, the stakes of alignment grow higher. The mathematical foundations laid by RLHF research—preference learning, constrained optimization, distributional constraints—will likely form the basis for alignment techniques applied to future systems. Understanding these foundations isn't just academic; it's essential knowledge for anyone thinking seriously about the trajectory of AI development.

The cosmos of AI alignment is vast, and RLHF is one of our most powerful tools for navigation. By combining human judgment with mathematical rigor, we're learning to steer these systems toward outcomes that benefit humanity—one preference comparison at a time.

Test Your Knowledge: RLHF

Question 1 of 5

What is the purpose of the reward model in RLHF?

References

Bai, Y., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." Anthropic. Semantic Scholar
Dai, J., et al. (2023). "Safe RLHF: Safe Reinforcement Learning from Human Feedback." Semantic Scholar

Interactive

Constitutional AI Explained: How Anthropic Makes Claude Safe Without Human Labels

Learn Constitutional AI (CAI) - Anthropic's technique for training safe AI without massive human labeling. Understand RLAIF, self-critique, and how it compares to RLHF.

AI/ML/NLPJanuary 21, 202610 min read

Interactive

Instruction Tuning Guide: Fine-Tune LLMs to Follow Directions with SFT

Learn instruction tuning (SFT) - the technique that transforms base LLMs into assistants like ChatGPT. Covers dataset creation, Alpaca, FLAN, and quality vs quantity tradeoffs.

AI/ML/NLPJanuary 4, 202613 min read

Interactive

Chain-of-Thought Prompting Guide: Make LLMs Think Step-by-Step

Master chain-of-thought prompting - the technique that makes GPT-4 and Claude solve complex problems. Includes zero-shot CoT, self-consistency, and implementation examples.

AI/ML/NLPNovember 15, 202514 min read

RLHF Explained: How ChatGPT and Claude Learn from Human Feedback

The Alignment Problem