Reinforcement Learning from Human Feedback (RLHF)

Building AI Systems that Learn from Human Preferences and Values

What is RLHF?

RLHF is a machine learning technique that uses human feedback to improve AI model behavior. Instead of training solely on large datasets, RLHF incorporates human preferences to align AI systems with human values and expectations. This approach has been crucial in making models like ChatGPT more helpful, harmless, and honest.


The Three-Stage Process:

Stage 1: Supervised Fine-tuning

# Supervised Fine-tuning on high-quality demonstrations
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer

# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained("gpt-3.5-turbo")
tokenizer = AutoTokenizer.from_pretrained("gpt-3.5-turbo")

# Prepare demonstration data
sft_dataset = [
    {"input": "How do I bake a cake?", 
     "output": "Here's a simple cake recipe: 1) Preheat oven to 350°F..."},
    {"input": "Write a professional email", 
     "output": "Subject: [Clear Subject Line]\nDear [Recipient],\n..."}
]

# Train on human demonstrations
trainer = Trainer(
    model=model,
    train_dataset=sft_dataset,
    training_args=training_args
)
trainer.train()

The base model is fine-tuned on high-quality human-written examples to demonstrate desired behavior patterns.


Stage 2: Reward Model Training

Human annotators rank model outputs, creating a reward model that predicts human preferences.

Ranking Process:

# Training the Reward Model
import torch
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
    
    def forward(self, input_ids):
        outputs = self.base_model(input_ids)
        # Get last token representation
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward

# Comparison loss for ranking
def comparison_loss(rewards_chosen, rewards_rejected):
    return -torch.log(torch.sigmoid(rewards_chosen - rewards_rejected)).mean()

# Training loop
for batch in comparison_data:
    rewards_chosen = reward_model(batch['chosen'])
    rewards_rejected = reward_model(batch['rejected'])
    loss = comparison_loss(rewards_chosen, rewards_rejected)
    loss.backward()
    optimizer.step()

Key Insight: The reward model learns to approximate human judgment without needing explicit reward functions.


Stage 3: PPO Training

Proximal Policy Optimization (PPO) uses the reward model to further improve the language model through reinforcement learning.

PPO Objectives:

# PPO Training Implementation
from trl import PPOTrainer, PPOConfig

# PPO Configuration
ppo_config = PPOConfig(
    model_name="sft_model",
    learning_rate=1.41e-5,
    batch_size=64,
    mini_batch_size=64,
    gradient_accumulation_steps=1,
    ppo_epochs=4,
    max_grad_norm=0.5,
)

# Initialize PPO trainer
ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=sft_model,
    tokenizer=tokenizer,
    dataset=dataset
)

# Training loop
for batch in ppo_trainer.dataloader:
    # Generate responses
    query_tensors = batch["query"]
    response_tensors = ppo_trainer.generate(query_tensors)
    
    # Get rewards from reward model
    rewards = reward_model(query_tensors, response_tensors)
    
    # PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    
    # Log training metrics
    ppo_trainer.log_stats(stats, batch, rewards)

Balancing Act: PPO balances exploration vs exploitation while maintaining model stability.


Advanced RLHF Techniques

Constitutional AI uses AI feedback alongside human feedback, creating scalable oversight for complex tasks.

Emerging Approaches:

# Direct Preference Optimization (DPO) - Alternative to PPO
# Simpler approach that directly optimizes policy using preferences

def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta=0.1):
    """
    pi_logps: policy log probabilities
    ref_logps: reference model log probabilities
    yw_idxs: indices of preferred responses
    yl_idxs: indices of less preferred responses
    """
    pi_yw, pi_yl = pi_logps[yw_idxs], pi_logps[yl_idxs]
    ref_yw, ref_yl = ref_logps[yw_idxs], ref_logps[yl_idxs]
    
    pi_diff = pi_yw - pi_yl
    ref_diff = ref_yw - ref_yl
    
    loss = -torch.log(torch.sigmoid(beta * (pi_diff - ref_diff))).mean()
    return loss

# DPO is more stable and computationally efficient than PPO
# No need for reward model during training

Key Benefits of RLHF:

RLHF represents a paradigm shift towards human-centered AI development, enabling us to build systems that are not just capable, but aligned with human intentions and values.