8.1 When Machines Learn from Our Preferences (RLHF)

Introduction

Once a language model has been trained to predict text, it still needs a final adjustment. This adjustment isn't about correcting factual errors—it's about reinforcing the types of responses that users prefer.

🏢

Activity

Human Preference Alignment

Human preference alignment allows a model to learn which style, tone, and level of detail are most appropriate, without the need for explicit rules.

How to Explore It

📋 A realistic question is shown.
🔄 Two plausible responses generated by the model are presented.
👆 You choose the response you prefer.
🧠 The network adjusts its internal values based on your preference.
📊 After several choices, responses adapt to your preferred style.

What to watch for: The model generates multiple possible responses. Human evaluators indicate which one they prefer. With each preference, the model learns to generate responses more aligned with user expectations.

Interactive Demonstration

Human Preference Trainer

Round 1 / 8

🧠 Internal Model

Weak

Neutral

Strong

Reinforced Criteria

💡 Clarity

⚠️ Prudence

✂️ Conciseness

❤️ Warmth

💬 User Question

🤖 Generated Responses

Choose the response you prefer:

Core Concepts

How Does Preference Alignment Work?

The model learns from human choices:

Generation: The model produces several possible responses
Comparison: Two options are presented to the human evaluator
Preference: The human indicates which response is better
Reinforcement: Internal values are adjusted to favor similar responses
Adaptation: With many preferences, the model converges toward a desired style

Criteria Being Reinforced

Preferences can reinforce different aspects:

Clarity: Responses that are easy to understand
Prudence: Responses that acknowledge limitations
Conciseness: Direct responses without unnecessary elaboration
Warmth: Friendly and empathetic tone

Important

The model does not learn new facts through this process. It learns which types of responses are preferred by users. This is fundamental for aligning the model's behavior with human expectations.