8.1 When Machines Learn from Our Preferences (RLHF)
Introduction
Once a language model has been trained to predict text, it still needs a final adjustment. This adjustment isn't about correcting factual errorsβit's about reinforcing the types of responses that users prefer.
π’
Activity
Human Preference Alignment
Human preference alignment allows a model to learn which style, tone, and level of detail are most appropriate, without the need for explicit rules.
How to Explore It
- π A realistic question is shown.
- π Two plausible responses generated by the model are presented.
- π You choose the response you prefer.
- π§ The network adjusts its internal values based on your preference.
- π After several choices, responses adapt to your preferred style.
What to watch for:
The model generates multiple possible responses. Human evaluators indicate which one they prefer. With each preference, the model learns to generate responses more aligned with user expectations.
Interactive Demonstration
Human Preference Trainer
Round
1 / 8
π§ Internal Model
Weak
Neutral
Strong
Reinforced Criteria
Clarity0
Prudence0
Conciseness0
Warmth0
User Question
Generated Responses
Choose the response you prefer:
A
B
Core Concepts
How Does Preference Alignment Work?
The model learns from human choices:
- Generation: The model produces several possible responses
- Comparison: Two options are presented to the human evaluator
- Preference: The human indicates which response is better
- Reinforcement: Internal values are adjusted to favor similar responses
- Adaptation: With many preferences, the model converges toward a desired style
Criteria Being Reinforced
Preferences can reinforce different aspects:
- Clarity: Responses that are easy to understand
- Prudence: Responses that acknowledge limitations
- Conciseness: Direct responses without unnecessary elaboration
- Warmth: Friendly and empathetic tone
Important
The model does not learn new facts through this process. It learns which types of responses are preferred by users. This is fundamental for aligning the model's behavior with human expectations.