8.1 When Machines Learn from Our Preferences (RLHF)

Introduction

Once a language model has been trained to predict text, it still needs a final adjustment. This adjustment isn't about correcting factual errorsβ€”it's about reinforcing the types of responses that users prefer.

🏒

Activity

Human Preference Alignment

Human preference alignment allows a model to learn which style, tone, and level of detail are most appropriate, without the need for explicit rules.

How to Explore It

  1. πŸ“‹ A realistic question is shown.
  2. πŸ”„ Two plausible responses generated by the model are presented.
  3. πŸ‘† You choose the response you prefer.
  4. 🧠 The network adjusts its internal values based on your preference.
  5. πŸ“Š After several choices, responses adapt to your preferred style.
What to watch for: The model generates multiple possible responses. Human evaluators indicate which one they prefer. With each preference, the model learns to generate responses more aligned with user expectations.

Interactive Demonstration

Human Preference Trainer

Round 1 / 8

🧠 Internal Model

Weak
Neutral
Strong
Reinforced Criteria
πŸ’‘ Clarity
0
⚠️ Prudence
0
βœ‚οΈ Conciseness
0
❀️ Warmth
0

πŸ’¬ User Question

πŸ€– Generated Responses

Choose the response you prefer:

A
B

Core Concepts

How Does Preference Alignment Work?

The model learns from human choices:

  1. Generation: The model produces several possible responses
  2. Comparison: Two options are presented to the human evaluator
  3. Preference: The human indicates which response is better
  4. Reinforcement: Internal values are adjusted to favor similar responses
  5. Adaptation: With many preferences, the model converges toward a desired style
Criteria Being Reinforced

Preferences can reinforce different aspects:

  • Clarity: Responses that are easy to understand
  • Prudence: Responses that acknowledge limitations
  • Conciseness: Direct responses without unnecessary elaboration
  • Warmth: Friendly and empathetic tone
Important

The model does not learn new facts through this process. It learns which types of responses are preferred by users. This is fundamental for aligning the model's behavior with human expectations.

Jan 22, 2024