Explainer
RLHF Explained
How human preferences steer a model's answers.
Reinforcement Learning from Human Feedback (RLHF) is how a raw model is taught to be helpful and well-behaved. Humans rank competing model responses, and those rankings train a reward model that captures what people prefer.
The language model is then optimized to produce responses the reward model scores highly. This is the step that turns a capable-but-unruly text predictor into something that reliably answers the question you actually asked.
RLHF is powerful but imperfect. It can make models overly cautious or prone to telling people what they want to hear, which is why newer methods keep refining the recipe.