Glossary · AI

What is
RLHF?

Reinforcement Learning from Human Feedback: training a model based on human preference rankings of outputs.

By Anish· Founder · Vedwix
·

Definition

RLHF trains a model to align with human preferences. After SFT, humans rank multiple model outputs, a reward model is trained to predict those preferences, and the LLM is then fine-tuned to maximize the reward. RLHF (and its alternatives like DPO) are how frontier models get their helpfulness and safety behavior.

Example

OpenAI's post-training pipeline for GPT-4 uses RLHF extensively to align the model with human preferences.

How Vedwix uses RLHF in client work

Rare in client work — RLHF needs scale. We use DPO occasionally for smaller alignment tasks.

Building with RLHF?

We ship this.

If you're building with RLHF in production, we can help — from architecture review to full implementation.

Brief us

Working on a RLHF project?

Brief Vedwix in three sentences or fewer.

Start a project