OpenAI, Microsoft and every other company that makes the smartest models still use RLHF despite OSS community preferring DPO for its ease of use.

RLHF seems to work really well for teaching the model reasoning, arithmetic, factual grounding and avoiding SFT shortfalls.

DPO really seems like a band aid. Every single model that we know of that are smart enough to be used in production by actual people have used RLHF or RLAIF.

LLaMa-3-70B: RLHF Wizard-2-8x22B: RLEIF GPT-4: RLHF and RLHAIF Claude-3: RLHF Gemini: RLHF Deepseek V2: RLHF

I think the current problem with OSS models is that we are missing our own RLHF. DPO and other papers say DPO is comparable or better yet it still remains that every single model with the strongest reasoning has used one form of RL or another.

Also it's way easier to make uncensored models with RLHF, you literally turn off the safety parameter. No, really, you can literally turn it off in the code for the reward model.

And RLHF doesn't have to make the model lobotomised either, turn off safety and train reward model to rank creative or “human like” responses high, it's literally that simple.

So unless we take initiative and start running our own RLHF, we are going to be dependant on corporations to do it for us, and they will just leave OSS once they can make a monetizable product. And don't expect them to give us uncensored models either.