
OpenAI researchers trained a model with reinforcement learning on realistic conversations testing traits like truthfulness and corrigibility. A small share of this data mixed into the regular pipeline improved the model on 44 of 53 benchmarks. Adversarial prompts and harmful fine-tuning had far less effect on the trained model.
Tap to vote and see what everyone thinks.
Summary by ByteBrief
AI's Next Battle Is Distribution, Not Creation