
Auriel Wright, a Gemini RL engineer, details how flawed RL training harnesses ruin model training by introducing noise and incorrect learning signals. She lists four core failures: not reading trajectories, lacking domain experts, missing economic tradeoffs, and poor environment quality. The training harness must be reliable, interactive, and free of race conditions or broken code. These issues cause researchers to discard training runs and build broken software that misleads model development.
Tap to vote and see what everyone thinks.