
Docker tested 21 models in a real agent loop across 3,570 tests. GPT-4 scored 0.974, Qwen3 14B scored 0.971, but llama3.3 70B scored only 0.607. A local model's tool-calling ability matters more than parameter count for agent performance.
Tap to vote and see what everyone thinks.