The author tested Gemini, Refine, Claude, and ChatGPT Pro on four published economics papers with known errors. ChatGPT Pro performed best, occasionally constructing counterexamples and corrected proofs. No model located a true error without substantial human guidance. The author argues a competent human paired with a frontier model can outperform current peer review.
Tap to vote and see what everyone thinks.