PaperBench is a benchmark that evaluates whether AI agents can replicate state-of-the-art AI research from scratch. Agents are tasked with reproducing papers, which requires understanding each paper’s contributions, building the codebase, and correctly executing the experiments.
I had the opportunity to contribute to the construction of PaperBench and to work alongside OpenAI during its development.
Thank you to everyone involved for the kind collaboration.
Tim