AI scores a ‘C–’ on its hardest math test yet

The best-yet test of artificial intelligence’s mathematical mettle has released its first official round of results. The verdict is that large language models (LLMs) are emerging as useful—albeit deeply flawed—assistants for math research.

Organized by a team of top mathematicians, the “First Proof” project is a response to AI companies’ growing fixation on using advanced math as a benchmark for their products—regardless of whether those metrics reflect the problems professional mathematicians actually care about. Results of a pilot round in February were mixed, with companies’ opaque, internal efforts vastly outperforming their public models.

This latest batch of tests involves a broader range of math problems and more rigorous protocols for its participants—to which only OpenAI and a trio of academic groups agreed. The results were again mixed, with six to seven of the 10 problems answered essentially correctly by at least one AI. Although peak performance continues to improve, the models also churn out copious amounts of garbage as a by-product, requiring heroic interventions to sift sense from slop.

To read more, click here.