When artificial intelligence systems began acing long-standing academic assessments, researchers realized they had a problem: the tests were too easy. Popular evaluations, such as the Massive Multitask Language Understanding (MMLU) exam, once considered formidable, are no longer challenging enough to meaningfully test advanced AI systems.

To address this gap, a global consortium of nearly 1,000 researchers, including a Texas A&M University professor, created something different—an exam so broad, so challenging and so deeply rooted in expert human knowledge that current AI systems consistently fail it.

"Humanity's Last Exam" (HLE) introduces a 2,500-question assessment spanning mathematics, humanities, natural sciences, ancient languages and highly specialized subfields. The team's work is outlined in a paper published in Nature with documentation from the project available at lastexam.ai.

To read more, click here.