For years, artificial intelligence (AI) researchers have dreamed of developing tools that could supercharge science by posing novel questions, designing experiments, and perhaps even carrying them out. In recent months, large language models (LLMs) have made discoveries that some AI developers claim have inched us closer to that future. But how do you test whether an AI model can truly do science?

For answers, researchers turn to benchmarks: standardized sets of questions or tasks that help assess an AI’s capacities and compare it against other models. But the complexity of science makes assessing their aptitude for it especially challenging. As Hao Peng, a computer scientist at the University of Illinois Urbana-Champaign, puts it: “Models have all this knowledge. Do they know how to use it?”

To read more, click here.