Large language models (LLMs) are artificial intelligence (AI) algorithms that are trained on vast amounts of data to learn patterns that enable them to generate human-like responses. Reasoning models are LLMs with the added capability of working through problems step by step before responding, thus mirroring structured thinking. Such AI systems have performed well in assessing medical knowledge, but whether they can match physician- level clinical reasoning on authentic diagnostic tasks remains largely unknown. On page 524 of this issue, Brodeur et al. (1) demonstrate that AI can now seemingly match or exceed physician-level clinical diagnostic reasoning on text-based scenarios by measuring against human physician performances on clinical vignettes and real-world emergency cases. The findings indicate an urgent need to understand how these tools can be safely integrated into clinical workflows, and a readiness for prospective evaluation alongside clinicians.

To read more, click here.