AI may excel at medical multiple-choice tests, but it still struggles with the dynamic, nuanced decision-making required in real-world care. New research shows that large language models are not yet capable of the flexible clinical reasoning clinicians rely on every day.
Neurology resident Liam McCoy (University of Alberta), who also collaborates with MIT and Harvard’s Beth Israel Deaconess Medical Center, evaluated how leading AI models interpret symptoms, integrate new information and adjust diagnostic pathways. He concludes that advanced models frequently misjudge uncertainty, overestimate the relevance of trivial details and fail to update their conclusions when patient information changes.
Paradoxically, some tuning methods designed to make AI “more helpful” have amplified this overconfidence. “We’re nowhere near a point where a patient could walk into a room, switch on a language model and expect a safe clinical assessment,” says McCoy. “There’s far more to medicine than factual recall.” The study was recently published in The New England Journal of Medicine.
Benchmarking AI with tools from medical education
To better measure diagnostic flexibility, McCoy and colleagues developed concor.dance, an AI benchmark inspired by script concordance testing. This is a method used worldwide to assess how medical trainees navigate clinical ambiguity. The team tested 10 major AI models across scenarios from surgery and pediatrics to psychiatry and emergency medicine, using scripts from multiple countries.
The results were consistent: while AI could match the performance of junior medical students, it fell short of senior residents and attending physicians. Models particularly struggled with “red herrings”. These are irrelevant facts that experienced clinicians quickly ignore. Instead of dismissing noise, the systems attempted to justify it, often producing confident but incorrect reasoning.
A gap between exam skills and real-world judgment
The findings underscore that clinical reasoning is a distinct competence, not easily replicated by pattern-matching algorithms. High scores on standardized tests, whether by students or AI models, do not equate to proficiency in real-time diagnostic decision-making.
McCoy sees this as a call to action rather than a dead end. With AI rapidly entering clinical workflows, from documentation to imaging support, he argues that the medical community must shape its evolution. “We have a responsibility to ensure these systems are effective, equitable and aligned with patient needs.”
As he continues his neurology training, McCoy plans to expand his evaluation framework and collaborate with major tech companies. His objective: ensure that future AI tools enhance clinical reasoning rather than disrupt it. “In medicine, you can’t ‘move fast and break things.’ But you also can’t afford to stand still.”
AI and diagnostic accuracy
The rise of AI in medical diagnostics seems unstoppable and has proven its added value time and again, although caution is often still advised. From day one, doctors, researchers and scientists have been saying that AI tools will never be able to replace doctors, radiologists and other medical professionals, but only support them.
We saw how well that support can work a few months ago with the arrival Microsoft's AI Diagnostic Orchestrator (MAI‑DxO). This innovative technology simulates the thought process of a multidisciplinary team of doctors and has proven capable of making complex medical diagnoses with a high degree of accuracy. According to Microsoft, it is up to four times more accurate than doctors.
MAI-DxO works according to a “chain-of-debate” method: a chain of reasoning steps in which the AI actively collects information, requests targeted tests and adjusts hypotheses. To do this, Microsoft uses multiple AI models (such as GPT, Claude, Gemini and Llama) that operate together as a single intelligent agent system. The result is not only greater accuracy, but also transparency, time savings and a reduction in diagnostic costs of approximately 20 per cent.