12 most accurate AI models for medical questions

Researchers examined which artificial intelligence models can be trusted for health-related chats. Even top LLMs still risk severe harm in up to 22.2% of cases, with 76.6% of errors coming from "errors of omission" – failing to order necessary tests or inquire about symptoms.

These findings align with the international NOHARM project, in which researchers from leading academic institutions evaluated the clinical safety of AI models. Notably, risks do not primarily arise from incorrect answers, but from what models fail to ask or consider. Missing critical questions or follow-up actions is the main source of potential harm. More information on the methodology and evaluations is available via the NOHARM benchmark.

Patients and doctors use AI, but not always the best one

Every day, about 40 million people ask ChatGPT about health. On Google, such queries reach one billion, with an increasing share of answers generated by artificial intelligence. Doctors also use these tools, most often unofficially. The phenomenon has a name: shadow AI, the use of generative systems outside formal approval, sometimes on personal smartphones. Studies suggest that 1 in 5 physicians in the United States do this.

When people seek health information, convenience often dictates which tool they choose. Some rely on Google, while others are shifting toward ChatGPT or Gemini. The quality and precision of medical responses, however, vary significantly across models.

A team of researchers from Stanford University, Harvard, and several other academic institutions examined this question through the NOHARM project, short for Numerous Options Harm Assessment for Risk in Medicine, which evaluates the quality of AI-generated medical advice.

Until now, most assessments of medical AI have focused on whether systems can pass licensing exams. Many newer models do so with ease, scoring above 90 percent. But, as the researchers note, passing an exam is not the same as diagnosing a patient or determining treatment.

To better reflect real-world practice, the team built a dataset of 100 clinical consultations based on questions primary care physicians submitted to specialists through Stanford Health Care’s electronic consultation system. The scenarios included decisions such as ordering tests, referring patients to specialists, or directing them to emergency care. Twenty nine board certified physicians reviewed the cases, producing more than 12,000 evaluations of clinical decisions.

A surprising leader emerges

The researchers tested 31 AI tools, including general-purpose systems such as ChatGPT, Gemini, and Copilot, as well as specialized medical platforms.

The top performer was AMBOSS LiSA 1.0, a system grounded in a medical knowledge base. It achieved an accuracy score of 62.3 percent, meaning its recommendations aligned with expert judgment in more than six out of ten cases. The tool is a paid platform for clinicians, used by more than 1 million health professionals in over 180 countries, including over 50 medical schools. In Europe, it has been adopted by large hospital networks such as HELIOS.

Close behind were general-purpose models: Gemini 2.5 Pro at 59.9 percent, GPT 5 at 58.3 percent, and Claude Sonnet 4.5 at 58.2 percent. The medical model Glass Health 4.0 also performed strongly at 59 percent. Smaller language models trailed, scoring between 42 and 49 percent.

Still, the differences at the top were often marginal, sometimes amounting to fractions of a percentage point. More revealing was how the systems balanced caution and patient safety. In theory, more conservative models should reduce the risk of harmful recommendations. In practice, excessive caution can also pose risks. In 22 percent of cases, patients could have faced serious harm, and in 77 percent of those instances, the issue was not incorrect advice but the failure to recommend an action that should have been taken.

Outperforming doctors within limits

The researchers also compared AI performance with that of internists. The top-ranked model outscored physicians by more than 15 percentage points overall and by more than 10 points in patient safety. It is not the first study to show such results.

That does not mean AI can replace doctors. Clinical care involves far more than processing information. It requires physical examination, context-shaping judgment, and human empathy.

The study also explored multi-agent systems, in which one AI model makes a recommendation, and others review it as a second opinion. These collaborative setups achieved safety scores up to six times higher than those of single-model systems. The strongest results came from combinations of different model types, such as open-source systems paired with commercial language models and medical knowledge bases.

The findings suggest a clear conclusion. Not all AI models are equally suited to answering health questions. Passing an exam is not a reliable proxy for clinical performance. For AI to be used safely in medicine, its use must be formally regulated and based on systems trained on high-quality medical data. Otherwise, reliance on general-purpose tools risks not only lower accuracy but also potential breaches of data security and patient privacy