As AI becomes increasingly embedded in healthcare, concerns remain about one of its most persistent weaknesses: hallucinations. These are responses in which AI systems confidently generate incorrect or entirely fabricated information. In medical settings, such errors can have serious consequences, particularly as patients increasingly turn to chatbots for information about symptoms, medications and treatment options.
Researchers at Binghamton University have now developed a new verification protocol that significantly reduces the risk of medical hallucinations by combining the outputs of multiple AI models. The findings, published in STAR Protocols, suggest that collaborative AI systems could improve the reliability of future healthcare applications.
Persistent AI problem
Large language models (LLMs) have demonstrated impressive capabilities in interpreting medical terminology and answering health-related questions. However, previous research has shown that even highly advanced models can generate inaccurate diagnoses, fabricated references or incorrect biomedical information.
The new study builds on earlier work in which researchers found that AI systems performed well in recognising diseases, medications and genetic concepts, but still produced a substantial number of hallucinations.
To address this challenge, researcher Ahmed Abdeen Hamed and professor Luis M. Rocha developed a multi-model verification approach designed to cross-check medical information before it is accepted as valid. Rather than relying on a single AI model, the system consults several independent models and compares their conclusions before generating a final answer.
Seven models, one decision
The protocol uses seven open-source large language models operating simultaneously. Each model receives the same patient symptom description written in everyday language. Before generating a response, every model is required to use retrieval-augmented generation (RAG), a technique that forces the AI to consult authoritative medical databases rather than relying solely on information embedded in its training data.
The models then independently translate the symptoms into recognised medical concepts and assign corresponding clinical identifiers. Once each model has produced its interpretation, the responses are compared through a voting mechanism.
During more than 10,000 experiments, the system produced striking results. Approximately 77 percent of outputs were supported by at least four of the seven models, while the remaining responses received support from at least two models. According to the researchers, no unsupported medical terms were generated and no hallucinations were observed in the final validated output. The approach effectively transforms multiple AI systems into a form of peer-review network, where agreement between independent models increases confidence in the result.
Beyond symptom checking
While the study focused on biomedical terminology and disease identification, the researchers believe the framework has much broader potential. The protocol can be applied across multiple layers of healthcare knowledge, including genetics, disease mechanisms, treatment pathways and clinical trial data. It may also help validate information related to adverse drug reactions by comparing evidence from scientific literature, pharmacological databases, clinical studies and other data sources.
According to Rocha, the method could support the development of increasingly sophisticated disease models and digital twins—virtual representations of patients that combine biological, clinical and behavioural data to predict treatment outcomes. The research team has already begun applying similar approaches to multi-layer disease modelling, including studies involving hormone receptor-positive breast cancer.
Trustworthy AI in healthcare
One of the key strengths of the protocol is its scalability. As additional open-source language models become available, researchers can continuously repeat the process using different combinations of AI systems, further strengthening confidence in the results. The implications extend beyond medicine. The same methodology could potentially be used to reduce hallucinations in legal research, academic literature reviews and historical analysis, where fabricated citations and factual inaccuracies remain a significant challenge.
As healthcare organisations increasingly explore generative AI for clinical decision support, patient engagement and knowledge management, the need for reliable verification mechanisms continues to grow. The Binghamton researchers argue that multi-agent AI validation may represent an important step toward safer and more trustworthy use of artificial intelligence in healthcare. By replacing single-model answers with consensus-based verification, the approach could help move AI-assisted medicine closer to the levels of accuracy and accountability required in real-world clinical practice.
Multimodel reasoning capability
Last year, we wrote an indepth article about the hallucination challenge using AI-chatbots in healthcare. GPT-5 from OpenAI represents a significant advance in the use of generative AI for healthcare. Compared with previous models, it delivers more accurate medical responses, substantially fewer hallucinations, and stronger diagnostic performance. In evaluations using the HealthBench platform, GPT-5 doubled diagnostic accuracy compared to GPT-4o when its extended reasoning mode was enabled. At the same time, hallucination rates for medical questions fell from 15.8% to just 1.6%.
A key innovation is its multimodal reasoning capability. GPT-5 can analyze not only text but also medical images, laboratory results, and other patient data. The model is also more likely to ask relevant follow-up questions and focus on practical next steps, while encouraging users to seek professional medical advice when appropriate. Mental health support has also been improved. GPT-5 is less prone to offering overly optimistic guidance and provides more empathetic, context-aware responses. Additional safeguards help reduce misinformation and tailor advice to the user’s circumstances.
These improvements position GPT-5 as more than an information tool. Researchers see it as a potential AI assistant for both patients and clinicians, capable of supporting diagnosis, patient engagement, and future care-coordination workflows. Its enhanced accuracy, multimodal capabilities, and stronger safety measures mark an important step toward more reliable and responsible AI in healthcare.