AI chatbots still fall short as digital doctors

AI chatbots are increasingly being used by patients to assess symptoms, seek medical advice and better understand health concerns. However, a new study from researchers at Pennsylvania State University suggests that current AI systems are not yet reliable enough to function as standalone medical advisors.

The researchers found that AI-generated responses to health-related questions were accurate in approximately 76 percent of cases, leaving a significant margin for potentially harmful errors. The findings will be presented at the FAccT 2026 conference and are currently available as a preprint on arXiv.

Investigating real-world health queries

While previous studies have examined the performance of large language models (LLMs) in clinical settings, the Penn State researchers focused on a different question: how do ordinary people use AI when seeking health information, and how accurate are the answers they receive?

To explore this, the team organized a “Diagnose-a-thon,” designed to mimic real-world AI usage. Thirty-four participants, including faculty members, staff and students, submitted 212 health-related prompts covering both real and hypothetical medical concerns. Questions were written from both patient and physician perspectives.

Participants were free to use one of four widely available AI models: ChatGPT-4o, ChatGPT-3.5, Gemini 1.5 Pro and Llama 3 8B. The resulting responses were then evaluated by nine board-certified physicians, who assessed both their medical accuracy and their potential to cause harm.

Weaknesses in some areas

Overall, the study found that 76.2 percent of AI-generated responses contained medically accurate information. Performance varied considerably across medical specialties. The highest scores were observed in obstetrics and gynecology as well as otolaryngology, where responses demonstrated relatively high validity and low risk of harm. In contrast, AI systems struggled more with internal medicine, neurology and dermatology, generating less reliable answers and receiving higher harm ratings from physician reviewers.

The researchers also discovered that prompt design influenced performance. More specific questions, particularly those between 60 and 250 characters in length, tended to produce more accurate responses than either very short or overly detailed prompts. These findings suggest that both the medical topic and the way users formulate questions play a significant role in determining AI performance.

Can additional medical training improve AI performance?

In a second phase of the study, researchers investigated whether additional medical training could improve AI performance. They enhanced the base models using medical textbooks, clinical guidelines and peer-reviewed scientific literature commonly included in medical education curricula. A panel of physicians, residents and medical students compared responses from the original models with those generated by the medically augmented versions.

Surprisingly, the additional training did not consistently improve quality. For Gemini and Llama, reviewers actually preferred the responses generated by the original models. For the ChatGPT models, no significant difference was observed. According to the researchers, the findings indicate that simply adding more medical content does not automatically lead to safer or more clinically appropriate outputs.

For clinicians rather than patients?

Despite the relatively high accuracy scores, the researchers emphasize that current AI systems still produce errors at rates exceeding 20 percent. That’s roughly twice the estimated error rate of human physicians.

As a result, the team believes that large language models may currently be more valuable as decision-support tools for healthcare professionals than as direct medical advisors for patients. According to co-author Jennifer Kraschnewski, AI has considerable potential to transform healthcare by helping clinicians process information and improve patient care. However, human expertise remains essential for interpreting complex medical situations and ensuring patient safety.

The researchers conclude that AI will likely continue to play a growing role in how people seek health information. Understanding both its strengths and limitations is therefore crucial. While current chatbots can provide useful guidance in many situations, the study highlights that they are not yet reliable enough to replace professional medical advice and should be used with appropriate caution.

Earlier study

In an earlier study, conducted in 2025, it was concluded that, while AI models perform well on medical exams, they still struggle with the complex clinical reasoning required in real-world healthcare. Research led by Liam McCoy found that leading large language models often misjudge uncertainty, place too much emphasis on irrelevant details and fail to revise their conclusions when new patient information becomes available.

Using a new benchmark called concor.dance, based on the script concordance tests used in medical education, researchers evaluated ten major AI models across a range of clinical specialties. The models performed at a level comparable to junior medical students but fell short of senior residents and experienced physicians.

The findings highlight a gap between factual knowledge and clinical judgment. According to the researchers, AI should currently be viewed as a support tool rather than a replacement for clinicians. They argue that healthcare professionals must actively guide the development of AI to ensure future systems enhance, rather than undermine, clinical decision-making.