Independent study questions safety of ChatGPT Health

An initial independent safety analysis of ChatGPT Health points to potential risks in assessing emergency care and suicide prevention. Researchers at the Icahn School of Medicine at Mount Sinai publish their findings in Nature Medicine and call for structural, external evaluation of AI health tools.

ChatGPT Health, launched in January 2026 by OpenAI, is used by approximately 40 million people every day for health information and advice, according to the developer. Among other things, the tool provides recommendations on the urgency of medical care. It was precisely this function that was the focus of the study, which is described as the first independent safety assessment of the large language model (LLM)-based system.

First independent test

Within weeks of its introduction, the use of ChatGPT Health grew explosively, while independent data on its reliability was lacking. This prompted the researchers to conduct a structured safety test.

For the study, 60 clinical scenarios were developed, covering 21 medical specialties. The cases ranged from relatively harmless complaints that can be treated at home to acute medical emergencies. Three independent physicians determined the appropriate level of urgency for each scenario based on guidelines from 56 medical professional associations.

To simulate realistic variation, each scenario was tested under 16 different contextual circumstances, including differences in gender, ethnicity, social factors (such as downplaying symptoms), and barriers to care, such as lack of insurance or transportation. A total of 960 interactions with ChatGPT Health were analyzed and compared with the medical consensus.

Under-triage in complex emergencies

The analysis shows that the AI tool generally correctly recognizes clear, “textbook” emergencies, such as a stroke or a severe allergic reaction. However, in more nuanced situations, where clinical judgment is essential, its performance proved to be less reliable.

In more than half of the cases that doctors said needed immediate emergency care, ChatGPT Health didn't recommend going to the emergency room right away. It was striking that in some cases, the system did mention risk factors in its explanation, but still gave reassuring advice. According to the researchers, this is particularly worrying in situations where subtle signs point to a potentially serious deterioration.

Inconsistent suicide warnings

A second point of concern is the built-in suicide prevention functionality. ChatGPT Health is designed to refer users to services such as 988 Suicide & Crisis Lifeline in high-risk situations. In practice, however, this warning was found to be triggered inconsistently.

The researchers found that the warnings sometimes appeared at relatively low risk assessments, while they were absent when users described concrete plans for self-harm. The article describes these as warnings that were “inverted relative to clinical risk.” In other words, inversely proportional to the actual risk

High stakes, high responsibility

The social impact of such systems is considerable. When millions of people consult AI to determine whether they need emergency care, an incorrect assessment can have direct consequences for patient safety.

At the same time, the researchers emphasize that their findings do not mean that consumers should avoid AI health tools altogether. However, they do advise that in case of worsening or worrying symptoms, such as chest pain, shortness of breath, severe allergic reactions, or changes in consciousness, medical help should always be sought immediately. If you are having thoughts of self-harm, it is advised to contact 988 Suicide & Crisis Lifeline or an emergency department.

AI as a supplement, not a replacement

The authors emphasize that AI systems are intended to support, not replace, clinical judgment. Because LLM-based systems are frequently updated, their performance may change over time. According to the authors, this underscores the need for ongoing, independent evaluation.

The research group announces that it will continue to evaluate future versions of ChatGPT Health and other consumer-oriented AI tools. The focus will be expanded to include pediatrics, medication safety, and use in non-English contexts.

The study makes it clear that AI is rapidly gaining ground in healthcare, but that large-scale deployment must be accompanied by transparency, testing, and continuous quality control. At a time when digital health advice is the first point of contact for millions of people, the researchers believe that independent evaluation is not a luxury, but a necessary condition for safe implementation.