ChatGPT Health Overlooks More Than 50% of Medical Emergencies

Key Takeaways

ChatGPT Health misdiagnosed over 50% of medical emergencies in a recent study, including critical situations like diabetic ketoacidosis.
The tool often missed suicidal ideation, misclassifying high-risk scenarios as lower risk.
Experts caution against over-reliance on AI for health advice, emphasizing the need for critical evaluation of its output.

Study Findings on ChatGPT Health

A recent independent evaluation of OpenAI’s ChatGPT Health revealed alarming shortcomings in its ability to triage medical emergencies. Launched in January 2026, this AI tool is used daily by around 40 million adults in the U.S. for health advice. The study, published in Nature Medicine on February 23, indicated that ChatGPT Health under-triaged 52% of “gold-standard emergencies.” For instance, it suggested follow-up appointments for patients with diabetic ketoacidosis and impending respiratory failure, rather than directing them to emergency care.

Dr. Ashwin Ramaswamy, the lead author of the study, pointed out that while ChatGPT Health performed well in textbook cases like strokes or severe allergic reactions, it struggled in more complex scenarios where clinical judgment is crucial. Researchers at the Icahn School of Medicine at Mount Sinai constructed 60 different patient scenarios, ranging from mild illnesses to severe emergencies. Each scenario was assessed by three independent doctors to determine the necessary level of care.

The evaluation generated nearly 1,000 responses from the AI under various conditions, including modifications to patient demographics and additional information from family members. Notably, in one simulation, the AI directed a woman who was suffocating to a subsequent appointment, endangering her life in 84% of the assessments.

Additionally, the AI reacted excessively in low-risk situations, incorrectly advising 64.8% of safe individuals to seek immediate medical attention. The tool was also designed to direct users at risk of suicide to crisis lines. However, researchers found that alerts were sometimes triggered in lower-risk scenarios and failed to activate even when users described specific self-harm plans.

“The system’s alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for serious cases,” researchers noted. They emphasized that when someone details how they would harm themselves, it typically indicates more immediate danger, not less.

The study also highlighted that input from family or friends often led to a reduction in the perceived urgency of symptoms, impacting triage recommendations. Despite these findings, the research team did not advocate for abandoning AI health tools altogether. Alvira Tyagi, a medical student and co-author, stated that understanding the limitations of such systems is crucial for future healthcare training.

Expert Alex Ruani from University College London expressed concern over the dangerous implications of the findings. He warned that the false sense of security created by these systems could have dire consequences for patients. “If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life,” he remarked.

OpenAI responded to the concerns, indicating that the study does not accurately reflect typical usage or the intended functionality of ChatGPT Health in real-world scenarios. The evaluation serves as a cautionary tale, emphasizing the need for critical analysis of AI outputs in healthcare settings.

The content above is a summary. For more details, see the source article.