Generative AI in medicine. What works, what doesn’t

A new review in Nature Medicine finds that generative artificial intelligence is beginning to alter how medicine is practiced, mainly as an assistant. The authors report that AI systems now help write notes, analyze data, and support education, yet still face limits around accuracy, bias, and oversight. They also warn of “performance drift”.

The medical co-pilot

Generative AI (GenAI) has moved far beyond the chatbots of early pandemic days. The study Generative artificial intelligence in medicine, published in October 2025, notes that transformer-based models like GPT-5, Gemini 2.5 Pro, and DeepSeek-R1 can now reason through complex clinical problems, retrieve guidelines, and even write code to test their own hypotheses.

These models can be trained on smaller, specialized datasets rather than the billions of general web pages used in earlier AI systems. That shift, the authors write, allows hospitals and research centers to build models that understand their unique patient populations without massive computing costs.

GenAI’s most striking potential lies in its collaborative power. Instead of replacing physicians, the researchers envision a “doctor–patient–AI triad,” in which AI provides evidence-based insights while the clinician maintains judgment and empathy. Early trials have shown that human-AI teams outperform either alone in tasks like triage and diabetic retinopathy screening.

From synthetic data to smarter models

A major challenge in healthcare research is the lack of clean, shareable data. The review highlights how GenAI can generate synthetic datasets, realistic patient records, scans, or lab results that contain no identifiable information. This opens the door for model training and medical education without breaching privacy rules.

The study details three main architectures powering these advances: variational autoencoders, generative adversarial networks (GANs), and newer diffusion models that can create lifelike MRI or CT images. Properly trained, these models can simulate radiologic scans for rare diseases or test new diagnostic algorithms.

Still, the authors warn that synthetic data comes with risks. Over-fitting to artificial patterns can degrade real-world performance, and there’s a danger of inadvertently reproducing identifiable patient traits. The solution, they suggest, lies in strong validation protocols and the pairing of synthetic and real data to balance privacy with accuracy.

Where AI already helps

While full autonomy remains distant, GenAI is already easing some of healthcare’s heaviest burdens. The study highlights three areas of immediate, practical impact.

Clinical support. Early research found that large language models can pass medical exams and answer patient queries with surprising empathy. However, their best use so far is as an assistant rather than an advisor, suggesting diagnoses, writing clinical notes, or summarizing charts while a physician reviews the results. Hybrid teams consistently perform best.

Administrative relief. Documentation and billing consume up to half of a clinician’s workday. GenAI “scribes” that transcribe and summarize consultations have cut note-taking time by more than 70 percent in pilot studies. Yet researchers caution against overreliance: hallucinated details or misclassified billing codes can create legal and financial exposure.

Medical education. GenAI tutors can adapt to a student’s learning level and simulate realistic patient conversations. After just a few sessions, students who received AI feedback outperformed peers who didn’t. Still, the review stresses that AI should complement, not replace, the real patient experience, serving as a safeguard against overconfidence and misinformation.

Caution, validation, and what’s next

Despite its potential, the review insists that GenAI in medicine must be held to the same standards as any clinical intervention. Most current studies are small, retrospective, and use non-clinical endpoints. The authors call for randomized clinical trials with transparent reporting and ongoing monitoring to prevent “performance drift.”

Evaluation frameworks are emerging. Metrics such as accuracy and sensitivity are being joined by “extrinsic” measures, such as explainability and empathy. Some teams are even testing the concept of “LLM as a judge,” where AI systems evaluate each other’s reasoning and bias at scale.

The authors outline a pragmatic path forward: start with narrow, well-defined tasks; maintain human oversight; and use open benchmarking datasets to ensure fairness across populations. They also emphasize training clinicians to “prompt” AI effectively and understand its limits.

Ultimately, the study’s vision is cautious but optimistic. Properly validated, generative AI could save physicians from paperwork, accelerate biomedical discovery, and expand access to expertise worldwide. But the transformation will depend on how wisely humans choose to deploy them, the researchers conclude, rather than on better models.