SYNTHIA: Why Europe Needs Synthetic Data To Boost Health AI

Thu 15 January 2026
Data
Interview

Synthetic data is becoming a promising way to advance data-driven health research in Europe while respecting patient privacy. Pieternella (Ellen) de Waal – Lead for Communication and Dissemination - for the EU-funded SYNTHIA project – explains how synthetic data is being tested across six diseases, why it matters for trustworthy health AI, and what to expect from the debate on synthetic data at theICT&Health World Conference 2026.

Why is synthetic data a game-changer for health AI research in Europe?

Synthetic data is solving one of healthcare's biggest paradoxes. Here is the challenge: developing effective AI for personalized medicine requires vast amounts of diverse patient data. But we are rightfully bound by strict privacy regulations such as GDPR. How do we innovate responsibly while protecting patients?

Synthetic data is the answer, and the SYNTHIA Project is addressing the challenge. We are generating artificial data that statistically mimics real patient populations but contains zero actual patient information. It is like creating a 'digital twin' of healthcare data – it looks and behaves like the real thing for research, but you cannot trace it back to any individual.

First, we break down data silos. European health data is fragmented across member states and institutions. Synthetic data lets us share insights without moving sensitive information across borders.

Second, we innovate without compromising our values. Europe leads in data protection and ethical AI – synthetic data upholds these principles while giving researchers the data volume they desperately need.

Third, we can address bias. Real datasets often underrepresent certain populations. With synthetic data, we can create more balanced datasets, ensuring AI tools work equally well for everyone.

Within the SYNTHIA consortium, we are working across six diseases to demonstrate this approach, including blood cancers, oncology, Alzheimer's disease, and Type 2 Diabetes. Our AI models, which are being developed to be trained on synthetic data, will perform comparably to those trained on real data without privacy concerns.

What’s the project timeline, and when can we expect the first result?

As the first IHI project in synthetic data, the consortium kicked off in 2024, and the first year has focused on laying solid groundwork for innovation. SYNTHIA experts have established the core technical infrastructure that underpins the SYNTHIA platform, enabling secure, federated access to health data across Europe.

The SYNTHIA clinical teams have developed comprehensive study protocols that clearly define the clinical variables and research questions for each of the six disease use cases. We have also completed our Data Management Plan and Data Protection Impact Assessment, important milestones that ensure we handle data responsibly and fully comply with European regulations. Most excitingly, we have already initiated synthetic data generation across multiple use cases, moving from theory to practice and demonstrating the real potential of this technology to advance personalized medicine.

Moving into our second year, SYNTHIA is focused on critical priorities that will shape the future of synthetic data in healthcare. We will define "good synthetic data" not only from a technical perspective but also through collaborative consensus across our diverse consortium of researchers, clinicians, and industry partners. We are refining our vision for the SYNTHIA platform's long-term sustainability, ensuring that the tools and datasets we develop remain accessible and valuable to the research community well beyond the project's lifetime. Also, continue to bring together a broad coalition of stakeholders – from academic researchers and industry partners to regulatory bodies and health technology assessment organizations – to align on robust validation standards that will build trust and enable wider adoption of synthetic data across Europe's healthcare ecosystem.

SYNTHIA targets six specific diseases as use cases: Lung Cancer, Breast Cancer, Multiple Myeloma, Diffuse large B-cell non- Hodgkin lymphoma (DLBCL), Alzheimer’s Disease, and Type 2 Diabetes.

These diseases pose common research challenges, including rare patient subgroups, privacy-sensitive data, expensive clinical trials, and long follow-up periods. By tackling these specific cases, SYNTHIA demonstrates that synthetic data can actually solve practical problems researchers face daily - not just fill in missing numbers, but enable studies that would otherwise be impossible or prohibitively expensive.

Starting with six diseases rather than claiming to solve everything is smart. It shows methodical validation - "here's where it works, here's the evidence." For a field as sensitive as healthcare, where bad data could literally harm people, this measured approach demonstrates the utility is real and verifiable, not overhyped.

What do these use cases demonstrate about the utility of synthetic data?

Looking at the SYNTHIA disease focus, what stands out is the proof-of-concept strategy, essentially saying "synthetic data is not just theoretically interesting, it actually works in practice to address real medical conditions”. The diversity of diseases demonstrates the broad applicability of synthetic data. By selecting conditions with fundamentally different underlying biology (cancer driven by genetic mutations, diabetes rooted in metabolic dysfunction, or neurological diseases with their own distinct mechanisms), SYNTHIA hopes to prove its approach works broadly across disease types. This is not a method narrowly optimized for a single medical scenario; it is a generalizable tool that adapts to diverse biological contexts.

What are the main components of the SYNTHIA platform – including synthetic data generation tools and evaluation frameworks – and how will they support researchers and developers?

From a scientific perspective, focused on generating synthetic data, the SYNTHIA platform is built on three primary components. The first is the Federated Learning module, which provides the essential infrastructure to securely connect data and generate structured synthetic datasets. The other two components constitute the platform's scientific core and focus on data generation and validation. The generation module is designed to select and validate the optimal generative model based on the available data. This operates in strict coordination with the validation framework, which assesses synthetic data quality across three main pillars: statistical fidelity, privacy preservability, and clinical utility. Furthermore, this framework evaluates the quality of the original real-world data to determine if it is sufficiently representative to produce a high-quality synthetic cohort.

How does SYNTHIA ensure that synthetic data is robust and reliable enough to be used in AI model development, particularly for clinical decision support or predictive modeling?

SYNTHIA ensures synthetic data is trustworthy through a comprehensive validation process built into the platform. Rather than applying a one-size-fits-all quality standard, SYNTHIA evaluates synthetic data against three key criteria: how accurately it mirrors real patient data (statistical fidelity), how well it protects patient privacy, and whether it is actually useful for clinical purposes.

Importantly, this evaluation is tailored to the specific research question. The standards for "good enough" differ depending on what one is trying to achieve. Expanding a small dataset, planning a clinical trial, testing an AI algorithm, or sharing data with collaborators all have different requirements and acceptable trade-offs.

By assessing these factors together in context, SYNTHIA confirms two essential points: first, that the synthetic data accurately represents real-world patient populations (not just random numbers that happen to look right), and second, that it is scientifically sound for the specific clinical application it is intended to support. This targeted validation approach enables the use of synthetic data with confidence for AI development in clinical decision-making and predictive modeling.

What role does interdisciplinary collaboration across clinicians, AI developers, legal experts, and industry partners play in achieving SYNTHIA’s objectives?

Given that SYNTHIA is an Innovative Health Initiative (IHI) funded project with this unique constellation of experts, interdisciplinary collaboration is not just helpful; it is essential to making synthetic data work in healthcare.

Creating synthetic data for medical use is fundamentally a multi-dimensional problem that no single discipline can solve alone. AI developers might build brilliant generative models, but without clinicians, they will not know which clinical variables actually matter or how diseases manifest in real patients. Synthetic data can be statistically perfect but clinically meaningless.

Clinicians bring medical validity, ensuring the synthetic patients actually reflect real disease patterns and that the data will be useful for genuine research questions. AI developers provide technical machinery to generate realistic, high-quality data at scale. Legal experts are crucial for navigating the minefield of privacy regulations, data protection laws, and ethical boundaries - they define what is permissible. Industry partners ground everything in practical reality: what do drug developers actually need? What formats work with existing systems? And critically, patient representatives ensure the research remains anchored to what matters to patients - their priorities, concerns about privacy and data use, and the outcomes they care about. Without patient voices, even technically sound, clinically valid synthetic data could miss the mark in addressing real patient needs or cross ethical boundaries that patients would find unacceptable.

Why IHI's structure enables this: The IHI framework brings these stakeholders together with aligned incentives rather than competing interests. This means they can co-design solutions from the start rather than having AI teams build something and then hoping clinicians will use it. Each discipline checks and validates the others - creating synthetic data that's technically sound, clinically meaningful, legally compliant, and practically useful. Without this collaboration, you would receive synthetic data that would fail at the implementation stage.

What are the expected results and long-term impacts of SYNTHIA on health research, personalized medicine, and the broader European AI ecosystem?

With its mission to leverage synthetic data to propel personalized medicine to new heights, the expected impact of SYNTHIA will be on multiple levels:

For health research: Breaking down data silos and privacy barriers that currently slow discovery. Researchers gain access to rich, diverse datasets that would otherwise be impossible to share, accelerating studies across rare diseases, underrepresented populations, and complex conditions.

For personalized medicine: Enabling the development of truly tailored treatments by providing the diverse patient data needed to understand how different individuals respond to therapies, moving beyond one-size-fits-all approaches to precision interventions matched to individual characteristics.

For Europe's AI ecosystem: Establishing European leadership in trustworthy, ethical AI for healthcare. SYNTHIA develops frameworks, standards, and validated methodologies that position Europe as the global reference for responsible synthetic data use, fostering innovation while upholding the highest privacy and ethical standards.

Long-term, this means faster drug development, more targeted therapies, reduced healthcare costs, and ultimately better patient outcomes - all while keeping Europe at the forefront of health AI innovation.

What can we expect from the debate on synthetic data during the ICT&Health World Conference 2026?

Our session (January 27, 11h30) will tackle the critical questions: How do we validate these datasets? What are the regulatory implications? How do we build trust? I believe synthetic data will enable Europe to lead in AI-driven personalized medicine – staying true to our privacy commitments while unleashing the full potential of health data for better patient outcomes.

SYNTHIA - Synthetic Data Generation framework for integrated validation of use cases and AI healthcare applications – is supported by the Innovative Health Initiative Joint Undertaking (IHI JU) under grant agreement No 101172872.


How is healthcare shaping its future? Thousands of healthcare professionals are discovering what truly works and seizing opportunities. Claim your ticket and experience it at the ICT&health World Conference 2026!