Europe Goes For Synthetic Data To Lead In Health Innovation

Mon 23 February 2026
Data
News

At the ICT&health World Conference, a debate organized by the IHI-founded SYNTHIA project examined how synthetic data could accelerate AI-driven innovation while safeguarding privacy and trust. Key takeaway: European science is hungry for data, and artificially generated datasets can be literally lifesaving.

Data as Europe’s strategic asset

The progress in artificial intelligence in life sciences depends on access to high-quality data. Yet in Europe, health data remains fragmented, sensitive, and difficult to share. For many researchers, GDPR is a bureaucratic hurdle that can demotivate data-driven research, requiring them to spend significant time on documentation while the risk of violating the restrictive law remains high. As of early 2026, the EU is pursuing targeted GDPR simplification and selective deregulation to reduce administrative burdens on businesses. At the same time, synthetic data offer a potentially transformative alternative in the context of limited access to health data.

From a European Commission perspective, data is “the oil of innovation” and even “the oil of the economy.” Szymon Bielecki, Head of Sector, Research and Innovation, eHealth, Well-Being and Ageing Unit at DG CONNECT, European Commission, positioned synthetic data within Europe’s broader digital strategy. He admitted that the health sector is “one of the sectors where it is most difficult to access good quality data.” Strict GDPR safeguards, interoperability gaps, and heterogeneous data standards across hospitals create structural barriers.

The European Health Data Space was presented as a mechanism to improve secondary use and interoperability. However, until real-world data becomes more accessible and harmonized, synthetic data may serve as a bridge. It can “create real-life conditions for testing digital solutions,” help train AI models, and support medicine development in privacy-preserving environments. “Artificial intelligence will never work without good quality data,” according to Bielecki.

The strategic ambition of Europe is to strengthen competitiveness in AI and digital health. Synthetic data fits into this broader data and AI policy landscape as an enabling technology rather than a replacement for real-world data infrastructure. The Commission also highlighted the importance of guidance and voluntary certification frameworks to provide structure and trust in synthetic data generation practices.

Validity, utility, privacy

From a research perspective, synthetic data is not valuable by default. It must meet clear scientific criteria.

Leonor Cerda Alberich, Co-Principal Investigator, Biomedical Imaging Research Group, La Fe Health Research Institute (IIS La Fe), emphasized three core pillars: “validity, utility, and privacy.” Synthetic datasets must capture “complex, nonlinear correlations that real-world data have.” Statistical tests, heat maps, and algorithmic validation approaches are required to assess whether synthetic data behaves like real patient data.

Utility is equally critical. As she explained, “if we train an AI model using synthetic data, are we having a model that also works on the real data?” This question goes to the heart of clinical relevance. Models that perform well in synthetic environments but fail in real-world practice offer no value.

Privacy protection remains the third pillar. Synthetic data generation must withstand “attack simulation” and adversarial testing to ensure that no individual patient can be re-identified. The promise of synthetic data lies precisely in this privacy-preserving capability.

The Synthia project addresses these challenges through a federated infrastructure that connects multiple hospitals across Europe. It works across disease areas, including lung cancer, breast cancer, multiple myeloma, Alzheimer’s disease, lymphoma, and type 2 diabetes. Importantly, the project seeks to ensure that synthetic datasets are “clinically relevant, not just make sense from a mathematical point of view.”

This distinction between statistical similarity and clinical usability was a recurring theme throughout the session.

“You can never have sufficient data to train a perfect model”

AI systems are “data hungry,” according to Hongxu Yang, an AI Scientist at GE Healthcare, who highlighted the challenges of collecting, cleaning, and annotating large-scale imaging datasets. Synthetic data offers a potential pathway to enhance robustness and performance across diverse patient populations. Clinical knowledge, he stressed, is indispensable: “If you are developing the model with clinical knowledge guidance, the model can be much easier to develop.” He added that integrating clinical insight allows models to be “lower cost, develop faster, and also be more stable.”

Saverio D’Amico, CEO and Co-Founder of TRAIN and AI Team Leader at Humanitas Research Hospital, described synthetic data as a “technological solution to clinical problems.” He framed it as “a sort of plastic data, so a conditionable data.” Once a generative model is trained on real-world data, it can be asked to produce patient cohorts with specific characteristics, for example, defined comorbidities. This capability opens new possibilities in personalized medicine and rare diseases.

One particularly compelling example concerned pediatric neuroblastoma. In highly sensitive clinical scenarios, building traditional control arms is ethically complex. Synthetic data may help construct external or synthetic control arms when real-world comparators do not match the inclusion criteria. As D’Amico noted, in medicine, one of the central challenges is “the lack of clinical evidence,” and synthetic data can act as an “amplifier” to enhance statistical power and signal detection.

“Synthetic data is like a multi-knife technology that we have got to use,” concluded D’Amico.

Sara Okhuijsen, CTO and Co-Founder of OASYS NOW, addressed the innovation barrier faced by early-stage companies. She described the “cold start problem”: startups need data to validate their models, yet trust and access come only after validation. Synthetic data can help “break the privacy utility trade-off in a different way” and accelerate product development while respecting privacy constraints. At the same time, she emphasized that synthetic data should be seen as part of a broader toolbox of privacy-preserving technologies, including confidential computing.

Ethical risks possible if synthetic data is not handled carefully

Despite strong optimism, the debate maintained a critical edge. Davide Cirillo, Head of the Machine Learning for Biomedical Research Unit at the Barcelona Supercomputing Center and Co-founder of OneCareAI, characterized synthetic data as “a double-edged sword.” While it is “ethically attractive” due to privacy preservation and improved access, it is also “ethically risky.”

Bias was a central concern. If the original real-world data used to train generative models is biased, synthetic data may “propagate this bias” and even amplify it. The apparent neutrality of synthetic datasets can create false confidence if validation processes are insufficient.

Accountability also raises complex questions. If synthetic data leads to flawed clinical decisions or discriminatory outcomes, “Who is to blame?” Is responsibility located with the data collectors, the model developers, or the organizations deploying the system? This accountability gap requires governance mechanisms beyond technical validation.

Cirillo argued that synthetic data must be treated as a “socio-technical asset.” That includes impact assessment, responsible design, transparency, and patient awareness. Patients should have the “right to know” if systems influencing their care were trained using synthetic data. Trust depends on openness.

Regulatory clarity remains another open issue. Industry participants pointed out that demonstrating equivalence between synthetic and real data under regulatory frameworks such as the EU AI Act remains challenging. Clear guidance will be essential to scale adoption across Europe.

How SYNTHIA paves the way for synthetic data

With initiatives such as SYNTHIA, stakeholders across Europe hope to gain access to artificial datasets to accelerate research in life sciences and the development of AI-based innovation. Launched in September 2024 as the first synthetic data project under the Innovative Health Initiative (IHI), SYNTHIA aims to build robust infrastructure, validation frameworks, and clinical use cases for privacy-preserving synthetic data in healthcare. The project focuses on developing and validating synthetic data generation methods across multiple data types, including laboratory results, clinical notes, genomics, and imaging. To demonstrate real-world relevance, six disease areas serve as use cases: lung cancer, breast cancer, multiple myeloma, diffuse large B-cell lymphoma, Alzheimer’s disease, and type 2 diabetes.

During the debate, there was broad agreement that synthetic data holds significant potential to accelerate AI-driven science, strengthen Europe’s digital health ecosystem, and support innovation in rare diseases, imaging, and clinical trials. At the same time, participants emphasized that quality assurance, methodological transparency, ethical safeguards, and regulatory clarity are indispensable to ensure credibility and adoption in clinical research and practice.

Synthetic data may help unlock the value of Europe’s fragmented health data landscape by enabling research while protecting patient privacy. Its long-term success, however, will depend not only on technological progress but on maintaining scientific rigor, clinical relevance, and societal trust.