Open-source database bundles complex cancer data

Researchers from the Johns Hopkins Kimmel Cancer Center and Johns Hopkins University have developed a new open-source database structure designed to simplify the analysis of complex cancer datasets. The platform, called AstroID, enables researchers to integrate and study multiple types of cancer-related information, including laboratory results, genetic sequencing and imaging data, within a single framework.

Cancer research often requires combining large volumes of information from different sources, such as clinical records, tissue samples, imaging studies and genomic analyses. However, these datasets are frequently stored in separate systems, making it difficult for researchers to analyze them together.

Linking diverse datasets in oncology

AstroID addresses this challenge by organizing clinical and specimen-related data into a six-tier hierarchical structure. These tiers include patient information (deidentified to protect privacy), diagnosis, clinical events such as treatments or blood draws, collected specimens, laboratory processing steps, and finally individual sample components such as slides or aliquots.

The database structure is built using the web-based data management platform REDCap and can be scaled to include thousands of patients and billions of individual cancer cells characterized through spatial analysis. The system was recently described in the journal Journal for Immunotherapy of Cancer.

Supporting large-scale cancer studies

Researchers at Johns Hopkins Medicine have already implemented the system in studies involving 16 different patient groups with multiple tumor types. Using AstroID, the team has mapped more than one billion cancer cells and linked these data to clinical information collected throughout the patients’ treatment trajectories.

According to Janis M. Taube, director of the Division of Dermatopathology and co-director of the Tumor Microenvironment Laboratory at the Bloomberg~Kimmel Institute for Cancer Immunotherapy, the new structure enables researchers to analyze patient data more holistically.

“What this structure does is allow me to ask questions across all of this data that's already been gathered, and across tumor types, and combine it all together in the context of the longitudinal patient experience,” Taube explains.

Reducing duplication in research

In oncology studies, patients often undergo multiple treatments, tests and follow-up assessments over time. Linking these clinical events to laboratory analyses, such as blood tests, pathology findings, imaging results and genomic data, is essential for identifying potential biomarkers and understanding disease progression.

Previously, researchers often had to manually compile and reorganize such data for each new study. This process could lead to duplicated work across research teams. “Investigators across the whole institution are also trying to tap into these patients and collect this information,” Taube says. “There were really huge inefficiencies across how we were working, and duplicating efforts.”

Scaling research through data infrastructure

The AstroID structure was designed to support larger-scale studies than previously possible. According to Alexander Szalay, director of the Institute for Data Intensive Science at Johns Hopkins, traditional manual data entry limited many cancer studies to relatively small patient cohorts. “What we are trying to do is to scale out so we can handle patients on the order of hundreds or thousands of patients in a study,” Szalay says.

The database architecture was developed by postdoctoral researcher Elizabeth Will and graduate student Benjamin Green, who designed the hierarchical data structure that can be translated into a query-based relational database.

Potential applications beyond oncology

Although AstroID is currently being used for cancer research, the underlying data structure could also be applied to other disease areas. The system is capable of organizing longitudinal biospecimen data and linking it to clinical events, making it potentially useful for studying a wide range of conditions.

By making the platform openly available on GitHub, the researchers hope to facilitate broader collaboration and enable investigators worldwide to analyze complex biomedical datasets more efficiently.

Europe has CancerWatch

Last year, the large-scale European partnership CancerWatch was launched. Led by the Norwegian Institute of Public Health, the programme brings together 92 organisations from 29 countries and focuses on the digital transformation of population-based cancer registries (PBCRs) in Europe. The ambition is to provide faster, more complete and more comparable data that will enable policymakers, researchers and healthcare professionals to take more effective action against cancer.

The data provides direct input for, among others, the European Cancer Information System (ECIS) and the European Cancer Inequalities Registry (ECIR). Both play a key role in the European Cancer Plan and form the basis for evidence-based health policy. "Better data leads to better decisions. With CancerWatch, we are building a solid foundation for more effective and fair cancer care in Europe", said Gijs Geleijnse, scientific coordinator of the project.