October 8, 2025
Quality data is at the heart of any high-performing AI model, but wrangling data into a modeling-ready state can be challenging and time-consuming. This is especially true for clinical data, which is often noisy, lacks standardization, or requires clinical expertise to assess its quality and relevancy. For clinical teams, these bottlenecks can delay trials and slow innovation.
At Unlearn, we’ve built the infrastructure, processes, and AI-based methods to transform this messy, incomplete, heterogeneous data into usable inputs for training and validating AI models.
We’ve collected de-identified, longitudinal patient data from over 1 million patients across more than 30 indications, partnering with research institutes, advocacy groups, academics, and commercial vendors to do so. Through harmonization—organizing, cleaning, and standardizing data—we’ve created a curated, robust dataset covering over 370k patients and more than 1 million patient-provider interactions—fueling 14 Digital Twin Generator (DTG) deployments across neurodegenerative, immunology, cardio-metabolic, and psychiatric indications.
It’s the kind of investment that could take years—and enormous resources—for a pharma company to build internally. By partnering with Unlearn, clinical teams bypass the “build vs. buy” dilemma and gain immediate access to validated, ready-for-use AI models. These models enable us to uncover patterns and answers that no single study or one-off analysis could provide on its own.
The Tools That Make Imperfect Clinical Data Perfect for Clinical Trials
Traditional Extract, Transform, Load (ETL) pipelines are designed for data with a consistent format and executed on a fixed schedule, which means they can afford to be rigid. But clinical data is anything but uniform. In fact, the sheer variety in structure, format, and quality of the data sources we need to harmonize necessitates a highly flexible approach to data processing.
Our data processing toolkit is designed for flexibility and scale, centering on:
- ETL Pipelines for the 80% Case: Modular, configurable pipelines handle the most common transformations—removing impossible values, standardizing lab units, handling duplicates, pivoting tables, and aggregating questionnaire scores. Each run creates an audit trail, ensuring traceability from raw data to final dataset.
- Clinical Data Scientist Input for the 20% Case: Unique, indication-specific, and study-specific challenges require expert intervention. Our clinical data scientists can inject custom code directly into pipelines, which is then subject to the same quality checks as standardized steps. This allows domain expertise to be encoded without sacrificing consistency or rigor.
- The Data API: Just as an application programming interface defines clear communication between systems, our data API standardizes how datasets are structured, formatted, and released. This allows our ML engineers to make reliable assumptions about data quality before training begins.
The result: a harmonized, transparent, and high-quality dataset that’s ready for AI model training and validation, and usable for downstream analyses.
When Data is Missing, Our Models Still Deliver
Data from past clinical trials and observational studies, even after harmonization, still contain gaps. This missingness is unavoidable in clinical data, since not every assessment is collected at every timepoint, and some records are incomplete.
Our DTGs are designed to learn from this reality. Instead of relying on naïve imputation methods, our modeling architecture is built to handle missing data natively. During training, DTGs learn patterns from incomplete longitudinal records. When applied in a new trial, the model can probabilistically infer missing values using correlations across biomarkers, clinical features, and patient trajectories.
For example, when a biomarker measurement is missing, the model infers it using correlations with related biomarkers and patient characteristics—without biasing outcomes toward an “idealized” patient.
The result: while the input datasets may contain missingness, the digital twins generated by our DTGs are complete and realistic representations of patient outcomes over time. This ensures that sponsors don’t face biased or “idealized” predictions, but instead receive digital twins that reflect the true complexity of patient data.
Model Validation That Proves Trust and Reliability
Every DTG undergoes rigorous evaluation, including cross-validation during model training to assess performance across outcomes and endpoints. This validation gives sponsors confidence that digital twin predictions remain accurate and unbiased when applied to new patients and trial settings.
For our partners, this translates into immediate access to validated, trial-ready DTGs—without the years of work it would take to build comparable infrastructure internally. Designing architectures that can robustly handle the intricacies of clinical data and validating them for trial use has taken Unlearn years of R&D. By partnering with us, sponsors bypass that burden and accelerate time to value, enabling clinical teams to focus on what matters most: designing smarter studies, making faster decisions, and moving trials forward with confidence.