May 7, 2026
The first three posts in this four-part series examined a problem that runs through oncology trial design: the evidence available to predict outcomes pertains to broader populations and older standards of care, rather than to the specific biomarker-defined cohorts and rapidly evolving treatment contexts that today’s trials require. Published trial results provide reliable estimates, but only as population averages for the cohorts studied. Real-world data provide patient-level granularity but are observational, subject to confounding, and expensive to obtain. As described in Part 1, reconciling these two data sources is tricky and increasingly so as cohorts narrow and standards of care shift.
In Part 2, we showed how treating outcome prediction as a modeling problem rather than a data-matching problem can address this for trial design, and how precision trial simulation gives teams a way to stress-test assumptions before committing to a protocol.
But there is a related problem that may be even harder, and it affects decisions that happen both before and after a trial is designed. What do you do when the comparison you need has never been made?
Two regimens, no direct trial comparison
FOLFIRINOX and gemcitabine plus nab-paclitaxel are both established first-line treatments for advanced pancreatic cancer. Oncologists regularly choose between them, and the decision affects patients every day.
Both regimens entered practice through trials against gemcitabine monotherapy. The PRODIGE4/ACCORD11 trial showed FOLFIRINOX improved median overall survival to 11.1 months versus 6.8 months for gemcitabine. The MPACT trial showed that gemcitabine plus nab-paclitaxel improved the median survival to 8.5 months versus 6.7 months. But no randomized trial has ever directly compared FOLFIRINOX and gemcitabine plus nab-paclitaxel, and given the cost and commercial dynamics involved, one is unlikely to be conducted.
That leaves clinicians and sponsors relying on indirect comparisons —notably network meta-analysis (NMA) or RWD-based target trial emulation (TTE). The PRODIGE4 and MPACT trials enrolled different patient populations (PRODIGE4: younger, healthier), used different eligibility criteria, were conducted in different geographies (PRODIGE4 in France; MPACT internationally across North America, Eastern Europe, Russia, and Australia), and read out in different years. How much of the apparent difference between the two regimens is real, and how much reflects differences in the patients who were studied?
This isn’t solely an academic question either. The answer matters for setting treatment guidelines, making formulary decisions, and for sponsors positioning new agents against an existing standard of care, as well as for development teams choosing a comparator arm for a new trial.
Attempts at indirect comparison:
To assess the comparative effectiveness of these trials, researchers have applied Bayesian Network Meta-Analysis (NMA) to this pair of trials. Crucially, this technique relies on an assumption of transitivity —that the patients in each trial’s control arms are exchangeable with respect to any effect-modifying factors. The fact that the PRODIGE4 trial restricted patients to a younger age range and a healthier performance status suggests that this assumption is problematic. It would be better to incorporate some means of adjusting for the baseline distributional imbalances into the comparison.
Along these lines, researchers have applied the target trial emulation (TTE) framework to make head-to-head comparisons using real-world data. A recent TTE using population-level data from Alberta, Canada (Boyne et al., Annals of Epidemiology, 2023), specified a hypothetical target trial protocol and emulated it using linked administrative health records. The study found that initiation of FOLFIRINOX was associated with a median overall survival of 8.3 months, compared with 5.1 months for gemcitabine plus nab-paclitaxel, with a mortality hazard ratio of 0.78. By design, TTE adjusts for measured confounders such as age and performance status, addressing the transitivity concern that limits the use of NMA.
The limitation lies elsewhere. The Boyne study was designed to compare outcomes as observed in routine clinical practice rather than to emulate the eligibility criteria of the original trials, and the cohort was drawn from a single regional healthcare system in Alberta. Of the 1,192 patients identified, 590 were excluded for missing laboratory data alone, and only 407 patients ultimately met the study's eligibility criteria. That is a reasonable design choice for its stated purpose, but it means the resulting estimates are not directly substitutable for a trial-versus-trial comparison. More broadly, even a rigorously executed TTE is time- and resource-intensive and does not incorporate the summary results of the randomized trials whose comparisons the sponsor actually cares about.
A different way to make the comparison
The two attempts at comparison mentioned above are problematic for complementary reasons. We suggest an alternative approach that harnesses the strengths of each. Our trial-calibrated modeling approach enables this approach.
In the previous posts in this series, we described how a generative model that is calibrated to published trial results can generate patient-level outcome predictions that are sensitive to patient-level prognostic/effect-modifying factors while still respecting gold-standard trial results. We validated this in non-small cell lung cancer and metastatic colorectal cancer, predicting control arm outcomes in settings where direct historical matches were sparse or unavailable in the patient-level training data.
The same mechanism can be applied to provide a simulation of a head-to-head clinical trial between two therapies. The model provides a patient-level simulation for each trial’s treatment arm; this means that one has a patient-level synthetic dataset for each line of therapy that recapitulates the aggregate tables in the trial publications, from baseline distribution tables to overall and subgroup outcomes. From here one can adjust one simulated arm’s baseline distribution to the other to obtain a hypothetical head to head comparison.
When this procedure is applied to the PRODIGE4 vs MPACT scenario, we obtain a correction to the naive comparison between FOLFIRINOX and gem + nab-pac in first-line metastatic pancreatic cancer. The correction ultimately results from the key baseline mismatches between the two cohorts: an older cohort in MPACT that has a prognostically worse performance status range. The effect is an evident softening of the 1-year and 2-year RMST difference between the two therapies, and a more conservative hazard ratio of .853 with 95% ci (0.736, 1.073). This compares to a hazard ratio of 0.79 (0.59, 1.05) derived from a network meta-analysis conducted previously.

The result is not a replacement for a randomized trial, but it is a rigorous comparison between competing therapies that accounts for prognostic/effect-modifying factors and privileges the high-quality summary evidence available. It is also more cost-effective and efficient than running a new trial or performing a large-scale RWD-based analysis; it can deliver answers efficiently by leveraging the broad benefits of transfer learning and using RWD in a manner constrained by RCT results. That speed matters in an indication where standards of care are shifting constantly. In some cases, a regimen’s window of clinical relevance is short enough that the comparison only matters now, and a modeling approach is the only realistic way to inform the decision in time.
Where this matters for development teams
The pancreatic cancer example is a clear illustration, but the underlying problem is common across oncology. Wherever a sponsor needs to understand how two treatments might compare in a trial setting, and no head-to-head trial exists, the same gap appears.
Consider a development team designing a trial for a new therapeutic agent. They need to choose a comparator arm design from among several options for SoC therapies and baseline cohort features. Ultimately, they need to compare expected outcomes under those choices and choose the design that maximizes the success parameters for the trial. That is a comparative effectiveness question, and it must be answered before the protocol is finalized, not after.
Or consider a medical affairs team preparing a health economics submission for a new therapy in a biomarker-defined subgroup. The submission requires a comparison against the current standard of care. But published trial data for that standard of care reflect an unselected population, not the subgroup the new therapy targets. The comparison the team needs has never been made at the resolution they require.
A calibrated modeling approach does not eliminate the need for judgment in either case. But it does provide a structured, reproducible framework for generating those comparisons, rooted in the best available randomized evidence and tailored to the specific patient population in question.
If you are facing a comparative effectiveness question in your program, reach out to our team.
