Synthetic Control Groups Don’t Work, or: How I Learned to Stop Worrying and Love Randomized Trials

Charles K. Fisher

March 7, 2023

Lots of smart people I talk to about clinical trials are making some big mistakes thinking about how artificial intelligence (AI) will be able to improve trials. Many of these people think that AI may be able to eliminate control groups from trials soon, and they think that digital twins are a method for creating synthetic control groups. Both of these beliefs are wrong. I often wonder ‘Why do so many smart people believe these things that are clearly wrong?’.

In fact, I think I understand why some of these incorrect beliefs persist, and even propagate, because up until a few years ago I believed them too.

Let’s start our story with a bit of history to set the stage.

The idea of a controlled trial dates back many centuries, to at least James Lind’s studies of scurvy in the mid 1700s. The main idea is simple; in order to know if a treatment works, we need something to compare it to. However, the concept of randomization of treatment assignment wasn’t introduced until the early 1920s following R.A. Fisher’s work on agriculture. But the idea of a randomized controlled trial (RCT) in medicine as we know it today didn’t materialize until a 1948 study of streptomycin for the treatment of tuberculosis.  

In my opinion, It’s quite remarkable that RCTs for medical research were invented after general relativity, after quantum mechanics, after the jet airplane, and around the same time as the ENIAC computer. RCTs are actually a modern technology.

Assigning participants in a trial to receive either an experimental or comparative treatment at random was an innovation because it allows one to estimate the relative treatment without having to worry about alternative explanations (called ‘confounders’), on average. That is, if we observe a treatment effect in an RCT, we can be fairly confident that it was caused by a difference in the treatments rather than a difference in the people assigned to the treatment and control groups. Most importantly, we don’t need any special knowledge to say that alternative explanations for an observed treatment effect are unlikely.

Randomization was an important innovation, but nothing in this world is free; everything has an opportunity cost. RCTs require large numbers of participants in order to provide precise estimates of treatment effects, which makes them very expensive and time consuming. They waste patients’ data, because information learned from previous studies is barely used to inform future studies. They only provide estimates of average treatment effects, and don’t tell us much about individuals. And, patients are often hesitant to participate in RCTs because, while clinical trials are often billed as a care option, nobody wants to be in the group randomly assigned to receive a placebo.

The opportunity costs associated with running RCTs haven’t gone unnoticed. Many researchers have observed that the control group in an RCT typically consists of participants receiving an existing, widely used treatment (i.e., placebo and standard of care). We already have data from lots of patients on widely used treatments from previous trials, disease registries, and even from routine practice—so, why don’t we use it? Maybe we don’t need randomized trials after all?

I’ll refer to the broad class of clinical trials without randomized, concurrent control groups as ‘externally controlled trials’. There are a couple of simple ways to run trials using external controls, and then there are more complicated ways.

Let’s start with a simple way. Run a study with two hospitals and give everyone at the first hospital the experimental treatment and everyone at the second hospital the control, then just compare the average outcomes from the two hospitals. How will you know if the people at the two hospitals are comparable? You won’t. Hasn’t stopped people from trying this anyway.

Here’s another simple way to run an externally controlled trial, which people call a ‘historical control’. If you’ve run an RCT in the same indication in the past, why not just re-use the control group from that first study in your new clinical trial? That is, enroll some participants in your new study and give all of them your experimental treatment, and then compare their outcomes to a control group from a previously completed trial. How will you know that nothing important has changed since you ran the first study? You won’t. But, historical controls are still used from some proof of concept studies.

There are obvious problems with these simple ways of running externally controlled trials. And, as we all know, the best way to circumvent problems with a simple approach is to use a more complicated approach!

Alright, let’s talk about synthetic control groups now (and, of course, how they don’t work).

In medicine, the term ‘synthetic control group’ is typically used to refer to a propensity score matched external control group. Let’s say we want to run an externally controlled trial in which we enroll a group of patients and give them all an experimental treatment, and then compare them to data from patients in the control groups of some previously completed trials (like a historical control). However, instead of using the data from all of the patients in our historical dataset, we’ll only select data from the patients who were similar to those in our new single arm trial. Ta-da!

While that sounds simple enough, we’re going to make it more complicated by using a statistical method to select which of our historical patients to include in our new analysis. This is called propensity score matching (or propensity score weighting, if a slightly different technique is used). The propensity score is a way of measuring if two patients look as though they had a similar chance of receiving a treatment; in the context of our study, a patient in the historical dataset ‘looks similar’ to a patient enrolled in our current study if they have similar propensity scores. Therefore, we can create a ‘synthetic control group’ by calculating the propensity scores for all of the patients in our current study and in our historical dataset, and then matching patients in the two cohorts based on their propensity scores.

Propensity score matching has a property that makes it sound really useful. If we know all of the relevant ways that patients could be different, and we account for all those variables when we do our statistical matching procedure, then a trial with a propensity score matched external control group looks just like an RCT. But, ‘How do you know if you have accounted for all of the relevant ways that patients could be different?” you ask. You don’t!

Synthetic control groups only work if the researcher has some special knowledge. Special knowledge that allows them to rule out alternative explanations for observed differences between the patient populations in their study without having to rely on randomization. Special knowledge that cannot be verified or falsified. Without this special knowledge, synthetic controls don’t work any better than the simple methods I discussed previously.

So far, all of this has been setting the stage, now let’s talk about my mistake.  

You may have noticed in reading the last section that there isn’t anything ‘synthetic’ about ‘synthetic control groups’ as the term is typically used in medicine. Wikipedia defines ‘synthetic data’ as:

“Synthetic data is information that's artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.

Data generated by a computer simulation can be seen as synthetic data. This encompasses most applications of physical modeling, such as music synthesizers or flight simulators. The output of such systems approximates the real thing, but is fully algorithmically generated.”

But a synthetic control group usually consists of data generated by real-world events, it is real patient data, just taken from a subset of a larger population. There’s nothing synthetic about it! This is incredibly frustrating to me as a computational scientist.

In any case, those of us who are familiar with computer simulation and generative modeling tend to look at this situation and think ‘well then, why don’t we try to actually create a computer simulated control group instead of matching to historical patient data?’.

Don’t just take my word for it. Here’s a quote from a March 2023 whitepaper “Generative AI: Perspectives from Stanford HAI” from Stanford University:

Generative AI could make clinical trials more efficient by creating “synthetic” control patients (i.e., fake patients) using data from real patients and their underlying attributes (to be compared with the patients who receive the new therapy). It could even generate synthetic outcomes to describe what happens to these patients if they are untreated. Biomedical researchers could then use the outcomes of real patients exposed to a new drug with the synthetic statistical  outcomes for the synthetic patients. This could make trials potentially smaller, faster, and less expensive, and thus lead to faster progress in delivering new drugs and diagnostics to clinicians and their patients.

Unfortunately, using AI to create synthetic/simulated control groups won’t work either. You would  need to know that the AI-model generalizes perfectly from the training population to the study population. And, you can’t know that.

Any trial with an external control group requires the researcher to have special, unverifiable knowledge in order to interpret the results causally. It’s true for historical controls. It’s true for propensity score matched external (i.e., ‘synthetic’) controls. It’s true for AI-generated controls. It’s always true. If the trial isn’t randomized, then you need special, unverifiable knowledge to know if observed differences are actually due to the treatment, or if they could have been caused by some alternative explanation.

Thinking that one can get around this fact with computational methods is a mistake, but one I definitely understand. In fact, this is pretty much what we were working on at Unlearn in 2018 until we figured out a better way.

I’ll get to the better way in a second, but first I’d like to double click on the differences between propensity scores

Propensity scores are weird when you first encounter them. To compute one, you start with a group of patients who received a treatment and a group who did not. Then, you build a statistical model to predict which patients received the treatment and which did not by looking at their pre-treatment (baseline) features. The model assigns a propensity score to a patient, which basically assesses “how much does this patient look like the patients who received the treatment?”. That’s why two groups of patients with similar propensity scores can be compared (if all potentially relevant features have been properly accounted for).

At Unlearn, we use generative AI to create digital twins of individual patients. A digital twin isn’t a “fake patient”; rather, it is a computer simulation of a specific individual person. As specialists in generative machine learning models, we aim to learn the parameters of this simulator from large sets of historical patient data and then apply that learned simulator to new patients. For example, if Elvis Presley is a participant in our clinical trial and we create his digital twin, then we would train a generative model on data from historical controls, and then use it to simulate how Elvis Presely would likely respond if given the control (e.g., a placebo). Thus, Elvis’s digital twin can be used to compute a prognostic score (actually, a prognostic distribution) describing the likelihood of his future health outcomes.

Let’s compare the two concepts.

  • Propensity score → probability of patient receiving the experimental treatment.
  • Digital twin → probability of patient’s outcome given control.

These are not remotely the same thing. Digital twins are not synthetic controls.

A patient’s digital twin provides a probabilistic forecast for their health outcomes under some relevant scenarios, such as if they were assigned to the control group in a trial. It has to be a ‘virtual twin’ of something, of a real person! It makes no sense to think of patients’ digital twins as new people, or fake patients, or whatever—patients’ digital twins tell you something about those real patients! That’s it!

But, I digress. The treatment effect for an individual patient is defined as the difference between their outcome on the experimental treatment and the control. So, one logical conclusion is that we can estimate a treatment effect by giving a patient an experimental treatment to observe that outcome, and then just subtract their predicted outcome on control. That is, the patient’s digital twin could act as their own individual-level control.

This is a very attractive idea. If this idea worked,

  • Trials would only need half as many patients as they do currently, making them much less expensive and time consuming.
  • All data from previous studies would be used to train the models that generate the digital twins, so that no patient data goes to waste.
  • Trials would provide estimates of individual treatment effects, in addition to population averages.
  • And, patients would be more likely to participate in these trials because they would always get access to a new experimental treatment.

That is, it would eliminate some of the key opportunity costs associated with randomization.

Unfortunately, it doesn’t really work; at least not yet. In order for this idea to work, we would need to know that the model used to create the participants’ digital twins generalizes exactly to the study population. I still believe that AI will eventually be advanced enough that we’ll be able to run most clinical trials in a computer (and for individuals, rather than populations) but I’m also convinced that’s still a long way off.

Fortunately, because participants’ digital twins can be used to predict their outcomes, we can use them to improve RCTs instead of trying to replace RCTs! We’ve written a ton about this, and have even gone through regulatory review via the EMA’s Novel Methodologies Qualification Opinion pathway. So, I’ll keep it brief here. In hand-wavy language, the general framework looks like this:

  1. Train a conditional generative model on historical data so that it can create digital twins of new patients.
  2. Enroll some patients in an RCT and collect pre-treatment data at baseline.
  3. Prompt the conditional generative model with the patients’ pre-treatment data to create their digital twins.
  4. Randomize the patients and collect their observed outcomes.
  5. Analyze the observed outcomes data from the patients while accounting for the predicted control outcomes from their digital twins (e.g., using covariate adjustment as in PROCOVA™).

This idea does work! By applying it,

  • Trials need fewer control patients than they do currently, making them much less expensive and time consuming.
  • All data from previous studies is used to train the models that generate the digital twins, so that no patient data goes to waste.
  • Trials primarily provide population average treatment effects, but individual level treatment effects can still be estimated (though, less accurately).
  • And, patients may be more likely to participate in these trials because they are more likely to get access to a new experimental treatment.

And guess what, no special knowledge is needed to interpret the results of the trial.

I hope at this point it’s clear that synthetic control groups don’t work because they require researchers to have special, unverifiable knowledge. That the methods used to create digital twins of patients are completely different from the methods normally used to construct synthetic control groups in clinical trials. That patients’ digital twins are not new ‘fake people’, but are instead forecasts of those real patients’ potential health outcomes. And that because they are prognostic forecasts, patients’ digital twins can be used to design better RCTs that have none of the downsides of trials with external controls.

So, I’ve come full circle. I started off thinking that we could use AI to replace RCTs, but I soon realized that randomization is here to stay. Rather than replacing RCTs, we can improve them. We can reimagine this 20th century technology for the 21st century.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Example of a caption
Block quote that is a longer piece of text and wraps lines.

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text