How Many People Do We Need to Find an Exact Match?

The goal of this post is to show (by way of a lottery analogy) that if we want to find an exact match for a subject with Alzheimer's, we would need a dataset with millions of patients -- an infeasible solution for matching.

In the previous post, we showed that it is difficult to find exact matches from historical data for complex diseases like Alzheimer's disease (AD). When we try matching subjects using diverse criteria, such as all of the ADAS-Cog component scores, we don't get any exact matches [1].

This result actually makes sense, though, because we used a pretty small dataset with ~6000 subjects. We mentioned in the previous post that only 19% of subjects in the CODR-AD dataset have an exact match on ADAS-Cog component scores. This number is pretty low. In the ideal clinical trial, the matching rate should actually be 100%. But in practice, this rate isn't necessarily perfect. So what if we could get our rate to be close to 100% and much higher than 19%, like 90%? 

In other words, how many people would we need in the dataset to find an exact match for 90% of subjects? 

To understand this question a bit better, let's use an analogy that asks a similar question. The lottery is a great analogy because it also involves finding exact matches. Imagine that we're playing a lottery, in which there are balls numbered 1-50 and k balls are drawn. The winning ticket must match all k numbers. 

So the question is: how many tickets would we need to buy to have a 90% chance of having the winning ticket i.e. having an exact match for all k numbers on our ticket? The answer is simple: to have a 90% chance of winning, we need to hold 90% of all tickets (which we'll call N).  

If we plot this number (N) vs. number of balls drawn (k), this is what we get:

As the number of balls (k) increases, the number of tickets we have to buy (N) goes up by a lot (NB: the y-axis is in log scale). If we have 6 numbers we have to match, for example, we would have to buy almost 16 million tickets [2]!

Now let's go back to our original question about matching subjects in an AD dataset. The number of ADAS-Cog components is like the number of balls (k) from the lottery example. And the number of subjects we would need to find exact matches for 90% of subjects corresponds to the number of tickets (N). 

If we make a similar plot as before, the results look like this:

Here are some observations. If we want to have 90% of subjects who match from the CODR-AD dataset, we'll find subjects who match on only 4 ADAS-Cog components. If we filter using all 11 ADAS-Cog components, we actually need ~100 million subjects [3]. For reference, this number is larger than the worldwide AD population! 

This is clearly an infeasible number of people to work with. As we saw in the first post, finding an exact match based on complex criteria is extremely difficult. We either can't find an exact match for all the components using a small dataset, or we need millions of people to find an exact match. 

Is there a way to work around this problem? Yes. Like with many other problems, we can use different approaches to solve our particular matching problem. In the next post, we will discuss and evaluate propensity score matching (PSM) as an alternative.


*Figures and appendix provided by Jon Walsh

1. In the previous post, we define an exact match as two subjects who have the same covariates that are relevant to what we want to measure in a particular clinical trial. 

2. See appendix for mathematical explanation

3. See appendix for mathematical explanation


Let's take a look at the math we used to make the plots shown in the blog post. First, let's think about the lottery example.  

If there are n different balls and we draw k of them, there are n!/(n-k)!k! distinct tickets.  

If we buy distinct tickets (i.e., do not draw them at random) and want to have a probability p of holding the winning ticket, then we need to hold a fraction p of all tickets, meaning we need to have N = p*(n!/(n-k)!k!) tickets.  

Plugging in numbers (n= 50, p= 90%), we get the plot, shown in fig. 1 in the main text.

For the ADAS-Cog case (see fig. 2), there are a couple of differences. The different "tickets," or ADAS-Cog component scores, are not equally probable. There are over 2 billion possible component scores (for all 11 components), but a small fraction are fairly common and many are very rare. Also, the population of subjects can have duplicates - two different ones can have the same component scores. This changes the probability calculations, but ultimately to understand how the probability of finding a match for a subject we need to understand the distribution of component scores. To do this, we build a simple model. We assume that the distribution of component scores is Gaussian and that it is approximately constant if we change a single component score by 1. Because there are so many possible scores, we'll regard them as continuous.

almost-equals 1 minus the integral from to of q of y times open bracket 1 minus V of rho q of y close bracket to the N power䁣
equals 1 minus n 2 raised to the negative n slash 2 power over normal Gamma times open paren 1 plus n slash 2 close paren the integral from 0 to 1 of open paren negative 2 l n z close paren raised to the n slash 2 minus 1 power times open paren 1 minus normal alpha z close paren to the N power䁣

The first line is the exact expression for the probability; the inner integral is the probability that a random x matches a given subject y, while the term in square brackets is the probability that none of N subjects is a match. The scale ρ is a distance measure to define matches. Using the "near constant" approximation of our model, we obtain the second line, where V(ρ) is a volume factor. Finally, under the Gaussian approximation of the probability we get the last line, expressed in terms of an integral that is numerically tractable. N is the number of components and α is a constant depending on ρ and the determinant of the Gaussian's covariance matrix. For each number of components n, we empirically determine the value of α for a set of N < N(CODR-AD), the size of the CODR-AD dataset. We generally find that α slowly varies with N, so we fit perform a linear fit of ln(α) to ln(N) and use the fit to extrapolate α beyond N(CODR-AD) and predict the probability for larger dataset  sizes. Allowing α to change accommodates differences between the data and the model, and we observe that the probabilities predicted from the fit agree well with the empirically measured probabilities for N < N(CODR-AD). The plot shows the estimated number of subjects N required to have a 90% match probability for each number k of components used.

Enter your email address to download paper.

Click the link to begin download.
Oops! Something went wrong while submitting the form.

Enter your email address to watch the webinar.

Click the link to watch webinar.
Oops! Something went wrong while submitting the form.

Why can’t we agree on how to define digital twins in healthcare?

White Papers

Summary of the EMA September 2022 Qualification Opinion for PROCOVA™


Charles Fisher, Unlearn.AI: “now is the time to adopt AI-based solutions”

The potential for AI implementation in healthcare can barely be measured, as it can already do what humans do, just countless times better and more efficiently.
The European Medicines Agency has qualified Unlearn’s AI-powered method for running smaller, faster clinical trials.
Digital twins seem to be everywhere in healthcare now, but no one agrees on a single definition for them.