The Travails of Comparing Generative Models

Recent tech headlines have been dominated by breathless stories about AI bursting into new domains. Brushing aside the hype, there have in fact been astounding achievements in the application of ML techniques to autonomous systems, computer vision, and natural language processing.

Much of these gains have been made in the field of supervised learning - the task of learning a mapping from some input variables to some output variables given a collection of input-output pairs. Indeed, there is a growing viewpoint that the new techniques have rendered several supervised learning domains "solved". It's hard to argue with such an attitude when a computer model can recognize traffic signs or convert images to text at a rate far superior to humans.

But simpler problems always give way to harder ones. Increasingly, the task of machine learning will be to explore the formidable frontier of unsupervised learning. Unsupervised learning is the task of learning a representation inherent in some data - without anything but the data itself. This definition is broad; it includes summarization, clustering, and density estimation. We can restrict the definition to one closer to what Yann LeCun prefers to call predictive learning - the task of learning a model that can predict the values of any subset of variables in a dataset conditioned on the rest. This confines his notion of unsupervised learning roughly to that of joint probability density estimation. Namely, given a dataset, learn a model which represents either explicitly or implicitly a joint probability distribution over all the variables in the dataset. From such a model one can then perform any prediction task by conditioning. Even under this more restrictive definition, LeCun made the following characterization:

'If [artificial] intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don't know how to make the cake.' - Yann LeCun NIPS 2016

Most models that estimate joint probability functions are generative. This means that the model affords a means of generating samples drawn from the modeled distribution. Pursuing this exceedingly useful property follows from what we dub the 'generative thesis' - that a good way to understand the structure of a dataset is to simulate processes that generate the data. The problem of modeling a dataset becomes the problem of building a machine that can generate new data points as if they had come from the original dataset.

The most often employed generative model types in deep unsupervised learning are Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Restricted Boltzmann Machines (RBMs).

Comparing generative models

The quality of a generative model must be judged by how well its distribution matches the distribution of the data. There are a few metrics that are commonly used for comparing different types of generative models.

Visual Inspection: The most common method to assess the performance of a model is to look at the samples it has generated. Many people object to this because it is qualitative. However, it is also inappropriate for non-image or textual data. For example, one probably cannot form a useful opinion about a generative model for protein sequences just by looking at the sequences it spits out.

Reconstruction Error: VAEs and RBMs can be viewed as methods that can take a data point, embed it into a lower dimensional space, then reverse the embedding to reconstruct the original data point. The error of the reconstruction tends to decrease during training, but one can easily obtain models with low reconstruction error but terrible samples. Moreover, reconstruction error isn't an appropriate metric for GANs (which don't do reconstruction).

Inception Scores: Inception is a neural network trained on ImageNet to perform object recognition. A number of methods for evaluating generative models of images compute scores based on the output of the inception network. These scores may correlate with human perception of image quality (so this is a type of automated visual inspection). However, inception scores obviously don't apply for anything that isn't an image because the inception network is a model trained on images.

Log Likelihood: Estimating the log likelihood of the data under the model using Parzen window estimates or Annealed Importance Sampling is another common approach for quantifying the performance of generative model. The log likelihood is an appropriate metric for all models and data types but can be difficult to estimate. Moreover, log likelihoods can be difficult to interpret in high dimensions [1].

Measuring the distance between the data and the model

Let p_d(x) be the data distribution density and let p_m(x) denote the model distribution density. What is a good measure of the distance between these distributions?

There are a number of means of quantifying the distance between these distributions. A particularly common set is the f-divergences, in particular the Kullback-Leibler Divergence (KLD), Reverse Kullback-Leibler Divergence (RKLD), and the Jensen-Shannon Divergence (JSD).

In our recent paper on Boltzmann Encoded Adversarial Machines, we implemented an approximation to these metrics to help guide real-time insight into training dynamics. Although the technique was presented in [2] back in 2009, to our knowledge no other ML paper uses online estimates of the KL, RKL, and JS divergences.

The Kullback-Leibler Divergence (KLD) ,

is a positive-valued metric valued on the extended line (it can be +infinity), which is zero if and only if p(x)=q(x) for all x.

Importantly, this divergence is not symmetric. As a result it is useful to study both the KL(p_d || p_m) and the Reverse-KL(p_d || p_m) :

Look, for instance, at the following example of a GAN in which the model is learning to represent the mixture of Gaussians shown in orange (Figure 1).

Figure 1. Three f-divergences plotting during GAN training. On the bottom row, the orange density is the ground truth mixture of gaussians distribution, whereas the blue densities are the model distributions at epochs 15 and 65.

At the start of training the model learns to cover the whole domain with a smoothed distribution, thus achieving a fairly low KLD. However, the central hole in the data contributes to a rather large RKLD until the model learns to push that area down. After epoch 55 or so the model actually starts to forget some of the modes - a phenomenon called mode collapse - which is known to afflict some types of GANs.

In general KLD is sensitive to the model's failure to cover all of the modes of the training dataset. Complimentarily, the RKLD is sensitive to spurious modes in the model distribution.

One problem with these divergences is that they are overly sensitive - they can explode when comparing distributions with disjoint support. A more stable comparison, the Jensen-Shannon Divergence (JSD),

is symmetric and finite. There are theoretical reasons why this should fall during training, and we can see that in Figure 1.

These f-divergences are useful in quantifying the quality of a model's fit. But there is much more to be said in this regard. Many recent advancements in generative ML (esp. with models trained via back propagation) recommend estimating courser, more-stable distributional metrics in order to define a training loss. See WGANs, Cramer-GANs, Sinkhorn Audodiff, OT-GANs and more.

Moving beyond benchmarks

"All models are wrong, but some are useful." - George Box, Robustness in Statistics, 1979

There is a trend in machine learning research to set up a "battle royale" in which the performance of different algorithms are compared on a number of benchmarks. Example projects include MLPerf and DAWNbench. In doing so, however, the community needs to be careful not to put our own biases into these tests.

First, current benchmark initiatives coming from the machine learning community are strongly biased towards computer vision and natural language processing. There are many applications of machine learning outside of these areas that require different focus: e.g., handling multimodal data, missing observations, and time series. Most models cannot be applied to all types of problems. Is a new technique better if it enables state-of-the-art performance on a specialized task, or general performance improvements on many tasks?

Second, models can be wrong in different ways. Therefore, different metrics will rank models differently. For example, the KLD punishes generative models that drop modes from the data distribution whereas the RKLD punishes models with spurious modes. Is it worse for a generative model to drop a mode or to have a spurious mode?

Finally, there are many factors that affect the decision about which technique to apply for a real problem. Obviously, we care about the performance of the model. We also usually care about the computational cost in training the model (which is incorporated in recent benchmarks like MLPerf and DAWNbench). However, applied researchers also care about development time, which is hard to measure. After all, many people write code in Python even though it runs much slower than Fortran on benchmarks. Similarly, some types of machine learning models are finicky. Is it better to have a model that performs pretty well after a small amount of hyperparameter tuning, or a model that performs really well but requires a lot of hyperparameter tuning?

We may have to accept that the search for universal principles of intelligence may not lead to a universal algorithm for artificial intelligence. Instead, we may end up with an ecosystem of algorithms - each one filling its niche.

[1] A Note on the Evaluation of Generative Models, Lucas Theis, Aaron van der Oord, and Mathias Bethege. ICLR 2016.

[2] Divergence Estimation for Multidimensional Densities via k-Nearest-Neighbor Distances, Qing Wang, Sanjeev R. Kulkarni, and Sergio Verdú_, IEEE Transactions of Information Theory, vol. 55, no. 5, May 20, 2009.

Enter your email address to download paper.

Click the link to begin download.
Oops! Something went wrong while submitting the form.

Enter your email address to watch the webinar.

Click the link to watch webinar.
Oops! Something went wrong while submitting the form.

Why can’t we agree on how to define digital twins in healthcare?

White Papers

Summary of the EMA September 2022 Qualification Opinion for PROCOVA™


Charles Fisher, Unlearn.AI: “now is the time to adopt AI-based solutions”

The potential for AI implementation in healthcare can barely be measured, as it can already do what humans do, just countless times better and more efficiently.
The European Medicines Agency has qualified Unlearn’s AI-powered method for running smaller, faster clinical trials.
Digital twins seem to be everywhere in healthcare now, but no one agrees on a single definition for them.