Biology Publications

Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data

Aaron M. Smith, Jonathan R. Walsh, John Long, Craig B. Davis, Peter Henstock, Martin R. Hodge, Mateusz Maciejewski, Xinmeng Jasmine Mu, Stephen Ra, Shanrong Zhao, Daniel Ziemek & Charles K. Fisher

The ability to confidently predict health outcomes from gene expression would catalyze a revolution in molecular diagnostics. Yet, the goal of developing actionable, robust, and reproducible predictive signatures of phenotypes such as clinical outcome has not been attained in almost any disease area. Here, we report a comprehensive analysis spanning prediction tasks from ulcerative colitis, atopic dermatitis, diabetes, to many cancer subtypes for a total of 24 binary and multiclass prediction problems and 26 survival analysis tasks. We systematically investigate the influence of gene subsets, normalization methods and prediction algorithms. Crucially, we also explore the novel use of deep representation learning methods on large transcriptomics compendia, such as GTEx and TCGA, to boost the performance of state-of-the-art methods. The resources and findings in this work should serve as both an up-to-date reference on attainable performance, and as a benchmarking resource for further research. Approaches that combine large numbers of genes outperformed single gene methods consistently and with a significant margin, but neither unsupervised nor semi-supervised representation learning techniques yielded consistent improvements in out-of-sample performance across datasets. Our findings suggest that using l2-regularized regression methods applied to centered log-ratio transformed transcript abundances provide the best predictive analyses overall. Transcriptomics-based phenotype prediction benefits from proper normalization techniques and state-of-the-art regularized regression approaches. In our view, breakthrough performance is likely contingent on factors which are independent of normalization and general modeling techniques; these factors might include reduction of systematic errors in sequencing data, incorporation of other data types such as single-cell sequencing and proteomics, and improved use of prior knowledge.

Biology Publications

Generating Digital Twins with Multiple Sclerosis Using Probabilistic Neural Networks

Jonathan R. Walsh, Aaron M. Smith, Yannick Pouliot, David Li-Bland, Anton Loukianov, Charles K. Fisher

Multiple Sclerosis (MS) is a neurodegenerative disorder characterized by a complex set of clinical assessments. We use an unsupervised machine learning model called a Conditional Restricted Boltzmann Machine (CRBM) to learn the relationships between covariates commonly used to characterize subjects and their disease progression in MS clinical trials. A CRBM is capable of generating digital twins, which are simulated subjects having the same baseline data as actual subjects. Digital twins allow for subject-level statistical analyses of disease progression. The CRBM is trained using data from 2395 subjects enrolled in the placebo arms of clinical trials across the three primary subtypes of MS. We discuss how CRBMs are trained and show that digital twins generated by the model are statistically indistinguishable from their actual subject counterparts along a number of measures.

Conference Posters

Digital Control Subjects for Alzheimer’s Disease Clinical Trials (AMIA 2019)

Charles Fisher, Yannick Pouliot, Aaron Smith, Jonathan Walsh

Objective: To develop a method to model disease progression that simulates detailed clinical data records for subjects in the control arms of Alzheimer's disease clinical trials. Methods: We used a robust data processing framework to build a machine learning dataset from a database of subjects in the control arms of a diverse set of 28 clinical trials on Alzheimer's disease. From this dataset, we selected 1908 subjects with 18-month trajectories of 44 variables and trained a model capable of simulating disease progression in 3-month intervals across all variables. Results: Based on a statistical analysis comparing data from actual and simulated subjects, the model generates accurate subject-level distributions across variables and through time. Focusing on a common clinical trial endpoint for Alzheimer’s disease (ADAS-Cog), we show the model can predict disease progression as accurately as several supervised models. Our model also predicts the outcome of a clinical trial whose data are distinct from the training and test datasets. Conclusion: The ability to simulate dozens of clinical characteristics simultaneously is a powerful tool to model disease progression. Such models have useful applications for clinical trials, from analyzing control groups to supplementing real subject data in control arms.

Conference Posters

Generating Digital Control Subjects using Machine Learning for Alzheimer's Disease Clinical Trials (CTAD 2019)

Charles Fisher, Yannick Pouliot, Aaron Smith, Jonathan Walsh

• Background: Recently, there has been a flurry of attention focused on the benefits of synthetic control patients in clinical trials. The ability to reduce the burden on control subjects with subjects in clinical trials for complex diseases like Alzheimer’s Disease would drastically improve the search for beneficial therapies. • Objective: To demonstrate a machine learning model is capable of simulating Alzheimer’s Disease progression and generate digital control subjects that are stiatistically indistinguishable from actual controls. • Methods: We developed a machine learning model of Alzheimer’s Disease progression trained with data from 4897 subjects from 28 clinical trial control arms involving early or moderate Alzheimer’s Disease. The model is an example of a Conditional Restricted Boltzmann Machine (CRBM), a kind of undirected neural network whose properties are well suited to the task of modeling clinical data progression. The model generates values for 47 variables for each digital control subject at three-month intervals. • Results: Based on a statistical analysis comparing data from actual and digital control subjects, the model generates accurate subject-level distributions across variables and through time that are statistically indistinguishable from actual data. • Conclusion: Our work demonstrates the potential for the CRBMs to generate digital control subjects that are statistically indistinguishable from actual control subjects, with promising applications for Alzheimer’s Disease clinical trials.

Machine Learning Publications

Boltzmann Encoded Adversarial Machines

Charles K. Fisher, Aaron M. Smith, Jonathan R. Walsh

Restricted Boltzmann Machines (RBMs) are a class of generative neural network that are typically trained to maximize a log-likelihood objective function. We argue that likelihood-based training strategies may fail because the objective does not sufficiently penalize models that place a high probability in regions where the training data distribution has low probability. To overcome this problem, we introduce Boltzmann Encoded Adversarial Machines (BEAMs). A BEAM is an RBM trained against an adversary that uses the hidden layer activations of the RBM to discriminate between the training data and the probability distribution generated by the model. We present experiments demonstrating that BEAMs outperform RBMs and GANs on multiple benchmarks.

Biology Publications

Machine learning for comprehensive forecasting of Alzheimer’s Disease progression

Charles K. Fisher, Aaron M. Smith, Jonathan R. Walsh, Coalition Against Major Diseases

Most approaches to machine learning from electronic health data can only predict a single endpoint. The ability to simultaneously simulate dozens of patient characteristics is a crucial step towards personalized medicine for Alzheimer’s Disease. Here, we use an unsupervised machine learning model called a Conditional Restricted Boltzmann Machine (CRBM) to simulate detailed patient trajectories. We use data comprising 18-month trajectories of 44 clinical variables from 1909 patients with Mild Cognitive Impairment or Alzheimer’s Disease to train a model for personalized forecasting of disease progression. We simulate synthetic patient data including the evolution of each sub-component of cognitive exams, laboratory tests, and their associations with baseline clinical characteristics. Synthetic patient data generated by the CRBM accurately refect the means, standard deviations, and correlations of each variable over time to the extent that synthetic data cannot be distinguished from actual data by a logistic regression. Moreover, our unsupervised model predicts changes in total ADAS-Cog scores with the same accuracy as specifcally trained supervised models, additionally capturing the correlation structure in the components of ADAS-Cog, and identifes sub-components associated with word recall as predictive of progression.

Biology Publications

Who is this gene and what does it do? A toolkit for munging transcriptomics data in python

Charles K. Fisher, Aaron M. Smith, Jonathan R. Walsh

Transcriptional regulation is extremely complicated. Unfortunately, so is working with transcriptional data. Genes can be referred to using a multitude of different identifiers and are assigned to an ever increasing number of categories. Gene expression data may be available in a variety of units (e.g, counts, RPKMs, TPMs). Batch effects dominate signal, but metadata may not be available. Most of the tools are written in R. Here, we introduce a library, genemunge, that makes it easier to work with transcriptional data in python. This includes translating between various types of gene names, accessing Gene Ontology (GO) information, obtaining expression levels of genes in healthy tissue, correcting for batch effects, and using prior knowledge to select sets of genes for further analysis. Code for genemunge is freely available on Github (http://github.com/unlearnai/genemunge).

Machine Learning Publications

A high-bias, low-variance introduction to Machine Learning for physicists

Pankaj Mehta, Marin Bukov, Ching-Hao Wang, Alexandre G.R. Day, Clint Richardson, Charles K. Fisher, David J. Schwab

Machine Learning (ML) is one of the most exciting and dynamic areas of modern research and application. The purpose of this review is to provide an introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to physicists. The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, and generalization before moving on to more advanced topics in both supervised and unsupervised learning. Topics covered in the review include ensemble models, deep learning and neural networks, clustering and data visualization, energy-based models (including MaxEnt models and Restricted Boltzmann Machines), and variational methods. Throughout, we emphasize the many natural connections between ML and statistical physics. A notable aspect of the review is the use of Python notebooks to introduce modern ML/statistical packages to readers using physics-inspired datasets (the Ising Model and Monte-Carlo simulations of supersymmetric decays of proton-proton collisions). We conclude with an extended outlook discussing possible uses of machine learning for furthering our understanding of the physical world as well as open problems in ML where physicists maybe able to contribute. (Notebooks are available here)

Biology Publications

Deep Learning of Representations for Transcriptomics-based Phenotype Prediction

Aaron M. Smith, Jonathan R. Walsh, John Long, Craig B. Davis, Peter Henstock, Martin R. Hodge, Mateusz Maciejewski, Xinmeng Jasmine Mu, Stephen Ra, Shanrong Zhang, Daniel Ziemek, Charles K. Fisher

The ability to predict health outcomes from gene expression would catalyze a revolution in molecular diagnostics. This task is complicated because expression data are high dimensional whereas each experiment is usually small (e.g., ~20,000 genes may be measured for ~100 subjects). However, thousands of transcriptomics experiments with hundreds of thousands of samples are available in public repositories. Can representation learning techniques leverage these public data to improve predictive performance on other tasks? Here, we report a comprehensive analysis using different gene sets, normalization schemes, and machine learning methods on a set of 24 binary and multiclass prediction problems and 26 survival analysis tasks. Methods that combine large numbers of genes outperformed single gene methods, but neither unsupervised nor semi-supervised representation learning techniques yielded consistent improvements in out-of-sample performance across datasets. Our findings suggest that using l2-regularized regression methods applied to centered log-ratio transformed transcript abundances provide the best predictive analyses.

Conference Posters

Generating Synthetic Control Subjects Using Machine Learning for Clinical Trials in Alzheimer's Disease (DIA 2019)

Charles K. Fisher, Yannick Pouliot, Aaron M. Smith, Jonathan R. Walsh

Objective: To develop a method to model disease progression that simulates detailed patient trajectories. To apply this model to subjects in control arms of Alzheimer's disease clinical trials. Methods: We used a robust data processing framework to build a machine learning dataset from a database of subjects in the control arms of a diverse set of 28 different clinical trials on Alzheimer's disease. From this dataset, we selected 1908 subjects with 18-month trajectories of 44 variables and trained 5 cross-validated models capable of simulating disease progression in 3-month intervals across all variables. Results: Based on a statistical analysis comparing data from actual patients with simulated patients, the model generates accurate patient-level distributions across variables and through time. Focusing on a common clinical trial endpoint for Alzheimer’s disease (ADAS-Cog), we show the model can predict disease progression as accurately as several supervised models. Our model also predicts the outcome of a clinical trial whose data are distinct from the training and test datasets. Conclusion: The ability to simulate dozens of patient characteristics simultaneously is a powerful tool to model disease progression. Such models have useful applications for clinical trials, from analyzing control groups to supplementing real subject data in control arms.