Statistical and Machine Learning Applications
in Biomedical Sciences Schedule

February 29, 2024 & March 1, 2024
Sue Gross Auditorium

View Map | View Schedule

Zhenke Wu
Title: Towards Better Policies in Sequentially Decision Making: A Robust Test for Stationarity

Abstract: Reinforcement learning (RL) is a powerful technique that allows an autonomous agent to learn an optimal policy to maximize the expected return. The optimality of various RL algorithms relies on the stationarity assumption, which requires time-invariant state transition and reward functions. However, deviations from stationarity over extended periods often occur in real-world applications like robotics control, health care and digital marketing, resulting in sub-optimal policies learned under stationary assumptions. We propose a doubly-robust procedure for testing the stationarity assumption and detecting change points in offline RL settings, e.g., using data obtained from a completed sequentially randomized trial. Our proposed testing procedure is robust to model misspecifications and can effectively control type-I error while achieving high statistical power, especially in high-dimensional settings. I will use an interventional mobile health study, the largest to date in the US, to illustrate the advantages of our method in detecting change points and optimizing long-term rewards in high-dimensional, non-stationary environments.

Julia Adela Palacios
Inference from Single Cell Lineage Tracing Data Generated via Genome Editing

Abstract: Single cell lineage tracing data obtained via genome editing with Crispr/Cas9 technology enables us to better understand important developmental processes at an unprecedented resolution. I will present a model that allows us to infer cell lineage phylogenies, mutation rates and lineage population size trajectories. We assume an efficient bounded coalescent model on cell phylogenies and propose a mutation model that describes how synthetic CRISPR target arrays generate observed variation after many cell divisions.  We apply our method to two different CRISPR technologies and discuss future directions, challenges, and opportunities.

Fengzhu Sun
DeepLINK: Deep learning inference using knockoffs with applications to genomics

Although practically attractive with high prediction and classification power, complicated machine learning methods often lack interpretability and reproducibility, limiting their scientific usage. A useful remedy is to select truly important variables contributing to the response of interest. We develop methods for deep learning inference using knockoffs, DeepLINK and DeepLINK-T, to achieve the goal of variable selection with controlled error rate in deep learning models for cross-sectional and temporal time series data. We show that DeepLINK can have high power in variable selection with a broad class of model designs. We apply DeepLINK to real datasets related to human gut microbiome, murine and human single cell RNA-seq data sets, and marine time microbiome time series data. DeepLINK produces statistical inference results with both reproducibility and biological meanings, demonstrating its promising usage to a broad range of scientific applications.

Yubai Yuan
Title: Optimal Transport for Latent Integration with An Application to Heterogeneous Neuronal Activity Data

Abstract: Detecting dynamic patterns of task-specific responses shared across heterogeneous datasets is an essential and challenging problem in many scientific applications in medical science and neuroscience. In our motivating example of rodent electrophysiological data, identifying the dynamical patterns in neuronal activity associated with ongoing cognitive demands and behavior is key to uncovering the neural mechanisms of memory. One of the greatest challenges in investigating a cross-subject biological process is that the systematic heterogeneity across individuals could significantly undermine the power of existing machine learning methods to identify the underlying biological dynamics. In addition, many technically challenging neurobiological experiments are conducted on only a handful of subjects where rich longitudinal data are available for each subject. The low sample sizes of such experiments could further reduce the power to detect common dynamic patterns among subjects. In this paper, we propose a novel heterogeneous data integration framework based on optimal transport to extract shared patterns in complex biological processes. The key advantages of the proposed method are that it can increase statistical power in identifying common patterns by reducing heterogeneity unrelated to the signal by aligning the extracted latent spatiotemporal information across subjects. Our approach is effective even with a small number of subjects, and does not require auxiliary matching information for the alignment. In particular, our method can align longitudinal data across heterogeneous subjects in a common latent space to capture the dynamics of shared patterns while utilizing temporal dependency within subjects. Our numerical studies on both simulation settings and neuronal activity data indicate that the proposed data integration approach improves prediction accuracy compared to existing machine learning methods.

Xiaohui Chen
Title: GeONet: a neural operator for learning the Wasserstein geodesic

Abstract: Optimal transport (OT) offers a versatile framework to compare complex data distributions in a geometrically meaningful way. Traditional methods for computing the Wasserstein distance and geodesic between probability measures require mesh-dependent domain discretization and suffer from the curse-of-dimensionality. We present GeONet, a mesh-invariant deep neural operator network that learns the non-linear mapping from the input pair of initial and terminal distributions to the Wasserstein geodesic connecting the two endpoint distributions. In the offline training stage, GeONet learns the saddle point optimality conditions for the dynamic formulation of the OT problem in the primal and dual spaces that are characterized by a coupled PDE system. The subsequent inference stage is instantaneous and can be deployed for real-time predictions in the online learning setting. We demonstrate that GeONet achieves comparable testing accuracy to the standard OT solvers on a simulation example and the CIFAR-10 dataset with considerably reduced inference-stage computational cost by orders of magnitude.

Andrew J Gentles
Title: Spatial organization and immunotherapy response in solid tumors

Abstract: In cancer, complex ecosystems of interacting cell types play fundamental roles in tumor development, progression, and response to therapy. However, the cellular organization, community structure, and spatially defined microenvironments of human tumors remain poorly understood. With the emergence of new technologies for high-throughput spatial profiling of complex tissue specimens, it is now possible to identify clinically significant spatial features with high granularity. We recently introduced EcoTyper, a machine learning framework for large-scale identification and validation of cell states and multicellular communities from bulk, single-cell, and spatially resolved gene expression data. When applied to 12 major cell lineages across 16 types of human carcinoma, EcoTyper identified 69 transcriptionally defined cell states. Most states were specific to neoplastic tissue, ubiquitous across tumor types, and significantly prognostic. By analyzing cell-state co-occurrence patterns, we discovered ten clinically distinct multicellular communities with unexpectedly strong conservation, including three with myeloid and stromal elements linked to adverse survival, one enriched in normal tissue, and two associated with early cancer development. Two ecotypes represented inflamed immune responses that correlated with response to immunotherapy, influenced by the spatial organization of their cell state. This study elucidates fundamental units of cellular organization in human carcinoma and provides a framework for large-scale profiling of cellular ecosystems in any tissue.”

Damla Senturk
Title: Modeling intra-individual inter-trial EEG response variability in autism

Abstract: Autism spectrum disorder (autism) is a prevalent neurodevelopmental condition characterized by early emerging impairments in social behavior and communication. EEG represents a powerful and non-invasive tool for examining functional brain differences in autism. Recent EEG evidence suggests that greater intra-individual trial-to-trial variability across EEG responses in stimulus-related tasks may characterize brain differences in autism. Traditional analysis of EEG data largely focuses on mean trends of the trial-averaged data, where trial-level analysis is rarely performed due to low neural signal to noise ratio. We propose to use nonlinear (shape-invariant) mixed effects (NLME) models to study intra-individual inter-trial EEG response variability using trial-level EEG data. By providing more precise metrics of response variability, this approach could enrich our understanding of neural disparities in autism and potentially aid the identification of objective markers. The proposed multilevel NLME models quantify variability in the signal’s interpretable and widely recognized features (e.g., latency and amplitude) while also regularizing estimation based on noisy trial-level data. Even though NLME models have been studied for more than three decades, existing methods cannot scale up to large data sets. We propose computationally feasible estimation and inference methods via the use of a novel minorization-maximization (MM) algorithm. Extensive simulations are conducted to show the efficacy of the proposed procedures. Applications to data from a large national consortium find that autistic children have larger intra-individual inter-trial variability in P1 latency in a visual evoked potential (VEP) task, compared to their neurotypical peers.

Peter Song
 Analyzing the influence of physical activity on biological age using wearable device data

Abstract: We consider a scalar-on-function regression analysis of physical activity data collected from a wearable device, in which the functional predictor is given by subject’s Occupation-Time curve (OTC) that presents a proportional continuum of time spent at or above varying activity levels. We invoke a mixed integer optimization (MIO) paradigm to formulate a fused estimation method for homogeneity pursuit.  This new approach can perform a simultaneous operation of changepoint detection and step-functional parameter estimation.  We show through extensive simulation experiments that the proposed MIO methodology enjoys both estimation accuracy and computational efficiency.  Under some mild regularity conditions, we establish a finite error bound for the changepoint selection consistency and parameter estimation consistency.  We apply the proposed MIO method on a real-world data analysis to assess the influence of physical activity on biological age.

Yan Liu
: Deciphering Neural Networks through the Lenses of Feature Interactions

Abstract: Interpreting how neural networks work is a crucial and challenging task in machine learning. In this talk, I will discuss a novel framework, namely neural interaction detector (NID), for interpreting complex neural networks by detecting statistical interactions captured by the neural networks. Furthermore, we can construct a more interpretable generalized additive model that achieves similar prediction performance as the original neural networks. Experiment results on several applications demonstrate the effectiveness of NID.

Xiucai Ding
Title: Manifold learning for noisy and high-dimensional datasets: challenges and some solutions

Abstract:  Manifold learning theory has garnered considerable attention in the modeling of expansive biomedical datasets, showcasing its ability to capture data essence more effectively than traditional linear methodologies. Nevertheless, prevalent algorithms are primarily designed for low-dimensional and clean datasets, whereas contemporary biomedical datasets tend to be high-dimensional and noisy. This presentation addresses the adaptation of these algorithms to effectively accommodate the challenges posed by high dimensionality and noise in modern datasets.

Andrew Holbrook
 A Bayesian Hierarchical Spatially Varying Coefficients Model for Longitudinal Structural Data in Glaucomatous Eyes

Abstract: We model macular thickness measurements over time and location to monitor glaucoma deterioration and prevent vision loss. Data characteristics vary over a 6×6 grid of locations on the retina with additional variability arising from the imaging process at each visit. Currently, physicians estimate slopes using repeated simple linear regression for each subject and location. We develop a novel Bayesian hierarchical model with spatially varying population-level and subject-specific coefficients with visit effects, accounting for both spatial and within-subject correlation, leading to more precision in estimating slopes. We employ correlated spatially varying a) intercepts, b) slopes, and c) residual standard deviations (SD) by treating these parameter fields as multivariate Gaussian processes with flexible Matérn cross-covariance functions. Each marginal process assumes an exponential kernel with its own SD and spatial correlation matrix. We apply our model to data from the Advanced Glaucoma Progression Study, providing insight to the correlations between the spatially varying processes at the population and subject levels.

Louis Ehwerhemuepha
Challenges in pediatric data science – use cases at CHOC

Abstract: In this presentation, we will highlight some practical statistical learning challenges we encounter developing models to predict outcomes of interest in pediatrics. This includes adjusting for bias in predicted probabilities by demographics (to improve equitable use of ML/AI); prediction of very rare but highly morbid conditions; challenges with small sample sizes in histopathology imaging prediction tasks; missing data and mismatches in modality for multimodal unstructured data prediction tasks; and the need for precision medicine in routine clinical care such as determining normal vital signs and labs. We will present these challenges with open discussions on adjustments that will mitigate their impact.

Michele Guindani
Integrative modeling and precision medicine: challenges from a statistical and imaging perspective

Abstract: Precision medicine promises improved health outcomes by accounting for individual variability in genetics, lifestyle, and environment when determining treatments. However, realizing precision medicine relies heavily on integrative modeling – the fusion of diverse data types like genomics, imaging, electronic health records, socioeconomic information, and more to derive insights into patients’ characteristics. This emerging field is full of statistical and computational challenges.

This talk provides an overview of these hurdles and showcases a few case studies. A dynamic brain connectivity analysis demonstrates incorporating concurrent physiological measurements to illuminate transitions between functional states during a motor task. An imaging genetics study reveals how joint modeling can identify imaging biomarkers and risk factors for schizophrenia. Ongoing challenges remain. This talk synthesizes a perspective on the field and outlines key steps to achieving robust, generalizable integrative models that pave the way for actionable precision medicine.