Foundation Models for Single-Cell Genomics

This theme explores applying large foundation models pretrained on massive single-cell datasets to genomics analysis tasks. Example projects include Cell2Sentence, which represents single-cell data as natural language for sequence modeling prior to training on millions of cells. By pretraining on a huge corpus of cells, Cell2Sentence gains generative capabilities for cell types, ability to interpret new data, perform translation across species, and other tasks via prompting. Key applications are accelerating and improving performance of single-cell analysis pipelines for tasks like cell type classification, trajectory inference, imputation, and generation of synthetic datasets. Ongoing work is creating even larger single-cell foundation models, combining with specialized techniques like graph neural networks, and exploring clinical applications such as modeling tumor heterogeneity.

Large Language Models for Biomedical Data Analysis

This theme involves developing methods to represent and integrate diverse modalities of biomedical data including genomics, imaging, and clinical data into large language models. By pretraining on massive corpora of biomedical data formatted as text, we enable these models to gain a rich understanding of biomedical concepts and relationships. Example projects include Cell2Sentence for representing single-cell transcriptomic data as sentences, and BrainLM for pretraining on a large corpus of fMRI brain recordings. Downstream tasks enabled by this approach include generative modeling of biomedical data, transfer learning for prediction tasks, and gaining biological insights through probing. Ongoing work is expanding the modalities integrated, creating multi-task models, and exploring clinical applications.


Graph Neural Networks for Computational Biology  

This theme focuses on developing graph neural network architectures and self-supervised objectives tailored for computational biology problems involving graphs. Example graphs include protein structures, gene regulatory networks, and spatially resolved transcriptomics data. Projects under this theme have designed new layers to capture specific properties of biomedical graphs, such as AMPNet's cross-attention mechanism to model feature interactions. Pretraining strategies leverage domain structure and unlabeled data to learn informative biological representations. Downstream tasks improved by these advances include node classification, link prediction, data generation, and interpretation of learned representations. Ongoing work is expanding the graph domains, exploring multi-scale graph modeling, and developing theoretical understandings.


Physics-Informed Deep Learning for Spatiotemporal Data

This theme involves leveraging mathematical principles from physics and dynamical systems to create deep learning models that learn robust representations and generate accurate predictions for spatiotemporal data. Example domains include medical imaging data, neural activity recordings, climate modeling, and simulation of physical systems. Projects under this theme like ANIE, NIDE, and CST inject inductive biases based on theories of dynamics and differential equations into model architectures and training. This allows uncovering governing equations and long-range interactions from data. Ongoing work is expanding the classes of mathematical models integrated, creating hybrid model combinations, and exploring theoretical connections to Koopman operator theory.

Operator Learning for Modeling Complex Biomedical Systems

This theme develops operator learning techniques to model the complex spatiotemporal dynamics of biomedical systems. The key idea is representing processes as operators like neural networks or graph operators, and learning them by optimizing solutions to corresponding operator equations on training data. Integral, integro-differential, and other non-local operators are used to capture long-range interactions beyond local differential operators. Differentiable operator solver layers are designed to enable end-to-end training. Iterative methods and transform techniques find operator solutions. Example projects include modeling protein dynamics with operators learned from molecular simulations and neural integral equations for learning brain activity dynamics from fMRI recordings. Key applications are gaining predictive abilities, generalizing to new conditions, and distilling dynamical knowledge from spatiotemporal biomedical data including neural recordings and medical imaging. Ongoing work is expanding the operator classes, combining learned operators with traditional physicochemical models, and developing theoretical understandings grounded in dynamical systems and Koopman operator theory.

Causality and Counterfactual Inference in Biomedical Data Analysis

This theme develops computational methods to infer causal relationships and perform counterfactual inference from observational biomedical data like genomics, electronic health records, and medical imaging. Approaches combine causal graph modeling, inverse propensity weighting, and optimal transport to account for confounders when estimating interventional effects. Example projects include CINEMA-OT for causal analysis of single-cell experiments. Key applications are distinguishing correlation from causation in biomedical datasets and enabling in silico trials to predict outcomes. Ongoing work is exploring causal discovery algorithms, integration of prior knowledge, and applications in drug repurposing.

Interpretable Deep Learning for Scientific Discovery

This theme creates deep learning models that produce interpretable representations and enable interactive interrogation. Approaches include imposing structural constraints on models, designing intrinsic interpretability metrics, and enabling humans to query model reasoning through natural language interfaces. Example projects include AMPNet's analysis of feature attention patterns and BrainLM's perturbation analysis to uncover causal influences between brain regions. Key applications are elucidating biological mechanisms from biomedical data and extracting scientific insights from model reasoning. Ongoing work is exploring theoretical connections to algorithmic information theory and integrating scientific knowledge into models.

Multi-Modal and Multi-Task Learning for Biomedicine

This theme develops multi-modal, multi-task models to enable transfer learning, incorporate diverse signals, and model interconnected biomedical processes. Approaches include joint training on data from various modalities like images, text, genomics, and designing models with shared parameters across related tasks. Example projects include BrainLM's multi-task pretraining encompassing prediction, generation, and simulation tasks across neuroimaging, genetics, and cognitive assessments. Key applications are improving generalization and sample efficiency in biomedical data analysis through transfer learning. Ongoing work is studying theoretically optimal model architectures and expanding the biomedical domains and tasks integrated.