Research themes — van Dijk Lab @Yale

Foundation Models for Single-Cell Genomics

We train massive language models on millions of single‑cell profiles to power tasks like cell‑type classification, trajectory inference, imputation, and synthetic data generation. Ongoing work scales these models further—often combining them with graph neural networks—and explores clinical uses such as modeling tumor heterogeneity.

Large Language Models for Biomedical Data Analysis

By converting genomics, imaging, and clinical records into text, we pretrain LLMs (e.g. Cell2Sentence, BrainLM) to capture rich biomedical concepts. This enables generative modeling, transfer‑learning for prediction, and data‑driven insights, with extensions to new modalities, multi‑task training, and clinical workflows.

Graph Neural Networks for Computational Biology

We design GNN architectures and self‑supervised objectives tailored to biological graphs—protein structures, regulatory networks, and spatial transcriptomics—to improve node classification, link prediction, data generation, and interpretability. Current efforts expand to multi‑scale graph modeling and deepen theoretical understanding.

Physics-Informed Deep Learning for Spatiotemporal Data

Injecting principles from physics and dynamical systems, our models (ANIE, NIDE, CST) learn governing equations and long‑range interactions in medical imaging, neural recordings, climate data, and simulations. We’re now building hybrid physics–ML architectures and probing links to Koopman operator theory.

Operator Learning for Modeling Complex Biomedical Systems

We treat biomedical processes as operators—integral, integro‑differential, or graph—in a differentiable framework, learning them end‑to‑end from data. Applications include protein dynamics and brain activity modeling, with ongoing work on new operator classes, hybrid physics integrations, and dynamical systems theory.

Causality and Counterfactual Inference in Biomedical Data Analysis

Our causal methods—combining graphical models, inverse‑propensity weighting, and optimal transport—estimate interventional effects from observational data (genomics, EHR, imaging). CINEMA‑OT, for example, teases apart true treatment responses in single‑cell experiments. We’re expanding causal discovery and drug‑repurposing applications.

Interpretable Deep Learning for Scientific Discovery

By embedding structural constraints, intrinsic metrics, and natural‑language interfaces into deep models (e.g. AMPNet, BrainLM), we make hidden reasoning steps visible and queryable. This work uncovers biological mechanisms and supports interactive exploration, with theoretical extensions in algorithmic information and knowledge integration.

Multi-Modal and Multi-Task Learning for Biomedicine

Jointly training on images, text, and omics data with shared parameters, our multi‑task models (e.g. BrainLM) boost generalization and sample efficiency across prediction, generation, and simulation tasks. Next steps include optimizing architectures and broadening domain coverage for complex biomedical applications.