Developer Guide

BodyModel

Building a world model for AI through epigenome engineering — from chromatin accessibility to programmable transcriptional memory.

Why Epigenome Engineering?

For a software-heavy entry into biology, epigenome engineering is the right wedge. The field already has:

  • Programmable transcriptional memory with CRISPRoff
  • Durable in vivo silencing with hit-and-run editors
  • Durable multiplexed epigenetic programming in primary human T cells
  • Transient ribonucleoprotein delivery of epigenome editors

At the same time, a 2024 systematic study showed that the same chromatin intervention can behave very differently depending on locus and context. That combination — real editability plus brutal context dependence — is exactly where strong developers have leverage.

The Mental Model

Start with same-cell RNA + ATAC, not methylation-first. 10x Epi Multiome directly measures gene expression and chromatin accessibility from the same cell. WGBS is still essential when your question is explicitly about CpG methylation, but RNA + ATAC gives the tightest first design / debug loop.

Latent stateChromatin configuration + cell identity + trajectory position
ObservationsscRNA-seq, scATAC-seq, histone/TF assays (CUT&Tag), methylation (WGBS)
ActionsTargeted recruitment of repressors, activators, methylation writers, or demethylation effectors
ObjectiveA persistent, cell-type-specific expression shift with acceptable heterogeneity and delivery cost

CUT&Tag is especially useful as a complementary assay — it profiles chromatin features at high resolution from low-input samples and single cells. ENCODE's ATAC / scATAC / WGBS standards ground your data contracts and QC expectations.

Python-First Tool Stack

Expect a human-in-the-loop workflow: ENCODE explicitly notes that a fully automated end-to-end scATAC pipeline is not yet feasible. Clustering, cell-state annotation, trajectories, and multimodal integration still require subjective decisions.

AnnData

Core object model

Scanpy

Standard single-cell workflow

Muon

Multimodal containers & methods

SnapATAC2

scATAC engine

scvi-tools / MultiVI

Probabilistic models, paired/unpaired multiome

SCENIC+

Enhancer-driven GRN inference

CellOracle

In silico TF perturbation

pertpy

Perturbation-analysis layer

First 30 Days

Days 1–5

Data model & ingest

Stand up the data model and ingest path. Work through the scverse getting-started tutorials and the EMBL-EBI Python single-cell materials, then load a public PBMC multiome dataset (the official 3 k PBMC tutorial dataset for fast iteration or the 10 k PBMC dataset for a realistic pipeline). If you later get raw 10x output, the primary pipeline is Cell Ranger ARC. Deliverable: one clean h5ad / h5mu object plus a dataset card recording provenance, genome build, feature definitions, and sample metadata.

Days 6–10

Serious QC (not notebook theater)

Implement RNA + ATAC QC — not just RNA QC. Track TSS enrichment, FRiP, fragment-size distribution, barcode-level quality, and peak-level quality. 10x warns that ATAC peaks are dynamically inferred rather than fixed like genes, so peak QC matters. SnapATAC2 exposes TSSe and FRiP directly; ENCODE standardizes both ATAC and scATAC processing / QC. Deliverable: a reproducible QC report with thresholds, excluded cells, excluded peaks, and rationale.

Days 11–20

From clusters to control points

Run SCENIC+ to infer enhancer-driven TF → region → gene programs from RNA + ATAC. Then run CellOracle to simulate TF perturbations in silico. The point is to stop thinking in terms of "marker genes" and start thinking in terms of controllable regulatory circuits. If you want a learned latent space across paired and unpaired multiomic data, add MultiVI through scvi-tools. Deliverable: a ranked graph of candidate control points per cell type or trajectory branch.

Days 21–30

Intervention-ranking engine

Anchor design logic to real editor families: CRISPRoff-style memory writing, hit-and-run silencing, durable multiplexed programming in primary T cells, or transient RNP delivery. Rank candidate interventions by predicted direction of effect, expected persistence, cell-state specificity, delivery complexity, and uncertainty. The 2024 context-dependence paper shapes the loss function: optimize for robust effect under chromatin-context variation, not average effect. Main artifact: a table — locus, effector family, target cell state, predicted effect, persistence score, uncertainty, and evidence.

Starter Repo Layout

epi-programming/
  data/raw/ interim/ processed/
  metadata/samples.csv  perturbations.csv  references.yaml
  configs/dataset.yaml  qc.yaml  models.yaml  ranking.yaml
  notebooks/
    00_pbmc_multiome_explore.ipynb
    01_qc_rna_atac.ipynb
    02_annotation.ipynb
    03_grn_scenicplus.ipynb
    04_in_silico_perturbation_celloracle.ipynb
    05_latent_model_multivi.ipynb
    06_candidate_ranking.ipynb
  src/io.py  qc.py  rna.py  atac.py  grn.py
      perturb.py  latent.py  ranking.py  eval.py
  artifacts/
  environment.yml
  README.md

AnnData / MuData Schema

obsdonor, batch, modality, cell_type, trajectory, perturbation, timepoint
vargene / peak annotation, genomic coords, motif hits
layersraw counts, normalized counts, imputed activity
obsmX_lsi, X_umap, X_multivi
unsQC thresholds, genome build, motif DB versions, model params, provenance

Best First Project

Build a PBMC / T-cell state steering engine. Start with public PBMC multiome data, learn a latent state representation, infer enhancer-driven regulatory programs, run in silico perturbations, and emit a ranked intervention report. After PBMC, graduate to ENCODE for reference epigenomes, HCA for open multiomic references, and GEO for broader public functional-genomics datasets.

What to Avoid

  • Starting with "epigenetic rejuvenation"
  • A giant foundation model
  • A one-click automated platform

Start with one cell system, one intervention objective, and one validation loop. That is the shortest path from "developer interested in epigenome" to "builder of useful epigenome software."

Key References (Read in Order)

  1. Nuñez et al., 2021 CRISPRoff — programmable transcriptional memory
  2. Policarpi et al., 2024 Systematic context dependence of chromatin interventions
  3. Cappelluti et al., 2024 Durable in vivo hit-and-run silencing
  4. Goudy et al., 2025 Durable multiplexed epigenetic programming in primary human T cells
  5. Xu et al., 2025 Transient RNP delivery of epigenome editors
  6. Kaya-Okur et al., 2019 CUT&Tag — practical chromatin-profiling assay

BioPathDB — Open biological pathway reference