Developer Guide
BodyModel
Building a world model for AI through epigenome engineering — from chromatin accessibility to programmable transcriptional memory.
Why Epigenome Engineering?
For a software-heavy entry into biology, epigenome engineering is the right wedge. The field already has:
- Programmable transcriptional memory with CRISPRoff
- Durable in vivo silencing with hit-and-run editors
- Durable multiplexed epigenetic programming in primary human T cells
- Transient ribonucleoprotein delivery of epigenome editors
At the same time, a 2024 systematic study showed that the same chromatin intervention can behave very differently depending on locus and context. That combination — real editability plus brutal context dependence — is exactly where strong developers have leverage.
The Mental Model
Start with same-cell RNA + ATAC, not methylation-first. 10x Epi Multiome directly measures gene expression and chromatin accessibility from the same cell. WGBS is still essential when your question is explicitly about CpG methylation, but RNA + ATAC gives the tightest first design / debug loop.
CUT&Tag is especially useful as a complementary assay — it profiles chromatin features at high resolution from low-input samples and single cells. ENCODE's ATAC / scATAC / WGBS standards ground your data contracts and QC expectations.
Python-First Tool Stack
Expect a human-in-the-loop workflow: ENCODE explicitly notes that a fully automated end-to-end scATAC pipeline is not yet feasible. Clustering, cell-state annotation, trajectories, and multimodal integration still require subjective decisions.
AnnData
Core object model
Scanpy
Standard single-cell workflow
Muon
Multimodal containers & methods
SnapATAC2
scATAC engine
scvi-tools / MultiVI
Probabilistic models, paired/unpaired multiome
SCENIC+
Enhancer-driven GRN inference
CellOracle
In silico TF perturbation
pertpy
Perturbation-analysis layer
First 30 Days
Days 1–5
Data model & ingest
Stand up the data model and ingest path. Work through the scverse getting-started tutorials and the EMBL-EBI Python single-cell materials, then load a public PBMC multiome dataset (the official 3 k PBMC tutorial dataset for fast iteration or the 10 k PBMC dataset for a realistic pipeline). If you later get raw 10x output, the primary pipeline is Cell Ranger ARC. Deliverable: one clean h5ad / h5mu object plus a dataset card recording provenance, genome build, feature definitions, and sample metadata.
Days 6–10
Serious QC (not notebook theater)
Implement RNA + ATAC QC — not just RNA QC. Track TSS enrichment, FRiP, fragment-size distribution, barcode-level quality, and peak-level quality. 10x warns that ATAC peaks are dynamically inferred rather than fixed like genes, so peak QC matters. SnapATAC2 exposes TSSe and FRiP directly; ENCODE standardizes both ATAC and scATAC processing / QC. Deliverable: a reproducible QC report with thresholds, excluded cells, excluded peaks, and rationale.
Days 11–20
From clusters to control points
Run SCENIC+ to infer enhancer-driven TF → region → gene programs from RNA + ATAC. Then run CellOracle to simulate TF perturbations in silico. The point is to stop thinking in terms of "marker genes" and start thinking in terms of controllable regulatory circuits. If you want a learned latent space across paired and unpaired multiomic data, add MultiVI through scvi-tools. Deliverable: a ranked graph of candidate control points per cell type or trajectory branch.
Days 21–30
Intervention-ranking engine
Anchor design logic to real editor families: CRISPRoff-style memory writing, hit-and-run silencing, durable multiplexed programming in primary T cells, or transient RNP delivery. Rank candidate interventions by predicted direction of effect, expected persistence, cell-state specificity, delivery complexity, and uncertainty. The 2024 context-dependence paper shapes the loss function: optimize for robust effect under chromatin-context variation, not average effect. Main artifact: a table — locus, effector family, target cell state, predicted effect, persistence score, uncertainty, and evidence.
Starter Repo Layout
epi-programming/
data/raw/ interim/ processed/
metadata/samples.csv perturbations.csv references.yaml
configs/dataset.yaml qc.yaml models.yaml ranking.yaml
notebooks/
00_pbmc_multiome_explore.ipynb
01_qc_rna_atac.ipynb
02_annotation.ipynb
03_grn_scenicplus.ipynb
04_in_silico_perturbation_celloracle.ipynb
05_latent_model_multivi.ipynb
06_candidate_ranking.ipynb
src/io.py qc.py rna.py atac.py grn.py
perturb.py latent.py ranking.py eval.py
artifacts/
environment.yml
README.mdAnnData / MuData Schema
Best First Project
Build a PBMC / T-cell state steering engine. Start with public PBMC multiome data, learn a latent state representation, infer enhancer-driven regulatory programs, run in silico perturbations, and emit a ranked intervention report. After PBMC, graduate to ENCODE for reference epigenomes, HCA for open multiomic references, and GEO for broader public functional-genomics datasets.
What to Avoid
- Starting with "epigenetic rejuvenation"
- A giant foundation model
- A one-click automated platform
Start with one cell system, one intervention objective, and one validation loop. That is the shortest path from "developer interested in epigenome" to "builder of useful epigenome software."
Key References (Read in Order)
- Nuñez et al., 2021 — CRISPRoff — programmable transcriptional memory
- Policarpi et al., 2024 — Systematic context dependence of chromatin interventions
- Cappelluti et al., 2024 — Durable in vivo hit-and-run silencing
- Goudy et al., 2025 — Durable multiplexed epigenetic programming in primary human T cells
- Xu et al., 2025 — Transient RNP delivery of epigenome editors
- Kaya-Okur et al., 2019 — CUT&Tag — practical chromatin-profiling assay