DataOps Storage Layout¶
This project uses three storage classes with different retention behavior:
data/datasets/: input snapshots (shared reference data)data/scratch/: heavy transient intermediates (safe to delete)output/: compact run bundles (figures, CSV, JSON metadata)
data/ is a local workspace and is not intended for git tracking.
Why this split¶
Prevents mixing paper-figure drafts with distributed intermediates.
Makes cleanup safe (
scratchcan be purged by policy).Keeps reproducible artifacts in
output/with provenance metadata.
Path contract¶
Configure all notebook and benchmark runs through:
notebooks/config.def.toml(committed defaults)notebooks/config.user.toml(local override, gitignored)
Key fields:
paths.data_root->data/datasetspaths.scratch_root->data/scratchpaths.output_root->output
Legacy compatibility¶
distributed/utse_cytois a legacy symlink to an archive location.Keep it for old scripts, but new workflows should rely on
data_rootand dataset keys.
Benchmark Suite Configuration¶
The distributed benchmark (scripts/benchmark/) has its own config contract:
scripts/benchmark/config/benchmark.def.toml— committed defaultsscripts/benchmark/benchmark.user.toml— local override (gitignored)
Hardware requirements¶
Stage |
GPU model |
Notes |
|---|---|---|
|
CPU only |
~115 s/frame on 8 CPUs (ANTs); 80 frames ≈ 2.5h |
|
A100-PCIE-40GB (40 GB) |
V100 (16 GB) OOM with |
|
A100-PCIE-40GB (40 GB) |
Same constraint as cellpose |
Key parameters¶
[dataset]
total_frames = 80 # 80 frames fits within 4h SLURM short slot at ~115 s/frame
# Scaling ratios (N=1 vs N=8) are equivalent; absolute time is not the goal
[resources.cellpose]
gres = "gpu:a100-pcie-40gb:1" # V100 OOM with cpsam (SAM ViT attention ~4 GB allocation)
Override in benchmark.user.toml if your cluster has different GPU models.
Output layout¶
output/benchmark/
├── register_denoise/run_<ID>/{log/, tables/}
├── cellpose/run_<ID>/{log/, tables/}
├── trackmate/run_<ID>/{log/, tables/}
├── contact/run_<ID>/{log/, tables/}
└── master/run_<ID>/run_metadata.json
Validation: pixi run python scripts/benchmark/collect_results.py --validate
Migration checklist¶
Move (or symlink) input datasets under
data/datasets/<dataset_name>/.Route distributed-heavy intermediates to
data/scratch/<dataset_name>/.Keep paper plotting drafts in
data/paper_figures/.Generate final run artifacts in
output/<workflow>/run_<id>/.Periodically purge
data/scratch/and old logs.