DataOps Storage Layout¶

This project uses three storage classes with different retention behavior:

data/datasets/: input snapshots (shared reference data)
data/scratch/: heavy transient intermediates (safe to delete)
output/: compact run bundles (figures, CSV, JSON metadata)

data/ is a local workspace and is not intended for git tracking.

Why this split¶

Prevents mixing paper-figure drafts with distributed intermediates.
Makes cleanup safe (scratch can be purged by policy).
Keeps reproducible artifacts in output/ with provenance metadata.

Path contract¶

Configure all notebook and benchmark runs through:

notebooks/config.def.toml (committed defaults)
notebooks/config.user.toml (local override, gitignored)

Key fields:

paths.data_root -> data/datasets
paths.scratch_root -> data/scratch
paths.output_root -> output

Legacy compatibility¶

distributed/utse_cyto is a legacy symlink to an archive location.
Keep it for old scripts, but new workflows should rely on data_root and dataset keys.

Benchmark Suite Configuration¶

The distributed benchmark (scripts/benchmark/) has its own config contract:

scripts/benchmark/config/benchmark.def.toml — committed defaults
scripts/benchmark/benchmark.user.toml — local override (gitignored)

Hardware requirements¶

Stage	GPU model	Notes
`register_denoise`	CPU only	~115 s/frame on 8 CPUs (ANTs); 80 frames ≈ 2.5h
`cellpose`	A100-PCIE-40GB (40 GB)	V100 (16 GB) OOM with `cpsam` on 2400×2400 images
`contact`	A100-PCIE-40GB (40 GB)	Same constraint as cellpose

Key parameters¶

[dataset]
total_frames = 80      # 80 frames fits within 4h SLURM short slot at ~115 s/frame
                       # Scaling ratios (N=1 vs N=8) are equivalent; absolute time is not the goal

[resources.cellpose]
gres = "gpu:a100-pcie-40gb:1"   # V100 OOM with cpsam (SAM ViT attention ~4 GB allocation)

Override in benchmark.user.toml if your cluster has different GPU models.

Output layout¶

output/benchmark/
├── register_denoise/run_<ID>/{log/, tables/}
├── cellpose/run_<ID>/{log/, tables/}
├── trackmate/run_<ID>/{log/, tables/}
├── contact/run_<ID>/{log/, tables/}
└── master/run_<ID>/run_metadata.json

Validation: pixi run python scripts/benchmark/collect_results.py --validate

Migration checklist¶

Move (or symlink) input datasets under data/datasets/<dataset_name>/.
Route distributed-heavy intermediates to data/scratch/<dataset_name>/.
Keep paper plotting drafts in data/paper_figures/.
Generate final run artifacts in output/<workflow>/run_<id>/.
Periodically purge data/scratch/ and old logs.