DataOps Storage Layout

This project uses three storage classes with different retention behavior:

  • data/datasets/: input snapshots (shared reference data)

  • data/scratch/: heavy transient intermediates (safe to delete)

  • output/: compact run bundles (figures, CSV, JSON metadata)

data/ is a local workspace and is not intended for git tracking.

Why this split

  • Prevents mixing paper-figure drafts with distributed intermediates.

  • Makes cleanup safe (scratch can be purged by policy).

  • Keeps reproducible artifacts in output/ with provenance metadata.

Path contract

Configure all notebook and benchmark runs through:

  • notebooks/config.def.toml (committed defaults)

  • notebooks/config.user.toml (local override, gitignored)

Key fields:

  • paths.data_root -> data/datasets

  • paths.scratch_root -> data/scratch

  • paths.output_root -> output

Legacy compatibility

  • distributed/utse_cyto is a legacy symlink to an archive location.

  • Keep it for old scripts, but new workflows should rely on data_root and dataset keys.

Benchmark Suite Configuration

The distributed benchmark (scripts/benchmark/) has its own config contract:

  • scripts/benchmark/config/benchmark.def.toml — committed defaults

  • scripts/benchmark/benchmark.user.toml — local override (gitignored)

Hardware requirements

Stage

GPU model

Notes

register_denoise

CPU only

~115 s/frame on 8 CPUs (ANTs); 80 frames ≈ 2.5h

cellpose

A100-PCIE-40GB (40 GB)

V100 (16 GB) OOM with cpsam on 2400×2400 images

contact

A100-PCIE-40GB (40 GB)

Same constraint as cellpose

Key parameters

[dataset]
total_frames = 80      # 80 frames fits within 4h SLURM short slot at ~115 s/frame
                       # Scaling ratios (N=1 vs N=8) are equivalent; absolute time is not the goal

[resources.cellpose]
gres = "gpu:a100-pcie-40gb:1"   # V100 OOM with cpsam (SAM ViT attention ~4 GB allocation)

Override in benchmark.user.toml if your cluster has different GPU models.

Output layout

output/benchmark/
├── register_denoise/run_<ID>/{log/, tables/}
├── cellpose/run_<ID>/{log/, tables/}
├── trackmate/run_<ID>/{log/, tables/}
├── contact/run_<ID>/{log/, tables/}
└── master/run_<ID>/run_metadata.json

Validation: pixi run python scripts/benchmark/collect_results.py --validate


Migration checklist

  1. Move (or symlink) input datasets under data/datasets/<dataset_name>/.

  2. Route distributed-heavy intermediates to data/scratch/<dataset_name>/.

  3. Keep paper plotting drafts in data/paper_figures/.

  4. Generate final run artifacts in output/<workflow>/run_<id>/.

  5. Periodically purge data/scratch/ and old logs.