Operator Guide¶

This guide covers deploying and operating pyCyto on shared HPC clusters — data staging, SLURM job submission, provenance management, and benchmark validation.

Data Placement Conventions¶

pyCyto uses three storage classes with distinct roles and retention policies:

Location	Purpose	Retention
`data/datasets/<name>/`	Input snapshots — read-only reference data	Permanent
`data/scratch/<name>/`	Heavy intermediates (segmentation masks, registered frames)	Purge-safe
`output/<workflow>/run_<id>/`	Compact run bundles (CSV, JSON, figures)	Keep per-paper

Configure roots in notebooks/config.user.toml (gitignored):

[paths]
data_root   = "/gpfs/your/data/root"
output_root = "/gpfs/your/output/root"

Rule: Never write outputs directly to data/datasets/. Never commit large binary files — intermediates belong in data/scratch/.

Data Lifecycle¶

        graph LR
    A[Ingest to data/datasets/] --> B[Scratch Compute in data/scratch/]
    B --> C[Validate with collect_results.py]
    C --> D[Promote to output/workflow/run_id/]

    classDef lifecycleStage fill:#0d7377,color:#fff,stroke:#0a5c60
    class A,B,C,D lifecycleStage

Ingest — place dataset snapshot under data/datasets/<dataset_name>/ following the layout in DataOps Storage Layout.
Scratch compute — distributed pipeline writes heavy intermediates (registered frames, per-patch segmentation) to data/scratch/. These are safe to delete after validation.
Validate — run collect_results.py --validate to generate acceptance_report.json and provenance_manifest.json.
Promote — compact results (CSV tables, figures, JSON metadata) are written to output/<workflow>/run_<id>/ and committed or archived.

SLURM Job Submission¶

Partition selection¶

Check live status before submitting:

sinfo -h -o "%P|%a|%l|%D|%t"

Note

Partition names are cluster-specific. The table below reflects the BMRC (Oxford) cluster. Check sinfo output for your cluster’s actual partition names and time limits.

Partition	Max time	Use
`short`	`1-06:00:00`	CPU jobs, preprocessing, debugging
`long`	`10-00:00:00`	Full CPU production runs
`gpu_short`	`4:00:00`	GPU segmentation, contact detection
`gpu_long`	`2-12:00:00`	Full GPU production runs

Always prefer short/gpu_short for development and benchmark validation.

Submitting the benchmark suite¶

# Dry-run first (resolves paths, prints sbatch commands, no submission)
pixi run python scripts/benchmark/run_benchmark.py --dry-run

# Full submission
pixi run python scripts/benchmark/run_benchmark.py

The orchestrator submits all 4 stages, blocks until GPU stages complete, then submits TrackMate and contact stages automatically.

Monitoring¶

squeue --me                        # all your jobs
squeue --me -o "%i %j %T %R"      # job ID, name, state, reason
tail -f output/benchmark/register_denoise/run_<ID>/log/<job>.out

Provenance and Run Metadata¶

Every run writes a run_metadata.json under output/<workflow>/run_<id>/:

{
  "run_id": "20260315T111551Z",
  "pipeline": "register_denoise",
  "total_frames": 80,
  "job_counts": [1, 2, 4, 8],
  "git_sha": "abc1234",
  "completed_at": "2026-03-15T14:23:01Z"
}

Generate provenance for notebook runs with:

from cyto.utils import write_run_metadata
write_run_metadata(output_dir / "run_metadata.json",
                   pipeline="my_analysis",
                   params={"channel": "ch0", "frames": 100})

Benchmark Validation¶

After all benchmark stages complete:

pixi run python scripts/benchmark/collect_results.py \
    --run-id 20260315T111551Z \
    --validate

Produces:

acceptance_report.json — pass/fail per stage against tolerance thresholds
provenance_manifest.json — file hashes, git SHA, hardware info
output/benchmark/master/run_<ID>/speedup_*.png — scaling plots

Validation thresholds (from benchmark.def.toml):

CPU wall-time tolerance: ±20%
GPU wall-time tolerance: ±30%
Speedup trend must be monotonically non-decreasing across job counts

Storage Housekeeping¶

# Check scratch usage
du -sh data/scratch/

# Safe to delete after benchmark validation
rm -rf data/scratch/<dataset>/patching/benchmark/register_denoise/

# Rotate old benchmark logs (keep last 3 runs)
ls -dt output/benchmark/master/run_*/ | tail -n +4 | xargs rm -rf

DataOps: URI-Addressable Storage (Planned)¶

Note

This section describes the planned post-paper URI-based I/O scheme. The design is finalized; implementation begins after paper submission.

The current pipeline uses absolute filesystem paths in YAML configs. The planned dataops-uri-abstraction scheme introduces a uniform URI format for all data types:

URI scheme	Storage backend	Example
`file://`	Local filesystem or GPFS	`file:///gpfs/data/utse_cyto/ch0/`
`db://`	PostgreSQL database	`db://cyto/runs/20260315T111551Z/tracks`
`ceph://`	Ceph object storage	`ceph://bpi-bucket/experiments/utse_cyto/`
`s3://`	S3-compatible object store	`s3://bpi-s3/utse_cyto/ch0_registered/`

Data versioning convention¶

Run outputs will be versioned under:

output/<workflow>/run_<run_id>/

where run_id follows the YYYYMMDDTHHMMSSz ISO-8601 timestamp format (e.g., 20260315T111551Z). This matches the current write_run_metadata() convention and will become the canonical provenance key.

Provenance records¶

Each output (image, table, network) will have a JSON sidecar:

{
  "uri":          "file:///output/cellpose/run_20260315/TCell.tif",
  "data_type":    "Label",
  "run_id":       "20260315T111551Z",
  "produced_by":  "CellPose",
  "params":       {"model_type": "cpsam", "diameter": 20},
  "git_sha":      "94e8d3f",
  "completed_at": "2026-03-15T14:23:01Z"
}

Current state (pre-abstraction)¶

Until the URI scheme is implemented, use notebooks/config.user.toml (gitignored) for all path overrides:

[paths]
data_root   = "/gpfs/your/data/root"
output_root = "/gpfs/your/output/root"

and cyto.utils.load_notebook_config() / load_db_config() to resolve paths in notebooks and scripts.

Cross-References¶

DataOps Storage Layout — full path contract and migration checklist
Cluster Jupyter Setup — interactive session setup
Containerized Execution — Apptainer on HPC