Operator Guide

This guide covers deploying and operating pyCyto on shared HPC clusters — data staging, SLURM job submission, provenance management, and benchmark validation.


Data Placement Conventions

pyCyto uses three storage classes with distinct roles and retention policies:

Location

Purpose

Retention

data/datasets/<name>/

Input snapshots — read-only reference data

Permanent

data/scratch/<name>/

Heavy intermediates (segmentation masks, registered frames)

Purge-safe

output/<workflow>/run_<id>/

Compact run bundles (CSV, JSON, figures)

Keep per-paper

Configure roots in notebooks/config.user.toml (gitignored):

[paths]
data_root   = "/gpfs/your/data/root"
output_root = "/gpfs/your/output/root"

Rule: Never write outputs directly to data/datasets/. Never commit large binary files — intermediates belong in data/scratch/.


Data Lifecycle

        graph LR
    A[Ingest to data/datasets/] --> B[Scratch Compute in data/scratch/]
    B --> C[Validate with collect_results.py]
    C --> D[Promote to output/workflow/run_id/]

    classDef lifecycleStage fill:#0d7377,color:#fff,stroke:#0a5c60
    class A,B,C,D lifecycleStage
    
  1. Ingest — place dataset snapshot under data/datasets/<dataset_name>/ following the layout in DataOps Storage Layout.

  2. Scratch compute — distributed pipeline writes heavy intermediates (registered frames, per-patch segmentation) to data/scratch/. These are safe to delete after validation.

  3. Validate — run collect_results.py --validate to generate acceptance_report.json and provenance_manifest.json.

  4. Promote — compact results (CSV tables, figures, JSON metadata) are written to output/<workflow>/run_<id>/ and committed or archived.


SLURM Job Submission

Partition selection

Check live status before submitting:

sinfo -h -o "%P|%a|%l|%D|%t"

Note

Partition names are cluster-specific. The table below reflects the BMRC (Oxford) cluster. Check sinfo output for your cluster’s actual partition names and time limits.

Partition

Max time

Use

short

1-06:00:00

CPU jobs, preprocessing, debugging

long

10-00:00:00

Full CPU production runs

gpu_short

4:00:00

GPU segmentation, contact detection

gpu_long

2-12:00:00

Full GPU production runs

Always prefer short/gpu_short for development and benchmark validation.

Submitting the benchmark suite

# Dry-run first (resolves paths, prints sbatch commands, no submission)
pixi run python scripts/benchmark/run_benchmark.py --dry-run

# Full submission
pixi run python scripts/benchmark/run_benchmark.py

The orchestrator submits all 4 stages, blocks until GPU stages complete, then submits TrackMate and contact stages automatically.

Monitoring

squeue --me                        # all your jobs
squeue --me -o "%i %j %T %R"      # job ID, name, state, reason
tail -f output/benchmark/register_denoise/run_<ID>/log/<job>.out

Provenance and Run Metadata

Every run writes a run_metadata.json under output/<workflow>/run_<id>/:

{
  "run_id": "20260315T111551Z",
  "pipeline": "register_denoise",
  "total_frames": 80,
  "job_counts": [1, 2, 4, 8],
  "git_sha": "abc1234",
  "completed_at": "2026-03-15T14:23:01Z"
}

Generate provenance for notebook runs with:

from cyto.utils import write_run_metadata
write_run_metadata(output_dir / "run_metadata.json",
                   pipeline="my_analysis",
                   params={"channel": "ch0", "frames": 100})

Benchmark Validation

After all benchmark stages complete:

pixi run python scripts/benchmark/collect_results.py \
    --run-id 20260315T111551Z \
    --validate

Produces:

  • acceptance_report.json — pass/fail per stage against tolerance thresholds

  • provenance_manifest.json — file hashes, git SHA, hardware info

  • output/benchmark/master/run_<ID>/speedup_*.png — scaling plots

Validation thresholds (from benchmark.def.toml):

  • CPU wall-time tolerance: ±20%

  • GPU wall-time tolerance: ±30%

  • Speedup trend must be monotonically non-decreasing across job counts


Storage Housekeeping

# Check scratch usage
du -sh data/scratch/

# Safe to delete after benchmark validation
rm -rf data/scratch/<dataset>/patching/benchmark/register_denoise/

# Rotate old benchmark logs (keep last 3 runs)
ls -dt output/benchmark/master/run_*/ | tail -n +4 | xargs rm -rf

DataOps: URI-Addressable Storage (Planned)

Note

This section describes the planned post-paper URI-based I/O scheme. The design is finalized; implementation begins after paper submission.

The current pipeline uses absolute filesystem paths in YAML configs. The planned dataops-uri-abstraction scheme introduces a uniform URI format for all data types:

URI scheme

Storage backend

Example

file://

Local filesystem or GPFS

file:///gpfs/data/utse_cyto/ch0/

db://

PostgreSQL database

db://cyto/runs/20260315T111551Z/tracks

ceph://

Ceph object storage

ceph://bpi-bucket/experiments/utse_cyto/

s3://

S3-compatible object store

s3://bpi-s3/utse_cyto/ch0_registered/

Data versioning convention

Run outputs will be versioned under:

output/<workflow>/run_<run_id>/

where run_id follows the YYYYMMDDTHHMMSSz ISO-8601 timestamp format (e.g., 20260315T111551Z). This matches the current write_run_metadata() convention and will become the canonical provenance key.

Provenance records

Each output (image, table, network) will have a JSON sidecar:

{
  "uri":          "file:///output/cellpose/run_20260315/TCell.tif",
  "data_type":    "Label",
  "run_id":       "20260315T111551Z",
  "produced_by":  "CellPose",
  "params":       {"model_type": "cpsam", "diameter": 20},
  "git_sha":      "94e8d3f",
  "completed_at": "2026-03-15T14:23:01Z"
}

Current state (pre-abstraction)

Until the URI scheme is implemented, use notebooks/config.user.toml (gitignored) for all path overrides:

[paths]
data_root   = "/gpfs/your/data/root"
output_root = "/gpfs/your/output/root"

and cyto.utils.load_notebook_config() / load_db_config() to resolve paths in notebooks and scripts.


Cross-References