Operator Guide¶
This guide covers deploying and operating pyCyto on shared HPC clusters — data staging, SLURM job submission, provenance management, and benchmark validation.
Data Placement Conventions¶
pyCyto uses three storage classes with distinct roles and retention policies:
Location |
Purpose |
Retention |
|---|---|---|
|
Input snapshots — read-only reference data |
Permanent |
|
Heavy intermediates (segmentation masks, registered frames) |
Purge-safe |
|
Compact run bundles (CSV, JSON, figures) |
Keep per-paper |
Configure roots in notebooks/config.user.toml (gitignored):
[paths]
data_root = "/gpfs/your/data/root"
output_root = "/gpfs/your/output/root"
Rule: Never write outputs directly to
data/datasets/. Never commit large binary files — intermediates belong indata/scratch/.
Data Lifecycle¶
graph LR
A[Ingest to data/datasets/] --> B[Scratch Compute in data/scratch/]
B --> C[Validate with collect_results.py]
C --> D[Promote to output/workflow/run_id/]
classDef lifecycleStage fill:#0d7377,color:#fff,stroke:#0a5c60
class A,B,C,D lifecycleStage
Ingest — place dataset snapshot under
data/datasets/<dataset_name>/following the layout in DataOps Storage Layout.Scratch compute — distributed pipeline writes heavy intermediates (registered frames, per-patch segmentation) to
data/scratch/. These are safe to delete after validation.Validate — run
collect_results.py --validateto generateacceptance_report.jsonandprovenance_manifest.json.Promote — compact results (CSV tables, figures, JSON metadata) are written to
output/<workflow>/run_<id>/and committed or archived.
SLURM Job Submission¶
Partition selection¶
Check live status before submitting:
sinfo -h -o "%P|%a|%l|%D|%t"
Note
Partition names are cluster-specific. The table below reflects the BMRC (Oxford) cluster. Check sinfo output for your cluster’s actual partition names and time limits.
Partition |
Max time |
Use |
|---|---|---|
|
|
CPU jobs, preprocessing, debugging |
|
|
Full CPU production runs |
|
|
GPU segmentation, contact detection |
|
|
Full GPU production runs |
Always prefer short/gpu_short for development and benchmark validation.
Submitting the benchmark suite¶
# Dry-run first (resolves paths, prints sbatch commands, no submission)
pixi run python scripts/benchmark/run_benchmark.py --dry-run
# Full submission
pixi run python scripts/benchmark/run_benchmark.py
The orchestrator submits all 4 stages, blocks until GPU stages complete, then submits TrackMate and contact stages automatically.
Monitoring¶
squeue --me # all your jobs
squeue --me -o "%i %j %T %R" # job ID, name, state, reason
tail -f output/benchmark/register_denoise/run_<ID>/log/<job>.out
Provenance and Run Metadata¶
Every run writes a run_metadata.json under output/<workflow>/run_<id>/:
{
"run_id": "20260315T111551Z",
"pipeline": "register_denoise",
"total_frames": 80,
"job_counts": [1, 2, 4, 8],
"git_sha": "abc1234",
"completed_at": "2026-03-15T14:23:01Z"
}
Generate provenance for notebook runs with:
from cyto.utils import write_run_metadata
write_run_metadata(output_dir / "run_metadata.json",
pipeline="my_analysis",
params={"channel": "ch0", "frames": 100})
Benchmark Validation¶
After all benchmark stages complete:
pixi run python scripts/benchmark/collect_results.py \
--run-id 20260315T111551Z \
--validate
Produces:
acceptance_report.json— pass/fail per stage against tolerance thresholdsprovenance_manifest.json— file hashes, git SHA, hardware infooutput/benchmark/master/run_<ID>/speedup_*.png— scaling plots
Validation thresholds (from benchmark.def.toml):
CPU wall-time tolerance: ±20%
GPU wall-time tolerance: ±30%
Speedup trend must be monotonically non-decreasing across job counts
Storage Housekeeping¶
# Check scratch usage
du -sh data/scratch/
# Safe to delete after benchmark validation
rm -rf data/scratch/<dataset>/patching/benchmark/register_denoise/
# Rotate old benchmark logs (keep last 3 runs)
ls -dt output/benchmark/master/run_*/ | tail -n +4 | xargs rm -rf
DataOps: URI-Addressable Storage (Planned)¶
Note
This section describes the planned post-paper URI-based I/O scheme. The design is finalized; implementation begins after paper submission.
The current pipeline uses absolute filesystem paths in YAML configs. The planned dataops-uri-abstraction scheme introduces a uniform URI format for all data types:
URI scheme |
Storage backend |
Example |
|---|---|---|
|
Local filesystem or GPFS |
|
|
PostgreSQL database |
|
|
Ceph object storage |
|
|
S3-compatible object store |
|
Data versioning convention¶
Run outputs will be versioned under:
output/<workflow>/run_<run_id>/
where run_id follows the YYYYMMDDTHHMMSSz ISO-8601 timestamp format (e.g., 20260315T111551Z). This matches the current write_run_metadata() convention and will become the canonical provenance key.
Provenance records¶
Each output (image, table, network) will have a JSON sidecar:
{
"uri": "file:///output/cellpose/run_20260315/TCell.tif",
"data_type": "Label",
"run_id": "20260315T111551Z",
"produced_by": "CellPose",
"params": {"model_type": "cpsam", "diameter": 20},
"git_sha": "94e8d3f",
"completed_at": "2026-03-15T14:23:01Z"
}
Current state (pre-abstraction)¶
Until the URI scheme is implemented, use notebooks/config.user.toml (gitignored) for all path overrides:
[paths]
data_root = "/gpfs/your/data/root"
output_root = "/gpfs/your/output/root"
and cyto.utils.load_notebook_config() / load_db_config() to resolve paths in notebooks and scripts.
Cross-References¶
DataOps Storage Layout — full path contract and migration checklist
Cluster Jupyter Setup — interactive session setup
Containerized Execution — Apptainer on HPC