Benchmark — Horizontal Scaling Performance

pyCyto includes a reproducible 4-stage benchmark suite that measures wall-time speedup as the number of parallel SLURM workers increases from N=1 to N=8.


Running the Benchmark

# Configure: copy defaults and override dataset root, GPU gres, frame count
cp scripts/benchmark/config/benchmark.def.toml scripts/benchmark/benchmark.user.toml
$EDITOR scripts/benchmark/benchmark.user.toml

# Submit all 4 stages (non-interactive)
pixi run python scripts/benchmark/run_benchmark.py

# Collect results after all jobs finish
pixi run python scripts/benchmark/collect_results.py --run-id <RUN_ID>

Results land in output/benchmark/master/run_<RUN_ID>/:

  • benchmark_results_aggregated.csv — per-stage wall times and speedups

  • benchmark_results.{png,svg,pdf} — timing figure

  • benchmark_speedup.{png,svg,pdf} — speedup figure

  • provenance_manifest.json — reproducibility metadata

  • scaling_analysis.md — human-readable interpretation


Scaling Results (UTSE dataset, 80 frames, A100 40 GB, run 20260317T114224Z)

Stage

N=1 (s)

N=2 (s)

N=4 (s)

N=8 (s)

Speedup @N=8

Efficiency @N=8

register_denoise

9703

4902

2580

1508

6.44×

80%

cellpose (ch0)

512

313

216

165

3.10×

39%

cellpose (ch1)

475

309

221

169

2.81×

35%

contact

807

625

414

400

2.02×

25%

trackmate (ch0)

555

332

346

313

1.77×

22%

Parallel efficiency = speedup / N. Ideal is 100% (linear scaling).

Wall-time figure

Benchmark wall times by stage and worker count

Speedup figure

Speedup vs number of workers — all stages

Both figures from run 20260317T114224Z (UTSE dataset, 80 frames, A100 40 GB GPU).


Key Findings

Registration/Denoising — Near-Linear (80% efficiency)

ANTs image registration is an embarrassingly parallel CPU workload. Each frame is registered independently, so doubling workers nearly halves wall time. The remaining 20% loss is from GPFS I/O overhead (loading/writing large TIFFs) and minor load imbalance.

Practical guidance: use N=8 for production runs. Registration dominates serial wall time at ~2.7 hours; N=8 brings this to 25 minutes.

Cellpose Segmentation — Moderate (39% efficiency)

GPU segmentation on an A100 40 GB. At N=1, the GPU is already well-utilised processing a full time-lapse batch. Splitting to N=8 workers means each job sends a smaller frame batch to the GPU — reducing per-batch latency but adding SLURM scheduling and I/O overhead per job. Diminishing returns set in beyond N=4.

Practical guidance: N=4 is the practical optimum (2.36× speedup with lower queue time than N=8).

Contact Analysis — Low (25% efficiency)

Contact detection (Delaunay triangulation, nearest-neighbour search) is primarily CPU-bound with GPU-accelerated pyclesperanto steps. Scaling is limited by uneven contact density across frame windows: frames with dense cell populations take disproportionately longer, creating a straggler that controls wall time.

Practical guidance: N=4 (1.95×) delivers most of the available speedup. N=8 offers marginal improvement (+0.07×) at higher queue overhead.

TrackMate Tracking — Barely Improves

TrackMate achieves only 1.77× at N=8. This is the expected result. Three compounding factors:

  1. Sequential linking constraint: TrackMate links cells across consecutive frames in order within each time window. No intra-window parallelism is possible — only splitting across time windows helps.

  2. JVM startup overhead: each SLURM job starts a fresh Java VM. With N=8 and 80 frames, each job processes 10 frames but still pays full JVM startup cost (~60–90 s). The startup-to-compute ratio worsens as window size shrinks.

  3. Straggler problem: the N=4 wall time (346 s) exceeds the N=2 wall time (332 s). The frame window with the highest cell density and longest tracks takes 346 s to process 20 frames — longer than either N=2 job’s 40 frames (332 s). Splitting concentrates the hard frames in one job rather than averaging them across a larger window.

TrackMate scaling:
  N=2: wall=332s, both jobs: 332s each (std≈0)       ← balanced
  N=4: wall=346s, avg=297s, std=33s                  ← straggler at 346s
  N=8: wall=313s, avg=246s, std=100s                 ← very high variance

Conclusion: TrackMate is the horizontal-scaling bottleneck of the pipeline. N=2 (1.67×) is the practical limit; beyond that, JVM overhead and the straggler effect provide no reliable improvement. GPU-native trackers (e.g. CUDA IoU) would remove this bottleneck.


Stage-Level Guidance

Stage

Recommended N

Rationale

register_denoise

8

Near-linear; dominates serial runtime

cellpose

4

Best speedup-to-queue-time ratio

contact

4

Most speedup before diminishing returns

trackmate

2

JVM overhead + straggler; N>2 unreliable


Benchmark Infrastructure

The benchmark suite lives in scripts/benchmark/:

scripts/benchmark/
├── run_benchmark.py        # orchestrator (submit + poll all 4 stages)
├── collect_results.py      # result collection + validation + plots
├── benchmark_config.py     # config loader + path resolver
├── config/
│   └── benchmark.def.toml  # default config (committed)
├── sbatch/                 # per-stage SLURM submission scripts
└── logs/                   # SLURM output logs

See scripts/benchmark/config/benchmark.def.toml for all configurable parameters (dataset root, GPU gres, frame count, tolerance thresholds).

For detailed troubleshooting see the operator guide.