Compute Node Model¶
A pyCyto compute node is any Python callable that accepts a dict and returns a dict. This simple contract is what makes the pipeline composable, testable in isolation, and runnable without modification across notebooks, CLI, and SLURM.
The Pure Function Model¶
graph LR
A["input_dict"] --> B["Module(params)"]
B --> C["output_dict"]
classDef io fill:#1e293b,color:#94a3b8,stroke:#334155
classDef mod fill:#0d7377,color:#fff,stroke:#0a5c60
class A,C io
class B mod
Every node is:
Stateless per call: all algorithm state lives in
__init__parameters; the__call__method is side-effect-free (except for optional file I/O viaoutput: true)Type-agnostic: the dict can carry any data — the node only accesses the keys it needs
Scheduler-blind: the node has no knowledge of SLURM, MPI, or job arrays; the compute spec is sideloaded by the runner
Distinction between params and compute spec¶
Concern |
Where defined |
Example |
|---|---|---|
Algorithm params |
|
|
Compute spec |
|
|
The node class never reads the resource config. This allows the same YAML pipeline to run locally (no resource config) or distributed (resource config supplied separately).
Dict I/O Contract¶
Nodes communicate exclusively through a Python dict with standardized keys:
Key |
Type |
Shape / format |
Direction |
|---|---|---|---|
|
Dask or NumPy array |
|
in / out |
|
Dask or NumPy array |
|
in / out |
|
pandas or Dask DataFrame |
one row per cell per frame |
in / out |
|
NetworkX |
nodes = cell IDs, edges = contacts |
in / out |
Nodes consume only the keys they need and return only the keys they produce. They may pass through keys unchanged, but must not silently modify keys they are not responsible for.
Input/output key table by stage¶
Stage |
Reads |
Writes |
|---|---|---|
Preprocessing |
|
|
Segmentation |
|
|
Tabulation |
|
|
Tracking |
|
|
Postprocessing |
any combination |
any combination |
Minimal End-to-End Example¶
1. Define the node¶
from tqdm import tqdm
class ThresholdSegmentation:
def __init__(self, threshold=0.5, verbose=True):
self.name = "ThresholdSegmentation"
self.threshold = threshold
self.verbose = verbose
def __call__(self, data: dict) -> dict:
image = data["image"] # read
if self.verbose:
tqdm.write(f"[{self.name}] {image.shape}")
label = (image > self.threshold).astype("uint32")
return {"label": label} # write
2. Call it from Python / Jupyter¶
import numpy as np
from my_module import ThresholdSegmentation
seg = ThresholdSegmentation(threshold=0.4)
result = seg({"image": np.random.rand(10, 512, 512).astype("float32")})
print(result["label"].shape) # (10, 512, 512)
3. Wire it into a pipeline YAML¶
pipeline:
segmentation:
- name: ThresholdSegmentation
tag: Threshold_TCell
channels: [TCell]
input_type: image
args:
threshold: 0.4
verbose: true
output_type: label
output: true
4. Add a resource spec for SLURM¶
# configs/distributed/pipeline-resources.yaml
pipeline:
segmentation:
Threshold_TCell:
partition: short
cpus-per-task: 4
mem: 16G
time: "01:00:00"
batch_size: 100
dependency: Normalize
dependency_type: afterok
Dask Arrays¶
Nodes should return Dask arrays (lazy) rather than NumPy arrays (eager) wherever possible. This allows downstream stages to fuse operations and avoids loading the full time-series into memory.
import dask.array as da
# Prefer:
result = da.map_blocks(my_fn, image, dtype="float32")
# Avoid unless necessary:
result = my_fn(image.compute())
Call .compute() only at the final write step (output: true), not inside node logic.
Testing a Node in Isolation¶
import numpy as np
from my_module import ThresholdSegmentation
def test_threshold_seg():
rng = np.random.default_rng(0)
image = rng.random((5, 64, 64), dtype="float32")
seg = ThresholdSegmentation(threshold=0.5, verbose=False)
result = seg({"image": image})
assert "label" in result
assert result["label"].shape == image.shape
assert result["label"].dtype == np.uint32
See Boilerplate Templates for a complete test stub.
See Also¶
Module Template — annotated class template with checklist
Plugin Integration — registration patterns for new algorithms
Architecture — the layered suite model and five data types