Developer Guide

pyCyto is built around a single architectural principle: every pipeline stage is a stateless Python callable that maps a dictionary to a dictionary. This page explains how to extend the pipeline, how the compute node model works, and how to package and containerize new modules.

Sub-pages:


The Compute Node Model

Each pipeline stage is a self-contained unit of computation:

input_dict  ──►  Module(params)  ──►  output_dict

The module receives all inputs through a dictionary, performs computation, and returns outputs through a dictionary. No global state, no file I/O inside the module class itself (I/O is handled by the pipeline orchestrator). This design means the same module runs identically whether it is called:

  • interactively in a notebook

  • via cyto --pipeline my_pipeline.yaml

  • inside a SLURM sbatch job

  • inside a container (Docker or Apptainer)

Dictionary key contract

Stage

Input key(s)

Output key(s)

Preprocessing

"image"

"image"

Segmentation

"image" / "label"

"label"

Tabulation

"image" / "label"

"dataframe"

Tracking

"dataframe"

"dataframe"

Contact

"label" / "dataframe"

"network", "dataframe"

Postprocessing

any

arbitrary

Always use these exact string keys. The pipeline YAML router uses them to wire stages together.


Module Template

All pipeline modules follow this class structure:

class MyModule(object):
    def __init__(self, param1=default_value, verbose=True) -> None:
        """
        Short description.

        Args:
            param1: Description.
            verbose (bool): Enable progress logging.
        """
        self.name = "MyModule"
        self.param1 = param1
        self.verbose = verbose

    def __call__(self, data: dict) -> dict:
        """
        Process data.

        Args:
            data (dict): Input with standardized keys.

        Returns:
            dict: Output with standardized keys.
        """
        image = data["image"]

        if self.verbose:
            tqdm.write(f"[{self.name}] processing ...")

        result = _my_algorithm(image, self.param1)
        return {"image": result}

Rules:

  1. Use tqdm.write() for logging — not print() — so progress bars are not broken.

  2. Prefer Dask arrays over NumPy for inputs/outputs so lazy evaluation propagates.

  3. Do not open files inside __call__. File paths belong in __init__ or in the orchestrator.

  4. Raise ValueError (not AssertionError) for invalid inputs.

Adding to the pipeline

  1. Place the module in the appropriate cyto/<stage>/ subpackage.

  2. Add an import in cyto/<stage>/__init__.py.

  3. Add dependencies to pixi.toml ([dependencies] for universal, [feature.<env>.dependencies] for optional).

  4. Add a YAML block in an example pipeline under pipelines/.

  5. For distributed execution, add a corresponding sbatch template in distributed/<stage>/.


Plugin Integration Guide

Some analysis steps rely on external tools (Fiji/TrackMate, pyclesperanto, ANTs). The integration pattern is the same in each case: wrap the external tool in a module class that satisfies the dictionary contract.

Baremetal integration

The external tool is called directly via Python bindings or subprocess:

# TrackMate via PyImageJ
import imagej
ij = imagej.init(fiji_dir or 'sc.fiji:fiji', headless=True)

Container integration

For tools that have conflicting dependencies, run them inside a container and pass data via temporary files or shared memory:

import subprocess, tempfile, numpy as np
with tempfile.NamedTemporaryFile(suffix=".npy") as f:
    np.save(f.name, data["image"])
    subprocess.run(["apptainer", "exec", "--nv", "tool.sif", "python", "run_tool.py", f.name])

SLURM job integration

For tools that need their own SLURM job (e.g. multi-GPU), write a batch script template and submit via the distributed/ orchestrator:

# distributed/tracking/batch_trackmate.sbatch
#SBATCH --partition=short
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
pixi run -e imagej python distributed/tracking/batch_trackmate.py "$@"

Docker → Apptainer Migration

HPC clusters typically prohibit Docker (requires root). Use Apptainer (formerly Singularity) instead.

Converting a Docker image to Apptainer SIF

# Build Docker image locally first
docker build -f containers/Dockerfile -t cyto-gpu:latest .

# Convert to SIF (can also pull directly from Docker Hub)
apptainer build containers/images/cyto-gpu.sif docker-daemon://cyto-gpu:latest

# Or build directly from a definition file (no Docker required)
apptainer build containers/images/cyto-gpu.sif containers/apptainer/cyto-gpu.def

Running with Apptainer

# GPU-enabled run
apptainer exec --nv containers/images/cyto-gpu.sif \
    pixi run -e gpu python scripts/benchmark/run_benchmark.py

# Interactive shell
apptainer shell --nv containers/images/cyto-gpu.sif

Build script

# Automated build (from repo root)
bash containers/apptainer/build.sh

The SIF path is configured via gpu_sif in scripts/benchmark/config/benchmark.def.toml. Set it in your benchmark.user.toml once built.


Dependency Management (pixi.toml)

⚠️ Add all new dependencies to pixi.toml, not to a requirements file or env YAML.

# Always-required (in default env)
[dependencies]
scipy = ">=1.10"

# PyPI-only dependency
[pypi-dependencies]
my-package = ">=1.0"

# Optional: only in the cellpose feature env
[feature.cellpose.pypi-dependencies]
cellpose = ">=3.0"

After editing pixi.toml, run pixi install (or pixi install -e <env>) to rebuild. Run pixi run pytest tests/ to verify.


Dask Support

Prefer Dask arrays over NumPy for large datasets. Dask enables lazy evaluation — data is only read/computed when explicitly requested via .compute().

import dask.array as da

# Lazy load — no memory allocated yet
arr = da.from_array(np.zeros((10000, 2048, 2048)), chunks=(1, 2048, 2048))

# Only compute what you need
result = arr[0:10].compute()

Avoid calling .compute() inside module __call__ unless strictly necessary — let the orchestrator decide when to materialize.


Testing

All new modules should have unit tests in tests/:

# Run all tests
pixi run pytest tests/

# Run a specific test file
pixi run pytest tests/test_preprocessing.py -v

Tests should not require GPU access unless decorated with a skip marker:

import pytest
pytest.importorskip("torch")   # skip if PyTorch not available

Documentation

Build the Sphinx docs locally:

cd doc/
pixi run make html
# Open: doc/_build/html/index.html

All new public classes and functions must have NumPy-style docstrings. Sphinx autodoc picks them up automatically.


Documentation Impact Classification

When opening a pull request or committing changes, classify the docs impact so reviewers know what to update:

Classification

Meaning

Required docs action

none

Internal implementation only; no user-visible behavior changes

No docs update required

minor

Adds a parameter, changes a default, or fixes a bug

Update the relevant docstring and YAML example

major

New module, new stage type, new interface, or breaking change

Update API reference, pipeline.md stage section, and cross-links

Minimum required updates for major docs-impact changes

  1. Canonical page update — add or update the relevant section in doc/source/ (pipeline stage, setup step, etc.)

  2. API docstring — NumPy-style docstring on the class and all public methods

  3. YAML example — add an annotated YAML snippet to configs/pipelines/pipeline.template.yaml with a comment

  4. Cross-link check — verify all pages that reference the changed module or path still resolve correctly (run make html and check for warnings)

  5. Changelog entry — add a bullet to the relevant OpenSpec change tasks file if the change is tracked there

Cross-change alignment

Before merging, check with team members working on overlapping areas. If two people are editing the same doc page or API module at the same time, coordinate to avoid contradictory guidance or overwritten work.

Pre-merge checklist

Before opening a pull request for any major docs-impact change, verify:

  • [ ] Docs impact classified: is this none / minor / major?

  • [ ] Docstrings added or updated (NumPy style) on any new or changed public API

  • [ ] YAML example updated in configs/pipelines/pipeline.template.yaml if a new module was added

  • [ ] cd doc && pixi run make html passes with 0 new warnings

  • [ ] Cross-links work: no 404s in the rendered HTML for pages you touched