pyCyto Analysis Pipeline

pyCyto processes time-lapse microscopy data through six ordered stages, each defined declaratively in a YAML configuration file. The same YAML runs locally (single node) or is distributed across SLURM jobs — paired with a resource config that specifies compute allocation per stage.

Configuration files:

File

Purpose

configs/pipelines/pipeline.template.yaml

Algorithm parameters, channel paths, analysis steps

configs/distributed/pipeline-resources.yaml

SLURM resources (memory, partition, GPU, batch size) per stage


Stage Color Scheme

The colour coding below is consistent across this documentation and the planned React Flow UI. The palette is defined in configs/pipeline-colors.toml.

File I/O Preprocessing Segmentation
Tracking Analysis / Postprocessing Output

Pipeline Diagram

        graph TD
    A([Raw Microscopy Images]) --> AA[Spatial Tiling]
    AA --> B([File I/O: Load Channels])

    B --> C[Preprocessing]
    C --> D[Register and Denoise]
    C --> E[Channel Merge]
    C --> F[Intensity Normalization]
    C --> G[Gamma Correction]
    D --> H[Segmentation]
    E --> H
    F --> H
    G --> H

    H --> I[Cellpose]
    H --> J[StarDist]
    I --> K[Tabulation]
    J --> K

    K --> L[Label to Sparse Features]
    L --> M[Tracking]

    M --> N[TrackMate sparse]
    M --> O[trackpy sparse]
    M --> P[Ultrack dense]
    N --> Q[Analysis]
    O --> Q
    P --> Q

    Q --> R[Contact Tracing]
    Q --> S[Kinematics]
    Q --> T[Cell Networks]
    R --> U([Results: Tables, Plots, Networks])
    S --> U
    T --> U

    classDef stageIO      fill:#1e293b,color:#f1f5f9,stroke:#475569
    classDef stagePre     fill:#0d7377,color:#fff,stroke:#0a5c60
    classDef stageSeg     fill:#7c3aed,color:#fff,stroke:#6d28d9
    classDef stageTab     fill:#0369a1,color:#fff,stroke:#075985
    classDef stageTrack   fill:#0369a1,color:#fff,stroke:#075985
    classDef stageAna     fill:#b45309,color:#fff,stroke:#92400e
    classDef stageOut     fill:#166534,color:#fff,stroke:#14532d
    classDef stageOpt     fill:#6b7280,color:#fff,stroke:#4b5563

    class A,B stageIO
    class C,D,E,F stagePre
    class G stageOpt
    class H,I,J stageSeg
    class K,L stageTab
    class M,N,O stageTrack
    class P stageOpt
    class Q,R,S,T stageAna
    class U stageOut
    class AA stageOpt
    

Tip: The diagram scrolls horizontally if your viewport is narrow. Use browser zoom (Ctrl +/-) to resize.


Stage Reference

Each section below maps one pipeline stage to its YAML configuration block. All stages follow the same module contract: output = Module(args)({"image": …, "label": …}).


1. Channel Inputs

Define the input data sources. Each key becomes the channel name used throughout the pipeline.

channels:
  TCell:      /data/datasets/my_experiment/ch0_tcell.tif
  CancerCell: /data/datasets/my_experiment/ch1_cancer.tif
  Dead:       /data/datasets/my_experiment/ch2_pi.tif

image_range:
  x: [0, null, 1]    # [start, stop, step] — null means full extent
  y: [0, null, 1]
  t: [0, 100, 1]     # first 100 frames

spacing: [0.83, 0.83]   # pixel size in microns [y, x]
output_dir: output/my_experiment

Key fields:

  • channels — one path per fluorescence channel; keys are arbitrary names

  • image_range — crop or subsample before processing; saves memory on large datasets

  • spacing — physical pixel size; used for velocity and displacement in microns

  • output_dir — root for all stage outputs


2. Preprocessing (Image → Image)

Preprocessing steps run in sequence. Each step writes its output TIFFs to output_dir/preprocessing/<tag>/ when output: true.

pipeline:
  preprocessing:

    - name: RegisterDenoise       # ANTs registration + BM3D denoising
      tag: RegisterDenoise        # ← must match key in pipeline-resources.yaml
      channels: [TCell, CancerCell, Dead]
      args:
        stacks: 12        # frames per registration reference window
        parallel: true    # use all cpus-per-task cores (recommended for SLURM)
        batch_size: 0     # 0 = auto; or set explicit frame count per job
      output: true

    - name: ChannelMerge          # combine two channels into one
      tag: ChannelMerge_cancer
      channels: [[CancerCell, Dead]]   # list of [ch_a, ch_b] pairs to merge
      output_channel_name: CancerCellComposite
      output: true

    - name: PercentileNormalization
      tag: Normalize
      channels: all               # applies to every channel
      args:
        lp: 5                     # clip below 5th percentile → 0
        up: 95                    # clip above 95th percentile → 1
      output: true

Key fields:

  • tag — identifier used in logs, output paths, and as the dependency key in the resource config

  • channels — list of channel names to process; all means every defined channel

  • output: true — persist this step’s output to disk (set false for intermediate-only steps)

  • output_channel_name — for steps that create new channels (e.g. ChannelMerge)

Distributed resource config (paired in pipeline-resources.yaml):

pipeline:
  preprocessing:
    RegisterDenoise:          # ← same tag as pipeline YAML
      partition: short
      cpus-per-task: 8
      mem: 32G
      time: "04:00:00"
      batch_size: 50          # frames per SLURM array job
      dependency: singleton   # "singleton" = no duplicate concurrent jobs
    Normalize:
      partition: short
      cpus-per-task: 2
      mem: 8G
      dependency: RegisterDenoise   # wait for RegisterDenoise to finish
      dependency_type: afterok
      batch_size: 100

3. Segmentation (Image → Label)

Detects cells and produces integer label masks. Each detected cell is an integer region; background is 0.

  segmentation:

    - name: CellPose              # Cellpose (GPU)
      tag: Cellpose_TCell
      channels: [TCell]
      input_type: image           # "image" or "label" (to refine an existing mask)
      args:
        model_type: cpsam         # model: cpsam, cyto2, nuclei, cyto3, ...
        diameter: 20              # expected cell diameter in pixels (0 = auto)
        cellprob_thresh: -3.0     # lower = more detections; range ~[-6, 6]
        batch_size: 128           # images per GPU forward pass
      output_type: label
      output: true

    - name: StarDist              # StarDist (GPU, TensorFlow)
      tag: StarDist_Cancer
      channels: [CancerCellComposite]
      input_type: image
      args:
        model_name: 2D_versatile_fluo
        prob_thresh: 0.3          # detection confidence threshold
        nms_thresh: 0.8           # non-maximum suppression overlap threshold
      output_type: label
      output: true

Key fields:

  • model_type / model_name — select the pretrained model; cpsam (Cellpose + SAM ViT) gives the best accuracy but requires an A100 for 2400×2400 images

  • diameter — critical tuning parameter; measure a few cells with Fiji/napari first

  • input_type — use "label" for refinement steps (morphological operations etc.)

Distributed resource config:

  segmentation:
    Cellpose_TCell:
      partition: gpu_short
      gres: gpu:a100-pcie-40gb:1    # cpsam needs ≥40 GB VRAM on large patches
      cpus-per-task: 8
      mem: 32G
      batch_size: 200               # frames per GPU job
      dependency: RegisterDenoise
      dependency_type: afterok

4. Tabulation (Image + Label → Table)

Converts segmentation masks to a sparse pandas DataFrame — one row per cell per frame. This is the "dataframe" type in the pipeline data model.

  label_to_sparse:
    image_label_pair:
      - [TCell,               TCell]               # measure TCell signal with TCell labels
      - [CancerCellComposite, CancerCellComposite]  # measure cancer signal with cancer labels
      - [Dead,                CancerCellComposite]  # measure PI signal using cancer cell outlines
    output: true

Each [image_channel, label_channel] pair extracts region properties (centroid, area, mean intensity, feret radius) from image_channel within each labeled region in label_channel. Multiple pairs can share the same label channel.

Output columns (standard pyCyto DataFrame schema):

Column

Description

label

Cell ID within frame

frame

Time index (0-based)

x, y

Centroid position (pixels)

feret_radius

Estimated cell radius

mean, median

Intensity statistics

area

Cell area (pixels²)

Distributed resource config:

  label_to_sparse:
    core:                          # per-frame tabulation jobs
      partition: short
      cpus-per-task: 8
      mem: 32G
      batch_size: 100
      dependency: singleton
    merge:                         # final merge of all patch outputs
      partition: short
      cpus-per-task: 8
      mem: 128G                    # needs full dataset in memory for merge

5. Tracking (Table → Table)

Links per-frame cell detections into temporal tracks. Adds track_id and kinematic columns to the DataFrame.

  tracking:

    - name: TrackMate             # sparse input: operates on centroid table
      tag: TrackMate_TCell
      channels: [TCell]
      args:
        linking_max_distance: 15      # max pixels to link between frames
        gap_closing_max_distance: 15  # max pixels to close a track gap
        max_frame_gap: 2              # max frames a track may be absent
        verbose: true
      output: true

    # - name: Ultrack             # dense input: operates on label masks
    #   tag: Ultrack_cancer
    #   images: [CancerCellComposite]
    #   labels: [CancerCellComposite]
    #   output: true

Tracker choice guide:

Tracker

Input

Best for

TrackMate

sparse (centroid table)

Standard 2D tracking, gap closing, Fiji ecosystem

trackpy

sparse (centroid table)

Simple nearest-neighbour; pure Python, no JVM

Ultrack

dense (label masks)

Crowded scenes, cell division, ILP global optimisation

TrackMate note: Requires the imagej pixi environment and either a pre-installed Fiji or imagej.init('sc.fiji:fiji') auto-download. The TrackMate CSV Importer plugin loads pyCyto’s centroid table directly into TrackMate without re-running detection — see TrackMate CSV Importer.

Distributed resource config:

  tracking:
    TrackMate_TCell:
      partition: short
      cpus-per-task: 8
      mem: 64G
      time: "04:00:00"
      dependency: [Cellpose_TCell]
      dependency_type: afterok
    Ultrack_cancer:
      dependency: singleton
      dependency_type: afterok
      database: {partition: long, mem: 64G, cpus-per-task: 16, time: "10-00:00:00"}
      segment:  {partition: short, mem: 4G,  cpus-per-task: 2}
      link:     {partition: short, mem: 4G,  cpus-per-task: 2}
      solve:    {partition: long,  mem: 32G, cpus-per-task: 4, time: "10-00:00:00"}
      export:   {partition: short, mem: 16G, cpus-per-task: 2}

6. Postprocessing / Analysis (any → any)

Analysis stages operate on any combination of images, labels, and DataFrames and can produce results tables, network graphs, and figures.

  postprocessing:

    - name: CrossCellContactMeasures   # GPU cell-cell contact detection
      tag: ContactTCellCancer
      channels: [[TCell, CancerCellComposite]]   # [effector, target] pairs
      input_type: [image, label, feature]
      output_type: [image, feature, network]
      args:
        base_image: false
      output_channel_name: [TCellToCancerCell]
      output: true

    - name: CellTriangulation          # Delaunay nearest-neighbour network
      tag: Network
      channels: [CancerCellComposite, TCell]
      input_type: [image, feature]
      output_type: [image, network]
      args:
        base_image: false
      output_channel_name: [CancerNetwork, TCellNetwork]
      output: true

Key fields:

  • channels — for contact analysis, inner lists are [effector, target] pairs

  • input_type / output_type — explicitly declare what this stage consumes and produces

  • output_channel_name — name of the result channel (used in downstream stages and output paths)

Distributed resource config:

  postprocessing:
    ContactTCellCancer:
      partition: gpu_short
      gres: gpu:a100-pcie-40gb:1
      cpus-per-task: 2
      ntasks: 7               # MPI: 1 master + 6 workers
      mem: 90G
      batch_size: 6           # patches per job
      dependency: singleton
      dependency_type: afterok

Distributed Execution: Paired Configuration

pyCyto’s distributed runner (distributed/submit_batch_jobs.py) reads two files together:

configs/pipelines/my_pipeline.yaml        ← WHAT to run (algorithm params)
configs/distributed/my_resources.yaml     ← HOW to run it (SLURM resources)

This separation means you can run the same pipeline YAML locally with cyto --pipeline and on the cluster with submit_batch_jobs.py — the algorithm is unchanged; only the resource envelope differs.

Dependency graph

Jobs are submitted with explicit SLURM dependencies. The dependency field in the resource config controls the submission order:

preprocessing:
  RegisterDenoise:
    dependency: singleton        # one job at a time (no parallel duplicates)

  Normalize:
    dependency: RegisterDenoise  # wait for RegisterDenoise to complete
    dependency_type: afterok     # afterok = only if previous job succeeded
                                 # afterany = regardless of exit code

Multiple dependencies: dependency: [Cellpose_TCell, Cellpose_Cancer]

Patching (large FOV)

Spatial patches are processed as independent SLURM array jobs:

# In pipeline YAML
patching:
  patch_count: [2, 2]   # 2×2 = 4 patches
  patches: all
  overlap: 64           # pixel overlap for boundary continuity

# In resource config
pipeline:
  patching:
    partition: short
    cpus-per-task: 3
    mem: 1G

Each stage then spawns n_patches × n_batches SLURM array elements. The merge step in label_to_sparse collects patch outputs back into a unified DataFrame.


Configuration Files Reference

All pipeline configs live under configs/:

configs/
├── pipelines/
│   ├── pipeline.template.yaml       ← canonical annotated template (start here)
│   ├── pipeline.yaml                ← minimal example (confocal cytotoxicity)
│   ├── pipeline_UTSE.yaml           ← lightsheet UTSE dataset
│   ├── pipeline_cytox_confocal.yaml ← confocal cytotoxicity
│   ├── pipeline_EVs.yaml            ← extracellular vesicle tracking
│   └── pipeline_tcell.yaml          ← T-cell only
├── distributed/
│   ├── pipeline-resources.yaml      ← default SLURM resource config
│   ├── pipeline-resources-deploy.yaml ← production deployment resources
│   └── pipeline-*.yaml              ← stage-specific resource configs
├── pipeline-colors.toml             ← global stage colour palette
└── db.def.toml                      ← PostgreSQL connection defaults

distributed/ in the repo root contains scripts only — batch Python scripts (batch_*.py), SLURM templates (batch_*.sbatch), and the orchestrator (submit_batch_jobs.py). All YAML configurations belong in configs/.

Submitting a distributed run:

pixi run python distributed/submit_batch_jobs.py \
    -p configs/pipelines/pipeline.template.yaml \
    -r configs/distributed/pipeline-resources.yaml \
    -v

# Dry-run: resolve paths and print sbatch commands without submitting
pixi run python distributed/submit_batch_jobs.py \
    -p configs/pipelines/pipeline.template.yaml \
    -r configs/distributed/pipeline-resources.yaml \
    --dry-run -v