pyCyto Analysis Pipeline¶
pyCyto processes time-lapse microscopy data through six ordered stages, each defined declaratively in a YAML configuration file. The same YAML runs locally (single node) or is distributed across SLURM jobs — paired with a resource config that specifies compute allocation per stage.
Configuration files:
File |
Purpose |
|---|---|
|
Algorithm parameters, channel paths, analysis steps |
|
SLURM resources (memory, partition, GPU, batch size) per stage |
Stage Color Scheme¶
The colour coding below is consistent across this documentation and the planned React Flow UI. The palette is defined in configs/pipeline-colors.toml.
| File I/O | Preprocessing | Segmentation |
| Tracking | Analysis / Postprocessing | Output |
Pipeline Diagram¶
graph TD
A([Raw Microscopy Images]) --> AA[Spatial Tiling]
AA --> B([File I/O: Load Channels])
B --> C[Preprocessing]
C --> D[Register and Denoise]
C --> E[Channel Merge]
C --> F[Intensity Normalization]
C --> G[Gamma Correction]
D --> H[Segmentation]
E --> H
F --> H
G --> H
H --> I[Cellpose]
H --> J[StarDist]
I --> K[Tabulation]
J --> K
K --> L[Label to Sparse Features]
L --> M[Tracking]
M --> N[TrackMate sparse]
M --> O[trackpy sparse]
M --> P[Ultrack dense]
N --> Q[Analysis]
O --> Q
P --> Q
Q --> R[Contact Tracing]
Q --> S[Kinematics]
Q --> T[Cell Networks]
R --> U([Results: Tables, Plots, Networks])
S --> U
T --> U
classDef stageIO fill:#1e293b,color:#f1f5f9,stroke:#475569
classDef stagePre fill:#0d7377,color:#fff,stroke:#0a5c60
classDef stageSeg fill:#7c3aed,color:#fff,stroke:#6d28d9
classDef stageTab fill:#0369a1,color:#fff,stroke:#075985
classDef stageTrack fill:#0369a1,color:#fff,stroke:#075985
classDef stageAna fill:#b45309,color:#fff,stroke:#92400e
classDef stageOut fill:#166534,color:#fff,stroke:#14532d
classDef stageOpt fill:#6b7280,color:#fff,stroke:#4b5563
class A,B stageIO
class C,D,E,F stagePre
class G stageOpt
class H,I,J stageSeg
class K,L stageTab
class M,N,O stageTrack
class P stageOpt
class Q,R,S,T stageAna
class U stageOut
class AA stageOpt
Tip: The diagram scrolls horizontally if your viewport is narrow. Use browser zoom (
Ctrl +/-) to resize.
Stage Reference¶
Each section below maps one pipeline stage to its YAML configuration block.
All stages follow the same module contract: output = Module(args)({"image": …, "label": …}).
1. Channel Inputs¶
Define the input data sources. Each key becomes the channel name used throughout the pipeline.
channels:
TCell: /data/datasets/my_experiment/ch0_tcell.tif
CancerCell: /data/datasets/my_experiment/ch1_cancer.tif
Dead: /data/datasets/my_experiment/ch2_pi.tif
image_range:
x: [0, null, 1] # [start, stop, step] — null means full extent
y: [0, null, 1]
t: [0, 100, 1] # first 100 frames
spacing: [0.83, 0.83] # pixel size in microns [y, x]
output_dir: output/my_experiment
Key fields:
channels— one path per fluorescence channel; keys are arbitrary namesimage_range— crop or subsample before processing; saves memory on large datasetsspacing— physical pixel size; used for velocity and displacement in micronsoutput_dir— root for all stage outputs
2. Preprocessing (Image → Image)¶
Preprocessing steps run in sequence. Each step writes its output TIFFs to output_dir/preprocessing/<tag>/ when output: true.
pipeline:
preprocessing:
- name: RegisterDenoise # ANTs registration + BM3D denoising
tag: RegisterDenoise # ← must match key in pipeline-resources.yaml
channels: [TCell, CancerCell, Dead]
args:
stacks: 12 # frames per registration reference window
parallel: true # use all cpus-per-task cores (recommended for SLURM)
batch_size: 0 # 0 = auto; or set explicit frame count per job
output: true
- name: ChannelMerge # combine two channels into one
tag: ChannelMerge_cancer
channels: [[CancerCell, Dead]] # list of [ch_a, ch_b] pairs to merge
output_channel_name: CancerCellComposite
output: true
- name: PercentileNormalization
tag: Normalize
channels: all # applies to every channel
args:
lp: 5 # clip below 5th percentile → 0
up: 95 # clip above 95th percentile → 1
output: true
Key fields:
tag— identifier used in logs, output paths, and as the dependency key in the resource configchannels— list of channel names to process;allmeans every defined channeloutput: true— persist this step’s output to disk (setfalsefor intermediate-only steps)output_channel_name— for steps that create new channels (e.g. ChannelMerge)
Distributed resource config (paired in pipeline-resources.yaml):
pipeline:
preprocessing:
RegisterDenoise: # ← same tag as pipeline YAML
partition: short
cpus-per-task: 8
mem: 32G
time: "04:00:00"
batch_size: 50 # frames per SLURM array job
dependency: singleton # "singleton" = no duplicate concurrent jobs
Normalize:
partition: short
cpus-per-task: 2
mem: 8G
dependency: RegisterDenoise # wait for RegisterDenoise to finish
dependency_type: afterok
batch_size: 100
3. Segmentation (Image → Label)¶
Detects cells and produces integer label masks. Each detected cell is an integer region; background is 0.
segmentation:
- name: CellPose # Cellpose (GPU)
tag: Cellpose_TCell
channels: [TCell]
input_type: image # "image" or "label" (to refine an existing mask)
args:
model_type: cpsam # model: cpsam, cyto2, nuclei, cyto3, ...
diameter: 20 # expected cell diameter in pixels (0 = auto)
cellprob_thresh: -3.0 # lower = more detections; range ~[-6, 6]
batch_size: 128 # images per GPU forward pass
output_type: label
output: true
- name: StarDist # StarDist (GPU, TensorFlow)
tag: StarDist_Cancer
channels: [CancerCellComposite]
input_type: image
args:
model_name: 2D_versatile_fluo
prob_thresh: 0.3 # detection confidence threshold
nms_thresh: 0.8 # non-maximum suppression overlap threshold
output_type: label
output: true
Key fields:
model_type/model_name— select the pretrained model;cpsam(Cellpose + SAM ViT) gives the best accuracy but requires an A100 for 2400×2400 imagesdiameter— critical tuning parameter; measure a few cells with Fiji/napari firstinput_type— use"label"for refinement steps (morphological operations etc.)
Distributed resource config:
segmentation:
Cellpose_TCell:
partition: gpu_short
gres: gpu:a100-pcie-40gb:1 # cpsam needs ≥40 GB VRAM on large patches
cpus-per-task: 8
mem: 32G
batch_size: 200 # frames per GPU job
dependency: RegisterDenoise
dependency_type: afterok
4. Tabulation (Image + Label → Table)¶
Converts segmentation masks to a sparse pandas DataFrame — one row per cell per frame. This is the "dataframe" type in the pipeline data model.
label_to_sparse:
image_label_pair:
- [TCell, TCell] # measure TCell signal with TCell labels
- [CancerCellComposite, CancerCellComposite] # measure cancer signal with cancer labels
- [Dead, CancerCellComposite] # measure PI signal using cancer cell outlines
output: true
Each [image_channel, label_channel] pair extracts region properties (centroid, area, mean intensity, feret radius) from image_channel within each labeled region in label_channel. Multiple pairs can share the same label channel.
Output columns (standard pyCyto DataFrame schema):
Column |
Description |
|---|---|
|
Cell ID within frame |
|
Time index (0-based) |
|
Centroid position (pixels) |
|
Estimated cell radius |
|
Intensity statistics |
|
Cell area (pixels²) |
Distributed resource config:
label_to_sparse:
core: # per-frame tabulation jobs
partition: short
cpus-per-task: 8
mem: 32G
batch_size: 100
dependency: singleton
merge: # final merge of all patch outputs
partition: short
cpus-per-task: 8
mem: 128G # needs full dataset in memory for merge
5. Tracking (Table → Table)¶
Links per-frame cell detections into temporal tracks. Adds track_id and kinematic columns to the DataFrame.
tracking:
- name: TrackMate # sparse input: operates on centroid table
tag: TrackMate_TCell
channels: [TCell]
args:
linking_max_distance: 15 # max pixels to link between frames
gap_closing_max_distance: 15 # max pixels to close a track gap
max_frame_gap: 2 # max frames a track may be absent
verbose: true
output: true
# - name: Ultrack # dense input: operates on label masks
# tag: Ultrack_cancer
# images: [CancerCellComposite]
# labels: [CancerCellComposite]
# output: true
Tracker choice guide:
Tracker |
Input |
Best for |
|---|---|---|
TrackMate |
sparse (centroid table) |
Standard 2D tracking, gap closing, Fiji ecosystem |
trackpy |
sparse (centroid table) |
Simple nearest-neighbour; pure Python, no JVM |
Ultrack |
dense (label masks) |
Crowded scenes, cell division, ILP global optimisation |
TrackMate note: Requires the imagej pixi environment and either a pre-installed Fiji or imagej.init('sc.fiji:fiji') auto-download. The TrackMate CSV Importer plugin loads pyCyto’s centroid table directly into TrackMate without re-running detection — see TrackMate CSV Importer.
Distributed resource config:
tracking:
TrackMate_TCell:
partition: short
cpus-per-task: 8
mem: 64G
time: "04:00:00"
dependency: [Cellpose_TCell]
dependency_type: afterok
Ultrack_cancer:
dependency: singleton
dependency_type: afterok
database: {partition: long, mem: 64G, cpus-per-task: 16, time: "10-00:00:00"}
segment: {partition: short, mem: 4G, cpus-per-task: 2}
link: {partition: short, mem: 4G, cpus-per-task: 2}
solve: {partition: long, mem: 32G, cpus-per-task: 4, time: "10-00:00:00"}
export: {partition: short, mem: 16G, cpus-per-task: 2}
6. Postprocessing / Analysis (any → any)¶
Analysis stages operate on any combination of images, labels, and DataFrames and can produce results tables, network graphs, and figures.
postprocessing:
- name: CrossCellContactMeasures # GPU cell-cell contact detection
tag: ContactTCellCancer
channels: [[TCell, CancerCellComposite]] # [effector, target] pairs
input_type: [image, label, feature]
output_type: [image, feature, network]
args:
base_image: false
output_channel_name: [TCellToCancerCell]
output: true
- name: CellTriangulation # Delaunay nearest-neighbour network
tag: Network
channels: [CancerCellComposite, TCell]
input_type: [image, feature]
output_type: [image, network]
args:
base_image: false
output_channel_name: [CancerNetwork, TCellNetwork]
output: true
Key fields:
channels— for contact analysis, inner lists are[effector, target]pairsinput_type/output_type— explicitly declare what this stage consumes and producesoutput_channel_name— name of the result channel (used in downstream stages and output paths)
Distributed resource config:
postprocessing:
ContactTCellCancer:
partition: gpu_short
gres: gpu:a100-pcie-40gb:1
cpus-per-task: 2
ntasks: 7 # MPI: 1 master + 6 workers
mem: 90G
batch_size: 6 # patches per job
dependency: singleton
dependency_type: afterok
Distributed Execution: Paired Configuration¶
pyCyto’s distributed runner (distributed/submit_batch_jobs.py) reads two files together:
configs/pipelines/my_pipeline.yaml ← WHAT to run (algorithm params)
configs/distributed/my_resources.yaml ← HOW to run it (SLURM resources)
This separation means you can run the same pipeline YAML locally with cyto --pipeline and on the cluster with submit_batch_jobs.py — the algorithm is unchanged; only the resource envelope differs.
Dependency graph¶
Jobs are submitted with explicit SLURM dependencies. The dependency field in the resource config controls the submission order:
preprocessing:
RegisterDenoise:
dependency: singleton # one job at a time (no parallel duplicates)
Normalize:
dependency: RegisterDenoise # wait for RegisterDenoise to complete
dependency_type: afterok # afterok = only if previous job succeeded
# afterany = regardless of exit code
Multiple dependencies: dependency: [Cellpose_TCell, Cellpose_Cancer]
Patching (large FOV)¶
Spatial patches are processed as independent SLURM array jobs:
# In pipeline YAML
patching:
patch_count: [2, 2] # 2×2 = 4 patches
patches: all
overlap: 64 # pixel overlap for boundary continuity
# In resource config
pipeline:
patching:
partition: short
cpus-per-task: 3
mem: 1G
Each stage then spawns n_patches × n_batches SLURM array elements. The merge step in label_to_sparse collects patch outputs back into a unified DataFrame.
Configuration Files Reference¶
All pipeline configs live under configs/:
configs/
├── pipelines/
│ ├── pipeline.template.yaml ← canonical annotated template (start here)
│ ├── pipeline.yaml ← minimal example (confocal cytotoxicity)
│ ├── pipeline_UTSE.yaml ← lightsheet UTSE dataset
│ ├── pipeline_cytox_confocal.yaml ← confocal cytotoxicity
│ ├── pipeline_EVs.yaml ← extracellular vesicle tracking
│ └── pipeline_tcell.yaml ← T-cell only
├── distributed/
│ ├── pipeline-resources.yaml ← default SLURM resource config
│ ├── pipeline-resources-deploy.yaml ← production deployment resources
│ └── pipeline-*.yaml ← stage-specific resource configs
├── pipeline-colors.toml ← global stage colour palette
└── db.def.toml ← PostgreSQL connection defaults
distributed/ in the repo root contains scripts only — batch Python scripts (batch_*.py), SLURM templates (batch_*.sbatch), and the orchestrator (submit_batch_jobs.py). All YAML configurations belong in configs/.
Submitting a distributed run:
pixi run python distributed/submit_batch_jobs.py \
-p configs/pipelines/pipeline.template.yaml \
-r configs/distributed/pipeline-resources.yaml \
-v
# Dry-run: resolve paths and print sbatch commands without submitting
pixi run python distributed/submit_batch_jobs.py \
-p configs/pipelines/pipeline.template.yaml \
-r configs/distributed/pipeline-resources.yaml \
--dry-run -v