UTSE Cytotoxicity Archive Audit (2026-03-11)

Scope: /users/kir-fritzsche/oyk357/archive/utse_cyto

Task Status

  • Low-risk removal: completed on 2026-03-11.

  • Removed targets:

    • dataset-level log/ directories across cyto, wt, and cancer_only

    • cyto/2023_11_21_.../patching/benchmark/

  • Verified removed: 11/11 paths.

  • Estimated reclaimed space from pre-cleanup measurements: ~3.8G.

Executive Findings

  • Archive layout is broadly compatible with the distributed pipeline stage model (preprocessing, segmentation, tracking, postprocessing, log).

  • Storage pressure is concentrated in a few very large intermediate trees, especially under cyto/2023_11_24....

  • Some stage outputs appear duplicated under alternate naming conventions (Cellpose vs Cellpose_Cancer/Cellpose_TCell, Ultrack vs split channel folders).

  • Provenance artifacts expected by the current benchmark/dataops workflow (provenance_manifest.json, acceptance_report.json, run_metadata.json) were not found in this archive.

  • tcell_only/ is empty and should be retained as a reserved target for planned future batch runs (not treated as cleanup target).

Dataset Size Snapshot

Measured with du -sh on 2026-03-11.

Cohort totals

Cohort

Dataset

Size

cyto

2023_11_21_...flow0p1mlperh

102G

cyto

2023_11_24_...flow_0p1mlperh

1.6T

cyto

2023_12_05_...flow0p1mlperh

377G

cyto

2023_12_05_...flow0p1mlperh_full_fov

728G

wt

2023_11_18_...flow_0p15mlperh

631G

wt

2023_11_23_...flow_0p1mlperh

92G

wt

2023_11_28_...flow0p1mlperh

84G

cancer_only

2023_10_03_...flowrate_0p15mlperh

404G

cancer_only

2023_10_26_...flowrate_0p1mlperh

277G

cancer_only

2023_12_08_...flow0p1mlperh

59G

Largest dataset decomposition (cyto/2023_11_24_...)

Path

Size

preprocessing/

630G

patching/

360G

center_cropped/

279G

label_to_sparse/

192G

tracking/

93G

postprocessing/

43G

segmentation/

15G

log/

1005M

Subcomponents showing likely duplication:

Path

Size

segmentation/Cellpose

5.8G

segmentation/Cellpose_Cancer

4.6G

segmentation/Cellpose_TCell

3.7G

tracking/Ultrack

9.6G

tracking/Ultrack_cancer

48G

tracking/Ultrack_tcell

35G

Alignment With Current Distributed Workflow

Aligned

  • Stage-based output hierarchy exists and is readable by current module-level expectations.

  • Patch-based logs (log/<stage>/<tag>/patch_*) are present and consistent with SLURM array execution style.

  • Archive is exposed in-repo via symlink: distributed/utse_cyto -> /users/kir-fritzsche/oyk357/archive/utse_cyto/.

Misaligned or Legacy

  • Mixed naming for equivalent stage outputs increases ambiguity for downstream scripts and cleanup logic.

  • Benchmark/intermediate payloads are embedded inside dataset trees (for example patching/benchmark/) instead of a single reproducible benchmark bundle contract.

  • No per-run provenance bundle found in archive outputs.

Cleanup and Retention Recommendations

1) Keep (do not remove)

  • Keep tcell_only/ as reserved namespace for upcoming batches when compute/storage allows.

  • Keep canonical stage outputs required for paper reruns and traceability.

2) Low-risk removals (immediate)

  • Remove stale log/ trees after compressing or exporting the last successful run logs.

    • Reclaim example: ~1G in cyto/2023_11_24....

  • Remove transient benchmark job scratch once benchmark summary/provenance is captured.

    • In 2023_11_21.../patching/benchmark/, repeated register_denoise_*jobs folders are each ~462M.

3) Medium-risk removals (after one-to-one validation)

  • Consolidate segmentation naming to one canonical namespace and drop duplicate channel-split trees if byte/metadata equivalent.

    • Potential reclaim in cyto/2023_11_24...: up to ~8.3G.

  • Consolidate tracking naming similarly (Ultrack vs split variants).

    • Potential reclaim in cyto/2023_11_24...: up to ~83G.

4) High-impact policy decisions (requires sign-off)

  • Decide whether all intermediate tiers must be retained simultaneously (center_cropped, preprocessing, patching/images, label_to_sparse).

  • If reproducible from upstream snapshots/config, these tiers are prime storage-reduction targets (hundreds of GB each).

Operational Improvements (DataOps)

  • Enforce a per-run manifest in each dataset root:

    • run_metadata.json, config hash, git SHA, environment lock, snapshot ID.

  • Add retention classes per output path: canonical, rebuildable, ephemeral.

  • Standardize stage folder naming and deprecate alternates.

  • Keep benchmark artifacts in a dedicated benchmark bundle root, not mixed into dataset production trees.

Suggested Next Actions (Ordered)

  1. Freeze and tag one canonical output naming scheme for segmentation/tracking.

  2. Add a dry-run cleanup script to report reclaimable space by retention class.

  3. Migrate benchmark outputs to provenance bundles and remove legacy job scratch by policy.

  4. Apply cleanup to one pilot dataset (cyto/2023_11_24...) before rolling out archive-wide.