UTSE Cytotoxicity Archive Audit (2026-03-11)¶

Scope: /users/kir-fritzsche/oyk357/archive/utse_cyto

Task Status¶

Low-risk removal: completed on 2026-03-11.
Removed targets:
- dataset-level log/ directories across cyto, wt, and cancer_only
- cyto/2023_11_21_.../patching/benchmark/
Verified removed: 11/11 paths.
Estimated reclaimed space from pre-cleanup measurements: ~3.8G.

Archive layout is broadly compatible with the distributed pipeline stage model (preprocessing, segmentation, tracking, postprocessing, log).
Storage pressure is concentrated in a few very large intermediate trees, especially under cyto/2023_11_24....
Some stage outputs appear duplicated under alternate naming conventions (Cellpose vs Cellpose_Cancer/Cellpose_TCell, Ultrack vs split channel folders).
Provenance artifacts expected by the current benchmark/dataops workflow (provenance_manifest.json, acceptance_report.json, run_metadata.json) were not found in this archive.
tcell_only/ is empty and should be retained as a reserved target for planned future batch runs (not treated as cleanup target).

Measured with du -sh on 2026-03-11.

Subcomponents showing likely duplication:

Path	Size
`segmentation/Cellpose`	5.8G
`segmentation/Cellpose_Cancer`	4.6G
`segmentation/Cellpose_TCell`	3.7G
`tracking/Ultrack`	9.6G
`tracking/Ultrack_cancer`	48G
`tracking/Ultrack_tcell`	35G

Stage-based output hierarchy exists and is readable by current module-level expectations.
Patch-based logs (log/<stage>/<tag>/patch_*) are present and consistent with SLURM array execution style.
Archive is exposed in-repo via symlink: distributed/utse_cyto -> /users/kir-fritzsche/oyk357/archive/utse_cyto/.

Mixed naming for equivalent stage outputs increases ambiguity for downstream scripts and cleanup logic.
Benchmark/intermediate payloads are embedded inside dataset trees (for example patching/benchmark/) instead of a single reproducible benchmark bundle contract.
No per-run provenance bundle found in archive outputs.

Keep tcell_only/ as reserved namespace for upcoming batches when compute/storage allows.
Keep canonical stage outputs required for paper reruns and traceability.

Remove stale log/ trees after compressing or exporting the last successful run logs.
- Reclaim example: ~1G in cyto/2023_11_24....
Remove transient benchmark job scratch once benchmark summary/provenance is captured.
- In 2023_11_21.../patching/benchmark/, repeated register_denoise_*jobs folders are each ~462M.

Consolidate segmentation naming to one canonical namespace and drop duplicate channel-split trees if byte/metadata equivalent.
- Potential reclaim in cyto/2023_11_24...: up to ~8.3G.
Consolidate tracking naming similarly (Ultrack vs split variants).
- Potential reclaim in cyto/2023_11_24...: up to ~83G.

Decide whether all intermediate tiers must be retained simultaneously (center_cropped, preprocessing, patching/images, label_to_sparse).
If reproducible from upstream snapshots/config, these tiers are prime storage-reduction targets (hundreds of GB each).

Enforce a per-run manifest in each dataset root:
- run_metadata.json, config hash, git SHA, environment lock, snapshot ID.
Add retention classes per output path: canonical, rebuildable, ephemeral.
Standardize stage folder naming and deprecate alternates.
Keep benchmark artifacts in a dedicated benchmark bundle root, not mixed into dataset production trees.

Freeze and tag one canonical output naming scheme for segmentation/tracking.
Add a dry-run cleanup script to report reclaimable space by retention class.
Migrate benchmark outputs to provenance bundles and remove legacy job scratch by policy.
Apply cleanup to one pilot dataset (cyto/2023_11_24...) before rolling out archive-wide.