UTSE Cytotoxicity Archive Audit (2026-03-11)¶
Scope: /users/kir-fritzsche/oyk357/archive/utse_cyto
Task Status¶
Low-risk removal: completed on 2026-03-11.
Removed targets:
dataset-level
log/directories acrosscyto,wt, andcancer_onlycyto/2023_11_21_.../patching/benchmark/
Verified removed: 11/11 paths.
Estimated reclaimed space from pre-cleanup measurements: ~3.8G.
Executive Findings¶
Archive layout is broadly compatible with the distributed pipeline stage model (
preprocessing,segmentation,tracking,postprocessing,log).Storage pressure is concentrated in a few very large intermediate trees, especially under
cyto/2023_11_24....Some stage outputs appear duplicated under alternate naming conventions (
CellposevsCellpose_Cancer/Cellpose_TCell,Ultrackvs split channel folders).Provenance artifacts expected by the current benchmark/dataops workflow (
provenance_manifest.json,acceptance_report.json,run_metadata.json) were not found in this archive.tcell_only/is empty and should be retained as a reserved target for planned future batch runs (not treated as cleanup target).
Dataset Size Snapshot¶
Measured with du -sh on 2026-03-11.
Cohort totals¶
Cohort |
Dataset |
Size |
|---|---|---|
cyto |
|
102G |
cyto |
|
1.6T |
cyto |
|
377G |
cyto |
|
728G |
wt |
|
631G |
wt |
|
92G |
wt |
|
84G |
cancer_only |
|
404G |
cancer_only |
|
277G |
cancer_only |
|
59G |
Largest dataset decomposition (cyto/2023_11_24_...)¶
Path |
Size |
|---|---|
|
630G |
|
360G |
|
279G |
|
192G |
|
93G |
|
43G |
|
15G |
|
1005M |
Subcomponents showing likely duplication:
Path |
Size |
|---|---|
|
5.8G |
|
4.6G |
|
3.7G |
|
9.6G |
|
48G |
|
35G |
Alignment With Current Distributed Workflow¶
Aligned¶
Stage-based output hierarchy exists and is readable by current module-level expectations.
Patch-based logs (
log/<stage>/<tag>/patch_*) are present and consistent with SLURM array execution style.Archive is exposed in-repo via symlink:
distributed/utse_cyto -> /users/kir-fritzsche/oyk357/archive/utse_cyto/.
Misaligned or Legacy¶
Mixed naming for equivalent stage outputs increases ambiguity for downstream scripts and cleanup logic.
Benchmark/intermediate payloads are embedded inside dataset trees (for example
patching/benchmark/) instead of a single reproducible benchmark bundle contract.No per-run provenance bundle found in archive outputs.
Cleanup and Retention Recommendations¶
1) Keep (do not remove)¶
Keep
tcell_only/as reserved namespace for upcoming batches when compute/storage allows.Keep canonical stage outputs required for paper reruns and traceability.
2) Low-risk removals (immediate)¶
Remove stale
log/trees after compressing or exporting the last successful run logs.Reclaim example: ~1G in
cyto/2023_11_24....
Remove transient benchmark job scratch once benchmark summary/provenance is captured.
In
2023_11_21.../patching/benchmark/, repeatedregister_denoise_*jobsfolders are each ~462M.
3) Medium-risk removals (after one-to-one validation)¶
Consolidate segmentation naming to one canonical namespace and drop duplicate channel-split trees if byte/metadata equivalent.
Potential reclaim in
cyto/2023_11_24...: up to ~8.3G.
Consolidate tracking naming similarly (
Ultrackvs split variants).Potential reclaim in
cyto/2023_11_24...: up to ~83G.
4) High-impact policy decisions (requires sign-off)¶
Decide whether all intermediate tiers must be retained simultaneously (
center_cropped,preprocessing,patching/images,label_to_sparse).If reproducible from upstream snapshots/config, these tiers are prime storage-reduction targets (hundreds of GB each).
Operational Improvements (DataOps)¶
Enforce a per-run manifest in each dataset root:
run_metadata.json, config hash, git SHA, environment lock, snapshot ID.
Add retention classes per output path:
canonical,rebuildable,ephemeral.Standardize stage folder naming and deprecate alternates.
Keep benchmark artifacts in a dedicated benchmark bundle root, not mixed into dataset production trees.
Suggested Next Actions (Ordered)¶
Freeze and tag one canonical output naming scheme for segmentation/tracking.
Add a dry-run cleanup script to report reclaimable space by retention class.
Migrate benchmark outputs to provenance bundles and remove legacy job scratch by policy.
Apply cleanup to one pilot dataset (
cyto/2023_11_24...) before rolling out archive-wide.