Switches phase1 default to the paired/train splits so gqa, flickr, and dotav1 get proper val coverage and dotav1/soda-a val sizes are resampled to match per-source train share. Also reverts patience to 20 and phase2 pretrained back to best.pt.
Set phase1 patience=200 to avoid early stop on slow-drift epochs and load phase2 from last.pt instead of best.pt, matching UNIC/DUNE/EdgeCrafter which train fixed epochs and use the final checkpoint.
Reference docstring lives in callbacks/distill_aug.py:classify_augmentations_distill, not the reverted ultralytics/data/augment.py path. Follow-up to 79dd79181 which fixed the same stale pointer in train_image_encoder.py.
The recipe set warmup_epochs=18 to match DINOv3's 16pct ratio at 114 ep, but
the runner scales warmup by batch/512 so at batch=1024 the effective warmup
became 36 ep (31pct of training). That broke direct comparison with the
existing 7-source runs, which use 2 effective warmup epochs.
Setting warmup_epochs=1 keeps the post-scaling value at 2, matching the
running runs. Other dinov3 axes (lr, wd schedule, augs, grad_clip) unchanged.
Motivation
fastvit-s x adaptor diverges at full scale on 7-source training (final knn
5.9%, chance-level). Forensic smokes ruled out norm hot-swap, beta2 sweep,
fixed-wd changes, and BN running-stat freezes. Two recipe-level mismatches
with DINOv3 / EUPE / UNIC / DUNE distillation papers remained:
* our pipeline still pulls Ultralytics defaults RandAugment + RandomErasing
0.4 from cfg/default.yaml, while every reference recipe disables both
and instead uses ColorJitter + Grayscale + GaussianBlur + Solarize;
* we use fixed weight_decay 0.02 with ~1pct warmup, while DINOv3 ramps
wd 0.04 -> 0.2 over training and warms up for 16pct of epochs.
What changed
callbacks/distill_aug.py: classify_augmentations_distill, sibling to
ultralytics/data/augment.py:classify_augmentations. Same signature plus
grayscale, gaussian_blur, solarize knobs (default 0.0 = bit-equivalent
to upstream). Order mirrors UNIC main_unic.py:485-521. Kept out of
ultralytics/data/ to avoid touching the upstream cls training pipeline.
callbacks/wd_schedule.py: half-cosine wd ramp matching DINOv3
dinov3/optim/schedulers.py CosineSchedule, registered DDP-safe inside
the trainer __init__ (per utils/dist.py:79 callbacks-on-rank-0 footgun).
ultralytics/cfg/__init__.py: extend allowed_custom_keys with wd_end,
grayscale, gaussian_blur, solarize so DDP arg serialisation passes.
ultralytics/models/yolo/classify/train_image_encoder.py: switch
_build_transforms to classify_augmentations_distill and forward the
three new self.args knobs; register wd_schedule callback when wd_end > 0.
run_enc_distill_phase1.py: new dinov3 recipe (lr0=2e-4, wd 0.04->0.2,
warmup 18 ep, ColorJitter 0.4/0.4/0.2/0.1, grayscale 0.2, blur 0.5,
solarize 0.2, auto_augment off, erasing off) plus override forwarding.
Existing default / eupe / radio / unic recipes untouched.
Previous callbacks/vit_modules.py monkey-patched parse_model (162-line
verbatim copy + one extra elif). That broke under DDP because the worker
cwd is USER_CONFIG_DIR/DDP/, so the runner-local callbacks package is
off sys.path. Import the blocks directly in tasks.py and fold them into
the AIFI elif that prepends ch[f].
Runner-side model.add_callback() was silently dropped on DDP workers, so
grad_clip, beta2 and nfs_sync never ran. Register the hooks inside
ImageEncoderTrainer so they run on every rank. Also imports vit_modules at
trainer module top so FastViT/SimpleViT YAMLs parse in DDP workers too.
Replace target-param comments in yolo26-{fastvit,simplevit}-cls.yaml with
measured params, ONNX node counts, and TRT fp16 latency from the 2026-04-23
export sweep (all 4 variants <=1.5x the yolo26s-cls conv baseline).
Note PaddlePaddle op-coverage gap and the RKNN torch-downgrade trap so future
sweeps skip them, and clarify that the 1327-node figure in MHSABlock refers
to the AIFI ViT, not these architectures.
Current MLP adaptor + CLS+patch-only supervision yields 14pp kNN
gain but only +0.24pp COCO100 over CE baseline (ties within noise).
Detection reads raw L3/L5/L10 while distill supervises a per-teacher
MLP after the final stage, so the supervised features never reach
the detection path.
distill_path in {adaptor (default), feat_map}: feat_map routes
student L3/L5/L8 to teacher final-block tokens via 1x1 Conv per
scale with MSE, landing gradients on the same layers detection
reads (EdgeCrafter-style path alignment).
adaptor_arch in {mlp (default), linear}: linear replaces the
2-layer Linear-LN-GELU-Linear MLP with a single
Linear(in, out, bias=False). EdgeCrafter argues heavy projections
absorb the student-teacher mismatch instead of forcing it into
the backbone where detection can benefit.
loss_items tensor shape is invariant (3,) across all four combos,
so WandB plots overlay across modes. Both args registered in
allowed_custom_keys (DDP-safe). Resume guard refuses silent
switches of either arg across restart.
Defaults reproduce prior behaviour bit-identical.
Replace unused AIFI student (12.8x slower than conv baseline at
bs=1 fp16, 1327 ONNX nodes). FastViT-S benches 1.07ms / 228 nodes,
actually faster than yolo26s-cls conv baseline (1.83ms / 234).
SimpleViT-S aligns 14x14 tokens with EUPE-ViT-B at 224px, which
lets feat_map distillation with adaptor_arch=linear collapse to
identity + projection.
Custom modules live in ultralytics/nn/modules/vit_blocks.py
(FastViTBlock, MHSABlock). Registration into parse_model goes
through callbacks/vit_modules.py, which copies parse_model
verbatim and adds one elif branch to prepend ch[f] for these
modules; avoids editing ultralytics/nn/tasks.py.
Simple-component constraint only: Conv2d, BatchNorm2d, LayerNorm,
GELU, Linear, F.scaled_dot_product_attention (no nn.MultiheadAttention,
no 2D RoPE) so ONNX/TRT/CoreML/TFLite export cleanly.
Scales yolo26{s,l}-{fastvit,simplevit}-cls: s ~5-7M, l ~15M params.
Lets users train from random init so pretrained-backbone runs can be compared against a no-pretraining control, quantifying the net contribution of the pretraining stage to final downstream accuracy.
Completes OBB coverage for encoder distillation downstream eval alongside coco det/pose and imagenet cls; params mirror the canonical yolo26s-obb.pt (bs=32, nbs=64, lr0=0.00125, imgsz=1024, degrees=180, MuSGD muon_w=0.5) so baseline runs are directly comparable to the paper 54.8 mAP reference, using the same --batch/--lr/--nbs linear scaling as coco_det_finetune.
Scales lr/nbs/warmup linearly from canonical bs=128/nbs=64/lr0=0.00038 so wd_eff and lr/sample stay invariant. Adds _COCO_DET_MODES constant and per-mode flag semantics in docstring.
Phase 2c pose runs were blocked because the runner had no pose branch;
adds coco_pose_finetune (data=coco-pose.yaml, MuSGD, pose=24, kobj=4.0)
that infers the -pose yaml from the phase1 cls yaml.
Aligns coco_det_finetune args with the published yolo26s.pt detection
recipe so phase2 coco runs match the official model's training setup.
Previously the branch drifted (missing nbs=64, cos_lr=False,
warmup_momentum/bias_lr, box/cls/dfl weights, randaugment, cutmix,
copy_paste_mode, translate/degrees/shear/hsv/erasing, muon_w=0.4355),
which made backbone comparisons against the 30.18 mAP CE baseline hard
to interpret. sgd_w/cls_w/o2m/detach_epoch from the reference aren't
accepted by this checkout's cfg validator, so only the exposed subset
is applied.
Renames modes with task prefixes so logs and wandb groups are
unambiguous: finetune -> inet_finetune, linear -> inet_linear_probe,
adamw_ft -> inet_adamw_finetune, coco_det(_frozen) ->
coco_det_finetune(_frozen). The muon_w=0.1 callback is now gated to
inet_finetune only; coco det uses muon_w=0.4355 from the published
recipe.
Ultralytics scales wd_eff with batch*accumulate/nbs but never scales lr0, so larger
global batches silently drift from the recipe's intended dynamics. The new flag takes
a per-GPU batch, computes global = per_gpu * world_size, and derives lr0, nbs, and
warmup_epochs from scale = max(1, global / NBS_CANONICAL=512) so wd_eff stays at the
recipe value while per-sample lr and optimizer-step warmup count are invariant.
_resolve_paths' flat-dir fallback returned (p, p), which in
multi-source mode duplicated train files into the val
ConcatDataset: on the 7-source mix 844,176 of 899,176 val
samples (93.9%) were just re-enqueued train files, making val
loss meaningless as a held-out signal. Regression introduced
when multi-path support was added in 1aea2f95c.
Resolver now returns (train, None) when no held-out val is
discoverable, and additionally swaps the last `train` path
segment for `val` to auto-rescue deep layouts like
.../images/train → .../images/val (recovers O365 30k, DOTA
5,297 held-out without caller changes). get_dataset filters
None so flat sources (GQA, Flickr, SODA) drop cleanly from
val instead of polluting it.
Ultralytics check_resume (trainer.py:841) restores the checkpoint's data path verbatim and does not honor caller overrides; cross-host resumes where the dataset lives at a different mount point (e.g. ultra5 NFS outage) previously needed a manual torch.load/save dance to rewrite train_args. Mirrors the existing name/device override branches so one helper call covers all four non-whitelisted fields (project, name, save_dir, data).
Phase2 had hardcoded per-mode lr0 (0.1 for MuSGD finetune, 1e-3 for AdamW), with no way to change it at launch without editing the file. Mirrors phase1's _pop_flag pattern so users can sweep learning rates or drop lr on resume runs that are diverging. CLAUDE.md already documented phase2 as supporting --lr; this makes the doc true.
Add callbacks.paths with run_paths() and patch_resume() helpers so fresh runs land on clean W&B project yolo-next-encoder while save_dir stays absolute local, and resumes auto-patch train_args to survive cross-machine / relocated launches without manual checkpoint edits.
Adds callbacks.wandb_config.fork_and_attach which pre-creates a forked wandb run (native fork_from or manual API-replay fallback) and hands off to DDP rank-0 via WANDB_RUN_ID+WANDB_RESUME env vars. phase1/phase2 gain an explicit module-level assert that LOCAL_PROJECT is absolute under /home/ and a --fork_from <parent_id>:<step> flag that invokes the helper before model.train(). Native fork is currently gated by wandb private preview so default path is API-replay; smoke-tested end-to-end with subprocess handoff.
Ultralytics check_resume overwrites args.project from the ckpt (only whitelisted keys can override), so resuming a legacy NFS ckpt keeps save_dir on NFS; nfs_sync now warns-without-raising on NFS save_dir and wraps the final sync, and phase1/phase2 pin project=LOCAL_PROJECT so fresh runs land on local SSD explicitly.
Writing save_dir to local SSD and rsyncing to NFS every 10min decouples training from NFS availability, avoiding a repeat of the C2-o365-coco-inet crash where a stale NFS mount destroyed the resumed run EMA state.
UNIC trains on ImageNet-1k (main_unic.py:97), DUNE on IN-19k+GLDv2+
Mapillary (data/dino2.py). Comma-separated data paths now supported
for combining ImageFolder datasets via ConcatDataset. Loss args
(cos_weight, l1_weight, cls_l1) passed from trainer to model.
UNIC (unic/modeling/losses.py:54) and DUNE (dune/model/losses.py:62)
apply 0.5cos+0.5L1 to both CLS and patches, vs our EUPE-style
cosine-only CLS. Configurable via train args for ablation testing.
kNN eval now runs inside ImageEncoderTrainer.validate() directly.
The callback closure is no longer needed. extract_features and
knn_accuracy remain as utilities for run_knn_eval.py standalone eval.
Remove knn_callback import and model.add_callback call. Pass
knn_eval=/data/shared-datasets/imagenet in train_args instead,
which ImageEncoderTrainer reads in _setup_train (DDP-safe).
Move kNN eval from external callback (lost in DDP subprocess) to trainer
validate() override. Enabled via knn_eval=<imagenet_path> in train_args,
which survives DDP serialization through allowed_custom_keys. Caches
dataloaders across epochs, runs every 5 epochs on rank 0 only.
Remap caused 17.77% mAP vs 28.02% without it. phase2-coco-d5 was
invalid (used remap unintentionally). Revert to standard pretrained=
which transfers backbone layers 0-8 via intersect_dicts.
Load distilled cls weights with C2PSA index remapping (cls model.9 ->
det model.10) in coco_det mode. Tested: remap transfers 228 vs 192
params but produced worse mAP (17.77% vs 28.02%) due to activation
magnitude mismatch. Kept for future investigation with scaling fix.
Takes a run directory, finds weights and model config from args.yaml,
runs kNN evaluation (k=20, T=0.07), and optionally updates the finished
WandB run summary with knn/top1 via --wandb flag.
Add aggregated cls_cos, patch_cos, patch_l1 metrics averaged across
teachers for cross-run comparison in WandB. Define epoch-based x-axis
via wandb.define_metric so backfilled and new runs align.
The else branch was missing, so cls model.9 (C2PSA) BN keys were both
remapped to model.10 AND kept as model.9. intersect_dicts shape-matched
6 C2PSA BN stats into SPPF (det model.9), corrupting initialization.
phase2-coco-d1-remap showed 11pp deficit vs non-remap at ep58.
cls model.9 (C2PSA) maps to det model.10 (C2PSA) due to SPPF insertion
at det model.9. Remaps keys before intersect_dicts so C2PSA weights
transfer correctly (+42 params over standard loading).
RADIO/EUPE protocol (k=20, T=0.07): extract L2-normalized CLS features,
temperature-weighted voting via scatter_add_. Includes callback for
on_fit_epoch_end integration with WandB logging.