Model Training & Experimentation¶

¶

13) Distributed Training¶

Trigger
- Automatic: new golden_train/val/test.manifest from Workflow #11, or new auto-mined slice from #12.
- Manual: engineer launches a training run from the UI/CLI, selecting dataset spec + model recipe.
- Scheduled: nightly/weekly rebuild on rolling data window.
Inputs
- Curated dataset manifests + slices.yaml (from #11), plus DVC tag/version.
- Model recipe (backbone, heads, loss weights, augmentations) and training config (train.yaml).
- Pretrained weights (optional) for warm-start or continual learning.
- Code container image (ECR) and commit SHA; W&B project/entity; environment secrets.
- Hardware profile (e.g., p4d.24xlarge × N nodes) and storage profile (FSx for Lustre).
Core Steps
- Packaging & Staging
  - Build/push Docker image with PyTorch, NCCL, CUDA, AMP, torchvision/OpenMMLab, and your repo.
  - Stage dataset to FSx for Lustre (for I/O throughput) using DVC or AWS DataSync; keep bulk in S3.
  - Pre-generate sharded WebDataset/TFRecord (optional) for faster streaming.
- Orchestration
  - Launch SageMaker distributed training job (preferred) or EKS Job with torchrun (DDP).
  - Configure NCCL and networking (env: NCCL_DEBUG=INFO, NCCL_SOCKET_IFNAME=eth0, NCCL_ASYNC_ERROR_HANDLING=1).
- Training Runtime
  - Initialize DDP with backend=nccl, set torch.set_float32_matmul_precision("high") where supported.
  - Data loading: multi-worker DataLoader + persistent workers, pinned memory, async prefetch.
  - Mixed precision (AMP) + gradient scaling; gradient accumulation for large effective batch sizes.
  - Task balancing for multi-head models (e.g., HydraNet-style): dynamic or curriculum weighting.
  - Checkpointing: local → FSx (fast) every N steps; periodic sync to S3 (s3://…/checkpoints/…).
  - Fault tolerance: resume from last global step on spot interruption; save RNG states/optimizers/scalers.
- Evaluation & Gating
  - After each epoch: run full offline eval on val + key slices; compute mAP/mIoU, per-class AP, trajectory ADE/FDE, lane F1, etc.
  - Produce slice dashboards (per weather/time/road class) and regression checks against last prod model.
  - Latency/throughput microbenchmarks: TorchScript export + dummy inference to report p50/p95 latency.
- Logging & Metadata
  - Log all metrics/artifacts to Weights & Biases: run config, gradients, losses, images/videos, PR/ROC curves.
  - Register artifacts: dataset version (DVC hash), code SHA, Docker digest, checkpoints, exported models.
- Testing inside the run (hard gates)
  - Sanity checks: one forward+backward batch before training; fail fast on NaNs or exploding loss.
  - Determinism smoke (seeded) on small shard; tolerance bands on metrics.
  - Data drift guard: compare batch feature stats vs. training reference (Evidently profile) and warn/abort on severe drift.
- Post-Run Actions
  - Generate model card (data provenance, metrics, caveats) and attach to W&B run & S3.
  - If gates pass, push exported model (TorchScript/ONNX) to artifact store and Model Registry (e.g., SageMaker Model Registry or W&B Artifacts promoted alias).
  - Emit event to orchestration (Airflow/Step Functions) to trigger #14 HPO (if configured) or #15 Eval/Sim (next workflow).
AWS/Tooling
- SageMaker Training (DDP), ECR, S3, FSx for Lustre, CloudWatch Logs, IAM, optional EKS.
- PyTorch (DDP/torchrun), NCCL, torch.cuda.amp, OpenMMLab/MMDetection (optional), Albumentations.
- W&B for tracking/artifacts; DVC for dataset pinning; Evidently for drift profiles.
Outputs
- Best checkpoint (.pt), plus TorchScript/ONNX exports; quantization-ready or TensorRT plan (optional).
- eval_report.json (overall + per-slice metrics, latency), confusion matrices, PR curves.
- W&B run (summary, artifacts), model card (model_card.md), training logs.
- Stored in S3 (Gold) under /models/<project>/<semver or run_id>/…, and linked in W&B Artifacts and registry.

14) Hyper-Parameter Optimization / Sweeps¶

Trigger
- Auto: training workflow marks model as “candidate” but below target on one or more slices.
- Scheduled: weekly sweeps on prioritized tasks (e.g., night/pedestrian detector).
- Manual: engineer launches targeted sweep (e.g., loss weights for lane vs. detection).
Inputs
- Same dataset manifests as #13 (or focused slice packs for the weak area).
- Baseline config (train.yaml) with search space: LR/WD, warmup, aug policy, loss weights, NMS/score thresholds, backbone/neck options, EMA on/off, AMP level, batch size/accum steps.
- Sweep strategy: Bayesian, Hyperband/ASHA, Random, or Population-Based Training (PBT) for long runs.
- Resource budget: max trials, parallelism, GPU hours, early-stop policy, cost cap.
Core Steps
- Orchestration & Budgeting
  - Create W&B Sweep config (YAML) with objective metric (e.g., val/mAP_weighted or multi-objective with constraints: maximize mAP_vehicle subject to latency_p95 < X ms and regression_Δslice < Y).
  - Choose executor:
    - Kubernetes Jobs on EKS with the W&B agent (elastic parallelism), or
    - SageMaker multiple training jobs tagged to the sweep (agent inside container).
  - Enforce cost guardrails: kill/truncate low-performers via ASHA; set CloudWatch alarms on spend.
- Trial Runtime
  - Each trial inherits DDP setup from #13; parameters injected via env/CLI override.
  - Log full telemetry to W&B: metrics, hyperparams, system stats (GPU util, memory), eval artifacts.
  - Early stopping on plateau or rule-based pruning (e.g., after 3 epochs if val/mAP < quantile Q).
- Validation & Gating (per trial)
  - Same eval battery as #13, including slice metrics and latency microbenchmarks.
  - Fairness/coverage checks: ensure no >K% regression on protected or safety-critical slices.
  - Stability check (optional): re-run best few trials with a different seed on a small shard; require consistent ranking.
- Selection & Promotion
  - W&B sweep dashboard computes leaderboard; export Pareto front for multi-objective cases.
  - Auto-package Top-K models; re-evaluate on holdout test and simulation smoke set.
  - Generate HPO report: importance analysis (SHAP/Iceberg plots of hyperparams), budget used, recommended defaults.
  - Promote winner to Model Registry with tag candidate/<task>/<date>; attach sweep summary and config freeze.
- Learnings Back-Prop
  - Update default training config with new best hyperparams (per task/slice).
  - Persist tuned augmentation policies and loss weights into recipe library.
  - If all trials fail gating on a specific slice, emit data request back to #8/#12 to mine more examples.
AWS/Tooling
- EKS (Jobs + W&B agent) or SageMaker (parallel training jobs), S3, FSx, CloudWatch, EventBridge for triggers.
- W&B Sweeps (Bayes/Random/Hyperband/PBT), PyTorch DDP, Ray Tune (optional backend for advanced schedulers).
- Evidently (sanity drift checks during trials), Numba/TVM/TensorRT (optional latency constraint evaluators).
Outputs
- sweep_report.md + sweep_summary.json (leaderboard, Pareto set, importance).
- Top-K model artifacts + exports; W&B artifacts with pinned configs and datasets.
- Registry update (winner promoted) + promotion event to downstream eval/simulation pipeline.