Packaging, Evaluation & Promotion Workflows¶

¶

15) Packaging and Export¶

When it runs
- Automatically after a training run in #13 or a sweep winner in #14 is marked “candidate”.
- On demand when an engineer requests a build for a specific target (cloud GPU, vehicle ECU, edge gateway).
- Nightly to refresh performance-optimized builds with the latest compiler/runtime stacks.
Inputs
- Best checkpoint from #13/#14 plus its W&B run, dataset DVC tag, and Git SHA.
- Export recipe: desired backends and precisions per target, for example:
  - TorchScript or ONNX (opset 17) for CPU/GPU
  - TensorRT engines (FP32, FP16, INT8) for NVIDIA targets
  - Triton ensemble configuration if pre/post-processing is composed as a pipeline
- Calibration shard for INT8 (balanced by slice, e.g., night rain pedestrians).
- Inference contract template: input shapes, dtypes, normalization, output schema, confidences.
Steps
- Repo staging
  - Pin environment: Docker image digest, CUDA/cuDNN, PyTorch/NCCL versions.
  - Fetch artifacts from W&B and S3; verify hashes; freeze the exact requirements.lock.
  - Sanity smoke: load checkpoint, single-batch forward, no NaNs/infs.
- Graph export
  - TorchScript trace or script path with dynamic axes if needed; or export to ONNX with opset/IR version constraints.
  - Operator coverage report; fail fast if unsupported ops creep in.
- Runtime optimization
  - Build TensorRT engines for target (T4, A10, A100 in cloud; Orin/Drive for edge) with per-device tactic replay.
  - Mixed precision plan selection; per-layer precision fallback where numerically sensitive.
  - INT8: create calibration cache with the curated shard; verify max absolute deviation vs FP32 on a validation micro-suite.
  - Optional quantization-aware training reuse: if available, prefer QAT checkpoints for INT8.
- Model repository assembly
  - Create Triton model dir structure: config.pbtxt, versioned subfolders, pre/post processing as Python or TensorRT backends, optional ensemble to fuse steps.
  - Generate inference_config.json describing IO schema, thresholds, NMS settings, class map, and expected augmentations disabled at inference.
- Security and compliance
  - Generate SBOM with Syft; scan image and artifacts with Trivy.
  - License scan for third-party code; attach report to model card.
  - Sign artifacts and/or container with cosign or AWS Signer; store signatures in S3 and publish digest in release notes.
- Equivalence & performance checks
  - Numerical equivalence: FP32 PyTorch vs exported engine on 1k randomized inputs per head; require Δ within tolerances (e.g., bbox IoU drift < 0.5% absolute on sample set; logits Δ < 1e-3).
  - Latency/throughput microbench: run on the target instance type; collect p50/p95 latency, GPU util, memory footprint.
  - Contract smoke: load model in a minimal Triton/TorchServe container; POST a known request; verify schema and ranges.
- Artifact packaging
  - Produce: model.ts or model.onnx, model.plan (per device), inference_config.json, calibration.cache, config.pbtxt, SBOM, export_report.json.
  - Build and push serving container to ECR tagged with semver and Git SHA (e.g., adas-detector:1.8.0-abcdef0).
  - Attach everything to a W&B Artifact and store in S3 Gold under /models/<task>/<semver or run_id>/….
AWS/Tooling
- ECR, S3, CodeBuild or GitHub Actions, KMS/Signer, Triton Inference Server, TensorRT, ONNX, TorchScript, W&B Artifacts.
Outputs
- Versioned, signed, performance-graded model packs per target.
- export_report.json with compile flags, precisions, operator sets, and microbenchmarks.
- Updated W&B artifact lineage linking back to dataset and code.

16) Evaluation and Robustness¶

When it runs
- Immediately after packaging (#15) for each target precision.
- Nightly regression across the full test library.
- On request when a slice shows drift or new edge cases arrive from #12.
Inputs
- Exported models and serving containers from #15.
- golden_train/val/test.manifest, slices.yaml, and extra challenge suites (rare weather, construction, tunnels).
- Baseline “current production” metrics for A/B comparison.
- Corruption/perturbation suite definitions and OOD probe sets.
Steps
- Dataset integrity & leakage guards
  - Verify manifests conform to schema; run Great Expectations on key fields.
  - Ensure no overlap of scene IDs across splits; enforce temporal and geographic separation policies.
- Primary evaluation
  - Compute task-specific metrics:
    - 2D detection: COCO mAP, AP50/75, small/medium/large splits; per-class PR curves.
    - 3D detection: nuScenes metrics or KITTI AP on BEV and 3D boxes.
    - Segmentation/lanes: mIoU, F1, boundary IoU.
    - Prediction: ADE/FDE, miss rate at K.
  - Slice evaluation for weather, time of day, geography, road type; compute Δ vs previous release.
  - Calibration: ECE, Brier score, reliability diagrams; tune decision thresholds if needed.
- Robustness & stress testing
  - Image/point cloud corruptions: blur, noise, JPEG compression, fog/rain/snow shaders, brightness/contrast; LiDAR dropouts; test at increasing severities; measure mAP/mIoU decay slopes.
  - Temporal stress: dropped frames, timestamp jitter, out-of-order batches; check tracker continuity and stability.
  - Sensor faults: zero out a camera or LiDAR for segments; confirm graceful degradation rules.
  - Quantization sensitivity: compare FP32 vs FP16 vs INT8 across slices.
- OOD & uncertainty
  - OOD probes using max softmax probability or energy scores; compute AUROC/AUPR for OOD vs in-dist.
  - Uncertainty quality: NLL, coverage vs confidence; verify abstention policies are triggered sensibly.
- Latency and footprint
  - Measure p50/p95 latency and throughput on target hardware using the packaged engine; cap memory and verify no OOM at peak batch/stream settings.
- Regression gates
  - Define win conditions, e.g., mAP_weighted +1.5 overall and no critical slice regression > 2%; latency p95 within budget; calibration ECE not worse.
  - If a gate fails, emit a blocking report and route back to #14 or #8/#12 to mine data for failing slices.
- Reporting
  - Create eval_report.json, slice_metrics.parquet, robustness_report.json, latency summaries, confusion matrices, and reliability plots; log all to W&B.
  - Generate a human-readable evaluation_summary.md with a “What improved / what regressed / next actions” section.
AWS/Tooling
- EKS or SageMaker Processing, Athena/Glue for audit queries, W&B, Evidently for reference vs candidate drift checks, Triton Perf Analyzer or custom profilers.
Outputs
- Machine-readable reports and plots; green/red promotion signal with rationale.
- Pinned W&B run linking evaluation to the packaged artifact.

17) Drive Replay and Simulation¶

When it runs
- After #16 indicates the candidate is promising but needs closed-loop validation.
- As a mandatory gate for any major change touching perception->planning interfaces.
- Periodically to re-validate regressions and expand the scenario library.
Inputs
- Exported model container and configs from #15.
- Log replay bundles: synchronized multi-sensor recordings with ground truth labels.
- Scenario library: OpenSCENARIO files and procedurally generated scenes based on real-world events (disengagements, near misses).
- Vehicle dynamics and controller configs for realistic closed-loop behavior.
Steps
- Open-loop replay
  - Reproduce sensor timing, distortions, and calibration; feed logs through the candidate model.
  - Compute frame/segment-level perception metrics against ground truth; analyze time-to-first-detection, track continuity, and ghosting.
  - Flag segments where the candidate diverges materially from production; surface them for targeted review.
- Scenario extraction
  - Convert flagged real-world intervals to OpenSCENARIO with actors, trajectories, traffic rules, and weather.
  - Parameterize scenarios (vehicle speed, gap times, actor types) for robust sweeps.
- Closed-loop simulation
  - Run in CARLA or NVIDIA Omniverse/Drive with high-fidelity sensors and physics.
  - Connect inference to the autonomy stack’s planning/control (or a proxy controller) so the model’s outputs drive the ego vehicle.
  - Randomize across seeds: weather, lighting, textures, spawn densities; run many permutations per scenario.
  - Collect safety metrics: collisions per 1k km, off-road incidents, traffic rule violations, TTC minima, and comfort metrics (jerk/acc).
- Batch orchestration
  - Distribute thousands of runs on EKS or AWS Batch with GPU nodes; mount scene assets via S3/FSx.
  - Cache compiled simulation assets to avoid rebuilds; checkpoint long sweeps.
- Review and gating
  - Aggregate results; compare to production baselines and to #16 offline metrics.
  - Define pass criteria per scenario category, e.g., zero collisions in NCAP-style scenes; no increase in near-misses for vulnerable road users; bounded comfort regressions.
  - Produce clips of failures for quick triage; create issues mapped back to #8/#12 for data requests if needed.
AWS/Tooling
- EKS/ECS or Batch with GPU, S3/FSx, CloudWatch Logs, Omniverse/Drive Sim or CARLA, ROS/rosbag replayers when needed, OpenSCENARIO toolchain, dashboards in QuickSight.
Outputs
- sim_summary.parquet, per-scenario CSV/JSON, video snippets of failures, heatmaps of violation types.
- A “Sim Gate” verdict attached to the model’s W&B artifact and promotion checklist.

18) Registry and Promotion¶

When it runs
- After #16 and #17 return green.
- On product manager approval and change-control window availability.
- On rollback events (reverse promotion).
Inputs
- Candidate model pack(s) and serving container(s) from #15.
- Evaluation and simulation reports from #16/#17.
- Model card, SBOM, vulnerability/license scans, and signatures.
- Release notes, migration notes, and serving configs.
Steps
- Registry entry
  - Create/update entry in the Model Registry with immutable pointers:
    - W&B Artifact digest, S3 artifact URIs, ECR image digest, commit SHA, dataset DVC tag.
    - Performance summary, slice table, latency budgets, supported targets, calibration cache version.
  - Apply semantic versioning and attach stage: staging, candidate, production.
- Governance checks
  - Validate approvals: technical, safety, security, and product.
  - Verify signatures and SBOM status; ensure vulnerability gates pass or are waived with justification.
  - Lock down IAM policies for read-only production consumption.
- Promotion plan
  - Choose rollout strategy: shadow (mirror traffic), canary (1%→5%→25%→100%), or A/B with customer cohorts.
  - Pre-deploy to staging Triton/TorchServe; run API contract smoke, performance soak (e.g., 30 min).
  - Define rollback SLOs: if p95 latency, error rate, or safety proxy metrics breach thresholds for N minutes, auto-rollback.
- Production push
  - Deploy canary to EKS/ECS or SageMaker Endpoints with the new container and model; wire CloudWatch alarms and Auto Scaling.
  - Gradually shift traffic; keep shadow for behavioral diffing (store diff summaries to S3).
  - Validate real-world proxy KPIs (e.g., false emergency braking rate, perception-planning disagreement rates).
- Finalize & broadcast
  - Promote registry stage to production; tag previous model as rollback.
  - Publish release notes, link to model card, evaluation, simulation, and SBOM.
  - Notify stakeholders; update dashboards.
- Post-promotion hooks
  - Kick Offline Mining (#12) with fresh error clusters seeded from shadow/canary telemetry.
  - Schedule the next weekly evaluation on the full library to guard against late regressions.
  - Archive heavy intermediate artifacts per retention policy; maintain cost hygiene.
AWS/Tooling
- SageMaker Model Registry or W&B Artifacts as registry-of-truth with a lightweight DynamoDB “promotion table” for active aliases.
- EKS/ECS or SageMaker Endpoints for serving; CloudWatch, Auto Scaling, EventBridge for rollouts; KMS/Signer/cosign for integrity.
- Athena/QuickSight for canary KPIs and shadow diffs.
Outputs
- A versioned, auditable production model entry with all lineage and approvals.
- Canary/rollout timelines, SLO dashboards, and an automated rollback path.
- Triggers fired to feed the next loop of the data engine.