Packaging, Evaluation & Promotion Workflows

15) Packaging and Export

  • When it runs

    • Automatically after a training run in #13 or a sweep winner in #14 is marked “candidate”.

    • On demand when an engineer requests a build for a specific target (cloud GPU, vehicle ECU, edge gateway).

    • Nightly to refresh performance-optimized builds with the latest compiler/runtime stacks.

  • Inputs

    • Best checkpoint from #13/#14 plus its W&B run, dataset DVC tag, and Git SHA.

    • Export recipe: desired backends and precisions per target, for example:

      • TorchScript or ONNX (opset 17) for CPU/GPU

      • TensorRT engines (FP32, FP16, INT8) for NVIDIA targets

      • Triton ensemble configuration if pre/post-processing is composed as a pipeline

    • Calibration shard for INT8 (balanced by slice, e.g., night rain pedestrians).

    • Inference contract template: input shapes, dtypes, normalization, output schema, confidences.

  • Steps

    • Repo staging

      • Pin environment: Docker image digest, CUDA/cuDNN, PyTorch/NCCL versions.

      • Fetch artifacts from W&B and S3; verify hashes; freeze the exact requirements.lock.

      • Sanity smoke: load checkpoint, single-batch forward, no NaNs/infs.

    • Graph export

      • TorchScript trace or script path with dynamic axes if needed; or export to ONNX with opset/IR version constraints.

      • Operator coverage report; fail fast if unsupported ops creep in.

    • Runtime optimization

      • Build TensorRT engines for target (T4, A10, A100 in cloud; Orin/Drive for edge) with per-device tactic replay.

      • Mixed precision plan selection; per-layer precision fallback where numerically sensitive.

      • INT8: create calibration cache with the curated shard; verify max absolute deviation vs FP32 on a validation micro-suite.

      • Optional quantization-aware training reuse: if available, prefer QAT checkpoints for INT8.

    • Model repository assembly

      • Create Triton model dir structure: config.pbtxt, versioned subfolders, pre/post processing as Python or TensorRT backends, optional ensemble to fuse steps.

      • Generate inference_config.json describing IO schema, thresholds, NMS settings, class map, and expected augmentations disabled at inference.

    • Security and compliance

      • Generate SBOM with Syft; scan image and artifacts with Trivy.

      • License scan for third-party code; attach report to model card.

      • Sign artifacts and/or container with cosign or AWS Signer; store signatures in S3 and publish digest in release notes.

    • Equivalence & performance checks

      • Numerical equivalence: FP32 PyTorch vs exported engine on 1k randomized inputs per head; require Δ within tolerances (e.g., bbox IoU drift < 0.5% absolute on sample set; logits Δ < 1e-3).

      • Latency/throughput microbench: run on the target instance type; collect p50/p95 latency, GPU util, memory footprint.

      • Contract smoke: load model in a minimal Triton/TorchServe container; POST a known request; verify schema and ranges.

    • Artifact packaging

      • Produce: model.ts or model.onnx, model.plan (per device), inference_config.json, calibration.cache, config.pbtxt, SBOM, export_report.json.

      • Build and push serving container to ECR tagged with semver and Git SHA (e.g., adas-detector:1.8.0-abcdef0).

      • Attach everything to a W&B Artifact and store in S3 Gold under /models/<task>/<semver or run_id>/….

  • AWS/Tooling

    • ECR, S3, CodeBuild or GitHub Actions, KMS/Signer, Triton Inference Server, TensorRT, ONNX, TorchScript, W&B Artifacts.

  • Outputs

    • Versioned, signed, performance-graded model packs per target.

    • export_report.json with compile flags, precisions, operator sets, and microbenchmarks.

    • Updated W&B artifact lineage linking back to dataset and code.


16) Evaluation and Robustness

  • When it runs

    • Immediately after packaging (#15) for each target precision.

    • Nightly regression across the full test library.

    • On request when a slice shows drift or new edge cases arrive from #12.

  • Inputs

    • Exported models and serving containers from #15.

    • golden_train/val/test.manifest, slices.yaml, and extra challenge suites (rare weather, construction, tunnels).

    • Baseline “current production” metrics for A/B comparison.

    • Corruption/perturbation suite definitions and OOD probe sets.

  • Steps

    • Dataset integrity & leakage guards

      • Verify manifests conform to schema; run Great Expectations on key fields.

      • Ensure no overlap of scene IDs across splits; enforce temporal and geographic separation policies.

    • Primary evaluation

      • Compute task-specific metrics:

        • 2D detection: COCO mAP, AP50/75, small/medium/large splits; per-class PR curves.

        • 3D detection: nuScenes metrics or KITTI AP on BEV and 3D boxes.

        • Segmentation/lanes: mIoU, F1, boundary IoU.

        • Prediction: ADE/FDE, miss rate at K.

      • Slice evaluation for weather, time of day, geography, road type; compute Δ vs previous release.

      • Calibration: ECE, Brier score, reliability diagrams; tune decision thresholds if needed.

    • Robustness & stress testing

      • Image/point cloud corruptions: blur, noise, JPEG compression, fog/rain/snow shaders, brightness/contrast; LiDAR dropouts; test at increasing severities; measure mAP/mIoU decay slopes.

      • Temporal stress: dropped frames, timestamp jitter, out-of-order batches; check tracker continuity and stability.

      • Sensor faults: zero out a camera or LiDAR for segments; confirm graceful degradation rules.

      • Quantization sensitivity: compare FP32 vs FP16 vs INT8 across slices.

    • OOD & uncertainty

      • OOD probes using max softmax probability or energy scores; compute AUROC/AUPR for OOD vs in-dist.

      • Uncertainty quality: NLL, coverage vs confidence; verify abstention policies are triggered sensibly.

    • Latency and footprint

      • Measure p50/p95 latency and throughput on target hardware using the packaged engine; cap memory and verify no OOM at peak batch/stream settings.

    • Regression gates

      • Define win conditions, e.g., mAP_weighted +1.5 overall and no critical slice regression > 2%; latency p95 within budget; calibration ECE not worse.

      • If a gate fails, emit a blocking report and route back to #14 or #8/#12 to mine data for failing slices.

    • Reporting

      • Create eval_report.json, slice_metrics.parquet, robustness_report.json, latency summaries, confusion matrices, and reliability plots; log all to W&B.

      • Generate a human-readable evaluation_summary.md with a “What improved / what regressed / next actions” section.

  • AWS/Tooling

    • EKS or SageMaker Processing, Athena/Glue for audit queries, W&B, Evidently for reference vs candidate drift checks, Triton Perf Analyzer or custom profilers.

  • Outputs

    • Machine-readable reports and plots; green/red promotion signal with rationale.

    • Pinned W&B run linking evaluation to the packaged artifact.


17) Drive Replay and Simulation

  • When it runs

    • After #16 indicates the candidate is promising but needs closed-loop validation.

    • As a mandatory gate for any major change touching perception->planning interfaces.

    • Periodically to re-validate regressions and expand the scenario library.

  • Inputs

    • Exported model container and configs from #15.

    • Log replay bundles: synchronized multi-sensor recordings with ground truth labels.

    • Scenario library: OpenSCENARIO files and procedurally generated scenes based on real-world events (disengagements, near misses).

    • Vehicle dynamics and controller configs for realistic closed-loop behavior.

  • Steps

    • Open-loop replay

      • Reproduce sensor timing, distortions, and calibration; feed logs through the candidate model.

      • Compute frame/segment-level perception metrics against ground truth; analyze time-to-first-detection, track continuity, and ghosting.

      • Flag segments where the candidate diverges materially from production; surface them for targeted review.

    • Scenario extraction

      • Convert flagged real-world intervals to OpenSCENARIO with actors, trajectories, traffic rules, and weather.

      • Parameterize scenarios (vehicle speed, gap times, actor types) for robust sweeps.

    • Closed-loop simulation

      • Run in CARLA or NVIDIA Omniverse/Drive with high-fidelity sensors and physics.

      • Connect inference to the autonomy stack’s planning/control (or a proxy controller) so the model’s outputs drive the ego vehicle.

      • Randomize across seeds: weather, lighting, textures, spawn densities; run many permutations per scenario.

      • Collect safety metrics: collisions per 1k km, off-road incidents, traffic rule violations, TTC minima, and comfort metrics (jerk/acc).

    • Batch orchestration

      • Distribute thousands of runs on EKS or AWS Batch with GPU nodes; mount scene assets via S3/FSx.

      • Cache compiled simulation assets to avoid rebuilds; checkpoint long sweeps.

    • Review and gating

      • Aggregate results; compare to production baselines and to #16 offline metrics.

      • Define pass criteria per scenario category, e.g., zero collisions in NCAP-style scenes; no increase in near-misses for vulnerable road users; bounded comfort regressions.

      • Produce clips of failures for quick triage; create issues mapped back to #8/#12 for data requests if needed.

  • AWS/Tooling

    • EKS/ECS or Batch with GPU, S3/FSx, CloudWatch Logs, Omniverse/Drive Sim or CARLA, ROS/rosbag replayers when needed, OpenSCENARIO toolchain, dashboards in QuickSight.

  • Outputs

    • sim_summary.parquet, per-scenario CSV/JSON, video snippets of failures, heatmaps of violation types.

    • A “Sim Gate” verdict attached to the model’s W&B artifact and promotion checklist.


18) Registry and Promotion

  • When it runs

    • After #16 and #17 return green.

    • On product manager approval and change-control window availability.

    • On rollback events (reverse promotion).

  • Inputs

    • Candidate model pack(s) and serving container(s) from #15.

    • Evaluation and simulation reports from #16/#17.

    • Model card, SBOM, vulnerability/license scans, and signatures.

    • Release notes, migration notes, and serving configs.

  • Steps

    • Registry entry

      • Create/update entry in the Model Registry with immutable pointers:

        • W&B Artifact digest, S3 artifact URIs, ECR image digest, commit SHA, dataset DVC tag.

        • Performance summary, slice table, latency budgets, supported targets, calibration cache version.

      • Apply semantic versioning and attach stage: staging, candidate, production.

    • Governance checks

      • Validate approvals: technical, safety, security, and product.

      • Verify signatures and SBOM status; ensure vulnerability gates pass or are waived with justification.

      • Lock down IAM policies for read-only production consumption.

    • Promotion plan

      • Choose rollout strategy: shadow (mirror traffic), canary (1%→5%→25%→100%), or A/B with customer cohorts.

      • Pre-deploy to staging Triton/TorchServe; run API contract smoke, performance soak (e.g., 30 min).

      • Define rollback SLOs: if p95 latency, error rate, or safety proxy metrics breach thresholds for N minutes, auto-rollback.

    • Production push

      • Deploy canary to EKS/ECS or SageMaker Endpoints with the new container and model; wire CloudWatch alarms and Auto Scaling.

      • Gradually shift traffic; keep shadow for behavioral diffing (store diff summaries to S3).

      • Validate real-world proxy KPIs (e.g., false emergency braking rate, perception-planning disagreement rates).

    • Finalize & broadcast

      • Promote registry stage to production; tag previous model as rollback.

      • Publish release notes, link to model card, evaluation, simulation, and SBOM.

      • Notify stakeholders; update dashboards.

    • Post-promotion hooks

      • Kick Offline Mining (#12) with fresh error clusters seeded from shadow/canary telemetry.

      • Schedule the next weekly evaluation on the full library to guard against late regressions.

      • Archive heavy intermediate artifacts per retention policy; maintain cost hygiene.

  • AWS/Tooling

    • SageMaker Model Registry or W&B Artifacts as registry-of-truth with a lightweight DynamoDB “promotion table” for active aliases.

    • EKS/ECS or SageMaker Endpoints for serving; CloudWatch, Auto Scaling, EventBridge for rollouts; KMS/Signer/cosign for integrity.

    • Athena/QuickSight for canary KPIs and shadow diffs.

  • Outputs

    • A versioned, auditable production model entry with all lineage and approvals.

    • Canary/rollout timelines, SLO dashboards, and an automated rollback path.

    • Triggers fired to feed the next loop of the data engine.