# Packaging, Evaluation & Promotion Workflows

##

### 15) Packaging and Export

* **When it runs**

  * Automatically after a training run in #13 or a sweep winner in #14 is marked “candidate”.
  * On demand when an engineer requests a build for a specific target (cloud GPU, vehicle ECU, edge gateway).
  * Nightly to refresh performance-optimized builds with the latest compiler/runtime stacks.

* **Inputs**

  * Best checkpoint from #13/#14 plus its W\&B run, dataset DVC tag, and Git SHA.
  * Export recipe: desired backends and precisions per target, for example:

    * TorchScript or ONNX (opset 17) for CPU/GPU
    * TensorRT engines (FP32, FP16, INT8) for NVIDIA targets
    * Triton ensemble configuration if pre/post-processing is composed as a pipeline
  * Calibration shard for INT8 (balanced by slice, e.g., night rain pedestrians).
  * Inference contract template: input shapes, dtypes, normalization, output schema, confidences.

* **Steps**

  * **Repo staging**

    * Pin environment: Docker image digest, CUDA/cuDNN, PyTorch/NCCL versions.
    * Fetch artifacts from W\&B and S3; verify hashes; freeze the exact `requirements.lock`.
    * Sanity smoke: load checkpoint, single-batch forward, no NaNs/infs.
  * **Graph export**

    * TorchScript trace or script path with dynamic axes if needed; or export to ONNX with opset/IR version constraints.
    * Operator coverage report; fail fast if unsupported ops creep in.
  * **Runtime optimization**

    * Build **TensorRT** engines for target (T4, A10, A100 in cloud; Orin/Drive for edge) with per-device tactic replay.
    * Mixed precision plan selection; per-layer precision fallback where numerically sensitive.
    * **INT8**: create calibration cache with the curated shard; verify max absolute deviation vs FP32 on a validation micro-suite.
    * Optional **quantization-aware training** reuse: if available, prefer QAT checkpoints for INT8.
  * **Model repository assembly**

    * Create **Triton** model dir structure: `config.pbtxt`, versioned subfolders, pre/post processing as Python or TensorRT backends, optional **ensemble** to fuse steps.
    * Generate **inference\_config.json** describing IO schema, thresholds, NMS settings, class map, and expected augmentations disabled at inference.
  * **Security and compliance**

    * Generate SBOM with **Syft**; scan image and artifacts with **Trivy**.
    * License scan for third-party code; attach report to model card.
    * Sign artifacts and/or container with **cosign** or **AWS Signer**; store signatures in S3 and publish digest in release notes.
  * **Equivalence & performance checks**

    * **Numerical equivalence**: FP32 PyTorch vs exported engine on 1k randomized inputs per head; require Δ within tolerances (e.g., bbox IoU drift < 0.5% absolute on sample set; logits Δ < 1e-3).
    * **Latency/throughput microbench**: run on the target instance type; collect p50/p95 latency, GPU util, memory footprint.
    * **Contract smoke**: load model in a minimal Triton/TorchServe container; POST a known request; verify schema and ranges.
  * **Artifact packaging**

    * Produce: `model.ts` or `model.onnx`, `model.plan` (per device), `inference_config.json`, `calibration.cache`, `config.pbtxt`, SBOM, `export_report.json`.
    * Build and push serving container to **ECR** tagged with semver and Git SHA (e.g., `adas-detector:1.8.0-abcdef0`).
    * Attach everything to a **W\&B Artifact** and store in **S3 Gold** under `/models/<task>/<semver or run_id>/…`.

* **AWS/Tooling**

  * **ECR, S3, CodeBuild or GitHub Actions, KMS/Signer, Triton Inference Server, TensorRT, ONNX, TorchScript, W\&B Artifacts**.

* **Outputs**

  * Versioned, signed, performance-graded model packs per target.
  * `export_report.json` with compile flags, precisions, operator sets, and microbenchmarks.
  * Updated W\&B artifact lineage linking back to dataset and code.

---

### 16) Evaluation and Robustness

* **When it runs**

  * Immediately after packaging (#15) for each target precision.
  * Nightly regression across the full test library.
  * On request when a slice shows drift or new edge cases arrive from #12.

* **Inputs**

  * Exported models and serving containers from #15.
  * `golden_train/val/test.manifest`, `slices.yaml`, and extra **challenge suites** (rare weather, construction, tunnels).
  * Baseline “current production” metrics for A/B comparison.
  * Corruption/perturbation suite definitions and OOD probe sets.

* **Steps**

  * **Dataset integrity & leakage guards**

    * Verify manifests conform to schema; run **Great Expectations** on key fields.
    * Ensure no overlap of scene IDs across splits; enforce temporal and geographic separation policies.
  * **Primary evaluation**

    * Compute task-specific metrics:

      * 2D detection: COCO mAP, AP50/75, small/medium/large splits; per-class PR curves.
      * 3D detection: nuScenes metrics or KITTI AP on BEV and 3D boxes.
      * Segmentation/lanes: mIoU, F1, boundary IoU.
      * Prediction: ADE/FDE, miss rate at K.
    * **Slice evaluation** for weather, time of day, geography, road type; compute Δ vs previous release.
    * **Calibration**: ECE, Brier score, reliability diagrams; tune decision thresholds if needed.
  * **Robustness & stress testing**

    * **Image/point cloud corruptions**: blur, noise, JPEG compression, fog/rain/snow shaders, brightness/contrast; LiDAR dropouts; test at increasing severities; measure mAP/mIoU decay slopes.
    * **Temporal stress**: dropped frames, timestamp jitter, out-of-order batches; check tracker continuity and stability.
    * **Sensor faults**: zero out a camera or LiDAR for segments; confirm graceful degradation rules.
    * **Quantization sensitivity**: compare FP32 vs FP16 vs INT8 across slices.
  * **OOD & uncertainty**

    * OOD probes using max softmax probability or energy scores; compute AUROC/AUPR for OOD vs in-dist.
    * Uncertainty quality: NLL, coverage vs confidence; verify abstention policies are triggered sensibly.
  * **Latency and footprint**

    * Measure p50/p95 latency and throughput on target hardware using the packaged engine; cap memory and verify no OOM at peak batch/stream settings.
  * **Regression gates**

    * Define win conditions, e.g., `mAP_weighted +1.5` overall and **no** critical slice regression > 2%; latency p95 within budget; calibration ECE not worse.
    * If a gate fails, emit a **blocking report** and route back to #14 or #8/#12 to mine data for failing slices.
  * **Reporting**

    * Create `eval_report.json`, `slice_metrics.parquet`, `robustness_report.json`, latency summaries, confusion matrices, and reliability plots; log all to **W\&B**.
    * Generate a human-readable `evaluation_summary.md` with a “What improved / what regressed / next actions” section.

* **AWS/Tooling**

  * **EKS or SageMaker Processing**, **Athena/Glue** for audit queries, **W\&B**, **Evidently** for reference vs candidate drift checks, **Triton Perf Analyzer** or custom profilers.

* **Outputs**

  * Machine-readable reports and plots; green/red promotion signal with rationale.
  * Pinned W\&B run linking evaluation to the packaged artifact.

---

### 17) Drive Replay and Simulation

* **When it runs**

  * After #16 indicates the candidate is promising but needs **closed-loop** validation.
  * As a mandatory gate for any major change touching perception->planning interfaces.
  * Periodically to re-validate regressions and expand the scenario library.

* **Inputs**

  * Exported model container and configs from #15.
  * **Log replay** bundles: synchronized multi-sensor recordings with ground truth labels.
  * **Scenario library**: OpenSCENARIO files and procedurally generated scenes based on real-world events (disengagements, near misses).
  * Vehicle dynamics and controller configs for realistic closed-loop behavior.

* **Steps**

  * **Open-loop replay**

    * Reproduce sensor timing, distortions, and calibration; feed logs through the candidate model.
    * Compute frame/segment-level perception metrics against ground truth; analyze time-to-first-detection, track continuity, and ghosting.
    * Flag segments where the candidate diverges materially from production; surface them for targeted review.
  * **Scenario extraction**

    * Convert flagged real-world intervals to **OpenSCENARIO** with actors, trajectories, traffic rules, and weather.
    * Parameterize scenarios (vehicle speed, gap times, actor types) for robust sweeps.
  * **Closed-loop simulation**

    * Run in **CARLA** or **NVIDIA Omniverse/Drive** with high-fidelity sensors and physics.
    * Connect inference to the autonomy stack’s planning/control (or a proxy controller) so the model’s outputs drive the ego vehicle.
    * Randomize across seeds: weather, lighting, textures, spawn densities; run many permutations per scenario.
    * Collect safety metrics: collisions per 1k km, off-road incidents, traffic rule violations, TTC minima, and comfort metrics (jerk/acc).
  * **Batch orchestration**

    * Distribute thousands of runs on **EKS** or **AWS Batch** with GPU nodes; mount scene assets via S3/FSx.
    * Cache compiled simulation assets to avoid rebuilds; checkpoint long sweeps.
  * **Review and gating**

    * Aggregate results; compare to production baselines and to #16 offline metrics.
    * Define pass criteria per scenario category, e.g., **zero** collisions in NCAP-style scenes; no increase in near-misses for vulnerable road users; bounded comfort regressions.
    * Produce clips of failures for quick triage; create issues mapped back to #8/#12 for data requests if needed.

* **AWS/Tooling**

  * **EKS/ECS or Batch** with GPU, **S3/FSx**, **CloudWatch Logs**, **Omniverse/Drive Sim or CARLA**, **ROS/rosbag** replayers when needed, **OpenSCENARIO** toolchain, dashboards in **QuickSight**.

* **Outputs**

  * `sim_summary.parquet`, per-scenario CSV/JSON, video snippets of failures, heatmaps of violation types.
  * A “Sim Gate” verdict attached to the model’s W\&B artifact and promotion checklist.

---

### 18) Registry and Promotion

* **When it runs**

  * After #16 and #17 return green.
  * On product manager approval and change-control window availability.
  * On rollback events (reverse promotion).

* **Inputs**

  * Candidate model pack(s) and serving container(s) from #15.
  * Evaluation and simulation reports from #16/#17.
  * Model card, SBOM, vulnerability/license scans, and signatures.
  * Release notes, migration notes, and serving configs.

* **Steps**

  * **Registry entry**

    * Create/update entry in the **Model Registry** with immutable pointers:

      * W\&B Artifact digest, S3 artifact URIs, ECR image digest, commit SHA, dataset DVC tag.
      * Performance summary, slice table, latency budgets, supported targets, calibration cache version.
    * Apply **semantic versioning** and attach stage: `staging`, `candidate`, `production`.
  * **Governance checks**

    * Validate approvals: technical, safety, security, and product.
    * Verify signatures and SBOM status; ensure vulnerability gates pass or are waived with justification.
    * Lock down IAM policies for read-only production consumption.
  * **Promotion plan**

    * Choose rollout strategy: **shadow** (mirror traffic), **canary** (1%→5%→25%→100%), or **A/B** with customer cohorts.
    * Pre-deploy to **staging** Triton/TorchServe; run **API contract** smoke, performance soak (e.g., 30 min).
    * Define rollback SLOs: if p95 latency, error rate, or safety proxy metrics breach thresholds for N minutes, auto-rollback.
  * **Production push**

    * Deploy canary to EKS/ECS or **SageMaker Endpoints** with the new container and model; wire **CloudWatch** alarms and **Auto Scaling**.
    * Gradually shift traffic; keep shadow for behavioral diffing (store diff summaries to S3).
    * Validate real-world **proxy KPIs** (e.g., false emergency braking rate, perception-planning disagreement rates).
  * **Finalize & broadcast**

    * Promote registry stage to **production**; tag previous model as **rollback**.
    * Publish release notes, link to model card, evaluation, simulation, and SBOM.
    * Notify stakeholders; update dashboards.
  * **Post-promotion hooks**

    * Kick **Offline Mining** (#12) with fresh error clusters seeded from shadow/canary telemetry.
    * Schedule the next **weekly evaluation** on the full library to guard against late regressions.
    * Archive heavy intermediate artifacts per retention policy; maintain cost hygiene.

* **AWS/Tooling**

  * **SageMaker Model Registry** or **W\&B Artifacts** as registry-of-truth with a lightweight **DynamoDB** “promotion table” for active aliases.
  * **EKS/ECS or SageMaker Endpoints** for serving; **CloudWatch**, **Auto Scaling**, **EventBridge** for rollouts; **KMS/Signer/cosign** for integrity.
  * **Athena/QuickSight** for canary KPIs and shadow diffs.

* **Outputs**

  * A versioned, auditable **production** model entry with all lineage and approvals.
  * Canary/rollout timelines, SLO dashboards, and an automated rollback path.
  * Triggers fired to feed the next loop of the data engine.

---