# Deployment & Serving ## ### 19) Canary / Shadow Deployment * **When it runs** * Immediately after **Registry & Promotion** (#18) marks a candidate as “ready-to-deploy”. * Also on demand for hotfixes and security-patch rebuilds of serving containers. * **Inputs** * Versioned, signed model packs and serving container(s) from **Packaging** (#15)—TorchScript/ONNX/TensorRT with `config.pbtxt` and `inference_config.json`. * Model Registry record (artifact digests, semver, dataset/hash lineage). * Rollout policy (canary steps, shadow sampling %, abort thresholds). * **Steps** * **Environment prep** * Provision/confirm **multi-AZ** EKS or SageMaker Endpoints; ensure VPC-only networking, VPC endpoints for S3, and TLS everywhere. * Warm capacity for both **shadow** and **canary** paths (separate HPA targets to isolate load). * **Shadow mode (read-only)** * Mirror a configurable % of real production traffic to the **shadow** model while keeping responses dark (not used by callers). * Log shadow outputs, latencies, and **behavioral diffs** vs. production to S3; compute summary KPIs (class-level recall/precision deltas, NMS stability, trajectory ADE/FDE deltas). * **Validation gates:** cap diff rates (e.g., abs(Δ recall pedestrian\_night) ≤ 1.5%); watch p95 latency. Auto-stop shadow if anomalies exceed pre-set budgets. * **Canary (serve a fraction of live traffic)** * Route a small cohort (e.g., 1%) to the candidate via ingress or SageMaker variant routing weights. * Enable **gray** logging: store complete requests + responses for the canary cohort, with PII redaction. * **Health & SLO checks:** request success rate, p95/p99 latency vs. SLO, GPU memory headroom, error budgets. * Increase traffic in steps (1% → 5% → 25% → 50% → 100%) only after each step maintains green KPIs for N minutes/hours. * **Abort / rollback path** * Instant rollback to previous production image via **blue/green** or revert variant weights. * Preserve failure bundle (requests, traces, metrics) to S3 for **Offline Mining** (#12). * **Documentation & sign-off** * Append canary/shadow results to the model card and Registry entry. * Notify stakeholders with a concise status page (live KPI tiles and roll-forward/rollback decision log). * **Core AWS / Tooling** * **EKS** (Triton/TorchServe pods), **ALB/NLB** Ingress, **SageMaker Endpoints** (variant weights), **App Mesh/Istio** for traffic shaping, **CloudWatch** alarms, **EventBridge** for step promotions. * **OpenTelemetry** for traces, **Prometheus/AMP** + **Grafana** for SLOs, **S3** for shadow logs and diffs, **W\&B** for deployment run metadata. * **Outputs & Storage** * Canary/shadow KPI reports, diff summaries, traces; stored in **S3** and linked in Registry. * Updated Registry stage (`candidate → production`) once canary completes. --- ### 20) A/B Testing & Feature Flags * **When it runs** * After canary when we want outcome-level proof (business or safety proxy KPIs). * During experiments that tune thresholds, ensemble weights, or post-processing steps without retraining. * **Inputs** * Deployed production and candidate models (or the same model with different **post-processing/threshold configs**). * **Experiment Plan**: primary metric(s), success criteria, sample size/power calculation, guardrails (safety, latency). * **Steps** * **Flag & cohort design** * Define **treatment arms** (e.g., Threshold\_A vs Threshold\_B; Model\_v1.8 vs v1.7). * Cohort users/vehicles by geography, time window, or fleet slice to minimize interference. * Implement with a **config/flag service** (DynamoDB or LaunchDarkly) read at request start. Cache locally with short TTL to avoid flag server coupling. * **Routing & consistency** * Sticky assignment per device/vehicle to avoid cross-over contamination. * Keep **feature parity** across arms except for the variable under test. * **Metrics capture** * Online KPIs (success rate, false-positive interventions, latency p95) plus **safety proxies** (e.g., disagreement with planner, emergency brake proxy rates). * Aggregate with exact timestamps and cohort tags; anonymize IDs at the logger. * **Statistical analysis** * Sequential testing or fixed-horizon with correction for multiple looks; pre-register the test to avoid p-hacking. * Guardrail checks: if any safety guardrail breaches, auto-terminate the test and revert flags. * **Decision & rollout** * Promote winning config/model by flipping flags globally or per-slice; persist final config to **inference\_config.json** next release cycle. * Archive experiment results (effect size, confidence intervals, power achieved) in the Registry. * **Core AWS / Tooling** * **DynamoDB** (flag store) or **LaunchDarkly**, **AppConfig**, **EventBridge** for change broadcasts. * **Athena/Glue** + **QuickSight** for analysis; **W\&B** to attach experiment metadata to model version. * **Outputs & Storage** * `ab_summary.json`, dashboards, and final flag state in **DynamoDB/AppConfig**; linked to Registry and model card. --- ### 21) Edge Build & OTA Packaging (Vehicle/Device) * **When it runs** * After cloud serving passes canary and we’re ready to produce **edge-optimized** builds. * On periodic runtime refreshes (driver version change, security patches) or new hardware SKU support. * **Inputs** * Model engine(s) per target (TensorRT FP16/INT8) from #15, with calibration cache. * Edge runtime constraints: memory/compute budgets, power/thermal envelopes, allowable latency. * Device fleet manifest: hardware SKU mapping, minimum supported driver/SDK versions. * **Steps** * **Cross-compile & optimize** * Build per-SKU **TensorRT** plans with tactic replay and builder flags aligned to target (e.g., Orin/Drive). * Fuse pre/post operations into CUDA plugins where beneficial; ensure zero-copy tensors across stages. * Run **quantization sanity** on-device emulation (QAT-aware if available). * **Runtime container/component** * Package as **Greengrass** component or OCI image with minimal base; pin CUDA/TensorRT versions; bundle `config.pbtxt` and `inference_config.json`. * Include a watchdog and **health endpoints**; implement local batcher and thermal-aware throttling hooks. * **Hardware-in-the-loop tests** * On a bench rig with target SoC, run **smoke suite**: contract tests, p95 latency, memory ceiling, and thermal soak. * **Determinism checks** at fixed seeds; performance variance bounds under thermal throttling scenarios. * **Security & compliance** * Code sign artifacts (**AWS Signer** or **cosign**) and produce a per-device **update manifest** with checksums. * SBOM attached; license and IP provenance validated. * **Release assembly** * Generate OTA bundle per cohort: artifact URIs, rollout policy, preconditions (battery level, vehicle parked, firmware min version), recovery strategy. * Publish metadata to the **OTA job catalog** (IoT Jobs/FleetWise campaign). * **Core AWS / Tooling** * **AWS IoT Greengrass** components, **AWS IoT FleetWise** or **IoT Device Management** for campaigns, **S3** artifact buckets, **Signer/KMS** for signatures. * Bench automation: **EKS** runner or on-prem CI hardware with **GitHub Actions/CodeBuild**. * **Outputs & Storage** * Signed edge bundles per SKU, update manifests, and bench reports; stored in **S3** and indexed in a **campaign DB** (DynamoDB/Registry). --- ### 22) OTA Delivery (Fleet Campaigns) * **When it runs** * After edge bundles are ready and approved by safety/security leads. * Coordinated with operations windows (time-of-day, depot/garage schedules). * **Inputs** * OTA bundles + manifests from #21. * Fleet segmentation (VIN/Device IDs by geography, customer, regulatory domain). * Rollout strategy: staged waves, max concurrent updates, stop conditions. * **Steps** * **Campaign creation** * Define cohorts and scheduling: wave sizes, blackout periods, and retries. * Preconditions: device online, battery ≥ X%, connected to Wi-Fi or certain carriers, parked/ignition state. * **Secure distribution** * Ship via **IoT Jobs** with signed URIs; devices verify signature and checksum before install. * **Bandwidth shaping**: CDN/S3 transfer acceleration; per-region throttles to avoid network saturation. * **Install & verify** * Atomic swap: install to **A/B partition** or container tag; upon success, flip active pointer. * **Health probes** post-install: run a local inference self-test; send success beacon with version and basic KPIs. * On failure, auto-rollback to previous slot and report error codes. * **Monitoring & control** * Live campaign dashboard: started/succeeded/failed, per-region rates, error categories. * Pause/resume and wave-size adjustments in real time; stop campaign on thresholded failure rates. * **Post-deploy soak** * Collect **in-field telemetry**: latency/thermals, crash reports, edge-level OOD counters, and light-weight quality proxies (e.g., detection density by condition). * Feed anomalies to **Offline Mining** (#12). * **Core AWS / Tooling** * **AWS IoT Jobs / FleetWise**, **IoT Core**, **CloudWatch**, **Athena** for campaign analytics, **QuickSight** dashboards. * **KMS** for artifact encryption at rest; **Private CA** for device certificates if needed. * **Outputs & Storage** * Campaign status logs, per-device install receipts, post-install health beacons; all in **S3/DynamoDB**, surfaced in dashboards and linked to Registry. --- ### 23) Online Service Operations (Cloud Inference) * **When it runs** * Always-on for cloud inference endpoints (batch and/or online). * Scales elastically with traffic; responds to deployments and load events. * **Inputs** * Production model image(s), **inference\_config.json**, and feature/metadata services endpoints. * SLOs/SLCs: availability, p95/p99 latency, error budgets, cost-per-1k inferences. * **Steps** * **Service layout** * **Ingress** → **Request validator** (schema, auth) → **Preprocessing** → **Model** → **Post-processing** → **Response**. * Optional **Feature Online Store** (Feast with DynamoDB/Redis) for feature joins; aggressive caching + TTLs. * **Resilience & scaling** * **HPA/KEDA** on GPU/CPU utilization, QPS, and queue depth; min pods to absorb cold starts. * Connection pools, timeouts, **circuit breakers** (Envoy/App Mesh) for downstream calls; backpressure via bounded queues. * **Multi-AZ**, pod disruption budgets, surge capacity for rollouts. * **Performance engineering** * Pin NUMA/GPU affinity; TensorRT/Triton dynamic batching with careful max delay. * Pre-allocate memory pools; enable CUDA graph capture where applicable. * Async I/O; zero-copy tensors; avoid per-request allocations. * **Security** * mTLS in-mesh; OIDC/JWT at edge; fine-grained IAM for S3/feature store access. * WAF rules for ingress, request size caps, schema enforcement, and PII redaction at loggers. * **Cost controls** * Right-size instance types, spot for batch, on-demand for online; autoscaling floors/ceilings. * Periodic **throughput/latency bin-packing** reviews and **mixed precision** tuning to reduce GPU ms/inference. * **Operational playbooks** * Runbooks for incident classes (latency spike, elevated 5xx, GPU OOM, feature store timeouts). * Synthetic probes and golden queries; regular failover/fire-drill practices. * **Core AWS / Tooling** * **EKS** with **Triton/TorchServe**, **ALB/NLB**, **App Mesh/Istio**, **Feast** (DynamoDB/ElastiCache Redis), **CloudWatch**, **SQS/Kinesis** for async/batch, **SageMaker Endpoints** where managed is preferred. * **Outputs & Storage** * Live responses (API), **structured logs**, **metrics**, **traces**, and **inference audit records** (S3 with lifecycle policies). --- ### 24) Observability (Telemetry, Drift, Explainability) * **When it runs** * Continuously, from the moment traffic reaches shadow/canary through long-term production. * On scheduled jobs for deeper drift/quality analysis. * **Inputs** * Request/response telemetry, model outputs, confidence histograms, selective ground truth (from human QA or auto-label confirmations), and reference statistics from #16. * **Steps** * **Metrics** * **System**: QPS, p50/p95/p99 latency, GPU/CPU/memory utilization, queue depth, error rates (4xx/5xx). * **Model**: per-class score distributions, calibration ECE, acceptance/abstention rates, novelty counters (OOD flags). * **Data**: feature value histograms, missingness, input schema drift. * **Logs** * Structured, PII-redacted request/response logs; correlation IDs to join across services. * **Failure bundles**: auto-capture payload + model state for 5xx or large diffs; store to S3 with strict retention. * **Traces** * **OpenTelemetry** spans from ingress through model to downstream stores; trace sampling biased toward tail latency and errors. * **Dashboards & alerts** * Grafana/QuickSight boards by **SLO tiers**; CloudWatch alerts on SLO/SLA breaches, drift thresholds, and OOD spikes. * PagerDuty/Slack routes with severity mapping; include runbooks and **auto-remediation** hooks (e.g., scale-up, switch to previous model, or temporary rule override). * **Drift & quality analytics** * Daily/weekly jobs (**Airflow**) that run **Evidently** against rolling windows: covariate drift, concept drift (where labels available), PSI/KS tests per feature and per-slice. * **Canary sentinels**: raise alerts early for slices historically fragile (night + rain + pedestrian). * **Explainability** * Lightweight **SHAP-on-sample** or gradient-based saliency for a small percentage of requests in staging; store as artifacts for model debugging. * Maintain **model card** live sections: data slices slipping, observed biases, mitigations taken. * **Feedback loops** * Emit curated failure/novelty cohorts to **Offline Mining** (#12) with descriptors and query templates. * Track **time-to-mitigation** and **defect escape rate** as MLOps KPIs. * **Core AWS / Tooling** * **OpenTelemetry Collector**, **AMP/Prometheus**, **Grafana**, **CloudWatch** (metrics/logs), **Athena/Glue** for large-scale log queries, **Evidently** for drift, **W\&B** for attaching production metrics to model versions. * **Outputs & Storage** * Time-series metrics, traces, and logs in AMP/CloudWatch + **S3** data lake; drift reports; incident tickets; curated error cohorts for the next training loop. ---