Deployment & Serving¶

¶

19) Canary / Shadow Deployment¶

When it runs
- Immediately after Registry & Promotion (#18) marks a candidate as “ready-to-deploy”.
- Also on demand for hotfixes and security-patch rebuilds of serving containers.
Inputs
- Versioned, signed model packs and serving container(s) from Packaging (#15)—TorchScript/ONNX/TensorRT with config.pbtxt and inference_config.json.
- Model Registry record (artifact digests, semver, dataset/hash lineage).
- Rollout policy (canary steps, shadow sampling %, abort thresholds).
Steps
- Environment prep
  - Provision/confirm multi-AZ EKS or SageMaker Endpoints; ensure VPC-only networking, VPC endpoints for S3, and TLS everywhere.
  - Warm capacity for both shadow and canary paths (separate HPA targets to isolate load).
- Shadow mode (read-only)
  - Mirror a configurable % of real production traffic to the shadow model while keeping responses dark (not used by callers).
  - Log shadow outputs, latencies, and behavioral diffs vs. production to S3; compute summary KPIs (class-level recall/precision deltas, NMS stability, trajectory ADE/FDE deltas).
  - Validation gates: cap diff rates (e.g., abs(Δ recall pedestrian_night) ≤ 1.5%); watch p95 latency. Auto-stop shadow if anomalies exceed pre-set budgets.
- Canary (serve a fraction of live traffic)
  - Route a small cohort (e.g., 1%) to the candidate via ingress or SageMaker variant routing weights.
  - Enable gray logging: store complete requests + responses for the canary cohort, with PII redaction.
  - Health & SLO checks: request success rate, p95/p99 latency vs. SLO, GPU memory headroom, error budgets.
  - Increase traffic in steps (1% → 5% → 25% → 50% → 100%) only after each step maintains green KPIs for N minutes/hours.
- Abort / rollback path
  - Instant rollback to previous production image via blue/green or revert variant weights.
  - Preserve failure bundle (requests, traces, metrics) to S3 for Offline Mining (#12).
- Documentation & sign-off
  - Append canary/shadow results to the model card and Registry entry.
  - Notify stakeholders with a concise status page (live KPI tiles and roll-forward/rollback decision log).
Core AWS / Tooling
- EKS (Triton/TorchServe pods), ALB/NLB Ingress, SageMaker Endpoints (variant weights), App Mesh/Istio for traffic shaping, CloudWatch alarms, EventBridge for step promotions.
- OpenTelemetry for traces, Prometheus/AMP + Grafana for SLOs, S3 for shadow logs and diffs, W&B for deployment run metadata.
Outputs & Storage
- Canary/shadow KPI reports, diff summaries, traces; stored in S3 and linked in Registry.
- Updated Registry stage (candidate → production) once canary completes.

20) A/B Testing & Feature Flags¶

When it runs
- After canary when we want outcome-level proof (business or safety proxy KPIs).
- During experiments that tune thresholds, ensemble weights, or post-processing steps without retraining.
Inputs
- Deployed production and candidate models (or the same model with different post-processing/threshold configs).
- Experiment Plan: primary metric(s), success criteria, sample size/power calculation, guardrails (safety, latency).
Steps
- Flag & cohort design
  - Define treatment arms (e.g., Threshold_A vs Threshold_B; Model_v1.8 vs v1.7).
  - Cohort users/vehicles by geography, time window, or fleet slice to minimize interference.
  - Implement with a config/flag service (DynamoDB or LaunchDarkly) read at request start. Cache locally with short TTL to avoid flag server coupling.
- Routing & consistency
  - Sticky assignment per device/vehicle to avoid cross-over contamination.
  - Keep feature parity across arms except for the variable under test.
- Metrics capture
  - Online KPIs (success rate, false-positive interventions, latency p95) plus safety proxies (e.g., disagreement with planner, emergency brake proxy rates).
  - Aggregate with exact timestamps and cohort tags; anonymize IDs at the logger.
- Statistical analysis
  - Sequential testing or fixed-horizon with correction for multiple looks; pre-register the test to avoid p-hacking.
  - Guardrail checks: if any safety guardrail breaches, auto-terminate the test and revert flags.
- Decision & rollout
  - Promote winning config/model by flipping flags globally or per-slice; persist final config to inference_config.json next release cycle.
  - Archive experiment results (effect size, confidence intervals, power achieved) in the Registry.
Core AWS / Tooling
- DynamoDB (flag store) or LaunchDarkly, AppConfig, EventBridge for change broadcasts.
- Athena/Glue + QuickSight for analysis; W&B to attach experiment metadata to model version.
Outputs & Storage
- ab_summary.json, dashboards, and final flag state in DynamoDB/AppConfig; linked to Registry and model card.

21) Edge Build & OTA Packaging (Vehicle/Device)¶

When it runs
- After cloud serving passes canary and we’re ready to produce edge-optimized builds.
- On periodic runtime refreshes (driver version change, security patches) or new hardware SKU support.
Inputs
- Model engine(s) per target (TensorRT FP16/INT8) from #15, with calibration cache.
- Edge runtime constraints: memory/compute budgets, power/thermal envelopes, allowable latency.
- Device fleet manifest: hardware SKU mapping, minimum supported driver/SDK versions.
Steps
- Cross-compile & optimize
  - Build per-SKU TensorRT plans with tactic replay and builder flags aligned to target (e.g., Orin/Drive).
  - Fuse pre/post operations into CUDA plugins where beneficial; ensure zero-copy tensors across stages.
  - Run quantization sanity on-device emulation (QAT-aware if available).
- Runtime container/component
  - Package as Greengrass component or OCI image with minimal base; pin CUDA/TensorRT versions; bundle config.pbtxt and inference_config.json.
  - Include a watchdog and health endpoints; implement local batcher and thermal-aware throttling hooks.
- Hardware-in-the-loop tests
  - On a bench rig with target SoC, run smoke suite: contract tests, p95 latency, memory ceiling, and thermal soak.
  - Determinism checks at fixed seeds; performance variance bounds under thermal throttling scenarios.
- Security & compliance
  - Code sign artifacts (AWS Signer or cosign) and produce a per-device update manifest with checksums.
  - SBOM attached; license and IP provenance validated.
- Release assembly
  - Generate OTA bundle per cohort: artifact URIs, rollout policy, preconditions (battery level, vehicle parked, firmware min version), recovery strategy.
  - Publish metadata to the OTA job catalog (IoT Jobs/FleetWise campaign).
Core AWS / Tooling
- AWS IoT Greengrass components, AWS IoT FleetWise or IoT Device Management for campaigns, S3 artifact buckets, Signer/KMS for signatures.
- Bench automation: EKS runner or on-prem CI hardware with GitHub Actions/CodeBuild.
Outputs & Storage
- Signed edge bundles per SKU, update manifests, and bench reports; stored in S3 and indexed in a campaign DB (DynamoDB/Registry).

22) OTA Delivery (Fleet Campaigns)¶

When it runs
- After edge bundles are ready and approved by safety/security leads.
- Coordinated with operations windows (time-of-day, depot/garage schedules).
Inputs
- OTA bundles + manifests from #21.
- Fleet segmentation (VIN/Device IDs by geography, customer, regulatory domain).
- Rollout strategy: staged waves, max concurrent updates, stop conditions.
Steps
- Campaign creation
  - Define cohorts and scheduling: wave sizes, blackout periods, and retries.
  - Preconditions: device online, battery ≥ X%, connected to Wi-Fi or certain carriers, parked/ignition state.
- Secure distribution
  - Ship via IoT Jobs with signed URIs; devices verify signature and checksum before install.
  - Bandwidth shaping: CDN/S3 transfer acceleration; per-region throttles to avoid network saturation.
- Install & verify
  - Atomic swap: install to A/B partition or container tag; upon success, flip active pointer.
  - Health probes post-install: run a local inference self-test; send success beacon with version and basic KPIs.
  - On failure, auto-rollback to previous slot and report error codes.
- Monitoring & control
  - Live campaign dashboard: started/succeeded/failed, per-region rates, error categories.
  - Pause/resume and wave-size adjustments in real time; stop campaign on thresholded failure rates.
- Post-deploy soak
  - Collect in-field telemetry: latency/thermals, crash reports, edge-level OOD counters, and light-weight quality proxies (e.g., detection density by condition).
  - Feed anomalies to Offline Mining (#12).
Core AWS / Tooling
- AWS IoT Jobs / FleetWise, IoT Core, CloudWatch, Athena for campaign analytics, QuickSight dashboards.
- KMS for artifact encryption at rest; Private CA for device certificates if needed.
Outputs & Storage
- Campaign status logs, per-device install receipts, post-install health beacons; all in S3/DynamoDB, surfaced in dashboards and linked to Registry.

23) Online Service Operations (Cloud Inference)¶

When it runs
- Always-on for cloud inference endpoints (batch and/or online).
- Scales elastically with traffic; responds to deployments and load events.
Inputs
- Production model image(s), inference_config.json, and feature/metadata services endpoints.
- SLOs/SLCs: availability, p95/p99 latency, error budgets, cost-per-1k inferences.
Steps
- Service layout
  - Ingress → Request validator (schema, auth) → Preprocessing → Model → Post-processing → Response.
  - Optional Feature Online Store (Feast with DynamoDB/Redis) for feature joins; aggressive caching + TTLs.
- Resilience & scaling
  - HPA/KEDA on GPU/CPU utilization, QPS, and queue depth; min pods to absorb cold starts.
  - Connection pools, timeouts, circuit breakers (Envoy/App Mesh) for downstream calls; backpressure via bounded queues.
  - Multi-AZ, pod disruption budgets, surge capacity for rollouts.
- Performance engineering
  - Pin NUMA/GPU affinity; TensorRT/Triton dynamic batching with careful max delay.
  - Pre-allocate memory pools; enable CUDA graph capture where applicable.
  - Async I/O; zero-copy tensors; avoid per-request allocations.
- Security
  - mTLS in-mesh; OIDC/JWT at edge; fine-grained IAM for S3/feature store access.
  - WAF rules for ingress, request size caps, schema enforcement, and PII redaction at loggers.
- Cost controls
  - Right-size instance types, spot for batch, on-demand for online; autoscaling floors/ceilings.
  - Periodic throughput/latency bin-packing reviews and mixed precision tuning to reduce GPU ms/inference.
- Operational playbooks
  - Runbooks for incident classes (latency spike, elevated 5xx, GPU OOM, feature store timeouts).
  - Synthetic probes and golden queries; regular failover/fire-drill practices.
Core AWS / Tooling
- EKS with Triton/TorchServe, ALB/NLB, App Mesh/Istio, Feast (DynamoDB/ElastiCache Redis), CloudWatch, SQS/Kinesis for async/batch, SageMaker Endpoints where managed is preferred.
Outputs & Storage
- Live responses (API), structured logs, metrics, traces, and inference audit records (S3 with lifecycle policies).

24) Observability (Telemetry, Drift, Explainability)¶

When it runs
- Continuously, from the moment traffic reaches shadow/canary through long-term production.
- On scheduled jobs for deeper drift/quality analysis.
Inputs
- Request/response telemetry, model outputs, confidence histograms, selective ground truth (from human QA or auto-label confirmations), and reference statistics from #16.
Steps
- Metrics
  - System: QPS, p50/p95/p99 latency, GPU/CPU/memory utilization, queue depth, error rates (4xx/5xx).
  - Model: per-class score distributions, calibration ECE, acceptance/abstention rates, novelty counters (OOD flags).
  - Data: feature value histograms, missingness, input schema drift.
- Logs
  - Structured, PII-redacted request/response logs; correlation IDs to join across services.
  - Failure bundles: auto-capture payload + model state for 5xx or large diffs; store to S3 with strict retention.
- Traces
  - OpenTelemetry spans from ingress through model to downstream stores; trace sampling biased toward tail latency and errors.
- Dashboards & alerts
  - Grafana/QuickSight boards by SLO tiers; CloudWatch alerts on SLO/SLA breaches, drift thresholds, and OOD spikes.
  - PagerDuty/Slack routes with severity mapping; include runbooks and auto-remediation hooks (e.g., scale-up, switch to previous model, or temporary rule override).
- Drift & quality analytics
  - Daily/weekly jobs (Airflow) that run Evidently against rolling windows: covariate drift, concept drift (where labels available), PSI/KS tests per feature and per-slice.
  - Canary sentinels: raise alerts early for slices historically fragile (night + rain + pedestrian).
- Explainability
  - Lightweight SHAP-on-sample or gradient-based saliency for a small percentage of requests in staging; store as artifacts for model debugging.
  - Maintain model card live sections: data slices slipping, observed biases, mitigations taken.
- Feedback loops
  - Emit curated failure/novelty cohorts to Offline Mining (#12) with descriptors and query templates.
  - Track time-to-mitigation and defect escape rate as MLOps KPIs.
Core AWS / Tooling
- OpenTelemetry Collector, AMP/Prometheus, Grafana, CloudWatch (metrics/logs), Athena/Glue for large-scale log queries, Evidently for drift, W&B for attaching production metrics to model versions.
Outputs & Storage
- Time-series metrics, traces, and logs in AMP/CloudWatch + S3 data lake; drift reports; incident tickets; curated error cohorts for the next training loop.