Deployment & Serving

19) Canary / Shadow Deployment

  • When it runs

    • Immediately after Registry & Promotion (#18) marks a candidate as “ready-to-deploy”.

    • Also on demand for hotfixes and security-patch rebuilds of serving containers.

  • Inputs

    • Versioned, signed model packs and serving container(s) from Packaging (#15)—TorchScript/ONNX/TensorRT with config.pbtxt and inference_config.json.

    • Model Registry record (artifact digests, semver, dataset/hash lineage).

    • Rollout policy (canary steps, shadow sampling %, abort thresholds).

  • Steps

    • Environment prep

      • Provision/confirm multi-AZ EKS or SageMaker Endpoints; ensure VPC-only networking, VPC endpoints for S3, and TLS everywhere.

      • Warm capacity for both shadow and canary paths (separate HPA targets to isolate load).

    • Shadow mode (read-only)

      • Mirror a configurable % of real production traffic to the shadow model while keeping responses dark (not used by callers).

      • Log shadow outputs, latencies, and behavioral diffs vs. production to S3; compute summary KPIs (class-level recall/precision deltas, NMS stability, trajectory ADE/FDE deltas).

      • Validation gates: cap diff rates (e.g., abs(Δ recall pedestrian_night) ≤ 1.5%); watch p95 latency. Auto-stop shadow if anomalies exceed pre-set budgets.

    • Canary (serve a fraction of live traffic)

      • Route a small cohort (e.g., 1%) to the candidate via ingress or SageMaker variant routing weights.

      • Enable gray logging: store complete requests + responses for the canary cohort, with PII redaction.

      • Health & SLO checks: request success rate, p95/p99 latency vs. SLO, GPU memory headroom, error budgets.

      • Increase traffic in steps (1% → 5% → 25% → 50% → 100%) only after each step maintains green KPIs for N minutes/hours.

    • Abort / rollback path

      • Instant rollback to previous production image via blue/green or revert variant weights.

      • Preserve failure bundle (requests, traces, metrics) to S3 for Offline Mining (#12).

    • Documentation & sign-off

      • Append canary/shadow results to the model card and Registry entry.

      • Notify stakeholders with a concise status page (live KPI tiles and roll-forward/rollback decision log).

  • Core AWS / Tooling

    • EKS (Triton/TorchServe pods), ALB/NLB Ingress, SageMaker Endpoints (variant weights), App Mesh/Istio for traffic shaping, CloudWatch alarms, EventBridge for step promotions.

    • OpenTelemetry for traces, Prometheus/AMP + Grafana for SLOs, S3 for shadow logs and diffs, W&B for deployment run metadata.

  • Outputs & Storage

    • Canary/shadow KPI reports, diff summaries, traces; stored in S3 and linked in Registry.

    • Updated Registry stage (candidate production) once canary completes.


20) A/B Testing & Feature Flags

  • When it runs

    • After canary when we want outcome-level proof (business or safety proxy KPIs).

    • During experiments that tune thresholds, ensemble weights, or post-processing steps without retraining.

  • Inputs

    • Deployed production and candidate models (or the same model with different post-processing/threshold configs).

    • Experiment Plan: primary metric(s), success criteria, sample size/power calculation, guardrails (safety, latency).

  • Steps

    • Flag & cohort design

      • Define treatment arms (e.g., Threshold_A vs Threshold_B; Model_v1.8 vs v1.7).

      • Cohort users/vehicles by geography, time window, or fleet slice to minimize interference.

      • Implement with a config/flag service (DynamoDB or LaunchDarkly) read at request start. Cache locally with short TTL to avoid flag server coupling.

    • Routing & consistency

      • Sticky assignment per device/vehicle to avoid cross-over contamination.

      • Keep feature parity across arms except for the variable under test.

    • Metrics capture

      • Online KPIs (success rate, false-positive interventions, latency p95) plus safety proxies (e.g., disagreement with planner, emergency brake proxy rates).

      • Aggregate with exact timestamps and cohort tags; anonymize IDs at the logger.

    • Statistical analysis

      • Sequential testing or fixed-horizon with correction for multiple looks; pre-register the test to avoid p-hacking.

      • Guardrail checks: if any safety guardrail breaches, auto-terminate the test and revert flags.

    • Decision & rollout

      • Promote winning config/model by flipping flags globally or per-slice; persist final config to inference_config.json next release cycle.

      • Archive experiment results (effect size, confidence intervals, power achieved) in the Registry.

  • Core AWS / Tooling

    • DynamoDB (flag store) or LaunchDarkly, AppConfig, EventBridge for change broadcasts.

    • Athena/Glue + QuickSight for analysis; W&B to attach experiment metadata to model version.

  • Outputs & Storage

    • ab_summary.json, dashboards, and final flag state in DynamoDB/AppConfig; linked to Registry and model card.


21) Edge Build & OTA Packaging (Vehicle/Device)

  • When it runs

    • After cloud serving passes canary and we’re ready to produce edge-optimized builds.

    • On periodic runtime refreshes (driver version change, security patches) or new hardware SKU support.

  • Inputs

    • Model engine(s) per target (TensorRT FP16/INT8) from #15, with calibration cache.

    • Edge runtime constraints: memory/compute budgets, power/thermal envelopes, allowable latency.

    • Device fleet manifest: hardware SKU mapping, minimum supported driver/SDK versions.

  • Steps

    • Cross-compile & optimize

      • Build per-SKU TensorRT plans with tactic replay and builder flags aligned to target (e.g., Orin/Drive).

      • Fuse pre/post operations into CUDA plugins where beneficial; ensure zero-copy tensors across stages.

      • Run quantization sanity on-device emulation (QAT-aware if available).

    • Runtime container/component

      • Package as Greengrass component or OCI image with minimal base; pin CUDA/TensorRT versions; bundle config.pbtxt and inference_config.json.

      • Include a watchdog and health endpoints; implement local batcher and thermal-aware throttling hooks.

    • Hardware-in-the-loop tests

      • On a bench rig with target SoC, run smoke suite: contract tests, p95 latency, memory ceiling, and thermal soak.

      • Determinism checks at fixed seeds; performance variance bounds under thermal throttling scenarios.

    • Security & compliance

      • Code sign artifacts (AWS Signer or cosign) and produce a per-device update manifest with checksums.

      • SBOM attached; license and IP provenance validated.

    • Release assembly

      • Generate OTA bundle per cohort: artifact URIs, rollout policy, preconditions (battery level, vehicle parked, firmware min version), recovery strategy.

      • Publish metadata to the OTA job catalog (IoT Jobs/FleetWise campaign).

  • Core AWS / Tooling

    • AWS IoT Greengrass components, AWS IoT FleetWise or IoT Device Management for campaigns, S3 artifact buckets, Signer/KMS for signatures.

    • Bench automation: EKS runner or on-prem CI hardware with GitHub Actions/CodeBuild.

  • Outputs & Storage

    • Signed edge bundles per SKU, update manifests, and bench reports; stored in S3 and indexed in a campaign DB (DynamoDB/Registry).


22) OTA Delivery (Fleet Campaigns)

  • When it runs

    • After edge bundles are ready and approved by safety/security leads.

    • Coordinated with operations windows (time-of-day, depot/garage schedules).

  • Inputs

    • OTA bundles + manifests from #21.

    • Fleet segmentation (VIN/Device IDs by geography, customer, regulatory domain).

    • Rollout strategy: staged waves, max concurrent updates, stop conditions.

  • Steps

    • Campaign creation

      • Define cohorts and scheduling: wave sizes, blackout periods, and retries.

      • Preconditions: device online, battery ≥ X%, connected to Wi-Fi or certain carriers, parked/ignition state.

    • Secure distribution

      • Ship via IoT Jobs with signed URIs; devices verify signature and checksum before install.

      • Bandwidth shaping: CDN/S3 transfer acceleration; per-region throttles to avoid network saturation.

    • Install & verify

      • Atomic swap: install to A/B partition or container tag; upon success, flip active pointer.

      • Health probes post-install: run a local inference self-test; send success beacon with version and basic KPIs.

      • On failure, auto-rollback to previous slot and report error codes.

    • Monitoring & control

      • Live campaign dashboard: started/succeeded/failed, per-region rates, error categories.

      • Pause/resume and wave-size adjustments in real time; stop campaign on thresholded failure rates.

    • Post-deploy soak

      • Collect in-field telemetry: latency/thermals, crash reports, edge-level OOD counters, and light-weight quality proxies (e.g., detection density by condition).

      • Feed anomalies to Offline Mining (#12).

  • Core AWS / Tooling

    • AWS IoT Jobs / FleetWise, IoT Core, CloudWatch, Athena for campaign analytics, QuickSight dashboards.

    • KMS for artifact encryption at rest; Private CA for device certificates if needed.

  • Outputs & Storage

    • Campaign status logs, per-device install receipts, post-install health beacons; all in S3/DynamoDB, surfaced in dashboards and linked to Registry.


23) Online Service Operations (Cloud Inference)

  • When it runs

    • Always-on for cloud inference endpoints (batch and/or online).

    • Scales elastically with traffic; responds to deployments and load events.

  • Inputs

    • Production model image(s), inference_config.json, and feature/metadata services endpoints.

    • SLOs/SLCs: availability, p95/p99 latency, error budgets, cost-per-1k inferences.

  • Steps

    • Service layout

      • IngressRequest validator (schema, auth) → PreprocessingModelPost-processingResponse.

      • Optional Feature Online Store (Feast with DynamoDB/Redis) for feature joins; aggressive caching + TTLs.

    • Resilience & scaling

      • HPA/KEDA on GPU/CPU utilization, QPS, and queue depth; min pods to absorb cold starts.

      • Connection pools, timeouts, circuit breakers (Envoy/App Mesh) for downstream calls; backpressure via bounded queues.

      • Multi-AZ, pod disruption budgets, surge capacity for rollouts.

    • Performance engineering

      • Pin NUMA/GPU affinity; TensorRT/Triton dynamic batching with careful max delay.

      • Pre-allocate memory pools; enable CUDA graph capture where applicable.

      • Async I/O; zero-copy tensors; avoid per-request allocations.

    • Security

      • mTLS in-mesh; OIDC/JWT at edge; fine-grained IAM for S3/feature store access.

      • WAF rules for ingress, request size caps, schema enforcement, and PII redaction at loggers.

    • Cost controls

      • Right-size instance types, spot for batch, on-demand for online; autoscaling floors/ceilings.

      • Periodic throughput/latency bin-packing reviews and mixed precision tuning to reduce GPU ms/inference.

    • Operational playbooks

      • Runbooks for incident classes (latency spike, elevated 5xx, GPU OOM, feature store timeouts).

      • Synthetic probes and golden queries; regular failover/fire-drill practices.

  • Core AWS / Tooling

    • EKS with Triton/TorchServe, ALB/NLB, App Mesh/Istio, Feast (DynamoDB/ElastiCache Redis), CloudWatch, SQS/Kinesis for async/batch, SageMaker Endpoints where managed is preferred.

  • Outputs & Storage

    • Live responses (API), structured logs, metrics, traces, and inference audit records (S3 with lifecycle policies).


24) Observability (Telemetry, Drift, Explainability)

  • When it runs

    • Continuously, from the moment traffic reaches shadow/canary through long-term production.

    • On scheduled jobs for deeper drift/quality analysis.

  • Inputs

    • Request/response telemetry, model outputs, confidence histograms, selective ground truth (from human QA or auto-label confirmations), and reference statistics from #16.

  • Steps

    • Metrics

      • System: QPS, p50/p95/p99 latency, GPU/CPU/memory utilization, queue depth, error rates (4xx/5xx).

      • Model: per-class score distributions, calibration ECE, acceptance/abstention rates, novelty counters (OOD flags).

      • Data: feature value histograms, missingness, input schema drift.

    • Logs

      • Structured, PII-redacted request/response logs; correlation IDs to join across services.

      • Failure bundles: auto-capture payload + model state for 5xx or large diffs; store to S3 with strict retention.

    • Traces

      • OpenTelemetry spans from ingress through model to downstream stores; trace sampling biased toward tail latency and errors.

    • Dashboards & alerts

      • Grafana/QuickSight boards by SLO tiers; CloudWatch alerts on SLO/SLA breaches, drift thresholds, and OOD spikes.

      • PagerDuty/Slack routes with severity mapping; include runbooks and auto-remediation hooks (e.g., scale-up, switch to previous model, or temporary rule override).

    • Drift & quality analytics

      • Daily/weekly jobs (Airflow) that run Evidently against rolling windows: covariate drift, concept drift (where labels available), PSI/KS tests per feature and per-slice.

      • Canary sentinels: raise alerts early for slices historically fragile (night + rain + pedestrian).

    • Explainability

      • Lightweight SHAP-on-sample or gradient-based saliency for a small percentage of requests in staging; store as artifacts for model debugging.

      • Maintain model card live sections: data slices slipping, observed biases, mitigations taken.

    • Feedback loops

      • Emit curated failure/novelty cohorts to Offline Mining (#12) with descriptors and query templates.

      • Track time-to-mitigation and defect escape rate as MLOps KPIs.

  • Core AWS / Tooling

    • OpenTelemetry Collector, AMP/Prometheus, Grafana, CloudWatch (metrics/logs), Athena/Glue for large-scale log queries, Evidently for drift, W&B for attaching production metrics to model versions.

  • Outputs & Storage

    • Time-series metrics, traces, and logs in AMP/CloudWatch + S3 data lake; drift reports; incident tickets; curated error cohorts for the next training loop.