Reliability, Capacity, Maps

33) Incident RCA (Root Cause Analysis) — serving & pipeline reliability

  • Trigger

    • Any of: production SLO/SLA breach (latency, error rate), safety predicate trip, anomaly alert from monitoring/drift (#25), simulation regression from pre-prod, canary rollback, repeated pipeline failures, or on-call/PagerDuty page.

    • Scheduled post-incident review within 72 hours for any SEV-1/SEV-2.

  • Inputs

    • Telemetry & traces: CloudWatch metrics/logs, OpenSearch logs, Prometheus/Grafana dashboards, AWS X-Ray/Jaeger traces, NVML/DCGM GPU telemetry, feature-store freshness metrics.

    • Change context: Deployment events (Git SHA, container digest, config/flag deltas), model registry history (candidate → staging → prod), feature definitions, safety predicate versions.

    • Data signals: Request samples, mispredictions flagged by online validators, user-reported issues, shadow-mode diffs, recent drift reports.

    • Artifacts: Last successful build/run logs, canary analysis reports, A/B analysis, W&B run metadata.

  • Steps (with testing/validation)

    • Immediate triage (T+0 to T+30min)

      • Declare incident, assign IC (incident commander) and scribe; set severity; start timeline.

      • Freeze risky changes (deployment lock) and engage runbooks (safe rollback primitives).

      • Capture context snapshot automatically: last N deploys, feature-flag changes, top error signatures, p99/p999 latency delta, affected tenants/regions.

    • Stabilize

      • Execute rollback or traffic shift to last-good model/service (canary controller); validate health via smoke tests and golden synthetic checks (known-request replay must pass).

      • If feature-store or data pipeline is culprit: fail over to degraded mode (fallback features, cached responses, or heuristic policy).

    • Data capture for forensics

      • Quarantine a redacted sample of failing requests, feature vectors, and predictions (S3 governance/incidents/<id>/samples/); include traces, safety decisions, and model confidences.

      • Preserve relevant logs via export (CloudWatch → S3), pin dashboards.

    • Hypothesis-driven analysis

      • Change correlation: identify first bad time; align with any change (code, model, config, data). Use change-point detection on KPIs to narrow window.

      • Reproduction: re-run failing requests in a deterministic container with the exact model+flags; compare to last-good; run side-by-side diff.

      • Dependency check: upstream (feature freshness, schema drift), downstream (clients, map service).

      • Model-centric probes: calibration curves on failing slice, confusion matrix deltas, SHAP drift vs. baseline, feature importance changes.

    • Root cause determination

      • Classify: Code defect, Model regression, Config/flag error, Data/feature drift, Infra capacity (noisy neighbor, GPU ECC, throttling), Map/trigger policy.

      • Quantify blast radius (requests, segments, geos), cost impact, safety impact.

    • Corrective & Preventive Actions (CAPA)

      • Immediate fix (patch, hotfix, config revert), plus long-term guardrail (test, monitor, lint rule, rollout constraint).

      • Create issue tickets with owners & due dates. Integrate with CI gates (e.g., block deploy if schema version mismatch).

    • Validation

      • Post-fix replay: reproduce pre-incident failing cases → verify pass; run targeted load to confirm capacity headroom.

      • Add new regression tests (golden scenario) to simulation and eval suites; require pass before future promotions.

    • Postmortem

      • Write blameless RCA using template (5 Whys, fishbone); include timeline, contributing factors, detection gaps, MTTR/MTTD.

      • Review in weekly reliability review; track action items to closure.

  • Core Tooling/Services

    • PagerDuty/Incident.io, CloudWatch/Logs/Alarms, OpenSearch, Prometheus/Grafana, AWS X-Ray/Jaeger, AWS CodeDeploy events, Feature-store metrics, W&B, Athena/QuickSight for KPI drilldowns, Jupyter for ad-hoc analysis.

  • Outputs & Storage

    • s3://…/governance/incidents/<incident_id>/ (samples, dashboards, reports), RCA document (Markdown/PDF), Jira tickets, updated runbooks & tests, promotion gate updates.


34) Experiment GC (Garbage Collection) — artifacts, indices, and datasets hygiene

  • Trigger

    • Weekly scheduled GC; low free space alert; budget threshold exceeded for storage/egress; repo archival; project sunset tag.

    • Manual quarantine → purge for compromised or incorrect datasets.

  • Inputs

    • Inventory sources: S3 Inventory (per-bucket), Glue/Athena tables, Iceberg snapshots, DVC remotes & tags, W&B runs/artifacts, ECR images/tags, OpenSearch indices, EMR logs, FSx/Lustre volumes.

    • Usage signals: Access logs (S3/Athena), last-read timestamps from index services, registry in-use pointers (current prod/staging models, golden datasets).

    • Policies: retention_policy.yaml (per class: Bronze/Silver/Gold), legal holds, exception lists, minimal-keep (e.g., N best runs per model).

  • Steps (with testing/validation)

    • Discovery & reachability

      • Build lineage graph: artifact → consumers (models, datasets, docs). Anything “unreached” and older than policy horizon becomes a candidate.

      • Join with usage stats (no access in ≥N days) and cost (size × storage class).

    • Protection rules

      • Always protect: models referenced by registry channels (prod, canary), golden datasets, signed model cards/datasheets, compliance snapshots, incident forensics.

      • Legal holds override GC; DSR erasure queues take precedence.

    • Action plan

      • S3: batch delete candidates; transition to Glacier for keep-but-cold.

      • Iceberg: expire snapshots ≥ horizon; rewrite manifests; vacuum orphan files.

      • W&B: delete old runs/artifacts except top-K per sweep by metric; export summary CSV first.

      • ECR: apply lifecycle policy (keep last M per repo & any tagged stable, delete dangling layers); scan for large bases to dedupe.

      • OpenSearch: apply ISM (Index State Mgmt) to roll over & delete old indices, or shrink & forcemerge if kept.

      • Logs: compress EMR/YARN logs; purge older than horizon.

    • Safety checks

      • Dry-run report (bytes to free, candidates count) → human approve for destructive steps.

      • Referential integrity check: no model or dataset manifest points at an about-to-delete URI.

      • Restore drill: pick 1% random deleted-to-Glacier objects and ensure restore works within SLA.

    • Execution

      • Orchestrate via Airflow/Step Functions with idempotent tasks and checkpointing; track failures & retries.

    • Validation

      • Post-GC audit: Athena reconciliation (sum sizes by class), check that dashboards & registry remain healthy.

      • Alert on “unexpected reference” errors if any job fails due to a missing artifact.

  • Core Tooling/Services

    • S3 Inventory/Batch Ops/Glacier, Glue/Athena, EMR Spark for Iceberg maintenance, W&B API, ECR lifecycle, OpenSearch ISM/Curator, Airflow/Step Functions, Jira for approval workflow.

  • Outputs & Storage

    • s3://…/governance/gc/reports/<date>.json, deletion manifests, restored-object test logs, storage savings dashboard; policy & exception registry in Git.


35) GPU Capacity & Queues — scheduling, reservations, autoscaling, fairshare

  • Trigger

    • Continuous: new training/HPO workloads submitted (Airflow/W&B Sweeps); scheduled capacity planning (weekly); queue-depth or wait-time SLO breach; quota changes; new model roadmap requiring different accelerators.

  • Inputs

    • Historical job metadata (GPU type/count, wall-clock, throughput), cluster utilization (DCGM, Prometheus), job queue stats (length, age), SageMaker job history, instance pricing (on-demand/spot), Savings Plans/RIs, project priorities, SLAs (e.g., “critical sweep completes in ≤48h”), dataset size and required I/O.

    • Node pool definitions (A100/H100 vs T4/L4; CPU-only for preprocessing); storage bandwidth (FSx/Lustre, S3).

  • Steps (with testing/validation)

    • Demand forecasting

      • Time-series model predicts GPU-hours by queue for next 1–4 weeks; identify peak weeks and confidence intervals.

      • Scenario overlay: planned HPO waves, retrain cadence tied to drift alarms.

    • Capacity planning

      • Map demand to node pools (labels/taints): gpu=A100, gpu=L4, cpu-prep.

      • Purchase/adjust Savings Plans; set SageMaker managed spot %; reserve high-risk windows (e.g., releases).

      • Validate data-path throughput: if bottlenecked on I/O (S3→FSx), increase FSx/Lustre capacity, pre-stage datasets.

    • Queueing & policy

      • Kueue/Volcano/Slurm or SageMaker priorities: project quotas (GPU-hour budgets), fairshare weights, preemption rules for interruptible sweeps.

      • Admission controller enforces budget tags & approvals for large jobs; refuses jobs that exceed per-run budget unless label SpendOverride.

    • Autoscaling & bin-packing

      • Karpenter/Cluster Autoscaler spins node groups on demand; prefer bin-packing (anti-fragmentation), GPU topology aware (NVLink domains).

      • For spot, enable checkpointing every N minutes to S3/FSx; preemption handler requeues gracefully.

    • Placement & topology

      • Enable NCCL topology hints; rack-aware placement; multi-nic configs for multi-node DDP.

      • Node-affinity to place data-heavy jobs close to FSx/Lustre; avoid overloading S3 (throttle prefetchers).

    • SLO management

      • SLOs: median wait time, p95 wait time, train throughput (img/s), GPU util %, cost/GPU-hour. Alert on breach.

      • If queue depth persists: auto-scale upper bound (if budget allows) or shed load (pause low-priority sweeps).

    • Validation

      • Load simulation: synthetic job submissions to validate queue fairness & SLOs.

      • Failover drill: zone outage simulation; ensure jobs reschedule; checkpoints recover.

      • Throughput tests: for each node pool, run standard training microbenchmarks; track regression over time.

    • Governance

      • Monthly capacity review: showback/chargeback per team; renovate underused images; deprecate old node pools.

  • Core Tooling/Services

    • EKS (Kubernetes), Karpenter/Cluster Autoscaler, NVIDIA DCGM exporter + Prometheus/Grafana, Kueue/Volcano/Slurm or SageMaker managed training, FSx for Lustre, S3, Ray for distributed sweeps, EventBridge/Lambda for policy actions, Cost Explorer API.

  • Outputs & Storage

    • Capacity plan (capacity_plan_<month>.md), queue configs (ConfigMaps/Slurm partitions), utilization dashboards, Savings Plan decisions, checkpoint manifests, SLA reports; all versioned in Git and logs in S3 governance.


36) Map/Trigger Policy Update — updating HD map layers & fleet trigger definitions

  • Trigger

    • Periodic map refresh (e.g., weekly); policy change from safety team; evidence from scenario mining (#8/#12) showing gaps; external roadway updates (work zones, new speed limits); spike in false positives/negatives for specific trigger definitions.

  • Inputs

    • Map data deltas (internal mapping pipeline outputs, vendor feeds, or OSM diffs), lane topology changes, speed limit updates, construction zones, geofences.

    • Trigger performance: alert rates, precision/recall of triggers (e.g., hard-brake, disengagement proximity), geographic breakdowns.

    • Scenario feedback: mined error clusters, audit results from Human QA (#10), simulation outcomes from drive replay/closed-loop (#17).

    • Constraints: ODD boundaries, regulatory requirements, privacy constraints.

  • Steps (with testing/validation)

    • Propose & author changes

      • Author policy YAML (semver): includes map layer updates (road attributes, closures) and trigger definitions (thresholds, state machines, OOD/uncertainty bounds, geofenced overrides).

      • Attach rationale, expected impact (alerts/day, coverage gain), and risk assessment.

    • Offline evaluation

      • Backtest on last N weeks of logs: compute precision/recall, alert volume, regional heatmaps; confirm reduced false alarms or improved recall on target scenarios.

      • Counterfactuals in sim: vary thresholds, verify safety balance (miss rate vs. nuisance rate).

      • Verify lat/long accuracy (map matching via OSRM/Valhalla); ensure no regressions at map tile boundaries.

    • Schema & consistency checks

      • Validate policy schema (JSON Schema); enforce allowed ranges; check for conflicting overrides across geofences.

      • Ensure version compatibility with edge agent and cloud detectors (backwards/forwards).

    • Security & signing

      • Sign policy bundle with KMS; attach in-toto attestation; generate SBOM for any included logic plugins.

    • Staged rollout

      • Publish to S3 policy bucket (immutable path per version); create IoT Jobs/Greengrass deployment targeting a canary cohort (small % of fleet or specific region).

      • Enable feature flags: trigger_policy.version, map_layer.version, with kill-switch.

    • Canary monitoring

      • Watch alert rates, map match errors, CPU/mem impact on edge, OTA download success rates, and any safety predicate changes; compare to control cohort.

      • Roll forward if within guardrails; roll back on anomalies (auto if thresholds breached).

    • Full rollout & enforcement

      • Gradually increase cohort; record final adoption; ensure backend parsers accept new tags/fields; update catalog ETL if schema changed.

    • Validation

      • Weekly policy audit: recompute metrics; ensure no drift between edge and cloud policy versions; verify replay on key scenarios passes.

      • Documentation: update policy change log, trigger explanations for labelers/engineers.

  • Core Tooling/Services

    • Geo stack: OSRM/Valhalla, GeoPandas/Shapely; data lake (Athena/OpenSearch) for backtests; AWS Location Service (optional); IoT Jobs/Greengrass for OTA; S3 static policy hosting; KMS for signing; Feature-flag service; QuickSight/Mapbox for heatmaps.

  • Outputs & Storage

    • s3://…/policy/map/<semver>/bundle.tar.gz, trigger_policy/<semver>/policy.yaml (signed), impact report (before/after metrics, maps), rollout dashboard, change-log; links recorded in registry & internal portal.