# Reliability, Capacity, Maps ## ### 33) Incident RCA (Root Cause Analysis) — serving & pipeline reliability * **Trigger** * Any of: production SLO/SLA breach (latency, error rate), safety predicate trip, anomaly alert from monitoring/drift (#25), simulation regression from pre-prod, canary rollback, repeated pipeline failures, or on-call/PagerDuty page. * Scheduled post-incident review within 72 hours for any SEV-1/SEV-2. * **Inputs** * **Telemetry & traces:** CloudWatch metrics/logs, OpenSearch logs, Prometheus/Grafana dashboards, AWS X-Ray/Jaeger traces, NVML/DCGM GPU telemetry, feature-store freshness metrics. * **Change context:** Deployment events (Git SHA, container digest, config/flag deltas), model registry history (candidate → staging → prod), feature definitions, safety predicate versions. * **Data signals:** Request samples, mispredictions flagged by online validators, user-reported issues, shadow-mode diffs, recent drift reports. * **Artifacts:** Last successful build/run logs, canary analysis reports, A/B analysis, W\&B run metadata. * **Steps (with testing/validation)** * **Immediate triage (T+0 to T+30min)** * Declare incident, assign IC (incident commander) and scribe; set severity; start timeline. * Freeze risky changes (deployment lock) and **engage runbooks** (safe rollback primitives). * Capture **context snapshot** automatically: last N deploys, feature-flag changes, top error signatures, p99/p999 latency delta, affected tenants/regions. * **Stabilize** * Execute **rollback or traffic shift** to last-good model/service (canary controller); validate health via smoke tests and golden synthetic checks (known-request replay must pass). * If feature-store or data pipeline is culprit: fail over to **degraded mode** (fallback features, cached responses, or heuristic policy). * **Data capture for forensics** * Quarantine a redacted sample of failing requests, feature vectors, and predictions (S3 `governance/incidents//samples/`); include traces, safety decisions, and model confidences. * Preserve relevant logs via export (CloudWatch → S3), pin dashboards. * **Hypothesis-driven analysis** * **Change correlation:** identify first bad time; align with **any change** (code, model, config, data). Use change-point detection on KPIs to narrow window. * **Reproduction:** re-run failing requests in a **deterministic container** with the exact model+flags; compare to last-good; run side-by-side diff. * **Dependency check:** upstream (feature freshness, schema drift), downstream (clients, map service). * **Model-centric probes:** calibration curves on failing slice, confusion matrix deltas, SHAP drift vs. baseline, feature importance changes. * **Root cause determination** * Classify: **Code defect**, **Model regression**, **Config/flag error**, **Data/feature drift**, **Infra capacity** (noisy neighbor, GPU ECC, throttling), **Map/trigger policy**. * Quantify blast radius (requests, segments, geos), cost impact, safety impact. * **Corrective & Preventive Actions (CAPA)** * Immediate fix (patch, hotfix, config revert), plus **long-term guardrail** (test, monitor, lint rule, rollout constraint). * Create **issue tickets** with owners & due dates. Integrate with CI gates (e.g., block deploy if schema version mismatch). * **Validation** * Post-fix **replay**: reproduce pre-incident failing cases → verify pass; run targeted load to confirm capacity headroom. * Add **new regression tests** (golden scenario) to simulation and eval suites; require pass before future promotions. * **Postmortem** * Write blameless RCA using template (5 Whys, fishbone); include timeline, contributing factors, detection gaps, MTTR/MTTD. * Review in weekly reliability review; track action items to closure. * **Core Tooling/Services** * PagerDuty/Incident.io, CloudWatch/Logs/Alarms, OpenSearch, Prometheus/Grafana, AWS X-Ray/Jaeger, AWS CodeDeploy events, Feature-store metrics, W\&B, Athena/QuickSight for KPI drilldowns, Jupyter for ad-hoc analysis. * **Outputs & Storage** * `s3://…/governance/incidents//` (samples, dashboards, reports), **RCA document** (Markdown/PDF), Jira tickets, updated runbooks & tests, promotion gate updates. --- ### 34) Experiment GC (Garbage Collection) — artifacts, indices, and datasets hygiene * **Trigger** * Weekly scheduled GC; **low free space** alert; **budget threshold** exceeded for storage/egress; repo archival; project sunset tag. * Manual **quarantine → purge** for compromised or incorrect datasets. * **Inputs** * **Inventory sources:** S3 Inventory (per-bucket), Glue/Athena tables, Iceberg snapshots, DVC remotes & tags, W\&B runs/artifacts, ECR images/tags, OpenSearch indices, EMR logs, FSx/Lustre volumes. * **Usage signals:** Access logs (S3/Athena), last-read timestamps from index services, registry **in-use** pointers (current prod/staging models, golden datasets). * **Policies:** `retention_policy.yaml` (per class: Bronze/Silver/Gold), legal holds, exception lists, minimal-keep (e.g., N best runs per model). * **Steps (with testing/validation)** * **Discovery & reachability** * Build **lineage graph**: artifact → consumers (models, datasets, docs). Anything “unreached” and older than policy horizon becomes a **candidate**. * Join with **usage stats** (no access in ≥N days) and **cost** (size × storage class). * **Protection rules** * Always protect: models **referenced by registry channels** (prod, canary), **golden datasets**, signed model cards/datasheets, compliance snapshots, incident forensics. * Legal holds override GC; DSR erasure queues take precedence. * **Action plan** * **S3**: batch delete candidates; transition to Glacier for keep-but-cold. * **Iceberg**: expire snapshots ≥ horizon; **rewrite manifests**; **vacuum** orphan files. * **W\&B**: delete old runs/artifacts except top-K per sweep by metric; export summary CSV first. * **ECR**: apply lifecycle policy (keep last M per repo & any tagged `stable`, delete dangling layers); scan for large bases to dedupe. * **OpenSearch**: apply ISM (Index State Mgmt) to roll over & delete old indices, or **shrink** & **forcemerge** if kept. * **Logs**: compress EMR/YARN logs; purge older than horizon. * **Safety checks** * **Dry-run** report (bytes to free, candidates count) → human approve for destructive steps. * Referential **integrity check**: no model or dataset manifest points at an about-to-delete URI. * **Restore drill**: pick 1% random deleted-to-Glacier objects and ensure restore works within SLA. * **Execution** * Orchestrate via Airflow/Step Functions with idempotent tasks and checkpointing; track failures & retries. * **Validation** * Post-GC audit: **Athena** reconciliation (sum sizes by class), check that dashboards & registry remain healthy. * Alert on “unexpected reference” errors if any job fails due to a missing artifact. * **Core Tooling/Services** * S3 Inventory/Batch Ops/Glacier, Glue/Athena, EMR Spark for Iceberg maintenance, W\&B API, ECR lifecycle, OpenSearch ISM/Curator, Airflow/Step Functions, Jira for approval workflow. * **Outputs & Storage** * `s3://…/governance/gc/reports/.json`, deletion manifests, restored-object test logs, storage savings dashboard; policy & exception registry in Git. --- ### 35) GPU Capacity & Queues — scheduling, reservations, autoscaling, fairshare * **Trigger** * Continuous: new training/HPO workloads submitted (Airflow/W\&B Sweeps); scheduled **capacity planning** (weekly); **queue-depth** or **wait-time** SLO breach; quota changes; new model roadmap requiring different accelerators. * **Inputs** * Historical job metadata (GPU type/count, wall-clock, throughput), cluster utilization (DCGM, Prometheus), job queue stats (length, age), **SageMaker** job history, instance pricing (on-demand/spot), Savings Plans/RIs, project priorities, SLAs (e.g., “critical sweep completes in ≤48h”), dataset size and required I/O. * Node pool definitions (A100/H100 vs T4/L4; CPU-only for preprocessing); storage bandwidth (FSx/Lustre, S3). * **Steps (with testing/validation)** * **Demand forecasting** * Time-series model predicts GPU-hours by queue for next 1–4 weeks; identify **peak weeks** and confidence intervals. * Scenario overlay: planned HPO waves, retrain cadence tied to drift alarms. * **Capacity planning** * Map demand to **node pools** (labels/taints): `gpu=A100`, `gpu=L4`, `cpu-prep`. * Purchase/adjust **Savings Plans**; set **SageMaker managed spot** %; reserve high-risk windows (e.g., releases). * Validate **data-path throughput**: if bottlenecked on I/O (S3→FSx), increase FSx/Lustre capacity, pre-stage datasets. * **Queueing & policy** * **Kueue/Volcano/Slurm** or SageMaker priorities: project quotas (GPU-hour budgets), **fairshare weights**, **preemption** rules for interruptible sweeps. * Admission controller enforces **budget tags & approvals** for large jobs; refuses jobs that exceed per-run budget unless label `SpendOverride`. * **Autoscaling & bin-packing** * **Karpenter/Cluster Autoscaler** spins node groups on demand; prefer bin-packing (anti-fragmentation), GPU topology aware (NVLink domains). * For spot, enable **checkpointing** every N minutes to S3/FSx; preemption handler requeues gracefully. * **Placement & topology** * Enable **NCCL topology hints**; rack-aware placement; multi-nic configs for multi-node DDP. * Node-affinity to place data-heavy jobs close to **FSx/Lustre**; avoid overloading S3 (throttle prefetchers). * **SLO management** * SLOs: **median wait time**, **p95 wait time**, **train throughput (img/s)**, **GPU util %**, **cost/GPU-hour**. Alert on breach. * If queue depth persists: auto-scale upper bound (if budget allows) or **shed load** (pause low-priority sweeps). * **Validation** * **Load simulation**: synthetic job submissions to validate queue fairness & SLOs. * **Failover drill**: zone outage simulation; ensure jobs reschedule; checkpoints recover. * **Throughput tests**: for each node pool, run standard training microbenchmarks; track regression over time. * **Governance** * Monthly **capacity review**: showback/chargeback per team; renovate underused images; deprecate old node pools. * **Core Tooling/Services** * EKS (Kubernetes), Karpenter/Cluster Autoscaler, NVIDIA DCGM exporter + Prometheus/Grafana, Kueue/Volcano/Slurm or SageMaker managed training, FSx for Lustre, S3, Ray for distributed sweeps, EventBridge/Lambda for policy actions, Cost Explorer API. * **Outputs & Storage** * Capacity plan (`capacity_plan_.md`), queue configs (ConfigMaps/Slurm partitions), utilization dashboards, Savings Plan decisions, checkpoint manifests, SLA reports; all versioned in Git and logs in S3 governance. --- ### 36) Map/Trigger Policy Update — updating HD map layers & fleet trigger definitions * **Trigger** * Periodic map refresh (e.g., weekly); **policy change** from safety team; evidence from scenario mining (#8/#12) showing gaps; external roadway updates (work zones, new speed limits); spike in false positives/negatives for specific trigger definitions. * **Inputs** * **Map data deltas** (internal mapping pipeline outputs, vendor feeds, or OSM diffs), lane topology changes, speed limit updates, construction zones, geofences. * **Trigger performance**: alert rates, precision/recall of triggers (e.g., hard-brake, disengagement proximity), geographic breakdowns. * **Scenario feedback**: mined error clusters, audit results from Human QA (#10), simulation outcomes from drive replay/closed-loop (#17). * **Constraints**: ODD boundaries, regulatory requirements, privacy constraints. * **Steps (with testing/validation)** * **Propose & author changes** * Author **policy YAML** (semver): includes **map layer updates** (road attributes, closures) and **trigger definitions** (thresholds, state machines, OOD/uncertainty bounds, geofenced overrides). * Attach rationale, expected impact (alerts/day, coverage gain), and risk assessment. * **Offline evaluation** * **Backtest** on last N weeks of logs: compute precision/recall, alert volume, **regional heatmaps**; confirm reduced false alarms or improved recall on target scenarios. * **Counterfactuals** in sim: vary thresholds, verify **safety balance** (miss rate vs. nuisance rate). * Verify **lat/long** accuracy (map matching via OSRM/Valhalla); ensure no regressions at map tile boundaries. * **Schema & consistency checks** * Validate policy schema (JSON Schema); enforce allowed ranges; check for **conflicting overrides** across geofences. * Ensure **version compatibility** with edge agent and cloud detectors (backwards/forwards). * **Security & signing** * Sign policy bundle with KMS; attach **in-toto** attestation; generate SBOM for any included logic plugins. * **Staged rollout** * Publish to **S3 policy bucket** (immutable path per version); create **IoT Jobs**/Greengrass deployment targeting a **canary cohort** (small % of fleet or specific region). * Enable **feature flags**: `trigger_policy.version`, `map_layer.version`, with kill-switch. * **Canary monitoring** * Watch alert rates, map match errors, CPU/mem impact on edge, OTA download success rates, and any safety predicate changes; compare to control cohort. * Roll forward if within guardrails; roll back on anomalies (auto if thresholds breached). * **Full rollout & enforcement** * Gradually increase cohort; record final adoption; ensure backend **parsers** accept new tags/fields; update catalog ETL if schema changed. * **Validation** * Weekly **policy audit**: recompute metrics; ensure no drift between **edge** and **cloud** policy versions; verify **replay** on key scenarios passes. * **Documentation**: update policy change log, trigger explanations for labelers/engineers. * **Core Tooling/Services** * Geo stack: OSRM/Valhalla, GeoPandas/Shapely; data lake (Athena/OpenSearch) for backtests; AWS Location Service (optional); IoT Jobs/Greengrass for OTA; S3 static policy hosting; KMS for signing; Feature-flag service; QuickSight/Mapbox for heatmaps. * **Outputs & Storage** * `s3://…/policy/map//bundle.tar.gz`, `trigger_policy//policy.yaml` (signed), **impact report** (before/after metrics, maps), rollout dashboard, change-log; links recorded in registry & internal portal. ---