Reliability, Capacity, Maps¶
¶
33) Incident RCA (Root Cause Analysis) — serving & pipeline reliability¶
Trigger
Any of: production SLO/SLA breach (latency, error rate), safety predicate trip, anomaly alert from monitoring/drift (#25), simulation regression from pre-prod, canary rollback, repeated pipeline failures, or on-call/PagerDuty page.
Scheduled post-incident review within 72 hours for any SEV-1/SEV-2.
Inputs
Telemetry & traces: CloudWatch metrics/logs, OpenSearch logs, Prometheus/Grafana dashboards, AWS X-Ray/Jaeger traces, NVML/DCGM GPU telemetry, feature-store freshness metrics.
Change context: Deployment events (Git SHA, container digest, config/flag deltas), model registry history (candidate → staging → prod), feature definitions, safety predicate versions.
Data signals: Request samples, mispredictions flagged by online validators, user-reported issues, shadow-mode diffs, recent drift reports.
Artifacts: Last successful build/run logs, canary analysis reports, A/B analysis, W&B run metadata.
Steps (with testing/validation)
Immediate triage (T+0 to T+30min)
Declare incident, assign IC (incident commander) and scribe; set severity; start timeline.
Freeze risky changes (deployment lock) and engage runbooks (safe rollback primitives).
Capture context snapshot automatically: last N deploys, feature-flag changes, top error signatures, p99/p999 latency delta, affected tenants/regions.
Stabilize
Execute rollback or traffic shift to last-good model/service (canary controller); validate health via smoke tests and golden synthetic checks (known-request replay must pass).
If feature-store or data pipeline is culprit: fail over to degraded mode (fallback features, cached responses, or heuristic policy).
Data capture for forensics
Quarantine a redacted sample of failing requests, feature vectors, and predictions (S3
governance/incidents/<id>/samples/
); include traces, safety decisions, and model confidences.Preserve relevant logs via export (CloudWatch → S3), pin dashboards.
Hypothesis-driven analysis
Change correlation: identify first bad time; align with any change (code, model, config, data). Use change-point detection on KPIs to narrow window.
Reproduction: re-run failing requests in a deterministic container with the exact model+flags; compare to last-good; run side-by-side diff.
Dependency check: upstream (feature freshness, schema drift), downstream (clients, map service).
Model-centric probes: calibration curves on failing slice, confusion matrix deltas, SHAP drift vs. baseline, feature importance changes.
Root cause determination
Classify: Code defect, Model regression, Config/flag error, Data/feature drift, Infra capacity (noisy neighbor, GPU ECC, throttling), Map/trigger policy.
Quantify blast radius (requests, segments, geos), cost impact, safety impact.
Corrective & Preventive Actions (CAPA)
Immediate fix (patch, hotfix, config revert), plus long-term guardrail (test, monitor, lint rule, rollout constraint).
Create issue tickets with owners & due dates. Integrate with CI gates (e.g., block deploy if schema version mismatch).
Validation
Post-fix replay: reproduce pre-incident failing cases → verify pass; run targeted load to confirm capacity headroom.
Add new regression tests (golden scenario) to simulation and eval suites; require pass before future promotions.
Postmortem
Write blameless RCA using template (5 Whys, fishbone); include timeline, contributing factors, detection gaps, MTTR/MTTD.
Review in weekly reliability review; track action items to closure.
Core Tooling/Services
PagerDuty/Incident.io, CloudWatch/Logs/Alarms, OpenSearch, Prometheus/Grafana, AWS X-Ray/Jaeger, AWS CodeDeploy events, Feature-store metrics, W&B, Athena/QuickSight for KPI drilldowns, Jupyter for ad-hoc analysis.
Outputs & Storage
s3://…/governance/incidents/<incident_id>/
(samples, dashboards, reports), RCA document (Markdown/PDF), Jira tickets, updated runbooks & tests, promotion gate updates.
34) Experiment GC (Garbage Collection) — artifacts, indices, and datasets hygiene¶
Trigger
Weekly scheduled GC; low free space alert; budget threshold exceeded for storage/egress; repo archival; project sunset tag.
Manual quarantine → purge for compromised or incorrect datasets.
Inputs
Inventory sources: S3 Inventory (per-bucket), Glue/Athena tables, Iceberg snapshots, DVC remotes & tags, W&B runs/artifacts, ECR images/tags, OpenSearch indices, EMR logs, FSx/Lustre volumes.
Usage signals: Access logs (S3/Athena), last-read timestamps from index services, registry in-use pointers (current prod/staging models, golden datasets).
Policies:
retention_policy.yaml
(per class: Bronze/Silver/Gold), legal holds, exception lists, minimal-keep (e.g., N best runs per model).
Steps (with testing/validation)
Discovery & reachability
Build lineage graph: artifact → consumers (models, datasets, docs). Anything “unreached” and older than policy horizon becomes a candidate.
Join with usage stats (no access in ≥N days) and cost (size × storage class).
Protection rules
Always protect: models referenced by registry channels (prod, canary), golden datasets, signed model cards/datasheets, compliance snapshots, incident forensics.
Legal holds override GC; DSR erasure queues take precedence.
Action plan
S3: batch delete candidates; transition to Glacier for keep-but-cold.
Iceberg: expire snapshots ≥ horizon; rewrite manifests; vacuum orphan files.
W&B: delete old runs/artifacts except top-K per sweep by metric; export summary CSV first.
ECR: apply lifecycle policy (keep last M per repo & any tagged
stable
, delete dangling layers); scan for large bases to dedupe.OpenSearch: apply ISM (Index State Mgmt) to roll over & delete old indices, or shrink & forcemerge if kept.
Logs: compress EMR/YARN logs; purge older than horizon.
Safety checks
Dry-run report (bytes to free, candidates count) → human approve for destructive steps.
Referential integrity check: no model or dataset manifest points at an about-to-delete URI.
Restore drill: pick 1% random deleted-to-Glacier objects and ensure restore works within SLA.
Execution
Orchestrate via Airflow/Step Functions with idempotent tasks and checkpointing; track failures & retries.
Validation
Post-GC audit: Athena reconciliation (sum sizes by class), check that dashboards & registry remain healthy.
Alert on “unexpected reference” errors if any job fails due to a missing artifact.
Core Tooling/Services
S3 Inventory/Batch Ops/Glacier, Glue/Athena, EMR Spark for Iceberg maintenance, W&B API, ECR lifecycle, OpenSearch ISM/Curator, Airflow/Step Functions, Jira for approval workflow.
Outputs & Storage
s3://…/governance/gc/reports/<date>.json
, deletion manifests, restored-object test logs, storage savings dashboard; policy & exception registry in Git.
36) Map/Trigger Policy Update — updating HD map layers & fleet trigger definitions¶
Trigger
Periodic map refresh (e.g., weekly); policy change from safety team; evidence from scenario mining (#8/#12) showing gaps; external roadway updates (work zones, new speed limits); spike in false positives/negatives for specific trigger definitions.
Inputs
Map data deltas (internal mapping pipeline outputs, vendor feeds, or OSM diffs), lane topology changes, speed limit updates, construction zones, geofences.
Trigger performance: alert rates, precision/recall of triggers (e.g., hard-brake, disengagement proximity), geographic breakdowns.
Scenario feedback: mined error clusters, audit results from Human QA (#10), simulation outcomes from drive replay/closed-loop (#17).
Constraints: ODD boundaries, regulatory requirements, privacy constraints.
Steps (with testing/validation)
Propose & author changes
Author policy YAML (semver): includes map layer updates (road attributes, closures) and trigger definitions (thresholds, state machines, OOD/uncertainty bounds, geofenced overrides).
Attach rationale, expected impact (alerts/day, coverage gain), and risk assessment.
Offline evaluation
Backtest on last N weeks of logs: compute precision/recall, alert volume, regional heatmaps; confirm reduced false alarms or improved recall on target scenarios.
Counterfactuals in sim: vary thresholds, verify safety balance (miss rate vs. nuisance rate).
Verify lat/long accuracy (map matching via OSRM/Valhalla); ensure no regressions at map tile boundaries.
Schema & consistency checks
Validate policy schema (JSON Schema); enforce allowed ranges; check for conflicting overrides across geofences.
Ensure version compatibility with edge agent and cloud detectors (backwards/forwards).
Security & signing
Sign policy bundle with KMS; attach in-toto attestation; generate SBOM for any included logic plugins.
Staged rollout
Publish to S3 policy bucket (immutable path per version); create IoT Jobs/Greengrass deployment targeting a canary cohort (small % of fleet or specific region).
Enable feature flags:
trigger_policy.version
,map_layer.version
, with kill-switch.
Canary monitoring
Watch alert rates, map match errors, CPU/mem impact on edge, OTA download success rates, and any safety predicate changes; compare to control cohort.
Roll forward if within guardrails; roll back on anomalies (auto if thresholds breached).
Full rollout & enforcement
Gradually increase cohort; record final adoption; ensure backend parsers accept new tags/fields; update catalog ETL if schema changed.
Validation
Weekly policy audit: recompute metrics; ensure no drift between edge and cloud policy versions; verify replay on key scenarios passes.
Documentation: update policy change log, trigger explanations for labelers/engineers.
Core Tooling/Services
Geo stack: OSRM/Valhalla, GeoPandas/Shapely; data lake (Athena/OpenSearch) for backtests; AWS Location Service (optional); IoT Jobs/Greengrass for OTA; S3 static policy hosting; KMS for signing; Feature-flag service; QuickSight/Mapbox for heatmaps.
Outputs & Storage
s3://…/policy/map/<semver>/bundle.tar.gz
,trigger_policy/<semver>/policy.yaml
(signed), impact report (before/after metrics, maps), rollout dashboard, change-log; links recorded in registry & internal portal.