Monitoring & Continual Learning¶

¶

25) Drift Detection (Data, Prediction, Concept, Performance)¶

When it runs
- Stream: near-real-time sliding windows (e.g., 15-minute / 1-hour) over online inference topics for early warning.
- Batch: scheduled Airflow jobs (hourly/daily/weekly) to produce canonical drift reports and trend lines.
- On-demand: after a deployment, during canary, or if SLOs breach (p95 latency, error rate) or OOD counters spike.
Inputs
- Online telemetry: request schemas, feature snapshots (PII-scrubbed), model outputs (scores/boxes/masks/uncertainty), per-request timing + resource metrics.
- Reference baselines: per-slice statistics frozen at promotion time from #16 (evaluation) and “healthy” historical windows (seasonal references).
- Label trickle: a small, delayed stream of ground truth from #10 (human QA) and #9 (auto-label confirmations) to estimate concept drift where possible.
Steps
- Collection & privacy
  - Tap inference logs via OpenTelemetry exporters; route to Kinesis → S3 (Parquet); strip or hash identifiers; mask PII fields at the edge.
  - Maintain a feature dictionary (name, type, valid range, unit) in Glue Data Catalog; enforce with validators.
- Drift computations
  - Schema drift: required fields present; types/ranges; missingness change (Great Expectations).
  - Covariate drift (inputs):
    - Numeric: PSI, KS/AD tests, distributional distance; maintain rolling means/variances and quantiles.
    - Categorical: population stability, χ² tests; top-k category churn.
    - Temporal: autocorrelation changes; seasonality break detection (CUSUM on aggregates).
  - Prediction drift:
    - Score histograms per class; calibration shift (ECE, Brier); acceptance/abstention rate drift.
    - Spatial: IoU of drivable-area segmentation vs. last known stable; box geometry sanity (aspect ratio, area) distributions.
    - Temporal: track fragmentation, ID-switch rates, latency correlation with confidence.
  - Concept drift (where labels available): prequential error rates, sliding-window AUC/mAP/mIoU; DDM/ADWIN style changepoint detectors.
  - OOD/uncertainty sentinels: ensemble disagreement, Mahalanobis distance in penultimate embeddings; maintain per-slice OOD counters.
- Attribution & slicing
  - Always compute by critical slices (weather, time, geography, road type, sensor subset) and cohorts (fleet/customer).
  - Root-cause heuristics: feature importance on drift indicators (e.g., Shapley on “drift vs no-drift” classification).
- Thresholding & governance
  - Severity bands: Green (noise), Yellow (monitor), Red (action). Calibrate thresholds per slice to avoid alert fatigue.
  - Dedup & cool-down: suppress duplicate alerts within a lookback window; escalate if duration exceeds T.
- Reporting & alerts
  - Emit drift_report.json and drift_metrics.parquet; publish Grafana tiles; send PagerDuty alerts with the top 3 implicated features/slices.
  - File an “Active Drift” ticket with a playbook link (triage, rollback rules, and data-mining recipe).
Core AWS / Tooling
- Airflow, Kinesis, S3/Glue/Athena, Evidently, Great Expectations, OpenTelemetry, Prometheus/AMP, Grafana, QuickSight.
Outputs & Storage
- Canonical drift artifacts in S3 (/monitoring/drift/YYYY/MM/DD/…), Glue tables for Athena, alert records in incident tracker.
- Event to #26 Continual Learning Trigger (with a compact spec of what and where drifted).

26) Continual Learning Trigger (Triage → Decide → Specify)¶

When it runs
- Fired by a Red/Yellow drift from #25, by performance regressions in canary/production, or by business/product requests (e.g., “construction zones increased; improve precision”).
- Nightly “gap analysis” against strategic coverage goals (ensuring rare slices are kept in check).
Inputs
- Drift alert payloads: implicated features/slices, magnitude, duration, example URIs.
- Error cohorts from #12 (offline mining) and canary/shadow diffs from #19.
- Capacity & budget constraints (GPU hours, labeling budget), plus SLA windows.
Steps
- Automated triage
  - Validate alert (guard against noise): re-compute stats on a fresh window; check seasonality/holiday effects.
  - Safety assessment: does this slice intersect safety predicates (#28)? If yes, bump severity and enforce stricter timelines.
- Decisioning
  - Choose the path: data curation only (expand training set), threshold/logic change (config flip via #20), model fine-tune, or full retrain.
  - Estimate expected lift vs. cost: consult historical learning curves per slice.
- Spec authoring
  - Produce a structured continual_learning_trigger.yaml including:
    - Query definition for Scenario Mining (#8) and Vector Index (#7) pulls.
    - Target label types (which heads, which ontology versions), required volume per slice.
    - Auto-labeler confidence thresholds and human QA sampling rates (#9/#10).
    - Training strategy knob (fine-tune vs. from-scratch, loss weights, data sampling ratios).
    - Gating metrics & minimal win conditions for #16 (eval) and #17 (sim).
- Stakeholder review
  - Async approvers (product/safety/platform); deadline based on severity and SLAs.
- Kickoff
  - Emit events to #8/#9/#10 pipeline orchestrators; create a W&B Project run group to track the cycle; open budget in labeling platform.
Core AWS / Tooling
- AppSync/GraphQL for dataset queries, OpenSearch + FAISS for similarity pulls, DVC for dataset manifests, Labelbox/Ground Truth for QA setup, W&B for lifecycle tracking.
Outputs & Storage
- Versioned continual_learning_trigger.yaml, mined candidate lists, label job IDs, DVC dataset tags for the forthcoming training data.
- Event emitted to #27 (Automated Retraining) once the new data crosses minimum viable volume/quality.

27) Automated Retraining (Data → Train → Gate → Package)¶

When it runs
- Upon readiness signals from #26 (data volume/quality met) or on a scheduled cadence (e.g., weekly retrains with incremental data).
- On emergency hot-fix retrains (e.g., severe false positives for emergency vehicles).
Inputs
- Curated/labeled datasets from #11 (Golden/Slice Builder) updated per trigger spec.
- Previous best model (for fine-tuning) and training configs from #13/#14, plus any new loss weights or augmentations.
- Compute plan (nodes, GPUs, instance types), training budget, and time window.
Steps
- Data assembly
  - Pull manifests via DVC with exact git/DVC tag; integrity check counts, class balance, per-slice minimums; run Great Expectations on schema/valid ranges.
  - Apply sampling strategy from trigger: overweight drifted slices but maintain global distribution constraints; snapshot as dataset_manifest.version.json.
- Training job orchestration
  - Launch distributed training (PyTorch DDP) on SageMaker or EKS; enable AMP (BF16/FP16), gradient accumulation, and gradient checkpointing as needed.
  - Curriculum/fine-tune options:
    - Warm start from best checkpoint; freeze low-level backbone for a stage if compute-bound.
    - Increase loss weights for target heads/slices; introduce targeted augmentations (fog/rain, motion blur).
  - Online W&B logging for metrics, LR schedules, and confusion matrices per slice; checkpoint best-of-N by main metric.
- Auto HPO (optional)
  - Fire a W&B Sweep for a narrow grid/Bayesian search on a few sensitive hyperparams (LR, augment strength, NMS thresholds) with ASHA early-stop.
- Gating & export
  - Evaluate on held-out and safety slices (#16); block if performance regresses by more than allowed deltas on guarded slices.
  - If green, run Packaging/Export (#15) for the winning checkpoint to produce TensorRT/ONNX/TorchScript packs.
- Artifact hygiene
  - Write training_report.json, checkpoints, W&B run refs; push model packs to S3 + ECR; update Model Card (data footprint deltas, known limitations).
- Handoff
  - Notify #19 to begin canary/shadow; attach drift remediation rationale to the promotion ticket (#18).
Core AWS / Tooling
- SageMaker Training / EKS, FSx for Lustre (optional staging), W&B (runs & sweeps), DVC, Great Expectations, TensorRT/ONNX, Triton.
Outputs & Storage
- Versioned trained models, export packs, and full lineage (dataset tags → code SHA → run ID).
- Evaluation artifacts ready for #16/#17; promotion request stub pre-filled.

28) Testing in Production (Safety Predicates & Runtime Guards)¶

When it runs
- Always-on in shadow and canary paths (hard gates), and in full production (soft/strict gates depending on predicate).
- Updated whenever the predicate library or operating boundaries change.
Inputs
- Live model outputs (boxes, masks, tracks, trajectories, confidences), ego and CAN/IMU telemetry, environmental context (weather/map tags), and historical baseline stats from #16/#17 for bounds.
- Predicate library: a versioned set of rules/constraints derived from safety analysis, simulation studies, and regulatory requirements.
Steps
- Predicate design & encoding
  - Express predicates as declarative policies (e.g., OPA/Rego or a domain-specific ruleset) with thresholds configurable per slice:
    - Geometric sanity: boxes within FOV, plausible aspect ratios/areas, non-negative depths.
    - Temporal consistency: max per-frame change in drivable area; track acceleration and jerk within physical bounds; ID switch caps.
    - Cross-sensor agreement: camera vs LiDAR consensus; veto if strong disagreement persists N frames.
    - Planner consistency proxies: large divergence between perception-based risk map and planner’s dynamic constraints flags a violation.
    - Uncertainty guard: abstain or degrade gracefully when confidence below a calibrated floor.
    - Performance guard: if p99 latency > budget, reduce batch size/enable fallback model.
- Deployment
  - Run the predicate engine as a sidecar or in-process filter before responses are accepted by downstream consumers.
  - For edge, keep a tiny, deterministic rules engine with bounded memory/CPU; for cloud, use OPA sidecar with hot-reloadable policies.
- Action modes
  - Shadow: purely record violations; do not affect caller output; route samples to S3 for audit/mining.
  - Canary/Prod:
    - Soft-gate: log + emit warning headers; allow response.
    - Hard-gate: replace response with safe fallback (prior model, rule-based heuristic, or abstention code) and flag the event.
    - Kill-switch: automatic rollback trigger (revert traffic weights or switch to previous production model) if violation rate exceeds M occurrences in T minutes in any protected slice.
- Calibration & audits
  - Periodically validate predicate hit rates with ground truth from #10; adjust thresholds to minimize false alarms without missing true hazards.
  - Run “predicate regression tests” from the simulation library (#17) as part of the promotion pipeline.
- Forensics & feedback
  - Each violation stores a forensic bundle: request, outputs, predicate IDs hit, policy version, traces; redact PII; store in S3 under /safety/violations/YYYY/MM/DD.
  - Generate safety dashboards (violation types over time/slices) and weekly audit packs; file tickets for systemic issues.
  - Emit mining specs to #12 for targeted data collection (e.g., “night-rain ped crossings where cross-sensor agreement < τ”).
Core AWS / Tooling
- EKS sidecars (OPA), CloudWatch Logs/Alarms, Athena + QuickSight for safety analytics, EventBridge to trigger rollbacks, App Mesh/Istio for circuit-breakers/fallback routing.
Outputs & Storage
- Predicate decision logs, violation bundles, policy versions, and automated rollback events.
- Feedback artifacts (mining specs) feeding the next data → training loop.