Monitoring & Continual Learning¶
¶
25) Drift Detection (Data, Prediction, Concept, Performance)¶
When it runs
Stream: near-real-time sliding windows (e.g., 15-minute / 1-hour) over online inference topics for early warning.
Batch: scheduled Airflow jobs (hourly/daily/weekly) to produce canonical drift reports and trend lines.
On-demand: after a deployment, during canary, or if SLOs breach (p95 latency, error rate) or OOD counters spike.
Inputs
Online telemetry: request schemas, feature snapshots (PII-scrubbed), model outputs (scores/boxes/masks/uncertainty), per-request timing + resource metrics.
Reference baselines: per-slice statistics frozen at promotion time from #16 (evaluation) and “healthy” historical windows (seasonal references).
Label trickle: a small, delayed stream of ground truth from #10 (human QA) and #9 (auto-label confirmations) to estimate concept drift where possible.
Steps
Collection & privacy
Tap inference logs via OpenTelemetry exporters; route to Kinesis → S3 (Parquet); strip or hash identifiers; mask PII fields at the edge.
Maintain a feature dictionary (name, type, valid range, unit) in Glue Data Catalog; enforce with validators.
Drift computations
Schema drift: required fields present; types/ranges; missingness change (Great Expectations).
Covariate drift (inputs):
Numeric: PSI, KS/AD tests, distributional distance; maintain rolling means/variances and quantiles.
Categorical: population stability, χ² tests; top-k category churn.
Temporal: autocorrelation changes; seasonality break detection (CUSUM on aggregates).
Prediction drift:
Score histograms per class; calibration shift (ECE, Brier); acceptance/abstention rate drift.
Spatial: IoU of drivable-area segmentation vs. last known stable; box geometry sanity (aspect ratio, area) distributions.
Temporal: track fragmentation, ID-switch rates, latency correlation with confidence.
Concept drift (where labels available): prequential error rates, sliding-window AUC/mAP/mIoU; DDM/ADWIN style changepoint detectors.
OOD/uncertainty sentinels: ensemble disagreement, Mahalanobis distance in penultimate embeddings; maintain per-slice OOD counters.
Attribution & slicing
Always compute by critical slices (weather, time, geography, road type, sensor subset) and cohorts (fleet/customer).
Root-cause heuristics: feature importance on drift indicators (e.g., Shapley on “drift vs no-drift” classification).
Thresholding & governance
Severity bands: Green (noise), Yellow (monitor), Red (action). Calibrate thresholds per slice to avoid alert fatigue.
Dedup & cool-down: suppress duplicate alerts within a lookback window; escalate if duration exceeds T.
Reporting & alerts
Emit
drift_report.json
anddrift_metrics.parquet
; publish Grafana tiles; send PagerDuty alerts with the top 3 implicated features/slices.File an “Active Drift” ticket with a playbook link (triage, rollback rules, and data-mining recipe).
Core AWS / Tooling
Airflow, Kinesis, S3/Glue/Athena, Evidently, Great Expectations, OpenTelemetry, Prometheus/AMP, Grafana, QuickSight.
Outputs & Storage
Canonical drift artifacts in S3 (
/monitoring/drift/YYYY/MM/DD/…
), Glue tables for Athena, alert records in incident tracker.Event to #26 Continual Learning Trigger (with a compact spec of what and where drifted).
26) Continual Learning Trigger (Triage → Decide → Specify)¶
When it runs
Fired by a Red/Yellow drift from #25, by performance regressions in canary/production, or by business/product requests (e.g., “construction zones increased; improve precision”).
Nightly “gap analysis” against strategic coverage goals (ensuring rare slices are kept in check).
Inputs
Drift alert payloads: implicated features/slices, magnitude, duration, example URIs.
Error cohorts from #12 (offline mining) and canary/shadow diffs from #19.
Capacity & budget constraints (GPU hours, labeling budget), plus SLA windows.
Steps
Automated triage
Validate alert (guard against noise): re-compute stats on a fresh window; check seasonality/holiday effects.
Safety assessment: does this slice intersect safety predicates (#28)? If yes, bump severity and enforce stricter timelines.
Decisioning
Choose the path: data curation only (expand training set), threshold/logic change (config flip via #20), model fine-tune, or full retrain.
Estimate expected lift vs. cost: consult historical learning curves per slice.
Spec authoring
Produce a structured
continual_learning_trigger.yaml
including:Query definition for Scenario Mining (#8) and Vector Index (#7) pulls.
Target label types (which heads, which ontology versions), required volume per slice.
Auto-labeler confidence thresholds and human QA sampling rates (#9/#10).
Training strategy knob (fine-tune vs. from-scratch, loss weights, data sampling ratios).
Gating metrics & minimal win conditions for #16 (eval) and #17 (sim).
Stakeholder review
Async approvers (product/safety/platform); deadline based on severity and SLAs.
Kickoff
Emit events to #8/#9/#10 pipeline orchestrators; create a W&B Project run group to track the cycle; open budget in labeling platform.
Core AWS / Tooling
AppSync/GraphQL for dataset queries, OpenSearch + FAISS for similarity pulls, DVC for dataset manifests, Labelbox/Ground Truth for QA setup, W&B for lifecycle tracking.
Outputs & Storage
Versioned
continual_learning_trigger.yaml
, mined candidate lists, label job IDs, DVC dataset tags for the forthcoming training data.Event emitted to #27 (Automated Retraining) once the new data crosses minimum viable volume/quality.
27) Automated Retraining (Data → Train → Gate → Package)¶
When it runs
Upon readiness signals from #26 (data volume/quality met) or on a scheduled cadence (e.g., weekly retrains with incremental data).
On emergency hot-fix retrains (e.g., severe false positives for emergency vehicles).
Inputs
Curated/labeled datasets from #11 (Golden/Slice Builder) updated per trigger spec.
Previous best model (for fine-tuning) and training configs from #13/#14, plus any new loss weights or augmentations.
Compute plan (nodes, GPUs, instance types), training budget, and time window.
Steps
Data assembly
Pull manifests via DVC with exact git/DVC tag; integrity check counts, class balance, per-slice minimums; run Great Expectations on schema/valid ranges.
Apply sampling strategy from trigger: overweight drifted slices but maintain global distribution constraints; snapshot as
dataset_manifest.version.json
.
Training job orchestration
Launch distributed training (PyTorch DDP) on SageMaker or EKS; enable AMP (BF16/FP16), gradient accumulation, and gradient checkpointing as needed.
Curriculum/fine-tune options:
Warm start from best checkpoint; freeze low-level backbone for a stage if compute-bound.
Increase loss weights for target heads/slices; introduce targeted augmentations (fog/rain, motion blur).
Online W&B logging for metrics, LR schedules, and confusion matrices per slice; checkpoint best-of-N by main metric.
Auto HPO (optional)
Fire a W&B Sweep for a narrow grid/Bayesian search on a few sensitive hyperparams (LR, augment strength, NMS thresholds) with ASHA early-stop.
Gating & export
Evaluate on held-out and safety slices (#16); block if performance regresses by more than allowed deltas on guarded slices.
If green, run Packaging/Export (#15) for the winning checkpoint to produce TensorRT/ONNX/TorchScript packs.
Artifact hygiene
Write
training_report.json
, checkpoints, W&B run refs; push model packs to S3 + ECR; update Model Card (data footprint deltas, known limitations).
Handoff
Notify #19 to begin canary/shadow; attach drift remediation rationale to the promotion ticket (#18).
Core AWS / Tooling
SageMaker Training / EKS, FSx for Lustre (optional staging), W&B (runs & sweeps), DVC, Great Expectations, TensorRT/ONNX, Triton.
Outputs & Storage
Versioned trained models, export packs, and full lineage (dataset tags → code SHA → run ID).
Evaluation artifacts ready for #16/#17; promotion request stub pre-filled.
28) Testing in Production (Safety Predicates & Runtime Guards)¶
When it runs
Always-on in shadow and canary paths (hard gates), and in full production (soft/strict gates depending on predicate).
Updated whenever the predicate library or operating boundaries change.
Inputs
Live model outputs (boxes, masks, tracks, trajectories, confidences), ego and CAN/IMU telemetry, environmental context (weather/map tags), and historical baseline stats from #16/#17 for bounds.
Predicate library: a versioned set of rules/constraints derived from safety analysis, simulation studies, and regulatory requirements.
Steps
Predicate design & encoding
Express predicates as declarative policies (e.g., OPA/Rego or a domain-specific ruleset) with thresholds configurable per slice:
Geometric sanity: boxes within FOV, plausible aspect ratios/areas, non-negative depths.
Temporal consistency: max per-frame change in drivable area; track acceleration and jerk within physical bounds; ID switch caps.
Cross-sensor agreement: camera vs LiDAR consensus; veto if strong disagreement persists N frames.
Planner consistency proxies: large divergence between perception-based risk map and planner’s dynamic constraints flags a violation.
Uncertainty guard: abstain or degrade gracefully when confidence below a calibrated floor.
Performance guard: if p99 latency > budget, reduce batch size/enable fallback model.
Deployment
Run the predicate engine as a sidecar or in-process filter before responses are accepted by downstream consumers.
For edge, keep a tiny, deterministic rules engine with bounded memory/CPU; for cloud, use OPA sidecar with hot-reloadable policies.
Action modes
Shadow: purely record violations; do not affect caller output; route samples to S3 for audit/mining.
Canary/Prod:
Soft-gate: log + emit warning headers; allow response.
Hard-gate: replace response with safe fallback (prior model, rule-based heuristic, or abstention code) and flag the event.
Kill-switch: automatic rollback trigger (revert traffic weights or switch to previous production model) if violation rate exceeds M occurrences in T minutes in any protected slice.
Calibration & audits
Periodically validate predicate hit rates with ground truth from #10; adjust thresholds to minimize false alarms without missing true hazards.
Run “predicate regression tests” from the simulation library (#17) as part of the promotion pipeline.
Forensics & feedback
Each violation stores a forensic bundle: request, outputs, predicate IDs hit, policy version, traces; redact PII; store in S3 under
/safety/violations/YYYY/MM/DD
.Generate safety dashboards (violation types over time/slices) and weekly audit packs; file tickets for systemic issues.
Emit mining specs to #12 for targeted data collection (e.g., “night-rain ped crossings where cross-sensor agreement < τ”).
Core AWS / Tooling
EKS sidecars (OPA), CloudWatch Logs/Alarms, Athena + QuickSight for safety analytics, EventBridge to trigger rollbacks, App Mesh/Istio for circuit-breakers/fallback routing.
Outputs & Storage
Predicate decision logs, violation bundles, policy versions, and automated rollback events.
Feedback artifacts (mining specs) feeding the next data → training loop.