Monitoring & Continual Learning

25) Drift Detection (Data, Prediction, Concept, Performance)

  • When it runs

    • Stream: near-real-time sliding windows (e.g., 15-minute / 1-hour) over online inference topics for early warning.

    • Batch: scheduled Airflow jobs (hourly/daily/weekly) to produce canonical drift reports and trend lines.

    • On-demand: after a deployment, during canary, or if SLOs breach (p95 latency, error rate) or OOD counters spike.

  • Inputs

    • Online telemetry: request schemas, feature snapshots (PII-scrubbed), model outputs (scores/boxes/masks/uncertainty), per-request timing + resource metrics.

    • Reference baselines: per-slice statistics frozen at promotion time from #16 (evaluation) and “healthy” historical windows (seasonal references).

    • Label trickle: a small, delayed stream of ground truth from #10 (human QA) and #9 (auto-label confirmations) to estimate concept drift where possible.

  • Steps

    • Collection & privacy

      • Tap inference logs via OpenTelemetry exporters; route to KinesisS3 (Parquet); strip or hash identifiers; mask PII fields at the edge.

      • Maintain a feature dictionary (name, type, valid range, unit) in Glue Data Catalog; enforce with validators.

    • Drift computations

      • Schema drift: required fields present; types/ranges; missingness change (Great Expectations).

      • Covariate drift (inputs):

        • Numeric: PSI, KS/AD tests, distributional distance; maintain rolling means/variances and quantiles.

        • Categorical: population stability, χ² tests; top-k category churn.

        • Temporal: autocorrelation changes; seasonality break detection (CUSUM on aggregates).

      • Prediction drift:

        • Score histograms per class; calibration shift (ECE, Brier); acceptance/abstention rate drift.

        • Spatial: IoU of drivable-area segmentation vs. last known stable; box geometry sanity (aspect ratio, area) distributions.

        • Temporal: track fragmentation, ID-switch rates, latency correlation with confidence.

      • Concept drift (where labels available): prequential error rates, sliding-window AUC/mAP/mIoU; DDM/ADWIN style changepoint detectors.

      • OOD/uncertainty sentinels: ensemble disagreement, Mahalanobis distance in penultimate embeddings; maintain per-slice OOD counters.

    • Attribution & slicing

      • Always compute by critical slices (weather, time, geography, road type, sensor subset) and cohorts (fleet/customer).

      • Root-cause heuristics: feature importance on drift indicators (e.g., Shapley on “drift vs no-drift” classification).

    • Thresholding & governance

      • Severity bands: Green (noise), Yellow (monitor), Red (action). Calibrate thresholds per slice to avoid alert fatigue.

      • Dedup & cool-down: suppress duplicate alerts within a lookback window; escalate if duration exceeds T.

    • Reporting & alerts

      • Emit drift_report.json and drift_metrics.parquet; publish Grafana tiles; send PagerDuty alerts with the top 3 implicated features/slices.

      • File an “Active Drift” ticket with a playbook link (triage, rollback rules, and data-mining recipe).

  • Core AWS / Tooling

    • Airflow, Kinesis, S3/Glue/Athena, Evidently, Great Expectations, OpenTelemetry, Prometheus/AMP, Grafana, QuickSight.

  • Outputs & Storage

    • Canonical drift artifacts in S3 (/monitoring/drift/YYYY/MM/DD/…), Glue tables for Athena, alert records in incident tracker.

    • Event to #26 Continual Learning Trigger (with a compact spec of what and where drifted).


26) Continual Learning Trigger (Triage → Decide → Specify)

  • When it runs

    • Fired by a Red/Yellow drift from #25, by performance regressions in canary/production, or by business/product requests (e.g., “construction zones increased; improve precision”).

    • Nightly “gap analysis” against strategic coverage goals (ensuring rare slices are kept in check).

  • Inputs

    • Drift alert payloads: implicated features/slices, magnitude, duration, example URIs.

    • Error cohorts from #12 (offline mining) and canary/shadow diffs from #19.

    • Capacity & budget constraints (GPU hours, labeling budget), plus SLA windows.

  • Steps

    • Automated triage

      • Validate alert (guard against noise): re-compute stats on a fresh window; check seasonality/holiday effects.

      • Safety assessment: does this slice intersect safety predicates (#28)? If yes, bump severity and enforce stricter timelines.

    • Decisioning

      • Choose the path: data curation only (expand training set), threshold/logic change (config flip via #20), model fine-tune, or full retrain.

      • Estimate expected lift vs. cost: consult historical learning curves per slice.

    • Spec authoring

      • Produce a structured continual_learning_trigger.yaml including:

        • Query definition for Scenario Mining (#8) and Vector Index (#7) pulls.

        • Target label types (which heads, which ontology versions), required volume per slice.

        • Auto-labeler confidence thresholds and human QA sampling rates (#9/#10).

        • Training strategy knob (fine-tune vs. from-scratch, loss weights, data sampling ratios).

        • Gating metrics & minimal win conditions for #16 (eval) and #17 (sim).

    • Stakeholder review

      • Async approvers (product/safety/platform); deadline based on severity and SLAs.

    • Kickoff

      • Emit events to #8/#9/#10 pipeline orchestrators; create a W&B Project run group to track the cycle; open budget in labeling platform.

  • Core AWS / Tooling

    • AppSync/GraphQL for dataset queries, OpenSearch + FAISS for similarity pulls, DVC for dataset manifests, Labelbox/Ground Truth for QA setup, W&B for lifecycle tracking.

  • Outputs & Storage

    • Versioned continual_learning_trigger.yaml, mined candidate lists, label job IDs, DVC dataset tags for the forthcoming training data.

    • Event emitted to #27 (Automated Retraining) once the new data crosses minimum viable volume/quality.


27) Automated Retraining (Data → Train → Gate → Package)

  • When it runs

    • Upon readiness signals from #26 (data volume/quality met) or on a scheduled cadence (e.g., weekly retrains with incremental data).

    • On emergency hot-fix retrains (e.g., severe false positives for emergency vehicles).

  • Inputs

    • Curated/labeled datasets from #11 (Golden/Slice Builder) updated per trigger spec.

    • Previous best model (for fine-tuning) and training configs from #13/#14, plus any new loss weights or augmentations.

    • Compute plan (nodes, GPUs, instance types), training budget, and time window.

  • Steps

    • Data assembly

      • Pull manifests via DVC with exact git/DVC tag; integrity check counts, class balance, per-slice minimums; run Great Expectations on schema/valid ranges.

      • Apply sampling strategy from trigger: overweight drifted slices but maintain global distribution constraints; snapshot as dataset_manifest.version.json.

    • Training job orchestration

      • Launch distributed training (PyTorch DDP) on SageMaker or EKS; enable AMP (BF16/FP16), gradient accumulation, and gradient checkpointing as needed.

      • Curriculum/fine-tune options:

        • Warm start from best checkpoint; freeze low-level backbone for a stage if compute-bound.

        • Increase loss weights for target heads/slices; introduce targeted augmentations (fog/rain, motion blur).

      • Online W&B logging for metrics, LR schedules, and confusion matrices per slice; checkpoint best-of-N by main metric.

    • Auto HPO (optional)

      • Fire a W&B Sweep for a narrow grid/Bayesian search on a few sensitive hyperparams (LR, augment strength, NMS thresholds) with ASHA early-stop.

    • Gating & export

      • Evaluate on held-out and safety slices (#16); block if performance regresses by more than allowed deltas on guarded slices.

      • If green, run Packaging/Export (#15) for the winning checkpoint to produce TensorRT/ONNX/TorchScript packs.

    • Artifact hygiene

      • Write training_report.json, checkpoints, W&B run refs; push model packs to S3 + ECR; update Model Card (data footprint deltas, known limitations).

    • Handoff

      • Notify #19 to begin canary/shadow; attach drift remediation rationale to the promotion ticket (#18).

  • Core AWS / Tooling

    • SageMaker Training / EKS, FSx for Lustre (optional staging), W&B (runs & sweeps), DVC, Great Expectations, TensorRT/ONNX, Triton.

  • Outputs & Storage

    • Versioned trained models, export packs, and full lineage (dataset tags → code SHA → run ID).

    • Evaluation artifacts ready for #16/#17; promotion request stub pre-filled.


28) Testing in Production (Safety Predicates & Runtime Guards)

  • When it runs

    • Always-on in shadow and canary paths (hard gates), and in full production (soft/strict gates depending on predicate).

    • Updated whenever the predicate library or operating boundaries change.

  • Inputs

    • Live model outputs (boxes, masks, tracks, trajectories, confidences), ego and CAN/IMU telemetry, environmental context (weather/map tags), and historical baseline stats from #16/#17 for bounds.

    • Predicate library: a versioned set of rules/constraints derived from safety analysis, simulation studies, and regulatory requirements.

  • Steps

    • Predicate design & encoding

      • Express predicates as declarative policies (e.g., OPA/Rego or a domain-specific ruleset) with thresholds configurable per slice:

        • Geometric sanity: boxes within FOV, plausible aspect ratios/areas, non-negative depths.

        • Temporal consistency: max per-frame change in drivable area; track acceleration and jerk within physical bounds; ID switch caps.

        • Cross-sensor agreement: camera vs LiDAR consensus; veto if strong disagreement persists N frames.

        • Planner consistency proxies: large divergence between perception-based risk map and planner’s dynamic constraints flags a violation.

        • Uncertainty guard: abstain or degrade gracefully when confidence below a calibrated floor.

        • Performance guard: if p99 latency > budget, reduce batch size/enable fallback model.

    • Deployment

      • Run the predicate engine as a sidecar or in-process filter before responses are accepted by downstream consumers.

      • For edge, keep a tiny, deterministic rules engine with bounded memory/CPU; for cloud, use OPA sidecar with hot-reloadable policies.

    • Action modes

      • Shadow: purely record violations; do not affect caller output; route samples to S3 for audit/mining.

      • Canary/Prod:

        • Soft-gate: log + emit warning headers; allow response.

        • Hard-gate: replace response with safe fallback (prior model, rule-based heuristic, or abstention code) and flag the event.

        • Kill-switch: automatic rollback trigger (revert traffic weights or switch to previous production model) if violation rate exceeds M occurrences in T minutes in any protected slice.

    • Calibration & audits

      • Periodically validate predicate hit rates with ground truth from #10; adjust thresholds to minimize false alarms without missing true hazards.

      • Run “predicate regression tests” from the simulation library (#17) as part of the promotion pipeline.

    • Forensics & feedback

      • Each violation stores a forensic bundle: request, outputs, predicate IDs hit, policy version, traces; redact PII; store in S3 under /safety/violations/YYYY/MM/DD.

      • Generate safety dashboards (violation types over time/slices) and weekly audit packs; file tickets for systemic issues.

      • Emit mining specs to #12 for targeted data collection (e.g., “night-rain ped crossings where cross-sensor agreement < τ”).

  • Core AWS / Tooling

    • EKS sidecars (OPA), CloudWatch Logs/Alarms, Athena + QuickSight for safety analytics, EventBridge to trigger rollbacks, App Mesh/Istio for circuit-breakers/fallback routing.

  • Outputs & Storage

    • Predicate decision logs, violation bundles, policy versions, and automated rollback events.

    • Feedback artifacts (mining specs) feeding the next data → training loop.