# Monitoring & Continual Learning

##

### 25) Drift Detection (Data, Prediction, Concept, Performance)

* **When it runs**

  * Stream: near-real-time sliding windows (e.g., 15-minute / 1-hour) over online inference topics for early warning.
  * Batch: scheduled **Airflow** jobs (hourly/daily/weekly) to produce canonical drift reports and trend lines.
  * On-demand: after a deployment, during canary, or if SLOs breach (p95 latency, error rate) or OOD counters spike.

* **Inputs**

  * **Online telemetry**: request schemas, feature snapshots (PII-scrubbed), model outputs (scores/boxes/masks/uncertainty), per-request timing + resource metrics.
  * **Reference baselines**: per-slice statistics frozen at promotion time from #16 (evaluation) and “healthy” historical windows (seasonal references).
  * **Label trickle**: a small, delayed stream of ground truth from #10 (human QA) and #9 (auto-label confirmations) to estimate concept drift where possible.

* **Steps**

  * **Collection & privacy**

    * Tap inference logs via **OpenTelemetry** exporters; route to **Kinesis** → **S3 (Parquet)**; strip or hash identifiers; mask PII fields at the edge.
    * Maintain a **feature dictionary** (name, type, valid range, unit) in Glue Data Catalog; enforce with validators.
  * **Drift computations**

    * **Schema drift**: required fields present; types/ranges; missingness change (Great Expectations).
    * **Covariate drift** (inputs):

      * Numeric: PSI, KS/AD tests, distributional distance; maintain rolling means/variances and quantiles.
      * Categorical: population stability, χ² tests; top-k category churn.
      * Temporal: autocorrelation changes; seasonality break detection (CUSUM on aggregates).
    * **Prediction drift**:

      * Score histograms per class; calibration shift (ECE, Brier); acceptance/abstention rate drift.
      * Spatial: IoU of drivable-area segmentation vs. last known stable; box geometry sanity (aspect ratio, area) distributions.
      * Temporal: track fragmentation, ID-switch rates, latency correlation with confidence.
    * **Concept drift** (where labels available): prequential error rates, sliding-window AUC/mAP/mIoU; DDM/ADWIN style changepoint detectors.
    * **OOD/uncertainty sentinels**: ensemble disagreement, Mahalanobis distance in penultimate embeddings; maintain per-slice OOD counters.
  * **Attribution & slicing**

    * Always compute by **critical slices** (weather, time, geography, road type, sensor subset) and cohorts (fleet/customer).
    * Root-cause heuristics: feature importance on drift indicators (e.g., Shapley on “drift vs no-drift” classification).
  * **Thresholding & governance**

    * Severity bands: **Green** (noise), **Yellow** (monitor), **Red** (action). Calibrate thresholds per slice to avoid alert fatigue.
    * **Dedup & cool-down**: suppress duplicate alerts within a lookback window; escalate if duration exceeds T.
  * **Reporting & alerts**

    * Emit `drift_report.json` and `drift_metrics.parquet`; publish Grafana tiles; send PagerDuty alerts with the top 3 implicated features/slices.
    * File an “Active Drift” ticket with a playbook link (triage, rollback rules, and data-mining recipe).

* **Core AWS / Tooling**

  * **Airflow**, **Kinesis**, **S3/Glue/Athena**, **Evidently**, **Great Expectations**, **OpenTelemetry**, **Prometheus/AMP**, **Grafana**, **QuickSight**.

* **Outputs & Storage**

  * Canonical drift artifacts in **S3** (`/monitoring/drift/YYYY/MM/DD/…`), Glue tables for Athena, alert records in incident tracker.
  * Event to **#26 Continual Learning Trigger** (with a compact spec of what and where drifted).

---

### 26) Continual Learning Trigger (Triage → Decide → Specify)

* **When it runs**

  * Fired by a **Red/Yellow** drift from #25, by performance regressions in canary/production, or by business/product requests (e.g., “construction zones increased; improve precision”).
  * Nightly “gap analysis” against strategic coverage goals (ensuring rare slices are kept in check).

* **Inputs**

  * Drift alert payloads: implicated features/slices, magnitude, duration, example URIs.
  * Error cohorts from #12 (offline mining) and canary/shadow diffs from #19.
  * Capacity & budget constraints (GPU hours, labeling budget), plus SLA windows.

* **Steps**

  * **Automated triage**

    * Validate alert (guard against noise): re-compute stats on a fresh window; check seasonality/holiday effects.
    * Safety assessment: does this slice intersect safety predicates (#28)? If yes, bump severity and enforce stricter timelines.
  * **Decisioning**

    * Choose the path: **data curation only** (expand training set), **threshold/logic change** (config flip via #20), **model fine-tune**, or **full retrain**.
    * Estimate **expected lift** vs. cost: consult historical learning curves per slice.
  * **Spec authoring**

    * Produce a structured `continual_learning_trigger.yaml` including:

      * Query definition for **Scenario Mining** (#8) and **Vector Index** (#7) pulls.
      * Target label types (which heads, which ontology versions), required volume per slice.
      * Auto-labeler confidence thresholds and human QA sampling rates (#9/#10).
      * Training strategy knob (fine-tune vs. from-scratch, loss weights, data sampling ratios).
      * Gating metrics & minimal win conditions for #16 (eval) and #17 (sim).
  * **Stakeholder review**

    * Async approvers (product/safety/platform); deadline based on severity and SLAs.
  * **Kickoff**

    * Emit events to #8/#9/#10 pipeline orchestrators; create a W\&B **Project** run group to track the cycle; open budget in labeling platform.

* **Core AWS / Tooling**

  * **AppSync/GraphQL** for dataset queries, **OpenSearch** + **FAISS** for similarity pulls, **DVC** for dataset manifests, **Labelbox/Ground Truth** for QA setup, **W\&B** for lifecycle tracking.

* **Outputs & Storage**

  * Versioned `continual_learning_trigger.yaml`, mined candidate lists, label job IDs, DVC dataset tags for the forthcoming training data.
  * Event emitted to #27 (Automated Retraining) once the new data crosses minimum viable volume/quality.

---

### 27) Automated Retraining (Data → Train → Gate → Package)

* **When it runs**

  * Upon readiness signals from #26 (data volume/quality met) or on a scheduled cadence (e.g., weekly retrains with incremental data).
  * On emergency hot-fix retrains (e.g., severe false positives for emergency vehicles).

* **Inputs**

  * Curated/labeled datasets from #11 (Golden/Slice Builder) updated per trigger spec.
  * Previous best model (for fine-tuning) and training configs from #13/#14, plus any new loss weights or augmentations.
  * Compute plan (nodes, GPUs, instance types), training budget, and time window.

* **Steps**

  * **Data assembly**

    * Pull manifests via **DVC** with exact git/DVC tag; integrity check counts, class balance, per-slice minimums; run Great Expectations on schema/valid ranges.
    * Apply **sampling strategy** from trigger: overweight drifted slices but maintain global distribution constraints; snapshot as `dataset_manifest.version.json`.
  * **Training job orchestration**

    * Launch distributed training (**PyTorch DDP**) on **SageMaker** or **EKS**; enable **AMP** (BF16/FP16), gradient accumulation, and gradient checkpointing as needed.
    * **Curriculum/fine-tune** options:

      * Warm start from best checkpoint; freeze low-level backbone for a stage if compute-bound.
      * Increase loss weights for target heads/slices; introduce targeted augmentations (fog/rain, motion blur).
    * Online **W\&B** logging for metrics, LR schedules, and confusion matrices per slice; checkpoint best-of-N by main metric.
  * **Auto HPO (optional)**

    * Fire a **W\&B Sweep** for a narrow grid/Bayesian search on a few sensitive hyperparams (LR, augment strength, NMS thresholds) with ASHA early-stop.
  * **Gating & export**

    * Evaluate on **held-out** and **safety** slices (#16); block if performance regresses by more than allowed deltas on guarded slices.
    * If green, run **Packaging/Export** (#15) for the winning checkpoint to produce TensorRT/ONNX/TorchScript packs.
  * **Artifact hygiene**

    * Write `training_report.json`, checkpoints, W\&B run refs; push model packs to **S3 + ECR**; update **Model Card** (data footprint deltas, known limitations).
  * **Handoff**

    * Notify #19 to begin canary/shadow; attach drift remediation rationale to the promotion ticket (#18).

* **Core AWS / Tooling**

  * **SageMaker Training** / **EKS**, **FSx for Lustre** (optional staging), **W\&B** (runs & sweeps), **DVC**, **Great Expectations**, **TensorRT/ONNX**, **Triton**.

* **Outputs & Storage**

  * Versioned trained models, export packs, and full lineage (dataset tags → code SHA → run ID).
  * Evaluation artifacts ready for #16/#17; promotion request stub pre-filled.

---

### 28) Testing in Production (Safety Predicates & Runtime Guards)

* **When it runs**

  * Always-on in **shadow** and **canary** paths (hard gates), and in full production (soft/strict gates depending on predicate).
  * Updated whenever the predicate library or operating boundaries change.

* **Inputs**

  * Live **model outputs** (boxes, masks, tracks, trajectories, confidences), ego and CAN/IMU telemetry, environmental context (weather/map tags), and historical baseline stats from #16/#17 for bounds.
  * **Predicate library**: a versioned set of rules/constraints derived from safety analysis, simulation studies, and regulatory requirements.

* **Steps**

  * **Predicate design & encoding**

    * Express predicates as declarative policies (e.g., **OPA/Rego** or a domain-specific ruleset) with thresholds configurable per slice:

      * **Geometric sanity**: boxes within FOV, plausible aspect ratios/areas, non-negative depths.
      * **Temporal consistency**: max per-frame change in drivable area; track acceleration and jerk within physical bounds; ID switch caps.
      * **Cross-sensor agreement**: camera vs LiDAR consensus; veto if strong disagreement persists N frames.
      * **Planner consistency proxies**: large divergence between perception-based risk map and planner’s dynamic constraints flags a violation.
      * **Uncertainty guard**: abstain or degrade gracefully when confidence below a calibrated floor.
      * **Performance guard**: if p99 latency > budget, reduce batch size/enable fallback model.
  * **Deployment**

    * Run the predicate engine as a **sidecar** or in-process filter before responses are accepted by downstream consumers.
    * For **edge**, keep a tiny, deterministic rules engine with bounded memory/CPU; for **cloud**, use **OPA** sidecar with hot-reloadable policies.
  * **Action modes**

    * **Shadow**: purely record violations; do not affect caller output; route samples to S3 for audit/mining.
    * **Canary/Prod**:

      * **Soft-gate**: log + emit warning headers; allow response.
      * **Hard-gate**: replace response with **safe fallback** (prior model, rule-based heuristic, or abstention code) and flag the event.
      * **Kill-switch**: automatic rollback trigger (revert traffic weights or switch to previous production model) if violation rate exceeds M occurrences in T minutes in any protected slice.
  * **Calibration & audits**

    * Periodically validate predicate hit rates with ground truth from #10; adjust thresholds to minimize false alarms without missing true hazards.
    * Run “predicate regression tests” from the simulation library (#17) as part of the promotion pipeline.
  * **Forensics & feedback**

    * Each violation stores a **forensic bundle**: request, outputs, predicate IDs hit, policy version, traces; redact PII; store in **S3** under `/safety/violations/YYYY/MM/DD`.
    * Generate safety dashboards (violation types over time/slices) and weekly audit packs; file tickets for systemic issues.
    * Emit **mining specs** to #12 for targeted data collection (e.g., “night-rain ped crossings where cross-sensor agreement < τ”).

* **Core AWS / Tooling**

  * **EKS** sidecars (OPA), **CloudWatch Logs/Alarms**, **Athena** + **QuickSight** for safety analytics, **EventBridge** to trigger rollbacks, **App Mesh/Istio** for circuit-breakers/fallback routing.

* **Outputs & Storage**

  * Predicate decision logs, violation bundles, policy versions, and automated rollback events.
  * Feedback artifacts (mining specs) feeding the next **data → training** loop.

---