# Testing Strategy

| Testing Layer                       | Test Type / Purpose                                                                     | Implementation Strategy & Tooling                                                                                                                                                                                    | Lifecycle Stage (Where & When)                                      |
| :---------------------------------- | :-------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------ |
| **Data & Features**                 | **Schema & Quality Validation** — prevent corrupt data from entering the lake/pipelines | **Great Expectations** suites (versioned in Git): sensor presence/format; timestamp monotonicity; null/NaN checks; frame count vs. duration; GPS bounds; CAN ranges. Airflow task hard-fails on critical violations. | **CI/CD & Ops:** Ingestion, sync/convert, feature-engineering DAGs. |
|                                     | **Sensor Sync & Temporal Alignment** — verify multi-sensor alignment                    | Pytests on sample shards; cross-sensor time skew bounds; missing packet rate; per-sensor clock drift alerts.                                                                                                         | **CI & Staging:** On each converter change; **Ops:** daily monitor. |
|                                     | **Camera/LiDAR Calibration Sanity** — catch extrinsic/intrinsic drift                   | Checkboard/self-cal pattern tests; reprojection error thresholds; focal/ principal-point consistency; LiDAR→camera overlay IoU.                                                                                      | **Staging:** After fleet calibration updates; **Ops:** weekly.      |
|                                     | **PII Redaction Verification** — enforce privacy                                        | Automated face/plate detectors pre/post redaction; assert no detected PII after blur; sample human audit.                                                                                                            | **Ops:** Post-ingestion job; alerts on failure.                     |
|                                     | **Map/Weather Join Consistency** — enrichment integrity                                 | Cross-validate road class/speed limits vs. GPS track; weather timestamp tolerance; missing joins reported.                                                                                                           | **Staging & Ops:** Enrichment DAG.                                  |
|                                     | **Deduplication & Coverage** — remove duplicates, ensure variety                        | Perceptual hashing / embedding similarity to flag duplicates; route/time/weather distribution checks.                                                                                                                | **CI/CD:** Dataset build; **Ops:** weekly audit.                    |
|                                     | **Label QA & Agreement** — label quality control                                        | Inter/Intra-annotator agreement (Cohen’s κ); adjudication workflow; spot-check hard slices; leakage checks (train/val/test).                                                                                         | **Labeling Ops & CI/CD:** Before dataset freeze.                    |
|                                     | **Auto-Label vs Human Concordance** — validate auto-labels                              | Measure precision/recall of auto-labels on human-audited subset; set min precision threshold before accepting.                                                                                                       | **CI/CD:** Auto-label release gate.                                 |
|                                     | **Feature Store Contracts (optional)** — temporal/online parity                         | Feast feature views tests: backfill correctness, TTLs, online/offline parity checks, point-in-time joins.                                                                                                            | **CI/CD & Staging:** Feature DAGs.                                  |
| **Code & Pipelines**                | **Unit Tests** — function-level correctness                                             | **pytest**, **hypothesis** (property tests) for transforms, IO, utils; mocks for S3/DB. Pre-commit hooks (black, isort, mypy).                                                                                       | **Dev & CI:** Every push/PR.                                        |
|                                     | **DAG & Pipeline Integration Tests** — E2E on small fixtures                            | Trigger Airflow DAGs against tiny fixtures in S3; assert artifacts, metadata rows, and lineage written.                                                                                                              | **Staging (CD):** Post-deploy.                                      |
|                                     | **Idempotency & Retry Semantics** — safe re-runs                                        | Re-run tasks with same inputs and ensure no duplicates or drift; simulate transient S3/DB failures.                                                                                                                  | **CI & Staging:** Nightly.                                          |
|                                     | **IaC Tests** — infra correctness                                                       | **Terratest** on Terraform modules; security group rules; least-privilege IAM; bucket policies.                                                                                                                      | **CI/CD:** Infra PRs.                                               |
| **Model (Offline)**                 | **Core Metrics (Perception)** — accuracy at task & slice                                | mAP/AP50–95, mIoU, MOT metrics; per-slice eval (night/rain/occlusion/construction/road-works). Log to **W\&B** with artifacts.                                                                                       | **CI/CD:** Training pipeline before registry.                       |
|                                     | **Behavioral Tests** — learned logic sanity                                             | Invariance/equivariance (small rotation/crop); monotonicity (closer obstacle → higher risk); kinematic plausibility; temporal consistency (track ID stability).                                                      | **CI/CD:** Post-train gating in W\&B jobs.                          |
|                                     | **Calibration Tests** — reliable confidence                                             | ECE/Brier, reliability diagrams; threshold tuning for safety predicates.                                                                                                                                             | **CI/CD:** Eval step; store plots to W\&B.                          |
|                                     | **Robustness (Corruptions/Weather)** — degrade gracefully                               | Corruption suite (blur, noise, fog, rain); adversarial occlusions; report per-corruption deltas and floors.                                                                                                          | **CI/CD:** Robustness stage; promotion blocked on regressions.      |
|                                     | **OOD & Uncertainty** — detect unknowns                                                 | OOD detectors (energy score/Mahalanobis) or ensemble disagreement; assert high-uncertainty on unseen domains.                                                                                                        | **CI/CD:** Eval stage; log to W\&B.                                 |
|                                     | **Regression on Golden Scenes** — no surprises                                          | Fixed golden drive clips (failure hall-of-fame); assert non-degradation beyond tolerances.                                                                                                                           | **CI/CD:** Mandatory gate.                                          |
|                                     | **Size/Latency Footprint (Target HW)** — fit & fast                                     | Export TorchScript/ONNX → TensorRT; measure p50/p95 latency, GPU mem, throughput; budget checks.                                                                                                       | **CI/CD (Staging HW):** Packaging step.                             |
| **Model (Pre-Prod / Replay / Sim)** | **Drive Replay** — historical incident re-eval                                          | Re-run model on incident logs; compare vs. baseline predictions; must reduce misses/false alarms.                                                                                                                    | **Staging:** Gate before canary.                                    |
|                                     | **Simulation Scenarios** — safety perturbations                                         | Scenario fuzzing (distance, speed, lighting); evaluate safety KPIs and predicates.                                                                                                                                   | **Staging:** Promotion evidence pack.                               |
| **Infrastructure & Serving**        | **API Contract & Smoke Test** — endpoint health                                         | **pytest** against Triton/FastAPI: `/healthz` 200, sample request/response schema, version headers.                                                                                                                  | **Staging & Prod (CD):** Post-deploy; auto-rollback on fail.        |
|                                     | **Performance & Load** — latency/throughput SLOs                                        | **Locust/k6**: step and spike loads; measure p50/p95/p99 latency, max sustainable QPS; Triton dynamic batching tuning.                                                                                               | **Staging (CD):** Gate before traffic.                              |
|                                     | **Soak & Stability** — long-run resilience                                              | 24–48h soak; monitor memory growth, GPU fragmentation, file descriptor leaks.                                                                                                                                        | **Staging:** Nightly/weekly.                                        |
|                                     | **Chaos & Fault Injection** — degrade gracefully                                        | Pod/node kill, network jitter, S3 throttling; **Toxiproxy**; validate backoff, retries, fallbacks (e.g., single-camera mode).                                                                                        | **Staging:** Monthly game-days.                                     |
|                                     | **Security/Compliance** — supply chain & posture                                        | **Trivy/Grype** scans, SBOM, container signing (cosign), IAM least privilege checks; secrets scan.                                                                                                                   | **CI/CD:** Image/IaC gates.                                         |
| **Monitoring & Post-Deployment**    | **Production Smoke & Canary Checks** — safe rollout                                     | Shadow compare vs. baseline; canary 1–5%; automated abort on SLO or safety predicate breach.                                                                                                                         | **Prod (CD):** Rollout controller.                                  |
|                                     | **Data & Output Drift** — detect change                                                 | **Evidently** daily batch on features/outputs vs. reference; PSI/KS thresholds; slice-level drift dashboards; alert to Slack/PagerDuty.                                                                              | **Prod Monitoring:** Airflow scheduled jobs.                        |
|                                     | **Online Slice Metrics** — safety where it matters                                      | Near-real-time metrics per slice (night/rain/…); raise tickets when slice recall drops >X pts.                                                                                                                       | **Prod Monitoring:** Grafana + W\&B reports.                        |
|                                     | **Continual Learning Triggers** — close the loop                                        | On drift/failure bucket thresholds, open labeling tasks; enqueue retrain DAG with new slices.                                                                                                                        | **Prod → CI/CD:** Automated but approval-gated.                     |
|                                     | **Testing in Production (Safety Predicates)** — guardrails live                         | Real-time predicates (e.g., max tolerated miss rate on pedestrian class, confidence floors); auto-fallback/disable signals.                                                                                          | **Prod:** Always on; audited weekly.                                |
| **OTA & Edge**                      | **Package Integrity & Compatibility** — safe OTA                                        | Signature verification, version/ABI checks, rollback test; disk/compute footprint limits; warm-start latency.                                                                                                        | **Pre-Prod Lab & Staged OTA:** Before phased rollout.               |
|                                     | **Hardware-in-the-Loop (HIL) Sanity** — target realism                                  | Bench tests on representative hardware; latency and throughput under thermal/power constraints.                                                                                                                      | **Pre-Prod:** Release gating.                                       |
| **Governance & Audit**              | **Lineage & Registry Completeness** — who/what/when                                     | **Weights & Biases**: runs, artifacts, datasets, model versions, approvals; dataset hashes (DVC) linked in registry entries.                                                                                         | **CI/CD & Prod:** Every promotion.                                  |
|                                     | **Datasheets & Model Cards** — compliance docs                                          | Auto-generate with metrics, slices, risks, mitigations, known gaps; store alongside registry version.                                                                                                                | **CI/CD:** Promotion step.                                          |
|                                     | **Incident RCA Pack** — learn fast                                                      | Bundle logs, traces, frames, saliency, predicates; publish corrective actions and tracking issues.                                                                                                                   | **On Incident:** Within SLA (e.g., 72h).                            |