# Testing Strategy | Testing Layer | Test Type / Purpose | Implementation Strategy & Tooling | Lifecycle Stage (Where & When) | | :---------------------------------- | :-------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------ | | **Data & Features** | **Schema & Quality Validation** — prevent corrupt data from entering the lake/pipelines | **Great Expectations** suites (versioned in Git): sensor presence/format; timestamp monotonicity; null/NaN checks; frame count vs. duration; GPS bounds; CAN ranges. Airflow task hard-fails on critical violations. | **CI/CD & Ops:** Ingestion, sync/convert, feature-engineering DAGs. | | | **Sensor Sync & Temporal Alignment** — verify multi-sensor alignment | Pytests on sample shards; cross-sensor time skew bounds; missing packet rate; per-sensor clock drift alerts. | **CI & Staging:** On each converter change; **Ops:** daily monitor. | | | **Camera/LiDAR Calibration Sanity** — catch extrinsic/intrinsic drift | Checkboard/self-cal pattern tests; reprojection error thresholds; focal/ principal-point consistency; LiDAR→camera overlay IoU. | **Staging:** After fleet calibration updates; **Ops:** weekly. | | | **PII Redaction Verification** — enforce privacy | Automated face/plate detectors pre/post redaction; assert no detected PII after blur; sample human audit. | **Ops:** Post-ingestion job; alerts on failure. | | | **Map/Weather Join Consistency** — enrichment integrity | Cross-validate road class/speed limits vs. GPS track; weather timestamp tolerance; missing joins reported. | **Staging & Ops:** Enrichment DAG. | | | **Deduplication & Coverage** — remove duplicates, ensure variety | Perceptual hashing / embedding similarity to flag duplicates; route/time/weather distribution checks. | **CI/CD:** Dataset build; **Ops:** weekly audit. | | | **Label QA & Agreement** — label quality control | Inter/Intra-annotator agreement (Cohen’s κ); adjudication workflow; spot-check hard slices; leakage checks (train/val/test). | **Labeling Ops & CI/CD:** Before dataset freeze. | | | **Auto-Label vs Human Concordance** — validate auto-labels | Measure precision/recall of auto-labels on human-audited subset; set min precision threshold before accepting. | **CI/CD:** Auto-label release gate. | | | **Feature Store Contracts (optional)** — temporal/online parity | Feast feature views tests: backfill correctness, TTLs, online/offline parity checks, point-in-time joins. | **CI/CD & Staging:** Feature DAGs. | | **Code & Pipelines** | **Unit Tests** — function-level correctness | **pytest**, **hypothesis** (property tests) for transforms, IO, utils; mocks for S3/DB. Pre-commit hooks (black, isort, mypy). | **Dev & CI:** Every push/PR. | | | **DAG & Pipeline Integration Tests** — E2E on small fixtures | Trigger Airflow DAGs against tiny fixtures in S3; assert artifacts, metadata rows, and lineage written. | **Staging (CD):** Post-deploy. | | | **Idempotency & Retry Semantics** — safe re-runs | Re-run tasks with same inputs and ensure no duplicates or drift; simulate transient S3/DB failures. | **CI & Staging:** Nightly. | | | **IaC Tests** — infra correctness | **Terratest** on Terraform modules; security group rules; least-privilege IAM; bucket policies. | **CI/CD:** Infra PRs. | | **Model (Offline)** | **Core Metrics (Perception)** — accuracy at task & slice | mAP/AP50–95, mIoU, MOT metrics; per-slice eval (night/rain/occlusion/construction/road-works). Log to **W\&B** with artifacts. | **CI/CD:** Training pipeline before registry. | | | **Behavioral Tests** — learned logic sanity | Invariance/equivariance (small rotation/crop); monotonicity (closer obstacle → higher risk); kinematic plausibility; temporal consistency (track ID stability). | **CI/CD:** Post-train gating in W\&B jobs. | | | **Calibration Tests** — reliable confidence | ECE/Brier, reliability diagrams; threshold tuning for safety predicates. | **CI/CD:** Eval step; store plots to W\&B. | | | **Robustness (Corruptions/Weather)** — degrade gracefully | Corruption suite (blur, noise, fog, rain); adversarial occlusions; report per-corruption deltas and floors. | **CI/CD:** Robustness stage; promotion blocked on regressions. | | | **OOD & Uncertainty** — detect unknowns | OOD detectors (energy score/Mahalanobis) or ensemble disagreement; assert high-uncertainty on unseen domains. | **CI/CD:** Eval stage; log to W\&B. | | | **Regression on Golden Scenes** — no surprises | Fixed golden drive clips (failure hall-of-fame); assert non-degradation beyond tolerances. | **CI/CD:** Mandatory gate. | | | **Size/Latency Footprint (Target HW)** — fit & fast | Export TorchScript/ONNX → TensorRT; measure p50/p95 latency, GPU mem, throughput; budget checks. | **CI/CD (Staging HW):** Packaging step. | | **Model (Pre-Prod / Replay / Sim)** | **Drive Replay** — historical incident re-eval | Re-run model on incident logs; compare vs. baseline predictions; must reduce misses/false alarms. | **Staging:** Gate before canary. | | | **Simulation Scenarios** — safety perturbations | Scenario fuzzing (distance, speed, lighting); evaluate safety KPIs and predicates. | **Staging:** Promotion evidence pack. | | **Infrastructure & Serving** | **API Contract & Smoke Test** — endpoint health | **pytest** against Triton/FastAPI: `/healthz` 200, sample request/response schema, version headers. | **Staging & Prod (CD):** Post-deploy; auto-rollback on fail. | | | **Performance & Load** — latency/throughput SLOs | **Locust/k6**: step and spike loads; measure p50/p95/p99 latency, max sustainable QPS; Triton dynamic batching tuning. | **Staging (CD):** Gate before traffic. | | | **Soak & Stability** — long-run resilience | 24–48h soak; monitor memory growth, GPU fragmentation, file descriptor leaks. | **Staging:** Nightly/weekly. | | | **Chaos & Fault Injection** — degrade gracefully | Pod/node kill, network jitter, S3 throttling; **Toxiproxy**; validate backoff, retries, fallbacks (e.g., single-camera mode). | **Staging:** Monthly game-days. | | | **Security/Compliance** — supply chain & posture | **Trivy/Grype** scans, SBOM, container signing (cosign), IAM least privilege checks; secrets scan. | **CI/CD:** Image/IaC gates. | | **Monitoring & Post-Deployment** | **Production Smoke & Canary Checks** — safe rollout | Shadow compare vs. baseline; canary 1–5%; automated abort on SLO or safety predicate breach. | **Prod (CD):** Rollout controller. | | | **Data & Output Drift** — detect change | **Evidently** daily batch on features/outputs vs. reference; PSI/KS thresholds; slice-level drift dashboards; alert to Slack/PagerDuty. | **Prod Monitoring:** Airflow scheduled jobs. | | | **Online Slice Metrics** — safety where it matters | Near-real-time metrics per slice (night/rain/…); raise tickets when slice recall drops >X pts. | **Prod Monitoring:** Grafana + W\&B reports. | | | **Continual Learning Triggers** — close the loop | On drift/failure bucket thresholds, open labeling tasks; enqueue retrain DAG with new slices. | **Prod → CI/CD:** Automated but approval-gated. | | | **Testing in Production (Safety Predicates)** — guardrails live | Real-time predicates (e.g., max tolerated miss rate on pedestrian class, confidence floors); auto-fallback/disable signals. | **Prod:** Always on; audited weekly. | | **OTA & Edge** | **Package Integrity & Compatibility** — safe OTA | Signature verification, version/ABI checks, rollback test; disk/compute footprint limits; warm-start latency. | **Pre-Prod Lab & Staged OTA:** Before phased rollout. | | | **Hardware-in-the-Loop (HIL) Sanity** — target realism | Bench tests on representative hardware; latency and throughput under thermal/power constraints. | **Pre-Prod:** Release gating. | | **Governance & Audit** | **Lineage & Registry Completeness** — who/what/when | **Weights & Biases**: runs, artifacts, datasets, model versions, approvals; dataset hashes (DVC) linked in registry entries. | **CI/CD & Prod:** Every promotion. | | | **Datasheets & Model Cards** — compliance docs | Auto-generate with metrics, slices, risks, mitigations, known gaps; store alongside registry version. | **CI/CD:** Promotion step. | | | **Incident RCA Pack** — learn fast | Bundle logs, traces, frames, saliency, predicates; publish corrective actions and tracking issues. | **On Incident:** Within SLA (e.g., 72h). |