Testing Strategy

Testing Layer

Test Type / Purpose

Implementation Strategy & Tooling

Lifecycle Stage (Where & When)

Data & Features

Schema & Quality Validation — prevent corrupt data from entering the lake/pipelines

Great Expectations suites (versioned in Git): sensor presence/format; timestamp monotonicity; null/NaN checks; frame count vs. duration; GPS bounds; CAN ranges. Airflow task hard-fails on critical violations.

CI/CD & Ops: Ingestion, sync/convert, feature-engineering DAGs.

Sensor Sync & Temporal Alignment — verify multi-sensor alignment

Pytests on sample shards; cross-sensor time skew bounds; missing packet rate; per-sensor clock drift alerts.

CI & Staging: On each converter change; Ops: daily monitor.

Camera/LiDAR Calibration Sanity — catch extrinsic/intrinsic drift

Checkboard/self-cal pattern tests; reprojection error thresholds; focal/ principal-point consistency; LiDAR→camera overlay IoU.

Staging: After fleet calibration updates; Ops: weekly.

PII Redaction Verification — enforce privacy

Automated face/plate detectors pre/post redaction; assert no detected PII after blur; sample human audit.

Ops: Post-ingestion job; alerts on failure.

Map/Weather Join Consistency — enrichment integrity

Cross-validate road class/speed limits vs. GPS track; weather timestamp tolerance; missing joins reported.

Staging & Ops: Enrichment DAG.

Deduplication & Coverage — remove duplicates, ensure variety

Perceptual hashing / embedding similarity to flag duplicates; route/time/weather distribution checks.

CI/CD: Dataset build; Ops: weekly audit.

Label QA & Agreement — label quality control

Inter/Intra-annotator agreement (Cohen’s κ); adjudication workflow; spot-check hard slices; leakage checks (train/val/test).

Labeling Ops & CI/CD: Before dataset freeze.

Auto-Label vs Human Concordance — validate auto-labels

Measure precision/recall of auto-labels on human-audited subset; set min precision threshold before accepting.

CI/CD: Auto-label release gate.

Feature Store Contracts (optional) — temporal/online parity

Feast feature views tests: backfill correctness, TTLs, online/offline parity checks, point-in-time joins.

CI/CD & Staging: Feature DAGs.

Code & Pipelines

Unit Tests — function-level correctness

pytest, hypothesis (property tests) for transforms, IO, utils; mocks for S3/DB. Pre-commit hooks (black, isort, mypy).

Dev & CI: Every push/PR.

DAG & Pipeline Integration Tests — E2E on small fixtures

Trigger Airflow DAGs against tiny fixtures in S3; assert artifacts, metadata rows, and lineage written.

Staging (CD): Post-deploy.

Idempotency & Retry Semantics — safe re-runs

Re-run tasks with same inputs and ensure no duplicates or drift; simulate transient S3/DB failures.

CI & Staging: Nightly.

IaC Tests — infra correctness

Terratest on Terraform modules; security group rules; least-privilege IAM; bucket policies.

CI/CD: Infra PRs.

Model (Offline)

Core Metrics (Perception) — accuracy at task & slice

mAP/AP50–95, mIoU, MOT metrics; per-slice eval (night/rain/occlusion/construction/road-works). Log to W&B with artifacts.

CI/CD: Training pipeline before registry.

Behavioral Tests — learned logic sanity

Invariance/equivariance (small rotation/crop); monotonicity (closer obstacle → higher risk); kinematic plausibility; temporal consistency (track ID stability).

CI/CD: Post-train gating in W&B jobs.

Calibration Tests — reliable confidence

ECE/Brier, reliability diagrams; threshold tuning for safety predicates.

CI/CD: Eval step; store plots to W&B.

Robustness (Corruptions/Weather) — degrade gracefully

Corruption suite (blur, noise, fog, rain); adversarial occlusions; report per-corruption deltas and floors.

CI/CD: Robustness stage; promotion blocked on regressions.

OOD & Uncertainty — detect unknowns

OOD detectors (energy score/Mahalanobis) or ensemble disagreement; assert high-uncertainty on unseen domains.

CI/CD: Eval stage; log to W&B.

Regression on Golden Scenes — no surprises

Fixed golden drive clips (failure hall-of-fame); assert non-degradation beyond tolerances.

CI/CD: Mandatory gate.

Size/Latency Footprint (Target HW) — fit & fast

Export TorchScript/ONNX → TensorRT; measure p50/p95 latency, GPU mem, throughput; budget checks.

CI/CD (Staging HW): Packaging step.

Model (Pre-Prod / Replay / Sim)

Drive Replay — historical incident re-eval

Re-run model on incident logs; compare vs. baseline predictions; must reduce misses/false alarms.

Staging: Gate before canary.

Simulation Scenarios — safety perturbations

Scenario fuzzing (distance, speed, lighting); evaluate safety KPIs and predicates.

Staging: Promotion evidence pack.

Infrastructure & Serving

API Contract & Smoke Test — endpoint health

pytest against Triton/FastAPI: /healthz 200, sample request/response schema, version headers.

Staging & Prod (CD): Post-deploy; auto-rollback on fail.

Performance & Load — latency/throughput SLOs

Locust/k6: step and spike loads; measure p50/p95/p99 latency, max sustainable QPS; Triton dynamic batching tuning.

Staging (CD): Gate before traffic.

Soak & Stability — long-run resilience

24–48h soak; monitor memory growth, GPU fragmentation, file descriptor leaks.

Staging: Nightly/weekly.

Chaos & Fault Injection — degrade gracefully

Pod/node kill, network jitter, S3 throttling; Toxiproxy; validate backoff, retries, fallbacks (e.g., single-camera mode).

Staging: Monthly game-days.

Security/Compliance — supply chain & posture

Trivy/Grype scans, SBOM, container signing (cosign), IAM least privilege checks; secrets scan.

CI/CD: Image/IaC gates.

Monitoring & Post-Deployment

Production Smoke & Canary Checks — safe rollout

Shadow compare vs. baseline; canary 1–5%; automated abort on SLO or safety predicate breach.

Prod (CD): Rollout controller.

Data & Output Drift — detect change

Evidently daily batch on features/outputs vs. reference; PSI/KS thresholds; slice-level drift dashboards; alert to Slack/PagerDuty.

Prod Monitoring: Airflow scheduled jobs.

Online Slice Metrics — safety where it matters

Near-real-time metrics per slice (night/rain/…); raise tickets when slice recall drops >X pts.

Prod Monitoring: Grafana + W&B reports.

Continual Learning Triggers — close the loop

On drift/failure bucket thresholds, open labeling tasks; enqueue retrain DAG with new slices.

Prod → CI/CD: Automated but approval-gated.

Testing in Production (Safety Predicates) — guardrails live

Real-time predicates (e.g., max tolerated miss rate on pedestrian class, confidence floors); auto-fallback/disable signals.

Prod: Always on; audited weekly.

OTA & Edge

Package Integrity & Compatibility — safe OTA

Signature verification, version/ABI checks, rollback test; disk/compute footprint limits; warm-start latency.

Pre-Prod Lab & Staged OTA: Before phased rollout.

Hardware-in-the-Loop (HIL) Sanity — target realism

Bench tests on representative hardware; latency and throughput under thermal/power constraints.

Pre-Prod: Release gating.

Governance & Audit

Lineage & Registry Completeness — who/what/when

Weights & Biases: runs, artifacts, datasets, model versions, approvals; dataset hashes (DVC) linked in registry entries.

CI/CD & Prod: Every promotion.

Datasheets & Model Cards — compliance docs

Auto-generate with metrics, slices, risks, mitigations, known gaps; store alongside registry version.

CI/CD: Promotion step.

Incident RCA Pack — learn fast

Bundle logs, traces, frames, saliency, predicates; publish corrective actions and tracking issues.

On Incident: Within SLA (e.g., 72h).