Testing Strategy¶
Testing Layer |
Test Type / Purpose |
Implementation Strategy & Tooling |
Lifecycle Stage (Where & When) |
---|---|---|---|
Data & Features |
Schema & Quality Validation — prevent corrupt data from entering the lake/pipelines |
Great Expectations suites (versioned in Git): sensor presence/format; timestamp monotonicity; null/NaN checks; frame count vs. duration; GPS bounds; CAN ranges. Airflow task hard-fails on critical violations. |
CI/CD & Ops: Ingestion, sync/convert, feature-engineering DAGs. |
Sensor Sync & Temporal Alignment — verify multi-sensor alignment |
Pytests on sample shards; cross-sensor time skew bounds; missing packet rate; per-sensor clock drift alerts. |
CI & Staging: On each converter change; Ops: daily monitor. |
|
Camera/LiDAR Calibration Sanity — catch extrinsic/intrinsic drift |
Checkboard/self-cal pattern tests; reprojection error thresholds; focal/ principal-point consistency; LiDAR→camera overlay IoU. |
Staging: After fleet calibration updates; Ops: weekly. |
|
PII Redaction Verification — enforce privacy |
Automated face/plate detectors pre/post redaction; assert no detected PII after blur; sample human audit. |
Ops: Post-ingestion job; alerts on failure. |
|
Map/Weather Join Consistency — enrichment integrity |
Cross-validate road class/speed limits vs. GPS track; weather timestamp tolerance; missing joins reported. |
Staging & Ops: Enrichment DAG. |
|
Deduplication & Coverage — remove duplicates, ensure variety |
Perceptual hashing / embedding similarity to flag duplicates; route/time/weather distribution checks. |
CI/CD: Dataset build; Ops: weekly audit. |
|
Label QA & Agreement — label quality control |
Inter/Intra-annotator agreement (Cohen’s κ); adjudication workflow; spot-check hard slices; leakage checks (train/val/test). |
Labeling Ops & CI/CD: Before dataset freeze. |
|
Auto-Label vs Human Concordance — validate auto-labels |
Measure precision/recall of auto-labels on human-audited subset; set min precision threshold before accepting. |
CI/CD: Auto-label release gate. |
|
Feature Store Contracts (optional) — temporal/online parity |
Feast feature views tests: backfill correctness, TTLs, online/offline parity checks, point-in-time joins. |
CI/CD & Staging: Feature DAGs. |
|
Code & Pipelines |
Unit Tests — function-level correctness |
pytest, hypothesis (property tests) for transforms, IO, utils; mocks for S3/DB. Pre-commit hooks (black, isort, mypy). |
Dev & CI: Every push/PR. |
DAG & Pipeline Integration Tests — E2E on small fixtures |
Trigger Airflow DAGs against tiny fixtures in S3; assert artifacts, metadata rows, and lineage written. |
Staging (CD): Post-deploy. |
|
Idempotency & Retry Semantics — safe re-runs |
Re-run tasks with same inputs and ensure no duplicates or drift; simulate transient S3/DB failures. |
CI & Staging: Nightly. |
|
IaC Tests — infra correctness |
Terratest on Terraform modules; security group rules; least-privilege IAM; bucket policies. |
CI/CD: Infra PRs. |
|
Model (Offline) |
Core Metrics (Perception) — accuracy at task & slice |
mAP/AP50–95, mIoU, MOT metrics; per-slice eval (night/rain/occlusion/construction/road-works). Log to W&B with artifacts. |
CI/CD: Training pipeline before registry. |
Behavioral Tests — learned logic sanity |
Invariance/equivariance (small rotation/crop); monotonicity (closer obstacle → higher risk); kinematic plausibility; temporal consistency (track ID stability). |
CI/CD: Post-train gating in W&B jobs. |
|
Calibration Tests — reliable confidence |
ECE/Brier, reliability diagrams; threshold tuning for safety predicates. |
CI/CD: Eval step; store plots to W&B. |
|
Robustness (Corruptions/Weather) — degrade gracefully |
Corruption suite (blur, noise, fog, rain); adversarial occlusions; report per-corruption deltas and floors. |
CI/CD: Robustness stage; promotion blocked on regressions. |
|
OOD & Uncertainty — detect unknowns |
OOD detectors (energy score/Mahalanobis) or ensemble disagreement; assert high-uncertainty on unseen domains. |
CI/CD: Eval stage; log to W&B. |
|
Regression on Golden Scenes — no surprises |
Fixed golden drive clips (failure hall-of-fame); assert non-degradation beyond tolerances. |
CI/CD: Mandatory gate. |
|
Size/Latency Footprint (Target HW) — fit & fast |
Export TorchScript/ONNX → TensorRT; measure p50/p95 latency, GPU mem, throughput; budget checks. |
CI/CD (Staging HW): Packaging step. |
|
Model (Pre-Prod / Replay / Sim) |
Drive Replay — historical incident re-eval |
Re-run model on incident logs; compare vs. baseline predictions; must reduce misses/false alarms. |
Staging: Gate before canary. |
Simulation Scenarios — safety perturbations |
Scenario fuzzing (distance, speed, lighting); evaluate safety KPIs and predicates. |
Staging: Promotion evidence pack. |
|
Infrastructure & Serving |
API Contract & Smoke Test — endpoint health |
pytest against Triton/FastAPI: |
Staging & Prod (CD): Post-deploy; auto-rollback on fail. |
Performance & Load — latency/throughput SLOs |
Locust/k6: step and spike loads; measure p50/p95/p99 latency, max sustainable QPS; Triton dynamic batching tuning. |
Staging (CD): Gate before traffic. |
|
Soak & Stability — long-run resilience |
24–48h soak; monitor memory growth, GPU fragmentation, file descriptor leaks. |
Staging: Nightly/weekly. |
|
Chaos & Fault Injection — degrade gracefully |
Pod/node kill, network jitter, S3 throttling; Toxiproxy; validate backoff, retries, fallbacks (e.g., single-camera mode). |
Staging: Monthly game-days. |
|
Security/Compliance — supply chain & posture |
Trivy/Grype scans, SBOM, container signing (cosign), IAM least privilege checks; secrets scan. |
CI/CD: Image/IaC gates. |
|
Monitoring & Post-Deployment |
Production Smoke & Canary Checks — safe rollout |
Shadow compare vs. baseline; canary 1–5%; automated abort on SLO or safety predicate breach. |
Prod (CD): Rollout controller. |
Data & Output Drift — detect change |
Evidently daily batch on features/outputs vs. reference; PSI/KS thresholds; slice-level drift dashboards; alert to Slack/PagerDuty. |
Prod Monitoring: Airflow scheduled jobs. |
|
Online Slice Metrics — safety where it matters |
Near-real-time metrics per slice (night/rain/…); raise tickets when slice recall drops >X pts. |
Prod Monitoring: Grafana + W&B reports. |
|
Continual Learning Triggers — close the loop |
On drift/failure bucket thresholds, open labeling tasks; enqueue retrain DAG with new slices. |
Prod → CI/CD: Automated but approval-gated. |
|
Testing in Production (Safety Predicates) — guardrails live |
Real-time predicates (e.g., max tolerated miss rate on pedestrian class, confidence floors); auto-fallback/disable signals. |
Prod: Always on; audited weekly. |
|
OTA & Edge |
Package Integrity & Compatibility — safe OTA |
Signature verification, version/ABI checks, rollback test; disk/compute footprint limits; warm-start latency. |
Pre-Prod Lab & Staged OTA: Before phased rollout. |
Hardware-in-the-Loop (HIL) Sanity — target realism |
Bench tests on representative hardware; latency and throughput under thermal/power constraints. |
Pre-Prod: Release gating. |
|
Governance & Audit |
Lineage & Registry Completeness — who/what/when |
Weights & Biases: runs, artifacts, datasets, model versions, approvals; dataset hashes (DVC) linked in registry entries. |
CI/CD & Prod: Every promotion. |
Datasheets & Model Cards — compliance docs |
Auto-generate with metrics, slices, risks, mitigations, known gaps; store alongside registry version. |
CI/CD: Promotion step. |
|
Incident RCA Pack — learn fast |
Bundle logs, traces, frames, saliency, predicates; publish corrective actions and tracking issues. |
On Incident: Within SLA (e.g., 72h). |