ML Problem Framing¶
¶
1) Understanding the Business Objective¶
Objective: deliver safer, more reliable driver-assistance for cars and trucks by increasing perception robustness and reducing intervention events—without exploding cloud costs or iteration time.
Stakeholder alignment:
Product: safety KPIs first, low-latency UX (real-time alerts, <100–150 ms p95 from sensor ingest to decision).
Operations/Program: predictable rollouts, gated by safety thresholds and audits.
Data/ML: fast iteration loops, reproducible experiments, clear failure mining for edge cases.
MLOps/Infra: scalable pipelines, cost controls, observability, and security-by-default.
Compliance/Legal: privacy, traceability, explainability commensurate with safety review.
Explainability baseline: agree that perception models must support post-hoc explanations for incident review (e.g., SHAP for tabular/CAN features, saliency/Grad-CAM for vision). Define reference baselines (e.g., median-frame conditions) to anchor explanations and audits.
2) Is Machine Learning the Right Approach? (Use-Case Evaluation)¶
Why ML here: perception across cameras/lidar/radar is a high-dimensional pattern recognition problem; rules alone are brittle. Failure modes evolve (weather, construction, vehicle styles), requiring continual learning.
When ML wins (our case):
Complex, non-linear patterns (multi-sensor fusion).
Scale (millions of frames; long operational horizon).
Evolving environment (drift) → retraining required.
Guardrails & baselines:
Start with robust non-ML heuristics for sanity checks (e.g., speed gates, sensor drop detection).
Gate ML predictions with safety predicates and shadow mode before activation.
Data Engine requirement: design a closed loop to collect → curate → label → train → deploy → monitor → retrain, with fleet triggers and targeted data mining to shorten time-to-improvement.
3) Defining the ML Problem¶
Product ideal outcome: fewer risky situations and fewer driver interventions at the same or better comfort level.
Model goals (decomposed):
Perception (primary): multi-task detection & segmentation (vehicles, pedestrians, cyclists, lanes, drivable area), depth/BEV occupancy, and tracking.
Event understanding (secondary): cut-ins, harsh braking ahead, lane closures, stationary hazards.
Confidence & uncertainty: calibrated scores feeding planners/alerts.
Inputs (multi-modal):
Camera (multi-view video), Radar/Lidar (where available), IMU/GNSS/CAN.
Context features: weather/time, road class, speed limits (if available).
Outputs & task types:
Object detection/instance segmentation (multilabel, multi-class).
BEV occupancy/semantic map (dense prediction).
Time-to-collision / proximity risk (regression), event flags (classification).
Key issues to address:
Long-tail & class imbalance: rare but high-impact scenarios (night rain, occlusions, construction workers directing traffic).
Domain shift/drift: seasonal/weather/geographic variation.
Label scarcity: use auto-labeling, weak supervision, and targeted human QA.
Throughput & latency: train at scale; serve under tight budgets and on constrained edge targets where applicable.
4) Assessing Feasibility & Risks¶
Data readiness
Sufficient raw sensor coverage; rare events under-represented → plan trigger-based mining and similarity search.
Labeling cost high → hybrid strategy (auto-label + human verification on hard slices).
Technical constraints
Latency: real-time perception budget (<30–60 ms per frame for core heads; overall pipeline <100–150 ms p95).
Memory/compute: fit models to deployment targets (quantization/distillation where needed).
Robustness: enforce sensor health checks; degrade gracefully to fewer modalities if a sensor drops.
Operational risk
Safety gating (shadow, A/B with small percentages, rollback).
Strict experiment traceability (datasets, code, hyperparams) with Weights & Biases (WandB) for runs/artifacts, and Git/DVC (or equivalent) for data/model lineage.
Cost & ROI
Control training/inference cost with spot/managed schedules, data pruning, tiered storage, and on-demand auto-labeling only for “high-value” clips.
Ethics & compliance
PII handling (faces/plates blurring where required), audit trails, incident review packs with explanations.
5) Defining Success Metrics (Business, Model, Operational)¶
Type |
Metric |
Definition |
How to Measure |
Target (initial) |
---|---|---|---|---|
Business |
Intervention rate ↓ |
Driver takeovers per 1000 km in assisted modes |
Fleet telemetry and event logs, normalized by km and conditions |
−15–20% vs. baseline cohort |
Business |
Safety-critical events ↓ |
Near-miss / harsh-brake alerts per 1000 km |
On-vehicle triggers (brake pressure, decel, proximity) with post-hoc validation |
−15–22% |
Business |
Feature reliability ↑ |
% sessions without faulted ADAS disengagement |
Session analytics, error codes, watchdog |
+10–15% |
Model (Perception) |
Primary mAP / mIoU ↑ |
Detection mAP@IoU, seg mIoU on curated “hard” sets |
Benchmarks by scenario slices (night/rain/occlusion/construction) |
+5–8 pts on hard slices |
Model (Risk) |
Calibration (ECE) ↓ |
Expected Calibration Error for confidence outputs |
Reliability diagrams on offline eval & shadow data |
< 3–5% ECE |
Model (Long-tail) |
Slice recall ↑ |
Recall on rare scenarios (e.g., night-rain pedestrians) |
Per-slice eval sets maintained in W&B Artifacts |
+10–15 pts |
Operational |
TTMU ↓ |
Time from failure discovery → safe model in prod |
Track via ticket timestamps and deploy tags |
8–10 w → 2–3 w |
Operational |
Pipeline SLA |
Ingestion→curation→label turnaround time per drive |
Orchestrator metrics, queue latencies |
< 24 h |
Operational |
Serving latency |
End-to-end p95 inference latency |
Traces (Cloud + edge), p50/p95/p99 |
< 100–150 ms p95 |
Operational |
Cost efficiency |
$/1000 km processed; $/training experiment |
Cost dashboards; W&B sweep cost per win |
−20–30% vs. baseline quarter |
Operational |
Observability coverage |
% models/pipelines with alerts, dashboards, SLOs |
Runbooks + monitoring inventory |
> 85% coverage |
Notes on measurement
Business metrics are evaluated on held-out test routes and staggered rollouts to control for route/weather mix.
Model metrics are tracked in WandB (projects for perception/risk; Artifacts for datasets, models, and eval sets).
Operational metrics come from pipeline orchestration, tracing, and cost dashboards; tie each notable change to a W&B run/deployment tag for auditability.
Decision summary: ML is necessary and appropriate for the perception and event-understanding layers, provided it is embedded in a closed-loop data engine with strong safety gating, targeted long-tail mining, reproducible experimentation (WandB), and rigorous operational controls.