Workflows, Team, Roles¶
¶
Workflows¶
Pipeline / Workflow |
Trigger |
Inputs |
Key Steps |
Outputs |
---|---|---|---|---|
1. Telemetry & Bulk Data Ingestion |
Drive docked or upload complete event |
SSD offloads, ROS/rosbag, video, lidar, radar, CAN/IMU, small telemetry streams |
Copy via Snowball/DataSync → S3 landing; verify checksums; register manifest; notify downstream |
Raw data in S3 bronze, drive manifest, ingestion log |
2. Data Integrity & PII Redaction |
Post-ingestion event |
Raw sensor logs |
File integrity checks; sensor presence; time span sanity; blur faces/plates; redact PII; sign results |
Cleaned S3 bronze, redaction report |
3. Sensor Sync & Format Conversion |
After integrity pass |
Rosbags, proprietary binaries |
Time-align streams; extract frames/keyframes; convert to Parquet/Zarr; create clip shards |
Synchronized Parquet/Zarr shards in S3 silver |
4. Metadata Extraction & Cataloging |
On conversion finish |
Synchronized shards, manifests |
Extract timestamps, GPS, weather if available, route IDs; schema to Glue; upsert DynamoDB; index text to OpenSearch |
Glue tables, DynamoDB entries, OpenSearch index |
5. Map & Weather Enrichment |
Daily batch |
GPS traces, time, road graphs |
Join with map tiles/HD refs; fetch historical weather; attach road classes/speed limits |
Enriched metadata columns in silver |
6. Scene Detection & Event Triggers |
Hourly batch |
Enriched clips, CAN signals |
Lightweight detectors for cut-ins, harsh brake, stationary hazard, disengagements; write event windows |
Event windows table, tags per clip |
7. Similarity & Vector Index Build |
New events found |
Reference clips, embeddings model |
Generate clip embeddings; upsert vector DB; enable “find more like this” |
Vector index entries; retrieval API ready |
8. Trigger-based Scenario Mining |
On-demand or schedule |
Event windows, vector queries |
Search long-tail scenarios (night-rain, occlusion, construction); de-dup; rank by novelty/uncertainty |
Candidate sets for labeling/mining |
9. Auto-Labeling (Bootstrapped) |
Candidate set ready |
Pretrained models, heuristics |
Run offline inference at scale; propagate pseudo-labels; confidence filtering; weak supervision rules |
Auto-labeled datasets with provenance |
10. Human-in-the-Loop Labeling & QA |
Auto-labeled set queued |
Auto-labels, raw clips |
Sampling for manual QA; spot-check hard slices; adjudicate disagreements; finalize labels |
Verified labels; label QA metrics |
11. Golden & Slice Dataset Builder |
Weekly or on request |
Labeled tables, metadata |
Build “golden” benchmark sets and slice packs (night, rain, occlusion, construction); freeze with DVC; publish to W&B Artifacts |
Versioned datasets with DVC tags and W&B artifacts |
12. Offline Mining via Batch Inference |
Nightly |
Latest model, large unlabeled pool |
Run model across pool on Batch/EKS; capture failures, high-uncertainty, drifted slices |
Failure buckets; candidates for re-label |
13. Distributed Training (Perception Multitask) |
New dataset version or ticket |
Curated datasets, configs |
Launch distributed training; mixed precision; checkpointing; gradient accumulation; log to W&B |
Trained checkpoints; W&B runs & artifacts |
14. Hyperparameter Sweeps |
Model change or perf gap |
Training code, sweep config |
W&B sweeps; early stopping; budgeted search; capture best by primary metric |
Best config bundle; sweep report |
15. Model Packaging & Export |
Train job success |
Best checkpoint |
Export TorchScript/ONNX; TensorRT build; INT8 calibration on repset; embed metadata |
Versioned model bundle in S3 + ECR image |
16. Model Evaluation & Robustness Suite |
New bundle ready |
Golden & slice datasets, model bundle |
Compute mAP/mIoU/AP by slice; calibration (ECE); robustness (noise, blur, weather); latency on target; write eval report |
Eval JSON, W&B reports, promotion decision signal |
17. Drive Replay & Simulation Validation |
Gate before promotion |
Model bundle, replay logs/sim scenarios |
Re-run model on historical incidents; sim-in-loop perturbations; compare to baselines; safety predicates |
Replay KPIs, safety deltas, sign-off artifacts |
18. Model Registry & Promotion Gate |
Eval passed |
W&B run, artifacts, reports |
Create/advance model version in W&B Registry; attach evidence (datasets, evals); request approvals |
Staged model with audit trail |
19. Canary/Shadow Deployment |
Promotion approved |
Container image, serving config |
Deploy to EKS Triton; shadow route same traffic for compare; canary small %; watch SLOs |
Shadow/canary live; rollout decision inputs |
20. Online A/B and Feature Flag Switchboard |
After shadow confidence |
Routing config, guardrails |
Route by geography/scene type; progressive exposure; automatic pause on SLO breach |
Controlled rollout; experiment results |
21. Edge-Compatible Build & OTA Packaging |
Edge target release |
Model bundle, calibrations |
Further quantization/distillation; embed runtime checks; produce OTA package manifest |
Edge package ready; manifest signed |
22. Over-The-Air Delivery |
Release ticket |
OTA package |
Stage to distribution; phased fleets; collect post-deploy telemetry hooks |
OTA rollout status; feedback telemetry |
23. Online Inference Service Ops |
Continuous |
Live frames/events |
Triton dynamic batching; health probes; autoscale; backpressure; cache hot features |
Real-time predictions; health metrics |
24. Monitoring & Observability |
Continuous |
Metrics/logs/traces |
Infra: CPU/GPU/mem; App: p50/p95/p99, QPS, error rate; ML: confidences, slice metrics; dashboards & alerts |
Grafana/W&B dashboards; alert incidents |
25. Data/Output Drift Detection |
Hourly/daily |
Live feature/output dists, baseline |
PSI/KS tests; concept drift on outputs; slice drifts; generate tickets if thresholds crossed |
Drift reports; retrain triggers |
26. Continual Learning Trigger |
Drift or failure quota exceeded |
Drift report, failure buckets |
Open labeling tasks; schedule mining; enqueue retraining DAG |
Approved retraining request |
27. Automated Retraining |
Triggered |
Updated datasets |
Re-run 13→18 sequence; compare to current prod; promote only on net gain |
New candidate model version |
28. Testing in Production (Safety Predicates) |
Pre/post rollout |
Live predictions |
Real-time rules: sanity, rate-limit, confidence thresholds, disagreement with baselines; automatic fallback |
Predicate logs; auto-disable signals |
29. Cost Telemetry & Optimization |
Daily/weekly |
AWS billing, job metrics |
Attribute cost to datasets/models; spot utilization; right-size; S3 tiering candidates |
Cost reports; actions (tiering, instance changes) |
30. Data Lifecycle & Tiering |
Weekly |
Access stats, retention policy |
Move cold data to Glacier/Intelligent-Tiering; compact small files; delete temp |
Lower storage cost; lifecycle logs |
31. Security & Compliance Scans |
CI and nightly |
Docker images, IaC, deps |
Trivy/Grype scans; IaC checks; SBOM; sign containers; policy-as-code gates |
Security reports; signed artifacts |
32. Governance: Datasheets & Model Cards |
On promotion |
Datasets, evals, risks |
Auto-generate Datasheets/Model Cards with metrics, slices, risks, mitigations |
Versioned governance docs |
33. Incident Review & RCA Pack |
On alert or incident |
Logs, traces, frames |
Bundle timeline, inputs/outputs, saliency, SHAP for tabular, predicates fired; propose fixes |
RCA doc; backlog items |
34. Experiment Lifecycle & Artifact GC |
Weekly |
W&B projects, S3 buckets |
Auto-archive stale runs; GC tmp artifacts; keep winners and governance sets |
Cleaned registry; controlled storage |
35. GPU Capacity & Queue Scheduler |
Continuous |
Job queue, quotas |
Bin-pack training/inference; fairness across teams; preemption for priority |
Predictable throughput; SLA adherence |
36. Map/Trigger Policy Update |
Monthly or new roadworks |
Map deltas, ops inputs |
Update road rules, construction zones; refresh trigger heuristics |
Updated enrichment; fewer false alarms |
Team and Roles¶
Category |
Tasks Covered |
Primary Owner |
Supporting Roles |
Notes / Hand-offs |
---|---|---|---|---|
Data Ingestion & Foundations |
1 Ingestion, 2 Integrity & PII, 3 Sync & Convert, 4 Metadata, 5 Map/Weather Enrichment |
Data Engineering |
Platform Eng, Security, PM |
Hand off enriched, validated data and catalogs to ML for mining and training. |
Scene Understanding & Data Mining |
6 Scene Detection & Triggers, 7 Vector Index, 8 Scenario Mining, 9 Auto-Labeling, 10 Human QA, 11 Golden/Slice Builder, 12 Offline Mining |
ML/MLOps Engineer |
Label Ops, Data Eng, PM |
You led trigger design, embedding search, mining strategy, auto-labeling rules, and curated “golden” & slice datasets; Label Ops handled adjudication in 10 with your sampling/QA guidelines. |
Model Training & Experimentation |
13 Distributed Training, 14 HPO/Sweeps |
ML/MLOps Engineer |
Platform Eng, Data Eng |
You owned training pipelines, W&B runs/artifacts, and budgeted sweeps; Platform Eng provisioned GPUs and job templates. |
Packaging, Evaluation & Promotion |
15 Packaging/Export, 16 Eval & Robustness, 17 Drive Replay/Sim, 18 Registry & Promotion |
ML Engineer |
Platform Eng, Simulation Eng, PM/Safety |
ML leads eval design and reports; Simulation validates safety on replays; PM/Safety approves promotion in registry. |
Deployment & Serving |
19 Canary/Shadow, 20 A/B & Flags, 21 Edge Build & OTA, 22 OTA Delivery, 23 Online Service Ops, 24 Observability |
Platform Engineering |
ML Engineer, SRE, PM |
Platform runs Triton/TorchServe on EKS, rollouts with canary/shadow; ML supplies model contracts and latency SLOs; SRE manages on-call. |
Monitoring & Continual Learning |
25 Drift Detection, 26 Continual Learning Trigger, 27 Automated Retraining, 28 Testing in Prod (Safety Predicates) |
ML/MLOps Engineer |
Platform Eng, Data Eng, PM/Safety |
You defined drift metrics, thresholds, and retrain triggers; wired safety predicates and rollback signals; coordinated retrain DAGs back to training gates. |
Cost, Lifecycle, Compliance |
29 Cost Telemetry, 30 Data Lifecycle/Tiering, 31 Security Scans, 32 Datasheets/Model Cards |
Platform Engineering |
FinOps, Security, ML Engineer, PM |
Cost attribution by job/model; lifecycle S3 tiering; SBOM/signing; ML contributes governance artifacts and model cards. |
Reliability, Capacity, Maps |
33 Incident RCA, 34 Experiment GC, 35 GPU Capacity & Queues, 36 Map/Trigger Policy Update |
SRE/Platform Engineering |
ML Engineer, Map/Ops, PM |
SRE drives RCAs; Platform handles capacity/bin-packing; ML provides failure buckets and updates trigger policies with Map/Ops. |