Workflows, Team, Roles¶

¶

Workflows¶

Pipeline / Workflow	Trigger	Inputs	Key Steps	Outputs
1. Telemetry & Bulk Data Ingestion	Drive docked or upload complete event	SSD offloads, ROS/rosbag, video, lidar, radar, CAN/IMU, small telemetry streams	Copy via Snowball/DataSync → S3 landing; verify checksums; register manifest; notify downstream	Raw data in S3 bronze, drive manifest, ingestion log
2. Data Integrity & PII Redaction	Post-ingestion event	Raw sensor logs	File integrity checks; sensor presence; time span sanity; blur faces/plates; redact PII; sign results	Cleaned S3 bronze, redaction report
3. Sensor Sync & Format Conversion	After integrity pass	Rosbags, proprietary binaries	Time-align streams; extract frames/keyframes; convert to Parquet/Zarr; create clip shards	Synchronized Parquet/Zarr shards in S3 silver
4. Metadata Extraction & Cataloging	On conversion finish	Synchronized shards, manifests	Extract timestamps, GPS, weather if available, route IDs; schema to Glue; upsert DynamoDB; index text to OpenSearch	Glue tables, DynamoDB entries, OpenSearch index
5. Map & Weather Enrichment	Daily batch	GPS traces, time, road graphs	Join with map tiles/HD refs; fetch historical weather; attach road classes/speed limits	Enriched metadata columns in silver
6. Scene Detection & Event Triggers	Hourly batch	Enriched clips, CAN signals	Lightweight detectors for cut-ins, harsh brake, stationary hazard, disengagements; write event windows	Event windows table, tags per clip
7. Similarity & Vector Index Build	New events found	Reference clips, embeddings model	Generate clip embeddings; upsert vector DB; enable “find more like this”	Vector index entries; retrieval API ready
8. Trigger-based Scenario Mining	On-demand or schedule	Event windows, vector queries	Search long-tail scenarios (night-rain, occlusion, construction); de-dup; rank by novelty/uncertainty	Candidate sets for labeling/mining
9. Auto-Labeling (Bootstrapped)	Candidate set ready	Pretrained models, heuristics	Run offline inference at scale; propagate pseudo-labels; confidence filtering; weak supervision rules	Auto-labeled datasets with provenance
10. Human-in-the-Loop Labeling & QA	Auto-labeled set queued	Auto-labels, raw clips	Sampling for manual QA; spot-check hard slices; adjudicate disagreements; finalize labels	Verified labels; label QA metrics
11. Golden & Slice Dataset Builder	Weekly or on request	Labeled tables, metadata	Build “golden” benchmark sets and slice packs (night, rain, occlusion, construction); freeze with DVC; publish to W&B Artifacts	Versioned datasets with DVC tags and W&B artifacts
12. Offline Mining via Batch Inference	Nightly	Latest model, large unlabeled pool	Run model across pool on Batch/EKS; capture failures, high-uncertainty, drifted slices	Failure buckets; candidates for re-label
13. Distributed Training (Perception Multitask)	New dataset version or ticket	Curated datasets, configs	Launch distributed training; mixed precision; checkpointing; gradient accumulation; log to W&B	Trained checkpoints; W&B runs & artifacts
14. Hyperparameter Sweeps	Model change or perf gap	Training code, sweep config	W&B sweeps; early stopping; budgeted search; capture best by primary metric	Best config bundle; sweep report
15. Model Packaging & Export	Train job success	Best checkpoint	Export TorchScript/ONNX; TensorRT build; INT8 calibration on repset; embed metadata	Versioned model bundle in S3 + ECR image
16. Model Evaluation & Robustness Suite	New bundle ready	Golden & slice datasets, model bundle	Compute mAP/mIoU/AP by slice; calibration (ECE); robustness (noise, blur, weather); latency on target; write eval report	Eval JSON, W&B reports, promotion decision signal
17. Drive Replay & Simulation Validation	Gate before promotion	Model bundle, replay logs/sim scenarios	Re-run model on historical incidents; sim-in-loop perturbations; compare to baselines; safety predicates	Replay KPIs, safety deltas, sign-off artifacts
18. Model Registry & Promotion Gate	Eval passed	W&B run, artifacts, reports	Create/advance model version in W&B Registry; attach evidence (datasets, evals); request approvals	Staged model with audit trail
19. Canary/Shadow Deployment	Promotion approved	Container image, serving config	Deploy to EKS Triton; shadow route same traffic for compare; canary small %; watch SLOs	Shadow/canary live; rollout decision inputs
20. Online A/B and Feature Flag Switchboard	After shadow confidence	Routing config, guardrails	Route by geography/scene type; progressive exposure; automatic pause on SLO breach	Controlled rollout; experiment results
21. Edge-Compatible Build & OTA Packaging	Edge target release	Model bundle, calibrations	Further quantization/distillation; embed runtime checks; produce OTA package manifest	Edge package ready; manifest signed
22. Over-The-Air Delivery	Release ticket	OTA package	Stage to distribution; phased fleets; collect post-deploy telemetry hooks	OTA rollout status; feedback telemetry
23. Online Inference Service Ops	Continuous	Live frames/events	Triton dynamic batching; health probes; autoscale; backpressure; cache hot features	Real-time predictions; health metrics
24. Monitoring & Observability	Continuous	Metrics/logs/traces	Infra: CPU/GPU/mem; App: p50/p95/p99, QPS, error rate; ML: confidences, slice metrics; dashboards & alerts	Grafana/W&B dashboards; alert incidents
25. Data/Output Drift Detection	Hourly/daily	Live feature/output dists, baseline	PSI/KS tests; concept drift on outputs; slice drifts; generate tickets if thresholds crossed	Drift reports; retrain triggers
26. Continual Learning Trigger	Drift or failure quota exceeded	Drift report, failure buckets	Open labeling tasks; schedule mining; enqueue retraining DAG	Approved retraining request
27. Automated Retraining	Triggered	Updated datasets	Re-run 13→18 sequence; compare to current prod; promote only on net gain	New candidate model version
28. Testing in Production (Safety Predicates)	Pre/post rollout	Live predictions	Real-time rules: sanity, rate-limit, confidence thresholds, disagreement with baselines; automatic fallback	Predicate logs; auto-disable signals
29. Cost Telemetry & Optimization	Daily/weekly	AWS billing, job metrics	Attribute cost to datasets/models; spot utilization; right-size; S3 tiering candidates	Cost reports; actions (tiering, instance changes)
30. Data Lifecycle & Tiering	Weekly	Access stats, retention policy	Move cold data to Glacier/Intelligent-Tiering; compact small files; delete temp	Lower storage cost; lifecycle logs
31. Security & Compliance Scans	CI and nightly	Docker images, IaC, deps	Trivy/Grype scans; IaC checks; SBOM; sign containers; policy-as-code gates	Security reports; signed artifacts
32. Governance: Datasheets & Model Cards	On promotion	Datasets, evals, risks	Auto-generate Datasheets/Model Cards with metrics, slices, risks, mitigations	Versioned governance docs
33. Incident Review & RCA Pack	On alert or incident	Logs, traces, frames	Bundle timeline, inputs/outputs, saliency, SHAP for tabular, predicates fired; propose fixes	RCA doc; backlog items
34. Experiment Lifecycle & Artifact GC	Weekly	W&B projects, S3 buckets	Auto-archive stale runs; GC tmp artifacts; keep winners and governance sets	Cleaned registry; controlled storage
35. GPU Capacity & Queue Scheduler	Continuous	Job queue, quotas	Bin-pack training/inference; fairness across teams; preemption for priority	Predictable throughput; SLA adherence
36. Map/Trigger Policy Update	Monthly or new roadworks	Map deltas, ops inputs	Update road rules, construction zones; refresh trigger heuristics	Updated enrichment; fewer false alarms

Team and Roles¶

Category	Tasks Covered	Primary Owner	Supporting Roles	Notes / Hand-offs
Data Ingestion & Foundations	1 Ingestion, 2 Integrity & PII, 3 Sync & Convert, 4 Metadata, 5 Map/Weather Enrichment	Data Engineering	Platform Eng, Security, PM	Hand off enriched, validated data and catalogs to ML for mining and training.
Scene Understanding & Data Mining	6 Scene Detection & Triggers, 7 Vector Index, 8 Scenario Mining, 9 Auto-Labeling, 10 Human QA, 11 Golden/Slice Builder, 12 Offline Mining	ML/MLOps Engineer	Label Ops, Data Eng, PM	You led trigger design, embedding search, mining strategy, auto-labeling rules, and curated “golden” & slice datasets; Label Ops handled adjudication in 10 with your sampling/QA guidelines.
Model Training & Experimentation	13 Distributed Training, 14 HPO/Sweeps	ML/MLOps Engineer	Platform Eng, Data Eng	You owned training pipelines, W&B runs/artifacts, and budgeted sweeps; Platform Eng provisioned GPUs and job templates.
Packaging, Evaluation & Promotion	15 Packaging/Export, 16 Eval & Robustness, 17 Drive Replay/Sim, 18 Registry & Promotion	ML Engineer	Platform Eng, Simulation Eng, PM/Safety	ML leads eval design and reports; Simulation validates safety on replays; PM/Safety approves promotion in registry.
Deployment & Serving	19 Canary/Shadow, 20 A/B & Flags, 21 Edge Build & OTA, 22 OTA Delivery, 23 Online Service Ops, 24 Observability	Platform Engineering	ML Engineer, SRE, PM	Platform runs Triton/TorchServe on EKS, rollouts with canary/shadow; ML supplies model contracts and latency SLOs; SRE manages on-call.
Monitoring & Continual Learning	25 Drift Detection, 26 Continual Learning Trigger, 27 Automated Retraining, 28 Testing in Prod (Safety Predicates)	ML/MLOps Engineer	Platform Eng, Data Eng, PM/Safety	You defined drift metrics, thresholds, and retrain triggers; wired safety predicates and rollback signals; coordinated retrain DAGs back to training gates.
Cost, Lifecycle, Compliance	29 Cost Telemetry, 30 Data Lifecycle/Tiering, 31 Security Scans, 32 Datasheets/Model Cards	Platform Engineering	FinOps, Security, ML Engineer, PM	Cost attribution by job/model; lifecycle S3 tiering; SBOM/signing; ML contributes governance artifacts and model cards.
Reliability, Capacity, Maps	33 Incident RCA, 34 Experiment GC, 35 GPU Capacity & Queues, 36 Map/Trigger Policy Update	SRE/Platform Engineering	ML Engineer, Map/Ops, PM	SRE drives RCAs; Platform handles capacity/bin-packing; ML provides failure buckets and updates trigger policies with Map/Ops.