Workflows, Team, Roles


Workflows

Pipeline / Workflow

Trigger

Inputs

Key Steps

Outputs

1. Telemetry & Bulk Data Ingestion

Drive docked or upload complete event

SSD offloads, ROS/rosbag, video, lidar, radar, CAN/IMU, small telemetry streams

Copy via Snowball/DataSync → S3 landing; verify checksums; register manifest; notify downstream

Raw data in S3 bronze, drive manifest, ingestion log

2. Data Integrity & PII Redaction

Post-ingestion event

Raw sensor logs

File integrity checks; sensor presence; time span sanity; blur faces/plates; redact PII; sign results

Cleaned S3 bronze, redaction report

3. Sensor Sync & Format Conversion

After integrity pass

Rosbags, proprietary binaries

Time-align streams; extract frames/keyframes; convert to Parquet/Zarr; create clip shards

Synchronized Parquet/Zarr shards in S3 silver

4. Metadata Extraction & Cataloging

On conversion finish

Synchronized shards, manifests

Extract timestamps, GPS, weather if available, route IDs; schema to Glue; upsert DynamoDB; index text to OpenSearch

Glue tables, DynamoDB entries, OpenSearch index

5. Map & Weather Enrichment

Daily batch

GPS traces, time, road graphs

Join with map tiles/HD refs; fetch historical weather; attach road classes/speed limits

Enriched metadata columns in silver

6. Scene Detection & Event Triggers

Hourly batch

Enriched clips, CAN signals

Lightweight detectors for cut-ins, harsh brake, stationary hazard, disengagements; write event windows

Event windows table, tags per clip

7. Similarity & Vector Index Build

New events found

Reference clips, embeddings model

Generate clip embeddings; upsert vector DB; enable “find more like this”

Vector index entries; retrieval API ready

8. Trigger-based Scenario Mining

On-demand or schedule

Event windows, vector queries

Search long-tail scenarios (night-rain, occlusion, construction); de-dup; rank by novelty/uncertainty

Candidate sets for labeling/mining

9. Auto-Labeling (Bootstrapped)

Candidate set ready

Pretrained models, heuristics

Run offline inference at scale; propagate pseudo-labels; confidence filtering; weak supervision rules

Auto-labeled datasets with provenance

10. Human-in-the-Loop Labeling & QA

Auto-labeled set queued

Auto-labels, raw clips

Sampling for manual QA; spot-check hard slices; adjudicate disagreements; finalize labels

Verified labels; label QA metrics

11. Golden & Slice Dataset Builder

Weekly or on request

Labeled tables, metadata

Build “golden” benchmark sets and slice packs (night, rain, occlusion, construction); freeze with DVC; publish to W&B Artifacts

Versioned datasets with DVC tags and W&B artifacts

12. Offline Mining via Batch Inference

Nightly

Latest model, large unlabeled pool

Run model across pool on Batch/EKS; capture failures, high-uncertainty, drifted slices

Failure buckets; candidates for re-label

13. Distributed Training (Perception Multitask)

New dataset version or ticket

Curated datasets, configs

Launch distributed training; mixed precision; checkpointing; gradient accumulation; log to W&B

Trained checkpoints; W&B runs & artifacts

14. Hyperparameter Sweeps

Model change or perf gap

Training code, sweep config

W&B sweeps; early stopping; budgeted search; capture best by primary metric

Best config bundle; sweep report

15. Model Packaging & Export

Train job success

Best checkpoint

Export TorchScript/ONNX; TensorRT build; INT8 calibration on repset; embed metadata

Versioned model bundle in S3 + ECR image

16. Model Evaluation & Robustness Suite

New bundle ready

Golden & slice datasets, model bundle

Compute mAP/mIoU/AP by slice; calibration (ECE); robustness (noise, blur, weather); latency on target; write eval report

Eval JSON, W&B reports, promotion decision signal

17. Drive Replay & Simulation Validation

Gate before promotion

Model bundle, replay logs/sim scenarios

Re-run model on historical incidents; sim-in-loop perturbations; compare to baselines; safety predicates

Replay KPIs, safety deltas, sign-off artifacts

18. Model Registry & Promotion Gate

Eval passed

W&B run, artifacts, reports

Create/advance model version in W&B Registry; attach evidence (datasets, evals); request approvals

Staged model with audit trail

19. Canary/Shadow Deployment

Promotion approved

Container image, serving config

Deploy to EKS Triton; shadow route same traffic for compare; canary small %; watch SLOs

Shadow/canary live; rollout decision inputs

20. Online A/B and Feature Flag Switchboard

After shadow confidence

Routing config, guardrails

Route by geography/scene type; progressive exposure; automatic pause on SLO breach

Controlled rollout; experiment results

21. Edge-Compatible Build & OTA Packaging

Edge target release

Model bundle, calibrations

Further quantization/distillation; embed runtime checks; produce OTA package manifest

Edge package ready; manifest signed

22. Over-The-Air Delivery

Release ticket

OTA package

Stage to distribution; phased fleets; collect post-deploy telemetry hooks

OTA rollout status; feedback telemetry

23. Online Inference Service Ops

Continuous

Live frames/events

Triton dynamic batching; health probes; autoscale; backpressure; cache hot features

Real-time predictions; health metrics

24. Monitoring & Observability

Continuous

Metrics/logs/traces

Infra: CPU/GPU/mem; App: p50/p95/p99, QPS, error rate; ML: confidences, slice metrics; dashboards & alerts

Grafana/W&B dashboards; alert incidents

25. Data/Output Drift Detection

Hourly/daily

Live feature/output dists, baseline

PSI/KS tests; concept drift on outputs; slice drifts; generate tickets if thresholds crossed

Drift reports; retrain triggers

26. Continual Learning Trigger

Drift or failure quota exceeded

Drift report, failure buckets

Open labeling tasks; schedule mining; enqueue retraining DAG

Approved retraining request

27. Automated Retraining

Triggered

Updated datasets

Re-run 13→18 sequence; compare to current prod; promote only on net gain

New candidate model version

28. Testing in Production (Safety Predicates)

Pre/post rollout

Live predictions

Real-time rules: sanity, rate-limit, confidence thresholds, disagreement with baselines; automatic fallback

Predicate logs; auto-disable signals

29. Cost Telemetry & Optimization

Daily/weekly

AWS billing, job metrics

Attribute cost to datasets/models; spot utilization; right-size; S3 tiering candidates

Cost reports; actions (tiering, instance changes)

30. Data Lifecycle & Tiering

Weekly

Access stats, retention policy

Move cold data to Glacier/Intelligent-Tiering; compact small files; delete temp

Lower storage cost; lifecycle logs

31. Security & Compliance Scans

CI and nightly

Docker images, IaC, deps

Trivy/Grype scans; IaC checks; SBOM; sign containers; policy-as-code gates

Security reports; signed artifacts

32. Governance: Datasheets & Model Cards

On promotion

Datasets, evals, risks

Auto-generate Datasheets/Model Cards with metrics, slices, risks, mitigations

Versioned governance docs

33. Incident Review & RCA Pack

On alert or incident

Logs, traces, frames

Bundle timeline, inputs/outputs, saliency, SHAP for tabular, predicates fired; propose fixes

RCA doc; backlog items

34. Experiment Lifecycle & Artifact GC

Weekly

W&B projects, S3 buckets

Auto-archive stale runs; GC tmp artifacts; keep winners and governance sets

Cleaned registry; controlled storage

35. GPU Capacity & Queue Scheduler

Continuous

Job queue, quotas

Bin-pack training/inference; fairness across teams; preemption for priority

Predictable throughput; SLA adherence

36. Map/Trigger Policy Update

Monthly or new roadworks

Map deltas, ops inputs

Update road rules, construction zones; refresh trigger heuristics

Updated enrichment; fewer false alarms


Team and Roles

Category

Tasks Covered

Primary Owner

Supporting Roles

Notes / Hand-offs

Data Ingestion & Foundations

1 Ingestion, 2 Integrity & PII, 3 Sync & Convert, 4 Metadata, 5 Map/Weather Enrichment

Data Engineering

Platform Eng, Security, PM

Hand off enriched, validated data and catalogs to ML for mining and training.

Scene Understanding & Data Mining

6 Scene Detection & Triggers, 7 Vector Index, 8 Scenario Mining, 9 Auto-Labeling, 10 Human QA, 11 Golden/Slice Builder, 12 Offline Mining

ML/MLOps Engineer

Label Ops, Data Eng, PM

You led trigger design, embedding search, mining strategy, auto-labeling rules, and curated “golden” & slice datasets; Label Ops handled adjudication in 10 with your sampling/QA guidelines.

Model Training & Experimentation

13 Distributed Training, 14 HPO/Sweeps

ML/MLOps Engineer

Platform Eng, Data Eng

You owned training pipelines, W&B runs/artifacts, and budgeted sweeps; Platform Eng provisioned GPUs and job templates.

Packaging, Evaluation & Promotion

15 Packaging/Export, 16 Eval & Robustness, 17 Drive Replay/Sim, 18 Registry & Promotion

ML Engineer

Platform Eng, Simulation Eng, PM/Safety

ML leads eval design and reports; Simulation validates safety on replays; PM/Safety approves promotion in registry.

Deployment & Serving

19 Canary/Shadow, 20 A/B & Flags, 21 Edge Build & OTA, 22 OTA Delivery, 23 Online Service Ops, 24 Observability

Platform Engineering

ML Engineer, SRE, PM

Platform runs Triton/TorchServe on EKS, rollouts with canary/shadow; ML supplies model contracts and latency SLOs; SRE manages on-call.

Monitoring & Continual Learning

25 Drift Detection, 26 Continual Learning Trigger, 27 Automated Retraining, 28 Testing in Prod (Safety Predicates)

ML/MLOps Engineer

Platform Eng, Data Eng, PM/Safety

You defined drift metrics, thresholds, and retrain triggers; wired safety predicates and rollback signals; coordinated retrain DAGs back to training gates.

Cost, Lifecycle, Compliance

29 Cost Telemetry, 30 Data Lifecycle/Tiering, 31 Security Scans, 32 Datasheets/Model Cards

Platform Engineering

FinOps, Security, ML Engineer, PM

Cost attribution by job/model; lifecycle S3 tiering; SBOM/signing; ML contributes governance artifacts and model cards.

Reliability, Capacity, Maps

33 Incident RCA, 34 Experiment GC, 35 GPU Capacity & Queues, 36 Map/Trigger Policy Update

SRE/Platform Engineering

ML Engineer, Map/Ops, PM

SRE drives RCAs; Platform handles capacity/bin-packing; ML provides failure buckets and updates trigger policies with Map/Ops.