# Workflows, Team, Roles ## ___ ### Workflows | Pipeline / Workflow | Trigger | Inputs | Key Steps | Outputs | | --------------------------------------------------- | ------------------------------------- | ------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------- | | **1. Telemetry & Bulk Data Ingestion** | Drive docked or upload complete event | SSD offloads, ROS/rosbag, video, lidar, radar, CAN/IMU, small telemetry streams | Copy via Snowball/DataSync → S3 landing; verify checksums; register manifest; notify downstream | Raw data in S3 bronze, drive manifest, ingestion log | | **2. Data Integrity & PII Redaction** | Post-ingestion event | Raw sensor logs | File integrity checks; sensor presence; time span sanity; blur faces/plates; redact PII; sign results | Cleaned S3 bronze, redaction report | | **3. Sensor Sync & Format Conversion** | After integrity pass | Rosbags, proprietary binaries | Time-align streams; extract frames/keyframes; convert to Parquet/Zarr; create clip shards | Synchronized Parquet/Zarr shards in S3 silver | | **4. Metadata Extraction & Cataloging** | On conversion finish | Synchronized shards, manifests | Extract timestamps, GPS, weather if available, route IDs; schema to Glue; upsert DynamoDB; index text to OpenSearch | Glue tables, DynamoDB entries, OpenSearch index | | **5. Map & Weather Enrichment** | Daily batch | GPS traces, time, road graphs | Join with map tiles/HD refs; fetch historical weather; attach road classes/speed limits | Enriched metadata columns in silver | | **6. Scene Detection & Event Triggers** | Hourly batch | Enriched clips, CAN signals | Lightweight detectors for cut-ins, harsh brake, stationary hazard, disengagements; write event windows | Event windows table, tags per clip | | **7. Similarity & Vector Index Build** | New events found | Reference clips, embeddings model | Generate clip embeddings; upsert vector DB; enable “find more like this” | Vector index entries; retrieval API ready | | **8. Trigger-based Scenario Mining** | On-demand or schedule | Event windows, vector queries | Search long-tail scenarios (night-rain, occlusion, construction); de-dup; rank by novelty/uncertainty | Candidate sets for labeling/mining | | **9. Auto-Labeling (Bootstrapped)** | Candidate set ready | Pretrained models, heuristics | Run offline inference at scale; propagate pseudo-labels; confidence filtering; weak supervision rules | Auto-labeled datasets with provenance | | **10. Human-in-the-Loop Labeling & QA** | Auto-labeled set queued | Auto-labels, raw clips | Sampling for manual QA; spot-check hard slices; adjudicate disagreements; finalize labels | Verified labels; label QA metrics | | **11. Golden & Slice Dataset Builder** | Weekly or on request | Labeled tables, metadata | Build “golden” benchmark sets and slice packs (night, rain, occlusion, construction); freeze with DVC; publish to W\&B Artifacts | Versioned datasets with DVC tags and W\&B artifacts | | **12. Offline Mining via Batch Inference** | Nightly | Latest model, large unlabeled pool | Run model across pool on Batch/EKS; capture failures, high-uncertainty, drifted slices | Failure buckets; candidates for re-label | | **13. Distributed Training (Perception Multitask)** | New dataset version or ticket | Curated datasets, configs | Launch distributed training; mixed precision; checkpointing; gradient accumulation; log to W\&B | Trained checkpoints; W\&B runs & artifacts | | **14. Hyperparameter Sweeps** | Model change or perf gap | Training code, sweep config | W\&B sweeps; early stopping; budgeted search; capture best by primary metric | Best config bundle; sweep report | | **15. Model Packaging & Export** | Train job success | Best checkpoint | Export TorchScript/ONNX; TensorRT build; INT8 calibration on repset; embed metadata | Versioned model bundle in S3 + ECR image | | **16. Model Evaluation & Robustness Suite** | New bundle ready | Golden & slice datasets, model bundle | Compute mAP/mIoU/AP by slice; calibration (ECE); robustness (noise, blur, weather); latency on target; write eval report | Eval JSON, W\&B reports, promotion decision signal | | **17. Drive Replay & Simulation Validation** | Gate before promotion | Model bundle, replay logs/sim scenarios | Re-run model on historical incidents; sim-in-loop perturbations; compare to baselines; safety predicates | Replay KPIs, safety deltas, sign-off artifacts | | **18. Model Registry & Promotion Gate** | Eval passed | W\&B run, artifacts, reports | Create/advance model version in W\&B Registry; attach evidence (datasets, evals); request approvals | Staged model with audit trail | | **19. Canary/Shadow Deployment** | Promotion approved | Container image, serving config | Deploy to EKS Triton; **shadow** route same traffic for compare; canary small %; watch SLOs | Shadow/canary live; rollout decision inputs | | **20. Online A/B and Feature Flag Switchboard** | After shadow confidence | Routing config, guardrails | Route by geography/scene type; progressive exposure; automatic pause on SLO breach | Controlled rollout; experiment results | | **21. Edge-Compatible Build & OTA Packaging** | Edge target release | Model bundle, calibrations | Further quantization/distillation; embed runtime checks; produce OTA package manifest | Edge package ready; manifest signed | | **22. Over-The-Air Delivery** | Release ticket | OTA package | Stage to distribution; phased fleets; collect post-deploy telemetry hooks | OTA rollout status; feedback telemetry | | **23. Online Inference Service Ops** | Continuous | Live frames/events | Triton dynamic batching; health probes; autoscale; backpressure; cache hot features | Real-time predictions; health metrics | | **24. Monitoring & Observability** | Continuous | Metrics/logs/traces | Infra: CPU/GPU/mem; App: p50/p95/p99, QPS, error rate; ML: confidences, slice metrics; dashboards & alerts | Grafana/W\&B dashboards; alert incidents | | **25. Data/Output Drift Detection** | Hourly/daily | Live feature/output dists, baseline | PSI/KS tests; concept drift on outputs; slice drifts; generate tickets if thresholds crossed | Drift reports; retrain triggers | | **26. Continual Learning Trigger** | Drift or failure quota exceeded | Drift report, failure buckets | Open labeling tasks; schedule mining; enqueue retraining DAG | Approved retraining request | | **27. Automated Retraining** | Triggered | Updated datasets | Re-run 13→18 sequence; compare to current prod; promote only on net gain | New candidate model version | | **28. Testing in Production (Safety Predicates)** | Pre/post rollout | Live predictions | Real-time rules: sanity, rate-limit, confidence thresholds, disagreement with baselines; automatic fallback | Predicate logs; auto-disable signals | | **29. Cost Telemetry & Optimization** | Daily/weekly | AWS billing, job metrics | Attribute cost to datasets/models; spot utilization; right-size; S3 tiering candidates | Cost reports; actions (tiering, instance changes) | | **30. Data Lifecycle & Tiering** | Weekly | Access stats, retention policy | Move cold data to Glacier/Intelligent-Tiering; compact small files; delete temp | Lower storage cost; lifecycle logs | | **31. Security & Compliance Scans** | CI and nightly | Docker images, IaC, deps | Trivy/Grype scans; IaC checks; SBOM; sign containers; policy-as-code gates | Security reports; signed artifacts | | **32. Governance: Datasheets & Model Cards** | On promotion | Datasets, evals, risks | Auto-generate Datasheets/Model Cards with metrics, slices, risks, mitigations | Versioned governance docs | | **33. Incident Review & RCA Pack** | On alert or incident | Logs, traces, frames | Bundle timeline, inputs/outputs, saliency, SHAP for tabular, predicates fired; propose fixes | RCA doc; backlog items | | **34. Experiment Lifecycle & Artifact GC** | Weekly | W\&B projects, S3 buckets | Auto-archive stale runs; GC tmp artifacts; keep winners and governance sets | Cleaned registry; controlled storage | | **35. GPU Capacity & Queue Scheduler** | Continuous | Job queue, quotas | Bin-pack training/inference; fairness across teams; preemption for priority | Predictable throughput; SLA adherence | | **36. Map/Trigger Policy Update** | Monthly or new roadworks | Map deltas, ops inputs | Update road rules, construction zones; refresh trigger heuristics | Updated enrichment; fewer false alarms | ___ ### Team and Roles | Category | Tasks Covered | Primary Owner | Supporting Roles | Notes / Hand-offs | | ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------- | --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Data Ingestion & Foundations** | **1** Ingestion, **2** Integrity & PII, **3** Sync & Convert, **4** Metadata, **5** Map/Weather Enrichment | **Data Engineering** | Platform Eng, Security, PM | Hand off enriched, validated data and catalogs to ML for mining and training. | | **Scene Understanding & Data Mining** | **6** Scene Detection & Triggers, **7** Vector Index, **8** Scenario Mining, **9** Auto-Labeling, **10** Human QA, **11** Golden/Slice Builder, **12** Offline Mining | **ML/MLOps Engineer** | Label Ops, Data Eng, PM | You led trigger design, embedding search, mining strategy, auto-labeling rules, and curated “golden” & slice datasets; Label Ops handled adjudication in 10 with your sampling/QA guidelines. | | **Model Training & Experimentation** | **13** Distributed Training, **14** HPO/Sweeps | **ML/MLOps Engineer** | Platform Eng, Data Eng | You owned training pipelines, W\&B runs/artifacts, and budgeted sweeps; Platform Eng provisioned GPUs and job templates. | | **Packaging, Evaluation & Promotion** | **15** Packaging/Export, **16** Eval & Robustness, **17** Drive Replay/Sim, **18** Registry & Promotion | ML Engineer | Platform Eng, Simulation Eng, PM/Safety | ML leads eval design and reports; Simulation validates safety on replays; PM/Safety approves promotion in registry. | | **Deployment & Serving** | **19** Canary/Shadow, **20** A/B & Flags, **21** Edge Build & OTA, **22** OTA Delivery, **23** Online Service Ops, **24** Observability | Platform Engineering | ML Engineer, SRE, PM | Platform runs Triton/TorchServe on EKS, rollouts with canary/shadow; ML supplies model contracts and latency SLOs; SRE manages on-call. | | **Monitoring & Continual Learning** | **25** Drift Detection, **26** Continual Learning Trigger, **27** Automated Retraining, **28** Testing in Prod (Safety Predicates) | **ML/MLOps Engineer** | Platform Eng, Data Eng, PM/Safety | You defined drift metrics, thresholds, and retrain triggers; wired safety predicates and rollback signals; coordinated retrain DAGs back to training gates. | | **Cost, Lifecycle, Compliance** | **29** Cost Telemetry, **30** Data Lifecycle/Tiering, **31** Security Scans, **32** Datasheets/Model Cards | Platform Engineering | FinOps, Security, ML Engineer, PM | Cost attribution by job/model; lifecycle S3 tiering; SBOM/signing; ML contributes governance artifacts and model cards. | | **Reliability, Capacity, Maps** | **33** Incident RCA, **34** Experiment GC, **35** GPU Capacity & Queues, **36** Map/Trigger Policy Update | SRE/Platform Engineering | ML Engineer, Map/Ops, PM | SRE drives RCAs; Platform handles capacity/bin-packing; ML provides failure buckets and updates trigger policies with Map/Ops. | ___