# Workflows, Team, Roles

##
___

### Workflows

| Pipeline / Workflow                                 | Trigger                               | Inputs                                                                          | Key Steps                                                                                                                        | Outputs                                              |
| --------------------------------------------------- | ------------------------------------- | ------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------- |
| **1. Telemetry & Bulk Data Ingestion**              | Drive docked or upload complete event | SSD offloads, ROS/rosbag, video, lidar, radar, CAN/IMU, small telemetry streams | Copy via Snowball/DataSync → S3 landing; verify checksums; register manifest; notify downstream                                  | Raw data in S3 bronze, drive manifest, ingestion log |
| **2. Data Integrity & PII Redaction**               | Post-ingestion event                  | Raw sensor logs                                                                 | File integrity checks; sensor presence; time span sanity; blur faces/plates; redact PII; sign results                            | Cleaned S3 bronze, redaction report                  |
| **3. Sensor Sync & Format Conversion**              | After integrity pass                  | Rosbags, proprietary binaries                                                   | Time-align streams; extract frames/keyframes; convert to Parquet/Zarr; create clip shards                                        | Synchronized Parquet/Zarr shards in S3 silver        |
| **4. Metadata Extraction & Cataloging**             | On conversion finish                  | Synchronized shards, manifests                                                  | Extract timestamps, GPS, weather if available, route IDs; schema to Glue; upsert DynamoDB; index text to OpenSearch              | Glue tables, DynamoDB entries, OpenSearch index      |
| **5. Map & Weather Enrichment**                     | Daily batch                           | GPS traces, time, road graphs                                                   | Join with map tiles/HD refs; fetch historical weather; attach road classes/speed limits                                          | Enriched metadata columns in silver                  |
| **6. Scene Detection & Event Triggers**             | Hourly batch                          | Enriched clips, CAN signals                                                     | Lightweight detectors for cut-ins, harsh brake, stationary hazard, disengagements; write event windows                           | Event windows table, tags per clip                   |
| **7. Similarity & Vector Index Build**              | New events found                      | Reference clips, embeddings model                                               | Generate clip embeddings; upsert vector DB; enable “find more like this”                                                         | Vector index entries; retrieval API ready            |
| **8. Trigger-based Scenario Mining**                | On-demand or schedule                 | Event windows, vector queries                                                   | Search long-tail scenarios (night-rain, occlusion, construction); de-dup; rank by novelty/uncertainty                            | Candidate sets for labeling/mining                   |
| **9. Auto-Labeling (Bootstrapped)**                 | Candidate set ready                   | Pretrained models, heuristics                                                   | Run offline inference at scale; propagate pseudo-labels; confidence filtering; weak supervision rules                            | Auto-labeled datasets with provenance                |
| **10. Human-in-the-Loop Labeling & QA**             | Auto-labeled set queued               | Auto-labels, raw clips                                                          | Sampling for manual QA; spot-check hard slices; adjudicate disagreements; finalize labels                                        | Verified labels; label QA metrics                    |
| **11. Golden & Slice Dataset Builder**              | Weekly or on request                  | Labeled tables, metadata                                                        | Build “golden” benchmark sets and slice packs (night, rain, occlusion, construction); freeze with DVC; publish to W\&B Artifacts | Versioned datasets with DVC tags and W\&B artifacts  |
| **12. Offline Mining via Batch Inference**          | Nightly                               | Latest model, large unlabeled pool                                              | Run model across pool on Batch/EKS; capture failures, high-uncertainty, drifted slices                                           | Failure buckets; candidates for re-label             |
| **13. Distributed Training (Perception Multitask)** | New dataset version or ticket         | Curated datasets, configs                                                       | Launch distributed training; mixed precision; checkpointing; gradient accumulation; log to W\&B                                  | Trained checkpoints; W\&B runs & artifacts           |
| **14. Hyperparameter Sweeps**                       | Model change or perf gap              | Training code, sweep config                                                     | W\&B sweeps; early stopping; budgeted search; capture best by primary metric                                                     | Best config bundle; sweep report                     |
| **15. Model Packaging & Export**                    | Train job success                     | Best checkpoint                                                                 | Export TorchScript/ONNX; TensorRT build; INT8 calibration on repset; embed metadata                                              | Versioned model bundle in S3 + ECR image             |
| **16. Model Evaluation & Robustness Suite**         | New bundle ready                      | Golden & slice datasets, model bundle                                           | Compute mAP/mIoU/AP by slice; calibration (ECE); robustness (noise, blur, weather); latency on target; write eval report         | Eval JSON, W\&B reports, promotion decision signal   |
| **17. Drive Replay & Simulation Validation**        | Gate before promotion                 | Model bundle, replay logs/sim scenarios                                         | Re-run model on historical incidents; sim-in-loop perturbations; compare to baselines; safety predicates                         | Replay KPIs, safety deltas, sign-off artifacts       |
| **18. Model Registry & Promotion Gate**             | Eval passed                           | W\&B run, artifacts, reports                                                    | Create/advance model version in W\&B Registry; attach evidence (datasets, evals); request approvals                              | Staged model with audit trail                        |
| **19. Canary/Shadow Deployment**                    | Promotion approved                    | Container image, serving config                                                 | Deploy to EKS Triton; **shadow** route same traffic for compare; canary small %; watch SLOs                                      | Shadow/canary live; rollout decision inputs          |
| **20. Online A/B and Feature Flag Switchboard**     | After shadow confidence               | Routing config, guardrails                                                      | Route by geography/scene type; progressive exposure; automatic pause on SLO breach                                               | Controlled rollout; experiment results               |
| **21. Edge-Compatible Build & OTA Packaging**       | Edge target release                   | Model bundle, calibrations                                                      | Further quantization/distillation; embed runtime checks; produce OTA package manifest                                            | Edge package ready; manifest signed                  |
| **22. Over-The-Air Delivery**                       | Release ticket                        | OTA package                                                                     | Stage to distribution; phased fleets; collect post-deploy telemetry hooks                                                        | OTA rollout status; feedback telemetry               |
| **23. Online Inference Service Ops**                | Continuous                            | Live frames/events                                                              | Triton dynamic batching; health probes; autoscale; backpressure; cache hot features                                              | Real-time predictions; health metrics                |
| **24. Monitoring & Observability**                  | Continuous                            | Metrics/logs/traces                                                             | Infra: CPU/GPU/mem; App: p50/p95/p99, QPS, error rate; ML: confidences, slice metrics; dashboards & alerts                       | Grafana/W\&B dashboards; alert incidents             |
| **25. Data/Output Drift Detection**                 | Hourly/daily                          | Live feature/output dists, baseline                                             | PSI/KS tests; concept drift on outputs; slice drifts; generate tickets if thresholds crossed                                     | Drift reports; retrain triggers                      |
| **26. Continual Learning Trigger**                  | Drift or failure quota exceeded       | Drift report, failure buckets                                                   | Open labeling tasks; schedule mining; enqueue retraining DAG                                                                     | Approved retraining request                          |
| **27. Automated Retraining**                        | Triggered                             | Updated datasets                                                                | Re-run 13→18 sequence; compare to current prod; promote only on net gain                                                         | New candidate model version                          |
| **28. Testing in Production (Safety Predicates)**   | Pre/post rollout                      | Live predictions                                                                | Real-time rules: sanity, rate-limit, confidence thresholds, disagreement with baselines; automatic fallback                      | Predicate logs; auto-disable signals                 |
| **29. Cost Telemetry & Optimization**               | Daily/weekly                          | AWS billing, job metrics                                                        | Attribute cost to datasets/models; spot utilization; right-size; S3 tiering candidates                                           | Cost reports; actions (tiering, instance changes)    |
| **30. Data Lifecycle & Tiering**                    | Weekly                                | Access stats, retention policy                                                  | Move cold data to Glacier/Intelligent-Tiering; compact small files; delete temp                                                  | Lower storage cost; lifecycle logs                   |
| **31. Security & Compliance Scans**                 | CI and nightly                        | Docker images, IaC, deps                                                        | Trivy/Grype scans; IaC checks; SBOM; sign containers; policy-as-code gates                                                       | Security reports; signed artifacts                   |
| **32. Governance: Datasheets & Model Cards**        | On promotion                          | Datasets, evals, risks                                                          | Auto-generate Datasheets/Model Cards with metrics, slices, risks, mitigations                                                    | Versioned governance docs                            |
| **33. Incident Review & RCA Pack**                  | On alert or incident                  | Logs, traces, frames                                                            | Bundle timeline, inputs/outputs, saliency, SHAP for tabular, predicates fired; propose fixes                                     | RCA doc; backlog items                               |
| **34. Experiment Lifecycle & Artifact GC**          | Weekly                                | W\&B projects, S3 buckets                                                       | Auto-archive stale runs; GC tmp artifacts; keep winners and governance sets                                                      | Cleaned registry; controlled storage                 |
| **35. GPU Capacity & Queue Scheduler**              | Continuous                            | Job queue, quotas                                                               | Bin-pack training/inference; fairness across teams; preemption for priority                                                      | Predictable throughput; SLA adherence                |
| **36. Map/Trigger Policy Update**                   | Monthly or new roadworks              | Map deltas, ops inputs                                                          | Update road rules, construction zones; refresh trigger heuristics                                                                | Updated enrichment; fewer false alarms               |

___

### Team and Roles


| Category                              | Tasks Covered                                                                                                                                                         | Primary Owner                 | Supporting Roles                        | Notes / Hand-offs                                                                                                                                                                             |
| ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------- | --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Data Ingestion & Foundations**      | **1** Ingestion, **2** Integrity & PII, **3** Sync & Convert, **4** Metadata, **5** Map/Weather Enrichment                                                            | **Data Engineering**          | Platform Eng, Security, PM              | Hand off enriched, validated data and catalogs to ML for mining and training.                                                                                                                 |
| **Scene Understanding & Data Mining** | **6** Scene Detection & Triggers, **7** Vector Index, **8** Scenario Mining, **9** Auto-Labeling, **10** Human QA, **11** Golden/Slice Builder, **12** Offline Mining | **ML/MLOps Engineer** | Label Ops, Data Eng, PM                 | You led trigger design, embedding search, mining strategy, auto-labeling rules, and curated “golden” & slice datasets; Label Ops handled adjudication in 10 with your sampling/QA guidelines. |
| **Model Training & Experimentation**  | **13** Distributed Training, **14** HPO/Sweeps                                                                                                                        | **ML/MLOps Engineer** | Platform Eng, Data Eng                  | You owned training pipelines, W\&B runs/artifacts, and budgeted sweeps; Platform Eng provisioned GPUs and job templates.                                                                      |
| **Packaging, Evaluation & Promotion** | **15** Packaging/Export, **16** Eval & Robustness, **17** Drive Replay/Sim, **18** Registry & Promotion                                                               | ML Engineer                   | Platform Eng, Simulation Eng, PM/Safety | ML leads eval design and reports; Simulation validates safety on replays; PM/Safety approves promotion in registry.                                                                           |
| **Deployment & Serving**              | **19** Canary/Shadow, **20** A/B & Flags, **21** Edge Build & OTA, **22** OTA Delivery, **23** Online Service Ops, **24** Observability                               | Platform Engineering          | ML Engineer, SRE, PM                    | Platform runs Triton/TorchServe on EKS, rollouts with canary/shadow; ML supplies model contracts and latency SLOs; SRE manages on-call.                                                       |
| **Monitoring & Continual Learning**   | **25** Drift Detection, **26** Continual Learning Trigger, **27** Automated Retraining, **28** Testing in Prod (Safety Predicates)                                    | **ML/MLOps Engineer** | Platform Eng, Data Eng, PM/Safety       | You defined drift metrics, thresholds, and retrain triggers; wired safety predicates and rollback signals; coordinated retrain DAGs back to training gates.                                   |
| **Cost, Lifecycle, Compliance**       | **29** Cost Telemetry, **30** Data Lifecycle/Tiering, **31** Security Scans, **32** Datasheets/Model Cards                                                            | Platform Engineering          | FinOps, Security, ML Engineer, PM       | Cost attribution by job/model; lifecycle S3 tiering; SBOM/signing; ML contributes governance artifacts and model cards.                                                                       |
| **Reliability, Capacity, Maps**       | **33** Incident RCA, **34** Experiment GC, **35** GPU Capacity & Queues, **36** Map/Trigger Policy Update                                                             | SRE/Platform Engineering      | ML Engineer, Map/Ops, PM                | SRE drives RCAs; Platform handles capacity/bin-packing; ML provides failure buckets and updates trigger policies with Map/Ops.                                                                |

___