# Planning, Operational Strategy

##
___


### MLOps Stack Canvas 

| Block (Canvas)                              | Tool / Service                                                                   | Options & Trade-offs (Why chosen)                                                                                                                                                                              |
| ------------------------------------------- | -------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **1. Value Proposition & Ownership**        | Confluence/Jira                                                          | Keeps model ownership explicit (product, safety, infra). Decisions captured as ADRs; reduces ambiguity in a safety-critical domain.                                                                            |
| **2. Data Sources & Versioning**            | **S3 (lake), S3 Glacier/Intelligent-Tiering, Lake Formation, IAM/KMS**           | Durable, cheap, fine-grained access control and encryption. Tiering controls cost at PB scale.                                                                                                                 |
|                                             | **Data ingress: Snowball, DataSync, Direct Connect**                             | Hybrid ingest (bulk vs. network). Snowball for big offline drops; DataSync for ongoing transfers.                                                                                                              |
|                                             | **ROS/rosbag tooling**                                                           | Native handling of multi-sensor logs (camera, radar, lidar, CAN).                                                                                                                                              |
|                                             | **DVC + Git LFS**                                                                | Dataset, manifest, and label versioning tied to code; reproducible experiments; artifact lineage.                                                                                                              |
|                                             | **Data governance: Glue Data Catalog**                                           | Central schema/partition catalog for discovery and query; feeds Athena/EMR.                                                                                                                                    |
| **3. Data Analysis & Experiment Mgmt**      | **Python, PyTorch, CUDA/TensorRT**                                               | Dominant stack for CV in the project window; strong ecosystem and production path.                                                                                                                             |
|                                             | **Jupyter/VS Code + remote kernels**                                             | Fast iteration; runs close to data/GPUs.                                                                                                                                                                       |
|                                             | **W\&B (Experiments + Artifacts)**                                               | Single pane for metrics, configs, datasets, and models; artifact lineage; team reports.                                                                                                                        |
| **4. Feature Store & Workflows (optional)** | **Feast (opt-in)**                                                               | Use only if multiple models reuse temporal/tabular features (e.g., telemetry, map joins). Offline: S3/Parquet + Glue/Athena; Online: DynamoDB/Redis. Avoids premature complexity for primarily deep CV models. |
| **5. Foundations (DevOps & Code Mgmt)**     | **GitHub + trunk-based flow**                                                    | Simple, fast releases; short-lived branches.                                                                                                                                                                   |
|                                             | **CI/CD: GitHub Actions**                                                        | Build/test Docker, run unit/integration tests, push images to ECR, deploy via IaC—fully as code.                                                                                                               |
|                                             | **IaC: Terraform**                                                               | Reproducible AWS infra (EKS/ECS, VPC, S3, IAM, ECR, RDS/OpenSearch, etc.).                                                                                                                                     |
|                                             | **Quality: pytest, hypothesis, mypy, black, isort, pre-commit**                  | Enforces correctness and style; catches regressions early.                                                                                                                                                     |
|                                             | **Secrets: AWS Secrets Manager / Parameter Store**                               | Centralized, rotated secrets with IAM.                                                                                                                                                                         |
| **6. CI/Training/Deployment Orchestration** | **Airflow (DAGs)**                                                               | Orchestrates E2E loops: ingest → curate → label → train → evaluate → register → deploy; robust retries, backfills, SLAs.                                                                                       |
|                                             | **Training backends: EKS + GPU nodes / SageMaker Training / AWS Batch**          | Flexibility: EKS for custom containers & Triton eval; SageMaker for managed distributed training; Batch for large offline inference/ETL. Choose per workload.                                                  |
|                                             | **Data/Model tests: Great Expectations, pytest-style model tests**               | Data contracts + schema checks; model acceptance gates (latency/size/metrics).                                                                                                                                 |
| **7. Model Registry & Versioning**          | **W\&B Model Registry + Artifacts**                                              | Semantic versions, stage transitions (Staging/Prod), lineage to code, data, metrics; approvals built into PRs.                                                                                                 |
|                                             | **ECR (containers), S3 (model bundles)**                                         | Immutable containers; decouples model package from runner; S3 for heavyweight bundles.                                                                                                                         |
| **8. Deployment & Serving**                 | **NVIDIA Triton Inference Server on EKS**                                        | High-throughput batch/offline eval and online GPU serving; model ensembles; dynamic batching for CV.                                                                                                           |
|                                             | **TorchServe (alt) / FastAPI microservices**                                     | Simple CPU/GPU services for lighter models or tools; FastAPI for control/metadata APIs.                                                                                                                        |
|                                             | **Traffic mgmt: ALB + EKS, Argo Rollouts (canary/shadow)**                       | Safe releases; canary by slice (scene type, geography), shadow for offline eval.                                                                                                                               |
| **9. Monitoring (ML/Data/System)**          | **W\&B dashboards & alerts**                                                     | Live experiment/inference metrics, slices, drift views; team reports.                                                                                                                                          |
|                                             | **Evidently (drift), custom monitors**                                           | Feature/output drift, PSI/KS tests; triggers for retraining tickets/DAGs.                                                                                                                                      |
|                                             | **CloudWatch + Prometheus/Grafana, OpenTelemetry**                               | Infra/app telemetry, golden signals (latency p50/p95/p99, QPS, error rates), trace sampling.                                                                                                                   |
|                                             | **Alerting: SNS/PagerDuty**                                                      | On-call routing with severity and runbooks.                                                                                                                                                                    |
| **10. Metadata Store**                      | **W\&B (run/param/artifact lineage)**                                            | Central ML metadata without extra infra; links code↔data↔model.                                                                                                                                                |
|                                             | **Glue Data Catalog + DynamoDB (scene index) + OpenSearch (text/vector search)** | Hybrid metadata for scenario mining: structured queries, free-text, and similarity search for “find more like this” clips.                                                                                     |
| **12. Overarching: Build vs Buy**           | **AWS-first + best-of-breed OSS (Airflow, Feast, DVC) + W\&B**                   | Managed where it matters (security, scale), OSS where flexible; minimizes platform tax while keeping velocity. Skills focus: PyTorch, K8s, Terraform, Airflow, W\&B.                                           |

___