# Planning, Operational Strategy ## ___ ### MLOps Stack Canvas | Block (Canvas) | Tool / Service | Options & Trade-offs (Why chosen) | | ------------------------------------------- | -------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **1. Value Proposition & Ownership** | Confluence/Jira | Keeps model ownership explicit (product, safety, infra). Decisions captured as ADRs; reduces ambiguity in a safety-critical domain. | | **2. Data Sources & Versioning** | **S3 (lake), S3 Glacier/Intelligent-Tiering, Lake Formation, IAM/KMS** | Durable, cheap, fine-grained access control and encryption. Tiering controls cost at PB scale. | | | **Data ingress: Snowball, DataSync, Direct Connect** | Hybrid ingest (bulk vs. network). Snowball for big offline drops; DataSync for ongoing transfers. | | | **ROS/rosbag tooling** | Native handling of multi-sensor logs (camera, radar, lidar, CAN). | | | **DVC + Git LFS** | Dataset, manifest, and label versioning tied to code; reproducible experiments; artifact lineage. | | | **Data governance: Glue Data Catalog** | Central schema/partition catalog for discovery and query; feeds Athena/EMR. | | **3. Data Analysis & Experiment Mgmt** | **Python, PyTorch, CUDA/TensorRT** | Dominant stack for CV in the project window; strong ecosystem and production path. | | | **Jupyter/VS Code + remote kernels** | Fast iteration; runs close to data/GPUs. | | | **W\&B (Experiments + Artifacts)** | Single pane for metrics, configs, datasets, and models; artifact lineage; team reports. | | **4. Feature Store & Workflows (optional)** | **Feast (opt-in)** | Use only if multiple models reuse temporal/tabular features (e.g., telemetry, map joins). Offline: S3/Parquet + Glue/Athena; Online: DynamoDB/Redis. Avoids premature complexity for primarily deep CV models. | | **5. Foundations (DevOps & Code Mgmt)** | **GitHub + trunk-based flow** | Simple, fast releases; short-lived branches. | | | **CI/CD: GitHub Actions** | Build/test Docker, run unit/integration tests, push images to ECR, deploy via IaC—fully as code. | | | **IaC: Terraform** | Reproducible AWS infra (EKS/ECS, VPC, S3, IAM, ECR, RDS/OpenSearch, etc.). | | | **Quality: pytest, hypothesis, mypy, black, isort, pre-commit** | Enforces correctness and style; catches regressions early. | | | **Secrets: AWS Secrets Manager / Parameter Store** | Centralized, rotated secrets with IAM. | | **6. CI/Training/Deployment Orchestration** | **Airflow (DAGs)** | Orchestrates E2E loops: ingest → curate → label → train → evaluate → register → deploy; robust retries, backfills, SLAs. | | | **Training backends: EKS + GPU nodes / SageMaker Training / AWS Batch** | Flexibility: EKS for custom containers & Triton eval; SageMaker for managed distributed training; Batch for large offline inference/ETL. Choose per workload. | | | **Data/Model tests: Great Expectations, pytest-style model tests** | Data contracts + schema checks; model acceptance gates (latency/size/metrics). | | **7. Model Registry & Versioning** | **W\&B Model Registry + Artifacts** | Semantic versions, stage transitions (Staging/Prod), lineage to code, data, metrics; approvals built into PRs. | | | **ECR (containers), S3 (model bundles)** | Immutable containers; decouples model package from runner; S3 for heavyweight bundles. | | **8. Deployment & Serving** | **NVIDIA Triton Inference Server on EKS** | High-throughput batch/offline eval and online GPU serving; model ensembles; dynamic batching for CV. | | | **TorchServe (alt) / FastAPI microservices** | Simple CPU/GPU services for lighter models or tools; FastAPI for control/metadata APIs. | | | **Traffic mgmt: ALB + EKS, Argo Rollouts (canary/shadow)** | Safe releases; canary by slice (scene type, geography), shadow for offline eval. | | **9. Monitoring (ML/Data/System)** | **W\&B dashboards & alerts** | Live experiment/inference metrics, slices, drift views; team reports. | | | **Evidently (drift), custom monitors** | Feature/output drift, PSI/KS tests; triggers for retraining tickets/DAGs. | | | **CloudWatch + Prometheus/Grafana, OpenTelemetry** | Infra/app telemetry, golden signals (latency p50/p95/p99, QPS, error rates), trace sampling. | | | **Alerting: SNS/PagerDuty** | On-call routing with severity and runbooks. | | **10. Metadata Store** | **W\&B (run/param/artifact lineage)** | Central ML metadata without extra infra; links code↔data↔model. | | | **Glue Data Catalog + DynamoDB (scene index) + OpenSearch (text/vector search)** | Hybrid metadata for scenario mining: structured queries, free-text, and similarity search for “find more like this” clips. | | **12. Overarching: Build vs Buy** | **AWS-first + best-of-breed OSS (Airflow, Feast, DVC) + W\&B** | Managed where it matters (security, scale), OSS where flexible; minimizes platform tax while keeping velocity. Skills focus: PyTorch, K8s, Terraform, Airflow, W\&B. | ___