Planning, Operational Strategy¶

¶

MLOps Stack Canvas¶

Block (Canvas)	Tool / Service	Options & Trade-offs (Why chosen)
1. Value Proposition & Ownership	Confluence/Jira	Keeps model ownership explicit (product, safety, infra). Decisions captured as ADRs; reduces ambiguity in a safety-critical domain.
2. Data Sources & Versioning	S3 (lake), S3 Glacier/Intelligent-Tiering, Lake Formation, IAM/KMS	Durable, cheap, fine-grained access control and encryption. Tiering controls cost at PB scale.
	Data ingress: Snowball, DataSync, Direct Connect	Hybrid ingest (bulk vs. network). Snowball for big offline drops; DataSync for ongoing transfers.
	ROS/rosbag tooling	Native handling of multi-sensor logs (camera, radar, lidar, CAN).
	DVC + Git LFS	Dataset, manifest, and label versioning tied to code; reproducible experiments; artifact lineage.
	Data governance: Glue Data Catalog	Central schema/partition catalog for discovery and query; feeds Athena/EMR.
3. Data Analysis & Experiment Mgmt	Python, PyTorch, CUDA/TensorRT	Dominant stack for CV in the project window; strong ecosystem and production path.
	Jupyter/VS Code + remote kernels	Fast iteration; runs close to data/GPUs.
	W&B (Experiments + Artifacts)	Single pane for metrics, configs, datasets, and models; artifact lineage; team reports.
4. Feature Store & Workflows (optional)	Feast (opt-in)	Use only if multiple models reuse temporal/tabular features (e.g., telemetry, map joins). Offline: S3/Parquet + Glue/Athena; Online: DynamoDB/Redis. Avoids premature complexity for primarily deep CV models.
5. Foundations (DevOps & Code Mgmt)	GitHub + trunk-based flow	Simple, fast releases; short-lived branches.
	CI/CD: GitHub Actions	Build/test Docker, run unit/integration tests, push images to ECR, deploy via IaC—fully as code.
	IaC: Terraform	Reproducible AWS infra (EKS/ECS, VPC, S3, IAM, ECR, RDS/OpenSearch, etc.).
	Quality: pytest, hypothesis, mypy, black, isort, pre-commit	Enforces correctness and style; catches regressions early.
	Secrets: AWS Secrets Manager / Parameter Store	Centralized, rotated secrets with IAM.
6. CI/Training/Deployment Orchestration	Airflow (DAGs)	Orchestrates E2E loops: ingest → curate → label → train → evaluate → register → deploy; robust retries, backfills, SLAs.
	Training backends: EKS + GPU nodes / SageMaker Training / AWS Batch	Flexibility: EKS for custom containers & Triton eval; SageMaker for managed distributed training; Batch for large offline inference/ETL. Choose per workload.
	Data/Model tests: Great Expectations, pytest-style model tests	Data contracts + schema checks; model acceptance gates (latency/size/metrics).
7. Model Registry & Versioning	W&B Model Registry + Artifacts	Semantic versions, stage transitions (Staging/Prod), lineage to code, data, metrics; approvals built into PRs.
	ECR (containers), S3 (model bundles)	Immutable containers; decouples model package from runner; S3 for heavyweight bundles.
8. Deployment & Serving	NVIDIA Triton Inference Server on EKS	High-throughput batch/offline eval and online GPU serving; model ensembles; dynamic batching for CV.
	TorchServe (alt) / FastAPI microservices	Simple CPU/GPU services for lighter models or tools; FastAPI for control/metadata APIs.
	Traffic mgmt: ALB + EKS, Argo Rollouts (canary/shadow)	Safe releases; canary by slice (scene type, geography), shadow for offline eval.
9. Monitoring (ML/Data/System)	W&B dashboards & alerts	Live experiment/inference metrics, slices, drift views; team reports.
	Evidently (drift), custom monitors	Feature/output drift, PSI/KS tests; triggers for retraining tickets/DAGs.
	CloudWatch + Prometheus/Grafana, OpenTelemetry	Infra/app telemetry, golden signals (latency p50/p95/p99, QPS, error rates), trace sampling.
	Alerting: SNS/PagerDuty	On-call routing with severity and runbooks.
10. Metadata Store	W&B (run/param/artifact lineage)	Central ML metadata without extra infra; links code↔data↔model.
	Glue Data Catalog + DynamoDB (scene index) + OpenSearch (text/vector search)	Hybrid metadata for scenario mining: structured queries, free-text, and similarity search for “find more like this” clips.
12. Overarching: Build vs Buy	AWS-first + best-of-breed OSS (Airflow, Feast, DVC) + W&B	Managed where it matters (security, scale), OSS where flexible; minimizes platform tax while keeping velocity. Skills focus: PyTorch, K8s, Terraform, Airflow, W&B.