Planning, Operational Strategy¶
¶
MLOps Stack Canvas¶
Block (Canvas) |
Tool / Service |
Options & Trade-offs (Why chosen) |
---|---|---|
1. Value Proposition & Ownership |
Confluence/Jira |
Keeps model ownership explicit (product, safety, infra). Decisions captured as ADRs; reduces ambiguity in a safety-critical domain. |
2. Data Sources & Versioning |
S3 (lake), S3 Glacier/Intelligent-Tiering, Lake Formation, IAM/KMS |
Durable, cheap, fine-grained access control and encryption. Tiering controls cost at PB scale. |
Data ingress: Snowball, DataSync, Direct Connect |
Hybrid ingest (bulk vs. network). Snowball for big offline drops; DataSync for ongoing transfers. |
|
ROS/rosbag tooling |
Native handling of multi-sensor logs (camera, radar, lidar, CAN). |
|
DVC + Git LFS |
Dataset, manifest, and label versioning tied to code; reproducible experiments; artifact lineage. |
|
Data governance: Glue Data Catalog |
Central schema/partition catalog for discovery and query; feeds Athena/EMR. |
|
3. Data Analysis & Experiment Mgmt |
Python, PyTorch, CUDA/TensorRT |
Dominant stack for CV in the project window; strong ecosystem and production path. |
Jupyter/VS Code + remote kernels |
Fast iteration; runs close to data/GPUs. |
|
W&B (Experiments + Artifacts) |
Single pane for metrics, configs, datasets, and models; artifact lineage; team reports. |
|
4. Feature Store & Workflows (optional) |
Feast (opt-in) |
Use only if multiple models reuse temporal/tabular features (e.g., telemetry, map joins). Offline: S3/Parquet + Glue/Athena; Online: DynamoDB/Redis. Avoids premature complexity for primarily deep CV models. |
5. Foundations (DevOps & Code Mgmt) |
GitHub + trunk-based flow |
Simple, fast releases; short-lived branches. |
CI/CD: GitHub Actions |
Build/test Docker, run unit/integration tests, push images to ECR, deploy via IaC—fully as code. |
|
IaC: Terraform |
Reproducible AWS infra (EKS/ECS, VPC, S3, IAM, ECR, RDS/OpenSearch, etc.). |
|
Quality: pytest, hypothesis, mypy, black, isort, pre-commit |
Enforces correctness and style; catches regressions early. |
|
Secrets: AWS Secrets Manager / Parameter Store |
Centralized, rotated secrets with IAM. |
|
6. CI/Training/Deployment Orchestration |
Airflow (DAGs) |
Orchestrates E2E loops: ingest → curate → label → train → evaluate → register → deploy; robust retries, backfills, SLAs. |
Training backends: EKS + GPU nodes / SageMaker Training / AWS Batch |
Flexibility: EKS for custom containers & Triton eval; SageMaker for managed distributed training; Batch for large offline inference/ETL. Choose per workload. |
|
Data/Model tests: Great Expectations, pytest-style model tests |
Data contracts + schema checks; model acceptance gates (latency/size/metrics). |
|
7. Model Registry & Versioning |
W&B Model Registry + Artifacts |
Semantic versions, stage transitions (Staging/Prod), lineage to code, data, metrics; approvals built into PRs. |
ECR (containers), S3 (model bundles) |
Immutable containers; decouples model package from runner; S3 for heavyweight bundles. |
|
8. Deployment & Serving |
NVIDIA Triton Inference Server on EKS |
High-throughput batch/offline eval and online GPU serving; model ensembles; dynamic batching for CV. |
TorchServe (alt) / FastAPI microservices |
Simple CPU/GPU services for lighter models or tools; FastAPI for control/metadata APIs. |
|
Traffic mgmt: ALB + EKS, Argo Rollouts (canary/shadow) |
Safe releases; canary by slice (scene type, geography), shadow for offline eval. |
|
9. Monitoring (ML/Data/System) |
W&B dashboards & alerts |
Live experiment/inference metrics, slices, drift views; team reports. |
Evidently (drift), custom monitors |
Feature/output drift, PSI/KS tests; triggers for retraining tickets/DAGs. |
|
CloudWatch + Prometheus/Grafana, OpenTelemetry |
Infra/app telemetry, golden signals (latency p50/p95/p99, QPS, error rates), trace sampling. |
|
Alerting: SNS/PagerDuty |
On-call routing with severity and runbooks. |
|
10. Metadata Store |
W&B (run/param/artifact lineage) |
Central ML metadata without extra infra; links code↔data↔model. |
Glue Data Catalog + DynamoDB (scene index) + OpenSearch (text/vector search) |
Hybrid metadata for scenario mining: structured queries, free-text, and similarity search for “find more like this” clips. |
|
12. Overarching: Build vs Buy |
AWS-first + best-of-breed OSS (Airflow, Feast, DVC) + W&B |
Managed where it matters (security, scale), OSS where flexible; minimizes platform tax while keeping velocity. Skills focus: PyTorch, K8s, Terraform, Airflow, W&B. |