Planning, Operational Strategy


MLOps Stack Canvas

Block (Canvas)

Tool / Service

Options & Trade-offs (Why chosen)

1. Value Proposition & Ownership

Confluence/Jira

Keeps model ownership explicit (product, safety, infra). Decisions captured as ADRs; reduces ambiguity in a safety-critical domain.

2. Data Sources & Versioning

S3 (lake), S3 Glacier/Intelligent-Tiering, Lake Formation, IAM/KMS

Durable, cheap, fine-grained access control and encryption. Tiering controls cost at PB scale.

Data ingress: Snowball, DataSync, Direct Connect

Hybrid ingest (bulk vs. network). Snowball for big offline drops; DataSync for ongoing transfers.

ROS/rosbag tooling

Native handling of multi-sensor logs (camera, radar, lidar, CAN).

DVC + Git LFS

Dataset, manifest, and label versioning tied to code; reproducible experiments; artifact lineage.

Data governance: Glue Data Catalog

Central schema/partition catalog for discovery and query; feeds Athena/EMR.

3. Data Analysis & Experiment Mgmt

Python, PyTorch, CUDA/TensorRT

Dominant stack for CV in the project window; strong ecosystem and production path.

Jupyter/VS Code + remote kernels

Fast iteration; runs close to data/GPUs.

W&B (Experiments + Artifacts)

Single pane for metrics, configs, datasets, and models; artifact lineage; team reports.

4. Feature Store & Workflows (optional)

Feast (opt-in)

Use only if multiple models reuse temporal/tabular features (e.g., telemetry, map joins). Offline: S3/Parquet + Glue/Athena; Online: DynamoDB/Redis. Avoids premature complexity for primarily deep CV models.

5. Foundations (DevOps & Code Mgmt)

GitHub + trunk-based flow

Simple, fast releases; short-lived branches.

CI/CD: GitHub Actions

Build/test Docker, run unit/integration tests, push images to ECR, deploy via IaC—fully as code.

IaC: Terraform

Reproducible AWS infra (EKS/ECS, VPC, S3, IAM, ECR, RDS/OpenSearch, etc.).

Quality: pytest, hypothesis, mypy, black, isort, pre-commit

Enforces correctness and style; catches regressions early.

Secrets: AWS Secrets Manager / Parameter Store

Centralized, rotated secrets with IAM.

6. CI/Training/Deployment Orchestration

Airflow (DAGs)

Orchestrates E2E loops: ingest → curate → label → train → evaluate → register → deploy; robust retries, backfills, SLAs.

Training backends: EKS + GPU nodes / SageMaker Training / AWS Batch

Flexibility: EKS for custom containers & Triton eval; SageMaker for managed distributed training; Batch for large offline inference/ETL. Choose per workload.

Data/Model tests: Great Expectations, pytest-style model tests

Data contracts + schema checks; model acceptance gates (latency/size/metrics).

7. Model Registry & Versioning

W&B Model Registry + Artifacts

Semantic versions, stage transitions (Staging/Prod), lineage to code, data, metrics; approvals built into PRs.

ECR (containers), S3 (model bundles)

Immutable containers; decouples model package from runner; S3 for heavyweight bundles.

8. Deployment & Serving

NVIDIA Triton Inference Server on EKS

High-throughput batch/offline eval and online GPU serving; model ensembles; dynamic batching for CV.

TorchServe (alt) / FastAPI microservices

Simple CPU/GPU services for lighter models or tools; FastAPI for control/metadata APIs.

Traffic mgmt: ALB + EKS, Argo Rollouts (canary/shadow)

Safe releases; canary by slice (scene type, geography), shadow for offline eval.

9. Monitoring (ML/Data/System)

W&B dashboards & alerts

Live experiment/inference metrics, slices, drift views; team reports.

Evidently (drift), custom monitors

Feature/output drift, PSI/KS tests; triggers for retraining tickets/DAGs.

CloudWatch + Prometheus/Grafana, OpenTelemetry

Infra/app telemetry, golden signals (latency p50/p95/p99, QPS, error rates), trace sampling.

Alerting: SNS/PagerDuty

On-call routing with severity and runbooks.

10. Metadata Store

W&B (run/param/artifact lineage)

Central ML metadata without extra infra; links code↔data↔model.

Glue Data Catalog + DynamoDB (scene index) + OpenSearch (text/vector search)

Hybrid metadata for scenario mining: structured queries, free-text, and similarity search for “find more like this” clips.

12. Overarching: Build vs Buy

AWS-first + best-of-breed OSS (Airflow, Feast, DVC) + W&B

Managed where it matters (security, scale), OSS where flexible; minimizes platform tax while keeping velocity. Skills focus: PyTorch, K8s, Terraform, Airflow, W&B.