# ML Platforms: How to ## ### The MLOps Lead's Guide to Designing & Operationalizing Machine Learning Platforms **Preamble: From Bespoke Solutions to Scalable Ecosystems** The era of siloed, manually-managed machine learning projects is rapidly giving way to the necessity of robust, scalable, and maintainable **Machine Learning Platforms**. As organizations like Zillow, Shopify, Uber, LinkedIn, Monzo, Coveo, Zomato, GreenSteam, and innovative ventures like Didact AI demonstrate, the ability to efficiently develop, deploy, and operate ML models is a significant competitive differentiator. This guide synthesizes their journeys, combined with MLOps best practices (as outlined by Google Cloud and AWS), to provide a thinking framework for Lead MLOps Engineers tasked with building or evolving such platforms. Our focus is on actionable insights, architectural patterns, critical trade-offs, and the "why" behind the "what." --- **Chapter 1: The Imperative for an ML Platform - Motivations & Core Principles** 1. **The "Why": Addressing Pervasive Challenges** * **Fragmentation & Inefficiency:** DS/MLEs using disparate tools, leading to knowledge silos, difficult collaboration, and duplicated effort. * **The Prototype-to-Production Chasm:** Significant friction and engineering effort to move models from research/notebooks to reliable production services. This includes code rewriting, dependency management, and infrastructure concerns. * **Scalability Bottlenecks:** Training on desktop-sized data, inability to handle production load, manual scaling processes. * **Lack of Standardization & Reproducibility:** Inconsistent data pipelines, "it works on my machine" issues, difficulty tracking experiments and model versions. * **Operational Blindness:** Poor monitoring of models in production, leading to silent failures or performance degradation (Google MLOps). * **Slow Iteration Cycles:** Manual handoffs and lack of automation significantly slow down the ability to update models or deploy new ones (All). * **"Hidden Technical Debt in ML Systems":** The complexity surrounding ML code (data dependencies, configuration, monitoring, etc.) often outweighs the ML code itself. 2. **Core Principles for ML Platform Design** * **Data is King:** Accessible, clean, standardized data provides the biggest marginal gain. Platform must facilitate robust data management and feature engineering. * **Empower Data Scientists & MLEs (Autonomy):** Enable end-to-end workflows, from experimentation to deployment, minimizing handoffs. * **Flexibility:** Accommodate diverse ML frameworks, libraries, and problem types. Avoid overly prescriptive tooling where possible. * **Reuse Over Rebuild:** Leverage existing robust infrastructure (data stacks, microservice platforms, CI/CD) and focus platform efforts on the ML-specific "delta." * **PaaS/FaaS is Better than IaaS:** Abstract away infrastructure management. Utilize managed services for compute, storage, and scaling to free up ML teams. * **ELT is Better than ETL:** Clear separation of raw data ingestion and transformation promotes reliability and reproducibility. * **Standardization & Reproducibility:** Enforce consistent environments, versioning (code, data, models), and workflows. * **Scalability & Elasticity:** Design for on-demand resource provisioning and automatic scaling (Shopify Ray on K8s, AWS MLOps, Zillow Knative/KServe). * **Modularity & Composability:** Break down the ML lifecycle into reusable components/pipelines * **Keep it Simple:** Don't over-engineer. Focus on current needs but allow for future evolution. --- **Chapter 2: Anatomy of a Modern ML Platform - Key Components & Capabilities** This chapter outlines the essential building blocks, drawing parallels across the provided examples. | Component/Capability | Description & Key Functions | Examples from Provided Articles | | :------------------------ | :-------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **1. Data Management & Feature Engineering Layer** | Ingesting, storing, transforming data for ML. Creating, sharing, and serving features for training and inference. | **Uber Michelangelo:** Shared Feature Store (HDFS/Cassandra), DSL for transformations. **Zomato:** Real-time (Flink -> Redis) & Static (Cassandra) Feature Stores. **Shopify Merlin:** Inputs from Data Lake/Pano (feature store). **Didact AI:** DuckDB/Redis feature store, complex multi-source FE. **Monzo:** SQL-based FE in BigQuery (dbt). | | **2. Experimentation & Development Environment** | Tools for interactive data exploration, model prototyping, and collaborative development. | **LinkedIn DARWIN:** JupyterHub on K8s, multi-language, SQL workbooks. **Shopify Merlin:** Jupyter in Merlin Workspaces (Ray on K8s). **Monzo:** Google Colab (prototyping only). **GreenSteam:** Jupyter in Docker. **AWS MLOps:** SageMaker Studio notebooks. | | **3. Model Training Orchestration & Execution** | Systems for defining, scheduling, and running model training jobs, often distributed, with hyperparameter tuning. | **Uber Michelangelo:** Distributed training, custom model types. **Shopify Merlin:** Ray Train, Ray Tune on K8s. **Monzo:** Custom containers on Google Cloud AI Platform. **Zomato:** MLFlow triggering SageMaker. **GreenSteam:** Argo Workflows. **AWS MLOps:** SageMaker Pipelines, Training Jobs. **Coveo:** Metaflow. | | **4. Model Registry & Artifact Store** | Centralized repository for versioning, storing, and managing trained models, their metadata, and associated artifacts (e.g., training data snapshots). | **Uber Michelangelo:** Cassandra-based model repo. **Monzo:** Custom Model Registry. **Zomato:** MLFlow. **GreenSteam:** Neptune.ai. **Didact AI:** Local disk + S3. **AWS MLOps:** SageMaker Model Registry. **Google MLOps:** Model Registry. | | **5. Model Evaluation & Validation** | Tools and processes for assessing model performance against metrics, business KPIs, and fairness/bias considerations. | **Uber Michelangelo:** Accuracy reports, tree viz, feature reports. **Google MLOps:** Offline & online validation, data/model validation in pipelines. **GreenSteam:** Human-in-the-loop audit reports. | | **6. Model Deployment & Serving Layer** | Infrastructure and workflows for deploying models as batch prediction jobs or real-time inference services. | **Uber Michelangelo:** Offline (Spark) & Online (custom serving cluster). **Shopify Merlin:** Batch on Ray (planning online). **Monzo:** Python microservices on AWS (production stack), AI Platform for batch. **Zomato:** SageMaker endpoints, ML Gateway (Go). **GreenSteam:** FastAPI microservices. **Zillow:** KServe/Knative. | | **7. Monitoring & Observability** | Tracking system health, model performance, data drift, and concept drift in production. Alerting on issues. | **Uber Michelangelo:** Live accuracy monitoring. **Monzo:** Grafana (system), Looker (model perf), dbt-slack (features). **Zomato:** Grafana. **GreenSteam:** Kibana, Sentry. **Google MLOps:** Continuous monitoring. **Didact AI:** Custom Python reports, Feature Explorer. | | **8. Workflow Orchestration & MLOps Pipelines** | Automating the end-to-end ML lifecycle, including CI/CD for models and pipelines, and CT. | **Google MLOps:** Levels 0, 1, 2 detailing pipeline automation. **AWS MLOps:** SageMaker Projects, Pipelines, CodePipeline. **Shopify Merlin:** Airflow/Oozie. **Monzo:** Airflow. **GreenSteam:** Argo. **Coveo:** Prefect. | | **9. Metadata & Artifact Tracking** | System for capturing, storing, and querying metadata about all aspects of the ML lifecycle (experiments, data, models, pipeline runs). | **Google MLOps:** ML Metadata & Artifact Repository. **LinkedIn DARWIN:** DataHub for resource metadata. **Neptune.ai / MLFlow** are common tools. | | **10. Governance & Compliance** | Ensuring security, privacy, auditability, and responsible AI practices throughout the platform. | **LinkedIn DARWIN:** Audit trails, fine-grained access control. **Google MLOps:** Handling model fairness, data privacy. **AWS MLOps:** Secure multi-account setup, IAM. | **Illustrative High-Level ML Platform Architecture:** --- **Chapter 3: MLOps Maturity & Pipeline Automation** 1. **MLOps Level 0: Manual Process** * **Characteristics:** Script-driven, interactive, manual handoffs between DS and Ops, infrequent releases, no CI/CD, focus on deploying model as prediction service, lack of active monitoring. (Google MLOps) * **Challenges:** Model decay, slow iteration, training-serving skew. (Google MLOps) * **GreenSteam's early days** and **Zomato pre-platform** exemplify this. 2. **MLOps Level 1: ML Pipeline Automation (Continuous Training - CT)** * **Goal:** Automate the ML pipeline for CT, achieve continuous delivery of model prediction service. (Google MLOps) * **Characteristics:** Orchestrated experiment steps, CT in production, experimental-operational symmetry, modularized/containerized code. (Google MLOps) * **Additional Components:** Automated Data Validation, Automated Model Validation, Feature Store (optional but beneficial), Metadata Management, Pipeline Triggers (on-demand, schedule, new data, model decay). (Google MLOps) * **AWS MLOps Initial & Repeatable Phases:** Focus on experimentation (SageMaker Studio) then automating training workflows (SageMaker Pipelines), model registry. Emphasis on multi-account strategy for dev/tooling/data lake. * **Shopify Merlin's initial focus on training and batch inference** aligns here. 3. **MLOps Level 2: CI/CD Pipeline Automation** * **Goal:** Robust, automated CI/CD system for rapid and reliable updates to ML pipelines themselves. (Google MLOps) * **Characteristics & Stages:** * Development & Experimentation (source code for pipeline steps). * Pipeline Continuous Integration (build, unit/integration tests for pipeline components). * Pipeline Continuous Delivery (deploy pipeline artifacts to target env). * Automated Triggering (of the deployed pipeline for CT). * Model Continuous Delivery (serve trained model as prediction service, progressive delivery - canary, A/B). * Monitoring (live data stats, model performance). * **AWS MLOps Reliable & Scalable Phases:** Introduces automated testing, pre-production/staging environments, manual approvals for promotion, templatization (SageMaker Projects) for onboarding multiple teams/use cases, advanced analytics governance account. * **Zillow's focus on "service as online flow" and automatic deployments** points to this level. --- **Chapter 4: Designing Your ML Platform - A Lead's Decision Framework** 1. **Understanding Your Context & Constraints ("Reasonable Scale" - Coveo)** * **Team Size & Skills:** DS, MLE, Data Engineers, Ops. Autonomy vs. specialized roles. * **Data Volume & Velocity:** TBs vs PBs, batch vs real-time. * **Number of Models & Use Cases:** Dozens vs. hundreds. * **Budget & Resources:** Affects build vs. buy, managed vs. self-hosted. * **Existing Infrastructure:** Leverage or rebuild? (Monzo principle) * **Time-to-Market Pressure.** 2. **Key Architectural Choices & Trade-offs** * **Build vs. Buy vs. Adopt OSS:** * **Build:** Full control, custom fit, high initial cost/effort (Uber often builds significantly). * **Buy (Commercial MLaaS/Point Solutions):** Faster setup, vendor support, potential lock-in, cost (Coveo advocates PaaS). * **Adopt OSS:** Flexibility, community, no license cost, self-management overhead (Shopify, Zomato, GreenSteam, LinkedIn heavily use OSS like Ray, Kubeflow, MLFlow, Flink, Argo, Neptune). * **Monolithic Platform vs. Best-of-Breed Integration:** * **Monolithic (e.g., SageMaker, Vertex AI):** Integrated experience, potentially less flexibility. * **Best-of-Breed:** Choose top tools for each component, integration challenge (Monzo, Coveo lean this way). * **Degree of Abstraction for Users:** * **Low-Code/No-Code vs. Code-First:** Catering to citizen DS vs. expert MLEs (LinkedIn DARWIN aims for both). * **Shopify Merlin:** Python-centric, aiming to abstract K8s/Ray complexities. * **Zillow:** Pythonic "service as online flow" to abstract web service concepts. * **Centralized vs. Decentralized Components:** * **Feature Store:** Centralized (Uber) vs. federated. * **Model Registry:** Typically centralized. * **Compute:** Shared clusters vs. dedicated per-user/project (Shopify Merlin Workspaces). * **Data Ingestion & Processing Strategy for ML:** * ELT for raw data, then ML-specific transformations. * Real-time feature computation (Flink - Zomato, Samza - Uber). * Batch feature computation (Spark - Uber, dbt+BigQuery - Monzo). * **Serving Strategy:** * Online (REST APIs, gRPC) vs. Batch vs. Streaming vs. Embedded. * CPU vs. GPU for inference. * Serverless (Knative - Zillow) vs. Provisioned. * **Environment Management:** * Docker/Containers are standard (GreenSteam, Shopify, AWS MLOps). * Kubernetes for orchestration (Shopify, LinkedIn DARWIN, Zillow). * Dedicated workspaces/sandboxes (Shopify Merlin, LinkedIn DARWIN). 3. **User Experience (UX) and Developer Productivity** * **Target Personas:** Who are you building for? (DS, MLE, Analysts). * **Seamless Workflow:** Minimize context switching. (LinkedIn DARWIN). * **Reproducibility:** Versioning data, code, models, environments. (GreenSteam, Monzo). * **Collaboration Features:** Sharing notebooks, features, models. (LinkedIn DARWIN). * **Ease of Onboarding:** Templates, CLIs, SDKs. (Shopify Merlin Projects, AWS SageMaker Projects). 4. **Iterative Platform Development** * Start with core needs (e.g., training, batch inference - Shopify). * Phased rollout based on MLOps maturity (AWS). * Gather user feedback continuously (LinkedIn DARWIN User Council). --- **Chapter 5: Lessons Learned from the Trenches** * **Start Simple, Iterate (GreenSteam YAGNI):** Avoid over-engineering for future unknowns. * **Embrace Docker/Containers Early (GreenSteam):** Solves dependency and reproducibility issues significantly. * **SQL is Powerful for Feature Engineering (Monzo, Coveo):** Leverage the power of data warehouses for transformations before ML-specific steps. dbt is a key enabler. * **Managed Services are Your Friend (Coveo, Monzo, Zomato):** Reduce operational burden, especially at "reasonable scale." * **Abstract Complexity from Users (Shopify, Zillow):** Data scientists should focus on ML, not K8s YAML or web server internals. Pythonic SDKs are favored. * **Testing ML is Hard (GreenSteam):** Unit tests for ML code can be tricky. Smoke tests on full datasets (with fast hyperparams) can be more effective. * **Human-in-the-Loop is Often Unavoidable (GreenSteam):** Especially for model auditing and ensuring business alignment, despite automation efforts. * **Feature Stores are Foundational:** They solve training-serving skew and promote feature reuse (Uber, Zomato). * **MLFlow is a Popular Starting Point for Experiment Tracking & Registry (Zomato).** * **Ray is Gaining Traction for Distributed Python ML (Shopify).** * **Orchestration is Key:** Airflow, Argo, Prefect, Metaflow, SageMaker/Vertex Pipelines are essential for automating complex workflows. * **Monitoring is Multi-faceted:** System health, data drift, model performance, business KPIs. * **Culture of Autonomy and Ownership:** Platforms should empower teams, not create new bottlenecks. --- **Chapter 6: The Future of ML Platforms** * **Greater Abstraction & Automation:** Further reduction of boilerplate and infrastructure management. * **Convergence of Data & ML Stacks:** Tighter integration between data warehouses/lakes and ML training/serving. * **Rise of Real-time/Online Learning Capabilities:** Platforms need to better support models that adapt continuously. * **Specialized Hardware Acceleration becoming Mainstream:** Easier access and management of GPUs/TPUs. * **Enhanced Model Governance & Responsible AI Features:** Built-in tools for fairness, explainability, privacy. * **Democratization through Low-Code/No-Code Interfaces:** While still providing power for expert users. * **OSS Continues to Drive Innovation:** With enterprise-grade managed offerings built on top. --- **ML Platform Design - MLOps Lead's Mind Map (Mermaid)** ### References - [Uber: Meet Michelangelo: Uber’s Machine Learning Platform](https://www.uber.com/en-IN/blog/michelangelo-machine-learning-platform/) - [System Architectures for Personalization and Recommendation](https://netflixtechblog.com/system-architectures-for-personalization-and-recommendation-e081aa94b5d8) - [Near Real-Time Netflix Recommendations Using Apache Spark Streaming](https://www.slideshare.net/slideshow/near-realtime-netflix-recommendations-using-apache-spark-streaming-with-nitin-sharma-and-elliot-chow/102214667) - [Shopify: The Magic of Merlin: Shopify's New Machine Learning Platform](https://shopify.engineering/merlin-shopify-machine-learning-platform) - [Coveo: You Don't Need a Bigger Boat: Recommendations at Reasonable Scale in a (Mostly) Serverless and Open Stack](https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat) - [Monzo’s machine learning stack](https://monzo.com/blog/2022/04/26/monzos-machine-learning-stack) - [Real-time Machine Learning Inference Platform at Zomato](https://www.youtube.com/watch?v=0-3ES1vzW14) - [Didact AI: The anatomy of an ML-powered stock picking engine](https://principiamundi.com/posts/didact-anatomy/)