# Shopify Merlin **Introduction** * **Context:** Shopify's ML platform team builds infrastructure and tools to streamline ML workflows for data scientists. * **Use Cases:** * **Internal:** Fraud detection, revenue predictions. * **External (Merchant/Buyer Facing):** Product categorization, recommendation systems. * **Need for Redesign:** Required a platform to handle diverse requirements, inputs, data types, dependencies, and integrations, enabling use of best-of-breed tools. * **Focus of Post:** Introduction to Merlin, its architecture, user workflow, and a product use case. **The Magic of Merlin** * **Foundation:** Based on an open-source stack. * **Objectives:** 1. **Scalability:** Robust infrastructure for scaling ML workflows. 2. **Fast Iterations:** Reduce friction, minimize prototype-to-production gap. 3. **Flexibility:** Allow users to use any necessary libraries/packages. * **Initial Focus (First Iteration):** Training and batch inference. **Merlin Architecture** * **Data Input:** Uses features and datasets from Shopify's data lake or Pano (feature store), typically pre-processed by tools like Spark. * **Merlin Workspaces:** * Dedicated environments for each use case (tasks, dependencies, resources). * Enable distributed computing and scalability. * **Underlying Technology:** Short-lived Ray clusters deployed on Shopify's Kubernetes cluster (for batch jobs). * **Merlin API:** Consolidated service for on-demand creation of Merlin Workspaces. * **User Interaction:** Users can interact with Merlin Workspaces from Jupyter Notebooks (prototyping) or orchestrate via Airflow/Oozie (production). * **Core Component:** Ray. - [Shopify: The Magic of Merlin: Shopify's New Machine Learning Platform](https://shopify.engineering/merlin-shopify-machine-learning-platform) **What Is Ray?** * **Definition:** Open-source framework with a simple, universal API for building distributed systems and tools to parallelize ML workflows. * **Ecosystem:** Includes distributed versions of scikit-learn, XGBoost, TensorFlow, PyTorch, etc. * **Functionality:** Provides a cluster to distribute computation across multiple CPUs/machines. * **`ray.init()`:** Starts a Ray runtime (local or connects to existing local/remote cluster). Enables seamless code transition from local to distributed. * **Ray Client API:** Used to connect to remote Ray clusters. * **Example (XGBoost on Ray):** * Uses `xgboost_ray` integration. * `RayParams` define distribution (e.g., `num_actors`, `cpus_per_actor`). * `RayDMatrix` for distributed data representation. * `train()` function executes distributed training. **Ray In Merlin** * **Rationale for Choosing Ray:** * Python-centric development at Shopify. * Enables end-to-end Python ML workflows. * Integrates with existing ML libraries. * Easily distributes/scales with minimal code changes. * **Usage:** Each ML project in Merlin includes Ray for distributed preprocessing, training, and prediction. * **Prototype to Production:** Ray facilitates this by allowing code developed locally/in notebooks to run on remote Ray clusters at scale from early stages. * **Adopted Ray Features:** * **Ray Train:** For distributed deep learning (TensorFlow, PyTorch). * **Ray Tune:** For experiment execution and hyperparameter tuning. * **Ray Kubernetes Operator:** For managing Ray deployments on Kubernetes and autoscaling Ray clusters. **Building On Merlin (User's Development Journey)** 1. **Creating a new project:** User creates a Merlin Project (code, requirements, packages). 2. **Prototyping:** User creates a Merlin Workspace (sandbox with Jupyter) for distributed/scalable prototyping. 3. **Moving to Production:** User updates Merlin Project with finalized code/requirements. 4. **Automating:** User orchestrates/schedules the workflow (via Airflow DAGs) in production. 5. **Iterating:** User spins up another Merlin Workspace for new experiments. **Merlin Projects** * **Purpose:** Dedicated to specific ML tasks (training, batch prediction). * **Customization:** Specify system-level packages or Python libraries. * **Technical Implementation:** Docker container with a dedicated virtual environment (Conda, pyenv) for code/dependency isolation. * **Management:** CLI for creating, defining, and using Merlin Projects. * **`config.yml`:** Specifies dependencies and ML libraries. * **`src` folder:** Contains use-case-specific code. * **CI/CD:** Pushing code to a branch triggers a custom Docker image build. **Merlin Workspaces** * **Creation:** Via centralized Merlin API (abstracts infrastructure logic like K8s Ray cluster deployment, ingress, service accounts). * **Resource Definition:** Users can define required resources (GPUs, memory, CPUs, machine types). * **Execution Environment:** Spins up a Ray cluster in a dedicated Kubernetes namespace using the Merlin Project's Docker image. * **API Payload Example:** Specifies `name`, `min_workers`, `max_workers`, `cpu`, `gpu_count`, `gpu_type`, `memory`, `enable_jupyter`, `image`. * **Lifecycle:** Can be shut down manually or automatically after job completion, returning resources to the K8s cluster. - [Shopify: The Magic of Merlin: Shopify's New Machine Learning Platform](https://shopify.engineering/merlin-shopify-machine-learning-platform) **Prototyping From Jupyter Notebooks** * **Environment:** Users spin up a new ML notebook in Shopify's centrally hosted JupyterHub environment using their Merlin Project's Docker image (includes all code/dependencies). * **Remote Connection:** Use Ray Client API from the notebook to connect remotely to their Merlin Workspaces. * **Distributed Computation:** Run remote Ray Tasks and Ray Actors to parallelize work on the underlying Ray cluster. * **Benefit:** Minimizes prototype-to-production gap by providing full Merlin/Ray capabilities early. **Moving to Production** * **Code Update:** Push prototyped code to Merlin Project, triggering a new Docker image build via CI/CD. * **Orchestration:** * Build ML flows using declarative YAML templates or configure Airflow DAGs. * Jobs scheduled periodically, call Merlin API to spin up Workspaces and run jobs. * **Monitoring & Observability:** * **Datadog:** Dedicated dashboard per Merlin Workspace for job monitoring and resource usage analysis. * **Splunk:** Logs from each Merlin job for debugging. **Onboarding Shopify’s Product Categorization Model to Merlin** * **Use Case Complexity:** Requires several workflows for training and batch prediction; chosen to validate Merlin due to large-scale computation and complex logic. * **Migration:** Training and batch prediction workflows migrated to Merlin and converted using Ray. * **Migrating the training code:** * Integrated TensorFlow training code with **Ray Train**. * Minimal code changes: original TF logic mostly unchanged, encapsulated in a `train_func`. * `Trainer` object from `ray.train` configured with backend ("tensorflow"), `num_workers`, `use_gpu`. * `trainer.run(train_func, config=config)` executes distributed training. * **Migrating inference:** * Multi-step process, each step migrated separately. * Used **Ray ActorPool** to distribute batch inference steps. (Similar to Python's `multiprocessing.Pool`). * **`Predictor` class (Ray Actor):** Contains logic for loading model and performing predictions. * Actors created based on available cluster resources (`ray.available_resources()["CPU"]`). * `ActorPool.map_unordered()` used to send dataset partitions to actors for prediction. * **Future Improvement:** Plan to migrate to **Ray Dataset Pipelines** for more robust data load distribution and batch inference. **What's next for Merlin** * **Aspiration:** Centralized platform streamlining ML workflows, enabling data scientist innovation. * **Next Milestones:** * **Migration:** Migrate all Shopify ML use cases to Merlin; add a low-code framework for new use cases. * **Online inference:** Support real-time model serving at scale. * **Model lifecycle management:** Add model registry and experiment tracking. * **Monitoring:** Support ML-specific monitoring. * **Current Status:** New platform, already providing scalability, fast iteration, and flexibility.