# Shopify Merlin
**Introduction**
*   **Context:** Shopify's ML platform team builds infrastructure and tools to streamline ML workflows for data scientists.
*   **Use Cases:**
    *   **Internal:** Fraud detection, revenue predictions.
    *   **External (Merchant/Buyer Facing):** Product categorization, recommendation systems.
*   **Need for Redesign:** Required a platform to handle diverse requirements, inputs, data types, dependencies, and integrations, enabling use of best-of-breed tools.
*   **Focus of Post:** Introduction to Merlin, its architecture, user workflow, and a product use case.
**The Magic of Merlin**
*   **Foundation:** Based on an open-source stack.
*   **Objectives:**
    1.  **Scalability:** Robust infrastructure for scaling ML workflows.
    2.  **Fast Iterations:** Reduce friction, minimize prototype-to-production gap.
    3.  **Flexibility:** Allow users to use any necessary libraries/packages.
*   **Initial Focus (First Iteration):** Training and batch inference.
**Merlin Architecture**
*   **Data Input:** Uses features and datasets from Shopify's data lake or Pano (feature store), typically pre-processed by tools like Spark.
*   **Merlin Workspaces:**
    *   Dedicated environments for each use case (tasks, dependencies, resources).
    *   Enable distributed computing and scalability.
    *   **Underlying Technology:** Short-lived Ray clusters deployed on Shopify's Kubernetes cluster (for batch jobs).
*   **Merlin API:** Consolidated service for on-demand creation of Merlin Workspaces.
*   **User Interaction:** Users can interact with Merlin Workspaces from Jupyter Notebooks (prototyping) or orchestrate via Airflow/Oozie (production).
*   **Core Component:** Ray.
 - [Shopify: The Magic of Merlin: Shopify's New Machine Learning Platform](https://shopify.engineering/merlin-shopify-machine-learning-platform)
**What Is Ray?**
*   **Definition:** Open-source framework with a simple, universal API for building distributed systems and tools to parallelize ML workflows.
*   **Ecosystem:** Includes distributed versions of scikit-learn, XGBoost, TensorFlow, PyTorch, etc.
*   **Functionality:** Provides a cluster to distribute computation across multiple CPUs/machines.
*   **`ray.init()`:** Starts a Ray runtime (local or connects to existing local/remote cluster). Enables seamless code transition from local to distributed.
*   **Ray Client API:** Used to connect to remote Ray clusters.
*   **Example (XGBoost on Ray):**
    *   Uses `xgboost_ray` integration.
    *   `RayParams` define distribution (e.g., `num_actors`, `cpus_per_actor`).
    *   `RayDMatrix` for distributed data representation.
    *   `train()` function executes distributed training.
**Ray In Merlin**
*   **Rationale for Choosing Ray:**
    *   Python-centric development at Shopify.
    *   Enables end-to-end Python ML workflows.
    *   Integrates with existing ML libraries.
    *   Easily distributes/scales with minimal code changes.
*   **Usage:** Each ML project in Merlin includes Ray for distributed preprocessing, training, and prediction.
*   **Prototype to Production:** Ray facilitates this by allowing code developed locally/in notebooks to run on remote Ray clusters at scale from early stages.
*   **Adopted Ray Features:**
    *   **Ray Train:** For distributed deep learning (TensorFlow, PyTorch).
    *   **Ray Tune:** For experiment execution and hyperparameter tuning.
    *   **Ray Kubernetes Operator:** For managing Ray deployments on Kubernetes and autoscaling Ray clusters.
**Building On Merlin (User's Development Journey)**
1.  **Creating a new project:** User creates a Merlin Project (code, requirements, packages).
2.  **Prototyping:** User creates a Merlin Workspace (sandbox with Jupyter) for distributed/scalable prototyping.
3.  **Moving to Production:** User updates Merlin Project with finalized code/requirements.
4.  **Automating:** User orchestrates/schedules the workflow (via Airflow DAGs) in production.
5.  **Iterating:** User spins up another Merlin Workspace for new experiments.
**Merlin Projects**
*   **Purpose:** Dedicated to specific ML tasks (training, batch prediction).
*   **Customization:** Specify system-level packages or Python libraries.
*   **Technical Implementation:** Docker container with a dedicated virtual environment (Conda, pyenv) for code/dependency isolation.
*   **Management:** CLI for creating, defining, and using Merlin Projects.
*   **`config.yml`:** Specifies dependencies and ML libraries.
*   **`src` folder:** Contains use-case-specific code.
*   **CI/CD:** Pushing code to a branch triggers a custom Docker image build.
**Merlin Workspaces**
*   **Creation:** Via centralized Merlin API (abstracts infrastructure logic like K8s Ray cluster deployment, ingress, service accounts).
*   **Resource Definition:** Users can define required resources (GPUs, memory, CPUs, machine types).
*   **Execution Environment:** Spins up a Ray cluster in a dedicated Kubernetes namespace using the Merlin Project's Docker image.
*   **API Payload Example:** Specifies `name`, `min_workers`, `max_workers`, `cpu`, `gpu_count`, `gpu_type`, `memory`, `enable_jupyter`, `image`.
*   **Lifecycle:** Can be shut down manually or automatically after job completion, returning resources to the K8s cluster.
- [Shopify: The Magic of Merlin: Shopify's New Machine Learning Platform](https://shopify.engineering/merlin-shopify-machine-learning-platform)
**What Is Ray?**
*   **Definition:** Open-source framework with a simple, universal API for building distributed systems and tools to parallelize ML workflows.
*   **Ecosystem:** Includes distributed versions of scikit-learn, XGBoost, TensorFlow, PyTorch, etc.
*   **Functionality:** Provides a cluster to distribute computation across multiple CPUs/machines.
*   **`ray.init()`:** Starts a Ray runtime (local or connects to existing local/remote cluster). Enables seamless code transition from local to distributed.
*   **Ray Client API:** Used to connect to remote Ray clusters.
*   **Example (XGBoost on Ray):**
    *   Uses `xgboost_ray` integration.
    *   `RayParams` define distribution (e.g., `num_actors`, `cpus_per_actor`).
    *   `RayDMatrix` for distributed data representation.
    *   `train()` function executes distributed training.
**Ray In Merlin**
*   **Rationale for Choosing Ray:**
    *   Python-centric development at Shopify.
    *   Enables end-to-end Python ML workflows.
    *   Integrates with existing ML libraries.
    *   Easily distributes/scales with minimal code changes.
*   **Usage:** Each ML project in Merlin includes Ray for distributed preprocessing, training, and prediction.
*   **Prototype to Production:** Ray facilitates this by allowing code developed locally/in notebooks to run on remote Ray clusters at scale from early stages.
*   **Adopted Ray Features:**
    *   **Ray Train:** For distributed deep learning (TensorFlow, PyTorch).
    *   **Ray Tune:** For experiment execution and hyperparameter tuning.
    *   **Ray Kubernetes Operator:** For managing Ray deployments on Kubernetes and autoscaling Ray clusters.
**Building On Merlin (User's Development Journey)**
1.  **Creating a new project:** User creates a Merlin Project (code, requirements, packages).
2.  **Prototyping:** User creates a Merlin Workspace (sandbox with Jupyter) for distributed/scalable prototyping.
3.  **Moving to Production:** User updates Merlin Project with finalized code/requirements.
4.  **Automating:** User orchestrates/schedules the workflow (via Airflow DAGs) in production.
5.  **Iterating:** User spins up another Merlin Workspace for new experiments.
**Merlin Projects**
*   **Purpose:** Dedicated to specific ML tasks (training, batch prediction).
*   **Customization:** Specify system-level packages or Python libraries.
*   **Technical Implementation:** Docker container with a dedicated virtual environment (Conda, pyenv) for code/dependency isolation.
*   **Management:** CLI for creating, defining, and using Merlin Projects.
*   **`config.yml`:** Specifies dependencies and ML libraries.
*   **`src` folder:** Contains use-case-specific code.
*   **CI/CD:** Pushing code to a branch triggers a custom Docker image build.
**Merlin Workspaces**
*   **Creation:** Via centralized Merlin API (abstracts infrastructure logic like K8s Ray cluster deployment, ingress, service accounts).
*   **Resource Definition:** Users can define required resources (GPUs, memory, CPUs, machine types).
*   **Execution Environment:** Spins up a Ray cluster in a dedicated Kubernetes namespace using the Merlin Project's Docker image.
*   **API Payload Example:** Specifies `name`, `min_workers`, `max_workers`, `cpu`, `gpu_count`, `gpu_type`, `memory`, `enable_jupyter`, `image`.
*   **Lifecycle:** Can be shut down manually or automatically after job completion, returning resources to the K8s cluster.
 - [Shopify: The Magic of Merlin: Shopify's New Machine Learning Platform](https://shopify.engineering/merlin-shopify-machine-learning-platform)
**Prototyping From Jupyter Notebooks**
*   **Environment:** Users spin up a new ML notebook in Shopify's centrally hosted JupyterHub environment using their Merlin Project's Docker image (includes all code/dependencies).
*   **Remote Connection:** Use Ray Client API from the notebook to connect remotely to their Merlin Workspaces.
*   **Distributed Computation:** Run remote Ray Tasks and Ray Actors to parallelize work on the underlying Ray cluster.
*   **Benefit:** Minimizes prototype-to-production gap by providing full Merlin/Ray capabilities early.
**Moving to Production**
*   **Code Update:** Push prototyped code to Merlin Project, triggering a new Docker image build via CI/CD.
*   **Orchestration:**
    *   Build ML flows using declarative YAML templates or configure Airflow DAGs.
    *   Jobs scheduled periodically, call Merlin API to spin up Workspaces and run jobs.
*   **Monitoring & Observability:**
    *   **Datadog:** Dedicated dashboard per Merlin Workspace for job monitoring and resource usage analysis.
    *   **Splunk:** Logs from each Merlin job for debugging.
**Onboarding Shopify’s Product Categorization Model to Merlin**
*   **Use Case Complexity:** Requires several workflows for training and batch prediction; chosen to validate Merlin due to large-scale computation and complex logic.
*   **Migration:** Training and batch prediction workflows migrated to Merlin and converted using Ray.
*   **Migrating the training code:**
    *   Integrated TensorFlow training code with **Ray Train**.
    *   Minimal code changes: original TF logic mostly unchanged, encapsulated in a `train_func`.
    *   `Trainer` object from `ray.train` configured with backend ("tensorflow"), `num_workers`, `use_gpu`.
    *   `trainer.run(train_func, config=config)` executes distributed training.
*   **Migrating inference:**
    *   Multi-step process, each step migrated separately.
    *   Used **Ray ActorPool** to distribute batch inference steps. (Similar to Python's `multiprocessing.Pool`).
    *   **`Predictor` class (Ray Actor):** Contains logic for loading model and performing predictions.
    *   Actors created based on available cluster resources (`ray.available_resources()["CPU"]`).
    *   `ActorPool.map_unordered()` used to send dataset partitions to actors for prediction.
    *   **Future Improvement:** Plan to migrate to **Ray Dataset Pipelines** for more robust data load distribution and batch inference.
**What's next for Merlin**
*   **Aspiration:** Centralized platform streamlining ML workflows, enabling data scientist innovation.
*   **Next Milestones:**
    *   **Migration:** Migrate all Shopify ML use cases to Merlin; add a low-code framework for new use cases.
    *   **Online inference:** Support real-time model serving at scale.
    *   **Model lifecycle management:** Add model registry and experiment tracking.
    *   **Monitoring:** Support ML-specific monitoring.
*   **Current Status:** New platform, already providing scalability, fast iteration, and flexibility.
- [Shopify: The Magic of Merlin: Shopify's New Machine Learning Platform](https://shopify.engineering/merlin-shopify-machine-learning-platform)
**Prototyping From Jupyter Notebooks**
*   **Environment:** Users spin up a new ML notebook in Shopify's centrally hosted JupyterHub environment using their Merlin Project's Docker image (includes all code/dependencies).
*   **Remote Connection:** Use Ray Client API from the notebook to connect remotely to their Merlin Workspaces.
*   **Distributed Computation:** Run remote Ray Tasks and Ray Actors to parallelize work on the underlying Ray cluster.
*   **Benefit:** Minimizes prototype-to-production gap by providing full Merlin/Ray capabilities early.
**Moving to Production**
*   **Code Update:** Push prototyped code to Merlin Project, triggering a new Docker image build via CI/CD.
*   **Orchestration:**
    *   Build ML flows using declarative YAML templates or configure Airflow DAGs.
    *   Jobs scheduled periodically, call Merlin API to spin up Workspaces and run jobs.
*   **Monitoring & Observability:**
    *   **Datadog:** Dedicated dashboard per Merlin Workspace for job monitoring and resource usage analysis.
    *   **Splunk:** Logs from each Merlin job for debugging.
**Onboarding Shopify’s Product Categorization Model to Merlin**
*   **Use Case Complexity:** Requires several workflows for training and batch prediction; chosen to validate Merlin due to large-scale computation and complex logic.
*   **Migration:** Training and batch prediction workflows migrated to Merlin and converted using Ray.
*   **Migrating the training code:**
    *   Integrated TensorFlow training code with **Ray Train**.
    *   Minimal code changes: original TF logic mostly unchanged, encapsulated in a `train_func`.
    *   `Trainer` object from `ray.train` configured with backend ("tensorflow"), `num_workers`, `use_gpu`.
    *   `trainer.run(train_func, config=config)` executes distributed training.
*   **Migrating inference:**
    *   Multi-step process, each step migrated separately.
    *   Used **Ray ActorPool** to distribute batch inference steps. (Similar to Python's `multiprocessing.Pool`).
    *   **`Predictor` class (Ray Actor):** Contains logic for loading model and performing predictions.
    *   Actors created based on available cluster resources (`ray.available_resources()["CPU"]`).
    *   `ActorPool.map_unordered()` used to send dataset partitions to actors for prediction.
    *   **Future Improvement:** Plan to migrate to **Ray Dataset Pipelines** for more robust data load distribution and batch inference.
**What's next for Merlin**
*   **Aspiration:** Centralized platform streamlining ML workflows, enabling data scientist innovation.
*   **Next Milestones:**
    *   **Migration:** Migrate all Shopify ML use cases to Merlin; add a low-code framework for new use cases.
    *   **Online inference:** Support real-time model serving at scale.
    *   **Model lifecycle management:** Add model registry and experiment tracking.
    *   **Monitoring:** Support ML-specific monitoring.
*   **Current Status:** New platform, already providing scalability, fast iteration, and flexibility.