# Environments, Branching, CI/CD, and Deployments Explained ## ### Appendix A: The MLOps Workflow: Environments, Branching, CI/CD, and Deployments Explained The interplay between development environments, branching strategies, CI/CD pipelines, and the distinction between deploying the *pipeline* versus deploying the *model* is fundamental to MLOps but often confusing for newcomers (and even experienced practitioners!). Let's break down how these concepts fit together in a practical, step-by-step workflow, using the choices we've made (GitHub Flow, GitHub Actions, Terraform, Airflow, FastAPI deployment on Serverless Containers, etc.). --- **Introduction:** Building and operating ML systems involves more than just writing code and training models. It requires a structured workflow to manage changes, ensure quality, and reliably deploy updates. This appendix clarifies how key concepts – Development Environments (Dev, Staging, Prod), Git Branching Strategies (like GitHub Flow), Continuous Integration/Continuous Delivery (CI/CD) pipelines, and the different types of deployments (ML Pipelines vs. ML Models) – work together in a typical MLOps project. **Core Concepts:** 1. **Environments (Dev, Staging, Prod):** These are distinct, isolated instances of your infrastructure (compute, storage, network configurations, deployed services). * **Dev:** The developer's local machine or a personal cloud workspace. Used for writing code, initial development, and local testing. Highly flexible, often inconsistent between developers without standardization. * **Staging (Pre-Production):** A shared environment designed to *exactly mirror* Production. Used for thorough testing (integration, end-to-end, load) of code *before* it goes live. Uses realistic (but usually non-production) data. Deployed to automatically after code passes CI and is merged. * **Production (Prod):** The live environment serving end-users. Receives code only after successful validation in Staging and necessary approvals. Runs on live data. Highest requirements for stability, reliability, and monitoring. 2. **Branching Strategy (GitHub Flow):** A system for managing code changes using Git. We'll use GitHub Flow as our example: * `main`: This branch always reflects the latest *production-ready* code. It should always be deployable. * `feature/your-feature-name`: Short-lived branches created from `main`. Developers work here. Once complete, a Pull Request (PR) is opened to merge back into `main`. 3. **CI/CD Pipeline (GitHub Actions):** Automated workflows triggered by Git events (like pushes or merges). * **CI (Continuous Integration):** Focuses on *integrating* code changes frequently and *verifying* them automatically. Runs tests (linters, unit tests) on feature branches/PRs. *Goal: Catch errors early, ensure code quality before merge.* * **CD (Continuous Delivery/Deployment):** Focuses on *automatically releasing* verified code to environments. Triggered by merges to `main`. Deploys to Staging, runs further tests, and often includes a manual gate before deploying to Production. *Goal: Make releases reliable, repeatable, and fast.* 4. **Deploying ML Pipelines vs. Deploying ML Models:** This is a key MLOps distinction. * **ML Pipeline:** The automated workflow code (e.g., an Airflow DAG definition, SageMaker Pipeline script, associated container images for steps) that performs tasks like data ingestion, preprocessing, training, and evaluation. You *deploy the pipeline definition* when the *code or logic* of the workflow changes. * **ML Model:** The trained artifact (`.pkl`, `.pt`, `.onnx` file, etc.) produced by *running* the training pipeline. You *deploy the model artifact* to a serving endpoint when a *new, validated model version* is available (usually after a successful pipeline run). **Step-by-Step Workflow Example:** Let's trace a change, like improving a feature extraction step in our "Trending Now" project: * **Step 1: Start Development (Dev Environment & Feature Branch)** * **Action:** Developer needs to modify the feature extraction script (`feature_extractor.py`). * **Branching:** Create a new branch from `main`: `git checkout -b feature/improve-extraction`. * **Environment:** Developer works in their **Dev** environment (e.g., VS Code connected to a cloud workspace or local machine with Docker). They edit `feature_extractor.py`, write corresponding unit tests (`test_feature_extractor.py`), and test locally using sample data. * **Tools:** Git, Python, Pandas, Pytest, IDE. * **Step 2: Commit & Push Code** * **Action:** Developer commits the changes: `git commit -am "Improve feature extraction logic"`. * **Action:** Developer pushes the feature branch to GitHub: `git push origin feature/improve-extraction`. * **Branching:** The changes exist only on the feature branch in the remote repository. * **Step 3: Continuous Integration (CI Triggered by Push)** * **Action:** GitHub Actions detects the push to the `feature/improve-extraction` branch (or more commonly, the opening of a PR from this branch - see Step 4). * **Pipeline:** The **CI pipeline** (defined in `.github/workflows/ci.yml`) runs automatically. * **Environment:** Runs on a temporary GitHub Actions runner (an isolated environment). * **CI Steps:** * Checks out the code from `feature/improve-extraction`. * Sets up Python environment. * Installs dependencies (from `requirements.txt`). * Runs linters (e.g., `flake8`, `black`). * Runs unit tests (e.g., `pytest tests/unit/`). * (If applicable) Runs static code analysis or security scans. * (If applicable) Builds Docker images for modified components (though maybe deferred until merge). * **Outcome:** Reports Pass/Fail status. * **Step 4: Code Review (Pull Request)** * **Action:** Developer opens a Pull Request (PR) on GitHub to merge `feature/improve-extraction` into `main`. * **Interaction:** The CI results from Step 3 are displayed on the PR. Team members review the code changes, suggest improvements, and eventually approve. * **Step 5: Merge to `main` (Triggering CD to Staging)** * **Action:** The approved PR is merged into the `main` branch. * **Branching:** `main` now contains the updated feature extraction logic. * **Pipeline:** The merge triggers the **CD pipeline** (defined in `.github/workflows/cd-staging.yml`) targeting the **Staging** environment. * **Step 6: Continuous Delivery to Staging** * **Action:** GitHub Actions runs the CD workflow for Staging. * **Environment:** Interacts with the **Staging** AWS account/resources managed by Terraform. * **CD Steps (Example):** * Checks out code from `main`. * Builds necessary artifacts (e.g., updates the Docker container for the data processing step in the Airflow DAG). * Packages the Airflow DAG definition. * Uses Terraform/AWS CLI to deploy the updated DAG and associated resources (like the new container image to ECR) to the **Staging** Airflow environment. * *(Optional but recommended)* Triggers a run of the *Data Ingestion Pipeline* in Staging using sample/staging data. * Runs **Integration Tests** (e.g., using `pytest tests/integration/`) against the Staging environment (Does the pipeline run end-to-end? Does the FastAPI endpoint respond correctly if that was updated?). * Runs **Infrastructure Tests** (e.g., using `pytest + boto3` or Terratest) to verify Terraform deployment was successful. * **Outcome:** Updated Data Ingestion Pipeline (or potentially Training Pipeline if the change affected it) is running in Staging; Test results are reported. * **Step 7: Validation in Staging & Manual Approval** * **Action:** Automated tests (Integration, E2E) pass in Staging. * **Action:** A QA engineer or ML engineer might manually trigger the pipeline in Staging and verify the output data or features look correct. They check dashboards (Staging Grafana). * **Action:** Stakeholder (e.g., Lead ML Engineer, Product Manager) gives **Manual Approval** for Production deployment (e.g., via GitHub Environments approval gate). * **Environment:** Review happens based on the **Staging** environment's behaviour. * **Step 8: Continuous Deployment to Production** * **Action:** The manual approval triggers the **CD pipeline** for Production (defined in `.github/workflows/cd-prod.yml`). * **Environment:** Interacts with the **Production** AWS account/resources managed by Terraform. * **CD Steps (Example):** * Uses the *exact same* artifacts built and tested for Staging (avoids rebuilding). * Uses Terraform/AWS CLI to deploy the updated DAG/container to the **Production** Airflow environment. * *(Optional)* Performs automated smoke tests immediately after deployment. * *(Crucial)* Monitors Production (CloudWatch, Prometheus/Grafana) closely after deployment. Rollback manually or automatically if critical issues arise. * **Outcome:** The updated Data Ingestion Pipeline is now live in Production. **How ML Model Deployment Fits In:** Notice the above workflow focused on deploying a change to the *code* of a pipeline (the feature extractor). Where does the *model* deployment fit? * **Model Training (CT):** The *Training Pipeline* runs periodically or on a trigger in its designated environment (often Prod or a dedicated Training env). * **Model Validation (Offline):** As the *last step* of the Production Training Pipeline run, the newly trained model candidate is evaluated against metrics on a holdout dataset. It uses a static holdout dataset (Test Set) to evaluate the model's *predictive performance* (accuracy, F1, precision, recall, AUC, etc.) against known ground truth. It answers: "Is this model statistically good enough based on historical data?" * **Model Registration:** If the model candidate passes offline validation, the pipeline registers it (e.g., as `version 1.1-candidate`) in the **Model Registry** (W&B). * **Approval Gate 1 (Pre-Staging):** A process (manual or semi-automated based on metric thresholds) approves the registered candidate model for deployment to the **Staging environment**. * **Model Deployment CD Pipeline (Stage 1 - Deploy to Staging):** Triggered by the pre-staging approval. * Fetches the specific model artifact (`version 1.1-candidate`) from the registry. * Deploys it to the **Staging Serving Environment** (e.g., Staging FastAPI service on Staging App Runner). * **Model Testing in Staging (Operational Validation):** Automated tests run against the Staging endpoint: * Basic health checks (is the endpoint responsive?). * Integration tests (does it work with other staging services?). * Load/Performance tests (using Locust) to check latency and resource usage under simulated load. * Consistency checks (optional). * **Approval Gate 2 (Pre-Production):** Based on successful Staging tests (automated reports + possibly manual review), the model (`version 1.1-candidate`) is approved for **Production deployment**. The model version in the registry might be updated (e.g., tag changed from `candidate` to `approved-for-prod`). * **Model Deployment CD Pipeline (Stage 2 - Deploy/Test in Production):** Triggered by the pre-production approval. * Deploys the *same validated model artifact* (`version 1.1-candidate`) to the **Production Serving Environment**, typically using a progressive rollout strategy (Canary, Shadow, or setting up an A/B test variant). * Online experiments run, monitoring business metrics and operational health. * **Full Production Rollout / Promotion:** If online experiments are successful, the candidate model (`version 1.1`) is fully rolled out (becomes the new champion), replacing the previous production version. The Model Registry is updated to reflect its `production` status. If experiments fail, the candidate is rolled back. **Integrated MLOps Workflow: Code, Pipeline, and Model Deployments** **Conclusion:** This step-by-step flow illustrates how environments, branching, CI/CD, pipeline deployments, and model deployments work together. Developers work on features in isolated **Dev** environments and branches. **CI** validates code quality automatically before merging. **CD** automates the release of verified code (including pipeline definitions) through **Staging** (for integration testing) and finally to **Production**. Continuous Training pipelines run *within* an environment (often Prod) to produce new *models*, which are then deployed to serving endpoints via their own (often simpler) deployment process triggered by the model registry. This structured approach ensures changes are tested incrementally, deployments are reliable, and the distinct lifecycles of code, pipelines, and models are managed effectively.