ML Expt tracking, Data Lineage, Model Registry¶
I. ML Experiment Tracking: The Foundation of Iterative Development¶
(Sources: Neptune “ML Experiment Tracking”, Neptune “ML Experiment Management”, Google MLOps Guide Fig 4)
What is ML Experiment Tracking?
Definition: The systematic process of saving all relevant information (metadata) associated with each machine learning experiment run.
Goal: To enable reproducibility, comparison, debugging, collaboration, and informed decision-making throughout the model development lifecycle.
Why Does It Matter? The MLOps Lead’s Perspective:
Organization & Discoverability: Centralizes scattered experiment results, regardless of where they were run (local, cloud, notebooks). Prevents “lost” work and tribal knowledge.
Reproducibility: Enables re-running experiments by capturing code, data, environment, and parameters. Critical for debugging and validation.
Efficient Comparison & Analysis: Allows side-by-side comparison of metrics, parameters, learning curves, visualizations, and artifacts. Speeds up identification of what works and what doesn’t.
Collaboration & Knowledge Sharing: Provides a single source of truth for the team. Facilitates easy sharing of results and progress with stakeholders via persistent links or dashboards.
Live Monitoring & Resource Management: Allows real-time tracking of running experiments, early stopping of unpromising runs, and monitoring hardware consumption for efficiency.
Debugging: Helps pinpoint issues by comparing a failed run to a successful one, looking at code diffs, environment changes, or data shifts.
What to Track: The MLOps Lead’s Checklist (Neptune “ML Experiment Tracking”, Chip Huyen Ch. 6)
Core Essentials (Must-Haves):
Code Versions: Git commit hashes, script snapshots (especially for uncommitted changes or notebooks). Tools like
nbdime
,jupytext
.Data Versions: Hashes of datasets/data pointers (e.g., MD5, DVC tracked files). Crucial to link model performance to the exact data used. (See Section II: Data Lineage & Provenance).
Hyperparameters: All parameters influencing the experiment (learning rate, batch size, architecture details, feature engineering steps). Log explicitly, avoid “magic numbers”. Config files (YAML, Hydra) are good practice.
Environment: Dependencies (e.g.,
requirements.txt
,conda.yml
, Dockerfile). Ensures consistent runtime.Evaluation Metrics: Key performance indicators on training, validation, and (sparingly) test sets. Log multiple relevant metrics.
Highly Recommended:
Model Artifacts: Serialized model weights/checkpoints (e.g., .h5, .pth, .pkl), especially the best performing ones.
Learning Curves: Metrics over epochs/steps for both training and validation sets.
Performance Visualizations: Confusion matrices, ROC/PR curves, prediction distributions.
Run Logs: Standard output/error streams.
Hardware Consumption: CPU/GPU/memory usage during training.
Experiment Notes/Tags: Qualitative observations, hypotheses being tested.
Advanced/Context-Specific:
Feature Importance/Explanations: SHAP values, LIME outputs, attention maps.
Sample Predictions: Examples of good/bad predictions, especially for vision or NLP tasks.
Gradient Norms/Weight Distributions: For deep learning debugging.
For LLMs: Prompts, chain configurations, specific metrics (ROUGE, BLEU), inference time.
Setting Up Experiment Tracking: Build vs. Buy vs. Self-Host (Neptune “ML Experiment Tracking”)
Approach
Pros
Cons
MLOps Lead Considerations
Spreadsheets/Naming Conventions
Simple to start.
Error-prone, not scalable, hard to collaborate, no live tracking, poor for complex metadata.
Strongly discourage for any serious project. Only for very small, solo, short-term explorations.
Git for Metadata Files
Leverages existing VCS skills.
Not designed for ML artifacts, poor comparison for >2 runs, difficult organization for many experiments.
Better than spreadsheets but quickly hits limitations for ML-specific needs.
Build Your Own Tracker
Full control, tailored to specific needs.
High development & maintenance effort, risk of reinventing the wheel, requires diverse engineering skills.
Only if existing tools are truly insufficient AND significant engineering resources are available. Often a distraction from core ML work.
Self-Host Open Source Tool
No vendor lock-in, data stays on-premise, customizable.
Maintenance overhead (infra, updates, security), may lack dedicated support.
Suitable if strict data residency is a must or high customization is needed, and team has infra/ops capabilities. Assess community support.
SaaS Experiment Tracker
Fully managed, scalable, expert support, rich features, rapid iteration.
Vendor dependency, data on third-party cloud (usually with strong security/compliance).
Often the most efficient for teams wanting to focus on ML. Evaluate based on features, integrations, security, pricing, and support. Examples: Neptune.ai.
Key for MLOps Lead: Champion the adoption of a dedicated experiment tracking tool. The productivity gains and risk reduction far outweigh the effort of manual methods.
II. Data Lineage & Provenance: Understanding the “Story Behind the Data”¶
(Source: Neptune “Data Lineage in ML”)
Definitions:
Data Lineage: Tracks data’s journey from origin to consumption, including transformations and processes it underwent. Focuses on metadata (data about the data).
Data Provenance: Broader than lineage. Includes lineage but also tracks systems and processes that influence the data.
Why It’s Crucial for MLOps:
Reproducibility: Essential for reproducing models and debugging. If input data or its processing changes, the model outcome will change.
Impact Analysis: Understand how changes in upstream data sources or processing steps affect downstream models and business outcomes.
Debugging: Trace back data issues (e.g., data quality degradation, schema changes) to their source.
Governance & Compliance: Provides audit trails for data usage, transformations, and model training, critical for regulated industries.
Data Quality & Trust: Helps ensure data integrity by understanding its origins and transformations.
Efficiency: Prevents re-computation or re-engineering of data pipelines if lineage is clear.
Methods of Data Lineage Tracing:
Data Tagging: Relies on transformation tools consistently tagging data. Best for closed systems.
Self-Contained Lineage: Lineage within a controlled data environment (e.g., data lake, data warehouse).
Parsing: Analyzing code (SQL, Python) and transformation logic to infer lineage. Can be complex and language-dependent.
Pattern-Based Lineage: Infers lineage by observing data patterns. Technology-agnostic but can miss code-driven transformations.
Data Lineage Across the ML Pipeline:
Data Gathering: Track source systems, ingestion methods, initial validation.
Data Processing: Log all transformations, filters, feature engineering steps, versions of scripts.
Data Storing & Access: Track storage locations, access permissions, data versions.
Data Querying: For training data generation, log the queries and data snapshots used.
Best Practices for Data Lineage (from an MLOps perspective):
Automation: Manual lineage tracking is not scalable. Leverage tools that automatically capture lineage or integrate with lineage systems.
Granularity: Track lineage at a level that is useful for debugging and reproducibility (e.g., dataset version, feature transformation script version).
Integration with Experiment Tracking: Link experiment runs to specific versions of datasets and preprocessing code.
Integration with Feature Stores: Feature stores inherently manage lineage for features.
Metadata Validation: Ensure captured lineage information is accurate and complete.
Tools for Data Lineage (often overlapping with data cataloging or broader data management):
Talend Data Catalog, IBM DataStage, Datameer
Open-source: Apache Atlas, OpenLineage, Marquez
Experiment trackers (like Neptune.ai) can capture crucial parts of data lineage by versioning data inputs (hashes, paths) and code.
DVC (Data Version Control): While primarily for data versioning, DVC pipelines (
dvc.yaml
) implicitly define data lineage for stages.
graph LR subgraph "Data Sources" DS1[Source DB] DS2[API Feed] DS3[File Uploads] end subgraph "Ingestion & Staging" I[Ingest Process] S[Staging Area/Lake] end subgraph "Transformation & Feature Engineering" T1[Preprocessing Script V1] T2[Feature Engineering Script V2.1] FS[Feature Store] end subgraph "Model Training & Experimentation" D_Train[Training Dataset V3.2] M_Exp[Experiment Run ID: exp_abc] M_Art[Model Artifact: model_v1.2.pkl] end subgraph "Deployment & Serving" Dep[Deployed Model V1.2] Pred[Predictions] end DS1 -- Ingested_by --> I DS2 -- Ingested_by --> I DS3 -- Ingested_by --> I I -- Loads_to --> S S -- Input_for --> T1 T1 -- Output_to --> S_Processed[Processed Data V1] S_Processed -- Input_for --> T2 T2 -- Populates --> FS FS -- Source_for --> D_Train D_Train -- Used_in --> M_Exp M_Exp -- Produces --> M_Art M_Art -- Registered_and_Deployed_as --> Dep Dep -- Generates --> Pred classDef data fill:#lightblue,stroke:#333,stroke-width:2px; classDef process fill:#lightgreen,stroke:#333,stroke-width:2px; classDef artifact fill:#lightyellow,stroke:#333,stroke-width:2px; class DS1,DS2,DS3,S,S_Processed,D_Train,FS data; class I,T1,T2,M_Exp,Dep process; class M_Art,Pred artifact;
III. ML Model Registry: Centralized Governance and Lifecycle Management¶
(Sources: Neptune “ML Model Registry”, Google MLOps Guide Fig 4, Practitioners Guide to MLOps)
What is a Model Registry?
Definition: A centralized system for storing, versioning, managing, and governing trained machine learning models and their associated metadata throughout their lifecycle (from development to production and retirement).
Distinction from Model Repository/Store: A repository might just store model files. A registry adds lifecycle management, versioning, metadata, and governance. A model store is a broader concept, potentially including a registry.
Why a Model Registry is Essential for MLOps:
Centralized Storage & Discoverability: Single source of truth for all trained models, making them easy to find, audit, and reuse.
Version Control for Models: Tracks different versions of a model, allowing rollback and comparison. Essential as models are retrained or improved.
Standardized Hand-off: Bridges the gap between data science (experimentation) and MLOps/engineering (deployment). Provides a clear point for promoting models.
Governance & Compliance: Facilitates review, approval, and auditing of models before deployment. Stores documentation (e.g., model cards) and evidence of validation.
Automation & CI/CD/CT Integration: Enables automated pipelines to register new model versions, trigger deployment workflows, and manage model stages (e.g., staging, production, archived).
Improved Security: Can manage access controls for models, especially those trained on sensitive data.
Key Features and Functionalities of a Model Registry: (Practitioners Guide to MLOps, Neptune “ML Model Registry”)
Model Registration: Ability to “publish” a trained model from an experiment tracking system or training pipeline.
Model Versioning: Automatically assigns and tracks versions for each registered model.
Metadata Storage: Stores comprehensive metadata:
Link to the experiment run that produced it (lineage to code, data, params).
Evaluation metrics (offline and online).
Model artifacts (weights, serialized model file).
Runtime dependencies (e.g., library versions).
Model documentation (model cards, intended use, limitations).
Owner, creation date, stage.
Model Staging & Transitions: Defines and manages model lifecycle stages (e.g., “Development”, “Staging”, “Production”, “Archived”). Supports workflows for promoting/demoting models.
API Access: Programmatic interface for CI/CD systems, monitoring tools, and serving platforms to interact with the registry.
UI for Management: A web interface for browsing, searching, comparing, and managing models and their versions.
Annotation & Tagging: Ability to add custom tags and descriptions.
Access Control: Manages permissions for who can register, approve, or deploy models.
Model Registry in the MLOps Workflow: (Google MLOps Levels, Practitioners Guide to MLOps Fig 3, 15)
graph TD A[Experimentation & Training Pipeline] -->|Trained Model & Metadata| B(Model Registry); B -- Stage: Staging --> C{Validation & QA}; C -- Approved --> D[CI/CD for Deployment]; D -- Deploy --> E(Production Serving Environment); E -- Feedback/Metrics --> F(Model Monitoring); F -- Performance Degradation --> A; B -- Discover/Fetch Model --> E; subgraph "Model Lifecycle Stages within Registry" direction LR Dev[Development Models] Staging[Staging Models] Prod[Production Models] Arch[Archived Models] Dev --> Staging; Staging --> Prod; Prod --> Arch; end
MLOps Level 0 (Manual): Data scientist manually registers the model. Ops team manually pulls for deployment.
MLOps Level 1 (ML Pipeline Automation): Automated training pipeline registers validated models. Deployment might still be manual or semi-automated.
MLOps Level 2 (CI/CD Pipeline Automation): Fully automated CI/CD pipeline interacts with the registry to manage model promotion and deployment based on triggers and approvals.
Build vs. Maintain vs. Buy for Model Registry: (Similar considerations as Experiment Tracking)
Building: Very complex due to diverse needs (storage, API, UI, versioning, workflow). Rarely justifiable.
Maintaining Open Source (e.g., MLflow Model Registry): Offers good features. Requires infra setup, maintenance, and expertise. Good for teams wanting control and having ops capabilities.
Buying/SaaS (e.g., Verta.ai, Neptune.ai, cloud provider registries like Vertex AI Model Registry, SageMaker Model Registry):
Pros: Fully managed, feature-rich, vendor support, faster to get started.
Cons: Potential vendor lock-in, cost, data residency concerns for some.
MLOps Lead’s Role: Evaluate based on team size, MLOps maturity, existing stack, budget, and specific governance/compliance needs. Integration with existing experiment tracking and deployment tools is key.
IV. Connecting the Dots: The MLOps Lead’s Unified View¶
Experiment tracking, data lineage, and model registries are not isolated components but interconnected pillars of a mature MLOps ecosystem.
Experiment Tracking feeds into the Model Registry: Successful experiments yield candidate models that are registered. The metadata logged during tracking (parameters, data versions, code versions, metrics) becomes crucial for the registry entry, providing lineage and context.
Data Lineage underpins both: Knowing what data (and its transformations) went into an experiment run (tracked) and thus into a registered model is fundamental for reproducibility, debugging, and governance.
Model Registry enables Deployment and Monitoring: It provides a stable, versioned source for deployment systems. Monitoring systems feedback performance metrics to the registry, informing decisions about retraining or rollback.
MLOps Lead’s Strategic Decisions Framework:
Define the “What” and “Why” for Tracking:
What metadata is essential for your team to reproduce, debug, and compare experiments effectively? (Start with the core list, expand as needed).
Why is this specific piece of metadata important for your project’s goals (e.g., compliance, debugging speed, performance improvement)?
Establish Data Handling Protocols:
How will data versions be managed and linked to experiments? (DVC, S3 versioning + hashes, feature store).
How will data lineage be captured or inferred for key datasets used in model training?
Design the Model Lifecycle Flow:
What are the stages a model goes through from experiment to production (and potentially archive)?
Who is responsible for approvals at each stage? What are the criteria?
How will models be promoted through these stages (manual, semi-automated, fully automated via CI/CD)?
Tooling Selection - Holistic View:
Does your chosen experiment tracker integrate well or offer model registry capabilities?
Does your model registry integrate with your deployment and monitoring tools?
Does your data infrastructure support adequate lineage tracking?
Consider the overall MLOps stack and aim for seamless integration rather than siloed tools.
graph LR subgraph "Development & Experimentation Phase" A[Ideation/Hypothesis] --> B(Code Versioning - Git); B --> C(Data Preparation & Versioning - DVC/Feature Store); C --> D(Hyperparameter Configuration - YAML/Hydra); D --> E(Environment Setup - Docker/Conda); E --> F[ML Experiment Run]; F --> G[Experiment Tracking System - Neptune/MLflow]; G -- Log --> H(Code Hash); G -- Log --> I(Data Hash/Path); G -- Log --> J(Parameters); G -- Log --> K(Environment Config); G -- Log --> L(Metrics); G -- Log --> M(Model Artifacts/Checkpoints); G -- Log --> N(Visualizations); end subgraph "Model Governance & Lifecycle Management" O[Model Registry - Neptune/MLflow/Vertex AI]; M -->|Register Model| O; L -- Link to Model Version --> O; I -- Link to Model Version --> O; H -- Link to Model Version --> O; J -- Link to Model Version --> O; O -- Model Stages --> P{Staging}; P -- Validation/QA --> Q{Production}; Q -- Trigger Retraining/Rollback --> F; Q -- Serve for Inference --> R[Deployment Platform]; end subgraph "Data Lineage" S[Source Data Systems] --> T(ETL/Data Pipelines); T -- Transformation Logic --> U(Processed Data for Training); U --> C; classDef MLOpsTool fill:#f9f,stroke:#333,stroke-width:2px; class G,O MLOpsTool; end R --> V(Application / End Users); V -- Feedback / New Data --> S;