Model Development: Lessons from production systems¶
¶
Document Purpose: This guide distills practical wisdom from leading tech companies on developing, deploying, and maintaining machine learning models in production. It’s designed for experienced Lead MLOps Engineers to refine their thinking frameworks and decision-making processes.
Core Philosophy: Production ML is an iterative, data-driven, and business-aligned engineering discipline. It’s more than just training a model; it’s about building robust, scalable, and maintainable systems that deliver continuous value.
I. The MLOps Mindset for Model Development
Business Value First:
Why: ML models exist to solve business problems or create new opportunities. (DoorDash, Booking.com, Airbnb LTV)
How:
Clearly define the problem and success metrics with business stakeholders.
Understand the “Total Addressable Market (TAM)” for the model’s impact. (DoorDash)
Offline model performance (e.g., AUC, R-squared) is a health check, not a guarantee of business impact. (Booking.com, DoorDash)
Translate model improvements to tangible business KPIs (e.g., revenue, efficiency, user experience).
Iterative & Hypothesis-Driven Approach:
Why: Large-scale problems are complex; perfect solutions rarely emerge on the first try. (Booking.com, Walmart, GitHub)
How:
Start simple (heuristic or basic model) to establish a baseline and understand the problem space. (Twitter, DoorDash)
Formulate clear hypotheses for model improvements or feature additions.
Prioritize iterations based on potential impact and effort.
“Prototype -> Scale -> Revisit & Improve” (Airbnb Categories)
Embrace the Full Lifecycle:
Why: A model isn’t “done” when deployed. Monitoring, retraining, and adaptation are crucial. (Instagram, DoorDash)
How: Design for observability, easy retraining, and quick rollback.
Humans are Part of the System:
Why: Purely automated systems often struggle with nuance, cold starts, and evolving data. (Airbnb Categories, DoorDash Item Tagging, Walmart)
How: Strategically integrate human-in-the-loop (HITL) for labeling, validation, exception handling, and rule creation. Treat analysts, crowd workers, and developers as “first-class citizens” in the system. (Walmart - Chimera)
II. Core Pillars of Production ML Model Development
A. Data: The Unyielding Foundation
Data Collection & Understanding:
Challenge: Data can be sparse, noisy, imbalanced, non-stationary, and come from diverse sources. (Instagram, DoorDash Prep Time, Facebook)
Best Practices:
Cataloging: Maintain a metadata catalog for data assets. (Facebook)
Signal Exploration: Use multiple data signals (e.g., text, image, user behavior, persisted vs. online data). (Facebook, Walmart, Airbnb)
Understand Censorship & Bias: Be aware of issues like left-censored data or selection bias and their impact. (DoorDash Prep Time)
Data Coverage: Monitor feature coverage; have fallback strategies for low coverage. (Instagram)
Data Quality & Preprocessing:
Challenge: “Garbage in, garbage out.” OCR errors, inconsistent formats, PII. (DoorDash Menu, Airbnb Messages)
Best Practices:
Denoising/Cleaning: Implement robust preprocessing pipelines (e.g., removing templates, special characters, PII scrubbing). (GitHub, Walmart, Airbnb Messages)
Handling Missing Data:
Impute clinically normal values (ICLR LSTM for Healthcare).
Introduce indicator variables for missingness. (ICLR LSTM for Healthcare)
Be wary of back-filling if it introduces future leaks. (ICLR LSTM for Healthcare)
Data Augmentation: Crucial for limited positive training sets, especially for deep learning. (GitHub, DoorDash Item Tagging)
Resampling/Windowing: Carefully consider how resampling (e.g., hourly for time series) affects data variability. (ICLR LSTM for Healthcare)
Labeling & Ground Truth:
Challenge: High cost, scalability issues for many classes, label ambiguity, multi-intent, cold start. (DoorDash Item Tagging, Airbnb Messages, Airbnb Categories)
Strategies & Solutions:
Unsupervised to Supervised: Use unsupervised methods (e.g., LDA) to discover initial topics/intents, then refine with human labeling. (Airbnb Messages)
Taxonomy Design (for tagging/categorization):
Keep tags mutually exclusive if possible.
Partition by distinct attributes.
Include “other” classes for future expansion.
Make tags objective. (DoorDash Item Tagging)
Human-in-the-Loop (HITL) for Labeling:
Iterative Labeling: Pilot label small samples, refine definitions based on inter-rater agreement. (Airbnb Messages)
Active Learning: Use model uncertainty to select samples for annotation. (DoorDash Item Tagging)
Self-Training: Use high-precision models to generate supplemental training samples. (DoorDash Item Tagging)
Crowdsourcing: Effective for evaluation and flagging, but monitor accuracy (97-98% typical, can dip). Be cautious adding crowd labels directly to training data. (Walmart - Chimera)
Internal Experts: Use for complex tasks, refining labels, and creating “golden datasets.” (Airbnb Messages, DoorDash Item Tagging)
Weak Supervision: Infer labels from heuristics (e.g., PRs from new contributors, small diffs for “good first issues”). (GitHub)
Handling Multi-Intent/Multi-Label: Assign specific intents to corresponding sentences or allow multiple tags. (Airbnb Messages)
B. Feature Engineering: Crafting Predictive Power
Feature Ideation & Creation:
Sources: User behavior, item metadata (text, image, structured), geo-location, time, reviews, etc. (DoorDash Merchant Selection, Airbnb LTV)
Techniques:
Embeddings: For users, items, n-grams. Learns low-dimension representations, helps generalization. (Instagram, DoorDash Item Tagging, Airbnb Categories)
Word2Vec, Deep Learning (multi-tower architectures).
Consider holistic product experience embeddings for myopic models.
N-grams: Quantized dense features and categorical features, jointly learned. Useful for higher-order interactions. (Instagram)
Time-based Features: Capture ramp-up time for new entities (e.g., merchants on platform). (DoorDash Merchant Selection)
Historical Features: Past N days of data. (Instagram)
Rule-Based/Heuristic Features: Based on domain knowledge. (Twitter Ad Requests - initial)
Hand-Engineered Features: Inspired by severity-of-illness scores in healthcare (mean, std dev, min/max, slope). (ICLR LSTM for Healthcare)
Feature Management & Scalability:
Feature Stores: Centralized repository for high-quality, vetted, reusable features (e.g., DoorDash Feature Store, Airbnb Zipline).
Benefits: Scalability, fast iteration, consistency, avoids redundant work.
Functionality: Daily feature creation, intelligent key joins, backfilling.
Handling Sparsity: Choose models that handle sparse features well (e.g., XGBoost). (DoorDash Merchant Selection)
Feature Coverage Monitoring: Track how often features are available. (Instagram)
Feature Importance & Interpretability:
Why: Understand model behavior, debug, gain business insights, comply with regulations.
Techniques: Shapley Values (SHAP) show feature contribution and direction. (DoorDash Merchant Selection, DoorDash Menu)
C. Model Selection & Prototyping: Balancing Act
Problem Formulation:
Classification vs. Regression: Choose based on the desired output.
Point-wise vs. Pairwise vs. Listwise (Ranking): (Instagram)
Multilabel Classification: When items can belong to multiple categories. (ICLR LSTM for Healthcare)
Consider the “Uncanny Valley”: Highly accurate predictions can sometimes be unsettling to users if perceived as intrusive. (Booking.com)
Model Choice:
Start Simple: Linear models, tree-based models (Random Forests, XGBoost, LightGBM) are good starting points. (DoorDash, Twitter, GitHub, Google AdWords Churn)
XGBoost often chosen for its performance, ability to handle sparse data, and AFT loss functions for survival analysis. (DoorDash)
Deep Learning:
CNNs: Good for text classification when key phrases matter, fast training/inference. (Airbnb Messages, Walmart Product Categorization)
RNNs/LSTMs: For sequential data, varying length sequences, long-range dependencies. (ICLR LSTM for Healthcare, Walmart Product Categorization - CNN-LSTM, NAVER Item Categorization)
Feedforward Neural Networks: Simple ones can be effective for filtering/prioritization. (Twitter Ad Requests)
Multi-tower architectures: For learning embeddings. (Instagram)
Specialized Models:
Cox Proportional Hazard Models (for censored data, though often less accurate than ML). (DoorDash Prep Time)
Mixture of Experts / Sub-models: For heterogeneous populations (e.g., by country, by merchant segment). (Instagram, DoorDash Merchant Selection)
AutoML: Can speed up model selection and benchmarking. (Airbnb LTV)
Prototyping Frameworks:
Pipelines (Scikit-learn, Spark): Define blueprints for feature transformation and model training. Ensures consistency between training and scoring. (Airbnb LTV)
Jupyter Notebooks to Production Pipelines: (e.g., Airbnb’s ML Automator for Airflow). Lowers barrier for Data Scientists.
Trade-offs:
Interpretability vs. Complexity/Accuracy: (Airbnb LTV, DoorDash Prep Time)
Sparse linear models: interpretable, less complex.
Tree-based (XGBoost): flexible, non-linear, less interpretable.
Deep Learning: highly flexible, often a black box.
Coverage vs. Accuracy (Precision/Recall): Especially for recommendations or filtering. (GitHub Good First Issues - aim for high precision)
Training/Inference Speed vs. Accuracy: Critical for online systems. (Instagram, Twitter)
Consider model size and latency early.
D. Training & Validation: Rigor and Reality
Training Strategy:
Data Splitting:
Train/Validation/Test.
For time-series data, split chronologically to avoid future leaks. (Mozilla Test Selection)
Separate across higher-level entities (e.g., repositories) if leakage is a concern. (GitHub)
Addressing Imbalance: Subsample negative set, weight loss function. (GitHub, DoorDash Item Tagging)
Loss Functions:
Log loss for multilabel. (ICLR LSTM for Healthcare)
Cross-entropy for standard classification.
Focal loss for imbalanced data. (Walmart Semantic Labels)
Custom loss for censored data (e.g., XGBoost AFT). (DoorDash Prep Time)
KL Divergence for matching distributions (e.g., with soft targets). (Walmart Semantic Labels)
Regularization: L2 weight decay, dropout (especially for non-recurrent connections in RNNs). (ICLR LSTM for Healthcare)
Optimization: SGD with momentum, Adagrad (be mindful of learning rate decay). (ICLR LSTM for Healthcare, Instagram)
Target Replication (for sequence models): Provide local error signals at each step to aid learning long dependencies. (ICLR LSTM for Healthcare)
Auxiliary Output Training / Multi-Task Learning: Use additional related labels/tasks to improve generalization and reduce overfitting. (ICLR LSTM for Healthcare, DoorDash Item Tagging)
Curriculum Learning: Start with easier examples or “harden” soft targets over time. (Walmart Semantic Labels)
Evaluation Metrics (Offline):
Standard Metrics: Precision, Recall, F1, AUC (micro/macro), R-squared, MSE.
Business-Specific Metrics:
Weighted Decile CPE, Decile Rank Score (DoorDash Merchant Selection): Tailored for sales lead allocation and ranking quality.
Precision@k (ICLR LSTM for Healthcare - P@10 for diagnoses): Medically plausible use case.
Segment Performance: Evaluate not just aggregate but also on key segments (e.g., new vs. existing users, chain vs. local merchants). (DoorDash Merchant Selection)
Calibration: Important if probabilities are used for decision-making or active learning. (DoorDash Item Tagging)
Validation & Health Checks:
Offline vs. Online Correlation: Offline gains don’t always translate to online/business impact. (Booking.com - “Offline model performance is just a health check”)
Backtesting: Replay past traffic with control/test treatments to assess model performance and ecosystem consequences. (Instagram)
Golden Datasets: Fixed evaluation sets for model comparison over time. (Instagram)
Response Distribution Analysis (RDA): For binary classifiers, analyze histogram of model outputs to detect pathologies (high bias, feature defects, concept drift) without true labels. (Booking.com)
E. Productionization & Deployment: Crossing the Chasm
Infrastructure & Tooling:
ML Platforms (e.g., Airbnb Bighead, DoorDash internal infra on Databricks/MLflow, Facebook’s system):
Model management (registration, versioning, storage).
Automated training/retraining (weekly/monthly/daily/hourly).
Distributed scoring.
Monitoring dashboards.
Feature Stores: As mentioned, for consistent feature access.
Workflow Orchestration (e.g., Airflow, Argo): Schedule and manage data pipelines and model training/inference jobs. (Airbnb LTV, GitHub)
Deployment Strategies:
Online (Real-time) vs. Offline (Batch) Prediction:
Online: For low-latency requirements (e.g., ad serving, fraud detection). Needs lightweight models, efficient inference. (Twitter, Facebook, Instagram)
Offline/Recurring: Daily/hourly predictions. Less resource-intensive. (Instagram, DoorDash)
Shadow Mode/Shadow Schedulers: Run new models/logic alongside the current production system to evaluate performance before full rollout. (Mozilla Test Selection)
A/B Testing (Randomized Controlled Trials - RCTs): Essential for measuring true business impact. (Booking.com, DoorDash Prep Time)
Switchback Tests: For network effects. (DoorDash Prep Time)
Triggered Analysis: Analyze only users exposed to a change, especially when eligibility criteria depend on model output. (Booking.com)
Controlling for Latency: Design experiments to disentangle model effect from performance degradation. (Booking.com)
Scalability & Efficiency:
Model Size & Latency: Critical constraints. Lightweight models for early pipeline stages (e.g., filtering invaluable requests). (Twitter Ad Requests - <3MB model)
Multi-stage Ranking/Filtering: Simpler model for first pass on many items, then complex model on top N. (Instagram)
Caching: Store frequent predictions. (Booking.com)
Bulking: Group multiple requests for a model into one. (Booking.com)
Precomputation: If feature space is small. (Booking.com)
F. Monitoring & Iteration: The Never-Ending Loop
Performance Monitoring:
Technical Metrics: Latency, error rates, resource usage.
Model Metrics (Offline): Track R-squared, ranking metrics, AUC over time on fixed golden sets and changing test sets. (DoorDash Merchant Selection, Instagram)
Model Metrics (Online):
Business KPIs (conversion, revenue, engagement).
Regression detection rate (Mozilla Test Selection).
Correlation between weights/predictions of model snapshots. (Instagram)
Data Drift & Concept Drift: Monitor input data distributions and model performance to detect when retraining is needed.
If important features change significantly, investigate training data. (DoorDash Merchant Selection)
Retraining & Model Updates:
Frequency: Daily, weekly, monthly, or continuous/online. (Twitter Ad Requests - hourly, Instagram - daily, DoorDash Merchant Selection - weekly/monthly)
Strategy:
Load past N days of data and continue training from previous snapshot. (Instagram)
Online learning: model fed real-time data. Guarantees freshness but high engineering cost. (Instagram)
Snapshot Vetting: Evaluate against fixed golden set and changing test set before publishing. (Instagram)
Feedback Loops:
Human Annotations: Feed corrected labels back into training data. (Airbnb Categories, DoorDash Item Tagging, Walmart - Chimera)
User Feedback: Implicit (e.g., clicks, conversions) and explicit (e.g., ratings, “not useful” flags).
Business Operations Feedback: Sales teams using lead scores provide valuable ground truth. (DoorDash Merchant Selection)
III. Addressing Common Real-World Challenges
Cold Start Problem:
Issue: Lack of historical data for new users, items, or features. (Instagram, DoorDash Item Tagging, Booking.com)
Solutions:
Fallback Models: Use simpler, high-coverage models if primary model lacks data. (Instagram)
Mixture Embeddings on User Clusters: Full coverage, good substitute for interaction history. (Instagram)
Human-in-the-Loop: For initial labeling and rule creation. (DoorDash Item Tagging, Airbnb Categories)
Content-Based Features: Rely on item/user attributes rather than interactions.
Augment Training Data: Include longer-tenured entities to increase sample size (but be mindful of distribution shifts). (DoorDash Merchant Selection)
Data Sparsity:
Issue: Many features have few non-zero values.
Solutions:
Choose models adept at handling sparsity (XGBoost).
Embeddings can densify representations.
Feature selection/engineering to reduce dimensionality.
Imbalanced Data:
Issue: Some classes/outcomes are much rarer than others. (GitHub Good First Issues, DoorDash Item Tagging)
Solutions:
Resampling: Oversample minority class, undersample majority class (carefully).
Weighted Loss Functions: Penalize errors on minority class more heavily. (GitHub)
Focal Loss: (Walmart Semantic Labels)
Collect more data for rare classes (often via targeted HITL).
Scalability (Data & Compute):
Issue: Millions of items, users, high request volume.
Solutions:
Distributed training and inference.
Efficient data storage and retrieval (feature stores, data warehouses).
Model quantization/pruning for smaller, faster models.
Approximate nearest neighbor search (e.g., FAISS for embeddings). (Instagram)
Rule-based systems for initial filtering or handling high-volume, simple cases. (Twitter Ad Requests)
Noisy/Unreliable Labels:
Issue: Human errors, ambiguous definitions, evolving understanding. (Airbnb Messages, DoorDash Item Tagging)
Solutions:
Consensus Mechanisms: Multiple annotators for critical labels.
Quality Assurance: Reviewer agreement checks, golden datasets. (DoorDash Item Tagging)
“Not Sure” Option for Annotators: Provides insight into task difficulty. (DoorDash Item Tagging)
Model Robustness: Some models are more robust to label noise than others.
IV. Advanced Techniques & Strategies
Embeddings for Semantic Understanding:
Product Categorization: Use CLIP embeddings for category names to construct soft targets for label smoothing. (Walmart Semantic Labels)
Message Intent: Pre-train word embeddings on large domain-specific corpus. (Airbnb Messages)
Listing Similarity: Find similar listings via embedding cosine similarity for candidate expansion. (Airbnb Categories)
Semantic Label Representation / Label Smoothing:
Why: Traditional one-hot encoding penalizes all misclassifications equally. Semantic smoothing makes models less overconfident and errors less severe (e.g., mistaking “cowboy boots” for “rain boots” is better than for “t-shirts”). (Walmart Semantic Labels)
How:
Uniform smoothing: Distribute a small probability α to all other classes.
Similarity-based smoothing: Distribute α based on pre-computed semantic similarity (e.g., from embeddings) between classes.
Dealing with Censored Data (e.g., Survival Analysis):
Problem: The true event time is not always observed (e.g., food prep time if Dasher arrives early). (DoorDash Prep Time)
Solution:
Two-stage Adjustment:
Identify “censorship severity” (e.g., observed arrival - prior estimate).
Adjust target value based on severity for censored records.
Train a standard regression model on adjusted data.
This was preferred over complex likelihood methods (Cox models, XGBoost AFT) for transparency and accuracy.
Multi-Stage Architectures:
Ranking: Simple model for initial candidate generation, complex model for re-ranking top N. (Instagram)
Classification/Tagging: Rule-based pre-filtering, then ML models. (Walmart - Chimera)
Online Learning / Continuous Training:
Why: Adapt to non-stationary data, trends, seasonality. (Instagram)
Challenges: Model stability, computational cost, monitoring.
Bias Mitigation in Ranking:
Issue: Top-ranked items get more exposure and thus more interaction data, creating a feedback loop. (Instagram)
Solutions:
Add position/batch size as features in the last layer.
Co-train with a logistic regression using these bias features.
Inverse Propensity Weighting: Weight training examples by their position.
V. The Human Element in ML Systems
Human-in-the-Loop (HITL) is often Essential at Scale:
For Labeling/Annotation: Especially for cold start, rare classes, and continuous improvement. (DoorDash Item Tagging, Airbnb Categories, Walmart - Chimera)
For Evaluation & QA: Crowd workers can evaluate model outputs, flag errors. Analysts investigate flagged issues. (Walmart - Chimera)
For Rule Creation & Maintenance: Domain experts (analysts) create and refine rules to handle edge cases, new patterns, or complement ML. (Walmart - Chimera, DoorDash Item Tagging - guardrails)
For Triage & Prioritization: Humans can triage bugs or categorize items when ML is not confident. (Mozilla BugBug, Airbnb Categories)
Designing Effective HITL Tasks:
Clarity: Simple binary or multiple-choice questions are better for crowd workers. (DoorDash Item Tagging)
Efficiency: Break down complex tasks (e.g., using a taxonomy).
Quality Control: Consensus, golden sets, “not sure” option.
Collaboration between Data Scientists, Engineers, Analysts, and Business:
Shared Understanding: Crucial for defining problems and success.
Feedback Loops: Analysts provide insights to DS/ML, business provides impact metrics.
VI. Thinking Frameworks & Decision Tools
A. Decision Framework: Addressing Model Errors/Low Performance
graph TD
A[Model Performance Unsatisfactory] --> B{Is it a Data Problem?};
B -- Yes --> C[Investigate Data Quality/Quantity/Labels/Features];
C --> C1[Improve Labeling (HITL, Active Learning)];
C --> C2[Feature Engineering (New sources, transformations)];
C --> C3[Address Data Skew/Sparsity];
B -- No --> D{Is it a Model Problem?};
D -- Yes --> E[Revisit Model Choice/Architecture/Hyperparameters];
E --> E1[Try Simpler/More Complex Model];
E --> E2[Tune Hyperparameters];
E --> E3[Adjust Loss Function];
D -- No --> F{Is it an Evaluation Mismatch?};
F -- Yes --> G[Offline metrics don't reflect online impact?];
G --> G1[Refine Online Experimentation (A/B test)];
G --> G2[Develop Business-Alighed Offline Metrics];
F -- No --> H[Consider External Factors/Concept Drift];
H --> H1[Retrain Model More Frequently];
H --> H2[Implement Drift Detection];
C1 --> A;
C2 --> A;
C3 --> A;
E1 --> A;
E2 --> A;
E3 --> A;
G1 --> A;
G2 --> A;
H1 --> A;
H2 --> A;
B. Trade-off Analysis Table (Example)
Decision Point |
Option 1 (e.g., Simpler Model) |
Option 2 (e.g., Complex Model) |
Factors to Consider |
Recommendation Context |
---|---|---|---|---|
Model Choice |
Linear Regression |
Deep Neural Network |
Interpretability, data size, compute, dev time, accuracy |
Start simple, iterate if baseline not met. |
Labeling |
Crowd Workers |
Domain Experts |
Cost, scale, nuance, quality |
Crowd for broad tasks, experts for complex/critical ones. |
Deployment |
Batch |
Real-time |
Latency needs, freshness, infra cost |
Real-time if immediate predictions critical, else batch. |
Data Handling |
Drop Censored Data |
Model Censored Data |
Bias, accuracy, model complexity, transparency |
Model if possible & transparent, else careful heuristic. |
Cold Start |
Fallback Heuristic |
Embeddings from Side Info |
Accuracy, coverage, dev effort |
Heuristic for MVP, embeddings for better long-term solution. |
System Complexity |
ML Only |
ML + Rules + HITL |
Accuracy, robustness, maintainability, cost |
Hybrid for critical, large-scale, evolving systems. |
C. Questions for an MLOps Lead when starting a new Model Development Project:
Problem & Value:
What specific business problem are we solving? What’s the expected ROI?
How will success be measured (offline and online)? What are the target KPIs?
Who are the stakeholders? Are they aligned on the goals?
Data:
What data is available? What is its quality, quantity, and recency?
Are there known biases, censorship, or privacy concerns?
How will data be sourced, stored, and versioned?
What is the labeling strategy? Budget for labeling?
Modeling:
What’s the simplest baseline we can establish?
What are the latency, throughput, and resource constraints for inference?
How critical is interpretability?
What are the potential failure modes of the model?
Production & Operations:
How will the model be deployed (batch, streaming, service)?
What infrastructure is needed for training, serving, and monitoring?
What’s the retraining strategy and frequency?
How will we monitor for data drift, concept drift, and performance degradation?
What’s the rollback plan if issues occur?
Who is responsible for ongoing maintenance and operations?
Human Loop & Iteration:
Where can human expertise improve the system (labeling, QA, rule-making)?
How will feedback from users and the system be incorporated for continuous improvement?
VII. Conclusion: The Evolving Landscape
The development of ML models for production is a dynamic field. Lessons from these industry leaders highlight a shift from purely algorithmic solutions to holistic system design. Key takeaways include:
Data-centricity: High-quality, well-understood data and robust feature engineering are paramount.
Iterative Pragmatism: Start simple, validate often, and build complexity incrementally.
Hybrid Systems: Combining ML with rules and human oversight often yields the most robust and accurate solutions at scale.
Production-Awareness: Design for scalability, monitoring, and maintainability from day one.
Business Alignment: Continuously tie ML efforts back to tangible business value.
As an MLOps Lead, your role is to champion these principles, foster a culture of rigorous engineering and continuous learning, and build the platforms and processes that enable your teams to deliver impactful ML solutions efficiently and reliably. The journey is complex, but by leveraging these shared industry experiences, we can navigate it more effectively.