# LinkedIn DARWIN
**Introduction**
* **Context:** LinkedIn generates massive data, used by data scientists (DS) and AI engineers (AIE) for various products (job recommendations, personalized feed).
* **Problem:** Historically, DS/AIE used diverse tools for data interaction, EDA, experimentation, and visualization.
* **Solution:** DARWIN (Data Science and Artificial Intelligence Workbench at LinkedIn), a unified "one-stop" data science platform.
* **Scope of DARWIN:** Goes beyond Jupyter notebooks to support the entire DS/AIE workflow.
**Motivation for building a unified data science platform**
* **Pre-DARWIN Productivity Challenges:**
* **Developer Experience/Ease of Use:**
* Context switching across multiple tools.
* Difficult collaboration.
* **Fragmentation/Variation in Tooling:**
* Knowledge fragmentation.
* Lack of easy discoverability of prior work.
* Difficulty sharing results.
* Overhead in making local/varied tools compliant with privacy/security policies.
* **Target Personas:**
* Expert DS and AIEs.
* Data analysts, product managers, business analysts (citizen DS).
* Metrics developers (using LinkedIn's Unified Metrics Platform - UMP).
* Data developers.
* **Workflow Phases & Tools to Support:**
* **Data Exploration/Transformation:** Jupyter notebooks (expert DS), UI-based SQL tools like Alation/Aqua Data Studio (citizen DS, PMs, BAs), Excel.
* **Data Visualization/Evaluation:** Jupyter notebooks, ML libraries (GDMix, XGBoost, TensorFlow), Tableau, internal visualization tools.
* **Productionizing:** Scheduling flows (Azkaban), feature engineering/model deployment frameworks (Frame, Pro-ML), Git integration for code review/check-in.
**Building DARWIN, LinkedIn’s data science platform**
* **Key Requirements for DARWIN:**
1. **Hosted EDA Platform:** Single window for all data engines (analysis, visualization, model dev).
2. **Knowledge Repository & Collaboration:** Share/review work, discover others' work/datasets/insights, data catalog, tagging, versioning.
3. **Code Support:** IDE-like experience, multi-language support, direct Git commit.
4. **Governance, Trust, Safety, Compliance:** Secure, compliant access.
5. **Scheduling, Publishing, Distribution:** Schedule executable resources, generate/publish/distribute results.
6. **Integration:** Leverage and integrate with other ecosystem tools (ML pipelines, metric authoring, data catalog).
7. **Scalable & Performant Hosted Solution:** Horizontally scalable, resource/environment isolation, similar experience to local tools.
8. **Extensibility:** Support for different environments/libraries, multiple languages, various query engines/data sources, custom extensions/kernels, "Bring Your Own Application" (BYOA) for platform democratization.
* **Key Open Source Technologies Leveraged:** JupyterHub, Kubernetes, Docker.
* **High-Level Architecture Components:**
* **Platform Foundations:** Scale, extensibility, governance, concurrent user environment management.
* **DARWIN Resources:** Core concept for knowledge artifacts.
* **Metadata/Storage Isolation:** Enables evolution as a knowledge repository.
* **Access to Data Sources/Compute Engines:** Unified window.
**DARWIN: Unified window to data platforms**
* **Supported Query Engines/Languages:**
* Spark (Python, R, Scala, Spark SQL).
* Trino.
* MySQL.
* Pinot (coming soon).
* **Direct Data Access:** HDFS (useful for TensorFlow).
* **Objective:** Provide access to data irrespective of its storage platform.
**DARWIN platform foundations**
* **Scale and Isolation using Kubernetes:**
* Achieves horizontal scalability.
* Provides dedicated, isolated environments for users.
* Supports long-running services and security features.
* Leverages off-the-shelf Kubernetes features to focus on DARWIN's differentiating aspects.
* **Extensibility through Docker images:**
* Used to launch user notebook containers on Kubernetes.
* Enables platform democratization: users/teams can extend/build on DARWIN.
* Isolates environments, allowing different libraries/applications.
* Supports "Bring Your Own Application" (BYOA): app developers package code, DARWIN handles scaling, SRE, compliance, discovery, sharing.
* **Partner Team Examples:**
* AIRP team's on-call dashboard (custom front-end).
* Greykite forecasting library support (input viz, model config, CV, forecast viz via Jupyter).
* **Mechanism:** Partner teams build custom Docker images on base DARWIN images, hosted in an independent Docker registry (app marketplace).
* **Management of concurrent user environments using JupyterHub:**
* Highly customizable, serves multiple environments, pluggable authentication.
* Kubernetes spawner launches independent user servers on K8s (isolated environments).
* Integrates with LinkedIn authentication stack.
* Manages user server lifecycle (culling inactive servers, explicit logout).
* **Governance: Safety, trust, and compliance:**
* Audit trail for every operation.
* Encrypted and securely stored execution results.
* Fine-grained access control for DARWIN resources.
**Platform**
- [DARWIN: Data Science and Artificial Intelligence Workbench at LinkedIn](https://www.linkedin.com/blog/engineering/developer-experience-productivity/darwin-data-science-and-artificial-intelligence-workbench-at-li)
**DARWIN: A knowledge repository**
* **Vision:** One-stop place for all data-related knowledge (accessing, understanding, analyzing, referencing, reporting).
* **Modeling as Resources:**
* Every top-level knowledge artifact (notebooks, SQL workbooks, outputs, markdown, reports, projects) is a "resource."
* Resources can be linked hierarchically.
* Enables seamless addition of new resource types, with common operations (CRUD, storage, collaboration, search, versioning) provided generically.
* **DARWIN Resource Metadata and Storage:**
* **Platform Service:**
* Manages DARWIN resource metadata.
* Entry point for DARWIN: authN/authZ, launches user containers (via JupyterHub).
* Maps resources to file blobs by interacting with Storage Service.
* Stores resource metadata in [DataHub](https://engineering.linkedin.com/blog/2019/data-hub) for centralized management and entity relationships.
* **Storage Service:**
* Stores backing content for resources as file blobs in a persistent backend.
* Abstracts storage layer choice.
* User content transfer managed by a client-side DARWIN storage library (plugs into app's content manager, e.g., Jupyter Notebook Contents API).
* **Enabling Collaboration:**
* **Sharing Resources:** Users can share resources (code, analysis) for learning, reuse, review. By default, shares "code only" (for privacy); owners can explicitly share "with results" to authorized users (audited).
* **Search and Discovery:** Metadata search via DataHub.
* **Frontend:**
* Uses React.js heavily for UI (e.g., React-based JupyterLab extensions).
* Provides resource browsing, CRUD operations, execution environment switching.
**Key features provided by the DARWIN platform**
* **Support for Multiple Languages:** Python, SQL, R, Scala (for Spark).
* **Intellisense Capabilities:** Code completion, doc help, function signatures for SQL, Python, R, Scala. SQL autocomplete powered by DataHub metadata.
* **SQL Workbooks:**
* For citizen DS, BAs, SQL-comfortable users.
* SQL editor, tabular results, spreadsheet operations (search, filter, sort, pivot).
* Future: built-in visualizations, report publishing, dataset profiles.
* **Scheduling of Notebooks and Workbooks:**
* Leverages Azkaban.
* Allows parameter specification for repeatable analysis with new data.
* **Integration with Other Products and Tools:**
* **Expert DS/AIEs:** Frame (internal feature management), TensorFlow, Pro-ML (ongoing).
* **Metrics Developers:** Internal tools for error/validation, metric templates, testing, review, code submission.
* **Forecasting:** Greykite framework leverages DARWIN.
**Architecture**
- [DARWIN: Data Science and Artificial Intelligence Workbench at LinkedIn](https://www.linkedin.com/blog/engineering/developer-experience-productivity/darwin-data-science-and-artificial-intelligence-workbench-at-li)
**Adoption within LinkedIn**
* **Product User Council:** Formed post-launch, acts as voice of the customer for prioritization and feedback, enabling co-creation.
* **Scale:** Over 1400 active users across Data Science, AI, SRE, Trust, BAs, product teams. >70% user base growth in the past year.
**What’s next?**
* **Publishing Dashboards and Apps:** Allow authors to manage views (hide code/outputs). Host always-running apps (Voila, Dash, Shiny, custom).
* **Built-in Visualizations:** Rich code-free viz for citizen DS (like Excel/Sheets).
* **Projects, User Workspaces, Version Control:**
* Projects as namespaces (currently public).
* Plan: Manage projects on Git, enable version control.
* Workspaces: Clone projects, work, commit to Git. Backed by network-attached storage.
* **Exploratory Data Analysis (EDA):** Leverage DataHub for dataset search/discovery, schema, lineage, relationships within DARWIN.
* **Open Sourcing DARWIN:** Eventual plan.
* **Ultimate Vision:** Support all use cases for various personas, either natively or via integration.
**Conclusion**
* DARWIN is evolving to meet growing/changing user needs, aiming to be the one-stop platform for DS, AIEs, and data analysts at LinkedIn.