# Didact AI **Introduction** * **Didact AI Goal:** Build an ML-powered stock picking engine to consistently beat the market. * **Development:** Worked full-time for 2 years (2019-2021). * **Performance:** Generated outstanding returns, steadily beating S&P 500 for over a year (Jun 2021 - Jul 2022). * **Didact Total Return:** +14.2% vs. S&P 500 (SPY) -7.1%. * **Key Metric:** Low correlation with SPY (0.18 weekly), indicating alpha generation. * **Downside Protection:** Significantly lower downside deviation (0.39% vs 1.22% for SPY weekly) and better returns when SPY was negative (-0.31% vs -2.03% for SPY weekly). * **Reason for Shutdown:** Perceived lack of Product-Market Fit (PMF) for the newsletter service. * **Focus of Post:** Architecture and technology of the engine, covering data ingestion, feature engineering, modeling, and MLOps. **Markets: Patterns, Regimes, Regime Shifts, and Sentiment** * **Core Assumptions/Priors:** * **Market Patterns:** Chunks of similar price-volume actions repeated across time (e.g., chart patterns, complex multi-market time-series artifacts). Ephemeral, payoffs change as they evolve and speculators adapt. * **Market Regimes:** The "context" in which assets operate (e.g., bull/bear market, geopolitical sentiment affecting commodities). Can be fine-grained (hundreds of regimes in Didact) or coarse-grained. Didact models asset and market dynamics together using regimes. * **Regime Shifts:** Sudden switches between market regimes, often due to exogenous circumstances, reflected in asset repricing. Regimes typically last 21-35 days (mean/median); shifts can be rapid or gradual. * **Market Sentiment:** Aggregated bullish/bearish behavior across assets/classes. Best captured by relative trading volumes weighted by periodic returns (e.g., high negative returns + high volume = negative sentiment). Sentiment often moves assets in lockstep during regime shifts ("correlations tend to 1"). * **Didact's Approach:** Stitches these concepts (patterns contextualized by regimes and sentiment) to power predictions. **Engine architecture: A 50k ft view** * **Pipeline Structure:** Raw financial market data -> data cleansing -> feature engineering/filtering -> action-oriented modules (stock picks). Includes a control panel and long-running monitoring. * **High-Level Components:** * Data Sources (External Providers, Internal Data Store) * Data Ingestion Pipeline * Data Layer (Data Store, Feature Store, Artifact Store) * ML Layer (Feature Engineering, *Pythia* Forecasting Model, *Cassandra* Regime Model, *Socrates* Model Explanations) * ML Ops / Monitoring Layer (Backup/Restore, Workflow Monitoring, Feature Explorer, Model Diagnostics, Performance Analytics) * Output (Weekly Picks via Newsletter) **Architecture** - [Source: Didact AI: The anatomy of an ML-powered stock picking engine](https://principiamundi.com/posts/didact-anatomy/) **Design focus** * **Primary Aim:** Enable rapid experimentation with *feature engineering*, believing it's key to success over model choice. * **Secondary Aim:** Relentless focus on *execution speed*. * **Daily Processing:** Text and time-series for >4000 US stocks, ETFs, macro series. * **Feature Generation:** 1000+ features per stock (cross-sector, industry, index, fund ownership). * **Quarterly Processing:** Earnings call transcripts, SEC EDGAR filings using deep learning language models (compute-intensive, batched weekly on GPU EC2 instances for cost savings). * **Performance Improvement:** Raw data to insight timeline reduced from ~2.5 hours to ~14 minutes. **Tech stack** * **ML Layer:** Almost exclusively Python. Some SQL for pipeline parts, Pandas alternatives for bottlenecks. Sequential execution with input validation and quality checks. * **Data Layer:** 1. **Data Store:** * Initially: Postgres (speed issues with many columns as features). Eventually: DuckDB (column-oriented embedded DB, good performance, plays well with Arrow/Parquet). 2. **Feature Store:** * Initially: Redis (persisted to disk) with a custom feature access layer for fine-grained engineering and aggregations. * Augmented with DuckDB: Pushed basic feature engineering to SQL scripts (mimicking Pandas WINDOW functions) for rapid columnar processing. 3. **Artifact Store:** Local disk storage, backed up daily to AWS S3 (60-day local retention). Includes serialized models, forecasts, regime metrics, training data snapshots. * **ML Ops / Monitoring Layer:** Python scripts generating reports (except Feature Explorer). **Python environment setup** * **Base:** Python 3.8 Conda environment. * **Key Packages:** * _Data engineering:_ `duckdb`, `numpy`, `pandas`, `pyarrow`, `boto`, `redislite`. * _ML Modeling:_ `jupyter`, `pandas`, `numpy`, `polars`, `joblib`, `bottleneck`, `pytorch`, `tsfresh`, `transformers` (Hugging Face), `xgboost`, `scikit-learn`, `scipy`. * _ML Ops:_ `dash` (Plotly), `fastapi`. * **Parallelism:** Judicious use of `multiprocessing` for CPU-intensive tasks not suited for GPUs. (Cloud GPUs considered expensive, hence no NVidia RAPIDS). **Execution platform** * **Infrastructure:** AWS. * **Instance:** `r5a.2xlarge` EC2 (8 vCPUs, 64 GB RAM, optimized for in-memory compute) with EBS volume. * **Self-Hosted:** Redis + Postgres (initially) on the EC2 instance. * **Backups:** Private, locked-down S3 buckets. **Pipeline scheduling** * **Frequency:** Executed daily, 2 hours after market close (allows data feed refresh). * **Data Quality:** Checks on incoming data; re-polls providers if necessary. * **Production Scheduling (Past):** `cron` jobs executing shell scripts calling Python modules. * **Current Rewrite:** Moving to Apache Airflow for DAG orchestration. * **Stock Pick Extraction:** Only from Friday's run (or last trading day of the week) to reduce cognitive burden for users. **Interlude: Complexity in financial data** * **Multi-modal & Complex:** Text, time series, implicit network graphs. * **Data Feed Examples:** * *Options Data:* Bids/asks, volumes, open interest, IV for all contracts, grouped by expiry, strike, type. Contracts expire and are replaced. * **Corporate Actions:** Need tooling to track name/ticker changes, splits, spinoffs, mergers for time-series consistency (these actions are also mined as features). One-off scripts for major events (e.g., Google/Alphabet). * **Index Reconstitutions:** (S&P 500, Russell) Announced in advance, trigger massive capital flows. Capturing these as features provides marginal alpha. * **Calendar Events:** Track FOMC meetings, holidays, options/futures expiry, witching days, macro releases (NFP, unemployment). * **Text Data (Earnings Transcripts, EDGAR Filings):** Use SOTA language models to infer uncertainty from changes in topics, word frequencies, sentiment. Aggregate sector/industry sentiment can be predictive. **The data ingestion pipeline** * **Daily Raw Data:** ~0.5 GB, complex, multiple feeds, cross-cutting relationships. * **Process:** 1. Grab latest daily data for US stocks into Arrow DataFrames. 2. Basic quality checks (e.g., missing data for active stocks, bad prints). * Bad print detection: Unlikely implied returns, compare against corporate actions. Re-fetch if likely bad print. * (Planned: Use Great Expectations for DQ tests). 3. Columnar Compression / Basic Transforms: Normalize denormalized commercial feeds. Column type casting (e.g., symbols as Categoricals), drop columns, map timestamps to `uint16` (date, time) pairs to save space. 4. Push transformed data from DataFrames into DuckDB. 5. Backups: ML Ops script saves updates from data store to Parquet files and pushes to AWS S3. **Interlude: How I think about ML in markets** Various ways to frame financial market forecasting with ML: 1. _Supervised Learning:_ Regression (predict return) or classification (predict outperformance). 2. _Unsupervised Learning:_ Discern market regimes as clusters with associated forward trajectories. 3. _Self-Supervised Learning:_ Like classical autoregression (ARIMA) or masking/predicting parts of feature vectors. 4. _Learning to Rank:_ Rank stocks by expected forward returns (from Information Retrieval). 5. _Recommender System:_ Stocks as items, market regimes as users; recommend appropriate stocks for a regime (e.g., collaborative filtering). 6. _Reinforcement Learning:_ Trading agents maximizing ROI by buying/selling. * **Goal:** Blend best ideas from each sub-field. **Feature Engineering pipeline** * **Feature Vector Composition (per stock, per day EOD):** 1. **Raw Features from Feature Groups:** (Price/volumes, options chains, text sentiment). 2. **Basic Intra-Group Engineering:** (e.g., (Close-Low)/(High-Low) from price/volume). Scores of these. 3. **Peer Contextualization/Normalization:** Compare stock's data against peers (sector, industry, index, co-ownership patterns). Compute percentiles. (Critical: maintain as-of-date reference data for peer groups). 4. **Historical Contextualization:** Compare stock's data against its own history (e.g., closing price vs. last 1-year close). Percentiles, ratios, statistical measures (std dev, skew, kurtosis) on rolling windows. 5. **Cross-Group Features:** Combine features across groups (e.g., price-to-sales, price delta / sentiment delta). 6. **Trading Strategy-Based Features:** Run various strategies (technical, momentum, value, quality, etc.), capture predictions as features (idea: some strategies profitable in certain regimes). 7. **Market Regime-Based Features:** Current market state (VIX term structure, S&P 500 vs. 200-DMA, yield curve shape, macro release trajectories). * **Piece-de-Resistance:** Accuracy gap (latest predictions vs. actual realized values, as percentages) included as a feature. Helps model detect regime shifts faster. **Event time vs chronological time** * **Chronological Time ("Clock Time"):** Standard financial data aggregation (per-minute, -hour, -day). * **Event Time:** Measure "time" by event counts (e.g., per-100 trades), leading to "volume bars." * **Analogy:** Similar to time domain vs. frequency domain in signal theory. * **Benefit:** Allows design of new ML features capturing inverted time-related info. * **Approximation:** Since trade-level data isn't available, "invert" 1-minute data to approximate event time features. **ML Modeling subsystem** * **Feature Matrix:** ~4000 stocks, features computed for last 250 days (with forward data). ~1M data points. Sufficient for complex gradient boosted trees. * **Model Choice Rationale:** * _Why Gradient Boosting (GBDTs)?_ Better at handling imbalanced data and novel market regimes than random forests. * _Why not Deep Learning?_ Dataset size too small. Would require different modeling approach (though ideas for transformer-based architectures exist). * **_Pythia_: Forecasting Model** * **Tool:** `xgboost`. * **Target:** Pick stocks with >75% probability of generating forward Sortino ratios >2.0x that of SPY. If >20 stocks, pick top 20 by probability. * **Retraining:** Daily (even if picks used weekly) to track mid-week regime shifts. * **_Cassandra_: Regime Model** * Tracks current market regime. * Computes forward trajectory projections for S&P 500, Nasdaq, Russell 2000. * Maintains forecasts of next likely regime shifts (critical for Pythia's performance). * Focuses on subset of feature matrix related to index-level aggregated info. * **_Socrates_: Model Explanations** * Provides narrative explanations for why specific stocks were picked. * Uses `shap` to generate explanations from the model, then processes for human comprehensibility. * (Future: NLG for succinct sentence explanations). **ML Ops: Forecasting and Monitoring** * **Control Plane:** Mostly manual invocation (except automated backups). * **Primary Objective:** Monitor ML pipeline for errors, bugs, potential catastrophic drops in profitability (even if unlikely due to regime shift accounting). * **Secondary Objective:** Explore individual features (drifts, patterns, correlations) to inspire new feature ideas. * **Backup/Restore:** * Backs up: raw data (Parquet), serialized models, predictions, regime forecasts, feature matrices, training data snapshots to private AWS S3. * Nothing deleted from S3. * Restore module like simplified `rsync`, maintaining a time window (e.g., last 60 days). * **Workflow Monitoring:** Script analyzes ML pipeline logs, sends SUCCESS/FAILURE (with diagnosis) reports via email. (Planned: browser-based admin panel). * **Feature Explorer:** Jupyter notebook based tool (early version used Plotly Dash with Bootstrap) for exploring constructed features and rapidly testing new feature ideas. * **Model Diagnostics:** Traces predictions, compares to realized returns, tracks accuracy gap evolution (backtested & walkforward). Dramatic spikes may indicate fundamental market changes. * **Performance Analytics:** Tracks model performance (hypothetical portfolio returns vs. SPY) using standard trading metrics (Sortino, Sharpe, volatility, downside risk, win/loss, profit factors). **Tackling performance bottlenecks** Changes to speed up pipeline execution: 1. **Postgres -> DuckDB (for data store):** In-memory columnar processing was beneficial, especially for bulk vectorized computations during ingestion. 2. **Pandas -> PyArrow + DuckDB (ingestion pipeline):** PyArrow for reading/cleaning, then zero-copy load into DuckDB. 3. **Numpy -> Bottleneck (feature engineering):** Fast C-based Numpy functions saved CPU cycles. 4. **Pandas -> Polars (feature engineering):** Rust-based DataFrame library integrated with Arrow, provided speedups (care needed as functions don't map 1-to-1). 5. **Redesigning Pipelines for Multiprocessing:** Using all EC2 cores beyond vectorization. **Coda: Why I shut down Didact** * **Market Timing:** Launched newsletter (June 2021) as COVID bull run was ending. * **Bull Market Challenge:** Hard to stand out on performance alone when "everyone's a genius" (e.g., GME). * **Unstable Market Challenge:** Easier to stand out (Didact did well against losses), but target market (retail investors) exhibits *loss aversion* and *endowment effect*, preferring to wait for old holdings to recover rather than allocate to a new systematic engine. * **PMF Trap:** This behavioral bias played a major role in shutting down the newsletter.