# Energy Demand Forecasting in Time Series IoT Data
##
---
### TL;DR: ML-Powered Energy Demand Forecasting for Smart Buildings
* **Challenge:** The company needed to forecast building-level energy demand 24-72 hours in advance to provide valuable data to energy suppliers for grid balancing and to enable a new resident-facing feature for optimizing solar energy self-consumption. The system had to be accurate, reliable, and capable of handling complex factors like weather and holidays.
* **My Role & Solution:** I led the development and operationalization of the forecasting system, serving as the Data Scientist and MLOps Engineer. My key contributions were:
* **Strategic Approach:** I designed a model development strategy that began with strong baselines (like SARIMAX and Prophet) and iteratively progressed to a high-performance XGBoost model. I established that forecasting at the building-level was the most effective initial approach, balancing accuracy with computational feasibility.
* **Feature Engineering:** I engineered a rich feature set crucial for forecast accuracy. This involved creating time-series features (lags, rolling windows), incorporating calendar events (holidays), and building a robust pipeline to process and align *future* weather forecast data with historical consumption patterns.
* **MLOps Infrastructure:** Using Terraform, I built a fully automated, serverless MLOps pipeline on AWS. The architecture included a distinct training workflow orchestrated by Step Functions for monthly model retraining, and a separate daily inference workflow that generates and stores forecasts.
* **Production Lifecycle Management:** I implemented the end-to-end system, including a containerized (Docker/ECR) forecasting model, versioning and governance via SageMaker Model Registry, and a CI/CD pipeline in Bitbucket for automated testing and deployment. The solution included a scalable serving layer using Amazon Timestream, API Gateway, and Lambda to deliver forecasts to both B2B and B2C consumers.
* **Impact:** The forecasting system became a key data product for the company, unlocking new value for both business partners and residents.
* Achieved a **<10% Mean Absolute Percentage Error (MAPE)** on 24-hour ahead building-level forecasts, providing reliable data to energy suppliers.
* Enabled the launch of the "Smart Energy Advisor" feature in the resident app, leading to a **15% increase in user engagement** with energy management tools. This, in turn, drove a measured ~10% increase in the building's solar self-consumption rate by empowering residents to align appliance usage with peak solar generation periods.
* **System Architecture:** I designed and implemented the complete AWS solution, from data ingestion to API serving, ensuring scalability and automation.

### Introduction
#### Purpose
This document provides detailed technical information about the Machine Learning (ML) based Energy Demand Forecasting (EDF) system developed. It serves as a guide for developers, MLOps engineers, operations teams.
#### Business Goal
The primary goals of the EDF system are to:
* Provide accurate short-term (e.g., 24-72 hours) aggregated energy demand forecasts at the building level to external energy suppliers and Distribution System Operators (DSOs) for improved grid balancing, energy procurement, and network management.
* Empower residents by providing insights into predicted building energy consumption versus predicted solar generation, enabling them to optimize appliance usage for increased solar self-consumption and potential cost savings.
#### Scope
This documentation details the end-to-end pipelines for data processing, model training, model evaluation, model registration, batch inference (forecasting), forecast storage, and the conceptual API serving layer. It assumes the existence of a separate data ingestion pipeline providing the necessary raw data feeds.
#### Key Technologies
* **Cloud Platform:** AWS (Amazon Web Services)
* **Data Lake:** Amazon S3
* **Data Processing/ETL:** AWS Glue (PySpark), AWS Lambda (Python), SageMaker Processing Jobs (PySpark/Python)
* **Feature Management:** Shared Code Libraries (Primary), Amazon SageMaker Feature Store (Optional for historical features)
* **Model Training:** Amazon SageMaker Training Jobs (using custom Docker containers with Prophet, XGBoost, etc.)
* **Model Forecasting:** Amazon SageMaker Processing Jobs (running forecast generation scripts)
* **Model Registry:** Amazon SageMaker Model Registry
* **Orchestration:** AWS Step Functions
* **Scheduling:** Amazon EventBridge Scheduler
* **Forecast Storage:** Amazon Timestream (Primary Example), Amazon RDS, Amazon DynamoDB
* **API Layer:** Amazon API Gateway, AWS Lambda
* **Infrastructure as Code:** Terraform
* **CI/CD:** Bitbucket Pipelines
* **Containerization:** Docker
* **Core Libraries:** PySpark, Pandas, Scikit-learn, Boto3, PyYAML, Prophet, XGBoost, AWS Data Wrangler.
**Table of Contents**
1. [Introduction](#introduction)
* [Purpose](#purpose)
* [Business Goal](#business-goal)
* [Scope](#scope)
* [Key Technologies](#key-technologies)
2. [Discovery and Scoping](#discovery-and-scoping)
* [Use Case Evaluation](#use-case-evaluation)
* [Product Strategies](#product-strategies)
* [Features](#features)
* [Product Requirements Document](#product-requirements-document)
* [Milestones and Timelines](#milestones-and-timelines)
3. [System Architecture](#system-architecture)
* [Overall Data Flow](#overall-data-flow)
* [Training Workflow Diagram](#training-workflow-diagram)
* [Inference Workflow Diagram](#inference-workflow-diagram)
4. [Configuration Management](#configuration-management)
5. [Infrastructure as Code (Terraform)](#infrastructure-as-code-terraform)
* [Stacks Overview](#stacks-overview)
* [Key Resources](#key-resources)
6. [CI/CD Pipeline (Bitbucket)](#cicd-pipeline-bitbucket)
* [CI Workflow](#ci-workflow)
* [Training CD Workflow](#training-cd-workflow)
* [Inference CD Workflow](#inference-cd-workflow)
7. [Deployment & Execution](#deployment--execution)
* [Prerequisites](#prerequisites)
* [Initial Deployment](#initial-deployment)
* [Running Training](#running-training)
* [Running Inference](#running-inference)
* [Model Approval](#model-approval)
8. [Monitoring & Alerting](#monitoring--alerting)
9. [Troubleshooting Guide](#troubleshooting-guide)
10. [Security Considerations](#security-considerations)
11. [Roadmap & Future Enhancements](#roadmap--future-enhancements)
12. [Appendices](#appendices)
* [Configuration File Example](#configuration-file-example)
### Discovery and Scoping
#### Use Case Evaluation

#### Product Strategies

#### Features

#### Product Requirements Document

#### Development Stages
### System Architecture

#### Overall Data Flow
The EDF system utilizes distinct, automated pipelines for training forecasting models and generating daily forecasts. It relies on processed historical data and external weather forecasts. Forecasts are stored in a time-series database and made available via APIs.
1. **Raw Data:** Aggregated Consumption, Solar Generation, Historical Weather, *Future Weather Forecasts*, Calendar/Topology data land in S3 Raw Zone.
2. **Processed Data:** Ingestion pipeline processes *historical* data into `processed_edf_data` in S3 Processed Zone / Glue Catalog.
3. **Features (Training):** Training Feature Engineering step reads `processed_edf_data`, calculates historical time-series features (lags, windows, time components), splits train/eval sets, and writes them to S3.
4. **Features (Inference):** Inference Feature Engineering step reads recent `processed_edf_data` and *future raw weather forecasts*, calculates features for the forecast horizon using shared logic, and writes them to S3.
5. **Model Artifacts:** Training jobs save serialized forecast models (Prophet JSON, XGBoost UBJ) to S3.
6. **Evaluation Reports:** Evaluation jobs save forecast metrics (MAPE, RMSE) to S3.
7. **Model Packages:** Approved forecast models are registered in the EDF Model Package Group.
8. **Raw Forecasts:** Inference forecast generation step writes raw forecast outputs (timestamp, building, yhat, yhat_lower, yhat_upper) to S3.
9. **Stored Forecasts:** Load Forecasts step reads raw forecasts from S3 and writes them into the target database (e.g., Timestream).
10. **Served Forecasts:** API Gateway and Lambda query the forecast database to serve B2B and B2C clients.
**Forecasting Pipeline Description:**
1. **Ingestion:** Aggregated Consumption, Solar Generation, Topology/Calendar, and crucially, *Forecasted* Weather data are ingested into the Raw S3 Zone.
2. **Processing:** A daily Glue ETL job aggregates data to the building level (if needed), joins it with weather forecasts and calendar info, and engineers features (lags, time features, weather interactions). Processed features are stored in S3 (and potentially SageMaker Feature Store for easier reuse/serving). The Glue Data Catalog is updated.
3. **Model Training:** A Step Functions workflow orchestrates training, similar to AD, using appropriate forecasting models (ARIMA, Prophet, Gradient Boosting, etc.). Models are registered.
4. **Batch Inference:** A daily Step Functions workflow prepares features for the forecast horizon (using actual latest data *and future weather forecasts*), runs SageMaker Batch Transform, and stores the resulting forecasts (e.g., hourly demand for next 72h) in S3.
5. **Results Storage:** Raw forecasts are stored in S3. A subsequent job loads the forecasts into a database optimized for time-series queries or fast lookups (DynamoDB, RDS, or potentially Amazon Timestream).
6. **Serving:**
* **B2B:** A dedicated API Gateway endpoint uses Lambda to query the Forecast Database, perform necessary aggregation/anonymization, and return forecasts to authorized suppliers/DSOs.
* **B2C:** The existing App/Tablet backend queries the Forecast Database (likely via another API Gateway endpoint) to get simplified data for resident visualization (comparing building demand vs. solar).
* **Analysts:** Use Athena for ad-hoc analysis.
#### Training Workflow
**Summary:** Triggered by Schedule/Manual -> Validates Schema -> Engineers Historical Features (Train/Eval Sets) -> Trains Model (Prophet/XGBoost) -> Evaluates Model (MAPE, RMSE) -> Conditionally Registers Model (Pending Approval).
**Step Function State Machine**
1. **State:** `ValidateSchema` (Optional)
* **Service:** SM Processing / Glue Shell
* **Action:** Validates schema of `processed_edf_data`.
2. **State:** `FeatureEngineeringTrainingEDF`
* **Service:** SM Processing Job (Spark) / Glue ETL
* **Action:** Reads `processed_edf_data` for training range. Creates time features, lags (consumption, weather), rolling windows. Splits into train/eval feature sets (Parquet) written to S3.
3. **State:** `ModelTrainingEDF`
* **Service:** SM Training Job
* **Action:** Reads training features from S3. Instantiates/fits chosen model (Prophet/XGBoost) based on input strategy and hyperparameters. Saves model artifact (`prophet_model.json` / `xgboost_model.ubj` + `model_metadata.json`) within `model.tar.gz` to S3.
4. **State:** `ModelEvaluationEDF`
* **Service:** SM Processing Job (Python + libs)
* **Action:** Loads model artifact. Reads evaluation features from S3. Generates forecasts for eval period. Calculates MAPE, RMSE, MAE against actuals (from eval features). Writes `evaluation_report_edf.json` to S3.
5. **State:** `CheckEvaluationEDF` (Choice)
* **Action:** Compares metrics (e.g., MAPE, RMSE) against thresholds from config.
6. **State:** `RegisterModelEDF`
* **Service:** Lambda
* **Action:** Creates Model Package in EDF group (`PendingManualApproval`), embedding metadata and evaluation results.
7. **Terminal States:** `WorkflowSucceeded`, `EvaluationFailedEDF`, `WorkflowFailed`.
#### Inference Workflow
**Summary:** Triggered Daily by Scheduler -> Gets Latest Approved Model -> Creates SM Model Resource -> Engineers Inference Features -> Generates Forecasts (SM Processing) -> Loads Forecasts into DB (Timestream/RDS).
1. **State:** `GetInferenceDate` (Optional Lambda / Step Functions Context)
* **Action:** Determines target forecast start date (e.g., "today" or "tomorrow" relative to execution time) and forecast end date based on horizon.
2. **State:** `GetApprovedEDFModelPackage`
* **Service:** Lambda
* **Action:** Gets latest `Approved` Model Package ARN from EDF group.
3. **State:** `CreateEDFSageMakerModel`
* **Service:** Lambda
* **Action:** Creates SM `Model` resource from the approved package ARN.
4. **State:** `FeatureEngineeringInferenceEDF`
* **Service:** SM Processing Job (Spark) / Glue ETL
* **Action:** Reads recent historical `processed_edf_data` (for lags) AND future raw `weather-forecast` data. Creates feature set covering the forecast horizon (including future dates and weather). Writes inference features (Parquet) to S3.
5. **State:** `GenerateForecastsEDF`
* **Service:** SM Processing Job (Python + libs)
* **Action:** Loads model artifact (mounted via SM Model resource). Reads inference features from S3. Calls model's `predict` method to generate forecasts (`yhat`, `yhat_lower`, `yhat_upper`). Writes forecast results (Parquet/CSV) to S3.
6. **State:** `LoadForecastsToDB`
* **Service:** Lambda / Glue Python Shell
* **Action:** Reads forecast results from S3. Formats data for target DB (Timestream example). Writes records to the database using batch operations.
7. **Terminal States:** `WorkflowSucceeded`, `WorkflowFailed`.
#### Forecast Serving
* **B2B API:** API Gateway endpoint proxying to a Lambda function. Lambda queries the Forecast DB (e.g., Timestream) based on requested `building_id` and `time_range`. Requires authentication/authorization (e.g., API Keys, Cognito Authorizer).
* **B2C API:** Separate API Gateway endpoint/Lambda. Queries Forecast DB, potentially performs simple comparison with Solar forecast (if available), returns simplified data structure for UI visualization. Requires app user authentication.
### Model Development & Iteration
**Models for Time Series Forecasting**



### Configuration Management
* Uses `config/edf_config.yaml` version-controlled in Git.
* File uploaded to S3 by CI/CD.
* Scripts load config from S3 URI passed via environment variable/argument.
* Includes sections for `feature_engineering`, `training` (with nested model hyperparameters), `evaluation` (thresholds), `inference` (schedule, instance types, DB config).
* Secrets managed via AWS Secrets Manager / SSM Parameter Store.
### Infrastructure as Code (Terraform)
* Manages all AWS resources for EDF pipelines.
* **Stacks:**
* `training_edf`: ECR Repo (`edf-training-container`), Model Package Group (`EDFBuildingDemandForecaster`), Lambda (`RegisterEDFModelFunction`), Step Function (`EDFTrainingWorkflow`), associated IAM roles. Optionally Feature Group (`edf-building-features`). Requires outputs from `ingestion`.
* `inference_edf`: Timestream DB/Table (or RDS/DDB), Lambdas (GetModel, CreateModel - potentially reused; LoadForecasts), Step Function (`EDFInferenceWorkflow`), EventBridge Scheduler, API Gateway Endpoints/Lambdas (for serving), associated IAM roles. Requires outputs from `ingestion` and `training_edf`.
* **State:** Remote backend (S3/DynamoDB) configured.
### CI/CD Pipeline (Bitbucket)
* Defined in `bitbucket-pipelines.yml`.
* **CI (Branches/PRs):** Lints, runs ALL unit tests, builds EDF container, pushes to ECR (`edf-training-container` repo), validates ALL Terraform stacks.
* **EDF Training CD (`custom:deploy-and-test-edf-training`):** Manual trigger. Deploys `training_edf` stack. Runs EDF training integration tests.
* **EDF Inference CD (`custom:deploy-and-test-edf-inference`):** Manual trigger. Deploys `inference_edf` stack. Runs EDF inference integration tests (requires approved model).
### Deployment & Execution
**8.1 Prerequisites:** Base AWS setup, Terraform, Docker, Python, Git, Bitbucket config, deployed `ingestion` stack.
**8.2 Initial Deployment:** Deploy Terraform stacks (`training_edf`, `inference_edf`) after `ingestion`. Build/push EDF container. Upload initial `edf_config.yaml`. Ensure processed EDF data exists.
**8.3 Running Training:** Trigger `EDFTrainingWorkflow` Step Function (schedule/manual) with appropriate input JSON (date range, model strategy, hyperparameters, image URI).
**8.4 Running Inference:** `EDFInferenceWorkflow` runs automatically via EventBridge Scheduler. Ensure prerequisite data (processed history, weather forecasts) is available daily.
**8.5 Model Approval:** Manually review `PendingManualApproval` packages in the `EDFBuildingDemandForecaster` group and promote to `Approved` based on evaluation metrics.
### Monitoring & Alerting
* **CloudWatch Logs:** Monitor logs for EDF Step Functions, Lambdas, SageMaker Processing Jobs.
* **CloudWatch Metrics:** Monitor SFN `ExecutionsFailed`, Lambda `Errors`/`Duration`, Processing Job `CPU/Memory`, Timestream/RDS metrics (if applicable), API Gateway `Latency`/`4XX/5XX Errors`.
* **Forecast Accuracy Tracking:** **CRITICAL:** Implement a separate process (e.g., scheduled Lambda/Glue job) to periodically compare stored forecasts against actual consumption data (loaded later) and calculate ongoing MAPE/RMSE. Log these metrics to CloudWatch or a dedicated monitoring dashboard.
* **Data Pipeline Monitoring:** Monitor success/failure of ingestion jobs providing raw data and the `process_edf_data` Glue job.
* **Weather Forecast API:** Monitor availability and error rates of the external weather forecast API.
* **CloudWatch Alarms:** Set alarms on: SFN Failures, Lambda Errors, Forecast Accuracy Degradation (MAPE/RMSE exceeding threshold), Weather API Fetch Failures, Target DB Write Errors.
### Estimated Monthly Costs
**Assumptions:**
* **Environment:** AWS `eu-central-1` (Frankfurt) region.
* **Buildings:** 120.
* **Data Volume:** Building-level aggregation results in smaller processed data sizes compared to the raw sensor data for AD.
* Daily Processed EDF Data: ~10 MB.
* Total Monthly Processed EDF Data: 10 MB/day * 30 days = **~300 MB**.
* **Model Training Frequency:** 1 time per month.
* **Batch Inference Frequency:** 30 times per month (daily).
* **API Serving Layer:**
* B2B API (Suppliers): Low volume, high value. Assume **10,000 requests/month**.
* B2C API (Residents): Higher volume. Assume 3,000 apartments * 1 request/day * 30 days = **90,000 requests/month**.
* Total API Requests: **~100,000 per month**.
| Pipeline Component | AWS Service(s) | Detailed Cost Calculation & Rationale | Estimated Cost (USD) |
| :--- | :--- | :--- | :--- |
| **Data Ingestion & Processing** | **AWS Glue**
**AWS Lambda**
**S3** | **AWS Glue ETL (Processing EDF Data):** Daily job to aggregate raw data and join with weather/calendar. Less data than AD raw processing.
- 1 job/day * 30 days * 0.15 hours/job * 4 DPUs * ~$0.44/DPU-hr = **~$7.92**
**AWS Lambda (Ingesting Raw Data):** Lambdas for fetching weather forecasts, historical weather, etc.
- Assume 5 functions, ~100k total invocations/month, avg 10 sec duration, 256MB. All usage is well within the Lambda free tier. = **~$0.00**
**S3 (PUT Requests):** Lower volume than AD due to aggregation.
- ~1000 PUT req/day * 30 days * ~$0.005/1k req = **~$0.15** | **$8 - $15** |
| **Feature Engineering** | **SageMaker Processing** | **SageMaker Processing Jobs:** Priced per instance-second. Covers feature engineering for both the monthly training and daily inference pipelines.
- Training (monthly): 1 run/month * 1.0 hour/run * 1 `ml.m5.large` instance * ~$0.11/hr = **~$0.11**
- Inference (daily): 30 runs/month * 0.25 hours/run * 1 `ml.m5.large` instance * ~$0.11/hr = **~$0.83** | **$1 - $3** |
| **Model Training** | **SageMaker Training**
**Step Functions** | **SageMaker Training Jobs:** Priced per instance-second. Assuming monthly retraining on a standard instance.
- 1 run/month * 1.5 hours/run * 1 `ml.m5.large` instance * ~$0.11/hr = **~$0.17**
**Step Functions:** The training workflow runs infrequently.
- ~10 transitions/run * 1 run/month = 10 transitions. Well within free tier. = **~$0.00** | **$1 - $2** |
| **Model Inference** | **SageMaker Processing**
**Step Functions** | **SageMaker Processing Job (Forecast Generation):** We use a Processing Job for flexibility. This is the daily forecasting job.
- 30 runs/month * 0.5 hours/run * 1 `ml.m5.large` instance * ~$0.11/hr = **~$1.65**
**Step Functions:** The inference workflow runs daily.
- ~8 transitions/run * 30 runs/month = 240 transitions. Well within free tier. = **~$0.00** | **$2 - $4** |
| **Forecast Storage & Serving** | **API Gateway**
**AWS Lambda**
**Amazon Timestream** | **API Gateway:** Priced per million requests.
- 100,000 requests/month is well within the 1M free tier requests (for REST APIs). = **~$0.00**
**AWS Lambda (Serving):** Lambdas for serving B2B/B2C requests.
- 100k invocations/month * avg 150ms duration * 256MB memory. All usage well within free tier. = **~$0.00**
**Amazon Timestream:** Priced per GB of ingestion, storage, and queries.
- Ingestion: 30 days * 120 bldgs * 72 hr fcst * ~1KB/record = ~260 MB ingest/month. Free tier is 1GB. = **~$0.00**
- Memory Store: Small, rolling window of recent data. ~1 GB * ~$0.036/GB-hr * 720 hrs = **~$25.92** (This can be high). **Let's assume we reduce memory retention to 1 day = ~$3.60**
- Magnetic Store: ~3 GB/year. ~0.3 GB * ~$0.03/GB-month = **~$0.01**
- Queries: 100k requests * ~10MB scanned/query (estimate) = ~1TB queries. $0.01/GB * 1024 GB = **~$10.24** | **$15 - $25** |
| **Storage & Logging** | **S3**
**ECR**
**CloudWatch** | **S3:** Priced per GB-month. Lower processed/feature data volume than AD, but still need to store raw data.
- Assume total storage for Raw, Processed, Features, Artifacts = ~380 GB
- 380 GB * ~$0.023/GB-month = **~$8.74**
- Add ~$2 for GET/LIST requests.
**ECR:** Priced per GB-month. For the EDF container image.
- Assume a separate 10 GB storage for EDF images. = **~$1.00**
**CloudWatch:** Logs from all jobs and Lambdas.
- Assume ~10 GB log ingestion * ~$0.50/GB = **~$5.00** | **$15 - $25** |
| **Total Estimated Monthly Cost** | **-** | **-** | **$42 - $74** |
This detailed breakdown reveals that the highest operational cost for the forecasting system is the **Forecast Storage & Serving** layer, specifically the **Timestream** memory store and query costs. The batch ML compute costs for feature engineering, training, and inference remain very low. This highlights the importance of optimizing the database configuration (e.g., memory vs. magnetic retention) and query patterns for the serving APIs to manage costs effectively.
### Challenges and learnings
### Troubleshooting Guide
1. **SFN Failures:** Check execution history for failed state, input/output, error message.
2. **Job Failures (Processing/Training):** Check CloudWatch Logs for the specific job run. Look for Python errors, resource exhaustion, S3 access issues.
3. **Lambda Failures:** Check CloudWatch Logs. Verify IAM permissions, input payload structure, environment variables, timeouts, memory limits. Check DLQ if configured.
4. **Forecast Accuracy Issues:**
* Verify quality/availability of input weather forecasts.
* Check feature engineering logic for errors or skew vs. training.
* Analyze residuals from the model evaluation step.
* Check if model drift has occurred (compare recent performance to registry metrics). Trigger retraining if needed.
* Ensure correct model version was loaded by inference pipeline.
5. **Data Loading Issues (Timestream/RDS):** Check `LoadForecastsToDB` Lambda logs for database connection errors, write throttling, data type mismatches, constraint violations. Check DB metrics.
6. **API Serving Issues:** Check API Gateway logs and metrics. Check serving Lambda logs. Verify DB connectivity and query performance.
### Security Considerations
* Apply IAM least privilege to all roles.
* Encrypt data at rest (S3, Timestream/RDS, EBS) and in transit (TLS).
* Use Secrets Manager for any API keys (e.g., weather provider).
* Secure API Gateway endpoints (Authentication - Cognito/IAM/API Keys, Authorization, Throttling).
* Perform regular vulnerability scans on the EDF container image.
* Consider VPC deployment and endpoints for enhanced network security.
### Roadmap & Future Enhancements
* Implement probabilistic forecasting (prediction intervals).
* Incorporate more granular data (e.g., appliance recognition, improved occupancy detection) if available.
* Explore more advanced forecasting models (LSTM, TFT) and benchmark rigorously.
* Implement automated retraining triggers based on monitored forecast accuracy drift.
* Develop more sophisticated XAI for forecasting (feature importance).
* Add A/B testing framework for forecast models.
* Integrate forecasts with building control systems (e.g., HVAC pre-cooling based on forecast).
### Appendices
#### Configuration File Example
```yaml
# config/.yaml
# --- General Settings ---
project_name: "ml"
aws_region: "eu-central-1"
# env_suffix will likely be passed dynamically or set per deployment environment
# --- Data Paths (Templates - Execution specific paths often constructed) ---
# Base paths defined here, execution IDs/dates appended by scripts/workflows
s3_processed_edf_path: "s3://{processed_bucket}/processed_edf_data/"
s3_raw_weather_fcst_path: "s3://{raw_bucket}/edf-inputs/weather-forecast/"
s3_raw_calendar_path: "s3://{raw_bucket}/edf-inputs/calendar-topology/"
s3_feature_output_base: "s3://{processed_bucket}/features/edf/{workflow_type}/{sfn_name}/{exec_id}/" # workflow_type=training/inference
s3_model_artifact_base: "s3://{processed_bucket}/model-artifacts/{sfn_name}/{exec_id}/"
s3_eval_report_base: "s3://{processed_bucket}/evaluation-output/{sfn_name}/{exec_id}/"
s3_forecast_output_base: "s3://{processed_bucket}/forecast-output/{sfn_name}/{exec_id}/"
# --- AWS Resource Names (Base names - suffix added in Terraform locals) ---
scripts_bucket_base: "ml-glue-scripts" # Base name for script bucket
processed_bucket_base: "ml-processed-data" # Base name for processed bucket
raw_bucket_base: "ml-raw-data" # Base name for raw bucket
edf_feature_group_name_base: "edf-building-features"
ecr_repo_name_edf_base: "edf-training-container"
edf_model_package_group_name_base: "EDFBuildingDemandForecaster"
lambda_register_edf_func_base: "RegisterEDFModelFunction"
lambda_load_forecasts_func_base: "LoadEDFResultsLambda"
lambda_get_model_func_base: "GetApprovedModelLambda" # Shared Lambda
lambda_create_sm_model_func_base: "CreateSageMakerModelLambda" # Shared Lambda
edf_training_sfn_base: "EDFTrainingWorkflow"
edf_inference_sfn_base: "EDFInferenceWorkflow"
edf_scheduler_base: "DailyEDFInferenceTrigger"
forecast_db_base: "EDFDatabase" # Timestream DB base name
forecast_table_name: "BuildingDemandForecasts" # Timestream table name
# --- Feature Engineering (Common & EDF Specific) ---
common_feature_eng:
lookback_days_default: 14 # Default days history needed
edf_feature_eng:
target_column: "consumption_kwh"
timestamp_column: "timestamp_hour"
building_id_column: "building_id"
time_features: # Features derived from timestamp
- "hour_of_day"
- "day_of_week"
- "day_of_month"
- "month_of_year"
- "is_weekend" # Example custom flag
lag_features: # Lag values in hours
consumption_kwh: [24, 48, 168] # 1d, 2d, 1wk
temperature_c: [24, 48, 168]
solar_irradiance_ghi: [24]
rolling_window_features: # Window size in hours, aggregations
consumption_kwh:
windows: [3, 24, 168]
aggs: ["avg", "stddev", "min", "max"]
temperature_c:
windows: [24]
aggs: ["avg"]
imputation_value: 0.0 # Value used for fillna after lags/windows
# --- Training Workflow ---
edf_training:
default_strategy: "Prophet" # Model strategy to use if not specified
instance_type: "ml.m5.xlarge" # Larger instance for potentially heavier training
instance_count: 1
max_runtime_seconds: 7200 # 2 hours
# Base hyperparameters (can be overridden by execution input)
hyperparameters:
Prophet:
prophet_changepoint_prior_scale: 0.05
prophet_seasonality_prior_scale: 10.0
prophet_holidays_prior_scale: 10.0
prophet_daily_seasonality: True
prophet_weekly_seasonality: True
prophet_yearly_seasonality: 'auto'
# prophet_regressors: ["temperature_c", "is_holiday_flag"] # Example if using regressors
XGBoost:
xgb_eta: 0.1
xgb_max_depth: 5
xgb_num_boost_round: 150
xgb_subsample: 0.7
xgb_colsample_bytree: 0.7
# feature_columns must align with feature_engineering output for XGBoost
feature_columns_string: "temperature_c,solar_irradiance_ghi,humidity,is_holiday_flag,hour_of_day,day_of_week,day_of_month,month_of_year,consumption_lag_24h,consumption_lag_168h,consumption_roll_avg_24h"
# --- Evaluation Workflow ---
edf_evaluation:
instance_type: "ml.m5.large"
instance_count: 1
# Metrics thresholds for the 'CheckEvaluation' choice state
metrics_thresholds:
max_mape: 20.0 # Example: Fail if MAPE > 20%
max_rmse: 5.0 # Example: Fail if RMSE > 5 kWh (adjust based on typical consumption)
# Optional: Path to historical labelled data for backtesting
# historical_labels_path: "s3://..."
# --- Inference Workflow ---
edf_inference:
scheduler_expression: "cron(0 5 * * ? *)" # 5 AM UTC Daily
scheduler_timezone: "UTC"
forecast_horizon_hours: 72
# Processing job instance types (can override training defaults if needed)
feature_eng_instance_type: "ml.m5.large"
feature_eng_instance_count: 1
forecast_gen_instance_type: "ml.m5.large" # Needs forecasting libs installed
forecast_gen_instance_count: 1
# Target DB Config
forecast_db_type: "TIMESTREAM" # TIMESTREAM | RDS | DYNAMODB
# Lambda Config
load_forecasts_lambda_memory: 512 # MB
load_forecasts_lambda_timeout: 300 # seconds
# --- Common Lambda Config ---
# Assuming shared Lambdas from AD are used
lambda_shared:
get_model_memory: 128
get_model_timeout: 30
create_sm_model_memory: 128
create_sm_model_timeout: 60
```
#### Data Schemas
This appendix provides the formal schema definitions for the primary data entities used across the Anomaly Detection and Energy Demand Forecasting workflows.
#### 1. Raw Meter Data
This represents the logical structure of data as it arrives from the central database into the S3 Raw Zone for processing. It's often in a semi-structured format like JSON Lines.
| Field Name | Data Type | Description |
| :--- | :--- | :--- |
| `timestamp_str` | String | ISO 8601 formatted timestamp (e.g., "2024-10-27T10:30:05Z") of when the readings were recorded by the tablet. |
| `building_id` | String | Unique identifier for the building (e.g., "bldg_A123"). |
| `apartment_id` | String | Unique identifier for the apartment (e.g., "apt_404"). |
| `readings` | Array[Object] | An array of sensor reading objects from the apartment. |
| `readings.sensor_type` | String | The type of measurement (e.g., `heating_energy_kwh`, `room_temp_c`). |
| `readings.value` | Double/Int | The numerical value of the sensor reading. |
| `readings.room_name` | String | (Optional) The specific room for the reading, if applicable (e.g., "living_room"). |
**Example (JSON Lines):**
```json
{"timestamp_str": "2024-10-27T10:30:00Z", "building_id": "bldg_A123", "apartment_id": "apt_404", "readings": [{"sensor_type": "heating_energy_kwh", "value": 15432.7}, {"sensor_type": "hot_water_litres", "value": 89541.2}, {"sensor_type": "room_temp_c", "room_name": "living_room", "value": 21.5}, {"sensor_type": "setpoint_temp_c", "room_name": "living_room", "value": 22.0}]}
```
---
#### 2. Processed Meter Data (for Anomaly Detection)
This is the output of the initial Ingestion Glue ETL job, stored in the S3 Processed Zone. It's a flattened, structured table optimized for analytical queries and as the source for feature engineering.
**Format:** Apache Parquet
**Partitioned by:** `year`, `month`, `day`
| Column Name | Data Type | Description |
| :--- | :--- | :--- |
| `apartment_id` | String | Unique identifier for the apartment. |
| `building_id` | String | Unique identifier for the building. |
| `event_ts` | Timestamp | The timestamp of the reading, cast to a proper timestamp type. |
| `heating_energy_kwh` | Double | The cumulative heating energy consumption in kilowatt-hours. |
| `hot_water_litres` | Double | The cumulative hot water consumption in litres. |
| `room_temp_c` | Double | The measured temperature in Celsius for a specific room (or average). |
| `setpoint_temp_c` | Double | The user-defined target temperature in Celsius for a specific room. |
| `outdoor_temp_c` | Double | The outdoor temperature at the time of the reading, joined from weather data. |
| **year** | Integer | **Partition Key:** Year derived from `event_ts`. |
| **month** | Integer | **Partition Key:** Month derived from `event_ts`. |
| **day** | Integer | **Partition Key:** Day derived from `event_ts`. |
---
#### 3. Weather Data
**3.1 Raw Weather Forecast Data (from API)**
This is the raw JSON structure ingested from the external weather forecast API into the S3 Raw Zone.
| Field Name | Data Type | Description |
| :--- | :--- | :--- |
| `latitude` | Double | Latitude of the forecast location. |
| `longitude` | Double | Longitude of the forecast location. |
| `generationtime_ms`| Double | Time taken by the API to generate the forecast. |
| `utc_offset_seconds`| Integer | UTC offset for the location. |
| `hourly` | Object | An object containing arrays of hourly forecast values. |
| `hourly.time` | Array[String] | Array of ISO 8601 timestamps for the forecast horizon. |
| `hourly.temperature_2m`| Array[Double] | Array of corresponding forecasted temperatures (°C). |
| `hourly.cloudcover` | Array[Integer] | Array of corresponding forecasted cloud cover (%). |
| `hourly.shortwave_radiation` | Array[Double]| Array of corresponding forecasted solar irradiance (W/m²). |
**3.2 Processed Weather Data (Joined in `processed_edf_data`)**
This represents the clean, hourly weather data after being processed and joined to the consumption data.
| Column Name | Data Type | Description |
| :--- | :--- | :--- |
| `building_id` | String | Unique identifier for the building the weather corresponds to. |
| `timestamp_hour` | Timestamp | The specific hour for which the weather data is valid, truncated. |
| `temperature_c` | Double | The average temperature in Celsius for that hour. |
| `humidity` | Double | The average relative humidity (%) for that hour. |
| `solar_irradiance_ghi`| Double | The average Global Horizontal Irradiance (W/m²) for that hour. |
| `is_holiday_flag` | Integer | A binary flag (1 or 0) indicating if the date is a public holiday. |
---
#### 4. Feature Store Features (Anomaly Detection)
This defines the schema of the `ad-apartment-features` Feature Group in SageMaker Feature Store. These are the inputs to the AD model.
| Feature Name | Data Type | Description |
| :--- | :--- | :--- |
| **apartment_record_id** | String | **Record Identifier:** Unique ID for the record (e.g., `[apartment_id]_[date]`). |
| **event_time** | Fractional | **Event Time:** Timestamp when the features were computed/ingested. |
| `event_date` | String | The specific date (YYYY-MM-DD) these daily features correspond to. |
| `building_id` | String | Identifier for the building. |
| `avg_temp_diff` | Fractional | The average difference between the setpoint and actual room temperature for the day. |
| `daily_energy_kwh` | Fractional | The total heating energy consumed on that day. |
| `hdd` | Fractional | Heating Degree Days, a measure of how cold the day was. |
| `energy_lag_1d` | Fractional | The value of `daily_energy_kwh` from the previous day. |
| `energy_roll_avg_7d` | Fractional | The 7-day rolling average of `daily_energy_kwh`. |
| `temp_diff_roll_std_3d` | Fractional | The 3-day rolling standard deviation of `avg_temp_diff`. |
---
#### 5. Alert Table Schema (DynamoDB)
This defines the structure of the `ad-alerts` table in Amazon DynamoDB, where the inference pipeline stores actionable alerts.
**Table Name:** `hometech-ml-ad-alerts-[env]`
| Attribute Name | Data Type | Key Type / Index | Description |
| :--- | :--- | :--- | :--- |
| **AlertID** | String (S) | **Partition Key (PK)** | Unique identifier for the alert. **Format:** `[ApartmentID]#[EventDate]`. |
| `ApartmentID` | String (S) | GSI-1 Partition Key | The unique identifier of the apartment that triggered the alert. |
| `BuildingID` | String (S) | GSI-2 Partition Key | The unique identifier of the building. |
| `EventDate` | String (S) | - | The date (YYYY-MM-DD) for which the anomaly was detected. |
| `AlertTimestamp` | String (S) | GSI-2 Sort Key | ISO 8601 timestamp of when the alert was created by the pipeline. |
| `AnomalyScore` | Number (N) | - | The raw numerical score from the ML model. Higher means more anomalous. |
| `Threshold` | Number (N) | - | The score threshold that was breached to create this alert. |
| **Status** | String (S) | GSI-1 Sort Key | The current state of the alert. **Values:** `Unseen`, `Investigating`, `Resolved-True Positive`, `Resolved-False Positive`. |
| `ModelVersion` | String (S) | - | Version of the model package that generated the score (for lineage). |
| `FeedbackNotes` | String (S) | - | (Optional) Notes entered by the maintenance technician during review. |
**Global Secondary Indexes (GSIs):**
* **GSI-1 (`ApartmentStatusIndex`):** Allows efficiently querying for alerts of a specific `Status` within a given `ApartmentID`.
* **Partition Key:** `ApartmentID`
* **Sort Key:** `Status`
* **GSI-2 (`BuildingAlertsIndex`):** Allows efficiently querying for all alerts in a `BuildingID`, sorted by time.
* **Partition Key:** `BuildingID`
* **Sort Key:** `AlertTimestamp`