Running Feast with AWS¶
¶
This guide outlines the typical workflow for installing, configuring, deploying, and using Feast in cloud environments like Snowflake, GCP, or AWS.
1. Install Feast¶
Base Installation:
pip install feast
Cloud-Specific Extras: Install additional dependencies based on your chosen cloud provider and online/offline stores.
Snowflake:
pip install 'feast[snowflake]'
GCP (BigQuery, Datastore/Firestore):
pip install 'feast[gcp]'
AWS (Redshift, DynamoDB):
pip install 'feast[aws]'
Redis (e.g., AWS ElastiCache or standalone):
pip install 'feast[redis]'
Purpose: Ensures you have the necessary client libraries to interact with your chosen backend services.
2. Create a Feature Repository¶
A feature repository is a directory containing your Feast configurations and feature definitions.
Action: Use the
feast init
command to bootstrap a new repository.Basic (local defaults):
feast init # Creates a directory like "my_feature_repo"
Cloud-Specific Templates: Use the
-t
flag for templates pre-configured for specific providers. These will prompt for necessary credentials and connection details.feast init -t snowflake feast init -t gcp feast init -t aws
Output:
feast init
creates:A project directory (e.g.,
my_feature_repo/
).feature_store.yaml
: Configuration file for your feature store (provider, registry, online/offline stores).example.py
(or similar): Python file with example feature definitions (Entity
,DataSource
,FeatureView
).data/
directory: Often contains sample data (e.g., a Parquet file) for the examples.
Next Steps:
cd my_feature_repo
Initialize a Git repository (
git init
) and commit these files. This is crucial for version control.Modify
example.py
with your actual feature definitions.Adjust
feature_store.yaml
to point to your production/staging cloud resources.
3. Deploy a Feature Store¶
This step registers your feature definitions with Feast and sets up any necessary infrastructure in your cloud environment based on your feature_store.yaml
and definitions.
Action: Run
feast apply
from within your feature repository directory.feast apply
Effect:
Parses your feature definition files (e.g.,
example.py
).Updates the Feast registry (e.g., creates/updates
registry.pb
file in GCS/S3, or writes to a SQL registry).May create tables or other resources in your configured online/offline stores if they don’t exist (behavior depends on the provider and store types).
Important:
feast apply
does not load data into the online store. It only sets up the definitions and metadata.Cleaning Up (Caution!):
feast teardown
will attempt to remove infrastructure created byfeast apply
. This is irreversible and will delete data/tables. Use with extreme caution, especially in production.
4. Build a Training Dataset (Historical Feature Retrieval)¶
Feast enables the creation of point-in-time correct datasets for model training.
Prerequisites:
Feature views must be defined and registered (
feast apply
).Historical feature data must exist in your offline store (e.g., tables in BigQuery, Snowflake, or files in S3).
Steps:
Define Feature References or Use a Feature Service:
Specify which features you need, typically as a list of strings (
"feature_view_name:feature_name"
) or by referencing a pre-definedFeatureService
.
# Using feature references feature_refs = [ "driver_hourly_stats:conv_rate", "driver_hourly_stats:acc_rate" ] # Or, using a FeatureService (recommended for production) # feature_service = fs.get_feature_service("my_model_v1_features")
Create an Entity DataFrame:
This DataFrame tells Feast which entities and at what points in time you need features for.
It must contain:
An
event_timestamp
column (Pandas Timestamps or datetime objects).Columns for all join keys of the entities involved in the selected features (e.g.,
driver_id
).
Options:
Pandas DataFrame: Create it in your Python script. May require uploading to the offline store for some providers, which can be slow.
import pandas as pd from datetime import datetime entity_df = pd.DataFrame({ "event_timestamp": [pd.Timestamp(datetime.now(), tz="UTC")], "driver_id": [1001] })
SQL Query (String): Provide a SQL query that returns the entity keys and timestamps. This is often more efficient as the data stays within the data warehouse. Only works if all feature views are in the same offline store (e.g., all in BigQuery).
entity_df_sql = "SELECT event_timestamp, driver_id FROM my_project.my_labels_table WHERE ..."
Launch Historical Retrieval:
Instantiate
FeatureStore
and callget_historical_features()
.
from feast import FeatureStore fs = FeatureStore(repo_path="path/to/your/feature_repo/") # Or programmatic config training_job = fs.get_historical_features( features=feature_refs, # or features=feature_service entity_df=entity_df # or entity_df=entity_df_sql ) training_df = training_job.to_df() # Convert to Pandas DataFrame # training_job.to_remote_storage() # Or save directly to cloud storage
Outcome:
training_df
will contain the original columns from yourentity_df
plus the joined feature values, all point-in-time correct.
5. Load Data into the Online Store (Materialization)¶
To serve features at low latency for online predictions, you need to load them from your offline store (batch sources) into an online store.
Prerequisites:
Feature views must be defined and registered (
feast apply
).An online store must be configured in
feature_store.yaml
.
Materialization Commands (CLI):
feast materialize <start_date> <end_date>
:Loads the latest feature values within the specified historical time range from batch sources into the online store.
Example:
feast materialize 2021-04-07T00:00:00 2021-04-08T00:00:00
Can specify specific views:
--views driver_hourly_stats
This command is stateless. It’s best used with an external scheduler (like Airflow) that manages the time ranges for each run.
feast materialize-incremental <end_date>
(Alternative):Loads only new data that has arrived in batch sources up to the specified
end_date
.Example:
feast materialize-incremental 2021-04-08T00:00:00
This command is stateful. Feast tracks the last materialization timestamp for each feature view in the registry.
On the first run, it materializes from the oldest timestamp in the source up to
end_date
. Subsequent runs use the previousend_date
as the new start time.
Programmatic Materialization (SDK):
store.materialize(start_date, end_date, feature_views=["my_fv"])
store.materialize_incremental(end_date, feature_views=["my_fv"])
Scheduling: Materialization is typically run on a schedule (e.g., daily, hourly) using orchestrators like Airflow.
6. Read Features from the Online Store (Online Serving)¶
Once data is in the online store, your models can retrieve the latest feature values for real-time predictions.
Prerequisites:
Features must be materialized into the online store.
Steps (Python SDK):
Define Feature References or Use a Feature Service:
Specify the features needed for prediction. This list often comes from the model training phase and should be packaged with the deployed model.
features_for_prediction = [ "driver_hourly_stats:conv_rate", "driver_hourly_stats:acc_rate" ] # Or use a FeatureService: # feature_service = fs.get_feature_service("my_model_v1_features_online")
Provide Entity Rows:
A list of dictionaries, where each dictionary represents an entity (or set of entities for composite keys) for which you need features.
No
event_timestamp
is needed because you’re fetching the latest values.
entity_rows = [ {"driver_id": 1001}, # For driver 1001 {"driver_id": 1002} # For driver 1002 ]
Read Online Features:
Instantiate
FeatureStore
and callget_online_features()
.
from feast import FeatureStore fs = FeatureStore(repo_path="path/to/feature/repo/") # Or programmatic config online_features_response = fs.get_online_features( features=features_for_prediction, # or features=feature_service entity_rows=entity_rows ) online_features_dict = online_features_response.to_dict() # online_features_dict will contain features and their values
Alternative: Use a deployed Feast Feature Server (REST API) for language-agnostic online feature retrieval.
7. Scaling Feast¶
As your feature store grows, certain default components might become bottlenecks.
Scaling Feast Registry:
Problem: The default file-based registry can struggle with concurrent writes (e.g., multiple
feast apply
or materialization jobs) and becomes inefficient as it rewrites the whole file for any change.Solution: Switch to a SQL-based registry (e.g., backed by PostgreSQL, MySQL). This allows for concurrent, transactional, and fine-grained updates.
Scaling Materialization:
Problem: The default in-memory materialization process (local engine) doesn’t scale for large datasets as it runs on a single process.
Solutions (Pluggable Materialization Engines):
Lambda-based engine (AWS): Offloads materialization tasks to AWS Lambda.
Bytewax-based engine: A Kubernetes-native streaming dataflow framework that can be used for scalable materialization.
Snowflake Materialization Engine: If both offline and online are Snowflake.
Custom Engine: Build your own engine using Spark, Ray, Flink, etc., to fit your existing infrastructure.
8. Structuring Feature Repos for Multiple Environments¶
Managing feature definitions across development, staging, and production environments.
Goal: Test changes in a non-production (staging) environment before promoting to production.
Common Approaches:
Different Git Branches:
Maintain long-lived branches (e.g.,
staging
,main
/production
).Changes are made to
staging
, tested, and then manually merged/copied tomain
.CI/CD applies changes from the respective branches to their corresponding environments.
Separate
feature_store.yaml
and Separate Feast Object Definitions:Directory structure:
├── staging/ │ ├── driver_features.py │ └── feature_store.yaml # Configured for staging resources ├── production/ │ ├── driver_features.py # Potentially a copy from staging │ └── feature_store.yaml # Configured for production resources └── .github/workflows/ # CI/CD pipelines for staging & prod
Changes are developed in
staging/
, tested, then copied toproduction/
.CI/CD uses the
feature_store.yaml
from the respective directory.Can organize features into Python packages within each environment directory for better structure (e.g.,
production/common/entities.py
,production/ranking_model/views.py
).
Shared Feast Object Definitions with Separate
feature_store.yaml
Files:Directory structure:
├── staging/ │ └── feature_store.yaml # Configured for staging ├── production/ │ └── feature_store.yaml # Configured for production ├── driver_features.py # Shared feature definitions └── .github/workflows/
CI/CD specifies which
feature_store.yaml
to use with the-f
(or--config
) flag:feast -f staging/feature_store.yaml apply feast -f production/feature_store.yaml apply
Advantage: Avoids code duplication for feature definitions, reducing copy-paste errors.
CI/CD Role: CI systems (e.g., GitHub Actions) are crucial. They run
feast apply
using the appropriate environment’s configuration, updating the shared registry (e.g., aregistry.db
file on GCS/S3 or a SQL database) and configuring infrastructure. The CI system needs write access to production infrastructure, while client SDKs (for training/serving) typically only need read access.
By following these steps and considering the scaling and structuring strategies, teams can effectively deploy and manage Feast in production environments using Snowflake, GCP, AWS, or other cloud platforms.