Feast Concepts¶
¶
Understanding these fundamental concepts is key to effectively using Feast for managing the lifecycle of your machine learning features.
1. Project: Your Isolated Feature Universe¶
Definition: The top-level namespace in Feast. A project provides complete isolation for a feature store at the infrastructure level. This is achieved by namespacing resources, such as prefixing table names in the online/offline store with the project name.
Scope: Each project is a distinct universe of entities, features, and data sources. You cannot retrieve features from multiple projects in a single request.
Best Practice: It’s recommended to have a single Feast project per environment (e.g.,
dev
,staging
,prod
).Benefits:
Logical Grouping: Organizes related features, views, and services.
Isolation: Prevents interference between different environments or large-scale initiatives.
Collaboration: Defines clear boundaries for teams.
Access Control: Can be a basis for permissioning (though Feast’s
Permission
objects offer more granularity).
2. Data Source & Data Ingestion: Connecting to Your Raw Data¶
Feast doesn’t manage the raw underlying data itself; instead, it defines how to connect to and interpret this data.
Data Source:
Definition: Represents the raw data systems where your feature data originates (e.g., a table in BigQuery, files in S3, a Kafka topic).
Data Model: Feast uses a time-series data model, expecting feature data to have timestamps indicating when the feature value was observed or generated.
Types:
Batch Data Sources: Typically data warehouses (BigQuery, Snowflake, Redshift) or data lakes (S3, GCS). Feast ingests from these for online serving and queries them for historical retrieval.
Stream Data Sources:
Push Sources: Allow users to directly push feature values into Feast (both offline and online stores). This is a common pattern for real-time updates.
[Alpha] Stream Sources: Allow registration of metadata for Kafka or Kinesis topics. Users are responsible for the ingestion pipeline, though Feast provides some helpers.
(Experimental) Request Data Sources: Data that is only available at the moment of a prediction request (e.g., user input from an HTTP request). Primarily used as input for On-Demand Feature Views.
Data Ingestion:
Offline Use (Training/Batch Scoring from Batch Sources): Feast often doesn’t ingest data in the traditional sense. It queries your existing batch data sources directly, leveraging the compute engine of the offline store (e.g., BigQuery’s query engine).
Online Use (Real-time Serving):
Materialization (from Batch Sources): The process of loading feature values from batch sources into the online store. The
materialize_incremental
command fetches the latest values for entities and ingests them. This is typically scheduled (e.g., via Airflow).For On-Demand Feature Views with
write_to_online_store=True
, thetransform_on_write
parameter controls if transformations are applied during this materialization (set toFalse
to materialize pre-transformed features).
Pushing (from Stream Sources): Streaming data can be pushed into the online store (and optionally the offline store) via Push Sources or custom stream processing jobs (e.g., using the contrib Spark processor for Kafka/Kinesis).
Schema Inference: If a schema isn’t explicitly defined for a batch data source, Feast attempts to infer it during
feast apply
by inspecting the source table or running aLIMIT
query.
3. Entity: The “Subject” of Your Features¶
Definition: An entity represents a core business object or concept to which features are related (e.g.,
customer
,driver
,product
). It’s defined with a uniquename
and one or morejoin_keys
(the primary keys used to link feature values).driver = Entity(name='driver', join_keys=['driver_id'])
Usage:
Defining & Storing Features:
Feature Views (see below) are associated with zero or more entities. This collection of entities for a feature view is its entity key.
Examples:
Zero entities:
num_daily_global_transactions
(a global feature).One entity:
user_age
(associated with auser
entity).Multiple entities (composite key):
num_user_purchases_in_merchant_category
(associated withuser
andmerchant_category
entities).
Reusing entity definitions across feature views is crucial for discoverability and consistency.
Retrieving Features:
Training Time: Users provide a list of entity keys + timestamps to fetch point-in-time correct features.
Serving Time: Users provide entity key(s) to fetch the latest feature values.
Retrieving All Entities:
Feast supports generating features for a SQL-backed list of entities for batch scoring.
For real-time retrieval, fetching all entities is not an out-of-the-box feature to prevent slow and expensive scan operations on data sources.
4. Feature View: Organizing and Defining Groups of Features¶
A Feature View is a central concept for declaring and managing features.
Definition: A logical collection of features, typically sourced from a single data source and often associated with one or more entities. It defines how Feast should interpret and access these features.
Online: A stateful collection read via
get_online_features
.Offline: A stateless collection created via
get_historical_features
.
Key Components:
name
: Unique identifier within the project.entities
: A list ofEntity
objects this view is associated with (can be empty for global features).schema
: A list ofField
objects defining the features in this view (name and data type). Optional, but highly recommended; if omitted, Feast infers it.source
: TheDataSource
(batch, stream, or request) from which these features originate.ttl
(Time-To-Live): Optional; limits how far back Feast looks for feature values during historical retrieval and can influence online store retention.tags
: Optional metadata (e.g.,{'owner': 'fraud_team'}
).
Important Note: Feature views require timestamped data. A workaround for non-timestamped data is to insert dummy timestamps.
Usage:
Generating training datasets.
Defining the schema for loading features into the online store.
Providing schema for retrieving features from the online store.
Feature Inferencing: If
schema
is not provided, Feast infers features from the data source columns (excluding entity join keys and timestamp columns).Entity Aliasing: Allows joining an
entity_dataframe
(used inget_historical_features
) to a Feature View when the column names in theentity_dataframe
don’t match the Feature View’s entityjoin_keys
. This is done dynamically using.with_name("new_fv_name").with_join_key_map({"feature_view_join_key": "entity_df_column_name"})
.Useful when you don’t control source column names or have multiple specialized entities that are subtypes of a general entity (e.g., “origin_location” and “destination_location” both aliasing a “location” entity).
Field (Feature):
Definition: An individual, measurable property or characteristic, typically observed on an entity. Defined with a
name
anddtype
(e.g.,Float32
,Int64
).Fields are defined within a Feature View’s
schema
.Feature names must be unique within a Feature View.
Can have
tags
for additional metadata.
Types of Feature Views:
Standard Feature View: The most common type, typically backed by a batch data source.
[Alpha] On-Demand Feature View (
on_demand_feature_view
):Allows defining new features by applying Python transformations to:
Existing features from other Feature Views.
Request-time data (via
RequestSource
).
Transformations are executed as Python code (often Pandas DataFrames) during both historical and online retrieval.
Scalability: Fine for online serving (small data). For historical retrieval on large datasets, local Python execution might not scale well.
Use Case: Rapid iteration by data scientists, light-weight transformations, combining diverse data sources at request time.
[Alpha] Stream Feature View (
stream_feature_view
):Extends a normal Feature View by having both a stream source (e.g., Kafka, Kinesis) and a batch source (for backfills/historical data).
Designed for features that need to be updated with very low latency from streaming events.
Can include transformations (e.g., Spark transformations if
mode="spark"
).
5. Feature Retrieval: Accessing Your Features¶
Feast provides APIs to get feature values for different ML lifecycle stages.
Core APIs:
feature_store.get_historical_features(...)
: For training data generation and offline batch scoring. Performs point-in-time correct joins.feature_store.get_online_features(...)
: For real-time model predictions from the online store.Feature Server Endpoints (e.g.,
POST /get-online-features
): For language-agnostic online feature retrieval.
Key Inputs for Retrieval:
Feature Specification:
Feature Service (Recommended for production): A logical group of features (potentially from multiple Feature Views) required by a specific model or model version. You define it once and reference it by name.
driver_stats_fs = FeatureService( name="driver_activity_v1", features=[driver_stats_fv, driver_ratings_fv[["lifetime_rating"]]] ) features = store.get_online_features(features=driver_stats_fs, ...)
Feature References (Good for experimentation): A list of strings in the format
<feature_view_name>:<feature_name>
.features = store.get_online_features(features=["driver_hourly_stats:conv_rate"], ...)
Entity Specification:
For
get_historical_features
: An “entity dataframe” (Pandas DataFrame or SQL query) containing entity join key values and event timestamps for point-in-time correctness.For
get_online_features
: A list ofentity_rows
(dictionaries of entity join key values). No timestamps needed as it fetches the latest values.
Event Timestamp: The timestamp recorded in your data source indicating when a feature event occurred. Crucial for point-in-time joins.
Dataset (in Retrieval Context): The output of
get_historical_features
. It’s a table (e.g., Pandas DataFrame) containing the requested features joined onto the input entity dataframe.
6. Point-in-Time Joins: Ensuring Temporal Correctness¶
This is a critical capability of Feast for generating historically accurate training data, preventing data leakage.
How it Works: When you call
get_historical_features
, Feast uses theevent_timestamp
column in your entity dataframe. For each row in this dataframe, it looks up feature values from the specified Feature Views that were valid at or before that row’sevent_timestamp
, but not after.TTL (Time-To-Live) Role: The
ttl
defined on a Feature View limits how far back in time Feast will search for a feature value from the givenevent_timestamp
. If a feature value is older than thettl
relative to theevent_timestamp
, it won’t be joined.Example: If your entity dataframe has an event at
2023-01-15 10:00:00
and a Feature View has attl
of2 hours
, Feast will look for feature values for that entity between2023-01-15 08:00:00
and2023-01-15 10:00:00
.
7. [Alpha] Saved Dataset: Persisting Feature Sets¶
Purpose: Allows you to save the output of
get_historical_features
(a feature dataset) for later use, such as model training, analysis, or data quality monitoring.Storage:
Metadata about the Saved Dataset is stored in the Feast registry.
The actual raw data (features, entities, timestamps) is stored in your configured offline store (e.g., a new table in BigQuery).
Creation:
Call
store.get_historical_features(...)
to get a retrieval job.Pass this job to
store.create_saved_dataset(from_=historical_job, name="my_dataset", storage=...)
. This triggers the job execution and persists the data.
Planned Creation Methods: Logging request/response data during online serving or features during writes to the online store.
Retrieval:
dataset = store.get_saved_dataset('my_dataset_name')
, thendataset.to_df()
.
8. Permission: Securing Your Feature Store¶
Feast provides a model for configuring granular access policies to its resources.
Scope: Permissions are defined and stored in the Feast registry.
Enforcement: Performed by Feast servers (online feature server, offline feature server, registry server) when requests are made through them. No enforcement when using a local provider directly with the SDK.
Core Components:
Resource
: The Feast object being secured (e.g.,FeatureView
,DataSource
,Project
). Assumed to have aname
and optionaltags
.Action
: The operation being performed (e.g.,CREATE
,DESCRIBE
,UPDATE
,DELETE
,READ_ONLINE
,WRITE_OFFLINE
). Aliases likeREAD
,WRITE
,CRUD
simplify definitions.Policy
: The rule for authorization (e.g.,RoleBasedPolicy
which checks user roles).
Permission
Object: Defines a single permission rule with attributes:name
: Name of the permission.types
: List of resource types this permission applies to (e.g.,[FeatureView, FeatureService]
). Aliases likeALL_RESOURCE_TYPES
.name_patterns
: List of regex patterns to match resource names.required_tags
: Dictionary of tags that must match the resource’s tags.actions
: List of actions authorized by this permission.policy
: The policy object to apply.
Important: Resources not matching any configured
Permission
are not secured and are accessible by any user.Configuration: Defined in the
auth
section offeature_store.yaml
. Feast supports OIDC and Kubernetes RBAC. Ifauth
is unspecified, it defaults tono_auth
(no enforcement).