# Feast Concepts ## Understanding these fundamental concepts is key to effectively using Feast for managing the lifecycle of your machine learning features. ### 1. Project: Your Isolated Feature Universe * **Definition:** The top-level namespace in Feast. A project provides complete isolation for a feature store at the infrastructure level. This is achieved by namespacing resources, such as prefixing table names in the online/offline store with the project name. * **Scope:** Each project is a distinct universe of entities, features, and data sources. You cannot retrieve features from multiple projects in a single request. * **Best Practice:** It's recommended to have a single Feast project per environment (e.g., `dev`, `staging`, `prod`). * **Benefits:** * **Logical Grouping:** Organizes related features, views, and services. * **Isolation:** Prevents interference between different environments or large-scale initiatives. * **Collaboration:** Defines clear boundaries for teams. * **Access Control:** Can be a basis for permissioning (though Feast's `Permission` objects offer more granularity). ### 2. Data Source & Data Ingestion: Connecting to Your Raw Data Feast doesn't manage the raw underlying data itself; instead, it defines how to connect to and interpret this data. * **Data Source:** * **Definition:** Represents the raw data systems where your feature data originates (e.g., a table in BigQuery, files in S3, a Kafka topic). * **Data Model:** Feast uses a time-series data model, expecting feature data to have timestamps indicating when the feature value was observed or generated. * **Types:** 1. **Batch Data Sources:** Typically data warehouses (BigQuery, Snowflake, Redshift) or data lakes (S3, GCS). Feast ingests from these for online serving and queries them for historical retrieval. 2. **Stream Data Sources:** * **Push Sources:** Allow users to directly push feature values into Feast (both offline and online stores). This is a common pattern for real-time updates. * **[Alpha] Stream Sources:** Allow registration of metadata for Kafka or Kinesis topics. Users are responsible for the ingestion pipeline, though Feast provides some helpers. 3. **(Experimental) Request Data Sources:** Data that is only available at the moment of a prediction request (e.g., user input from an HTTP request). Primarily used as input for On-Demand Feature Views. * **Data Ingestion:** * **Offline Use (Training/Batch Scoring from Batch Sources):** Feast often doesn't *ingest* data in the traditional sense. It queries your existing batch data sources directly, leveraging the compute engine of the offline store (e.g., BigQuery's query engine). * **Online Use (Real-time Serving):** * **Materialization (from Batch Sources):** The process of loading feature values from batch sources into the online store. The `materialize_incremental` command fetches the *latest* values for entities and ingests them. This is typically scheduled (e.g., via Airflow). * For On-Demand Feature Views with `write_to_online_store=True`, the `transform_on_write` parameter controls if transformations are applied during this materialization (set to `False` to materialize pre-transformed features). * **Pushing (from Stream Sources):** Streaming data can be pushed into the online store (and optionally the offline store) via Push Sources or custom stream processing jobs (e.g., using the contrib Spark processor for Kafka/Kinesis). * **Schema Inference:** If a schema isn't explicitly defined for a batch data source, Feast attempts to infer it during `feast apply` by inspecting the source table or running a `LIMIT` query. ### 3. Entity: The "Subject" of Your Features * **Definition:** An entity represents a core business object or concept to which features are related (e.g., `customer`, `driver`, `product`). It's defined with a unique `name` and one or more `join_keys` (the primary keys used to link feature values). ```python driver = Entity(name='driver', join_keys=['driver_id']) ``` * **Usage:** 1. **Defining & Storing Features:** * Feature Views (see below) are associated with zero or more entities. This collection of entities for a feature view is its **entity key**. * Examples: * Zero entities: `num_daily_global_transactions` (a global feature). * One entity: `user_age` (associated with a `user` entity). * Multiple entities (composite key): `num_user_purchases_in_merchant_category` (associated with `user` and `merchant_category` entities). * Reusing entity definitions across feature views is crucial for discoverability and consistency. 2. **Retrieving Features:** * **Training Time:** Users provide a list of _entity keys + timestamps_ to fetch point-in-time correct features. * **Serving Time:** Users provide _entity key(s)_ to fetch the latest feature values. * **Retrieving All Entities:** * Feast supports generating features for a SQL-backed list of entities for *batch scoring*. * For *real-time retrieval*, fetching all entities is not an out-of-the-box feature to prevent slow and expensive scan operations on data sources. ### 4. Feature View: Organizing and Defining Groups of Features A Feature View is a central concept for declaring and managing features. * **Definition:** A logical collection of features, typically sourced from a single data source and often associated with one or more entities. It defines how Feast should interpret and access these features. * **Online:** A stateful collection read via `get_online_features`. * **Offline:** A stateless collection created via `get_historical_features`. * **Key Components:** * `name`: Unique identifier within the project. * `entities`: A list of `Entity` objects this view is associated with (can be empty for global features). * `schema`: A list of `Field` objects defining the features in this view (name and data type). Optional, but highly recommended; if omitted, Feast infers it. * `source`: The `DataSource` (batch, stream, or request) from which these features originate. * `ttl` (Time-To-Live): Optional; limits how far back Feast looks for feature values during historical retrieval and can influence online store retention. * `tags`: Optional metadata (e.g., `{'owner': 'fraud_team'}`). * **Important Note:** Feature views require timestamped data. A workaround for non-timestamped data is to insert dummy timestamps. * **Usage:** * Generating training datasets. * Defining the schema for loading features into the online store. * Providing schema for retrieving features from the online store. * **Feature Inferencing:** If `schema` is not provided, Feast infers features from the data source columns (excluding entity join keys and timestamp columns). * **Entity Aliasing:** Allows joining an `entity_dataframe` (used in `get_historical_features`) to a Feature View when the column names in the `entity_dataframe` don't match the Feature View's entity `join_keys`. This is done dynamically using `.with_name("new_fv_name").with_join_key_map({"feature_view_join_key": "entity_df_column_name"})`. * Useful when you don't control source column names or have multiple specialized entities that are subtypes of a general entity (e.g., "origin_location" and "destination_location" both aliasing a "location" entity). * **Field (Feature):** * **Definition:** An individual, measurable property or characteristic, typically observed on an entity. Defined with a `name` and `dtype` (e.g., `Float32`, `Int64`). * Fields are defined within a Feature View's `schema`. * Feature names must be unique within a Feature View. * Can have `tags` for additional metadata. * **Types of Feature Views:** 1. **Standard Feature View:** The most common type, typically backed by a batch data source. 2. **[Alpha] On-Demand Feature View (`on_demand_feature_view`):** * Allows defining new features by applying Python transformations to: * Existing features from other Feature Views. * Request-time data (via `RequestSource`). * Transformations are executed as Python code (often Pandas DataFrames) during both historical and online retrieval. * **Scalability:** Fine for online serving (small data). For historical retrieval on large datasets, local Python execution might not scale well. * **Use Case:** Rapid iteration by data scientists, light-weight transformations, combining diverse data sources at request time. 3. **[Alpha] Stream Feature View (`stream_feature_view`):** * Extends a normal Feature View by having both a stream source (e.g., Kafka, Kinesis) and a batch source (for backfills/historical data). * Designed for features that need to be updated with very low latency from streaming events. * Can include transformations (e.g., Spark transformations if `mode="spark"`). ### 5. Feature Retrieval: Accessing Your Features Feast provides APIs to get feature values for different ML lifecycle stages. * **Core APIs:** 1. `feature_store.get_historical_features(...)`: For training data generation and offline batch scoring. Performs point-in-time correct joins. 2. `feature_store.get_online_features(...)`: For real-time model predictions from the online store. 3. Feature Server Endpoints (e.g., `POST /get-online-features`): For language-agnostic online feature retrieval. * **Key Inputs for Retrieval:** * **Feature Specification:** * **Feature Service (Recommended for production):** A logical group of features (potentially from multiple Feature Views) required by a specific model or model version. You define it once and reference it by name. ```python driver_stats_fs = FeatureService( name="driver_activity_v1", features=[driver_stats_fv, driver_ratings_fv[["lifetime_rating"]]] ) features = store.get_online_features(features=driver_stats_fs, ...) ``` * **Feature References (Good for experimentation):** A list of strings in the format `:`. ```python features = store.get_online_features(features=["driver_hourly_stats:conv_rate"], ...) ``` * **Entity Specification:** * For `get_historical_features`: An "entity dataframe" (Pandas DataFrame or SQL query) containing entity join key values and **event timestamps** for point-in-time correctness. * For `get_online_features`: A list of `entity_rows` (dictionaries of entity join key values). **No timestamps needed** as it fetches the latest values. * **Event Timestamp:** The timestamp recorded in your data source indicating when a feature event occurred. Crucial for point-in-time joins. * **Dataset (in Retrieval Context):** The output of `get_historical_features`. It's a table (e.g., Pandas DataFrame) containing the requested features joined onto the input entity dataframe. ### 6. Point-in-Time Joins: Ensuring Temporal Correctness This is a critical capability of Feast for generating historically accurate training data, preventing data leakage. * **How it Works:** When you call `get_historical_features`, Feast uses the `event_timestamp` column in your entity dataframe. For each row in this dataframe, it looks up feature values from the specified Feature Views that were valid *at or before* that row's `event_timestamp`, but not after. * **TTL (Time-To-Live) Role:** The `ttl` defined on a Feature View limits how far back in time Feast will search for a feature value from the given `event_timestamp`. If a feature value is older than the `ttl` relative to the `event_timestamp`, it won't be joined. * **Example:** If your entity dataframe has an event at `2023-01-15 10:00:00` and a Feature View has a `ttl` of `2 hours`, Feast will look for feature values for that entity between `2023-01-15 08:00:00` and `2023-01-15 10:00:00`. ### 7. [Alpha] Saved Dataset: Persisting Feature Sets * **Purpose:** Allows you to save the output of `get_historical_features` (a feature dataset) for later use, such as model training, analysis, or data quality monitoring. * **Storage:** * Metadata about the Saved Dataset is stored in the Feast registry. * The actual raw data (features, entities, timestamps) is stored in your configured offline store (e.g., a new table in BigQuery). * **Creation:** 1. Call `store.get_historical_features(...)` to get a retrieval job. 2. Pass this job to `store.create_saved_dataset(from_=historical_job, name="my_dataset", storage=...)`. This triggers the job execution and persists the data. * **Planned Creation Methods:** Logging request/response data during online serving or features during writes to the online store. * **Retrieval:** `dataset = store.get_saved_dataset('my_dataset_name')`, then `dataset.to_df()`. ### 8. Permission: Securing Your Feature Store Feast provides a model for configuring granular access policies to its resources. * **Scope:** Permissions are defined and stored in the Feast registry. * **Enforcement:** Performed by Feast servers (online feature server, offline feature server, registry server) when requests are made through them. *No enforcement when using a local provider directly with the SDK.* * **Core Components:** * **`Resource`:** The Feast object being secured (e.g., `FeatureView`, `DataSource`, `Project`). Assumed to have a `name` and optional `tags`. * **`Action`:** The operation being performed (e.g., `CREATE`, `DESCRIBE`, `UPDATE`, `DELETE`, `READ_ONLINE`, `WRITE_OFFLINE`). Aliases like `READ`, `WRITE`, `CRUD` simplify definitions. * **`Policy`:** The rule for authorization (e.g., `RoleBasedPolicy` which checks user roles). * **`Permission` Object:** Defines a single permission rule with attributes: * `name`: Name of the permission. * `types`: List of resource types this permission applies to (e.g., `[FeatureView, FeatureService]`). Aliases like `ALL_RESOURCE_TYPES`. * `name_patterns`: List of regex patterns to match resource names. * `required_tags`: Dictionary of tags that must match the resource's tags. * `actions`: List of actions authorized by this permission. * `policy`: The policy object to apply. * **Important:** Resources not matching any configured `Permission` are *not secured* and are accessible by any user. * **Configuration:** Defined in the `auth` section of `feature_store.yaml`. Feast supports OIDC and Kubernetes RBAC. If `auth` is unspecified, it defaults to `no_auth` (no enforcement). ### 9. Tags: Adding Metadata to Feast Objects While a specific `tags.md` document wasn't provided, tags are key-value pairs used throughout Feast to add arbitrary metadata to various objects. * **Usage:** * **`Field`:** Each feature (field) can have tags. * **`FeatureView`:** Can have tags for organizational purposes. * **`Permission`:** Tags on resources can be used as a condition (`required_tags`) for applying a permission policy. * **`SavedDataset`:** Can have tags. * **Purpose:** * **Organization:** Grouping or categorizing resources (e.g., by team, sensitivity level, status). * **Discovery:** Helping users find relevant features or resources. * **Policy Enforcement:** As seen in `Permission`, tags can drive access control decisions.