Improve performance of ingestion to Sagemaker feature store using Feature Processor SDK

0

Hi, I'm using the Sagemaker Feature Store Feature Processor SDK to ingest data into an OfflineStore.

The problem I am having is that the ingestion speed is very slow. Ingesting a test set of 10,000 records takes 18 minutes, which implies the 1M records I need to ingest will take 30 hours!

Is this expected, or is there some way to improve this ingestion performance?

For reference, here is the code I'm using:

@feature_processor(
    inputs=[SnowflakeDataSource(query, sf_options, secret)], 
    # snowflake data source implemented using code in custom data sources doc
    output=CRB_FG_ARN,
    target_stores=["OfflineStore"],
    spark_config={"spark.jars.packages": "net.snowflake:spark-snowflake_2.12:2.12.0-spark_3.3"}
)
def transform(input_df):
    from pyspark.sql.functions import col, unix_timestamp

    transformed_df = (
        input_df.select([col(x).alias(x.lower()) for x in input_df.columns])
        .withColumn("created_at", unix_timestamp("created_at"))
    )

    # this print statement is shown almost immediately, implying latency isn't with Snowflake query    
    print(f"dataframe shape: {(transformed_df.count(), len(transformed_df.columns))}")

    return transformed_df

transform()

EDIT1:

Well I changed the table format of the feature group to TableFormatEnum.ICEBERG and that allowed me to ingest the full 1M rows in 13 minutes.

EDIT2:

I re-enabled the write to the OnlineStore in the feature store creation, and the ingestion is very slow again. When looking in s3, I see many files per day (rather than a single file per day as was the case when writing only to the OfflineStore)

2024-05-16 17:25:40       4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/dfkchwkp/created_at_trunc=2023-05-07/20230507T164259Z_xuEYFSbUERrnKwQx.parquet
2024-05-16 17:21:10       4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/iyzled4v/created_at_trunc=2023-05-07/20230507T065739Z_yoFAGbgxIFClXSFy.parquet
2024-05-16 17:25:40       4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/qvvzi6tz/created_at_trunc=2023-05-07/20230507T180708Z_icsZiEfeGETlPgFl.parquet
2024-05-16 17:21:09       4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/rmmlxmbw/created_at_trunc=2023-05-07/20230507T181622Z_uyyeaMQWysmxzTez.parquet
2024-05-16 17:21:10       4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/yaiznhwv/created_at_trunc=2023-05-07/20230507T102844Z_kVUEXnFqcRCdaeGB.parquet
2024-05-16 17:25:40       4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/yzk3hxbn/created_at_trunc=2023-05-07/20230507T161916Z_eFLMXfmhPkorCPyJ.parquet

I can't find any docs about this, and I'm starting to lose a little faith in the production worthiness of the sagemaker feature store :/

MarkNS
asked 20 days ago285 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions