Sagemaker Feature Store Spark connector is duplicating data

0

Hi!

I'm using the Feature Store Spark connector to ingest data into the Sagemaker Feature Store and when we try to ingest data to a Feature Group with the online store enabled, the data is duplicated. In the image bellow, "customer_id" is the ID feature, "date_ref" the event column. All the features are equal for the same ID and EventTime column, except the "api_invocation_time".

Duplicated data

If the feature group doesn't have the online store enabled, we ingest the data directly to the offline store without issues. But when we use the "Ingest by default" option in the connector (not specifying the "target_stores" in the connector, uses the PutRecord API), the data ingested is duplicated:

params = {
    "input_data_frame":dataframe,
    "feature_group_arn": feature_group_arn            
}

if not online_store_enabled:
    params["target_stores"] = ["OfflineStore"]
    logger.info(f"Ingesting data to the offline store")

pyspark_connector.ingest_data(**params)
logger.info("Finished the ingestion!")

failed_records = pyspark_connector.get_failed_stream_ingestion_data_frame()

How can I solve this issue using the connector?

EDIT:

Apparently, the problem is in the "get_failed_stream ingestion data frame" method. This method, instead of just returning a dataframe, ingests the data again before returning the failed records. Removing the method from the ingestion pipeline resolves the issue, although we lose a form of validation.

質問済み 1年前361ビュー
1回答
1
承認された回答

This issue should be patched in 1.1.1. If you upgrade from 1.1.0, get_failed_stream_ingestion_data_frame should no longer trigger any re-computation now.

AWS
Can Sun
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ