How to wait for sagemaker asynchronous endpoints in step function

Question

I'm building a step function workflow, where I have multiple asynchronous endpoints invoked in parallel (number of models can change, and I don't want to tie implementation to it). After invoking the endpoints they immediately return the output location in S3. I would like to pause the workflow until all the endpoints have successfully written their outputs to S3. I'm thinking of using the callback pattern with token where a lambda function calls sendTaskSuccess once all the outputs are present in S3.

My issue is how to store these output locations properly, with a single jobId such that I can retrieve them with a lambda function when an inference is written in S3 and verify if all the inferences are present. I thought about using DynamoDB, but eventual consistency worries me, because I'll be using GSI on the S3 output path column to retrieve the JobId, then use the jobId to retrieve the remaining output paths.
 
RDS PostgreSQL isn't directly supported in step function, and so I'll be using a lambda to run the queries and cold starts also may be an issue.

Answer

Well, you can use DynamoDB with strong consistency reads for critical operations to ensure data consistency. Use a composite primary key in DynamoDB, where the partition key is the `jobId` and the sort key is the output location, to efficiently query and update the status of each output. Implement a Lambda function that triggers on S3 events, checks if all outputs for a `jobId` are present in DynamoDB, and then calls `sendTaskSuccess`.

Answer

Here are a few ways you could approach storing and retrieving the output locations from your asynchronous endpoints:

* Use DynamoDB to store the job ID, output locations, and a status field. When a lambda function finishes writing to S3, it updates the status for that output location.
* Once all statuses are complete, your final lambda can retrieve the full list of outputs for that job ID. You may need to handle eventual consistency, such as with conditional writes to the status field.
* Have each lambda write output metadata to a single S3 object upon completion. The metadata could include the job ID and output location.
* Your final lambda would poll for this S3 object until all expected outputs are present. It can then read the full metadata to retrieve all output locations.
* Use SNS to notify of completion. Each lambda subscribes a topic and publishes upon finishing write to S3.
* Your final lambda would wait for all expected SNS messages before proceeding. The messages would need to include the job ID and output location.
In all cases, testing and refactoring may be needed to properly handle failures or outlier completion times between endpoints.

How to wait for sagemaker asynchronous endpoints in step function

Relevant content