Storing incomplete data on Sagemaker inference endpoint

Question

I am using an inference endpoint to analyze streaming data. Batches of data that come in may not be complete, i.e. data part 1 may come in batch 1, and part 2 in batch 2. I need to join together the parts before running inference. My plan was to have a global variable in my inference script that is used to store incomplete data until the rest of the parts show up.   
However, when I create the endpoint, i see the initialization print statements display multiple times in the logs.  
Does this mean that the endpoint is actually running my code on multiple separate vCPUs? If this was the case, then my global variable would not be shared across the parallel runs, and the plan wouldn't work.  
If this is true, is there a way to force a single instance of the code to run, or to somehow share the variable across all the different vCPUs?  
Is there a different reason why the initialization statements would print multiple times?

Answer

Thank you for contacting us regarding streaming fragmented data to your SageMaker endpoint. Based on your description, you mentioned that you are streaming data to your SageMaker Endpoint and the invocations are fragmented due to the payloads not being consolidated. As such, you wanted to create a global variable in the inference code to consolidate the payloads before invocation. However, looking into the CloudWatch logs, you see that the inference code is being initialized multiple times and would like to know if the inference script is being initialized on  the separate vCPUs in the Endpoint when created and if so, does this mean that:

1. Is the Endpoint running code on multiple separate vCPUs? If so, is the global variable shared across the parallel runs? 
2. If the code is running on multiple vCPUs, is there a way to force a single instance of the code to run, or to somehow share the variables across all the different vCPUs?

You are correct that SageMaker endpoints run on multiple vCPUs to handle inference requests in parallel, however looking into the documentation, it is not outlined if the inference script runs on all vCPUs in parallel. As such, I wouldn't be able to know if there is a way to force a single instance of the code to run, or how to share the global variables across all the different vCPUs.

Given the complexity around resolving this issue, I would recommend reaching out to AWS Premium Support. They can help troubleshoot your specific endpoint configuration and data pipelines to advise the best approach for sharing state. They can also assist with implementation if needed.

Please open a Premium Support case in the AWS Console or call 24/7 phone support. Reference this re:Post summary when you connect and an engineer can investigate further with you.

Storing incomplete data on Sagemaker inference endpoint

Relevant content