I am using the batch transform function in SageMaker for the inference of my PyTorch model. I am using the same structure as https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own/container.
The error is that my job will start multiple times on different workers if I choose multiple workers. Or it will repeat after finish if I choose 1 worker.
I think it should be some errors in timeout setup. I have tried to increase the keepalive_timeout and proxy_read_timeout in the serve file and tried the SAGEMAKER_MODEL_SERVER_TIMEOUT as an environment variable. But nothing worked. Could some one help? Thanks!
To understand the scenario better, can you share the error message and the code used for the setup?