Sagemaker Batch Transform Job Failure: Timeout Issue and Job Restarted Unexpectedly

0

I am using the batch transform function in SageMaker for the inference of my PyTorch model. I am using the same structure as https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own/container. The error is that my job will start multiple times on different workers if I choose multiple workers. Or it will repeat after finish if I choose 1 worker.

I think it should be some errors in timeout setup. I have tried to increase the keepalive_timeout and proxy_read_timeout in the serve file and tried the SAGEMAKER_MODEL_SERVER_TIMEOUT as an environment variable. But nothing worked. Could some one help? Thanks!

  • To understand the scenario better, can you share the error message and the code used for the setup?

Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen