AWS Sagemaker randomly and transiently fails on training jobs in us-west-2


I've been running jobs on Sagemaker consistently for the past 2 months in us-west-1. I recently transferred everything to us-west-2 because of better instance options. However, I've noticed that jobs (somewhat commonly) just fail immediately after sending them in. This never happened in us-west-1, and the jobs for the most part never changed. Weirdly enough, I can send in the same job (either by script or just by cloning the job in the console) and it'll work every now and then!

To be clear, here is the setup and what happens:

  • Sagemaker job in us-west-2 with a custom image I provide in ECR.
  • Spot training job
  • S3 buckets passed in as FastFile

Some times, when these jobs are sent in, it immediately quits and returns with the cryptic error: InternalServerError: We encountered an internal error. Please try again.

Cloning them almost immediately after works fine, usually. I noticed this happens a lot with g5 instances (specifically, g5.2xlarge). Is it an availability issue? I don't see any messages mentioning low availability of specific instances, but maybe this is it since it is also a spot training job.

These are transient issues, but they pop up much more often than I'd expect.

asked 3 months ago234 views
1 Answer

For your workload, I would recommend using the real-time inference option with SageMaker endpoints.

With client-side batching as you are currently doing, you can ensure the order of inputs and outputs is maintained since each batch request is synchronous. Creating multiple instances in the endpoint will not change this behavior.

Compared to the batch transform job option, real-time inference avoids the startup time of 4-6 minutes for batch jobs to be provisioned. Since your jobs arrive infrequently, this startup delay negates any benefits of scaling to multiple instances.

You can provision the endpoint in advance with the required instance type and number of instances based on throughput needs. This way when a prediction job arrives, it will get processed immediately without waiting for resources to be provisioned.

The cost will be based on the time the endpoint is in service. You can terminate the endpoint when not in use to avoid idle costs.

For throughput needs above what a single instance can handle, you can scale out the endpoint to multiple instances as traffic increases.

profile picture
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions