I want to troubleshoot errors when I run Amazon SageMaker training jobs.
Resolution
To identify the reason for your SageMaker training job error, check the failure reason on the SageMaker console or in the DescribeTrainingJob API call. Then, troubleshoot your job error based on the error that you see when your training job fails.
If your SageMaker training job failed with an internal server error, first retry the job to make sure that a transient issue didn't cause the issue.
If the job fails when you retry it, then review the logs for training jobs on Amazon CloudWatch.
Locate the logs in CloudWatch under the log group /aws/sagemaker/TrainingJobs in the log stream that's similar to this example: example-training-job-name/algo-example-instance-number-in-cluster-example-epoch-timestamp.
Review job metrics, such as CPUUtilization, MemoryUtilization, and DiskUtilization to make sure that the failure didn't occur because of a resource limitation.
To access the training job logs and job metrics, complete the following steps:
- Open the SageMaker console.
- Choose Training jobs, and then choose the training job.
- Choose TrainingJobName.
- In the Monitor section, choose View logs.
- In the Monitor section, review the graphs of instance utilization.
If the job uses all the resources, then switch to a larger instance type or attach a larger storage volume to the instance.
For more information, see Monitoring training job metrics (SageMaker console).
Related information
Monitor and analyze training jobs using Amazon CloudWatch metrics
Logs for built-in algorithms