I want to troubleshoot why my Amazon SageMaker pipeline execution failed.
Resolution
To troubleshoot the failed pipeline execution in SageMaker, do the following:
Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.
1. Run the AWS Command Line Interface (AWS CLI) command list-pipeline-executions.
Note: Use the AWS CloudShell console if you don't have AWS CLI configured in your local machine.
$ aws sagemaker list-pipeline-executions --pipeline-name test-pipeline-p-wzx9cplzrvdk
The command returns a list of pipeline executions for your pipeline that looks similar to the following:
"PipelineExecutionSummaries": [
{
"PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b",
"StartTime": "2022-09-27T12:56:44.646000+00:00",
"PipelineExecutionStatus": "Failed",
"PipelineExecutionDisplayName": "execution-1664283404791",
"PipelineExecutionFailureReason": "Step failure: One or multiple steps failed."
},
{
"PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/acvref9y1f47",
"StartTime": "2022-09-27T12:13:28.762000+00:00",
"PipelineExecutionStatus": "Succeeded",
"PipelineExecutionDisplayName": "execution-1664280808943"
}
]
}
2. Run the list-pipeline-executions-steps command to view the steps that failed:
$ aws sagemaker list-pipeline-execution-steps --pipeline-execution-arn arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b
The output looks similar to the following:
{
"PipelineExecutionSteps": [
{
"StepName": "TrainAbaloneModel",
"StartTime": "2022-09-27T13:00:49.235000+00:00",
"EndTime": "2022-09-27T13:01:50.056000+00:00",
"StepStatus": "Failed",
"AttemptCount": 0,
"FailureReason": "ClientError: ClientError: Please ensure the security group provided is valid",
"Metadata": {
"TrainingJob": {
"Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:training-job/pipelines-lvejn1jl827b-trainabalonemodel-u9l9wjassg"
}
}
},
{
"StepName": "PreprocessAbaloneData",
"StartTime": "2022-09-27T12:56:45.595000+00:00",
"EndTime": "2022-09-27T13:00:48.638000+00:00",
"StepStatus": "Succeeded",
"AttemptCount": 0,
"Metadata": {
"ProcessingJob": {
"Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:processing-job/pipelines-lvejn1jl827b-preprocessabalonedat-6axq0kthyg"
}
}
}
]
}
In this case, the training job step failed because a non-existent security group was specified in the job's VpcConfig object.
If the FailureReason for the failed step isn't clear, check the Amazon CloudWatch logs for the failed SageMaker job or endpoint to troubleshoot further. You can see the logs for the training jobs in the CloudWatch log group /aws/sagemaker/TrainingJobs. The log stream looks similar to the following:
example-training-job-name/algo-example-instance-number-in-cluster-example-epoch-timestamp
Related information
Log Amazon SageMaker events with Amazon CloudWatch