Why does my Amazon SageMaker pipeline execution fail?

2 minute read
1

I want to troubleshoot why my Amazon SageMaker pipeline execution failed.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

To troubleshoot the failed pipeline execution in SageMaker, complete the following steps:

  1. Run the AWS Command Line Interface (AWS CLI) command list-pipeline-executions.
    Note: Use the AWS CloudShell console if you don't have AWS CLI configured in your local machine.

    $ aws sagemaker list-pipeline-executions --pipeline-name test-pipeline-p-wzx9cplzrvdk

    The command returns a list of pipeline executions for your pipeline that's similar to the following content:

    "PipelineExecutionSummaries": [    {
          "PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b",
          "StartTime": "2022-09-27T12:56:44.646000+00:00",
          "PipelineExecutionStatus": "Failed",
          "PipelineExecutionDisplayName": "execution-1664283404791",
          "PipelineExecutionFailureReason": "Step failure: One or multiple steps failed."
        },
        {
          "PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/acvref9y1f47",
          "StartTime": "2022-09-27T12:13:28.762000+00:00",
          "PipelineExecutionStatus": "Succeeded",
          "PipelineExecutionDisplayName": "execution-1664280808943"
        }
      ]
    }
  2. Run the list-pipeline-executions-steps command to view the steps that failed:

    $ aws sagemaker list-pipeline-execution-steps \
    --pipeline-execution-arn arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b \
    --query "PipelineExecutionSteps[?StepStatus=='Failed']"

    The output looks similar to the following content:

    {  "PipelineExecutionSteps": [
        {
          "StepName": "TrainAbaloneModel",
          "StartTime": "2022-09-27T13:00:49.235000+00:00",
          "EndTime": "2022-09-27T13:01:50.056000+00:00",
          "StepStatus": "Failed",
          "AttemptCount": 0,
          "FailureReason": "ClientError: ClientError: Please ensure the security group provided is valid",
          "Metadata": {
            "TrainingJob": {
              "Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:training-job/pipelines-lvejn1jl827b-trainabalonemodel-u9l9wjassg"
            }
          }
        }
      ]
    }
    

    The training job step failed because the job's VpcConfig object specified a nonexistent security group.

    If the FailureReason for the failed step isn't clear, then check the Amazon CloudWatch logs for the failed SageMaker job or endpoint. The logs for the training jobs are in the CloudWatch log group /aws/sagemaker/TrainingJobs. The log stream looks similar to the following example: example-training-job-name/algo-example-instance-number-in-cluster-example-epoch-timestamp.

If failure or error messages and CloudWatch logs can't explain the root cause of the failure, then open a support case with AWS Premium Support.

Provide the following information:

  1. Pipeline Amazon Resource Name (ARN)
  2. Execution ARN
  3. Previous successful execution ARNs

Related information

Log Amazon SageMaker events with Amazon CloudWatch

How do I troubleshoot errors when running Amazon SageMaker training jobs?

AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago