我想排查我的 Amazon SageMaker 管道执行失败的原因。
解决方法
要排查 SageMaker 中管道执行失败的问题,请执行以下操作:
**注意:**如果在运行 AWS CLI 命令时收到错误信息,请确保您使用的是最新版本的 AWS CLI。
1. 运行 AWS 命令行界面(AWS CLI)命令 list-pipeline-executions。
**注意:**如果您没有在本地计算机上配置 AWS CLI,请使用 AWS CloudShell 控制台。
$ aws sagemaker list-pipeline-executions --pipeline-name test-pipeline-p-wzx9cplzrvdk
该命令会返回您的管道的管道执行列表,该列表看起来与以下内容类似:
"PipelineExecutionSummaries": [
{
"PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b",
"StartTime": "2022-09-27T12:56:44.646000+00:00",
"PipelineExecutionStatus": "Failed",
"PipelineExecutionDisplayName": "execution-1664283404791",
"PipelineExecutionFailureReason": "Step failure: One or multiple steps failed."
},
{
"PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/acvref9y1f47",
"StartTime": "2022-09-27T12:13:28.762000+00:00",
"PipelineExecutionStatus": "Succeeded",
"PipelineExecutionDisplayName": "execution-1664280808943"
}
]
}
2. 运行 list-pipeline-executions-steps 命令以查看失败的步骤:
$ aws sagemaker list-pipeline-execution-steps --pipeline-execution-arn arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b
输出与以下内容类似:
{
"PipelineExecutionSteps": [
{
"StepName": "TrainAbaloneModel",
"StartTime": "2022-09-27T13:00:49.235000+00:00",
"EndTime": "2022-09-27T13:01:50.056000+00:00",
"StepStatus": "Failed",
"AttemptCount": 0,
"FailureReason": "ClientError: ClientError: Please ensure the security group provided is valid",
"Metadata": {
"TrainingJob": {
"Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:training-job/pipelines-lvejn1jl827b-trainabalonemodel-u9l9wjassg"
}
}
},
{
"StepName": "PreprocessAbaloneData",
"StartTime": "2022-09-27T12:56:45.595000+00:00",
"EndTime": "2022-09-27T13:00:48.638000+00:00",
"StepStatus": "Succeeded",
"AttemptCount": 0,
"Metadata": {
"ProcessingJob": {
"Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:processing-job/pipelines-lvejn1jl827b-preprocessabalonedat-6axq0kthyg"
}
}
}
]
}
在这种情况下,训练作业步骤失败,因为在作业的 VpcConfig 对象中指定了不存在的安全组。
如果不清楚失败步骤的 FailureReason,请查看 Amazon CloudWatch Logs 中是否有失败的 SageMaker 作业或端点,以进一步进行问题排查。您可以在 CloudWatch 日志组 /aws/sagemaker/TrainingJobs 中查看训练作业的日志。日志流看起来与以下内容类似:
example-training-job-name/algo-example-instance-number-in-cluster-example-epoch-timestamp
相关信息
使用 Amazon CloudWatch 记录 Amazon SageMaker 事件