我想要對 Amazon SageMaker 管道執行失敗的原因進行疑難排解。
解決方案
若要對 SageMaker 的管線執行失敗進行疑難排解,請執行下列操作:
**備註:**如果您在執行 AWS CLI 命令時收到錯誤,請確保您使用的是最新版 AWS CLI。
1. 執行 AWS Command Line Interface (AWS CLI) 命令 list-pipeline-executions。
**備註:**如果您的本機電腦未設定 AWS CLI,請使用 AWS CloudShell 主控台。
$ aws sagemaker list-pipeline-executions --pipeline-name test-pipeline-p-wzx9cplzrvdk
此命令會傳回管線的管線執行清單,看起來類似下列內容:
"PipelineExecutionSummaries": [
{
"PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b",
"StartTime": "2022-09-27T12:56:44.646000+00:00",
"PipelineExecutionStatus": "Failed",
"PipelineExecutionDisplayName": "execution-1664283404791",
"PipelineExecutionFailureReason": "Step failure: One or multiple steps failed."
},
{
"PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/acvref9y1f47",
"StartTime": "2022-09-27T12:13:28.762000+00:00",
"PipelineExecutionStatus": "Succeeded",
"PipelineExecutionDisplayName": "execution-1664280808943"
}
]
}
2. 執行 list-pipeline-executions-steps 命令,以檢視失敗的步驟:
$ aws sagemaker list-pipeline-execution-steps --pipeline-execution-arn arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b
輸出類似於以下內容:
{
"PipelineExecutionSteps": [
{
"StepName": "TrainAbaloneModel",
"StartTime": "2022-09-27T13:00:49.235000+00:00",
"EndTime": "2022-09-27T13:01:50.056000+00:00",
"StepStatus": "Failed",
"AttemptCount": 0,
"FailureReason": "ClientError: ClientError: Please ensure the security group provided is valid",
"Metadata": {
"TrainingJob": {
"Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:training-job/pipelines-lvejn1jl827b-trainabalonemodel-u9l9wjassg"
}
}
},
{
"StepName": "PreprocessAbaloneData",
"StartTime": "2022-09-27T12:56:45.595000+00:00",
"EndTime": "2022-09-27T13:00:48.638000+00:00",
"StepStatus": "Succeeded",
"AttemptCount": 0,
"Metadata": {
"ProcessingJob": {
"Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:processing-job/pipelines-lvejn1jl827b-preprocessabalonedat-6axq0kthyg"
}
}
}
]
}
在此情況下,訓練任務步驟失敗,是因為在該任務的 vPCConfig 物件中指定不存在的安全群組。
如果失敗步驟的失敗原因不明,請檢查 Amazon CloudWatch Logs 中是否有失敗的 SageMaker 任務或端點,以進一步進行疑難排解。您可以在 CloudWatch 日誌群組 /aws/sagemaker/TrainingJobs 中查看訓練任務的日誌。日誌串流看起來類似下列內容:
example-training-job-name/algo-example-instance-number-in-cluster-example-epoch-timestamp
相關資訊
使用 Amazon CloudWatch 記錄 Amazon SageMaker 事件