如何对 SageMaker Studio 中计划的笔记本作业进行故障排除?
2 分钟阅读
0
当我在 Amazon SageMaker Studio 中运行计划的笔记本作业时,遇到了一个错误。
简述
两个常见错误可能会阻止 SageMaker Studio 中计划的笔记本作业:
- AccessDenied 错误
- 尝试更新作业时出现 UI 错误
解决方案
AccessDenied 错误
AccessDenied 错误往往涉及以下几个方面的问题:
- AWS Identity and Access Management(IAM)策略
- 虚拟私有云(VPC)端点策略
- 资源标签异常
IAM 策略问题
AccessDenied 错误往往是因基于权限的错误导致。因此,请遵循笔记本作业所需的 IAM 角色的最佳实践。建立基本信任关系需要下列 IAM 角色:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Service": "events.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
此外,请验证您的 IAM 角色是否具有下列权限:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::*:role/*", "Condition": { "StringLike": { "iam:PassedToService": [ "sagemaker.amazonaws.com", "events.amazonaws.com" ] } } }, { "Effect": "Allow", "Action": [ "events:TagResource", "events:DeleteRule", "events:PutTargets", "events:DescribeRule", "events:PutRule", "events:RemoveTargets", "events:DisableRule", "events:EnableRule" ], "Resource": "*", "Condition": { "StringEquals": { "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true" } } }, { "Effect": "Allow", "Action": [ "s3:CreateBucket", "s3:PutBucketVersioning", "s3:PutEncryptionConfiguration" ], "Resource": "arn:aws:s3:::sagemaker-automated-execution-*" }, { "Effect": "Allow", "Action": [ "sagemaker:ListTags" ], "Resource": [ "arn:aws:sagemaker:*:*:user-profile/*", "arn:aws:sagemaker:*:*:space/*", "arn:aws:sagemaker:*:*:training-job/*", "arn:aws:sagemaker:*:*:pipeline/*" ] }, { "Effect": "Allow", "Action": [ "sagemaker:AddTags" ], "Resource": [ "arn:aws:sagemaker:*:*:training-job/*", "arn:aws:sagemaker:*:*:pipeline/*" ] }, { "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterface", "ec2:CreateNetworkInterfacePermission", "ec2:CreateVpcEndpoint", "ec2:DeleteNetworkInterface", "ec2:DeleteNetworkInterfacePermission", "ec2:DescribeDhcpOptions", "ec2:DescribeNetworkInterfaces", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSubnets", "ec2:DescribeVpcEndpoints", "ec2:DescribeVpcs", "ecr:BatchCheckLayerAvailability", "ecr:BatchGetImage", "ecr:GetDownloadUrlForLayer", "ecr:GetAuthorizationToken", "s3:ListBucket", "s3:GetBucketLocation", "s3:GetEncryptionConfiguration", "s3:PutObject", "s3:DeleteObject", "s3:GetObject", "sagemaker:DescribeDomain", "sagemaker:DescribeUserProfile", "sagemaker:DescribeSpace", "sagemaker:DescribeStudioLifecycleConfig", "sagemaker:DescribeImageVersion", "sagemaker:DescribeAppImageConfig", "sagemaker:CreateTrainingJob", "sagemaker:DescribeTrainingJob", "sagemaker:StopTrainingJob", "sagemaker:Search", "sagemaker:CreatePipeline", "sagemaker:DescribePipeline", "sagemaker:DeletePipeline", "sagemaker:StartPipelineExecution" ], "Resource": "*" } ] }
有关详细信息,请参阅 AWS managed policies for SageMaker notebooks。
VPC 端点问题
如果通过 VPC 端点启动笔记本作业,请检查该端点的配置和策略。确保遵循相关服务端点的步骤和最佳实践:
- Amazon Elastic Compute Cloud(Amazon EC2)VPC 端点
- Amazon EventBridge 端点
- SageMaker 端点
- Amazon Simple Storage Service(Amazon S3)端点
对于 Amazon S3 VPC 端点,最常见的错误与限制为单个账户的端点有关。例如,以下策略限制对 ID 为 111122223333 的账户的访问权限:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowSpecificAccountsPermission", "Effect": "Allow", "Principal": { "AWS": "*" }, "Action": "s3:*", "Resource": "*", "Condition": { "StringEquals": { "s3:ResourceAccount": "111122223333" } } } ] }
在这种情况下,您还必须对用户操作允许以下桶访问:
{ "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::sagemakerheadlessexecution-prod-*", "arn:aws:s3:::sagemakerheadlessexecution-prod-*/*" ], "Effect": "Allow", "Sid": "SCTASK14554266" }
资源标签异常
确保您的 IAM 策略具有下列权限:
{ "Effect": "Allow", "Action": [ "events:TagResource", "events:DeleteRule", "events:PutTargets", "events:DescribeRule", "events:PutRule", "events:RemoveTargets", "events:DisableRule", "events:EnableRule" ], "Resource": "*", "Condition": { "StringEquals": { "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true" } } }
尝试更新作业时出现 UI 错误
当您尝试创建、描述、更新、停止或删除笔记本作业时,可能会遇到 UI 错误。在作业定义(计划的作业)方面也可能会遇到此问题。要解决此问题,请先记下 UI 中出现的错误消息。此消息通常包含解决问题的指示或建议。
如果无法解决错误,请完成下面的步骤:
- 获取错误的屏幕截图,然后将其另存为图像文件。
- 创建 HTTP 存档(HAR)文件,该文件用于在出现 UI 错误时捕获网络流量。
- 前往 SageMaker Studio 的 Jupyter 服务器终端。依次选择文件、新建、终端。
- 查看 /var/log/apps/app_container.log 中的日志,看一下 UI 出现错误时是否有异常、错误或警告。
- 通过 AWS Support 中心联系 AWS Support 部门。在您的请求中,附上错误屏幕截图、app_container.log 和 HAR 文件。
AWS 官方已更新 1 年前
没有评论
相关内容
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前
- AWS 官方已更新 1 年前