跳至内容

如何对 SageMaker AI 计划笔记本作业中出现的问题进行故障排除?

3 分钟阅读
0

我想对 Amazon SageMaker AI 计划笔记本作业中出现的问题进行故障排除。

解决方法

您的笔记本作业未在计划时间内运行

计划笔记本作业使用 Amazon EventBridge 和 SageMaker AI 训练和管道服务。如果您的笔记本未在计划时间内运行,则您的 AWS Identity and Access Management (IAM) 角色可能没有所需的权限。

您用于创建和计划笔记本作业的 IAM 角色可以是 SageMaker AI Studio 域角色,也可以是附加到域中单个用户配置文件的角色。

要为计划笔记本作业授予管理 Amazon CloudWatch 事件的权限,请在您的计划笔记本作业中添加以下策略:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "events:TagResource",
                "events:DeleteRule",
                "events:PutTargets",
                "events:DescribeRule",
                "events:PutRule",
                "events:RemoveTargets",
                "events:DisableRule",
                "events:EnableRule"
            ],
            "Resource": "*",
            "Condition": {
              "StringEquals": {
                "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
              }
            }
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/*",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "events.amazonaws.com"
                }
            }
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": "sagemaker:ListTags",
            "Resource": "arn:aws:sagemaker:*:*:user-profile/*/*"
        }
    ]
}

**注意:**上述策略允许在用户配置文件上列出标签,以识别标记为计划作业的笔记本。

当您尝试创建计划笔记本作业时,“Create”(创建)按钮显示为灰色

当您创建笔记本作业时,如果 Create(创建)按钮显示为灰色,且 Additional options(其他选项)显示 !,请检索 HAR 文件。查看问题发生期间的网络捕获情况,以确定原因。检查是否存在 Amazon Virtual Private Cloud (Amazon VPC) 配置错误或运行时角色权限问题。

如果您使用 Amazon VPC 来运行笔记本作业,请确保您拥有以下 Amazon VPC 端点:

指定至少一个私有子网和安全组。如果您不使用私有子网,请使用另一个配置选项。有关详细信息,请参阅使用仅 VPC 模式的要求

如果您配置了上述端点或未使用 Amazon VPC 来运行笔记本作业,请配置 IAM 权限和运行时角色权限。

要配置 IAM 权限,请完成以下步骤:

  1. 打开 IAM 控制台
  2. 在导航窗格中,选择 Users(用户)。
  3. 选择与您的笔记本作业关联的 IAM 用户。
  4. 在下拉菜单中,选择 Add Permissions(添加权限),然后选择 Create inline policy(创建内联策略)。
  5. 选择 JSON 选项卡,然后添加以下策略:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "EventBridgeSchedule",
                "Effect": "Allow",
                "Action": [
                    "events:TagResource",
                    "events:DeleteRule",
                    "events:PutTargets",
                    "events:DescribeRule",
                    "events:EnableRule",
                    "events:PutRule",
                    "events:RemoveTargets",
                    "events:DisableRule"
                ],
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
                    }
                }
            },
            {
                "Sid": "IAMPassrole",
                "Effect": "Allow",
                "Action": "iam:PassRole",
                "Resource": "arn:aws:iam::*:role/*",
                "Condition": {
                    "StringLike": {
                        "iam:PassedToService": [
                            "sagemaker.amazonaws.com",
                            "events.amazonaws.com"
                        ]
                    }
                }
            },
            {
                "Sid": "IAMListRoles",
                "Effect": "Allow",
                "Action": "iam:ListRoles",
                "Resource": "*"
            },
            {
                "Sid": "S3ArtifactsAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:PutEncryptionConfiguration",
                    "s3:CreateBucket",
                    "s3:PutBucketVersioning",
                    "s3:ListBucket",
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:GetEncryptionConfiguration",
                    "s3:DeleteObject",
                    "s3:GetBucketLocation"
                ],
                "Resource": [
                    "arn:aws:s3:::sagemaker-automated-execution-*"
                ]
            },
            {
                "Sid": "S3DriverAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject",
                    "s3:GetBucketLocation"
                ],
                "Resource": [
                    "arn:aws:s3:::sagemakerheadlessexecution-*"
                ]
            },
            {
                "Sid": "SagemakerJobs",
                "Effect": "Allow",
                "Action": [
                    "sagemaker:DescribeTrainingJob",
                    "sagemaker:StopTrainingJob",
                    "sagemaker:DescribePipeline",
                    "sagemaker:CreateTrainingJob",
                    "sagemaker:DeletePipeline",
                    "sagemaker:CreatePipeline"
                ],
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
                    }
                }
            },
            {
                "Sid": "AllowSearch",
                "Effect": "Allow",
                "Action": "sagemaker:Search",
                "Resource": "*"
            },
            {
                "Sid": "SagemakerTags",
                "Effect": "Allow",
                "Action": [
                    "sagemaker:ListTags",
                    "sagemaker:AddTags"
                ],
                "Resource": [
                    "arn:aws:sagemaker:*:*:pipeline/*",
                    "arn:aws:sagemaker:*:*:space/*",
                    "arn:aws:sagemaker:*:*:training-job/*",
                    "arn:aws:sagemaker:*:*:user-profile/*"
                ]
            },
            {
                "Sid": "ECRImage",
                "Effect": "Allow",
                "Action": [
                    "ecr:GetAuthorizationToken",
                    "ecr:BatchGetImage"
                ],
                "Resource": "*"
            }
        ]
    }

要配置运行时角色权限,请完成以下步骤:

  1. 打开 IAM 控制台
  2. 在导航窗格中,选择 Roles(角色)。
  3. 选择与您的笔记本作业关联的作业运行时角色。
  4. 选择 Trust relationships(信任关系)选项卡。
  5. 选择 Edit trust policy(编辑信任策略),然后添加以下策略:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": [
                        "sagemaker.amazonaws.com",
                        "events.amazonaws.com"
                    ]
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

要添加访问资源的权限,请完成以下步骤:

  1. 打开 IAM 控制台

  2. 在左侧导航窗格中,选择 Roles(角色)。

  3. 选择与您的笔记本作业关联的作业运行时角色。

  4. 在下拉菜单中,选择 Add Permissions(添加权限),然后选择 Create inline policy(创建内联策略)。

  5. 选择 JSON 选项卡,然后添加以下策略:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "PassroleForJobCreation",
                "Effect": "Allow",
                "Action": "iam:PassRole",
                "Resource": "arn:aws:iam::*:role/*",
                "Condition": {
                    "StringLike": {
                        "iam:PassedToService": "sagemaker.amazonaws.com"
                    }
                }
            },
            {
                "Sid": "S3ForStoringArtifacts",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:ListBucket",
                    "s3:GetBucketLocation"
                ],
                "Resource": "arn:aws:s3:::sagemaker-automated-execution-*"
            },
            {
                "Sid": "S3DriverAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject",
                    "s3:GetBucketLocation"
                ],
                "Resource": [
                    "arn:aws:s3:::sagemakerheadlessexecution-*"
                ]
            },
            {
                "Sid": "SagemakerJobs",
                "Effect": "Allow",
                "Action": [
                    "sagemaker:StartPipelineExecution",
                    "sagemaker:CreateTrainingJob"
                ],
                "Resource": "*"
            },
            {
                "Sid": "ECRImage",
                "Effect": "Allow",
                "Action": [
                    "ecr:GetDownloadUrlForLayer",
                    "ecr:BatchGetImage",
                    "ecr:GetAuthorizationToken",
                    "ecr:BatchCheckLayerAvailability"
                ],
                "Resource": "*"
            }
        ]
    }

    **注意:**在上述策略中,您可以添加您的笔记本作业必须有权访问的其他资源的权限。

  6. 选择 Review policy(查看策略)。然后,输入您的策略的名称。

  7. 选择 Create policy(创建策略)。

您收到“Unable to find metadata for image”错误

您收到“Unable to find metadata for image arn in region: example-region”错误消息。

当您在创建笔记本作业时未关联或选择用户图像时,会出现上述错误。当您尝试将用户图像附加到笔记本作业时,也可能会出现此错误。

要解决此问题,请稍后重新计划您的笔记本作业。如果错误仍然出现,请联系 AWS Support。

相关信息

Operationalize your Amazon SageMaker Studio notebooks as scheduled notebook jobs(将您的 Amazon SageMaker Studio 笔记本作为计划笔记本作业运行)

为本地 Jupyter 环境安装策略和权限

AWS 官方已更新 1 年前