如何对 SageMaker Studio 中计划的笔记本作业进行故障排除?

2 分钟阅读
0

当我在 Amazon SageMaker Studio 中运行计划的笔记本作业时,遇到了一个错误。

简述

两个常见错误可能会阻止 SageMaker Studio 中计划的笔记本作业:

  • AccessDenied 错误
  • 尝试更新作业时出现 UI 错误

解决方案

AccessDenied 错误

AccessDenied 错误往往涉及以下几个方面的问题:

  • AWS Identity and Access Management(IAM)策略
  • 虚拟私有云(VPC)端点策略
  • 资源标签异常

IAM 策略问题

AccessDenied 错误往往是因基于权限的错误导致。因此,请遵循笔记本作业所需的 IAM 角色的最佳实践。建立基本信任关系需要下列 IAM 角色:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

此外,请验证您的 IAM 角色是否具有下列权限:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/*",
      "Condition": {
        "StringLike": {
          "iam:PassedToService": [
            "sagemaker.amazonaws.com",
            "events.amazonaws.com"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "events:TagResource",
        "events:DeleteRule",
        "events:PutTargets",
        "events:DescribeRule",
        "events:PutRule",
        "events:RemoveTargets",
        "events:DisableRule",
        "events:EnableRule"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:PutBucketVersioning",
        "s3:PutEncryptionConfiguration"
      ],
      "Resource": "arn:aws:s3:::sagemaker-automated-execution-*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:ListTags"
      ],
      "Resource": [
        "arn:aws:sagemaker:*:*:user-profile/*",
        "arn:aws:sagemaker:*:*:space/*",
        "arn:aws:sagemaker:*:*:training-job/*",
        "arn:aws:sagemaker:*:*:pipeline/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:AddTags"
      ],
      "Resource": [
        "arn:aws:sagemaker:*:*:training-job/*",
        "arn:aws:sagemaker:*:*:pipeline/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateNetworkInterface",
        "ec2:CreateNetworkInterfacePermission",
        "ec2:CreateVpcEndpoint",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteNetworkInterfacePermission",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSubnets",
        "ec2:DescribeVpcEndpoints",
        "ec2:DescribeVpcs",
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer",
        "ecr:GetAuthorizationToken",
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:GetEncryptionConfiguration",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:GetObject",
        "sagemaker:DescribeDomain",
        "sagemaker:DescribeUserProfile",
        "sagemaker:DescribeSpace",
        "sagemaker:DescribeStudioLifecycleConfig",
        "sagemaker:DescribeImageVersion",
        "sagemaker:DescribeAppImageConfig",
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:StopTrainingJob",
        "sagemaker:Search",
        "sagemaker:CreatePipeline",
        "sagemaker:DescribePipeline",
        "sagemaker:DeletePipeline",
        "sagemaker:StartPipelineExecution"
      ],
      "Resource": "*"
    }
  ]
}

有关详细信息,请参阅 AWS managed policies for SageMaker notebooks

VPC 端点问题

如果通过 VPC 端点启动笔记本作业,请检查该端点的配置和策略。确保遵循相关服务端点的步骤和最佳实践:

对于 Amazon S3 VPC 端点,最常见的错误与限制为单个账户的端点有关。例如,以下策略限制对 ID 为 111122223333 的账户的访问权限:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowSpecificAccountsPermission",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "s3:*",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "s3:ResourceAccount": "111122223333"
        }
      }
    }
  ]
}

在这种情况下,您还必须对用户操作允许以下桶访问:

{
  "Action": [
    "s3:*"
  ],
  "Resource": [
    "arn:aws:s3:::sagemakerheadlessexecution-prod-*",
    "arn:aws:s3:::sagemakerheadlessexecution-prod-*/*"
  ],
  "Effect": "Allow",
  "Sid": "SCTASK14554266"
}

资源标签异常

确保您的 IAM 策略具有下列权限:

{
  "Effect": "Allow",
  "Action": [
    "events:TagResource",
    "events:DeleteRule",
    "events:PutTargets",
    "events:DescribeRule",
    "events:PutRule",
    "events:RemoveTargets",
    "events:DisableRule",
    "events:EnableRule"
  ],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
    }
  }
}

尝试更新作业时出现 UI 错误

当您尝试创建描述更新停止删除笔记本作业时,可能会遇到 UI 错误。在作业定义(计划的作业)方面也可能会遇到此问题。要解决此问题,请先记下 UI 中出现的错误消息。此消息通常包含解决问题的指示或建议。

如果无法解决错误,请完成下面的步骤:

  1. 获取错误的屏幕截图,然后将其另存为图像文件。
  2. 创建 HTTP 存档(HAR)文件,该文件用于在出现 UI 错误时捕获网络流量。
  3. 前往 SageMaker Studio 的 Jupyter 服务器终端。依次选择文件、新建、终端
  4. 查看 /var/log/apps/app_container.log 中的日志,看一下 UI 出现错误时是否有异常、错误或警告。
  5. 通过 AWS Support 中心联系 AWS Support 部门。在您的请求中,附上错误屏幕截图、app_container.log 和 HAR 文件。
AWS 官方
AWS 官方已更新 1 年前