Skip to content

How do I troubleshoot errors that I receive when a scheduled notebook job tries to run in SageMaker AI Studio?

4 minute read
1

I want to troubleshoot errors that I receive when scheduled notebook jobs try to run in Amazon SageMaker AI Studio.

Resolution

Troubleshoot AccessDenied errors

When a scheduled notebook job tries to run, you might receive an "AccessDenied" error for the following reasons:

  • You don't have the required AWS Identity and Access Management (IAM) policies.
  • You don't have the required Amazon Virtual Private Cloud (Amazon VPC) endpoint policies.
  • You have resource tag exceptions.

IAM policy issues

Make sure that your notebook has the following policy attached to the IAM role to allow the base trust relationship:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Verify that your IAM role has the following permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/*",
      "Condition": {
        "StringLike": {
          "iam:PassedToService": [
            "sagemaker.amazonaws.com",
            "events.amazonaws.com"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "events:TagResource",
        "events:DeleteRule",
        "events:PutTargets",
        "events:DescribeRule",
        "events:PutRule",
        "events:RemoveTargets",
        "events:DisableRule",
        "events:EnableRule"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:PutBucketVersioning",
        "s3:PutEncryptionConfiguration"
      ],
      "Resource": "arn:aws:s3:::sagemaker-automated-execution-*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:ListTags"
      ],
      "Resource": [
        "arn:aws:sagemaker:*:*:user-profile/*",
        "arn:aws:sagemaker:*:*:space/*",
        "arn:aws:sagemaker:*:*:training-job/*",
        "arn:aws:sagemaker:*:*:pipeline/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:AddTags"
      ],
      "Resource": [
        "arn:aws:sagemaker:*:*:training-job/*",
        "arn:aws:sagemaker:*:*:pipeline/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateNetworkInterface",
        "ec2:CreateNetworkInterfacePermission",
        "ec2:CreateVpcEndpoint",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteNetworkInterfacePermission",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSubnets",
        "ec2:DescribeVpcEndpoints",
        "ec2:DescribeVpcs",
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer",
        "ecr:GetAuthorizationToken",
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:GetEncryptionConfiguration",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:GetObject",
        "sagemaker:DescribeDomain",
        "sagemaker:DescribeUserProfile",
        "sagemaker:DescribeSpace",
        "sagemaker:DescribeStudioLifecycleConfig",
        "sagemaker:DescribeImageVersion",
        "sagemaker:DescribeAppImageConfig",
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:StopTrainingJob",
        "sagemaker:Search",
        "sagemaker:CreatePipeline",
        "sagemaker:DescribePipeline",
        "sagemaker:DeletePipeline",
        "sagemaker:StartPipelineExecution"
      ],
      "Resource": "*"
    }
  ]
}

For more information, see AWS managed policies for SageMaker AI Notebooks.

VPC endpoint issues

If you start the notebook job through an Amazon VPC endpoint, then check the endpoint's configuration and policy. Make sure that you complete the required steps and follow the best practices for the relevant AWS service endpoint:

For Amazon S3 VPC endpoints, you might receive an error related to an endpoint that's restricted to a single AWS account. For example, the following policy restricts access to an account with the ID 111122223333:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowSpecificAccountsPermission",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "s3:*",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "s3:ResourceAccount": "111122223333"
        }
      }
    }
  ]
}

To resolve this issue, you must also allow the following S3 bucket access for the user's actions:

{
  "Action": [
    "s3:*"
  ],
  "Resource": [
    "arn:aws:s3:::sagemakerheadlessexecution-prod-*",
    "arn:aws:s3:::sagemakerheadlessexecution-prod-*/*"
  ],
  "Effect": "Allow",
  "Sid": "SCTASK14554266"
}

Resource tag exceptions

Make sure that your IAM policy has the following permissions:

{
  "Effect": "Allow",
  "Action": [
    "events:TagResource",
    "events:DeleteRule",
    "events:PutTargets",
    "events:DescribeRule",
    "events:PutRule",
    "events:RemoveTargets",
    "events:DisableRule",
    "events:EnableRule"
  ],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
    }
  }
}

Troubleshoot UI errors

When you try to create, describe, update, stop, or delete a notebook job, you might receive a UI error. You might also receive this error when you use job definitions (scheduled jobs). To troubleshoot, review the error message that appears in the UI. This message might contain directions or suggested actions to resolve the issue.

If you can't resolve the error, then complete the following steps:

  1. Take a screenshot of the error and save it as an image file.
  2. Create an HTTP Archive (HAR) file that captures the network traffic when the UI error occurs.
  3. Open the SageMaker AI Studio Jupyter server terminal. Choose File, New, Terminal.
  4. Check the logs in /var/log/apps/app_container.log for exceptions, errors, or warnings at the time of the UI error.
  5. Contact AWS Support. In your request, attach the error screenshot, the app_container.log, and the HAR file.
AWS OFFICIALUpdated 4 months ago
3 Comments

Also having issues with custom images in notebook jobs. I get an error running the update-domain call because in the example there is no sample ImageName or AppImageConfigName, can you please clarify what these values should be? Can these be adjusted via console? Do we have to create a new image version for an existing image after applying? Also I'm unable to find the ARN in /opt/.sagemakerinternal/internal-metadata.json

replied 3 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

AWS
MODERATOR
replied 3 years ago

Hello @AR for specific use case scenarios like a custom image I would recommend reaching out to AWS support for further details and elaborations on this

AWS
SUPPORT ENGINEER
replied 3 years ago