I want to configure an Amazon SageMaker AI notebook instance to use AWS Glue interactive sessions, PySparkProcessor, or Sparkmagic kernels to run big data workloads.
Resolution
To configure a SageMaker AI notebook instance to run Spark and PySpark workloads, complete one of the following resolutions.
Configure AWS Glue interactive sessions for notebook instances
For a serverless option to run Apache Spark and PySpark workloads, configure AWS Glue interactive sessions for your notebook instances. When you start your notebook instance, the interactive session creates a PySpark kernel and a Spark kernel. You can then use one of the installed kernels from the Jupyter or JupyterLab application Launcher tab.
Grant permissions for AWS Glue interactive sessions
Complete the following steps:
- Open the AWS Identity and Access Management (IAM) console.
- In the navigation pane, under Access management, choose Roles.
- Select the execution role that you use for your SageMaker AI notebook instance.
- Create the following inline custom IAM policy in the JSON editor:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "uniqueStatementId",
"Effect": "Allow",
"Action": [
"iam:GetRole",
"iam:PassRole",
"sts:GetCallerIdentity"
],
"Resource": "YOUR-IAM-ROLE-ARN"
}
]
}
Note: Replace YOUR-IAM-ROLE-ARN with the Amazon Resource Name (ARN) of your notebook instance's IAM execution role.
- To grant AWS Glue permissions to the IAM role, choose Attach policies from the Add Permissions dropdown menu. Then, search for AwsGlueSessionUserRestrictedServiceRole and choose Attach policies.
- To allow AWS Glue to assume the IAM role, choose the Trust relationships tab, and then add glue.amazonaws.com to the service list. Confirm that your trust policy is similar to the following example:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com",
"glue.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}
Install AWS Glue interactive sessions kernels in notebook instances
Complete the following steps:
-
To automatically install the AWS Glue kernels during startup, create the following lifecycle configuration script:
#!/bin/bash
set -e
# Start conda environment
sudo -u ec2-user -i <<'EOF'
# Activate conda default environment
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
# Install/upgrade packages for boto3 and aws-glue-sessions
pip3 install --upgrade jupyter boto3 aws-glue-sessions
echo "AWS Glue Sessions Installed Successfully"
# Install Glue kernels
install-glue-kernels
echo "Glue Kernels Installed Successfully"
# Deactivate conda environment
conda deactivate
EOF
# Ensure script reports success
echo "Lifecycle configuration complete!"
systemctl restart jupyter-server
sudo touch /home/ec2-user/glue_ready
-
Navigate to your notebook instance, and then confirm that the instance isn't in the InService state.
-
To attach the lifecycle configuration script, choose Notebook instance settings, and then choose Edit.
-
Under Additional configuration, select your lifecycle configuration script from the Lifecycle configuration dropdown list.
-
Choose Update notebook instance.
Note: The notebook instance can take several minutes to update.
-
Start your notebook instance.
-
Open JupyterLab, and then choose the Launcher tab.
-
Choose either the AWS Glue Spark or AWS Glue PySpark kernel to run your data workloads.
Note: After you process your workload, shut down the kernel in JupyterLab so that you don't continue to incur charges in AWS Glue.
For more information about how to configure your AWS Glue interactive session, see Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks.
Configure PySparkProcessor to process SageMaker AI jobs
You can use PySparkProcessor to run PySpark scripts as processing jobs. For more information, see PySparkProcessor on the SageMaker Read the Docs website.
Note: The PySparkProcessor uses pre-built SageMaker AI Spark containers. You can configure only the framework_version, py_version, and container_version arguments.
For example notebooks that you can use, see sagemaker-spark-processing.ipynb on the GitHub website.
Configure an Amazon EMR backend cluster for SageMaker AI Sparkmagic kernels
Sparkmagic kernels require a backend Amazon EMR cluster. If you use Sparkmagic kernels without the backend Amazon EMR cluster, then you receive the following error message:
"The code failed because of a fatal error: Error sending http request and maximum retry encountered..."
To set up a Spark cluster that runs on Amazon EMR to connect to your notebook instance, see Build Amazon SageMaker AI notebooks backed by Spark in Amazon EMR.
After you confirm the connection, run the following command to upgrade sagemaker-studio-analytics-extension:
pip install --upgrade sagemaker-studio-analytics-extension
The latest versions of sagemaker-studio-analytics-extension override the default 60-second server session timeout to 120 seconds. For more information, see Troubleshoot Livy connections hanging or failing.
After you update the extension, launch a Jupyter notebook with a PySpark kernel and test the connection. If the connection is successful, then you see a message that's similar to the following one:
"Starting Spark application … SparkSession available as 'spark' "
After you connect, import PySpark to run your workloads.
Related information
Set up IAM permissions for AWS Glue Studio
Building AWS Glue jobs with interactive sessions