By using AWS re:Post, you agree to the AWS re:Post Terms of Use

How do I upgrade my Python version on Amazon EMR and configure PySpark jobs to use the upgraded Python version?

7 minute read
0

I want to upgrade my Python version on Amazon EMR and configure PySpark jobs to use the upgraded Python version.

Short description

Cluster instances and system applications use different Python versions based on the following Amazon EMR release versions:

  • Amazon EMR release versions 4.6.0-5.19.0: Python 3.4 is installed on the cluster instances. Python 2.7 is the system default.
  • Amazon EMR release versions 5.20.0 and later: Python 3.6 is installed on the cluster instances. For Amazon EMR versions 5.20.0-5.29.0, Python 2.7 is the system default. For versions 5.30.0 and later, Python 3 is the system default.
  • Amazon EMR release versions 6.0.0 and later: Python 3.7 is installed on the cluster instances. Python 3 is the system default.
  • Amazon EMR release versions 7.0.0 and later: Python 3.9 is installed on the cluster instances. Python 3 is the system default.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

To upgrade your Python version, point the PYSPARK_PYTHON environment variable for the spark-env classification to the directory where the new Python version is installed:

which example-python-version

Note: Replace example-python-version with your new Python version.

Upgrade your Python version for Amazon EMR that runs on Amazon EC2

Note: Before you install a new Python and OpenSSL version on your Amazon EMR cluster instances, make sure that you test the following scripts.

To upgrade to Python 3.9 for Amazon EMR version 6.15 that runs on Amazon Elastic Compute Cloud (Amazon EC2), use the following script. You can also use the script to upgrade to Python 3.10 or later on Amazon EMR version 7.0:

sudo yum -y install openssl-devel bzip2-devel libffi-devel xz-devel gcc sqlite-devel
wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz
tar xvf example-python3-version.tgz
cd example-python3-version/
./configure --enable-optimizations
sudo make altinstall

Note: Replace example-python3-version with your Python 3 version.

OpenSSL is required to upgrade to Python 3.10 or later on Amazon EMR 6.15 or earlier. Use the following script:

sudo yum -y install openssl-devel bzip2-devel libffi-devel xz-devel gcc sqlite-devel
cd /home/hadoop/
wget https://github.com/openssl/openssl/archive/refs/tags/example-openssl11-version.tar.gz
tar -xzf example-openssl11-version.tar.gz
cd example-openssl11-version/ 
./config --prefix=/usr --openssldir=/etc/ssl --libdir=lib no-shared zlib-dynamic
make
sudo make install
cd /home/hadoop/
wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz
tar xvf example-python3-version.tgz
cd example-python3-version/
./configure --enable-optimizations --with-openssl=/usr
sudo make altinstall

Note: Replace example-python3-version with your Python 3 version and example-openssl11-version with your OpenSSL 11 version. For more information, see openssl on the GitHub website.

To use the upgraded version as your default Python 3 installation, use /usr/local/bin/python3.x as your new Python location. The preceding Python script is installed at /usr/local/bin/python3.x and the default Python installation is /usr/bin/python3.

Upgrade your Python version on a running cluster

Note: For Amazon EMR versions 5.36.0 and earlier, you can upgrade the Python version to 3.8.

Amazon EMR version 5.21.0 or earlier

Submit a reconfiguration request with a configuration object that's similar to the following:

[  {
     "Classification": "spark-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3"
          }
       }
    ]
  }
]

Amazon EMR version 4.6.0-5.20.x

Complete the following steps:

  1. Use SSH to connect to the primary node.
  2. To change the default Python environment, run the following command:
    sudo sed -i -e '$a\export PYSPARK_PYTHON=/usr/bin/python3' /etc/spark/conf/spark-env.sh
    To confirm that PySpark uses the correct Python version, run the following command:
    [hadoop@example-ip-address conf]$ pyspark
    Note: Replace example-ip-address with your IP address.
    Example output:
    Python 3.4.8 (default, Apr 25 2018, 23:50:36)
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
          /_/
    Using Python version 3.4.8 (default, Apr 25 2018 23:50:36)
    SparkSession available as 'spark'.

Note: The new configuration takes effect on the next PySpark job.

Upgrade your Python version on a new cluster

To upgrade your Python version when you launch a cluster on Amazon EMR, add a bootstrap action to the script that you use.

To upgrade to Python 3.9 for Amazon EMR version 6.15 that runs on Amazon EC2, use the following script. You can also use the script to upgrade to Python 3.10 or later on Amazon EMR version 7.0:

sudo yum -y install openssl-devel bzip2-devel libffi-devel xz-devel gcc sqlite-devel
wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz
tar xvf example-python3-version.tgz
cd example-python3-version/
./configure --enable-optimizations
sudo make altinstall

Note: Replace example-python3-version with your Python 3 version.

OpenSSL is required to upgrade to Python 3.10 or later on Amazon EMR 6.15 and earlier. Use the following script:

sudo yum -y install openssl-devel bzip2-devel libffi-devel xz-devel gcc sqlite-devel
cd /home/hadoop/
wget https://github.com/openssl/openssl/archive/refs/tags/example-openssl11-version.tar.gz
tar -xzf example-openssl11-version.tar.gz
cd example-openssl11-version/ 
./config —prefix=/usr --openssldir=/etc/ssl --libdir=lib no-shared zlib-dynamic
make
sudo make install
cd /home/hadoop/
wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz
tar xvf example-python3-version.tgz
cd example-python3-version/
./configure --enable-optimizations --with-openssl=/usr
sudo make altinstall

Note: Replace example-python3-version with your Python 3 version and example-openssl11-version with your OpenSSL 11 version. For more information, see openssl on the GitHub website.

Then, add a configuration object that's similar to the following:

[
  {
     "Classification": "spark-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
            "PYSPARK_PYTHON": "<example-python-version-path>"
          }
       }
    ]
  }
]

Upgrade your Python version on Amazon EMR on Amazon EKS

Note: Amazon Linux 2023 based images contain al2023 in the name. Also, Amazon EMR 6.13.0 and later use Python 3.9.16 by default in images that are based on Amazon Linux 2023. For images that are based on Amazon Linux 2, Python 3.7 is the default version.

To upgrade your Python version for Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS), a docker image is required. Select a base URI for your AWS Region:

FROM example-base-URI-account-id.dkr.ecr.example-region.amazonaws.com/spark/emr-6.15.0
USER root
RUN yum install -y gcc openssl-devel bzip2-devel libffi-devel tar gzip wget make
RUN wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz && \
tar xzf example-python3-version.tgz && cd example-python3-version && \
./configure --enable-optimizations && \
make altinstall
USER hadoop:hadoop

Note: Replace example-base-URI-account-id with base account ID for Apache Spark images, example-region with your Region, and example-python3-version with the Python version.

To pass an image when you submit a Spark workload, use application configuration overrides, a Spark driver, and a primary pod image:

{
                "classification": "spark-defaults",
                "properties": {
                    "spark.kubernetes.container.image": "example-account-id.dkr.ecr.example-region.amazonaws.com/example-repository"
                }
 }

Note: Replace example-account-id with the account ID that's storing the created image, example-repository with your repository name that's storing the custom image, and example-region with your Region.

Upgrade your Python version on Amazon EMR Serverless

To upgrade your Python version on an Amazon EMR Serverless application, use a docker image to install the new Python version:

FROM public.ecr.aws/emr-serverless/spark/emr-6.9.0:latest
USER root
# install python 3
RUN yum install -y gcc openssl-devel bzip2-devel libffi-devel tar gzip wget make
RUN wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz && \
tar xzf example-python3-version.tgz && cd example-python3-version && \
./configure --enable-optimizations && \
make altinstall
# EMRS will run the image as hadoop
USER hadoop:hadoop

Note: Replace example-python-version with your Python 3 version.

When you submit a Spark job to an Amazon EMR Serverless application, pass the following path to use the new Python version:

--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3.9
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=/usr/local/bin/python3.9
--conf spark.executorEnv.PYSPARK_PYTHON=/usr/local/bin/python3.9

Related information

Configure Spark

Apache Spark

Details for selecting a base image URI

Using custom images with EMR Serverless

AWS OFFICIAL
AWS OFFICIALUpdated a month ago