How do I upgrade my Python version on Amazon EMR and configure PySpark jobs to use the upgraded Python version?
I want to upgrade my Python version on Amazon EMR and configure PySpark jobs to use the upgraded Python version.
Short description
Cluster instances and system applications use different Python versions based on the following Amazon EMR release versions:
- Amazon EMR release versions 4.6.0-5.19.0: Python 3.4 is installed on the cluster instances. Python 2.7 is the system default.
- Amazon EMR release versions 5.20.0 and later: Python 3.6 is installed on the cluster instances. For Amazon EMR versions 5.20.0-5.29.0, Python 2.7 is the system default. For versions 5.30.0 and later, Python 3 is the system default.
- Amazon EMR release versions 6.0.0 and later: Python 3.7 is installed on the cluster instances. Python 3 is the system default.
- Amazon EMR release versions 7.0.0 and later: Python 3.9 is installed on the cluster instances. Python 3 is the system default.
Resolution
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
To upgrade your Python version, point the PYSPARK_PYTHON environment variable for the spark-env classification to the directory where the new Python version is installed:
which example-python-version
Note: Replace example-python-version with your new Python version.
Upgrade your Python version for Amazon EMR that runs on Amazon EC2
Note: Before you install a new Python and OpenSSL version on your Amazon EMR cluster instances, make sure that you test the following scripts.
To upgrade to Python 3.9 for Amazon EMR version 6.15 that runs on Amazon Elastic Compute Cloud (Amazon EC2), use the following script. You can also use the script to upgrade to Python 3.10 or later on Amazon EMR version 7.0:
sudo yum -y install openssl-devel bzip2-devel libffi-devel xz-devel gcc sqlite-devel wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz tar xvf example-python3-version.tgz cd example-python3-version/ ./configure --enable-optimizations sudo make altinstall
Note: Replace example-python3-version with your Python 3 version.
OpenSSL is required to upgrade to Python 3.10 or later on Amazon EMR 6.15 or earlier. Use the following script:
sudo yum -y install openssl-devel bzip2-devel libffi-devel xz-devel gcc sqlite-devel cd /home/hadoop/ wget https://github.com/openssl/openssl/archive/refs/tags/example-openssl11-version.tar.gz tar -xzf example-openssl11-version.tar.gz cd example-openssl11-version/ ./config --prefix=/usr --openssldir=/etc/ssl --libdir=lib no-shared zlib-dynamic make sudo make install cd /home/hadoop/ wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz tar xvf example-python3-version.tgz cd example-python3-version/ ./configure --enable-optimizations --with-openssl=/usr sudo make altinstall
Note: Replace example-python3-version with your Python 3 version and example-openssl11-version with your OpenSSL 11 version. For more information, see openssl on the GitHub website.
To use the upgraded version as your default Python 3 installation, use /usr/local/bin/python3.x as your new Python location. The preceding Python script is installed at /usr/local/bin/python3.x and the default Python installation is /usr/bin/python3.
Upgrade your Python version on a running cluster
Note: For Amazon EMR versions 5.36.0 and earlier, you can upgrade the Python version to 3.8.
Amazon EMR version 5.21.0 or earlier
Submit a reconfiguration request with a configuration object that's similar to the following:
[ { "Classification": "spark-env", "Configurations": [ { "Classification": "export", "Properties": { "PYSPARK_PYTHON": "/usr/bin/python3" } } ] } ]
Amazon EMR version 4.6.0-5.20.x
Complete the following steps:
- Use SSH to connect to the primary node.
- To change the default Python environment, run the following command:
To confirm that PySpark uses the correct Python version, run the following command:sudo sed -i -e '$a\export PYSPARK_PYTHON=/usr/bin/python3' /etc/spark/conf/spark-env.sh
Note: Replace example-ip-address with your IP address.[hadoop@example-ip-address conf]$ pyspark
Example output:Python 3.4.8 (default, Apr 25 2018, 23:50:36) Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Python version 3.4.8 (default, Apr 25 2018 23:50:36) SparkSession available as 'spark'.
Note: The new configuration takes effect on the next PySpark job.
Upgrade your Python version on a new cluster
To upgrade your Python version when you launch a cluster on Amazon EMR, add a bootstrap action to the script that you use.
To upgrade to Python 3.9 for Amazon EMR version 6.15 that runs on Amazon EC2, use the following script. You can also use the script to upgrade to Python 3.10 or later on Amazon EMR version 7.0:
sudo yum -y install openssl-devel bzip2-devel libffi-devel xz-devel gcc sqlite-devel wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz tar xvf example-python3-version.tgz cd example-python3-version/ ./configure --enable-optimizations sudo make altinstall
Note: Replace example-python3-version with your Python 3 version.
OpenSSL is required to upgrade to Python 3.10 or later on Amazon EMR 6.15 and earlier. Use the following script:
sudo yum -y install openssl-devel bzip2-devel libffi-devel xz-devel gcc sqlite-devel cd /home/hadoop/ wget https://github.com/openssl/openssl/archive/refs/tags/example-openssl11-version.tar.gz tar -xzf example-openssl11-version.tar.gz cd example-openssl11-version/ ./config —prefix=/usr --openssldir=/etc/ssl --libdir=lib no-shared zlib-dynamic make sudo make install cd /home/hadoop/ wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz tar xvf example-python3-version.tgz cd example-python3-version/ ./configure --enable-optimizations --with-openssl=/usr sudo make altinstall
Note: Replace example-python3-version with your Python 3 version and example-openssl11-version with your OpenSSL 11 version. For more information, see openssl on the GitHub website.
Then, add a configuration object that's similar to the following:
[ { "Classification": "spark-env", "Configurations": [ { "Classification": "export", "Properties": { "PYSPARK_PYTHON": "<example-python-version-path>" } } ] } ]
Upgrade your Python version on Amazon EMR on Amazon EKS
Note: Amazon Linux 2023 based images contain al2023 in the name. Also, Amazon EMR 6.13.0 and later use Python 3.9.16 by default in images that are based on Amazon Linux 2023. For images that are based on Amazon Linux 2, Python 3.7 is the default version.
To upgrade your Python version for Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS), a docker image is required. Select a base URI for your AWS Region:
FROM example-base-URI-account-id.dkr.ecr.example-region.amazonaws.com/spark/emr-6.15.0 USER root RUN yum install -y gcc openssl-devel bzip2-devel libffi-devel tar gzip wget make RUN wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz && \ tar xzf example-python3-version.tgz && cd example-python3-version && \ ./configure --enable-optimizations && \ make altinstall USER hadoop:hadoop
Note: Replace example-base-URI-account-id with base account ID for Apache Spark images, example-region with your Region, and example-python3-version with the Python version.
To pass an image when you submit a Spark workload, use application configuration overrides, a Spark driver, and a primary pod image:
{ "classification": "spark-defaults", "properties": { "spark.kubernetes.container.image": "example-account-id.dkr.ecr.example-region.amazonaws.com/example-repository" } }
Note: Replace example-account-id with the account ID that's storing the created image, example-repository with your repository name that's storing the custom image, and example-region with your Region.
Upgrade your Python version on Amazon EMR Serverless
To upgrade your Python version on an Amazon EMR Serverless application, use a docker image to install the new Python version:
FROM public.ecr.aws/emr-serverless/spark/emr-6.9.0:latest USER root # install python 3 RUN yum install -y gcc openssl-devel bzip2-devel libffi-devel tar gzip wget make RUN wget https://www.python.org/ftp/python/3.x.x/example-python3-version.tgz && \ tar xzf example-python3-version.tgz && cd example-python3-version && \ ./configure --enable-optimizations && \ make altinstall # EMRS will run the image as hadoop USER hadoop:hadoop
Note: Replace example-python-version with your Python 3 version.
When you submit a Spark job to an Amazon EMR Serverless application, pass the following path to use the new Python version:
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3.9 --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=/usr/local/bin/python3.9 --conf spark.executorEnv.PYSPARK_PYTHON=/usr/local/bin/python3.9
Related information
Relevant content
- asked 2 years agolg...
- asked 3 years agolg...
- Accepted Answerasked a year agolg...
- asked 2 years agolg...
- How do I install and troubleshoot Python libraries in Amazon EMR and Amazon EMR Serverless clusters?AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 4 months ago