How do I use external Python libraries in my AWS Glue ETL job?
I want to use external Python libraries in an AWS Glue extract, transform, and load (ETL) job.
Short description
When you use AWS Glue versions 2.0, 3.0, and 4.0, you can install additional Python modules or different module versions at the job level. To add a new module or change the version of an existing module, use the --additional-python-modules job parameter key. The key's value is a list of comma-separated Python module names. When you use this parameter, your AWS Glue ETL job installs the additional modules through the Python package installer (pip3).
You can also use the --additional-python-modules parameter to install Python libraries that are written in C-based languages.
Resolution
Install or update Python modules
To install an additional Python module for your AWS Glue job, complete the following steps:
- Open the AWS Glue console.
- In the navigation pane, Choose Jobs.
- Select the job where you want to add the Python module.
- Choose Actions, and then choose Edit job.
- Expand the Security configuration, script libraries, and job parameters (optional) section.
- Under Job parameters, do the following:
For Key, enter --additional-python-modules.
For Value, enter a comma-separated list of modules that you want to add. - Choose Save.
For example, suppose that you want to add two new modules, version 1.0.2 of PyMySQL and version 3.6.2 of the Natural Language Toolkit (NLTK). You install the PyMySQL module from the internet and the NLTK module from an Amazon Simple Storage Service (Amazon S3) bucket. In that case, the --additional-python-modules parameter key has the value pymysql==1.0.2, s3://aws-glue-add-modules/nltk-3.6.2-py3-none-any.whl.
Some modules have dependencies on other modules. If you install or update such a module, then you must also download the other modules that it depends on. This means that you must have internet access to install or update the module. If you don't have internet access, then see Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0.
For a list of Python modules that are included in each AWS Glue version by default, see Python modules already provided in AWS Glue.
Install C-based Python modules
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
AWS Glue also supports libraries and extensions written in C with the --additional-python-modules parameter. However, some Python modules, such as spacy and grpc, require root permissions to install. AWS Glue doesn't provide root access during package installation. To resolve this issue, precompile the binaries into a wheel compatible with AWS Glue and install that wheel.
To compile a library in a C-based language, the compiler must be compatible with the target operating system and processor architecture. If the library is compiled against a different operating system or processor architecture, then the wheel isn't installed in AWS Glue. Because AWS Glue is a managed service, cluster access isn't available to develop these dependencies.
To precompile a C-based Python module that requires root permissions, complete the following steps:
-
Launch an Amazon Elastic Compute Cloud (Amazon EC2) Linux instance (Amazon Linux 2 AMI) with enough volume space for your libraries.
-
Install Docker on the EC2 instance, set up non-sudo access, and then start Docker. To do so, run the following commands:
Install Docker:
sudo yum install docker -y
Set up sudo access:
sudo su
Start Docker:
sudo service docker start
-
Create a Dockerfile file for the module. For example, to install the grpcio module, create a file called dockerfile_grpcio and copy the following content into the file:
FROM amazonlinux:2 \# Install required repositories and tools RUN yum update -y RUN yum install shadow-utils.x86\_64 -y \# Install Java 8 (corrected package name) RUN yum install -y java-1.8.0-openjdk.x86\_64 \# Install Python 3.7 WORKDIR /opt RUN yum install -y gcc openssl-devel bzip2-devel libffi-devel wget tar make \# Install Python 3.7 RUN wget https://www.python.org/ftp/python/3.7.12/Python-3.7.12.tgz RUN tar xzf Python-3.7.12.tgz WORKDIR /opt/Python-3.7.12 RUN ./configure --enable-optimizations RUN make altinstall RUN ln -sf /usr/local/bin/python3.7 /usr/bin/python3 RUN ln -sf /usr/local/bin/pip3.7 /usr/bin/pip3 \# Verify Python version RUN python3 --version RUN pip3 --version \# Install other dependencies RUN yum install -y doxygen autoconf automake libtool zlib-devel openssl-devel maven wget protobuf-compiler cmake make gcc-c++ RUN yum install -y python3-devel \# Install Python packages RUN pip3 install --upgrade pip RUN pip3 install wheel RUN pip3 install cython numpy scipy RUN pip3 install cmake scikit-build \# Create wheel directory and install grpcio WORKDIR /root RUN mkdir wheel\_dir RUN pip3 install Cython RUN pip3 install grpcio RUN pip3 wheel grpcio -w wheel\_dir
-
Run the docker build to build your Dockerfile:
docker build -f dockerfile\_grpcio .
-
Restart the Docker daemon:
sudo service docker restart
When the docker build command completes, you get a success message that contains your Docker image ID. For example, "Successfully built 1111222233334444". Note the Docker image ID to use in the next step.
-
Extract the .whl wheel file from the Docker container. To do so, run the following commands:
Get the Docker image ID:
docker image ls
Run the container, but replace 1111222233334444 with your Docker image ID:
docker run -dit 111122223334444
Verify the location of the wheel file and retrieve the name of the wheel file, but replace 5555666677778888 with your container ID:
docker exec -t -i 5555666677778888 ls /root/wheel\_dir/
Copy the wheel from the Docker container to Amazon EC2:
docker cp 5555666677778888:/root/wheel\_dir/doc-example-wheel .
Note: Replace doc-example-wheel with the name of your generated wheel file
-
To upload the wheel to Amazon S3, run the following commands:
aws s3 cp doc-example-wheel s3://path/to/wheel/
aws s3 cp grpcio-1.32.0-cp37-cp37m-linux\_x86\_64.whl s3://aws-glue-add-modules/grpcio/
Note: Replace grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl with the name of your Python package file.
-
Open the AWS Glue console.
-
For the AWS Glue ETL job, under Job parameters, enter the following:
For Key, enter --additional-python-modules.
For Value, enter s3://aws-glue-add-modules/grpcio/grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl.
Related information
Relevant content
- asked 2 years agolg...
- asked 3 years agolg...
- asked 3 years agolg...
- asked a year agolg...
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
- How do I install and troubleshoot Python libraries in Amazon EMR and Amazon EMR Serverless clusters?AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 2 years ago