How do I use external Python libraries in my AWS Glue 1.0 or 0.9 ETL job?

3 minute read
0

I want to use an external Python library in my AWS Glue 1.0 or 0.9 extract, transform, and load (ETL) job.

Short description

To use an external library in an Apache Spark ETL job, do the following:

1.    Package the library files in a .zip file (unless the library is contained in a single .py file).

2.    Upload the package to Amazon Simple Storage Service (Amazon S3).

3.    Use the library in a job or JobRun.

Resolution

The following is an example of how to use an external library in a Spark ETL AWS Glue 1.0 or 0.9 ETL job.

Important: If you want to use an external library in your AWS Glue 2.0 job, then see How do I use external Python libraries in my AWS Glue 2.0 ETL job? If you want to use an external library in a Python shell job, then follow the steps at Providing your own Python library.

1.    Create a Python 2 or Python 3 library for boto3. Be sure that the AWS Glue version that you're using supports the Python version that you choose for the library. AWS Glue version 1.0 supports Python 2 and Python 3, and AWS Glue version 0.9 supports only Python 2.

Note: Libraries and extension modules for Spark jobs must be written in Python. Libraries, such as pandas, that are written in C aren't supported in Glue 0.9 or 1.0. If you need to use a Library written in C, then upgrade AWS Glue to at least version 2.0 and use the --additional-python-modules option. For more information, see How do I use external Python libraries in my AWS Glue 2.0 ETL job?

2.    Launch an Amazon Elastic Compute Cloud (Amazon EC2) Linux instance.

3.    Connect to the Linux instance using SSH.

4.    Run the following commands to install Python and Boto3. For more information, see Boto3 documentation for Quickstart.

sudo yum groupinstall "Development Tools"
sudo yum -y install openssl-devel
wget https://www.python.org/ftp/python/3.6.9/Python-3.6.9.tgz
tar xvf Python-3.6.9.tgz
cd Python-3.6.9/
./configure --enable-optimizations
sudo make install
sudo pip install boto3

5.    Confirm the location of the Python site-packages directory:

python -m site

You receive an output similar to the following:

/usr/lib/python3.6/site-packages

6.    Package the external library files in a .zip file unless the library is contained in a single .py file. The .zip file must include an __init__.py file, and the package directory must be at the root of the archive. The __init__.py file can be empty. For more information, see Python documentation for Packages.

Example:

cd /usr/lib/python3.6/site-packages
sudo zip -r -X "/home/ec2-user/site-packages.zip" *

7.    Upload the package to Amazon S3:

aws s3 cp /home/ec2-user/site-packages.zip s3://awsexamplebucket/

8.    Use the library in a job or JobRun.

To use an external library in a development endpoint, do the following:

1.    Package the library and upload the file to Amazon S3, as explained previously.

2.    Create the development endpoint. For Python library path, enter the Amazon S3 path for the package. For more information, see Loading Python libraries in a development endpoint.


Related information

Using Python libraries with AWS Glue

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago