How can I troubleshoot issues installing Python libraries on my EMR cluster?

4 minute read
0

I want to troubleshoot problems installing Python libraries on my Amazon EMR cluster

Short description

I'm trying to install Python libraries on my EMR cluster, but I'm seeing one of the following issues:

  • I can't install Python libraries on my EMR cluster.
  • The Python package isn't available on Amazon EMR.
  • Installed Python packages aren't available on newly provisioned core or task nodes.

You can install Python libraries on EMR clusters either by using a bootstrap action or by manually logging in to each node. Install Python libraries using bootstrap actions to make sure that the libraries are installed on all nodes automatically during cluster provisioning and cluster resizing.

Resolution

I can't install Python libraries on my EMR cluster or the Python package isn't available on Amazon EMR

Log in to the node where the package missing error occurred. Then, use the following command to verify that Python libraries are installed:

$ sudo pip3 freeze | grep pandas
pandas==1.3.5
$ sudo pip3 freeze | grep numpy
numpy==1.21.6

Or, verify that Python libraries are installed from the Python shell:

$ python
Python 3.7.15 (default, Oct 31 2022, 22:44:31)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np

If the preceding commands return an error such as ModuleNotFoundError: No module named 'python_library', then the library isn't installed.

You can install Python libraries on EMR clusters using pip commands, as shown in the following examples:

sudo pip3 install pandas scipy sklearn
sudo pip3 install file://requirements.txt

In the preceding example, requirements.txt is a list of Python packages and libraries that you want to install.

For more information, see the following:

To install additional custom libraries, use the pip install command.

Python is installed in Amazon EMR by default. However, not all Python libraries are installed. For more information, see Installing and using kernels and libraries.

To view a list of the Python libraries installed on the cluster, use the sudo pip3 freeze command. The following is an example sudo pip3 freeze command and example output:

$ sudo pip3 freeze
aws-cfn-bootstrap==2.0 
beautifulsoup4==4.9.3
boto==2.49.0 
click==8.1.3 
docutils==0.14 
jmespath==1.0.1 
joblib==1.2.0 l
ockfile==0.11.0 
lxml==4.9.1 
mysqlclient==1.4.2 
nltk==3.7 
nose==1.3.4 
numpy==1.20.0 
py-dateutil==2.2 
pystache==0.5.4 
python-daemon==2.2.3 
python37-sagemaker-pyspark==1.4.2 
pytz==2022.6
PyYAML==5.4.1 
regex==2021.11.10 
simplejson==3.2.0 
six==1.13.0 
tqdm==4.64.1 
windmill==1.6

Python packages aren't available on newly provisioned core or task node during cluster scaling

Python packages installed manually on individual nodes might not be available on newly provisioned core or task nodes during cluster scaling.

To make sure that packages exist in newly provisioned nodes, use a bootstrap action to install libraries instead of installing them manually.

There might be cases where the desired package isn't available despite having a bootstrap script to install them. In these cases, check the bootstrap script logs to determine what went wrong. To check the bootstrap script logs, do the following:

If the new instance is running:

1.    Connect to the primary node using SSH.

2.    Check the bootstrap logs for errors at the following locations:

  • /var/log/bootstrap-actions/N/stderr
  • /var/log/bootstrap-actions/N/stdout

In the preceding paths, N represents the bootstrap script number (for example 1,2,3, and so on).

If the new instance failed provisioning:

The bootstrap logs are captured in the Amazon Simple Storage Service (Amazon S3) bucket that you configured for Amazon EMR logging. The paths are:

  • s3://DOC-EXAMPLE-LOG-BUCKET/cluster-id/node/instance-id/bootstrap-actions/N/stdout
  • s3://DOC-EXAMPLE-LOG-BUCKET/cluster-id/node/instance-id/bootstrap-actions/N/stderr

In the preceding paths, N represents the bootstrap script number (for example, 1,2,3, and so on).


AWS OFFICIAL
AWS OFFICIALUpdated a year ago