I want to troubleshoot Python library installation issues on my Amazon EMR cluster.
Resolution
The Python package is unavailable on Amazon EMR
Note: By default, Amazon EMR installs Python. However, Amazon EMR doesn’t install all Python libraries.
If the Python package is unavailable on Amazon EMR, then log in to the node where the package missing error occurred. Then, run commands similar to the following examples to check whether the Python packages are installed.
Example command to check whether specific packages are installed:
sudo pip3 freeze | grep pandas;
sudo pip3 freeze | grep numpy
Note: Replace pandas and numpy with the Python packages you want to check.
Expected output:
pandas==1.3.5
numpy==1.21.6
Example command to see a list of all Python libraries installed on the cluster:
sudo pip3 freeze
Expected output:
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
click==8.1.3
docutils==0.14
jmespath==1.0.1
numpy==1.21.6
pandas==1.3.5
...
You can also run commands similar to the following examples in the Python shell:
import pandas as pd;
import numpy as np
If the preceding commands return an error such as ModuleNotFoundError: No module named 'python_library', then the library isn't installed.
To install Python libraries on your Amazon EMR clusters, run the following command:
sudo pip3 install pandas scipy sklearn
Note: Replace pandas scipy sklearn with the Python libraries you want to install.
Or you can run the following command to read and install Python packages from a .txt file:
sudo pip3 install -r requirements.txt
Note: Replace requirements.txt with a .txt file that contains a list of Python packages and libraries that you want to install.
For more information, see the following Installing packages on the Python packaging website. Also see Installing and using kernels and libraries in EMR Studio.
Python packages aren't available on newly provisioned core or task node during cluster scaling
Python packages installed manually on individual nodes might be unavailable on newly provisioned core or task nodes during cluster scaling.
To make sure that packages exist in newly provisioned nodes, use a bootstrap action to install libraries instead of installing them manually.
If the package you want to install is unavailable even through a bootstrap script, then check the bootstrap log to determine the issue.
Running instance
To review the bootstrap log on a running instance, first use SSH to connect to the primary node. Then, check the bootstrap logs for errors at the following locations at /var/log/bootstrap-actions/bootStrapNumber/stderr and /var/log/bootstrap-actions/bootStrapNumber/stdout.
Note: In the preceding path, replace bootStrapNumber with the bootstrap script number.
New instance
For new instances, the bootstrap logs are captured in the Amazon Simple Storage Service (Amazon S3) bucket that you configured for Amazon EMR logging. Check the bootstrap logs for errors at s3://DOC-EXAMPLE-LOG-BUCKET/cluster-id/node/instance-id/bootstrap-actions/bootStrapNumber/stdout and s3://DOC-EXAMPLE-LOG-BUCKET/cluster-id/node/instance-id/bootstrap-actions/bootStrapNumber/stderr
Note: In the preceding paths, bootStrapNumber represents the bootstrap script number.