Skip to content

How do I troubleshoot Python library installation issues on my EMR cluster?

3 minute read
0

I want to troubleshoot Python library installation issues on my Amazon EMR cluster.

Resolution

The Python package is unavailable on Amazon EMR

Note: By default, Amazon EMR installs Python. However, Amazon EMR doesn’t install all Python libraries.

If the Python package is unavailable on Amazon EMR, then log in to the node where the package missing error occurred. Then, run commands similar to the following examples to check whether the Python packages are installed.

Example command to check whether specific packages are installed:

sudo pip3 freeze | grep pandas;
sudo pip3 freeze | grep numpy

Note: Replace pandas and numpy with the Python packages you want to check.

Expected output:

pandas==1.3.5
numpy==1.21.6

Example command to see a list of all Python libraries installed on the cluster:

sudo pip3 freeze

Expected output:

aws-cfn-bootstrap==2.0 
beautifulsoup4==4.9.3
boto==2.49.0 
click==8.1.3 
docutils==0.14 
jmespath==1.0.1   
numpy==1.21.6  
pandas==1.3.5
...

You can also run commands similar to the following examples in the Python shell: 

import pandas as pd;
import numpy as np

If the preceding commands return an error such as ModuleNotFoundError: No module named 'python_library', then the library isn't installed.

To install Python libraries on your Amazon EMR clusters, run the following command:

sudo pip3 install pandas scipy sklearn

Note: Replace pandas scipy sklearn with the Python libraries you want to install.

Or you can run the following command to read and install Python packages from a .txt file:

sudo pip3 install -r requirements.txt

Note: Replace requirements.txt with a .txt file that contains a list of Python packages and libraries that you want to install.

For more information, see the following Installing packages on the Python packaging website. Also see Installing and using kernels and libraries in EMR Studio.

Python packages aren't available on newly provisioned core or task node during cluster scaling

Python packages installed manually on individual nodes might be unavailable on newly provisioned core or task nodes during cluster scaling.

To make sure that packages exist in newly provisioned nodes, use a bootstrap action to install libraries instead of installing them manually.

If the package you want to install is unavailable even through a bootstrap script, then check the bootstrap log to determine the issue.

Running instance

To review the bootstrap log on a running instance, first use SSH to connect to the primary node. Then, check the bootstrap logs for errors at the following locations at /var/log/bootstrap-actions/bootStrapNumber/stderr and /var/log/bootstrap-actions/bootStrapNumber/stdout.

Note: In the preceding path, replace bootStrapNumber with the bootstrap script number.

New instance

For new instances, the bootstrap logs are captured in the Amazon Simple Storage Service (Amazon S3) bucket that you configured for Amazon EMR logging. Check the bootstrap logs for errors at s3://DOC-EXAMPLE-LOG-BUCKET/cluster-id/node/instance-id/bootstrap-actions/bootStrapNumber/stdout and s3://DOC-EXAMPLE-LOG-BUCKET/cluster-id/node/instance-id/bootstrap-actions/bootStrapNumber/stderr

Note: In the preceding paths, bootStrapNumber represents the bootstrap script number.

AWS OFFICIALUpdated 3 months ago