I have a Python package saved in CodeCommit and I need it to run in the notebook linked to an EMR cluster.

0

I have a Python package saved in CodeCommit and need to use it in the notebook attached to my EMR cluster workspace. The package is already successfully installed via bootstrap. To do this, in my .sh file, I need to configure git to access CodeCommit and then use pip install git+https://my_package.foo. This part works fine.

However, in the PySpark notebook already attached to the cluster, if I try to install using sc.install_pypi_package("my_package"), it recognizes the package and proceeds with the installation. It even appears in the sc.list_packages() listing. But when I try to import it, I receive the error:

An error was encountered:
No module named 'notebook.notebookapp'
Traceback (most recent call last):
File "/mnt2/yarn/usercache/livy/appcache/application_1710938070423_0006/container_1710938070423_0006_01_000001/tmp/spark-436a127f-1c84-491c-ad55-d60e244939d5/lib/python3.9/site-packages/my_package/init.py", line 31, in <module>
from notebook.notebookapp import NotebookApp
ModuleNotFoundError: No module named 'notebook.notebookapp'

Any help is welcome. Including other installation methods.

1 Respuesta
0

The python environment that EMR notebook uses is /emr/notebook-env/bin/python which is different from the default /usr/bin/python. This is the reason why you might also observe the differences between the pip list and !pip list if we run from EMR notebook.

So as a next step:

  • you can install the python dependency from the EMR notebook manually. But this would be needed for each cluster.

  • In case you wish to automate the installation with the EMR, you can consider to use the below script ( replace the placeholder <packages> with the packages that are needed ), so that they get installed in both python environments:

#!/bin/bash
sudo pip3 install <packages>
sudo /emr/notebook-env/bin/pip install install <packages>

But the catch here is you need to use the delayed bootstrap action script so that once the EMR cluster comes into WAITING state, then after that the bootstrap action runs, see here - https://repost.aws/knowledge-center/emr-update-all-nodes-bootstrap . Delayed bootstrap action is needed because if we don't use delayed Bootstrap, by default when the bootstrap will run, the cluster won't find /emr/notebook-env path and so Bootstrap will fail which will terminate the cluster.

You might already be aware that by default, the Bootstrap action runs before the application provisioning phase of the EMR cluster.

AWS
respondido hace 8 días

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas