Bug: Glue 4.0 | Python 3.10 | pandas library

0

I have been experimenting using Glue 4 which supports Python 3.10 and pandas.

I am adding pandas as a zipped library through the --extra-py-files functionality for a gluetl job.

When running my job, it fails importing pandas (version 1.4.3) (import pandas as pd) with the following which I copy-pasted from the cloudwatch logs:

2022-12-06 16:49:09,450 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
  File "/tmp/database_monitoring.py", line 2, in <module>
    import pandas as pd
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/__init__.py", line 48, in <module>
    from pandas.core.api import (
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/api.py", line 47, in <module>
    from pandas.core.groupby import (
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/__init__.py", line 1, in <module>
    from pandas.core.groupby.generic import (
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 76, in <module>
    from pandas.core.frame import DataFrame
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 170, in <module>
    from pandas.core.generic import NDFrame
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 147, in <module>
    from pandas.core.describe import describe_ndframe
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/describe.py", line 45, in <module>
    from pandas.io.formats.format import format_percentiles
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/formats/format.py", line 105, in <module>
    from pandas.io.common import (
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/common.py", line 8, in <module>
    import bz2
  File "/usr/local/lib/python3.10/bz2.py", line 17, in <module>
    from _bz2 import BZ2Compressor, BZ2Decompressor
ModuleNotFoundError: No module named '_bz2'
2022-12-06 16:49:09,450 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last): File "/tmp/database_monitoring.py", line 2, in <module> import pandas as pd File "/home/spark/.local/lib/python3.10/site-packages/pandas/__init__.py", line 48, in <module> from pandas.core.api import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/api.py", line 47, in <module> from pandas.core.groupby import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/__init__.py", line 1, in <module> from pandas.core.groupby.generic import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 76, in <module> from pandas.core.frame import DataFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 170, in <module> from pandas.core.generic import NDFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 147, in <module> from pandas.core.describe import describe_ndframe File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/describe.py", line 45, in <module> from pandas.io.formats.format import format_percentiles File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/formats/format.py", line 105, in <module> from pandas.io.common import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/common.py", line 8, in <module> import bz2 File "/usr/local/lib/python3.10/bz2.py", line 17, in <module> from _bz2 import BZ2Compressor, BZ2Decompressor ModuleNotFoundError: No module named '_bz2'

I believe this is a bug in AWS Glue 4.0 as opposed to a user issue. Is anyone able to advise or confirm? And if so, is there a bug fix planned for this?

JDay
asked a year ago719 views
2 Answers
0

Do you need a specific / higher version than the included one? If not, no need to provide any zip at all. I can't / won't test Glue v4 at all - since the documentation (e.g. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html) and other actual service functionality like the aws_glue_interactive_sessions module is not updated. So much about "general availability".

answered a year ago
  • I did attempt this but found it was more trouble that it was worth. I am using libraries which also use pandas so I would need to add custom logic to ignore the pandas dependency when installing those libraries. And even then, that is supposing the pandas version AWS offers is compatible.

    This would add significant overhead to something which is supposed to be an out-of-the-box solution. Hence, I just won't use Glue 4.0 and will think of an alternative unless this is resolved.

  • I guess there was some context missing and neither me nor you did actually google the error message ;)

0

The integrated pandas version faces the same error and that must be fixed on the system level, e.g. https://stackoverflow.com/questions/50335503/no-module-named-bz2-in-python3.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions