Can't install pyarrow on AWS Glue python shell

0

I want to import pyarrow in a Python shell Glue script because I need to export a dataframe as parquet (i.e. with DataFrame.to_parquet()).

The way to add custom dependencies suggested in the AWS docs is to use .egg or .whl files (https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library).

The library pyarrow has numpy and six as dependencies:

  • numpy is already pre-installed on Glue, with version 1.16.2 as I checked with a simple print(numpy.version.version)
  • six is not pre-installed so I downloaded six-1.14.0-py2.py3-none-any.whl from Pypi and uploaded it to S3.
  • pyarrow is not pre-installed so I downloaded from Pypi the wheel file pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl and uploaded it to S3.

The minimal script is this:

import pandas as pd
import six
import numpy
from pyarrow import *

data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
df.to_parquet('test.parquet')

When I run the script adding as libraries the wheel files of six and pyarrow, I get the following message:

Processing ./glue-python-libs-f8nyy9el/six-1.14.0-py2.py3-none-any.whl
Installing collected packages: six
Successfully installed six-1.14.0
Processing ./glue-python-libs-f8nyy9el/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl

and the following error:

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96e10>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/numpy/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96c88>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/numpy/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96dd8>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/numpy/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c969b0>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/numpy/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96898>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/numpy/
ERROR: Could not find a version that satisfies the requirement numpy>=1.14 (from pyarrow==0.16.0) (from versions: none)
ERROR: No matching distribution found for numpy>=1.14 (from pyarrow==0.16.0)
Traceback (most recent call last):
  File "/tmp/runscript.py", line 112, in <module>
    download_and_install(args.extra_py_files)
  File "/tmp/runscript.py", line 62, in download_and_install
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--target=
{}
".format(install_path), local_file_path])
  File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command &#39;[&#39;/usr/local/bin/python&#39;, &#39;-m&#39;, &#39;pip&#39;, &#39;install&#39;, &#39;--target=/glue/lib/installation&#39;, &#39;/tmp/glue-python-libs-f8nyy9el/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl&#39;]&#39; returned non-zero exit status 1.

So at first, it seems that six is installed correctly, but then it looks like the job does not realize that numpy is already present with a compatible version.

Then I tried to upload to S3 also the wheel file s3://risultati-navigazione-wt-ga/libs/numpy-1.18.2-cp36-cp36m-manylinux1_x86_64.whl that I downloaded from Pypi. In this case I get the message:

Processing ./glue-python-libs-xzfdvgzd/numpy-1.18.2-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: numpy
Successfully installed numpy-1.18.2
Processing ./glue-python-libs-xzfdvgzd/six-1.14.0-py2.py3-none-any.whl
Installing collected packages: six
Successfully installed six-1.14.0
Processing ./glue-python-libs-xzfdvgzd/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl

and the error:

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/six/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/six/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/six/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/six/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by &#39;ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, &#39;Connection to pypi.org timed out. (connect timeout=15)&#39;)&#39;: /simple/six/
ERROR: Could not find a version that satisfies the requirement six>=1.0.0 (from pyarrow==0.16.0) (from versions: none)
ERROR: No matching distribution found for six>=1.0.0 (from pyarrow==0.16.0)
Traceback (most recent call last):
  File "/tmp/runscript.py", line 112, in <module>
    download_and_install(args.extra_py_files)
  File "/tmp/runscript.py", line 62, in download_and_install
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--target=
{}
".format(install_path), local_file_path])
  File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command &#39;[&#39;/usr/local/bin/python&#39;, &#39;-m&#39;, &#39;pip&#39;, &#39;install&#39;, &#39;--target=/glue/lib/installation&#39;, &#39;/tmp/glue-python-libs-xzfdvgzd/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl&#39;]&#39; returned non-zero exit status 1.

so, this time, numpy is recognized during the installation of pyarrow but, as far I understand, althoughsix is installed correctly, for some reason pyarrow can't find it during the installation and indeed it tries to download from the Internet (it gets stuck a few minutes during that operation).

Can anybody help me? Thanks!

ecanovi
asked 4 years ago1477 views
1 Answer
0

It seems that you enabled Glue connection to your DBs inside your VPC in your python shell job, and there is not internet connectivity. That caused connection timeout to pypi.org.

If you can have internet connectivity from your python shell job, it will work.

  • Add NAT Gateway to your public subnet so that you can access to internet from your private subnet
  • Add internet gateway
  • Remove Glue connection if you do not need access to your DBs

See details here: https://aws.amazon.com/premiumsupport/knowledge-center/ec2-internet-connectivity/?nc1=h_ls

If you do not want to have internet connectivity, plz update this thread.

Edited by: NoritakaS-AWS on Mar 30, 2020 1:30 AM

AWS
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions