- Newest
- Most votes
- Most comments
Hello,
I understand you wish to use python-oracledb in you Glue PySpark ETL job. It can be done by either of the following approaches:
- If your Glue job runs in a VPC subnet with public Internet access (a NAT gateway is required since Glue workers don't have public Ip address [1]). You can specify the job parameter like this:
Key: --additional-python-modules
Value: oracledb
- If your Glue job runs in a VPC without internet access, you must create a Python repository on Amazon S3 by following this documentation [2] and include oracledb in your "modules_to_install.txt" file. Then, you should be able to install the package from your own Python repository on S3 by using following parameters. (make sure replace the MY-BUCKET with the real bucket name according to your use case)
"--additional-python-modules" : "oracledb",
"--python-modules-installer-option" : "--no-index --find-links=http://MY-BUCKET.s3-website-us-east-1.amazonaws.com/wheelhouse --trusted-host MY-BUCKET.s3-website-us-east-1.amazonaws.com"
-
As you are facing "CommandFailedException: Library file doesn't exist:" error, please consider looking at the IAM permission for Glue and the S3 object as well.
-
Unless a library is contained in a single .py file, it should be packaged in a .zip archive. [3] Please try creating zip files and use Python 3.9. For using extra python files you can use it in the job parameters as follows:
key: --extra-py-files
value: s3://<bucket_name>/etl_jobs/my_etl_job.zip
-
For Glue Python Shell jobs , you can add python libraries (not spark) and the method to do so is found here: https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library
-
For Glue ETL jobs (PySpark) you can find the info on how to add additional libraries here: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
-
Please refer to the below [4] [5] [6] articles for more elaborate explanation on the above.
References:
- https://aws.amazon.com/premiumsupport/knowledge-center/nat-gateway-vpc-private-subnet/
- https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/
- https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html#aws-glue-programming-python-libraries-zipping
- https://aws.amazon.com/premiumsupport/knowledge-center/glue-version2-external-python-libraries/
- https://stackoverflow.com/questions/61217834/how-to-use-extra-files-for-aws-glue-job
- https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
In order for me to troubleshoot further, by taking a look at the logs in the backend, please feel free to open a support case with AWS using the following link with the sanitized script, the job run, any additional dependency that you are trying to import and we would be happy to help.
For example 1 ) I am not finding --additional-python-modules key in AWS console. Does its name got changed ?
key: --extra-py-files value: s3://<bucket_name>/etl_jobs/my_etl_job.zip
I also tried adding Zip file in S3 bucket for oracleDB . but it is giving ModuleNotFoundError: No module named 'oracledb'. After adding this file do I need to change something my script so that it reads from this file ?
When adding wheel file then getting error :
ImportError: cannot import name 'base_impl' from partially initialized module 'oracledb' (most likely due to a circular import) (/glue/lib/installation/oracledb/init.py)
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 2 years ago
You can use --additional-python-modules even if it's not offered in the this. The reason you get that import 'base_impl' is because it cannot find the native .so file, it needs to be precompiled that's why it's better just to install from pip