Include additional python module in Glue job

0

I am trying to include python-oracledb in my job. I have followed the instructions from here, saving various versions of the relevant .whl files from PyPi. I have set the Glue job parameter --additional-python-modules as the key and the value as the S2 URI.

When i run my job I still get the NoModuleFoundError: No module named oracledb.

Please help.

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import boto3
    import oracledb

Job params

  • Just found the error in the log, it is not supported. Any idea if it will ever be supported.

gefragt vor einem Jahr1093 Aufrufe
2 Antworten
0

Hi,

I understand you wish to use python-oracledb in you Glue PySpark ETL job. I did some tests with my test environment and I'm able to confirm this can be done by either of the following approaches:

  1. If your Glue job runs in a VPC subnet with public Internet access (a NAT gateway is required since Glue workers don't have public Ip address [1]). You can specify the job parameter like this:
Key:  --additional-python-modules
Value:  oracledb
  1. If your Glue job runs in a VPC without internet access, you must create a Python repository on Amazon S3 by following this documentation [2] and include oracledb in your "modules_to_install.txt" file. Then, you should be able to install the package from your own Python repository on S3 by using following parameters. (make sure replace the MY-BUCKET with the real bucket name according to your use case)
"--additional-python-modules" : "oracledb",
"--python-modules-installer-option" : "--no-index --find-links=http://MY-BUCKET.s3-website-us-east-1.amazonaws.com/wheelhouse --trusted-host MY-BUCKET.s3-website-us-east-1.amazonaws.com"

Ref:

[1] https://aws.amazon.com/premiumsupport/knowledge-center/nat-gateway-vpc-private-subnet/

[2] https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/

AWS
Ethan_H
beantwortet vor einem Jahr
0

Hello,

Thank you for your question. My name is Yvonne, from RDS team.

From your question I understand that you experienced an error "NoModuleFoundError: No module named oracledb" and also noticed the error in the log "it is not supported" while trying to include python-oracledb in your Glue job, so you want to know when it will be supported.

Unfortunately i am not able to provide the timelines as our development team has their own timelines however we announce all new features when we release them in below blogs [1] [2] .

Please note that 'additional-python-modules' is applicable for Spark Glue Job with Glue version 2.0 and 3.0. You can include the external python library as mentioned in the link[3] .

For supported versions please refer to the below documentation:

[+] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

In case you require further assistance or have any queries, feel free to respond back to the case and I will be happy to assist you.

References:

[1] https://aws.amazon.com/new/
[2] https://aws.amazon.com/blogs/aws/
[3] https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html#reduced-start-times-limitations

AWS
beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen