AWS Glue : how to use data connections with all the VPC settings to read data from a postgres table from outside

0

Hi, I have created a JDBC connection (setup VPC, subnet and security group) called 'my_test_connection that connects to an external postgres database. The connection test passed. I was also able crawl the required table and create a meta data in data catalog but how do I store the data as a spark data frame/parquet using a Glue pyspark script and the connection. I think if I use :

connection_options = glueContext.getResolvedOptions( connection_name, ['URL', 'USERNAME', 'PASSWORD'] ) dataframe = spark.read
.format("jdbc")
.option("url", connection_options['URL']) \ and so on .... I will not get the VPC details and etc. Please suggest a good method to easily access the connectio nvia script.

1 Antwort
0

The network details are applied on job start not in that part of the code. To apply the VPC you would need a NETWORK connection type added to the job.

However, adding url/user/password as parameters are you are doing is not a good practice, they will be stored on the job run history and logs.
Instead you can create a JDBC connection (where you put the VPC so you don't need a NETWORK connection) and then if you want to use the SparkSession API, get the connection details from the Glue connection using extract_jdbc_conf, see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-extract_jdbc_conf

profile pictureAWS
EXPERTE
beantwortet vor 10 Monaten
  • I tried it out :

    db_connection = glueContext.extract_jdbc_conf(connection_name='pegasys_orion_postgres') print(db_connection) #Gave me a dictionary containing {'fullUrl': '<url>', 'enforceSSL': 'false', 'skipCustomJDBCCertValidation': 'false', 'url': ',<url>', 'customJDBCCertString': '', 'user': 'user', 'customJDBCCert': '', 'password': 'pw', 'vendor': 'postgresql'} Then I added : connection_details = { "dbTable": "table_name", "connectionName" : 'my_connection_postgres', "useConnectionProperties": True, "url": "<url>", }

    dataSource = glueContext.create_dynamic_frame_from_options(
    connection_type="postgresql", connection_options=connection_details, format=None, format_options={}, transformation_ctx = "" )

    But I get the following error : dataSource = glueContext.create_dynamic_frame_from_options( File "/opt/amazon/lib/python3.7/site-packages/awsglue/context.py", line 237, in create_dynamic_frame_from_options return source.getFrame(**kwargs) File "/opt/amazon/lib/python3.7/site-packages/awsglue/data_source.py", line 37, in getFrame jframe = self._jsource.getDynamicFrame() File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in call return_value = get_return_value( File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return

  • File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o97.getDynamicFrame. : java.lang.ArrayIndexOutOfBoundsException: 1 at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:1000) at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:996) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:1011) at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:102) at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:102) at com.amazonaws.services.glue.AbstractSparkSQLDataSource.getDynamicFrame(DataSource.scala:726) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.j

  • That library is not receiving the url parameter, make sure you print what you are passing, you should be using the fullUrl from the connection

  • I was using "fullurl" but still kept getting the same error. But my workaround was to use create_dynamic_frame_from_catalog and it got me the dataframe without any issues.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen