AWS Glue : how to use data connections with all the VPC settings to read data from a postgres table from outside

0

Hi, I have created a JDBC connection (setup VPC, subnet and security group) called 'my_test_connection that connects to an external postgres database. The connection test passed. I was also able crawl the required table and create a meta data in data catalog but how do I store the data as a spark data frame/parquet using a Glue pyspark script and the connection. I think if I use :

connection_options = glueContext.getResolvedOptions( connection_name, ['URL', 'USERNAME', 'PASSWORD'] ) dataframe = spark.read
.format("jdbc")
.option("url", connection_options['URL']) \ and so on .... I will not get the VPC details and etc. Please suggest a good method to easily access the connectio nvia script.

1 Answer
0

The network details are applied on job start not in that part of the code. To apply the VPC you would need a NETWORK connection type added to the job.

However, adding url/user/password as parameters are you are doing is not a good practice, they will be stored on the job run history and logs.
Instead you can create a JDBC connection (where you put the VPC so you don't need a NETWORK connection) and then if you want to use the SparkSession API, get the connection details from the Glue connection using extract_jdbc_conf, see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-extract_jdbc_conf

profile pictureAWS
EXPERT
answered 9 months ago
  • I tried it out :

    db_connection = glueContext.extract_jdbc_conf(connection_name='pegasys_orion_postgres') print(db_connection) #Gave me a dictionary containing {'fullUrl': '<url>', 'enforceSSL': 'false', 'skipCustomJDBCCertValidation': 'false', 'url': ',<url>', 'customJDBCCertString': '', 'user': 'user', 'customJDBCCert': '', 'password': 'pw', 'vendor': 'postgresql'} Then I added : connection_details = { "dbTable": "table_name", "connectionName" : 'my_connection_postgres', "useConnectionProperties": True, "url": "<url>", }

    dataSource = glueContext.create_dynamic_frame_from_options(
    connection_type="postgresql", connection_options=connection_details, format=None, format_options={}, transformation_ctx = "" )

    But I get the following error : dataSource = glueContext.create_dynamic_frame_from_options( File "/opt/amazon/lib/python3.7/site-packages/awsglue/context.py", line 237, in create_dynamic_frame_from_options return source.getFrame(**kwargs) File "/opt/amazon/lib/python3.7/site-packages/awsglue/data_source.py", line 37, in getFrame jframe = self._jsource.getDynamicFrame() File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in call return_value = get_return_value( File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return

  • File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o97.getDynamicFrame. : java.lang.ArrayIndexOutOfBoundsException: 1 at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:1000) at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:996) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:1011) at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:102) at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:102) at com.amazonaws.services.glue.AbstractSparkSQLDataSource.getDynamicFrame(DataSource.scala:726) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.j

  • That library is not receiving the url parameter, make sure you print what you are passing, you should be using the fullUrl from the connection

  • I was using "fullurl" but still kept getting the same error. But my workaround was to use create_dynamic_frame_from_catalog and it got me the dataframe without any issues.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions