AWS Glue problems reading from PostgreSQL DB that has uppercased table and column names

0

I have an RDS PostgreSQL database that has table names and column names with upper cased characters. I have created a glue crawler that connects to the database via jdbc and populates the glue data catalog with the database schemas but in that translation the upper case characters are converted to lower case characters when stored in the data catalog tables. When I run a glue job to query I get this error:

An error occurred while calling o94.getDynamicFrame. ERROR: column "id" does not exist

I made a copy of this table and changed the table names and column names to have all lower case characters and the same glue jobs and queries run successfully.

Changing the table names and column names to lower cased characters in our production environment is just not an option due to the extensive work it would require.

I found the 'Edit Schema' option in the Glue UI where you can change column names and data types and thought for a moment the solution had been found. However, when you change a character to upper case and then select "Save" it is reverted to lower case as it is saved.

I have edited the pyspark script directly and worked with the glueContext.create_dynamic_frame.from_catalog method using the additional_options parameter to build my select statement using upper and lower case characters but that still fails with the error message noted above.

Script generated for node PostgreSQL table

PostgreSQLtable_node1 = glueContext.create_dynamic_frame.from_catalog( database="mydatabase", table_name="mytable", additional_options={"query":"SELECT id from mytable;"}, transformation_ctx="PostgreSQLtable_node1" I believe the failure is because the schema as it is stored in the data catalog contains lower characters while the actual schema in the database is upper characters so when Glue tries to work with the table it is looking for "id" while the actual is "ID" and so "not found" is returned.

I have read about the CaseSensitive option and looking in that direction next for a solution.

I have not seen any recent (less than couple years old) posts about this issue so not sure if I'm missing something.

Any assistance would be greatly appreciated.

2 Answers
0
Accepted Answer

Following is feedback from AWS support. I used this feedback and was able to successfully solve the issue:

I believe the issue here is not the job bookmarks but the query being passed to the PG instance. Using the from_catalog method to query the DB will not work here. When we use the from_catalog method to create the DynamicFrame, Glue will first consult the catalog for the column and then check with the PG instance. Glue will fold the column name to lowercase and pass it to the PG instance. PG, being case sensitive, will try to search for a column by name id, which it will not find and so the error is thrown.

Instead, we need to read directly from the PG instance using the from_options method. I request you to add the query, but escape the case-sensitive column from it to use the exact column name so that Glue will not fold it to lowercase and the PG engine will be able to find the required column in the table. Below is how you can do it.

Sample code snippet:

=====================

###Read from PG connection_pg_options = { "url": "jdbc:postgresql://<jdbc-host-name>:5432/db", "dbtable": "public.test", "user": "admin", "password": "pwd", "sampleQuery": "select "ID" from public.test", }

PGNode = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_pg_options)

PGNode.show()

=====================

I have tested this at my end with a PG schema like below.

===================== create table with_case ("ID" int PRIMARY KEY, "firstName" varchar(10), "lastName" varchar(10));

Glue code snippet to read:

connection_pg_options = { "url": "jdbc:postgresql://.mydatabase.us-west-2.rds.amazonaws.com:5432/pilot", "dbtable": "public.with_case", "user": "postgres", "password": "mypassword", "sampleQuery": "select "ID" from public.with_case" }

PGNode = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_pg_options)

PGNode.show()

Output:

{ "ID": 3 }

{ "ID": 2 }

{ "ID": 1 }

answered 2 years ago
0

Hello,

Glue catalog only allows lowercase column/database names. This is the reason behind observing lower case column names although they are having camel case column names in source.

Even if you try to edit the column names manually in the glue table schema inside Glue catalog to Camelcase, when you save it you will observe it saving in lower case.

As postgresql is case sensitive, so you can bypass the reading by putting the source table name and column name in quotes. I have used below code to read the data from Postgresql table which has columns and table name in uppercase and it i working fine.

connection_psql_options = {
    "url": "jdbc:postgresql://database-us-east-1.rds.amazonaws.com:5432/testdb",
    "dbtable": "public.\"COMPANY1\"",
    "user": "postgres",
    "password": "Postgres123",
    "sampleQuery" : "select \"ID\" from public.\"COMPANY1\"",
    }
    
dyf_psql = glueContext.create_dynamic_frame.from_options(connection_type="postgresql",
                                                          connection_options=connection_psql_options)

dyf_psql.toDF().show(5)
AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions