Issue with Redshift Spectrum/Glue Crawler - Unintended Column Splitting on strings

0

Hey,

We have a Glue crawler crawling a series of CSVs in S3 and capturing this in a database. This is surfaced in Redshift via Spectrum Schema.

The problem we have is that in Redshift, the delimiter/quote is not being respected for commas within quotes causing unintended splits in a column of text: Enter image description here

So far, we have attempted to add a classifier to the crawler as such Enter image description here

which did not resolve the problem, the table property here is currently quoteclassifer true as I manually set it, but, the crawler runs and overrides this value back to false. Enter image description here

We also tried changing the datatypes from string to varchar(1000) in the JSON Schema but this did not seem to work either.

This is not an issue when you open the CSV in Excel/Notepad/VSCode.

OmniFlo
asked 5 months ago152 views
1 Answer
0

Greetings from AWS! I understand that Glue crawler reset "areColumnsQuoted" parameter each time it runs, such caused Redshift Spectrum cannot split your csv data correctly. To fix this issue, after manually changed the table parameter in your Redshift Spectrum table, you can try editing your glue crawler --> "Set output and scheduling" --> "Output configuration" --> "Advanced options" --> select "Add new columns only" and enable "Update all new and existing partitions with metadata from the table" --> Save the changes. By applying this configuration, the crawler will inherit existing table parameters from your exiting table and will not reset/override them.

Ref: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console

AWS
Ethan_H
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions