Issue with Redshift Spectrum/Glue Crawler - Unintended Column Splitting on strings

Question

Hey,

We have a Glue crawler crawling a series of CSVs in S3 and capturing this in a database. This is surfaced in Redshift via Spectrum Schema.

The problem we have is that in Redshift, the delimiter/quote is not being respected for commas within quotes causing unintended splits in a column of text:
![Enter image description here](/media/postImages/original/IMlU_8APx8SzKqGNKvZVn4Ww)

So far, we have attempted to add a classifier to the crawler as such
![Enter image description here](/media/postImages/original/IMow09095NSC21bPyy0AgF7w)

which did not resolve the problem, the table property here is currently quoteclassifer true as I manually set it, but, the crawler runs and overrides this value back to false.
![Enter image description here](/media/postImages/original/IM7QGoZUeZTKO7O27uCDJ1Bg)

We also tried changing the datatypes from string to varchar(1000) in the JSON Schema but this did not seem to work either.

This is not an issue when you open the CSV in Excel/Notepad/VSCode.

Answer

Greetings from AWS! I understand that Glue crawler reset "areColumnsQuoted" parameter each time it runs, such caused Redshift Spectrum cannot split your csv data correctly. To fix this issue, after manually changed the table parameter in your Redshift Spectrum table, you can try editing your glue crawler --> "Set output and scheduling" --> "Output configuration" --> "Advanced options" --> select "Add new columns only" and enable "Update all new and existing partitions with metadata from the table" --> Save the changes. By applying this configuration, the crawler will inherit existing table parameters from your exiting table and will not reset/override them.

Ref: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console

Issue with Redshift Spectrum/Glue Crawler - Unintended Column Splitting on strings

相关内容