Issue with Redshift Spectrum/Glue Crawler - Unintended Column Splitting on strings

0

Hey,

We have a Glue crawler crawling a series of CSVs in S3 and capturing this in a database. This is surfaced in Redshift via Spectrum Schema.

The problem we have is that in Redshift, the delimiter/quote is not being respected for commas within quotes causing unintended splits in a column of text: Enter image description here

So far, we have attempted to add a classifier to the crawler as such Enter image description here

which did not resolve the problem, the table property here is currently quoteclassifer true as I manually set it, but, the crawler runs and overrides this value back to false. Enter image description here

We also tried changing the datatypes from string to varchar(1000) in the JSON Schema but this did not seem to work either.

This is not an issue when you open the CSV in Excel/Notepad/VSCode.

OmniFlo
已提问 5 个月前167 查看次数
1 回答
0

Greetings from AWS! I understand that Glue crawler reset "areColumnsQuoted" parameter each time it runs, such caused Redshift Spectrum cannot split your csv data correctly. To fix this issue, after manually changed the table parameter in your Redshift Spectrum table, you can try editing your glue crawler --> "Set output and scheduling" --> "Output configuration" --> "Advanced options" --> select "Add new columns only" and enable "Update all new and existing partitions with metadata from the table" --> Save the changes. By applying this configuration, the crawler will inherit existing table parameters from your exiting table and will not reset/override them.

Ref: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console

AWS
Ethan_H
已回答 5 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则

相关内容