Issue with Redshift Spectrum/Glue Crawler - Unintended Column Splitting on strings

0

Hey,

We have a Glue crawler crawling a series of CSVs in S3 and capturing this in a database. This is surfaced in Redshift via Spectrum Schema.

The problem we have is that in Redshift, the delimiter/quote is not being respected for commas within quotes causing unintended splits in a column of text: Enter image description here

So far, we have attempted to add a classifier to the crawler as such Enter image description here

which did not resolve the problem, the table property here is currently quoteclassifer true as I manually set it, but, the crawler runs and overrides this value back to false. Enter image description here

We also tried changing the datatypes from string to varchar(1000) in the JSON Schema but this did not seem to work either.

This is not an issue when you open the CSV in Excel/Notepad/VSCode.

OmniFlo
질문됨 5달 전167회 조회
1개 답변
0

Greetings from AWS! I understand that Glue crawler reset "areColumnsQuoted" parameter each time it runs, such caused Redshift Spectrum cannot split your csv data correctly. To fix this issue, after manually changed the table parameter in your Redshift Spectrum table, you can try editing your glue crawler --> "Set output and scheduling" --> "Output configuration" --> "Advanced options" --> select "Add new columns only" and enable "Update all new and existing partitions with metadata from the table" --> Save the changes. By applying this configuration, the crawler will inherit existing table parameters from your exiting table and will not reset/override them.

Ref: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console

AWS
Ethan_H
답변함 5달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠