Skip to content

Not able to crawl multiple csvs with same specs to exclude double quotes within each field

0

I have 4 csvs that have same columns and I am able to crawl them as 1 data table. the issue I am facing is even after adding areColumnsQuoted = true I am seeing each column value enclosed with double quotes. I have tried adding a classifier and trying all types like lazy and open csv serde options and keeping quote symbol as double quotes also. For some reason I am still seeing double quotes within each data table value like : taking country as a column name and US as value within that column for a row. correct country = US , what I am getting country = "US"

these are the table properties

asked a year ago294 views
1 Answer
2
Accepted Answer
  1. Check Classifier Settings: Ensure that areColumnsQuoted = true is correctly set.
  2. Use OpenCSV SerDe: Explicitly define the quote character in the SerDe settings if you are using a custom table definition.
  3. ETL Post-processing: Consider stripping quotes within your Glue ETL job if they persist.
  4. Preprocess CSVs: As a last resort, preprocess the CSVs to remove quotes before crawling.
AWS
EXPERT
answered a year ago
EXPERT
reviewed a year ago
  • Update:It worked for me.

    How did this work ? Referred this resource: https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html#classifier-built-in

    content: The built-in CSV classifier creates tables referencing the LazySimpleSerDe as the serialization library, which is a good choice for type inference. However, if the CSV data contains quoted strings, edit the table definition and change the SerDe library to OpenCSVSerDe. Adjust any inferred types to STRING, set the SchemaChangePolicy to LOG, and set the partitions output configuration to InheritFromTable for future crawler runs. For more information about SerDe libraries, see SerDe Reference in the Amazon Athena User Guide.

    I had to add the right type of csv verde to be used even if I dont want to add a specific classifier for this crawler. the right one was : org.apache.hadoop.hive.serde2.OpenCSVSerde

    -^this removes the double quotes I was getting earlier. Serde serialization lib: org.apache.hadoop.hive.serde2.OpenCSVSerde

    and it worked.....

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.