Partitioning create a duplicate column

0

Hello Team,

We are doing archival.

We are streaming data from Oracle to S3 via kafka. We have source (Debezium) and sink(S3 Sink) connectors and the data gets stored in s3 based on on field partitioner in the kafka record called - template_name.

In AWS Glue we created a crawler to create tables based on the data stored in s3. Crawler creates table along with partitioner name 'template_name'. Now we have two columns named 'template_name' and because of this we are not able to query the database and started getting duplicate column error.

asked 9 months ago464 views
1 Answer
0

Hello,

This is a known issue with the Kafka S3 sink connector where when you choose an existing column in the record for partitioning using partition.field.name property in the connector. This will create your files in S3 partitions and also the same partition column will be present in the output data files as well.

So, when a Glue crawler, crawls the partitioned output S3 location, the resulting table will be having duplicate columns which cannot be queried from Athena or Hive. Please refer the below links for more info

Known Issues with the S3 sink connector

https://github.com/confluentinc/kafka-connect-hdfs/issues/221

https://github.com/confluentinc/kafka-connect-hdfs/issues/238

https://github.com/confluentinc/kafka-connect-storage-cloud/issues/387

Can you try the below steps from AWS glue crawler side ?

  1. Delete the duplicate column from the Glue table AWS glue console -> DataCatalog tables -> Choose your table -> Edit Schema -> Delete the duplicate column from the schema

  2. Update your crawler properties like below

Enter image description here

In the above process we are fixing the table schema and forcing the crawler to not update the schema of your Glue catalog table. Please refer this doc

AWS
SUPPORT ENGINEER
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions