- Newest
- Most votes
- Most comments
Hello,
This is a known issue with the Kafka S3 sink connector where when you choose an existing column in the record for partitioning using partition.field.name property in the connector. This will create your files in S3 partitions and also the same partition column will be present in the output data files as well.
So, when a Glue crawler, crawls the partitioned output S3 location, the resulting table will be having duplicate columns which cannot be queried from Athena or Hive. Please refer the below links for more info
Known Issues with the S3 sink connector
https://github.com/confluentinc/kafka-connect-hdfs/issues/221
https://github.com/confluentinc/kafka-connect-hdfs/issues/238
https://github.com/confluentinc/kafka-connect-storage-cloud/issues/387
Can you try the below steps from AWS glue crawler side ?
-
Delete the duplicate column from the Glue table
AWS glue console -> DataCatalog tables -> Choose your table -> Edit Schema -> Delete the duplicate column from the schema
-
Update your crawler properties like below
In the above process we are fixing the table schema and forcing the crawler to not update the schema of your Glue catalog table. Please refer this doc
Relevant content
- asked 10 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 5 months ago