My Amazon Athena query fails with the error "HIVE_INVALID_METADATA: Hive metadata for table sample_table is invalid: Table descriptor contains duplicate columns".
Short description
This error happens when the AWS Glue table schema contains duplicate column names or columns with the same name as the partition columns.
Resolution
Check for duplicate column names
To identify the duplicate column, run SHOW CREATE TABLE to retrieve the table schema. Or, view the table schema in the AWS Glue console. In the following example, there are two columns named "column1."
CREATE EXTERNAL TABLE `athenatestingduplicatecolumn_athenatesting`( `column1` bigint,
`column2` bigint,
`column3` bigint,
`column1` bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://doc-example-bucket/athenatesting/'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1'
)
To resolve the error, run CREATE TABLE to recreate the Athena table with unique column names. Or, use the AWS Glue console to rename the duplicate columns:
- Open the AWS Glue console.
- Choose the table name from the list, and then choose Edit schema.
- Choose the column name, enter a new name, and then choose Save.
Check for table columns and partition columns that have the same name
To check for duplicate partition column and table column names, view the table schema in the AWS Glue console. In the following example, the partition column and table column are both named "id".
Example data:
{ "id" : 50, "name":"John" }{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
Table definition:
| | |
---|
Column name | Data type | Partition key |
id | int | |
name | string | |
id | string | Partition (0) |
Use one of the following options to resolve the issue:
- Rename the partition column in the Amazon Simple Storage Service (Amazon S3) path.
- Rename the column name in the data and in the AWS Glue table definition.
- If the source data is JSON, use the mapping function instead of AWS Glue crawler to manually recreate the table. Then, add partitions in Athena. In the following example, the AWS Glue table points to s3://doc-example-bucket/athenajsontesting/, which has this partition structure: s3://doc-example-bucket/athenajsontesting/id=12. After you create the table, use MSCK REPAIR TABLE to load the partition.
CREATE EXTERNAL TABLE `athenajsontest_athenajsontesting2`( `id1` int COMMENT 'from deserializer',
`name` string COMMENT 'from deserializer')
PARTITIONED BY (
`id` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.id1'='id')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://doc-example-bucket/athenajsontesting'
TBLPROPERTIES (
'classification'='json',
'transient_lastDdlTime'='1578399113')
Check the JSON source data for duplicate keys
If the JSON source data contains column names that differ only by case, for example, {"Username": "bob1234", "username": "bob" }, then see Why do I get errors when I try to read JSON data in Amazon Athena? Athena is case insensitive by default. Athena treats "Username" and "username" as duplicate keys, unless you use OpenX SerDe and set the case.insensitive property to false.
Check the Parquet source data for duplicate keys
If the data source is in Parquet format and the table has a duplicate column, then change one of the column names. Set the parquet.column.index.access SerDe property to true. Parquet accesses columns by name and ORC by index.
For more information, see Handling schema updates.
Related information
JSON SerDe libraries