After resolving my first issue with getting a resource data sync set up, I've now run into another issue with the same folder.
When a resource data sync is created, it creates a folder structure with 13 folders following a folder structure like:
s3://resource-data-sync-bucket/AWS:*/accountid=*/regions=*/resourcetype=*/instance.json}
When running the glue crawler over this, a schema is created where partitions are made for each subpath with an =
in it.
This works fine for most of the data, except for the path starting with AWS:InstanceInformation
. The instance information json files ALSO contain a "resourcetype" field as can be seen here.
{"PlatformName":"Microsoft Windows Server 2019 Datacenter","PlatformVersion":"10.0.17763","AgentType":"amazon-ssm-agent","AgentVersion":"3.1.1260.0","InstanceId":"i","InstanceStatus":"Active","ComputerName":"computer.name","IpAddress":"10.0.0.0","ResourceType":"EC2Instance","PlatformType":"Windows","resourceId":"i-0a6dfb4f042d465b2","captureTime":"2022-04-22T19:27:27Z","schemaVersion":"1.0"}
As a result, there are now two "resourcetype" columns in the "aws_instanceinformation" table schema. Attempts to query that table result in the error HIVE_INVALID_METADATA: Hive metadata for table is invalid: Table descriptor contains duplicate columns
I've worked around this issue by removing the offending field and setting the crawler to ignore schema updates, but this doesn't seem like a great long term solution since any changes made by AWS to the schema will be ignored.
Is this a known issue with using this solution? Are there any plans to change how the AWS:InstanceInformation documents are so duplicate columns aren't created.