AWS Glue not properly crawling s3 bucket populated by "Resource Data Sync" -- specifically, "AWS: InstanceInformation" is not made into a table

0

I set up an s3 bucket that collects inventory data from multiple AWS accounts using the Systems Manager "Resource Data Sync". I was able to set up the Data Syncs to feed into the single bucket without issue and the Glue crawler was created automatically.

Now that I'm trying to query the data in Athena, I noticed there is an issue with how the Crawler is parsing the data in the bucket. The folder "AWS:InstanceInformation" is not being turned into a table. Instead, it is turning all of the "region=us-east-1/" and "test.json" sub-items into tables which are, obviously, not queryable.

To illustrate further, each of the following paths is being turned into it's own table.

  • s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/region=us-east-1
  • s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/test.json
  • s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/region=us-east-1
  • s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/test.json
  • s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/region=us-east-1
  • s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/test.json

This is ONLY happening with the "AWS:InstanceInformation" folder. All of the other folders (e.g. "AWS:DetailedInstanceInformation") are being properly turned into tables.

Since all of this data was populated automatically, I'm assuming that we are dealing with a bug? Is there anything I can do to fix this?

1 個回答
0

After testing more, I've determined the issue is being caused by that "test.json" file which is being added at the time of Resource Sync creation.

已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南