AWS Glue Crawler not scanning all the S3 buckets with partitions

0

Hello Team, is there a limit to the number of tables which can be scanned using the Glue Crawler? I have a crawler which scans S3 buckets from a single source for data from January 2021 until December 2022. I have partitions for year and month. The crawler is not updating the data for November and December 2022. I am using this data to query in Athena and eventually in QuickSight. Can anyone suggest what could be wrong?

2 Answers
1

There is no specific limit to the number of tables that can be scanned by a Glue Crawler. However, there are a few things that could be causing your Crawler to not update the data for November and December 2022:

The Crawler's schedule: The Crawler might not be running as frequently as you expect. Verify that the Crawler schedule is set correctly.

S3 bucket permissions: The Crawler needs permissions to access the S3 bucket where the data is stored. Verify that the IAM role associated with the Crawler has the necessary permissions to access the S3 bucket.

Partitioning: Verify that the partitioning scheme you have set up is correct, and that the Crawler is looking for the partitions in the correct location.

Data format: Make sure that the data in the S3 bucket is in a format that the Crawler can understand.

Data size: The Crawler has a maximum amount of data it can crawl. If the data size is too large, the Crawler might not be able to process it all.

Glue Crawler's configuration: You can check the Crawler's properties and see if there are any configurations that need to be changed.

Athena Partitions: Verify that the partitions are visible on Athena and that the data is visible on Athena.

You can check the Glue Crawler's logs and CloudWatch logs to get more information about the error, If the problem persists you might want to try creating a new Crawler, or refer to the Glue Crawler documentation or AWS Support for further assistance.

profile picture
answered a year ago
  • I just noticed. My team added a new partition starting month of November 2022. Do you know if there is a way I can detect the tables under a new partition using the same crawler? If yes, what should be the configuration? For eg. before I had a partition for Year and Month until October 2022 (structure was /Year/Month/.csv ) and now we have an added partition (current structure is /Year/Month/group/.csv) How can I accommodate this change?

0

Can you please check when did the crawler run last ? My guess is , it last ran in October.

Do run the crawler once more and check . The latest months partitions should get created if the previous ones worked fine. If you have the need for running it every month , schedule accordingly.

answered a year ago
  • I observed the schema was different starting Nov'22 i.e a new partition group was added inside month.Do you know if there is a way I can detect the tables under a new partition using the same crawler? If yes, what should be the configuration? For eg. before I had a partition for Year and Month until October 2022 (structure was /Year/Month/.csv ) and now we have an added partition (current structure is /Year/Month/group/.csv) How can I accommodate this change?

  • "TableLevelConfiguration" should do the trick. Set it to 3 for the crawler. Check if that works . My hunch is the crawler will expect all the data to be at the level 3. If that is the problem , Move the existing data ( till Oct ) to a default Group so that the crawler finds all the data at the same level . This can be done fro the console or through a simple CLI call.

    Refer : https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html

    Alternatively :

    Add a second crawler with level 3 , and get it cataloged to a new table You can then create a view in Athena of the 2 tables ( Table -1 : Data till Oct, Table -2: Data from Nov ) and use it in Quicksight

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions