AWS Glue Crawler Scalability for Large Number of Delta Tables

Question: We currently have approximately 100 tables in delta format, partitioned by yyyy, mm, dd, hh, mm. Our current process involves reading these delta tables via a crawler, cataloging them, and utilizing spectrum tables in Redshift for building business logic.

However, we are encountering scalability limitations due to the maximum of 10 tables per crawler. As we continue to add more tables, adding additional crawlers becomes cumbersome. Additionally, the data volume on some of these tables is substantial, with up to 500k records per hour.

Considering these constraints, what would be the optimal approach to read the delta tables in parallel via the crawler? Can we configure the crawler to utilize an RDS database for improved scalability? Any insights or best practices would be appreciated.

Hamzah Chaudhry EXPERTE
vor einem Monat
Could you share how you're creating these Delta tables? Where is the source data coming from for these tables?
pkgp-aws
vor einem Monat
We are creating the delta tables via Glue ETL. Source - API.

Themen

Analysen Datenbank

Relevanter Inhalt

Wie erkennt der AWS-Glue-Crawler das Schema?
AWS OFFICIALAktualisiert vor 2 Jahren
Warum schlägt mein AWS Glue-Crawler aufgrund einer internen Serviceausnahme fehl?
AWS OFFICIALAktualisiert vor einem Jahr
Wie kann ich den Fehler „FAILED: SemanticException table is not partitioned but partition spec exists" in Athena beheben?
AWS OFFICIALAktualisiert vor einem Jahr
Warum fügt meine MSCK REPAIR TABLE-Abfrage keine Partitionen zum AWS Glue Data Catalog hinzu?
AWS OFFICIALAktualisiert vor 2 Jahren