AWS Glue Crawler Scalability for Large Number of Delta Tables

0

Question: We currently have approximately 100 tables in delta format, partitioned by yyyy, mm, dd, hh, mm. Our current process involves reading these delta tables via a crawler, cataloging them, and utilizing spectrum tables in Redshift for building business logic.

However, we are encountering scalability limitations due to the maximum of 10 tables per crawler. As we continue to add more tables, adding additional crawlers becomes cumbersome. Additionally, the data volume on some of these tables is substantial, with up to 500k records per hour.

Considering these constraints, what would be the optimal approach to read the delta tables in parallel via the crawler? Can we configure the crawler to utilize an RDS database for improved scalability? Any insights or best practices would be appreciated.

  • Could you share how you're creating these Delta tables? Where is the source data coming from for these tables?

  • We are creating the delta tables via Glue ETL. Source - API.

feita há um mês356 visualizações
Sem respostas

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas