How can I determine an appropriate number of workers for Glue Data Quality Ruleset runs on large datasets?

0

I want to check data quality on multiple SQL tables, some of which have up to 180 million entries. The tables are loaded into Glue using Crawlers, which appears to work fine. Each table has a relatively complex ruleset to check, including some custom SQL. On one of the larger tables, I attempted to run the ruleset consisting of 19 rules with the default 5 workers, but had to quit the run after it ran for ~20 hours. I have since tried to find out how I can scale this work so that these runs are more efficient and ideally less cost-intensive.

It appears the only change I can make is the number of workers. Now, how could I find out how many workers I need? Looking in CloudWatch or the API using boto3, I was not able to find any statistics on the usage that the stopped run, or any of the successful data quality runs on other tables, had.

profile picture
strupp1
질문됨 8달 전157회 조회
1개 답변
1
수락된 답변

There is no right number since it depends on the data, the rules and how long you are willing to wait. Sometimes, more workers won't even help if there is a bottleneck. In general, try to avoid custom SQL rules since they have to run independently and go through the data on a separate read.
If you apply the same rules on a Glue Job, you could use SparkUI to view the execution and maybe find where is the bottleneck.

profile pictureAWS
전문가
답변함 8달 전
  • Thank you, this helps already. Unfortunately, we need a bunch of CustomSql rules because we have many columns that can be NULL or must have a specific format. Would you say as a rule of thumb, we should have one worker for each CustomSql rule and some extra for the other rules?

  • The general idea is that column rules are cheaper that table rules, try to do that format check with a column rule. No you cannot directly relate number rules and workers, the volume of data has a bigger impact so it's not a linear correlation.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인