AWS glue studio node long run time for data preview


Hi, I am using AWS glue studio to read from a DDB table with direct DDB connection. So far my visual diagram has two nodes:

  1. Source DDB table node -> Here preview takes 5 minutes for only 2 rows of dataset but at least shows result
  2. Transform- selectFields -> Here session runs for long time (>20 minutes) and fails with error of 'session not ready My DDB table is of 691 bytes with provisioned capacity units as 5 RCU and 5 WCU. The glue job details has below config:
  3. Glue version -> 4.0
  4. Language-> python3
  5. Worker Type -> G1X (automatic scale for number of workers is enabled)
  6. Max number of workers -> 11
  7. job timeout-> 2880

Considering this is a smaller data subset, can you please let me know why it is taking a long time to run? or where to look for any related insights? I am hoping to use this as a part of my production data-pipeline that will transform and move data to redshift for DW purposes. Unfortunately there isn't enough information available for glue studios.

1개 답변

First of all I would suggest using on-demand mode in DynamoDB, at least until you get it working correctly. When you have 5 RCU, Glue takes that number as a limit, and rate limits its requests as not to exceed it. But I suspect you may have other issues.

Moreover, DynamoDB is releasing ZeroETL with Redshift, which is now in private preview, so perhaps it's advisable not to spend too much time creating the wheel.

profile pictureAWS
답변함 3달 전
  • Hi Leeroy, thanks for the prompt response and redirecting towards zero ETL with Redshift blog. While our account gets allow-listed for the preview, can you please let me know what other parts of the config I should be looking at to speed up the preview of sample dataset? I have changed DDB tables to on-demand mode, but it's not really speed up yet.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠