Glue Visual Studio Job for processing large DynamoDB table

0

Use Case: Build a data warehouse in Redshift using DynamoDB data.

Background: I am creating 6 different Redshift tables from a single DynamoDB table using glue visual ETL. There are 6 different 'Custom Transform' nodes currently in picture that perform all transformations, then there are few other nodes like 'change schema' etc for each of the workflow and finally 6 'redshift' target nodes for each of the transformation that completes my workflow.

Issue: Everything was working well G1X worker type on test dataset (Of course, test dataset is smaller than prod dataset) but the moment entire workflow was pointed to prod dataset (2GB DynamoDB table) things went off. I am currently figuring out a correct job config that might work for me. But with below config it is taking ~17 hours for job run to FAIL

Glue version: 4.0 Language: Python3 Worker Type: G8X Max number of workers: 5

Some advance parameters based on research spark.driver.memory=20g spark.executor.memory=40g spark.executor.cores=8 spark.executor.instances=2

For each of the custom transform, the first step is to convert dynamic frame to dataframe. I assume it is happening 6 times and that's where the bottleneck starts which leads to memory intensive operations and long job run times. Plus there are many transformations involved, which contribute to extensive memory use.

Ask/help from AWS re:Post community & users:

  1. Is there a way by which I can create dataframe from dynamic frame just once, cache it may be and use it for all independent custom transforms? Since glue job custom transform node accepts a dynamicframe collection and returns a dynamicframe collection as well, I am not sure if we can cache the df.

  2. Can you please suggest any optimizations for the pipeline? The spark UI starts giving error after half hour into job run, so I am unable to monitor anything from that end. CPU load and memory profile metrics are below 50% usage (at least with the above mentioned config).

Why not using COPY command solution: The nature of our source data is messy, so unfortunately I cannot go for Redshift COPY command approach. For eg, in our source DDB data we have different prefixed attributes for a given PK. Ideally such prefixed attributes should come under one column when they are loaded into Redshift. It is not feasible to hardcode such attributes when defining redshift target table. Example based on below analogy: For DynamoDB 'teacher' table, for a given teacher A (PK), there are students S1, S2, S3,S4 etc (many of them). When this is loaded into Redshift table 'teacher_student', for a given teacher code A, the student column should have S1, S2, S3, S4 as students (in normalized form). It's not possible to hard code S1, S2, S3, S4 as column names for target redshift.

1 Answer
0

The transformation between DynamicFrame and DataFrame is super intensive, why must you do this multiple times?

If you can't do your transformations directly using a DynamicFrame, then why read into one? If you are using DataFrame to transform data, then I suggest reading into a DataFrame, do your transformations and write to Redshift.

profile pictureAWS
EXPERT
answered 4 months ago
  • Thanks for the reply. How can I directly read into data-frame? The only way I know of to apply custom logic is to use 'custom transform' node in visual ETL. This node by default takes dynamicframeCollection (dfc), so no other option but to get a dynamicframe from dfc and convert it to dataframe. Then use that dataframe for transformations.

    def MyTransform6 (glueContext, dfc) -> DynamicFrameCollection: df = dfc.select(list(dfc.keys())[0]).toDF()

    Is there any simpler way that I am missing to directly access data-frame?

  • Sorry i'm not sure if that capability is available in visual ETL. I usually write my pyspark manually, where you can just use native Spark/Pandas Dataframes.

  • Thanks, I'll try this approach. May I please know what connector/JAR file you use while developing this? Unfortunately I didn't find any solid information around this in my research. I am hesitant to use the 'audienceproject' JAR file mentioned in many of SO links.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions