Optimizing AWS Glue Job for Faster S3 Writes of Large Datasets

0

I have an AWS Glue job that transfers data from a PostgreSQL database to Amazon S3. The job functioned efficiently until the size of the data increased. Now, when attempting to save approximately 2-3 GB of data to S3, it takes over an hour, which is significantly slower than desired.

The bottleneck appears to be in the following section of the code, specifically during the write operation to S3:

glueContext.write_dynamic_frame.from_options(
    frame=frame.repartition(1),
    connection_type="s3",
    format="glueparquet",
    connection_options={"path": s3_path},
    format_options={"compression": "snappy"},
)

I've isolated the problem to this block through print-based debugging, where timestamps indicated a significant delay during the execution of these lines.

I'm looking for suggestions on how to optimize this part of the code to reduce the time taken to write large datasets (2-3 GB) to S3. Are there best practices or specific strategies I can employ to enhance the efficiency of this data transfer process in AWS Glue?

Any advice or insights would be greatly appreciated, especially from those who have tackled similar challenges with AWS Glue and large data transfers to S3.

itamar
asked 3 months ago398 views
2 Answers
2

Bear in mind you are creating the bottleneck with "repartition(1)", which forces a single core to do all the writing.
Still that doesn't justify an hour to write 2-3GB of parquet, probably the reading from Postgress is slow or some transformation in the middle, which you are probably doing it as well in a single thread.
To parallelize the read up to the write, have a look at this: https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html (the other way is using Spark API and set partitionColumn, lowerBound, upperBound)

profile pictureAWS
EXPERT
answered 3 months ago
0

Following the advice from a previous response, I've adjusted my AWS Glue job code to include a partition predicate when creating a dynamic frame from the catalog. Here's the updated code segment:

dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="pg-to-s3",
    table_name=f"stats_prod_{table_name}",
    additional_options={"partitionPredicate": f"'{partition_name}'"},
)

After incorporating this change, I proceed with writing the dynamic frame to S3 as follows:

frame = frame.repartition(1)
    glueContext.write_dynamic_frame.from_options(
        frame=frame,
        connection_type="s3",
        format="glueparquet",
        connection_options={"path": s3_path},
        format_options={"compression": "snappy"},
    )

This write operation seems to be the most time-consuming part of my job. I'm trying to understand whether this indicates a potential inefficiency with the SQL read operation facilitated by the create_dynamic_frame.from_catalog method, or if the duration is mainly attributed to the process of writing to S3. Furthermore, my job is configured with these settings:

Worker type: G.1X vCPU: 4 RAM: 16GB Maximum workers: 6 Considering these details, does it make sense to adjust the repartitioning in the writing operation to frame=frame.repartition(6), matching the maximum number of workers? I'm looking for advice on how to optimize this AWS Glue job, focusing on whether there are significant benefits to aligning the repartitioning strategy with the worker configuration, or if there are other optimization strategies I should consider to reduce execution time.

itamar
answered 2 months ago
  • to optimize the compute you wan to have as many partitions as total executor cores, assuming you are ok with that number of output files

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions