Optimizing AWS Glue Job for Faster S3 Writes of Large Datasets

0

I have an AWS Glue job that transfers data from a PostgreSQL database to Amazon S3. The job functioned efficiently until the size of the data increased. Now, when attempting to save approximately 2-3 GB of data to S3, it takes over an hour, which is significantly slower than desired.

The bottleneck appears to be in the following section of the code, specifically during the write operation to S3:

 dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="pg-to-s3",
    table_name=f"stats_prod_{table_name}",
    additional_options={"partitionPredicate": f"'{partition_name}'"},
)

After incorporating this change, I proceed with writing the dynamic frame to S3 as follows:

frame = frame.repartition(24)
    glueContext.write_dynamic_frame.from_options(
        frame=frame,
        connection_type="s3",
        format="glueparquet",
        connection_options={"path": s3_path},
        format_options={"compression": "snappy"},
    )

I've isolated the problem to this block through print-based debugging, where timestamps indicated a significant delay during the execution of these lines.

I'm looking for suggestions on how to optimize this part of the code to reduce the time taken to write large datasets (2-3 GB) to S3. Are there best practices or specific strategies I can employ to enhance the efficiency of this data transfer process in AWS Glue?

Any advice or insights would be greatly appreciated, especially from those who have tackled similar challenges with AWS Glue and large data transfers to S3.

itamar
질문됨 한 달 전178회 조회
1개 답변
0

Don't be confused by the lines execution, the writing triggers the whole processing but it doesn't mean writing files is slow.
In your case probably the bottleneck is reading from Postgres, see this: https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html

profile pictureAWS
전문가
답변함 한 달 전
profile picture
전문가
검토됨 한 달 전
  • but i dont use a complex query at all. i just take the oldest partition, write it to the s3 and drop it. im pretty sure it doesnt takes so much time.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠