- Newest
- Most votes
- Most comments
Bear in mind you are creating the bottleneck with "repartition(1)", which forces a single core to do all the writing.
Still that doesn't justify an hour to write 2-3GB of parquet, probably the reading from Postgress is slow or some transformation in the middle, which you are probably doing it as well in a single thread.
To parallelize the read up to the write, have a look at this: https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html (the other way is using Spark API and set partitionColumn, lowerBound, upperBound)
Following the advice from a previous response, I've adjusted my AWS Glue job code to include a partition predicate when creating a dynamic frame from the catalog. Here's the updated code segment:
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database="pg-to-s3",
table_name=f"stats_prod_{table_name}",
additional_options={"partitionPredicate": f"'{partition_name}'"},
)
After incorporating this change, I proceed with writing the dynamic frame to S3 as follows:
frame = frame.repartition(1)
glueContext.write_dynamic_frame.from_options(
frame=frame,
connection_type="s3",
format="glueparquet",
connection_options={"path": s3_path},
format_options={"compression": "snappy"},
)
This write operation seems to be the most time-consuming part of my job. I'm trying to understand whether this indicates a potential inefficiency with the SQL read operation facilitated by the create_dynamic_frame.from_catalog method, or if the duration is mainly attributed to the process of writing to S3. Furthermore, my job is configured with these settings:
Worker type: G.1X vCPU: 4 RAM: 16GB Maximum workers: 6 Considering these details, does it make sense to adjust the repartitioning in the writing operation to frame=frame.repartition(6), matching the maximum number of workers? I'm looking for advice on how to optimize this AWS Glue job, focusing on whether there are significant benefits to aligning the repartitioning strategy with the worker configuration, or if there are other optimization strategies I should consider to reduce execution time.
Relevant content
- asked 19 days ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
to optimize the compute you wan to have as many partitions as total executor cores, assuming you are ok with that number of output files