Parallelize writing to Iceberg tables in Glue

Question

I am creating my Iceberg table and inserting full dataframes into it using the instructions under https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout. I observe, that during the long time of writing (starting at ~13:30), only one Executor remains active:

Screenshot from Glue Job Run Metrics

Is there a way to parallelize writing in order to speed it up and to not take longer than the rest of the Glue job?

![Screenshot from Glue Job Run Metrics](/media/postImages/original/IMK1qnW9KwS4mhqDnMuYL9Cw)

Is there a way to parallelize writing in order to speed it up and to not take longer than the rest of the Glue job?

Answer

Hi,

Have you tried to repartition your DataFrame before writing? I have seen this on the past and it was more a thing of Spark and the number of partitions more than an Iceberg thing.

[Spark repartition docs](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.repartition.html)

Bests

Parallelize writing to Iceberg tables in Glue

相關內容