Parallelize writing to Iceberg tables in Glue

0

I am creating my Iceberg table and inserting full dataframes into it using the instructions under https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout. I observe, that during the long time of writing (starting at ~13:30), only one Executor remains active:

Screenshot from Glue Job Run Metrics

Is there a way to parallelize writing in order to speed it up and to not take longer than the rest of the Glue job?

  • Check in SparkUI what operation is doing, my impression is that the writing if done by 13:30 and that is doing some table maintenance

  • @Gonzalo Herreros, no, I checked this. The table was empty before writing and at 13:30 (and even at 14h30) I do not yet see any data on S3.

已提問 8 個月前檢視次數 525 次
1 個回答
0

Hi,

Have you tried to repartition your DataFrame before writing? I have seen this on the past and it was more a thing of Spark and the number of partitions more than an Iceberg thing.

Spark repartition docs

Bests

profile pictureAWS
已回答 8 個月前
  • While that is the solution for plain tables, in the case of Iceberg it does many operations that needs to decide based on the files it needs to update the file size configuration, so no longer one partition goes to one file.

  • Actually, this can be true. I plan to extend my table once per day and it is partitioned by the column indicating, when a row was inserted/modified. Thus, I get only one partition per day or per run of the Glue job. In my case this is a client's requirement. Can I improve something else in this case?

  • Seems that you do not have many options... The repartition route is not going to be useful ( if you do not change the distribution mode parameter and that is going to carry out other set of problems).

    https://github.com/apache/iceberg/issues/7406

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南