Parallelize writing to Iceberg tables in Glue

0

I am creating my Iceberg table and inserting full dataframes into it using the instructions under https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout. I observe, that during the long time of writing (starting at ~13:30), only one Executor remains active:

Screenshot from Glue Job Run Metrics

Is there a way to parallelize writing in order to speed it up and to not take longer than the rest of the Glue job?

  • Check in SparkUI what operation is doing, my impression is that the writing if done by 13:30 and that is doing some table maintenance

  • @Gonzalo Herreros, no, I checked this. The table was empty before writing and at 13:30 (and even at 14h30) I do not yet see any data on S3.

demandé il y a 8 mois529 vues
1 réponse
0

Hi,

Have you tried to repartition your DataFrame before writing? I have seen this on the past and it was more a thing of Spark and the number of partitions more than an Iceberg thing.

Spark repartition docs

Bests

profile pictureAWS
répondu il y a 8 mois
  • While that is the solution for plain tables, in the case of Iceberg it does many operations that needs to decide based on the files it needs to update the file size configuration, so no longer one partition goes to one file.

  • Actually, this can be true. I plan to extend my table once per day and it is partitioned by the column indicating, when a row was inserted/modified. Thus, I get only one partition per day or per run of the Glue job. In my case this is a client's requirement. Can I improve something else in this case?

  • Seems that you do not have many options... The repartition route is not going to be useful ( if you do not change the distribution mode parameter and that is going to carry out other set of problems).

    https://github.com/apache/iceberg/issues/7406

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions