Use GlueContext.getSink for writing Apache Iceberg table to S3 bucket and Data Catalog

Question

Is there a way to use `GlueContext.getSink().writeFrame(...)` to write Apache Iceberg tables? So far I only find the version `GlueContext.write_dynamic_frame.from_options(...)` working which is documented at the bottom of https://aws.amazon.com/de/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/. This version does not seem to provide an option to update the data catalog simultaneously.

Answer

Hello,

As per the [doc](https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html) there are only two ways to update the schema **1.getSink()** and **2.from_catalog()** automatically from an AWS Glue Job and your job needs to use the Iceberg connection or Iceberg jars.

**getSink()** does not support market place connections. [Reference](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-get-sink)

**from_catalog()** needs to read the metadata like classification or connection from the existing iceberg table.  However, if you are creating iceberg tables from Athena as shown [here](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html) . This method does not work as well. [Reference](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_catalog)

So, the only way I could see is to use **from_options()** method and use Spark dataframes to write to your Iceberg table.

Schema evolution for Iceberg tables are documented [here](https://iceberg.apache.org/docs/latest/evolution/) and using Athena [here](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-evolving-table-schema.html)

Use GlueContext.getSink for writing Apache Iceberg table to S3 bucket and Data Catalog

相关内容