How can I use Apache Iceberg with a cross-account AWS Glue Data Catalog in Spark?

2 minute read
0

I want to use Spark with Amazon EMR or AWS Glue to interact with Apache Iceberg from a cross-account AWS Glue Data Catalog.

Resolution

Set the following parameters to use Spark to interact with Apache Iceberg tables from the AWS Glue Data Catalog:


--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \  
--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog \  
--conf spark.sql.catalog.dev.glue.id=<CROSS_ACCOUNT_ID> \  
--conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/ \  
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \  
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

You can set these parameters in a number of ways, depending on whether you use an AWS Glue job or an Amazon EMR cluster.

For AWS Glue jobs, use job parameters. For example:

Key:  --conf  
Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.dev.glue.id=<CROSS_ACCOUNT_ID> --conf spark.sql.catalog.dev.warehouse=s3://<WAREHOUSE_DIR>/ --conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.dev.io-impl=org.apache.iceberg.aws.s3.S3FileIO

For an Amazon EMR cluster that runs version 6.5 or later, set the parameters when you submit the job. Or, use the Spark default configuration (/etc/spark/conf/spark-defaults.conf). For more information, see Use an Iceberg cluster with Spark.

Note: For cross-account scenarios, you must always use the glue.id property to specify the corresponding AWS Glue Data Catalog ID (AWS account ID).

If you're using Amazon EMR version 6.5 or later, then use the following spark-defaults configuration:

]    }  
        "configurations": []  
        },  
            "spark.sql.catalog.dev.io-impl": "org.apache.iceberg.aws.s3.S3FileIO"  
            "spark.sql.catalog.dev.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",  
            "spark.sql.catalog.dev.warehouse": "s3://<WAREHOUSE_DIR>/",  
            "spark.sql.catalog.dev.glue.id": "<CROSS_ACCOUNT_ID>",  
            "spark.sql.catalog.dev": "org.apache.iceberg.spark.SparkCatalog",  
            "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",  
            "spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar",  
        "properties": {  
        "classification": "spark-defaults",  
    {  
    },  
        "configurations": []  
        },  
            "iceberg.enabled": "true"  
        "properties": {  
        "classification": "iceberg-defaults",  
    {  
[

Note: The Amazon EMR or AWS Glue job must have sufficient AWS Identity and Access Management (IAM) permissions to access the cross-account AWS Glue Data Catalog. For more information, see Making a cross-account API call.

Related information

Read an Iceberg table from Amazon S3 using Spark

AWS OFFICIAL
AWS OFFICIALUpdated a year ago