Skip to content

How do I use Apache Iceberg with a cross-account AWS Glue Data Catalog in Spark?

2 minute read
-1

I want to use Apache Spark with Amazon EMR or AWS Glue to interact with Apache Iceberg from an AWS Glue Data Catalog in another AWS account.

Resolution

To use Spark with Apache Iceberg tables from the AWS Glue Data Catalog, set parameters in your AWS Glue job or your Amazon EMR cluster.

The Amazon EMR or AWS Glue job must have AWS Identity and Access Management (IAM) permissions to access the cross-account AWS Glue Data Catalog. For more information, see Methods for granting cross-account access in AWS Glue.

You must use the Catalog.Id property to specify the ID of the account that the AWS Glue Data Catalog is in. For more information, see Making a cross-account API call

Set parameters in AWS Glue

For AWS Glue jobs, set the job parameters.

Example job parameters:

Key:  --conf  Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.dev.glue.id=CROSS_ACCOUNT_ID --conf spark.sql.catalog.dev.warehouse=s3://amzn-s3-demo-bucket/ --conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.dev.io-impl=org.apache.iceberg.aws.s3.S3FileIO

Note: Replace CROSS_ACCOUNT_ID with your cross-account ID, amzn-s3-demo-bucket with your S3 bucket location.

Set parameters in Amazon EMR

For an Amazon EMR cluster that runs version 6.5 or later, set the parameters when you submit the job. Or, use the Spark default configuration, /etc/spark/conf/spark-defaults.conf. For more information, see Use an Iceberg cluster with Spark.

To set the parameters, run the following spark-submit command:

spark-submit \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://amzn-s3-demo-bucket/prefix \
--conf spark.sql.catalog.my_catalog.type=glue \
--conf spark.sql.catalog.my_catalog.glue.id=CROSS_ACCOUNT_ID \
--conf spark.sql.defaultCatalog=my_catalog \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Note: Replace CROSS_ACCOUNT_ID with your cross-account ID, **amzn-s3-demo-bucket/**prefix with your S3 bucket location and prefix, and my_catalog with your catalog.

-or-

Use the following spark-defaults configuration:

]    }          "configurations": []  
        },  
            "spark.sql.catalog.dev.io-impl": "org.apache.iceberg.aws.s3.S3FileIO"  
            "spark.sql.catalog.dev.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",  
            "spark.sql.catalog.dev.warehouse": "s3://amzn-s3-demo-bucket/",  
            "spark.sql.catalog.dev.glue.id": "CROSS_ACCOUNT_ID",  
            "spark.sql.catalog.dev": "org.apache.iceberg.spark.SparkCatalog",  
            "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",  
            "spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar",  
        "properties": {  
        "classification": "spark-defaults",  
    {  
    },  
        "configurations": []  
        },  
            "iceberg.enabled": "true"  
        "properties": {  
        "classification": "iceberg-defaults",  
    {  
[

Note: Replace CROSS_ACCOUNT_ID with your cross-account ID, amzn-s3-demo-bucket with your S3 bucket location.

Related information

Read an Iceberg table from Amazon S3 using Spark

AWS OFFICIALUpdated 5 months ago
2 Comments

Need separate job configuration for normal db and iceberg db

for each type of db, job configuration will be different

for normal DB: "properties": { "spark.sql.catalogImplementation": "hive", "spark.hadoop.hive.metastore.glue.catalogid": "PROD_ACCT_ID" }

for iceberg tables:

"properties": { "spark.sql.catalog.bubble_iceberg_catalog.glue.id": "PROD_ACCT_ID", "spark.sql.catalog.bubble_iceberg_catalog.warehouse": "s3://sat-tech-platform-prod/data-dir/data/processed/databases/", "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.defaultCatalog": "bubble_iceberg_catalog", "spark.sql.catalog.bubble_iceberg_catalog": "org.apache.iceberg.spark.SparkCatalog", "spark.sql.catalog.bubble_iceberg_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog", "spark.sql.catalog.bubble_iceberg_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO", "spark.hadoop.aws.glue.catalog.separator": "/", "spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar" }

replied 8 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

AWS
MODERATOR
replied 8 months ago