I want to use Apache Spark with Amazon EMR or AWS Glue to interact with Apache Iceberg from an AWS Glue Data Catalog in another AWS account.
Resolution
To use Spark with Apache Iceberg tables from the AWS Glue Data Catalog, set parameters in your AWS Glue job or your Amazon EMR cluster.
The Amazon EMR or AWS Glue job must have AWS Identity and Access Management (IAM) permissions to access the cross-account AWS Glue Data Catalog. For more information, see Methods for granting cross-account access in AWS Glue.
You must use the Catalog.Id property to specify the ID of the account that the AWS Glue Data Catalog is in. For more information, see Making a cross-account API call.
Set parameters in AWS Glue
For AWS Glue jobs, set the job parameters.
Example job parameters:
Key: --conf Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.dev.glue.id=CROSS_ACCOUNT_ID --conf spark.sql.catalog.dev.warehouse=s3://amzn-s3-demo-bucket/ --conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.dev.io-impl=org.apache.iceberg.aws.s3.S3FileIO
Note: Replace CROSS_ACCOUNT_ID with your cross-account ID, amzn-s3-demo-bucket with your S3 bucket location.
Set parameters in Amazon EMR
For an Amazon EMR cluster that runs version 6.5 or later, set the parameters when you submit the job. Or, use the Spark default configuration, /etc/spark/conf/spark-defaults.conf. For more information, see Use an Iceberg cluster with Spark.
To set the parameters, run the following spark-submit command:
spark-submit \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://amzn-s3-demo-bucket/prefix \
--conf spark.sql.catalog.my_catalog.type=glue \
--conf spark.sql.catalog.my_catalog.glue.id=CROSS_ACCOUNT_ID \
--conf spark.sql.defaultCatalog=my_catalog \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Note: Replace CROSS_ACCOUNT_ID with your cross-account ID, **amzn-s3-demo-bucket/**prefix with your S3 bucket location and prefix, and my_catalog with your catalog.
-or-
Use the following spark-defaults configuration:
] } "configurations": []
},
"spark.sql.catalog.dev.io-impl": "org.apache.iceberg.aws.s3.S3FileIO"
"spark.sql.catalog.dev.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.dev.warehouse": "s3://amzn-s3-demo-bucket/",
"spark.sql.catalog.dev.glue.id": "CROSS_ACCOUNT_ID",
"spark.sql.catalog.dev": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar",
"properties": {
"classification": "spark-defaults",
{
},
"configurations": []
},
"iceberg.enabled": "true"
"properties": {
"classification": "iceberg-defaults",
{
[
Note: Replace CROSS_ACCOUNT_ID with your cross-account ID, amzn-s3-demo-bucket with your S3 bucket location.
Related information
Read an Iceberg table from Amazon S3 using Spark