1 Answer
- Newest
- Most votes
- Most comments
0
To access Iceberg tables from a production AWS Glue Data Catalog using an EMR Serverless application in a non-production environment, you need to properly configure cross-account access. Based on your current configuration, there are a few adjustments needed.
For your EMR Serverless application to access Iceberg tables in a production account, you need to ensure:
- The execution role for your EMR Serverless job has the necessary cross-account permissions to:
- Access the production AWS Glue Data Catalog
- Read data from the production S3 bucket
- Your Spark configuration needs some adjustments:
{
"runtimeConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.sql.catalog.hubble_iceberg_catalog": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.hubble_iceberg_catalog.type": "glue",
"spark.sql.catalog.hubble_iceberg_catalog.glue.id": "Production_ACCOUNT_ID",
"spark.sql.catalog.hubble_iceberg_catalog.warehouse": "s3://sat-tech-platform-prod/data-dir/data/processed/databases",
"spark.sql.catalog.hubble_iceberg_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.hubble_iceberg_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.defaultCatalog": "hubble_iceberg_catalog"
}
}
]
}
The key changes are:
- Adding
"spark.sql.catalog.hubble_iceberg_catalog.type": "glue"to specify the catalog type - Adding
"spark.sql.defaultCatalog": "hubble_iceberg_catalog"to set this as the default catalog - Ensuring the region is correctly set if your production account is in a different region
When querying the tables, you would use:
SELECT * FROM hubble_iceberg_catalog.database_name.table_name
If you're still encountering issues, check that:
- The execution role has the necessary IAM permissions including
lakeformation:GetDataAccessif Lake Formation is used - Cross-account access policies are properly set up in both accounts
- The S3 bucket policy allows access from your non-production account
Sources
Working with AWS Glue multi-catalog hierarchy on EMR Serverless - Amazon EMR
Using Apache Iceberg with EMR Serverless - Amazon EMR
Working with a multi-catalog hierarchy in AWS Glue Data Catalog with Spark on Amazon EMR - Amazon EMR
answered 5 months ago
Relevant content
- asked 2 years ago
- asked 2 years ago
- asked 2 years ago
- AWS OFFICIALUpdated a year ago
