Skip to content

Access Production glue iceberg table data from Non prod emr serverless job

0

Hi, i have a use case where, i want to access/read my prod glue table data from emr serverless application job

so i am running one job on non prod emr serverless application studio, which needs to access prod glue iceberg table. i am able to read normal table from prod , but not able read iceberg table. nt sure about the what configuration do i need to give in my code or studio application configuration currently i have this as application configuration:

{ "runtimeConfiguration": [ { "classification": "spark-defaults", "configurations": null, "properties": { "spark.sql.catalog.hubble_iceberg_catalog.warehouse": "s3://sat-tech-platform-prod/data-dir/data/processed/databases", "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.catalogImplementation": "hive", "spark.sql.catalog.hubble_iceberg_catalog.glue.catalog-id": "Production_ACCOUNT_ID", "spark.sql.catalog.hubble_iceberg_catalog": "org.apache.iceberg.spark.SparkCatalog", "spark.sql.catalog.hubble_iceberg_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog", "spark.sql.catalog.hubble_iceberg_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO", "spark.hadoop.hive.metastore.glue.catalogid": "Production_ACCOUNT_ID" } } ] }

1 Answer
0

To access Iceberg tables from a production AWS Glue Data Catalog using an EMR Serverless application in a non-production environment, you need to properly configure cross-account access. Based on your current configuration, there are a few adjustments needed.

For your EMR Serverless application to access Iceberg tables in a production account, you need to ensure:

  1. The execution role for your EMR Serverless job has the necessary cross-account permissions to:
  • Access the production AWS Glue Data Catalog
  • Read data from the production S3 bucket
  1. Your Spark configuration needs some adjustments:
{
"runtimeConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.sql.catalog.hubble_iceberg_catalog": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.hubble_iceberg_catalog.type": "glue",
"spark.sql.catalog.hubble_iceberg_catalog.glue.id": "Production_ACCOUNT_ID",
"spark.sql.catalog.hubble_iceberg_catalog.warehouse": "s3://sat-tech-platform-prod/data-dir/data/processed/databases",
"spark.sql.catalog.hubble_iceberg_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.hubble_iceberg_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.defaultCatalog": "hubble_iceberg_catalog"
}
}
]
}

The key changes are:

  • Adding "spark.sql.catalog.hubble_iceberg_catalog.type": "glue" to specify the catalog type
  • Adding "spark.sql.defaultCatalog": "hubble_iceberg_catalog" to set this as the default catalog
  • Ensuring the region is correctly set if your production account is in a different region

When querying the tables, you would use:

SELECT * FROM hubble_iceberg_catalog.database_name.table_name

If you're still encountering issues, check that:

  1. The execution role has the necessary IAM permissions including lakeformation:GetDataAccess if Lake Formation is used
  2. Cross-account access policies are properly set up in both accounts
  3. The S3 bucket policy allows access from your non-production account
    Sources
    Working with AWS Glue multi-catalog hierarchy on EMR Serverless - Amazon EMR
    Using Apache Iceberg with EMR Serverless - Amazon EMR
    Working with a multi-catalog hierarchy in AWS Glue Data Catalog with Spark on Amazon EMR - Amazon EMR
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.