Cannot read cross-account Iceberg table with Glue or EMR

Hi!

I am trying to use DataZone to share an Iceberg table from the Glue Data Catalog with another AWS account. I have created the table with Athena in the source account like this:

CREATE TABLE iceberg_table (
  id int,
  data string,
  category string) 
PARTITIONED BY (category, bucket(16,id)) 
LOCATION 's3://************/dzd_ceozi0qzepfll7/datazone/409ty6lk11tpqj/' 
TBLPROPERTIES (
  'table_type'='ICEBERG',
  'format'='parquet',
  'write_compression'='snappy',
  'optimize_rewrite_delete_file_threshold'='10'
)

INSERT INTO "iceberg_table" ("id", "data", "category")
VALUES (1, 'my data', '100'),
(2, 'hello', '200'),
(3, 'this', '100'),
(4, 'is', '200'),
(5, 'a test', '300');

Then I registered it as a data asset in DataZone and shared it with a different project that has an environment in another AWS account. So far, so good. I was then able to query the shared table in the other account with Athena. Then I though it wouldn't be a big thing to read the table from EMR (or in a Glue notebook). I started a Glue notebook with an IAM role that has the necessary Glue, S3, and Lake Formation permissions. This is the code of the notebook:

%idle_timeout 60
%glue_version 4.0
%worker_type G.1X
%number_of_workers 2
%%configure
{
"--conf":"spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"--datalake-formats":"iceberg"
}

from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.sql import SparkSession
catalog_nm = "glue_catalog"
s3_bucket = "s3://************/dzd_ceozi0qzepfll7/datazone/4qaoomp2oefmkr/"
spark = SparkSession.builder \
    .config("spark.sql.defaultCatalog", catalog_nm) \
    .config(f"spark.sql.catalog.{catalog_nm}",
        "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_nm}.warehouse", s3_bucket) \
    .config(f"spark.sql.catalog.{catalog_nm}.catalog-impl",
        "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_nm}.io-impl",
        "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)

%%sql
show databases

%%sql
show tables in consumerdatalake_sub_db

%%sql
select * from glue_catalog.consumerdatalake_sub_db.iceberg_table limit 10

Both, the show databases and show tables in .. are working fine. But the last select statement results in the following error:

Py4JJavaError: An error occurred while calling o77.sql. : org.apache.iceberg.exceptions.ValidationException: Input Glue table is not an iceberg table: glue_catalog.consumerdatalake_sub_db.iceberg_table (type=null) at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49) at org.apache.iceberg.aws.glue.GlueToIcebergConverter.validateTable(GlueToIcebergConverter.java:48) at org.apache.iceberg.aws.glue.GlueTableOperations.doRefresh(GlueTableOperations.java:116) at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:95) at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:78) at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:43) .....

It seems like Glue cannot infer that the table type is actually "iceberg". As a test, I created the same Iceberg table in the default Glue catalog of the subscriber account and I was able to query that table without any issues. I compared the Glue tables and both say "Table Type: ICEBERG". Are there any restrictions when trying to read a shared Iceberg table? Any idea what could be missing?

When I do the same for a CSV table instead of ICEBERG, I get a similar error:

df = glueContext.create_data_frame.from_catalog(database='consumerdatalake_sub_db', table_name='customers')
df.show()

Py4JJavaError: An error occurred while calling o82.getCatalogSource. : java.lang.Error: No classification or connection in consumerdatalake_sub_db.customers

In the Glue tables overview I can see that there are no "classifications" set for the shared tables.

Wolfgang
23 days ago
It seems that this has to do with the Lake Formation Resource Links that are created for the shared tables. I was able to query both, the CSV and Iceberg data, by granting Lake Formation permissions to the tables in the producer data catalog: producerdatalake_pub_db.customers and producerdatalake_pub_db.iceberg_table.

Here I read that EMR and Glue can access shared tables directly: https://aws.github.io/aws-lakeformation-best-practices/data-sharing/general-data-sharing/#resource-links

But the AWS docs say that it should also be possible via resource links: https://docs.aws.amazon.com/lake-formation/latest/dg/resource-links-about.html

Creating a resource link to a database or table enables you to do the following: Access the Data Catalog databases and tables from any AWS Region by creating resource links in those regions pointing to the database and tables in another region. You can run queries in any region with these resource links using Athena, Amazon EMR and run AWS Glue ETL Spark jobs, without copying source data nor the metadata in Glue Data Catalog.

Topics

Analytics

Relevant content

Cannot drop iceberg table by SparkSQL with AWS Glue Catalog
duwan
asked 4 months ago
Iceberg table drop column leaves the column in the Athena Glue catalog
jhall
asked a year ago
Is it not possible to create a partioned iceberg table with AWS Glue createTable?
Kit - CostCenter
asked 7 months ago
Create Iceberg table in Glue
Accepted Answer
ThomasMueller
asked 10 months ago
How can I troubleshoot Apache Iceberg table errors with Athena?
AWS OFFICIALUpdated 3 months ago
How can I troubleshoot EMR job failures when trying to connect to the Glue Data Catalog?
AWS OFFICIALUpdated a year ago
How do I share AWS Glue Data Catalog databases and tables cross-account using AWS Lake Formation?
AWS OFFICIALUpdated 2 years ago
How can I use Apache Iceberg with a cross-account AWS Glue Data Catalog in Spark?
AWS OFFICIALUpdated a year ago
Migrating Glue Data Catalog tables to use Apache Iceberg open table format using Athena
EXPERT
Hamzah Chaudhry
published 2 months ago
Creating a DynamoDB Table with CloudFormation and Adding Items at Creation Time
EXPERT
Leeroy Hannigan
published a year ago