Hi!
I am trying to use DataZone to share an Iceberg table from the Glue Data Catalog with another AWS account.
I have created the table with Athena in the source account like this:
CREATE TABLE iceberg_table (
id int,
data string,
category string)
PARTITIONED BY (category, bucket(16,id))
LOCATION 's3://************/dzd_ceozi0qzepfll7/datazone/409ty6lk11tpqj/'
TBLPROPERTIES (
'table_type'='ICEBERG',
'format'='parquet',
'write_compression'='snappy',
'optimize_rewrite_delete_file_threshold'='10'
)
INSERT INTO "iceberg_table" ("id", "data", "category")
VALUES (1, 'my data', '100'),
(2, 'hello', '200'),
(3, 'this', '100'),
(4, 'is', '200'),
(5, 'a test', '300');
Then I registered it as a data asset in DataZone and shared it with a different project that has an environment in another AWS account. So far, so good. I was then able to query the shared table in the other account with Athena. Then I though it wouldn't be a big thing to read the table from EMR (or in a Glue notebook).
I started a Glue notebook with an IAM role that has the necessary Glue, S3, and Lake Formation permissions. This is the code of the notebook:
%idle_timeout 60
%glue_version 4.0
%worker_type G.1X
%number_of_workers 2
%%configure
{
"--conf":"spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"--datalake-formats":"iceberg"
}
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.sql import SparkSession
catalog_nm = "glue_catalog"
s3_bucket = "s3://************/dzd_ceozi0qzepfll7/datazone/4qaoomp2oefmkr/"
spark = SparkSession.builder \
.config("spark.sql.defaultCatalog", catalog_nm) \
.config(f"spark.sql.catalog.{catalog_nm}",
"org.apache.iceberg.spark.SparkCatalog") \
.config(f"spark.sql.catalog.{catalog_nm}.warehouse", s3_bucket) \
.config(f"spark.sql.catalog.{catalog_nm}.catalog-impl",
"org.apache.iceberg.aws.glue.GlueCatalog") \
.config(f"spark.sql.catalog.{catalog_nm}.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO") \
.getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
%%sql
show databases
%%sql
show tables in consumerdatalake_sub_db
%%sql
select * from glue_catalog.consumerdatalake_sub_db.iceberg_table limit 10
Both, the show databases
and show tables in ..
are working fine. But the last select
statement results in the following error:
Py4JJavaError: An error occurred while calling o77.sql.
: org.apache.iceberg.exceptions.ValidationException: Input Glue table is not an iceberg table: glue_catalog.consumerdatalake_sub_db.iceberg_table (type=null)
at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49)
at org.apache.iceberg.aws.glue.GlueToIcebergConverter.validateTable(GlueToIcebergConverter.java:48)
at org.apache.iceberg.aws.glue.GlueTableOperations.doRefresh(GlueTableOperations.java:116)
at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:95)
at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:78)
at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:43)
.....
It seems like Glue cannot infer that the table type is actually "iceberg". As a test, I created the same Iceberg table in the default Glue catalog of the subscriber account and I was able to query that table without any issues. I compared the Glue tables and both say "Table Type: ICEBERG". Are there any restrictions when trying to read a shared Iceberg table? Any idea what could be missing?
When I do the same for a CSV table instead of ICEBERG, I get a similar error:
df = glueContext.create_data_frame.from_catalog(database='consumerdatalake_sub_db', table_name='customers')
df.show()
Py4JJavaError: An error occurred while calling o82.getCatalogSource.
: java.lang.Error: No classification or connection in consumerdatalake_sub_db.customers
In the Glue tables overview I can see that there are no "classifications" set for the shared tables.
It seems that this has to do with the Lake Formation Resource Links that are created for the shared tables. I was able to query both, the CSV and Iceberg data, by granting Lake Formation permissions to the tables in the producer data catalog:
producerdatalake_pub_db.customers
andproducerdatalake_pub_db.iceberg_table
.Here I read that EMR and Glue can access shared tables directly: https://aws.github.io/aws-lakeformation-best-practices/data-sharing/general-data-sharing/#resource-links
But the AWS docs say that it should also be possible via resource links: https://docs.aws.amazon.com/lake-formation/latest/dg/resource-links-about.html