Glue 4.0 Iceberg issues

0

Hello,

I have two questions issues:

spark = (
    SparkSession.builder
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://co-raw-sales-dev")
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .enableHiveSupport()
    .getOrCreate()
)

df.writeTo("glue_catalog.co_raw_sales_dev.new_test").using("iceberg").create()

CREATED TABLE DDL:

CREATE TABLE co_raw_sales_dev.new_test (
  id bigint,
  name string,
  points bigint)
LOCATION 's3://co-raw-sales-dev**//**new_test'
TBLPROPERTIES (
  'table_type'='iceberg'
);

The problem I am having is that there is double // in location between bucket and table name in s3.

This one wokrs: df.writeTo("glue_catalog.co_raw_sales_dev.new_test2").using("iceberg").create()

but if I remove "glue_catalog" like: df.writeTo("co_raw_sales_dev.new_test2").using("iceberg").create()

I am getting error : An error occurred while calling o339.create. Table implementation does not support writes: co_raw_sales_dev.new_test2

am I missing some parameter in SparkSession config?

Thank you, Adas.

asked a year ago1108 views
1 Answer
0
Accepted Answer
  1. I doubt you can make it work correctly, s3 allows that but for the filesystem it means you have a directory without name, I would move the files and avoid issues in the future (even if you can solve it now).
  2. You need to specify "glue_catalog" so it knows it's the Iceberg catalog, otherwise it will treat it as a regular table.
profile pictureAWS
EXPERT
answered a year ago
  • Thank you Gonzalo for explanation.

    One more question, I am running query below from Glue:

    query = f"""
    CREATE TABLE IF NOT EXISTS glue_catalog.{std_database}.{std_table}
    USING iceberg
    LOCATION 's3://{std_bucket}/{std_table}'
    PARTITIONED BY (id)
    TBLPROPERTIES (
      'format'='parquet',
      'write_compression'='snappy'
    )
    AS SELECT * FROM source_df
    """
    spark.sql(query)
    

    I inserted data, I can query it everything seems fine, but when running "SHOW CREATE TABLE {std_database}.{std_table}" on Athena, I am getting error: CREATE TABLE statement cannot be generated because table has unsupported properties.

    Both properties I added are described in: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html What might be wrong?

  • Maybe instead of "using" use the table property 'table_type' ='ICEBERG', otherwise it works for me

  • Hi Honzalo,

    I tested various scenarios, and the only way "SHOW CREATE TABLE" on Athena works, when I create table on PySpark without any TBLPROPERTIES.

    Also "SHOW CREATE TABLE" answer on Athena and PySpark is different:

    Athena:

    CREATE TABLE co_raw_sales_dev.test1(
      id bigint,
      name string,
      points bigint,
      created string,
      updated string)
    PARTITIONED BY (`id`)
    LOCATION 's3://co-raw-sales-dev/test1'
    TBLPROPERTIES (
      'table_type'='iceberg'
    );
    

    PySpark:

    CREATE TABLE glue_catalog.co_raw_sales_dev.test1(
    id BIGINT,
    name STRING,
    points BIGINT,
    created STRING,
    updated STRING)
    USING iceberg
    PARTITIONED BY (id)
    LOCATION 's3://co-raw-sales-dev/test1'
    TBLPROPERTIES ( 
    	current-snapshot-id' = '5704046200302329156',  
    	'format' = 'iceberg/parquet',  
    	'format-version' = '1'
    )
    

    I think the problem here is that Glue 4.0 creates Iceberg format=1 and Athena is using Iceberg format=2.

  • Spark defaults to format-version 1, but it should work with 2

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions