Can I keep Hive tables in sync with data changes made in Iceberg tables?

Question

I would like to adopt Iceberg tables for easier data management in the data lake but we have a dependency on Redshift Spectrum and it seems that Amazon is not planning on supporting Iceberg in Redshift Spectrum any time soon. Is there a way to build Iceberg and Hive tables off the same S3 bucket and keep them in sync with changes. For example, if I do a delete in Iceberg I would like that to be reflected in the Hive tables as well.

Answer

Hello

I understand that you want to know if there is a way to build Iceberg and Hive tables off the same S3 bucket and keep them in sync with changes.

Iceberg by default uses the Hive storage layout but can be switched to use the ObjectStoreLocationProvider. With ObjectStoreLocationProvider, a deterministic hash is generated for each stored file, with the hash appended directly after the write.data.path.

By default property write.object-storage.enabled is set to “true” in Athena, which creates a HASH and appends before the partition. This is done for performance improvement of S3 Calls. Also, changing this property is currently not supported in Athena.

For more details, please refer to the below documentation:
[+] https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout 
[+] https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html#querying-iceberg-table-properties

For instance:

S3 layout for an Iceberg table is as below:

data/
	0fc25534/a.parquet
	2a13ect9/b.parquat
	667bfdv6/c.parquet
metadata/

In addition to that, Athena Iceberg DELETE writes Iceberg position delete files to a table which is known as a merge-on-read delete. When Athena reads Iceberg data, it merges the Iceberg position delete files with data files to produce the latest view of a table.

While, when you create a database and table in Athena, you are simply describing the schema and the location where the table data are located in Amazon S3 for read-time querying and Athena reads all data stored in the Amazon S3 folder that you specify. So, I think there is no direct way of achieving the use-case that you want.

Can I keep Hive tables in sync with data changes made in Iceberg tables?

Relevant content