Can I keep Hive tables in sync with data changes made in Iceberg tables?

0

I would like to adopt Iceberg tables for easier data management in the data lake but we have a dependency on Redshift Spectrum and it seems that Amazon is not planning on supporting Iceberg in Redshift Spectrum any time soon. Is there a way to build Iceberg and Hive tables off the same S3 bucket and keep them in sync with changes. For example, if I do a delete in Iceberg I would like that to be reflected in the Hive tables as well.

1 Answer
0

Hello

I understand that you want to know if there is a way to build Iceberg and Hive tables off the same S3 bucket and keep them in sync with changes.

Iceberg by default uses the Hive storage layout but can be switched to use the ObjectStoreLocationProvider. With ObjectStoreLocationProvider, a deterministic hash is generated for each stored file, with the hash appended directly after the write.data.path.

By default property write.object-storage.enabled is set to “true” in Athena, which creates a HASH and appends before the partition. This is done for performance improvement of S3 Calls. Also, changing this property is currently not supported in Athena.

For more details, please refer to the below documentation: [+] https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout [+] https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html#querying-iceberg-table-properties

For instance:

S3 layout for an Iceberg table is as below:

data/ 0fc25534/a.parquet 2a13ect9/b.parquat 667bfdv6/c.parquet metadata/

In addition to that, Athena Iceberg DELETE writes Iceberg position delete files to a table which is known as a merge-on-read delete. When Athena reads Iceberg data, it merges the Iceberg position delete files with data files to produce the latest view of a table.

While, when you create a database and table in Athena, you are simply describing the schema and the location where the table data are located in Amazon S3 for read-time querying and Athena reads all data stored in the Amazon S3 folder that you specify. So, I think there is no direct way of achieving the use-case that you want.

AWS
Ankur_J
answered 9 months ago
  • I have to imagine this has come up before with more data protection laws like GDPR and CCPA becoming relevant so is there an AWS recommended solution for handling this use case? I have a bunch of partitioned data in Parquet files sitting in S3 that are used to build Hive tables in Athena but I need to be able to delete certain rows from that Parquet.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions