- Newest
- Most votes
- Most comments
Hello
I understand that you want to know if there is a way to build Iceberg and Hive tables off the same S3 bucket and keep them in sync with changes.
Iceberg by default uses the Hive storage layout but can be switched to use the ObjectStoreLocationProvider. With ObjectStoreLocationProvider, a deterministic hash is generated for each stored file, with the hash appended directly after the write.data.path.
By default property write.object-storage.enabled is set to “true” in Athena, which creates a HASH and appends before the partition. This is done for performance improvement of S3 Calls. Also, changing this property is currently not supported in Athena.
For more details, please refer to the below documentation: [+] https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout [+] https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html#querying-iceberg-table-properties
For instance:
S3 layout for an Iceberg table is as below:
data/ 0fc25534/a.parquet 2a13ect9/b.parquat 667bfdv6/c.parquet metadata/
In addition to that, Athena Iceberg DELETE writes Iceberg position delete files to a table which is known as a merge-on-read delete. When Athena reads Iceberg data, it merges the Iceberg position delete files with data files to produce the latest view of a table.
While, when you create a database and table in Athena, you are simply describing the schema and the location where the table data are located in Amazon S3 for read-time querying and Athena reads all data stored in the Amazon S3 folder that you specify. So, I think there is no direct way of achieving the use-case that you want.
Relevant content
- asked 5 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
I have to imagine this has come up before with more data protection laws like GDPR and CCPA becoming relevant so is there an AWS recommended solution for handling this use case? I have a bunch of partitioned data in Parquet files sitting in S3 that are used to build Hive tables in Athena but I need to be able to delete certain rows from that Parquet.