Athena Iceberg Table is growing infinitely with the same number of rows.

0

I have an Iceberg Table which holds constant number of rows. Every hour I update some of this rows using MERGE command. After each MERGE I'm executing VACUUM command. Table is created with following properties.

TBLPROPERTIES (
  'table_type'='iceberg',
  'vacuum_max_snapshot_age_seconds'='60'
);

Initially this table is about a 1GB. In week or so it grows up to 50+ GB.

What I'm doing wrong? How to avoid this growth and delete unnecessary garbage?

UPDATE

It looks like the vacuum does not remove orphaned files. I have following settings

TBLPROPERTIES (
  'table_type'='iceberg',
  'vacuum_max_snapshot_age_seconds'='60',
  'vacuum_max_metadata_files_to_keep'='1',
  'optimize_rewrite_delete_file_threshold'='2',
  'optimize_rewrite_data_file_threshold'='2',
  'vacuum_min_snapshots_to_keep'='2'
)

I do vacuum hourly right after update.

My table has following partitions

PARTITIONED BY (`partition`, `pos`, `pcc`, hour(`requeststarted`)) 

And despite the fact than following query

select min(requeststarted), max(requeststarted) 
from estr_iceberg_latest_warehouse

gives following results

#	_col0	_col1
1	2024-07-22 11:04:55.347000	2024-07-30 10:57:38.610000

I still can see endless amount of data on S3 in directories like this 'requeststarted_hour=2024-05-31-02/' which way out of the bound of partition values.

Should I remove orphaned files from the S3 manually?

  • Updated with additional information.

profile picture
Smotrov
asked 6 months ago860 views
1 Answer
2

Can you:

  1. Verify that the VACUUM command is executing successfully after each MERGE operation?
  2. Verify that it's actually deleting old snapshots and associated data files?
  3. Look at the size and nature of the updates you're making every hour? If each update is adding a significant amount of data, this could contribute to the rapid growth of the table.
profile picture
EXPERT
answered 6 months ago
  • Thank you very much for your suggestion.

    1. Yep. It is successful every time.
    2. SELECT * FROM "my_table$snapshots query returns a single row with following content.
    #	committed_at	snapshot_id	parent_id	operation	manifest_list	summary
    1	2024-03-29 13:01:39.498 UTC	8024726289693964937	4447953933496454465	overwrite	s3://my_backet/my_table/metadata/snap-8024726289693964937-1-977f4853-ae88-429a-9189-10fd07a538c3.avro	{added-position-deletes=4, total-equality-deletes=0, trino_query_id=20240329_130135_00187_c22yh, added-position-delete-files=1, added-delete-files=1, total-records=1369786, changed-partition-count=1, total-position-deletes=2023891, added-files-size=1627, total-delete-files=2210, total-files-size=6475502851, total-data-files=308}
    

    SELECT * FROM "my_table$files"; returns 308 rows. Unfortunately I don't know how to check if any of this files associated with an old snapshot.

    1. It is not the case. To put it simple each merge changes value of an Integer column keeping total number of rows at the same level.
  • Could you verify if versioning is enabled on your bucket? If so, this might be contributing to the increase in size. You can check more about Versioning workflows.

  • Bucket Versioning is disabled.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions