I've been trying to test out Iceberg tables with Amazon Redshift Spectrum and have come across a major issue.
Here is my setup:
- I create an iceberg table via spark (emr 7.0) and insert data across multiple partitions
- The table uses the Glue catalog
- I add the glue database to Spectrum
Here is an example table DDL:
CREATE TABLE iceberg_test (
id string,
year string,
last_name string,
first_name string,
age double,
nested struct<nested_field: string>)
PARTITIONED BY (`year`, `last_name`)
LOCATION 's3://data/iceberg/main.db/iceberg_test'
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='zstd'
);
With this table, i have two partition columns, year and last_name. I insert data across multiple partitions. For example:
year = 2022, last_name = smith
year = 2021, last_name = bob
Whenever i issue a query against this table via spectrum, Spectrum will do a full table scan rather than read just the partitions specified in the query.
An example query would be
select * from main.iceberg_test where year = 2022 and last_name = 'smith';
I've confirmed the scan behaviour by looking at my S3 access logs for the table.
The issue appears to be caused by the presence of the nested
field in the table. Any table with a nested struct
field will cause Spectrum to ignore the partitions and do a full table scan.
If i remove this field from the table, I can verify that spectrum only accesses the folders for the partitions in the query.
Is there some work around for this or some setting that i'm missing? I don't see any indication on the Redshift Iceberg docs about this significant limitation.