I've been trying to read some zstd data in Athena. The files are multi-frame, which is a feature that ZSTD format fully supports. In fact, a multi-frame file is simply a file created by concatenating individual ZSTD files. This approach works like a dream with GZ files, and Athena reads them without any hitches.
However, when I try the same with concatenated ZSTD files, I keep running into an error message that says, "GENERIC_INTERNAL_ERROR: Unknown frame descriptor."
These files are perfectly readable when I use a command-line tool.
smotrov@MacBook-Pro test_data % zstd -dvl default_US_YYY_bZfVhjMmdecpQew2wey8W.zst
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
default_US_YYY_bZfVhjMmdecpQew2wey8W.zst
# Zstandard Frames: 20
DictID: 0
Window Size: 4.00 MiB (4194304 B)
Compressed Size: 43.8 MiB (45957238 B)
Check: None
Has anyone encountered this issue before, and if so, any ideas on how to fix it?
It turned out that there were such known bug in Hadoop. Fix Version/s: 3.4.0, 3.2.3, 3.3.2 https://issues.apache.org/jira/browse/HDFS-14099
It is interesting when this fix will be implemented in Athena? And on which level.