Multi-frame ZSTD data in Athena

0

I've been trying to read some zstd data in Athena. The files are multi-frame, which is a feature that ZSTD format fully supports. In fact, a multi-frame file is simply a file created by concatenating individual ZSTD files. This approach works like a dream with GZ files, and Athena reads them without any hitches.

However, when I try the same with concatenated ZSTD files, I keep running into an error message that says, "GENERIC_INTERNAL_ERROR: Unknown frame descriptor."

These files are perfectly readable when I use a command-line tool.

smotrov@MacBook-Pro test_data % zstd -dvl default_US_YYY_bZfVhjMmdecpQew2wey8W.zst
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
default_US_YYY_bZfVhjMmdecpQew2wey8W.zst 
# Zstandard Frames: 20
DictID: 0
Window Size: 4.00 MiB (4194304 B)
Compressed Size: 43.8 MiB (45957238 B)
Check: None

Has anyone encountered this issue before, and if so, any ideas on how to fix it?

profile picture
Smotrov
질문됨 일 년 전53회 조회
1개 답변
0

Besides this being a well documented bug you can work around by ensuring that the ZSTD compressed files are stored in a format that Athena supports natively, such as Parquet or ORC. Athena supports reading ZSTD compressed data in these formats.

When creating the table in Athena, specify the ZSTD compression in the table properties.

Example Athena CREATE TABLE statement for a Parquet table with ZSTD compression:

CREATE TABLE my_table (
  col1 INT,
  col2 STRING
)
STORED AS PARQUET
LOCATION 's3://my-bucket/my-data/'
TBLPROPERTIES (
  'parquet.compression' = 'ZSTD',
  'compression_level' = '5'
);
profile picture
전문가
답변함 19시간 전
  • Than you for your comment. But I do understand that Athena supports ZSTD. Moreover, I use it in production. My question is about Multi-frame ZSTD as it is mentioned in the title. Unfortunately, when you have data files with Multi-frame ZSTD this workaround will not help at all.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠