Athena version 3 - Timestamp column causes serialization error for parquet data format

1

We ran into the serialization error of the timestamp field after switching to Athena version 3 for the parquet data files. Using Athena version 2, the same parquet data is read without errors. Also, before opening this issue, we ensured that our data corresponded to all the described recommendations from the Athena 3 guide.

Error: SERIALIZATION_ERROR: Could not serialize column 'timestamp' of type 'timestamp(3)' at position 1:1

Below is the suggested solution to the problem from the documentation:

Precision mismatch in Timestamp columns causes serialization error Error message: SERIALIZATION_ERROR: Could not serialize column 'COLUMNZ' of type 'timestamp(3)' at position X:Y

Cause: Athena engine version 3 checks to make sure that the precision of timestamps in the data is the same as the precision specified for the column data type in the table specification. Currently, this precision is always 3. If the data has a precision greater than this, queries fail with the error noted.

Suggested solution: Check your data to make sure that your timestamps have millisecond precision.

In our parquet data, we use a timestamp field with Unix time format, which has millisecond precision: "timestamp": 1688479202968

Example of our parquet data and DDL statement for creating Athena table:

{
  "id": "23020733",
  "timestamp": 1688479202968,
  "receiveTimestamp": 1688479203118
   ...
}
CREATE EXTERNAL TABLE `sample_table`(
  `id` string, 
  `timestamp` timestamp, 
  `receivetimestamp` timestamp 
   )
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://...'

It is critical for us to ensure the normal functioning of the Athena query without changing existing requests or the parquet files.

At this stage, we have switched Athena to version 2 for all our services and are still working with it.

What is the correct description of this field in the Athena table? If it's a bug on WAS side, how could I report it to them?

질문됨 9달 전891회 조회
1개 답변
0

This seems to be the expected behavior perhaps. Please refer Athen Version History and search for "Precision", you'll see all the changes related to timestamp precision and will give you an idea about the difference in behavior between version-2 and version-3.

However if you still think, this is a bug then first point of contact would be AWS Support and I'd suggest you to log a case under "Technical category" with AWS support describing the situation and they will certainly help you and provide additional context and guidance. Note that, you'll only be able to log a support case if you have a support plan which can let you create the case, basic support plan doesn't come with that ability.

Hope this helps.

Comment here if you have additional questions, happy to help.

Abhishek

profile pictureAWS
전문가
답변함 9달 전
  • Thanks Abhishek! The thing is, their release note says that they fixed it and Athena 3 is in parity with Athena 2, but it still doesn't work for us. We also tried to test the parquet file with different encoding, and it works:

    -- correct
    "encodings" : [ "RLE", "PLAIN", "BIT_PACKED" ],
    
    -- incorrect
    "encodings" : [ "RLE", "PLAIN" ], 
    

    But the problem cannot be in the parquet files, as they are processed by Athena 2 without issues. I already opened a support case in AWS portal, hopefully, they will answer, but as we are not entitled to technical support, I don't know if this is going to be solved.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠