Athena HIVE_CURSOR_ERROR when running very large query but not when running a small query over same data

0

Hi,

In Athena, in the Console, I am getting a HIVE_CURSOR_ERROR when running:

(Queries, file paths etc. are anonymized)

SELECT DISTINCT file_path
FROM (
		SELECT xid,
			"$path" as file_path
		FROM "xxxx"."yyyyyyyyyy"
		WHERE partcountry = 'XY'
			AND partyear in ('2023')
			AND partmonth in ('2')
			AND xid IN ('638odhjdk20dk')
	)
  • Note: it has to open every single file and check the column xid if 638odhjdk20dk is in it. The column xid is of type String.

  • The error reads: HIVE_CURSOR_ERROR: Failed to read Parquet file: s3://xxxxxx/subfolderid=xyz/partyear=2023/partmonth=2/partday=3/part-00704-bd419050-hec9-42df-8093-7342djsl80d5b33.c000.snappy.parquet

  • This happened multiple times and always with different parquet files. But only for this country XY which has a lot of data.

  • With other countries it did not happen. Those have much less data (always scanning less than 3TB for other countries).

  • In country XY, it happens sometimes after it has scanned 5TB, sometimes after 10TB.

  • It only takes about 3 to 7 minutes to get to the error.

  • The workgroup I am using has no data scan limit.

  • I have used this article: https://repost.aws/knowledge-center/athena-hive-cursor-error to resolve it;

  • I have downloaded the file locally and opened it with pandas in python. There was no problem. So the file is not corrupted.

  • My hypothesis is that this has to do with Athena running very fast through many many many files and somewhere there is crash, perhaps because there is a lock on a file from one worker while another wants to open it.

  • Support for this hypothesis comes from this query, that focuses just on one day (partday=3), and one subfolderid (subfolderid=xyz) within partmonth=2; This is the day and subfolderid are where the "unreadable" file is located (see above).

SELECT DISTINCT file_path
FROM (
		SELECT xid,
			"$path" as file_path
		FROM "xxxx"."yyyyyyyyyy"
		WHERE partcountry = 'XY'
			AND partyear in ('2023')
			AND partmonth in ('2')
			AND partday in ('3')
			AND subfolderid='xyz'
			AND xid IN ('638odhjdk20dk')
	)
  • This query just runs through about 306GB and completes without any problems.

Can anyone help here? We need Athena to be reliable with also such large queries.

asked 10 months ago333 views
2 Answers
0

additional info: this is a lot of files and also small files

there seems to be an issue with running too many requests against S3:

https://stackoverflow.com/questions/59686651/hive-cursor-error-please-reduce-your-request-rate

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions