Athena HIVE_CURSOR_ERROR when running very large query but not when running a small query over same data

0

Hi,

In Athena, in the Console, I am getting a HIVE_CURSOR_ERROR when running:

(Queries, file paths etc. are anonymized)

SELECT DISTINCT file_path
FROM (
		SELECT xid,
			"$path" as file_path
		FROM "xxxx"."yyyyyyyyyy"
		WHERE partcountry = 'XY'
			AND partyear in ('2023')
			AND partmonth in ('2')
			AND xid IN ('638odhjdk20dk')
	)
  • Note: it has to open every single file and check the column xid if 638odhjdk20dk is in it. The column xid is of type String.

  • The error reads: HIVE_CURSOR_ERROR: Failed to read Parquet file: s3://xxxxxx/subfolderid=xyz/partyear=2023/partmonth=2/partday=3/part-00704-bd419050-hec9-42df-8093-7342djsl80d5b33.c000.snappy.parquet

  • This happened multiple times and always with different parquet files. But only for this country XY which has a lot of data.

  • With other countries it did not happen. Those have much less data (always scanning less than 3TB for other countries).

  • In country XY, it happens sometimes after it has scanned 5TB, sometimes after 10TB.

  • It only takes about 3 to 7 minutes to get to the error.

  • The workgroup I am using has no data scan limit.

  • I have used this article: https://repost.aws/knowledge-center/athena-hive-cursor-error to resolve it;

  • I have downloaded the file locally and opened it with pandas in python. There was no problem. So the file is not corrupted.

  • My hypothesis is that this has to do with Athena running very fast through many many many files and somewhere there is crash, perhaps because there is a lock on a file from one worker while another wants to open it.

  • Support for this hypothesis comes from this query, that focuses just on one day (partday=3), and one subfolderid (subfolderid=xyz) within partmonth=2; This is the day and subfolderid are where the "unreadable" file is located (see above).

SELECT DISTINCT file_path
FROM (
		SELECT xid,
			"$path" as file_path
		FROM "xxxx"."yyyyyyyyyy"
		WHERE partcountry = 'XY'
			AND partyear in ('2023')
			AND partmonth in ('2')
			AND partday in ('3')
			AND subfolderid='xyz'
			AND xid IN ('638odhjdk20dk')
	)
  • This query just runs through about 306GB and completes without any problems.

Can anyone help here? We need Athena to be reliable with also such large queries.

demandé il y a 10 mois342 vues
2 réponses
0
répondu il y a 10 mois
0

additional info: this is a lot of files and also small files

there seems to be an issue with running too many requests against S3:

https://stackoverflow.com/questions/59686651/hive-cursor-error-please-reduce-your-request-rate

répondu il y a 10 mois

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions