I am seeing Athena queries over a bucket containing ORC files fail with the error message 'HIVE_CURSOR_ERROR: Failed to read ORC file'. Any query over entirety of the data in the bucket fails. A specific example query has been SELECT * FROM reachcounts_outbound WHERE calculation='a8d9458d-83e2-4e94-b272-3dbcd91296a0'
where calculation is set up as a partition in the reachcounts_outbound table (which is backed by an S3 bucket unscoreit-reachcounts-outbound).
I've validated that the file referenced by the error message is a valid ORC file by downloading it and running orc-tools data
on it, and the contents are what I'd expect. I've downloaded other ORC files in the bucket and compared them. They have the same schema and that schema is what I'd expect it to be; it matches the schema I've defined for the table.
I've tried deleting the individual file referenced when the error message first appeared. However, it continues to fail with the same message with a different file in the bucket. However if I specify a limit clause of any number under 1597894 on the query above, it will succeed.
I've tried running MSCK REPAIR TABLE on the reachcounts_outbound table. This did not change anything.
The query id of a request that caused a failure is 54480f27-1992-40f7-8240-17cc622f91db
.
Thanks!
Update: The ORC files that are rejected all appear to have exactly 10,000 rows, which is the stride size for the file