Valid single-line JSON files in S3 getting "HIVE_CURSOR_ERROR: Row is not a valid JSON Object" in Athena query

0

Hey everyone I am trying to query a set of JSON files in S3 with Athena and I am getting the Hive cursor error for invalid JSON even though the files in question are valid single-line JSON. Is there a line-length limitation for Athena, or must a JSON record end with a newline character even if there is only one record in the file?

It looked like the Glue crawler got the schema right I have a bunch of these files all same structure all same keys. JSON validates correctly in my JSON editor but Athena thinks it is invalid JSON.

This is an example of the contents of one of my JSON files:

{"all_keys": ["frequently_bought_together.products[0].asin", "frequently_bought_together.products[0].image", "frequently_bought_together.products[0].link", "frequently_bought_together.products[0].price.currency", "frequently_bought_together.products[0].price.raw", "frequently_bought_together.products[0].price.symbol", "frequently_bought_together.products[0].price.value", "frequently_bought_together.products[0].title", "frequently_bought_together.products[1].asin", "frequently_bought_together.products[1].image", "frequently_bought_together.products[1].link", "frequently_bought_together.products[1].price.currency", "frequently_bought_together.products[1].price.raw", "frequently_bought_together.products[1].price.symbol", "frequently_bought_together.products[1].price.value", "frequently_bought_together.products[1].title", "frequently_bought_together.products[2].asin", "frequently_bought_together.products[2].image", "frequently_bought_together.products[2].link", "frequently_bought_together.products[2].price.currency", "frequently_bought_together.products[2].price.raw", "frequently_bought_together.products[2].price.symbol", "frequently_bought_together.products[2].price.value", "frequently_bought_together.products[2].title", "frequently_bought_together.total_price.currency", "frequently_bought_together.total_price.raw", "frequently_bought_together.total_price.symbol", "frequently_bought_together.total_price.value", "product.asin", "product.attributes[0].name", "product.attributes[0].value", "product.attributes[1].name", "product.attributes[1].value", "product.attributes[2].name", "product.attributes[2].value", "product.attributes[3].name", "product.attributes[3].value", "product.attributes[4].name", "product.attributes[4].value", "product.bestsellers_rank[0].category", "product.bestsellers_rank[0].link", "product.bestsellers_rank[0].rank", "product.bestsellers_rank[1].category", "product.bestsellers_rank[1].link", "product.bestsellers_rank[1].rank", "product.bestsellers_rank[2].category", "product.bestsellers_rank[2].link", "product.bestsellers_rank[2].rank", "product.bestsellers_rank_flat", "product.brand", "product.buybox_winner.availability.dispatch_days", "product.buybox_winner.availability.raw", "product.buybox_winner.availability.stock_level", "product.buybox_winner.availability.type", "product.buybox_winner.condition.is_new", "product.buybox_winner.fulfillment.is_fulfilled_by_amazon", "product.buybox_winner.fulfillment.is_fulfilled_by_third_party", "product.buybox_winner.fulfillment.is_sold_by_amazon", "product.buybox_winner.fulfillment.is_sold_by_third_party", "product.buybox_winner.fulfillment.standard_delivery.date", "product.buybox_winner.fulfillment.standard_delivery.name", "product.buybox_winner.fulfillment.third_party_seller.id", "product.buybox_winner.fulfillment.third_party_seller.link", "product.buybox_winner.fulfillment.third_party_seller.name", "product.buybox_winner.fulfillment.type", "product.buybox_winner.is_amazon_fresh", "product.buybox_winner.is_prime", "product.buybox_winner.is_prime_exclusive_deal", "product.buybox_winner.price.currency", "product.buybox_winner.price.raw", "product.buybox_winner.price.symbol", "product.buybox_winner.price.value", "product.buybox_winner.shipping.currency", "product.buybox_winner.shipping.raw", "product.buybox_winner.shipping.symbol", "product.buybox_winner.shipping.value", "product.categories[0].category_id", "product.categories[0].link", "product.categories[0].name", "product.categories[1].category_id", "product.categories[1].link", "product.categories[1].name", "product.categories[2].category_id", "product.categories[2].link", "product.categories[2].name", "product.categories[3].category_id", "product.categories[3].link", "product.categories[3].name", "product.categories[4].category_id", "product.categories[4].link", "product.categories[4].name", "product.categories_flat", "product.color", "product.description", "product.dimensions", "product.feature_bullets", "product.feature_bullets_count", "product.feature_bullets_flat", "product.first_available.raw", "product.first_available.utc", "product.has_size_guide", "product.images[0].link", "product.images[0].variant", "product.images[1].link", "product.images[1].variant", "product.images[2].link", "product.images[2].variant", "product.images[3].link", "product.images[3].variant", "product.images[4].link", "product.images[4].variant", "product.images[5].link", "product.images[5].variant", "product.images[6].link", "product.images[6].variant", "product.images[7].link", "product.images[7].variant", "product.images[8].link", "product.images[8].variant", "product.images_count", "product.images_flat", "product.is_bundle", "product.keywords", "product.keywords_list", "product.link", "product.main_image.link", "product.material", "product.rating", "product.rating_breakdown.five_star.count", "product.rating_breakdown.five_star.percentage", "product.rating_breakdown.four_star.count", "product.rating_breakdown.four_star.percentage", "product.rating_breakdown.one_star.count", "product.rating_breakdown.one_star.percentage", "product.rating_breakdown.three_star.count", "product.rating_breakdown.three_star.percentage", "product.rating_breakdown.two_star.count", "product.rating_breakdown.two_star.percentage", "product.ratings_total", "product.search_alias.title", "product.search_alias.value", "product.specifications[0].name", "product.specifications[0].value", "product.specifications[10].name", "product.specifications[10].value", "product.specifications[11].name", "product.specifications[11].value", "product.specifications[12].name", "product.specifications[12].value", "product.specifications[13].name", "product.specifications[13].value", "product.specifications[14].name", "product.specifications[14].value", "product.specifications[15].name", "product.specifications[15].value", "product.specifications[16].name", "product.specifications[16].value", "product.specifications[17].name", "product.specifications[17].value", "product.specifications[1].name", "product.specifications[1].value", "product.specifications[2].name", "product.specifications[2].value", "product.specifications[3].name", "product.specifications[3].value", "product.specifications[4].name", "product.specifications[4].value", "product.specifications[5].name", "product.specifications[5].value", "product.specifications[6].name", "product.specifications[6].value", "product.specifications[7].name", "product.specifications[7].value", "product.specifications[8].name", "product.specifications[8].value", "product.specifications[9].name", "product.specifications[9].value", "product.specifications_flat", "product.sub_title.link", "product.sub_title.text", "product.title", "product.videos_additional[0].closed_captions", "product.videos_additional[0].creator_type", "product.videos_additional[0].duration", "product.videos_additional[0].id", "product.videos_additional[0].parent_asin", "product.videos_additional[0].product_asin", "product.videos_additional[0].profile_image_url", "product.videos_additional[0].profile_link", "product.videos_additional[0].public_name", "product.videos_additional[0].related_products", "product.videos_additional[0].title", "product.videos_additional[0].type", "product.videos_additional[0].vendor_code", "product.videos_additional[0].vendor_name", "product.videos_additional[0].vendor_tracking_id", "product.videos_additional[0].video_image_extension", "product.videos_additional[0].video_image_height", "product.videos_additional[0].video_image_id", "product.videos_additional[0].video_image_url", "product.videos_additional[0].video_image_url_unchanged", "product.videos_additional[0].video_image_width", "product.videos_additional[0].video_mime_type", "product.videos_additional[0].video_previews", "product.videos_additional[0].video_url", "product.weight", "request_info.credits_remaining", "request_info.credits_reset_at", "request_info.credits_used", "request_info.credits_used_this_request", "request_info.success", "request_metadata.amazon_url", "request_metadata.created_at", "request_metadata.processed_at", "request_metadata.total_time_taken", "request_parameters.amazon_domain", "request_parameters.asin", "request_parameters.type", "sponsored_products[0].asin", "sponsored_products[0].image", "sponsored_products[0].is_prime", "sponsored_products[0].link", "sponsored_products[0].price.currency", "sponsored_products[0].price.raw", "sponsored_products[0].price.symbol", "sponsored_products[0].price.value", "sponsored_products[0].rating", "sponsored_products[0].title", "sponsored_products[1].asin", "sponsored_products[1].image", "sponsored_products[1].is_prime", "sponsored_products[1].link", "sponsored_products[1].price.currency", "sponsored_products[1].price.raw", "sponsored_products[1].price.symbol", "sponsored_products[1].price.value", "sponsored_products[1].rating", "sponsored_products[1].title", "sponsored_products[2].asin", "sponsored_products[2].image", "sponsored_products[2].is_prime", "sponsored_products[2].link", "sponsored_products[2].price.currency", "sponsored_products[2].price.raw", "sponsored_products[2].price.symbol", "sponsored_products[2].price.value", "sponsored_products[2].rating", "sponsored_products[2].title", "sponsored_products[3].asin", "sponsored_products[3].image", "sponsored_products[3].is_prime", "sponsored_products[3].link", "sponsored_products[3].price.currency", "sponsored_products[3].price.raw", "sponsored_products[3].price.symbol", "sponsored_products[3].price.value", "sponsored_products[3].rating", "sponsored_products[3].title", "sponsored_products[4].asin", "sponsored_products[4].image", "sponsored_products[4].is_prime", "sponsored_products[4].link", "sponsored_products[4].price.currency", "sponsored_products[4].price.raw", "sponsored_products[4].price.symbol", "sponsored_products[4].price.value", "sponsored_products[4].rating", "sponsored_products[4].title", "sponsored_products[5].asin", "sponsored_products[5].image", "sponsored_products[5].link", "sponsored_products[5].price.currency", "sponsored_products[5].price.raw", "sponsored_products[5].price.symbol", "sponsored_products[5].price.value", "sponsored_products[5].rating", "sponsored_products[5].title", "sponsored_products[6].asin", "sponsored_products[6].image", "sponsored_products[6].is_prime", "sponsored_products[6].link", "sponsored_products[6].price.currency", "sponsored_products[6].price.raw", "sponsored_products[6].price.symbol", "sponsored_products[6].price.value", "sponsored_products[6].rating", "sponsored_products[6].title", "sponsored_products[7].asin", "sponsored_products[7].image", "sponsored_products[7].is_prime", "sponsored_products[7].link", "sponsored_products[7].price.currency", "sponsored_products[7].price.raw", "sponsored_products[7].price.symbol", "sponsored_products[7].price.value", "sponsored_products[7].rating", "sponsored_products[7].title", "sponsored_products[8].asin", "sponsored_products[8].image", "sponsored_products[8].is_prime", "sponsored_products[8].link", "sponsored_products[8].price.currency", "sponsored_products[8].price.raw", "sponsored_products[8].price.symbol", "sponsored_products[8].price.value", "sponsored_products[8].rating", "sponsored_products[8].title", "sponsored_products[9].asin", "sponsored_products[9].image", "sponsored_products[9].is_prime", "sponsored_products[9].link", "sponsored_products[9].price.currency", "sponsored_products[9].price.raw", "sponsored_products[9].price.symbol", "sponsored_products[9].price.value", "sponsored_products[9].rating", "sponsored_products[9].title"], "elpased_time": "14.403965", "missing_keys": "[]", "product_id": "B0C2QPPVM6", "retailer": "Amazon", "scraper": "rainforest", "status_code": 200, "time_of_api_call": "2024-06-01T00:53:51"}

Mike
asked 4 months ago348 views
2 Answers
1
Accepted Answer

TLDR; make sure you set up your Glue data source to be a folder with only the JSON files for your desired table in it, no subfolders with foreign files etc as an Athena query will parse ALL files --recursively-- in your data source folder and will choke on files that do not match the Glue schema!

The problem here turns out to be that the Athena query engine does not respect any "exclude" filters you set in Glue for your crawler, and I thought it did. I had a folder in S3 which contained JSON files with the format shown above and also a couple of config files which were specifically excluded in the data source for the Glue crawler that built my database. Athena, however, is unaware of those exclusions so when you run a query it will recursively query every file in the data source directory (including subfolders) without any regard for exclusions you specified when you created the data source for your Glue crawler.

What this means is that you should put all the JSON files for a given table in their own folder with no subfolders and no other extraneous files and run your Glue crawler on that folder. Then when Athena is running a query it will not find a JSON file that is different from the schema discovered by the Glue crawler and throw the "invalid JSON" error. The error reported here is very misleading because there was no invalid JSON present, but there was JSON that did not match the schema reported by Glue. A "schema mismatch" error would have been much more useful.

I discovered what was going on by using the following pair of queries in the Athena query editor:

  1. CREATE EXTERNAL TABLE IF NOT EXISTS json_validator (jsonrow string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '%' location 's3://my-bucket-name/'
  2. WITH testdataset AS (SELECT "$path" s3path,jsonrow,try(json_parse(jsonrow)) isjson FROM json_validator) SELECT * FROM testdataset WHERE ISJSON IS NULL

When you do this you will immediately see what Athena is tripping over and be able to fix it. I found this technique at https://repost.aws/knowledge-center/error-json-athena

Mike
answered 4 months ago
profile picture
EXPERT
reviewed 4 months ago
profile pictureAWS
EXPERT
reviewed 4 months ago
0

The error might occur if the JSON files lack proper newline delimiters between records, as Athena expects each JSON object to be on a separate line. Ensure each JSON record is separated by a newline character, even if there is only one record per file. Additionally, verify the schema inferred by the Glue crawler to ensure it matches the structure of your JSON files.

Also check this How do I resolve "HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key" when reading files from AWS Config in Athena?

profile picture
EXPERT
answered 4 months ago
  • hey Osvaldo thanks for your answer, this turned out to be more complex than I thought so I will explain in a subsequent answer

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions