JSON Table in Glue with array only shows first entry

0

Running a Glue Crawler on json file with this structure:

[ { "SoldTo": "647218", "ContainerId": 1691340, "Container": "Per Åges Badelandcontainer 2", "ContainerReference": "true", "Article": "7431629", "Min": "15", "Max": "99", "Date": "2024-11-29" }, { "SoldTo": "647218", "ContainerId": 1691340, "Container": "Per Åges Badelandcontainer 2", "ContainerReference": "true", "Article": "72251347", "Quantity": "0", "Min": "15", "Max": "99", "Date": "2024-11-29" } ] Gives me a table with an array column with this schema: { "array": [ { "SoldTo": "string", "ContainerId": "int", "Container": "string", "ContainerReference": "string", "Article": "string", "Quantity": "string", "Min": "string", "Max": "string", "Date": "string" } ] }

This is fine and I am table to query the table in Athena. The problem is that I only get the first entry in every json file. I have tried using a custom classifier in the Crawler, with JSON path $[*]. That gave me the array columns directly in the table schema, but it did not solve the issue of only getting the first record in each file. Even when I create the table from start with a Crawler, the objectCount table property says 12, which is the number of files, not the number of records in the files.

asked 13 days ago30 views
1 Answer
2

The issue you're experiencing is related to how AWS Glue processes JSON files containing arrays. By default, Glue may only read the first entry in each JSON array, which explains why you're only seeing the first record from each file.

To resolve this and process all records in your JSON arrays, you can try the following approaches:

  1. Use a JSON SerDe: When creating your table, specify a JSON SerDe (Serializer/Deserializer) that can handle JSON arrays. The OpenX JSON SerDe or the Hive JSON SerDe can be used for this purpose. You'll need to set this in your table properties.

  2. Use AWS Glue ETL jobs: Instead of relying solely on the Crawler, you can create an AWS Glue ETL job to read and process your JSON files. In the job, you can use the format_options parameter with the multiLine option set to "true" when reading the JSON data. This allows Glue to properly handle multi-line JSON records.

  3. Modify your JSON structure: If possible, consider modifying your JSON structure to a newline-delimited JSON format, where each JSON object is on a separate line. This format is often easier for Glue and other big data tools to process.

  4. Use AWS Glue DataBrew: AWS Glue DataBrew can be an alternative for preparing your JSON data. It has built-in transformations that can help flatten nested JSON structures.

  5. Custom ETL script: Write a custom ETL script (in PySpark or Scala) that explicitly handles the JSON array structure and flattens it into individual records.

If you decide to use an AWS Glue ETL job, you can use the JSON format options to properly read your data. For example:

datasource0 = glueContext.create_dynamic_frame.from_options(
    format_options = {"multiLine": "true"},
    connection_type = "s3",
    format = "json",
    connection_options = {"paths": ["s3://your-bucket/your-path"], "recurse":True},
    transformation_ctx = "datasource0"
)

This approach should allow you to process all records in your JSON arrays, not just the first entry in each file.
Sources
Using the JSON format in AWS Glue - AWS Glue
Parsing a string column containing JSON data - AWS Glue

profile picture
answered 13 days ago
profile picture
EXPERT
reviewed 12 days ago
  • Option 2 - running a Glue job read the source data correctly, and "jsonPath": "$[*]" allowed me to extract array into columns in a new table in the same operation

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions