Only first record from JSON array in s3 is being retrieved from Athena

0

I'm having the same issue. Data is stored in below format in s3 as JSON array with partitions S3 path - s3://fleet-fuelcard-data-import-dev/lambda/fuelsoft-morgan/660306/2024/Apr/03-Apr-2024.json.

{ "_1": [ { "EntityMapping": null, "AccountId": "xx", "CardNumber": "xxxx", "CardType": "xxxxx", "CheckDigit": "5", "CostCentre": "xx Amb Station", "Currency": "EUR", "DriverName": "", "EmployeeId": "", "ExtractedDate": "03-Apr-2024 11:38:28", "Goods": 44.64 }, { "EntityMapping": null, "AccountId": "660306", "CardNumber": "700047660306000053", "CardType": "Morgan", "CheckDigit": "5", "CostCentre": "Thurles Amb Station", "Currency": "EUR", "DriverName": "", "EmployeeId": "", "ExtractedDate": "03-Apr-2024 11:37:52", "Goods": 67.88 } ]

I used glue crawler to create the table in Athena. Below the respective DDL for the same and when I try to access the records using select, only the first record is being selected.

CREATE EXTERNAL TABLE fi_fuelsoft_morgan( array array<structentitymapping:string,accountid:string,cardnumber:string,cardtype:string,checkdigit:string,costcentre:string,currency:string,drivername:string,employeeid:string,extracteddate:string,goods:double,isonumber:string,id:string,invoiceno:int,invoiceddate:string,mileage:int,productdescription:string,producttype:string,receiptnumber:string,registrationnumber:string,shortcardnumber:string,sitedescription:string,sitenumber:string,total:double,transactiondate:string,unitprice:double,vat:double,volume:double,cf_account_name:string> COMMENT 'from deserializer') PARTITIONED BY ( partition_0 string, partition_1 string, partition_2 string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'paths'='array') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://fleet-fuelcard-data-import-dev/lambda/fuelsoft-morgan/' TBLPROPERTIES ( 'CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='fleet-fuelcard-integration', 'averageRecordSize'='22644', 'classification'='json', 'compressionType'='none', 'objectCount'='1393', 'partition_filtering.enabled'='true', 'recordCount'='1393', 'sizeKey'='31560393', 'typeOfData'='file')

I'm stuck with this , Can anyone help on resolving this issue ? on how to get all the records in JSON array? Do I need to change the structure of array ? to have one single record in one line? as per the document https://github.com/rcongiu/Hive-JSON-Serde The following example will not work.

{ "key" : 10 } { "key" : 20 }

1 Answer
0

Hi,

My first guess would be classifiers:

https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html

Haven't tried it myself but have a look at this:

For the given JSON structure, you would likely want to create a custom JSON classifier in AWS Glue. Here's the configuration you'd use:

  • Classification: json (This is the default for JSON classifiers)
  • JSON Path: $._1[*]

Explanation:

  • $: Represents the root of the JSON document.
  • _1: Refers to the top-level key in your JSON.
  • [*]: The wildcard indicates that the classifier should process all elements within the array under the _1 key.

Why a custom classifier?

  • The built-in JSON classifier might not correctly infer the schema due to the underscore in the top-level key (_1).
  • The custom classifier with the specified JSON path allows you to explicitly target the data you want to extract.

Additional Considerations:

  • Data Types: By default, the crawler will infer the data types for your columns (e.g., AccountId as string, Goods as double). If you want to enforce specific types, you can modify the schema in the AWS Glue Data Catalog after the initial crawl.
  • Nested Structures: If your JSON had more complex nested structures, you would need to adjust the JSON path accordingly.

How to create the custom classifier:

  1. Go to the AWS Glue console.
  2. Navigate to "Classifiers" and click "Add classifier".
  3. Select "JSON" as the classifier type.
  4. Provide a name for your classifier (e.g., "MyJsonClassifier").
  5. Enter the JSON path as described above.
  6. Click "Create classifier".

How to use the classifier in your crawler:

  1. In your crawler configuration, under "Classifiers", add your newly created custom classifier (e.g., "MyJsonClassifier").
  2. Ensure that your custom classifier is listed before the built-in JSON classifier. This way, the crawler will use your custom logic first.
Rene
answered 14 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions