Glue Job to transform JSON to Parquete. But the first JSON Object gets lost!

0

Hello,

i have an AWS Glue Job which Gets JSON Files from S3, Transform this JSON to Parquet und save them to S3 and Glue Table.

The JSON File contains an Array with JSON Objects. e.g. the following File

[
  {
    "Listen Date": "2023-08-27",
    "Show Title": "Title of Show Travel",
    "Episode Title": "Episode Nordirland",
    "Listener Country": "Germany",
    "Listener OS": "iOS",
    "Listener User Agent": "Podcasts/4022.700.8 CFNetwork/1410.0.3 Darwin/22.6.0",
    "Episode ID": "64e713c4de5a4f0011114111",
    "Show ID": "64905e9ec4784300116c1cb0",
    "Listens": "2"
  },
  {
    "Listen Date": "2023-08-27",
    "Show Title": "Title of Show Story",
    "Episode Title": "Episode Däumelinchen",
    "Listener Country": "Germany",
    "Listener OS": "iOS",
    "Listener User Agent": "Podcasts/4022.700.8 CFNetwork/1410.0.3 Darwin/22.6.0",
    "Episode ID": "64e5d556aae05200110c4d3b",
    "Show ID": "6492d63c895f9d0011789a66",
    "Listens": "1"
  }
]

The Data is transformed and saved to the correct S3 Folder and Glue Table. But the first JSON Object (in this case the object with Show ID 64905e9ec4784300116c1cb0 get lost!

If i have more then 2 objects in the array, all objects gets transformed, just the first one is skipped. if i resort the objects in the array, again the first one gets lost. So i assume that it's not a problem with the object/data.

My Glue Job is a Visual Standard Glue Job with 3 Steps. Step 1: Data Source S3 Bucket Step 2: Transformation (i figured out, the i have to rename my properties and remove the spaces, otherwise the values of all objects gets lost. (e.g. "Listen Date" => "listendate") Step 3: Data Target S3 Bucket

Has someone an Idea how to fix this issue? Step1

Step 2

Step 3

Stefan
posta 8 mesi fa368 visualizzazioni
2 Risposte
0
Risposta accettata

Multiline has unfortunately brought nothing.

The Solution was to set the JsonPath to: $[*]

It's still a little bit confusing. Because: Without JSON Path – The Button: Infer Schema brings this Schema: Enter image description here

And i tried a lot of time to bring this deep Schema in a Flat list. Then i realized that the schema at Runetime already considered the array, and i doesn't need to handle this deep structure. but just lost the first entry.

With setting the JSON Path, the array is again correct considered, but the first entry is not lost. In my Opinion, without setting the json path the schema at runetime should be deep (as determined). Otherwise it's confusing and inconsistent.

Stefan
con risposta 8 mesi fa
0

Try marking the "Multiline" checkbox in the source, otherwise it sounds like a bug.

profile pictureAWS
ESPERTO
con risposta 8 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande