Best Glue Catalog table column type to store variable JSON docs

1

I'm creating a table in the AWS Glue Catalog, I need to store an array of JSON objects for a column in that table, via a kinesis data stream lambda function

I tried the following definition for the column, and it works ok:

array<structname:string,id:bigint>

The problem is: the JSON inside the array does not always stick to the schema "structname:string,id:bigint", and it may have more properties in the future, so we don't want to keep updating the schema everytime there is a change to it

Is there a way I can tell the column to just expect JSON, without having to spell out each and every property of it?

I tried: array<struct<>>

but I got errors while parsing the data:

"The schema is invalid. A group type can not be empty. Parquet does not support empty group without leaves. Empty group: array_element",

I also tried the following: array array<> array<struc>

but still get errors: Error parsing the schema: Error: < expected at the end of 'array'

it kind of feels wrong, but should I just save JSON as a string? and worry about deserializing it into JSON later when we query the data?

is there a way to just save it as a schema-less doc?

  • I don't think so. It requires that you define the inner schema of a struct or array.

ramiro
asked 8 months ago1007 views
3 Answers
0

@ramiro Have you managed to find a workaround? I'm also facing the same thing!

Bisha
answered 6 months ago
  • Hello Bisha, unfortunately no. What we end up doing was to store the variable JSON as a string, that way we don't have to validate against any schema, as the struc forced us to do.

    The catch is that you will have to deserialize the json strings at query time, but so far we found there is not a big penalty on the performance of the queries to have that extra convertion layer. That may depend on your data size though.

  • @ramiro Cool! I've also done the same thing, but it is not intuitive to have to remember in your SQL queries when to use json_extract_scalar or dot notation queries

  • I agree, if there was the equivalent of redshift's SUPER as a column type in athena, with all the flexibility it brings, that would be great

0

Hi,

Glue supports JSON natively: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-json-home.html

The native JSON support offers several interesting options: withSchema and jsonPath are very useful for example to save you lots of code to validate and retrieve your data.

Best,

Duduer

profile pictureAWS
EXPERT
answered 8 months ago
0

Thanks for the comment Duduer

Indeed Glue does, but what is the data type to pick for those kind of columns, where I am to store it in the Glue Data Catalog table?

Check the attached file for the options here (when defining Glue Data Catalog column type), there is no "json" option:

Enter image description here

The only one that seem to fit storing JSON is struct<your_json_obj_properties_here>

My problem is that struct force you to strictly define the "your_json_obj_properties_here" part, and for our data, that could be different across the different records we are planning to store.

For example: sometimes that column value could look like this:

{ prop1: "value1" }

Sometimes it may have more than one property:

{ prop1: "value1", moreComplex: { dataStructures: "here" } }

Etc. etc...

In other words I wish there was a way to just say: "this column will be for ANY JSON, don't ask me to define each and every of the properties when I create the column type, because I don't know them beforehand, and they may be different on each row! Just take any valid JSON structure I send instead."

In redshift, that is called a SUPER column type:

https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html

Is there an equivalent in the Glue catalog tables, for column types?

ramiro
answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions