I want to process JSON files in my AWS Glue ETL job.
Resolution
Convert JSON files to another format
To convert a JSON file to another format, use Visual ETL to create an AWS Glue ETL job. For Data source, choose JSON. For Data target, choose the new file format. Choose S3 as the Node type for the Data source and the Data target.
Create a JSON classifier to read nested JSON data
If your AWS Glue crawler must read nested columns, then create a custom classifier that's defined as a JSON classifier. Then, create a new AWS Glue crawler. Add the custom JSON classifier to your new AWS Glue crawler's list of classifiers.
Use relationalize to convert nested JSON columns into columns in your AWS Glue ETL job. You can also use the jsonPath option in your AWS Glue ETL job configuration's format option values. For code examples, see Example: Read JSON files or folders from Amazon Simple Storage Service (Amazon S3).
Use the unnest option to convert nested fields into top-level objects.
Use an AWS Glue crawler to parse JSON arrays
By default, the AWS Glue crawler treats data as a single array. To create a schema that's based on each record in a JSON array, create a JSON custom classifier. For JSON path, enter $[*]
When you use an AWS Glue ETL job to read a JSON array, use the explode function in Apache Spark to convert arrays into rows. For more information, see pyspark.sql.functions.explode on the Spark website.
You can also use the to_json function in Spark to convert arrays to strings. For more information, see pyspark.sql.functions.to_json on the Spark website.
Troubleshoot DynamicFrame counts that don't match the number of records in a data source
If your DynamicFrame count doesn't match the number of records in you JSON data source, then the data contains malformed records. Run the following errorsAsDynamicFrame command to locate malformed records in your dataset:
# View error fields and error data
error_record = dynamicframe_df.errorsAsDynamicFrame().toDF().head()
Set the multiline value to read JSON records that contain multiple lines
If your JSON record spans multiple lines, then set the multiline value in your AWS Glue ETL job configuration's format options to true. By default, the multiline value is set to false.