AWS Glue DynamicFrame .. where to get corrupt records?

0

Hi this question is regarding corrupt or malformed records in Glue ETL. Spark DataFrames obviously have an option for indicated column for _corrupt_record when this happens and the entire record is dumped into the corrupt column. I see via documentation that DynamicFrames have errorsAsDynamicFrame() method and its supposedly built to handle the corrupt record handling BUT its just a stacktrace and the original record is nowhere. So how to get the original malformed or corrupt record data to save, view, or output to S3? Does DynamicFrame store the original corrupt record at all anywhere OR what is the facility to do something with the corrupt original record such as save it to S3 file? Documentation is very lacking.

Ric
asked 4 months ago192 views
1 Answer
0

You're right, the handling of malformed/corrupt records in AWS Glue DynamicFrames is not as transparent or easy to access as in Spark DataFrames.

Since DynamicFrames are built on top of Spark DataFrames, the corrupt records are still being captured somewhere, but Glue does not expose an easy way to access them directly.

Here are some options to deal with corrupt records in Glue:

  1. Use a DynamicFrame filter transformation to filter out the corrupt records into a separate DynamicFrame. You can check for null values or empty strings in required columns to find corrupt records.

  2. Convert the DynamicFrame to a Spark DataFrame using toDF(), then access the _corrupt_record column directly.

  3. Handle exceptions from transformations using errorsAsDynamicFrame() as you mentioned, but convert the error DynamicFrame to a DataFrame to get the corrupt record details.

  4. Write a custom Glue transform that accesses the underlying Spark DataFrame directly using getDataFrame() and extracts the corrupt records.

  5. As a last resort, read the Glue job logs to try to find details of corrupt records. But this is messy.

The best option is generally to filter out corrupt records as a separate DynamicFrame, then write that out to S3 or a reject folder for further processing/debugging.

It's an area that could be improved in Glue. But with some work

AWS
Saad
answered 4 months ago
  • Regarding filtering out corrupt records in DynamicFrame: The problem is that DynamicFrame is somehow already filtering out the corrupt records upon creation. The corrupt records are nowhere and the only residue is the errorsAsDynamicFrame() as a separate nested frame which has little value in pinpointing the corrupt record especially if there are a plethora of corrupted records. The dynamicFrame record within the errorsAsDynamicFrame is not the RAW record.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions