Glue Job Unexpected behavior: Schema Transformation Skipping Blocks

0

Hi everyone,

I've been running a Glue job smoothly for the past year, processing a collection of individual data blocks from Location A and transferring them to Location B. Each block shares the same schema and contains similar data.

However, today I encountered an issue where one specific block didn't appear in Location B despite being present in Location A. Upon investigation, I discovered that the Glue job's applymapping function skipped this particular block during schema transformation. While all other blocks were processed correctly, this one was left out.

To provide some context, I'm fetching data from a Glue table, converting it into a dynamic frame, applying schema transformations with the applymapping function, consolidating the data, and then writing it to S3.

This problem is entirely new and has never occurred during the past year of using this job. I would like to know how to ensure that the schema transformation applies consistently to all blocks. Thank you

EDIT TO ADD: The following is my sample input, expected output data and the received output data

Sampel input data 
block_number:string column1:string column2:string column3:string

Expected Sample output data
block_number:number column1:number column2:string #column3 is dropped as it was not specified in applymapping

Recieved Sample output data #for one particular block or partition
block_number:string column1:string column2:string column3:string

Recieved Sample output data #for all other partitions
block_number:number column1:number column2:string #column3 is dropped as it was not specified in applymapping

Also for what it's worth, I was overwriting this particular partitions data(at location A) multiple times using awswrangler using the following code

wr.s3.to_parquet(df=dataframe, path=destination_path, compression='snappy', \
    mode='overwrite_partitions', partition_cols=["col1","col2"], \
    filename_prefix=f'{block}_', index=False, \
    table=table, database=database, dtype=dtype, dataset=True )

And on further inspection I noticed that the data for this single partition(at location A) contains duplicates in it, there is a single parquet file in the location with data written to it again and again. This should not be happening as the wrangler function I use will always overwrite the data.

  • Not clear what you mean by "block", if you print the DynF count before and after the mapping, do you see a gap?

  • can you provide an example input, the result you get and the expected output ?

  • @Gonzalo Herreros yeah, I do see a gap if I print the count before and after, however if I filter my mapped DynF with respect to my partition(block) in this case, then I'm seeing the data for the block but I dont see the schema mapping applied, however for my other partitions(blocks) I see the schema mapping applied

  • @Arun A K I have added an example input, the result I get and the expected output

muthu
asked 2 months ago88 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions