Glue Job Unexpected behavior: Schema Transformation Skipping Blocks

Hi everyone,

I've been running a Glue job smoothly for the past year, processing a collection of individual data blocks from Location A and transferring them to Location B. Each block shares the same schema and contains similar data.

However, today I encountered an issue where one specific block didn't appear in Location B despite being present in Location A. Upon investigation, I discovered that the Glue job's applymapping function skipped this particular block during schema transformation. While all other blocks were processed correctly, this one was left out.

To provide some context, I'm fetching data from a Glue table, converting it into a dynamic frame, applying schema transformations with the applymapping function, consolidating the data, and then writing it to S3.

This problem is entirely new and has never occurred during the past year of using this job. I would like to know how to ensure that the schema transformation applies consistently to all blocks. Thank you

EDIT TO ADD: The following is my sample input, expected output data and the received output data

Sampel input data 
block_number:string column1:string column2:string column3:string

Expected Sample output data
block_number:number column1:number column2:string #column3 is dropped as it was not specified in applymapping

Recieved Sample output data #for one particular block or partition
block_number:string column1:string column2:string column3:string

Recieved Sample output data #for all other partitions
block_number:number column1:number column2:string #column3 is dropped as it was not specified in applymapping

Also for what it's worth, I was overwriting this particular partitions data(at location A) multiple times using awswrangler using the following code

wr.s3.to_parquet(df=dataframe, path=destination_path, compression='snappy', \
    mode='overwrite_partitions', partition_cols=["col1","col2"], \
    filename_prefix=f'{block}_', index=False, \
    table=table, database=database, dtype=dtype, dataset=True )

And on further inspection I noticed that the data for this single partition(at location A) contains duplicates in it, there is a single parquet file in the location with data written to it again and again. This should not be happening as the wrangler function I use will always overwrite the data.

Gonzalo Herreros EXPERT
2 months ago
Not clear what you mean by "block", if you print the DynF count before and after the mapping, do you see a gap?
Arun A K
a month ago
can you provide an example input, the result you get and the expected output ?
muthu
a month ago
@Gonzalo Herreros yeah, I do see a gap if I print the count before and after, however if I filter my mapped DynF with respect to my partition(block) in this case, then I'm seeing the data for the block but I dont see the schema mapping applied, however for my other partitions(blocks) I see the schema mapping applied
muthu
a month ago
@Arun A K I have added an example input, the result I get and the expected output

Topics

Analytics

Relevant content

Data Catalog schema table getting modified when I run my Glue ETL job
bfeeny
asked 2 years ago
Glue Jobs have no access to current schemas (Glue Catalog)
MehdiE
asked 2 years ago
Schema was not updated after the glue job run
chingandy
asked 8 months ago
Schema inconsistency between Glue Data Catalog and Glue ETL Job
Accepted Answer
Santosh Sahoo
asked 5 months ago
Why is my AWS Glue ETL job running for a long time?
AWS OFFICIALUpdated 4 months ago
How can I run an AWS Glue job on a specific partition in Amazon S3?
AWS OFFICIALUpdated a year ago
Why is my AWS Glue ETL job reprocessing data even when job bookmarks are enabled?
AWS OFFICIALUpdated a year ago
Why is the AWS Glue crawler running for a long time?
AWS OFFICIALUpdated 3 years ago
How to create a AWS Glue Connector for data sources in VMware Cloud on AWS
EXPERT
Greg Vinton
published 6 months ago
AWS re:Post Live - Monitor and Debug AWS Glue Jobs Using Job Observability Metrics and Amazon Grafana
EXPERT
AWS rePost Live
published 12 days ago