Why does my AWS Glue ETL job reprocess data even when job bookmarks are turned on?

2 minute read
0

I turned on job bookmarks for my AWS Glue job, but the job still reprocesses my data.

Resolution

The following are common reasons why an extract, transform, and load (ETL) job reprocesses data even though you turned on job bookmarks:

  • You have multiple concurrent jobs with job bookmarks, and the max concurrency isn't set to 1.

  • The job.init() object is missing or isn't called at the start of the AWS Glue ETL script:

    job.init(args['JOB_NAME'], args)
  • The job.commit() object is missing or isn't called at the end of the script:

    job.commit()
  • The transformation_ctx parameter is missing or isn't unique for each ETL operator instance:

    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_name", table_name = "table_name", transformation_ctx = "datasource0")
  • The table's primary keys aren't in sequential order (JDBC connections only).

  • The source data was modified after your last job run.

  • The job uses a Spark DataFrame but the AWS Glue job bookmark feature isn't supported by Spark DataFrames.

For more information about these issues, see Error: A job is reprocessing data when job bookmarks are turned on.

Related information

Tracking processed data using job bookmarks

AWS OFFICIAL
AWS OFFICIALUpdated a month ago