I turned on job bookmarks for my AWS Glue job, but the job still reprocesses my data.
Resolution
The following are common reasons why an extract, transform, and load (ETL) job reprocesses data even though you turned on job bookmarks:
-
You have multiple concurrent jobs with job bookmarks, and the max concurrency isn't set to 1.
-
The job.init() object is missing or isn't called at the start of the AWS Glue ETL script:
job.init(args['JOB_NAME'], args)
-
The job.commit() object is missing or isn't called at the end of the script:
job.commit()
-
The transformation_ctx parameter is missing or isn't unique for each ETL operator instance:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_name", table_name = "table_name", transformation_ctx = "datasource0")
-
The table's primary keys aren't in sequential order (JDBC connections only).
-
The source data was modified after your last job run.
-
The job uses a Spark DataFrame but the AWS Glue job bookmark feature isn't supported by Spark DataFrames.
For more information about these issues, see Error: A job is reprocessing data when job bookmarks are turned on.
Related information
Tracking processed data using job bookmarks