Hello - I have a glue job that reads data from glue catalog table, and writes it back into s3 in Delta format.
IAM role of the glue job has s3:PutObject, List, Describe and all other permissions needed to interact with s3 (read and write). However, I keep running into the error -
2022-12-14 13:48:09,274 ERROR [Thread-9] output.FileOutputCommitter (FileOutputCommitter.java:setupJob(360)): Mkdirs failed to create glue-d-xxx-data-catalog-t-<dataset-name>-m-w://<s3-prefix>/_temporary/0
2022-12-14 13:48:13,875 WARN [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(73)): Lost task 5.0 in stage 1.0 (TID 6) (172.34.113.239 executor 2): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: HG7ST1B44A6G30JC; S3 Extended Request ID: tR1CgoC1RcXZHEEZZ1DuDOvwIAmqC0+flXRd1ccdsY3C8PyjkEpS4wHDaosFoKpRskfH1Del/NA=; Proxy: null)
This error does not appear when I open up s3 bucket access with wildcard(principal:*) in the s3 bucket permissions section. Job fails even if I change the principal section to the same role as Glue jobs are associated with.
Now, my question is - is there a different identify that AWS Glue assumes to run the job. The IAM role associated with the job has all the permissions to interact with s3 but it throws above AccessDenied exception ( failed to create directory). However, job succeeds with wildcard(*) on s3 permissions.
Just to add some more context - this error does not happen when I am using native glue constructs like dynamic frame, spark data frame to read, process and persist data into s3. It only happens with delta format.
Below is the samplec code
src_dyf = glueContext.create_dynamic_frame.from_catalog(database="<db_name>", table_name="<>table_name_glue_catalog")
dset_df = src_dyf.toDF() # dynamic frame to dta frame conversion
# write data frame into s3 prefix in delta format.
glueContext.write_data_frame_from_catalog(
frame=dset_df,
database="xxx_data_catalog",
table_name="<tbale_name>",
additional_options=additional_options #contains key-value pair of s3 path with key, path
)