Glue spark job failing due to writing to the bucket

0

I have a Glue job that performs a column mapping (a different question question!), the job fails at the final stage where it is time to persist the results back to the parquet:

gluecontext.write_dynamic_frame.from_options( frame=person_frame, connection_type='s3', connection_options={ "path": f"s3://bucket/message_type", "partitionKeys": [ "yyyy", "mm", "dd"]}, format='parquet')

The glue job fails at this stage. The job is given a role, and that role contains a set of specific Glue permissions (as per the AWSGlueServiceRole), and specific s3 permissions (Get/Put/DeleteObject on s3://bucket). Additionally because various articles made the point that if encryption of the content occurs i will need to have kms:En/Decrypt & kms:GenerateKey on the specific aws-s3 key. This has been added more for purposes of elimination, whilst I am 99% sure that the operation doesn't write the objects encrypted, because i couldn't prove that it didn't, those permissions were temporarily added.

However the job fails with the messages, indicating it is the last part of the job - write the results back:

"An error occurred while calling pyWriteDynamicFrame." and "com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied;"

I don't understand why this is happening; the write permissions (s3:PutObject) are in the same poiicy as the read permissions (s3:ReadObject), such that if that policy wasn't being read, or was set up incorrectly, then the first part of the job should fail, where it actually reads the data. Since this passes without any problem, which at least to me, demonstrates that particular policy is active & the role given to the Glue job is correct. If this is the case, then the Put/Delete object is part of the same policy, how can it be that it is failing according to not having permissions to write to that s3 bucket?

If anyone could help me with what i am missing it would be hugely appreciated.

For brevity, the role has essentially 2 policies:

GlueEssentials: [ Glue:/*, s3:ListObjects,

], // basically the same glue requirements as needed when creating a Glue crawler -> AWSGlueServiceRole

BucketSpecifics: [ s3:PutObject, s3:GetObject, s3:DeleteObject

for: "s3://bucket*" ]

  • have you tried with "s3://bucket/*" resource, I have always done it like that, not sure if you can put the wildcard in the bucket name to include any prefix

3 Answers
0

Hello.

Could you please attach the entire policy you are using?
Also, will it succeed if I temporarily set S3FullAccess?
If you set S3FullAccess and it succeeds, there is a problem with access privileges to S3.
The following documents may be helpful.
https://repost.aws/knowledge-center/glue-403-access-denied-error

profile picture
EXPERT
answered 7 months ago
0

OK - so tried adding S3FullAccess to the list of policies and get the same behavior; error during writing rows, due to AWS s3 403 error - which suggests then the piece that is performing the writing is NOT the Glue worker - which is supposed to run with the configured service role.

So if the piece that is writing the rows is NOT under the configured IAM role, then it begs the question, what is writing the rows, if not a worker executing with the configured Role :/ Given that S3FullAccess isn't working, then me attaching the full set of policies is unlikely to help much. my mistake was in thinking the writing of the dynamic frame would be executed under the role given to the job.

Any suggestions as to which user / role / set of permissions it is actually executing as? I will try to butcher my script with a set of boto3 commands which log the user / role(s) etc. but i'm pretty stumped then, if it is not running as the configured role as per "IAM Role" in the "Job Details" tab :/

pete
answered 7 months ago
0

@Riku_Kobayashi

Great idea re: S3 logs - by all accounts the process seems to have problems with the partitions that have been set up:

ed972da9ad376f018cd13ea47ea3527296a57d1a6e03d455c86cedc1ec558fa5 cognius-messaging-staging [03/Oct/2023:14:38:18 +0000] 18.212.199.80 arn:aws:sts::438298068074:assumed-role/AWSGlueServiceRole-messaging/GlueJobRunnerSession KV8R86NG6X4AKA6Y REST.HEAD.OBJECT person/year%253D2023/month%253D09/day%253D28 "HEAD /person/year%3D2023/month%3D09/day%3D28 HTTP/1.1" 404 NoSuchKey 303 - 8 - "-" "ElasticMapReduce/1.0.0 emrfs/s3n user:spark,groups:[root], aws-internal/3 aws-sdk-java/1.12.331 Linux/4.14.238-125.422.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.382-b05 java/1.8.0_382 scala/2.12.15 groovy/2.4.4 vendor/Amazon.com_Inc. cfg/retry-mode/standard" - eHncuUZDJ4BBK49tqG4HMOPWiYJ8SpaFrQby3lWsY1+Y2EOtCweDooAqtdJyzOdpIKaQkgI+m14= SigV4 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader cognius-messaging-staging.s3.amazonaws.com TLSv1.2 - -

ed972da9ad376f018cd13ea47ea3527296a57d1a6e03d455c86cedc1ec558fa5 cognius-messaging-staging [03/Oct/2023:14:27:19 +0000] 34.204.78.144 arn:aws:sts::438298068074:assumed-role/AWSGlueServiceRole-messaging/GlueJobRunnerSession VTATVJ3V07SHEVT2 REST.HEAD.OBJECT person/year%253D2023/month%253D09/day%253D28_%2524folder%2524 "HEAD /person/year%3D2023/month%3D09/day%3D28_%24folder%24 HTTP/1.1" 404 NoSuchKey 312 - 8 - "-" "ElasticMapReduce/1.0.0 emrfs/s3n user:spark,groups:[root], aws-internal/3 aws-sdk-java/1.12.331 Linux/4.14.238-125.422.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.382-b05 java/1.8.0_382 scala/2.12.15 groovy/2.4.4 vendor/Amazon.com_Inc. cfg/retry-mode/standard" - G06qLT2FisbJTTIF+7iE640eJ4KnT9g+kxHbhkjzPZR+ZSBcejzfNXONO22T2Xgba8CQ8p5GjTQ= SigV4 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader cognius-messaging-staging.s3.amazonaws.com TLSv1.2 - -

For further explanation - we use a glue crawler to automatically create our tables, columns etc. Downside to this process is that it will automatically set the partition names as "Partition 0", "Partition 1" etc. when using a yyyy/mm/dd structure such as:

"person/2023/10/04" - so to get around that, we name the directories as: "person/year=2023/month=10/day=04" which ensures the partitions are named correctly when creating the table and partitions. If the url is completely unescaped, it does exist - there are partitions for person/2023/09/28. As such if it's coming up with 404, I wonder if either the engine doesn't know how to work with url escaped paths.

Although irritatingly - from the glue logs:

Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: W2S801ETG7WGY7SX; S3 Extended Request ID: ZDuhL26URJL9Lu+lcj+Pm1NrYOSAUpjTpgozDcHHUApnbJF64fWfsI6AnmfIhkQUlmOtpxgqvjtqaZaNqPwTmFikiAY/nlLkb4bjNykaGxM=; Proxy: null)

ls . | grep xargs W2S801ETG7WGY7SX
(none)

So it seems these occasions where there is a 403 are not being written to the logs - presumably that this is not the target bucket is more likely then than it just not being logged?

pete
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions