Read Data from Redshift Table and write to Redshift table using AWS Glue Pyspark

0

I am trying to read data from redshift schema_a and write its output into another redshfit table in schema_b. Below is the code I am using to read from redshift:

tmp_dir="s3://bjs-digital-dods-data-lake-processed/temp/rt_fact_payment/"
sample_querry = "select * from abc.table_a where condition_a>10"
RedshiftCluster_node3 = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options={
        "sampleQuery": sample_querry,
        "redshiftTmpDir": tmp_dir,
        "useConnectionProperties": "true",
        "connectionName": "dodsqa_redshift",
    },
    transformation_ctx="RedshiftCluster_node3",
)

When I run above query, I am getting below error:

caused by: com.amazon.redshift.util.RedshiftException: ERROR: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid RERN2PVXA8V6RXJS,ExtRid fsd/1fDHyRPYoXYEc1I3o+LqUkOL6vaQm5gkqnVev6BFadXHqMP3aTLdtv7OB8nWLrOl8FouziY=,CanRetry 1 Detail: error: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid RERN2PVXA8V6RXJS,ExtRid fsd/1fDHyRPYoXYEc1I3o+LqUkOL6vaQm5gkqnVev6BFadXHqMP3aTLdtv7OB8nWLrOl8FouziY=,CanRetry 1 code: 8001 context: Failed to initialize S3 output stream. S3 path: s3://bjs-digital-dods-data-lake-processed/temp/rt_fact_payment/62131958-fd04-4459-96a2-f15cdfe7c84a/0001_part_00.parquet query: 2496495 location: s3_parquet_unloader.cpp:419 process: query0_87_2496495 [pid=31904]

What am i missing here? Can I write more than 1 query in "sample_querry"? Any help will be much useful as its very urgent.

Joe
已提问 8 个月前686 查看次数
3 回答
1

The Glue role needs permission to read (GetObject) from the temporary dir, sounds you only have list permission

profile pictureAWS
专家
已回答 8 个月前
  • But the glue job had the S3 policy attracted to it and it had READ, WRITE, LIST mentioned in those policy.

1

Hi,

According to the error message, it looks like the Glue role doesn't have the right permissions to read from the temporary directory.

To fix this, you might have to update the permissions for the Glue role and give it access to GetObject in the temporary directory. You can do this by heading to the IAM console, looking for the Glue role, and giving it the required permissions.

Basically, you just need to add the "GetObject" action to the role. After you update the permissions, give your query another shot. If you're still having problems, just give me a heads up and I'll be glad to assist you further.

profile picture
已回答 8 个月前
  • Is not glue or does redshift needs rule to access S3 bucket? Because as per my actual logic, I am trying to read data from redshift and write it back to another redshift table. So in the backend, it's write into S3 from redshift and read from S3 into redshift . But,YES thisbprocess happens inside glue. So not confused.

0
已接受的回答

When i changed the tmp_dir location, this piece if code worked fine. The newly added tmp_dir location (S3 bucket) was given by glue itself as default. I have no idea why is it like that. In Glue Visual studio, I added a Redshift Source object to pull data from my redshift db itself. Then, I was able to pull the underlying code for the above process. In that code, a "tmp_dir" was given and It worked fine. If I change that location to my own, then the code is failing with above mentioned error.

If any of you has reason for that, please share your comments.

Joe
已回答 8 个月前
  • The only possible reasons I can think is that Redshift is in a different region (but you should get a clear error) or the role in Redshift doesn't have permission

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则