Read Data from Redshift Table and write to Redshift table using AWS Glue Pyspark

0

I am trying to read data from redshift schema_a and write its output into another redshfit table in schema_b. Below is the code I am using to read from redshift:

tmp_dir="s3://bjs-digital-dods-data-lake-processed/temp/rt_fact_payment/"
sample_querry = "select * from abc.table_a where condition_a>10"
RedshiftCluster_node3 = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options={
        "sampleQuery": sample_querry,
        "redshiftTmpDir": tmp_dir,
        "useConnectionProperties": "true",
        "connectionName": "dodsqa_redshift",
    },
    transformation_ctx="RedshiftCluster_node3",
)

When I run above query, I am getting below error:

caused by: com.amazon.redshift.util.RedshiftException: ERROR: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid RERN2PVXA8V6RXJS,ExtRid fsd/1fDHyRPYoXYEc1I3o+LqUkOL6vaQm5gkqnVev6BFadXHqMP3aTLdtv7OB8nWLrOl8FouziY=,CanRetry 1 Detail: error: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid RERN2PVXA8V6RXJS,ExtRid fsd/1fDHyRPYoXYEc1I3o+LqUkOL6vaQm5gkqnVev6BFadXHqMP3aTLdtv7OB8nWLrOl8FouziY=,CanRetry 1 code: 8001 context: Failed to initialize S3 output stream. S3 path: s3://bjs-digital-dods-data-lake-processed/temp/rt_fact_payment/62131958-fd04-4459-96a2-f15cdfe7c84a/0001_part_00.parquet query: 2496495 location: s3_parquet_unloader.cpp:419 process: query0_87_2496495 [pid=31904]

What am i missing here? Can I write more than 1 query in "sample_querry"? Any help will be much useful as its very urgent.

Joe
demandé il y a 8 mois686 vues
3 réponses
1

The Glue role needs permission to read (GetObject) from the temporary dir, sounds you only have list permission

profile pictureAWS
EXPERT
répondu il y a 8 mois
  • But the glue job had the S3 policy attracted to it and it had READ, WRITE, LIST mentioned in those policy.

1

Hi,

According to the error message, it looks like the Glue role doesn't have the right permissions to read from the temporary directory.

To fix this, you might have to update the permissions for the Glue role and give it access to GetObject in the temporary directory. You can do this by heading to the IAM console, looking for the Glue role, and giving it the required permissions.

Basically, you just need to add the "GetObject" action to the role. After you update the permissions, give your query another shot. If you're still having problems, just give me a heads up and I'll be glad to assist you further.

profile picture
répondu il y a 8 mois
  • Is not glue or does redshift needs rule to access S3 bucket? Because as per my actual logic, I am trying to read data from redshift and write it back to another redshift table. So in the backend, it's write into S3 from redshift and read from S3 into redshift . But,YES thisbprocess happens inside glue. So not confused.

0
Réponse acceptée

When i changed the tmp_dir location, this piece if code worked fine. The newly added tmp_dir location (S3 bucket) was given by glue itself as default. I have no idea why is it like that. In Glue Visual studio, I added a Redshift Source object to pull data from my redshift db itself. Then, I was able to pull the underlying code for the above process. In that code, a "tmp_dir" was given and It worked fine. If I change that location to my own, then the code is failing with above mentioned error.

If any of you has reason for that, please share your comments.

Joe
répondu il y a 8 mois
  • The only possible reasons I can think is that Redshift is in a different region (but you should get a clear error) or the role in Redshift doesn't have permission

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions