Read Data from Redshift Table and write to Redshift table using AWS Glue Pyspark

0

I am trying to read data from redshift schema_a and write its output into another redshfit table in schema_b. Below is the code I am using to read from redshift:

tmp_dir="s3://bjs-digital-dods-data-lake-processed/temp/rt_fact_payment/"
sample_querry = "select * from abc.table_a where condition_a>10"
RedshiftCluster_node3 = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options={
        "sampleQuery": sample_querry,
        "redshiftTmpDir": tmp_dir,
        "useConnectionProperties": "true",
        "connectionName": "dodsqa_redshift",
    },
    transformation_ctx="RedshiftCluster_node3",
)

When I run above query, I am getting below error:

caused by: com.amazon.redshift.util.RedshiftException: ERROR: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid RERN2PVXA8V6RXJS,ExtRid fsd/1fDHyRPYoXYEc1I3o+LqUkOL6vaQm5gkqnVev6BFadXHqMP3aTLdtv7OB8nWLrOl8FouziY=,CanRetry 1 Detail: error: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid RERN2PVXA8V6RXJS,ExtRid fsd/1fDHyRPYoXYEc1I3o+LqUkOL6vaQm5gkqnVev6BFadXHqMP3aTLdtv7OB8nWLrOl8FouziY=,CanRetry 1 code: 8001 context: Failed to initialize S3 output stream. S3 path: s3://bjs-digital-dods-data-lake-processed/temp/rt_fact_payment/62131958-fd04-4459-96a2-f15cdfe7c84a/0001_part_00.parquet query: 2496495 location: s3_parquet_unloader.cpp:419 process: query0_87_2496495 [pid=31904]

What am i missing here? Can I write more than 1 query in "sample_querry"? Any help will be much useful as its very urgent.

Joe
preguntada hace 8 meses687 visualizaciones
3 Respuestas
1

The Glue role needs permission to read (GetObject) from the temporary dir, sounds you only have list permission

profile pictureAWS
EXPERTO
respondido hace 8 meses
  • But the glue job had the S3 policy attracted to it and it had READ, WRITE, LIST mentioned in those policy.

1

Hi,

According to the error message, it looks like the Glue role doesn't have the right permissions to read from the temporary directory.

To fix this, you might have to update the permissions for the Glue role and give it access to GetObject in the temporary directory. You can do this by heading to the IAM console, looking for the Glue role, and giving it the required permissions.

Basically, you just need to add the "GetObject" action to the role. After you update the permissions, give your query another shot. If you're still having problems, just give me a heads up and I'll be glad to assist you further.

profile picture
respondido hace 8 meses
  • Is not glue or does redshift needs rule to access S3 bucket? Because as per my actual logic, I am trying to read data from redshift and write it back to another redshift table. So in the backend, it's write into S3 from redshift and read from S3 into redshift . But,YES thisbprocess happens inside glue. So not confused.

0
Respuesta aceptada

When i changed the tmp_dir location, this piece if code worked fine. The newly added tmp_dir location (S3 bucket) was given by glue itself as default. I have no idea why is it like that. In Glue Visual studio, I added a Redshift Source object to pull data from my redshift db itself. Then, I was able to pull the underlying code for the above process. In that code, a "tmp_dir" was given and It worked fine. If I change that location to my own, then the code is failing with above mentioned error.

If any of you has reason for that, please share your comments.

Joe
respondido hace 8 meses
  • The only possible reasons I can think is that Redshift is in a different region (but you should get a clear error) or the role in Redshift doesn't have permission

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas