AWS Glue job cannot see Serverless Redshift DB as target

0

I am trying to create a process that will ingest .cvs files dropped into an S3 bucket daily and add those records to an existing table in a serverless Redshift database. To accomplish this, I'm trying to build a Glue job that will scan the S3 bucket for new files and update/add those records to the redshift table accordingly. I have no Spark or Python skills so am using Glue Studio. I have selected the 'Visual with a source and target' option with my source being Amazon S3 and Target as Amazon Redshift. I click 'Create' and am taken to the canvas. I can find my S3 Bucket, apply the transform mappings but when I click the Redshift Cluster node, I do not see my serverless Redshift database. I do understand there is a difference between a Serverless Redshift database and a Redshift Cluster but there is no option to select the target as Serverless Redshift. My question seems to be similar to that posted in this question but the answer provided did not really help me: https://repost.aws/questions/QU33GP1YziR5OXIn-vehIxwA/aws-glue-studio-source-s-3-target-redshift-job-wants-to-select-gdc-not-redshift-table-in-target

I followed the steps outlined here https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html (though this specifically applies to Redshift Clusters) to address the Inbound/outbound rules of the VPC security group in my Serverless Redshift workgroup and all look to be set up correctly. I also reviewed the associated IAM roles and they also look to be set up correctly.

My question is how do I set up a Glue job to read from an S3 bucket and populate a table in a Serverless Redshift database? What steps am I missing? I can't tell if I'm running into permission issues (i.e. Redshift won't allow Glue access) or if there is something else I'm missing. I cannot find any documentation that specifically addressed Glue and Serverless Redshift databases. I am open to any suggestions.

  • Hello! I'm going through a similar scenario- were you able to get this resolved?

  • @UICVA I never was able to get this to work. We ended up going a different direction away from Redshift and instead using RDS. We are now working on building Lambda jobs to perform our ETL.

  • Thank you so much for responding! Will explore other options!

已提問 1 年前檢視次數 2581 次
2 個答案
1

You actually can use a RedShift serverless table as your target in AWS Glue Studio jobs. The catch is that you need to add the target RedShift Serverless table to your Data Catalog first, either manually or using a Glue Crawler.

This is noted in here: https://docs.aws.amazon.com/glue/latest/ug/data-target-nodes.html.

"For all data sources except Amazon S3 and connectors, a table must exist in the AWS Glue Data Catalog for the target type that you choose. AWS Glue Studio does not create the Data Catalog table." There is also fine print in the Glue Studio job itself. Under "Amazon RedShift" is says "AWS Glue Data Catalog table with RedShift as the data target".

Once you the have added it, you can then use it as a source or target in the Glue Studio Job. Use the drop down to select "Amazon RedShift" as the target. Note that the database and table name will be based on the Glue Catalog table name, not whatever you named it in the actual RedShift Cluster.

To use it with a Crawlers, you need to first add a Connection in Glue -- RedShift serverless will be a JDBC Connection. Then when creating the crawler, select "JDBC" as your Data source and choose the RedShift connection that you created.

Hope that helps!

AWS
已回答 1 年前
  • Thank you for your response. I will look into this closer. I appreciate your guidance.

  • The other option is to not use a table and just specify the cluster details when you make the call to glueContext.create_dynamic_frame_from_options("redshift", connection_options)

1

Unfortunately, Glue Studio does not currently support Serverless Redshift as a target. You'll need to use another approach to transfer data from your S3 bucket to your Redshift database. Some options include:

Use a traditional Redshift cluster as the target in Glue Studio and then use Amazon Redshift Spectrum to query your Serverless Redshift database from the Redshift cluster.

Use a different AWS service such as Amazon Data Pipeline, AWS Lambda, or AWS Step Functions to transfer the data from S3 to Serverless Redshift.

Use a Spark job with the Redshift JDBC driver to transfer the data.

Use the Redshift COPY command to transfer the data directly from S3 to Serverless Redshift.

profile picture
已回答 1 年前
  • Thank you for the quick response. I'll investigate your suggestions.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南