Refresh Redshift materialized view using a Glue Job script

0

I'm trying to refresh a materialized view with a glue job, connecting to Redshift cluster using boto3 authenticating with a database username. The execution timeouts with no errors in CloudWatch. I'm sure the problem is within the connection, but I can't find what am I doing wrong. Also, I configured a connection in AWS Glue, wich I use successfully with other Visual ETL jobs, as well as this one. I know I could schedule a refresh using the code editor v2 query scheduler, but I had too many issues configuring the security, with no success, the schedule executes but I see no runs in the history, but this would be another post. Here's my script so far:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import boto3

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Redshift connection details
redshift_cluster_id = 'my-warehouse'
db_name = 'my-db'
db_user = 'my-user'

# SQL to refresh materialized views
refresh_sql_1 = "REFRESH MATERIALIZED VIEW mv_name1"
refresh_sql_2 = "REFRESH MATERIALIZED VIEW mv_name2"

def refresh_materialized_view(sql):
    client = boto3.client('redshift-data')
    response = client.execute_statement(
        ClusterIdentifier=redshift_cluster_id,
        Database=db_name,
        DbUser=db_user,
        Sql=sql,
        WithEvent=True
    )
    return response

# Refresh the materialized views and commit
refresh_materialized_view(refresh_sql_1)
refresh_materialized_view(refresh_sql_2)

job.commit()

During execution, the job loops running these commands until timeout (15 mins):

  • INFO LogPusher: uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-etc..../sparkHistoryLogs/
  • INFO ExecutorTaskManagement: polling for executor task status
  • INFO JESSchedulerBackend: polling for JES task status
  • Why can’t u use a scheduled query?

profile picture
asked 3 months ago251 views
3 Answers
0
Accepted Answer

I did it with the Query Editor v2 scheduler, carefully following the documentation: https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-schedule-query.html

The key, in my case, was to authenticate using Temporary credentials, allowing the IAM Role to access database credentials of a specific dbuser:

{
"Version": "2012-10-17",
"Statement": [
    {
        "Sid": "UseTemporaryCredentialsForAllDbUsers",
        "Effect": "Allow",
        "Action": "redshift:GetClusterCredentials",
        "Resource": [
            "arn:aws:redshift:*:*:dbuser:*/awsuser",
            "arn:aws:redshift:*:*:dbuser:*/myuser"
        ]
    }
  ]
}
profile picture
answered 2 months ago
0

Hi

You could use the visual ETL approach to make sure your connection works, then copy the script generated into a new job, and modify the ETL logic accordingly.

Also, an important step when it comes to Redshift is the Glue job must use the correct VPC otherwise the DB won't be reachable. For further details, check Redshift connections

Thanks, Rama

profile pictureAWS
EXPERT
answered 3 months ago
  • That's a good approach if the job actually is to perform ETL operations, because it has to have at least a source and a target. This is not the case, I just want to run a refresh materialized view command. After carefully reading the documentation, I managed to get the scheduled queries in Query Editor v2 to work, so I'm leaving the Glue job on hold.

0

I added code to the script, trying to replicate what the Redshift Query Editor V2 Scheduled queries do, using temporary credentials from a database user:

# Assume IAM Role
sts_client = boto3.client('sts')
assumed_role = sts_client.assume_role(
    RoleArn='arn:aws:iam::000000:role/service-role/AmazonRedshift-CommandsAccessRole',
    RoleSessionName='RedshiftSession'
)

# Get temporary credentials
credentials = assumed_role['Credentials']

# Create redshift-data API client using temporary credentials
redshift_client = boto3.client(
    'redshift-data',
    aws_access_key_id=credentials['AccessKeyId'],
    aws_secret_access_key=credentials['SecretAccessKey'],
    aws_session_token=credentials['SessionToken']
)

# Create function

Unfortunately, I got this error (which makes sense): Connect timeout on endpoint URL: "https://sts.amazonaws.com/"

profile picture
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions