How do I specify a %connection when running a Glue job locally?

0

I have been attempting to follow the documentation for developing Glue jobs locally.

Upon running the recommended command for running a Glue job within a Docker container, the following results:

$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_spark_submit amazon/aws-glue-libs:glue_libs_4.0.0_image_01 spark-submit /home/glue_user/workspace/src/$SCRIPT_FILE_NAME --region us-west-2
(...)
py4j.protocol.Py4JJavaError: An error occurred while calling o63.getDynamicFrame.
: java.sql.SQLException: The connection attempt failed.
(...)
Caused by: java.net.NoRouteToHostException: No route to host (Host unreachable)

In the Jupyter Notebook version of the script, I use a magic to use a connection to enable Redshift things:

%number_of_workers 10
%connections connection-name

How would I modify the docker run command to specify that connection so that the script will work as intended?

2 Answers
2
Accepted Answer

You can't, in a notebook you use a read cluster called interactive session, the docker container has many limitations.
You can still reference the connection inside the code but any jars it provides you have to provide yourself and it cannot put that docker container inside a VPC if the connection specifies one, as seems to be your case.
You would need to run the docker container in an instance that is already inside the VPC, which is probably not worth the hassle and not practical for development.
Either use a local DB for testing or move to interactive sessions with notebooks.

profile pictureAWS
EXPERT
answered 7 months ago
profile picture
EXPERT
reviewed 10 days ago
profile picture
EXPERT
reviewed 2 months ago
0

Hi, we have a similar use-case. We have a redshift cluster in private VPC. Glue jobs are reading from redshift DB using IAM-based url (in glue 4.0) and with glue role having permissions to retrieve temp db credentials from redshift for authentication with which no need to config any user/password. Additionally, we need to attach a glue connection (NETWORK type, with Redshift Subnet and SG info) to the glue job, and glue sets up ENI with VPC assigning SGs specified in the AWS Glue connection to ENI and enables connection.That way, read via dynamic frame method below works for glue jobs along with iam-based authentication.

df=glue_context.create_dynamic_frame.from_options( connection_type="redshift", connection_options={ "url": "jdbc:redshift:iam://'redshift-cluster:region/db", "query": query_string, "redshiftTmpDir": redshift_dir, "aws_iam_role": redshift_role, "DbUser": "db_user_name" } )

We are trying to achieve the similar iam-based redshift connection in glue local docker set-up, which we use to develop. So, we are trying to locally read from redshift DB using IAM-based url (in glue 4.0) and giving permissions to docker profile role to retrieve temp db credentials from redshift for authentication. As you mentioned it's not possible to set the glue connection from local docker. So for local docker set-up the missing piece is Network access. My question is that if we set-up a SG for our local docker using a security network access client running in local, and allowing inbound traffic to redshift from the local docker SG (that way dealing with network access and no need for glue network connection), would the connection via glue dynamic frame above still work locally? Thanks.

Onur
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions