Unable to use Glue Data catalog tables crawled from Snowflake JDBC

0

I've been setting an AWS Glue test environment in which business users should use tables available in Glue catalog as data sources for their Glue jobs. The tables in the catalog come from data marts in Snowflake, which are crawled with Glue crawler via JDBC connection. The crawler runs successfully with the provided connection, creating all the tables expected from the database into Glue catalog. My issue is when I try to use the tables from catalog the job throws the following error :

An error occurred while calling o87.getCatalogSource. : java.io.InvalidObjectException: JDBC connection URL jdbc:snowflake://<ORG>-<ACC>.snowflakecomputing.com/?user=<USER>&db=<DB>&role=<ROLE>&warehouse=<VW> is not supported. Check the Developer Guide for the list of supported data stores / URL formatting.

For the connection details, it was created via management console, using the JDBC URL present on the error log. Since AWS Glue detects snowflake protocol in the URL I let it handle the driver and related configurations. The same failure occurs either using secrets manager or storing the credentials directly in the connection. The network configuration was also correctly provided.

Considering other aspects worthy of mentioning, the IAM Role provided has all the policies necessary to run the jobs, as suggested in the docs. Snowflake credentials stored on secrets manager have all the necessary rights to access the data. The job used Glue 4.0.

The generated script is as simples as this:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Data Catalog table
DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="northwind",
    table_name="northwind_database_public_categories",
    transformation_ctx="DataCatalogtable_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=DataCatalogtable_node1,
    mappings=[
        ("description", "string", "description", "string"),
        ("picture", "binary", "picture", "boolean"),
        ("category_name", "string", "category_name", "string"),
        ("category_id", "long", "category_id", "bigint"),
    ],
    transformation_ctx="ApplyMapping_node2",
)

# Script generated for node Snowflake Connector 2.11.1 for AWS Glue 4.0
SnowflakeConnector2111forAWSGlue40_node3 = glueContext.write_dynamic_frame.from_options(
    frame=ApplyMapping_node2,
    connection_type="marketplace.spark",
    connection_options={
        "sfWarehouse": "VW",
        "dbtable": "test",
        "sfDatabase": "GLUE_TEST",
        "sfSchema": "PUBLIC",
        "connectionName": "marketplace_conn",
    },
    transformation_ctx="SnowflakeConnector2111forAWSGlue40_node3",
)


job.commit()

and the complete exception raised:

ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
  File "/tmp/catalog_test.py", line 16, in <module>
    DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog(
  File "/opt/amazon/lib/python3.7/site-packages/awsglue/dynamicframe.py", line 629, in from_catalog
    return self._glue_context.create_dynamic_frame_from_catalog(db, table_name, redshift_tmp_dir, transformation_ctx, push_down_predicate, additional_options, catalog_id, **kwargs)
  File "/opt/amazon/lib/python3.7/site-packages/awsglue/context.py", line 184, in create_dynamic_frame_from_catalog
    source = DataSource(self._ssql_ctx.getCatalogSource(db, table_name, redshift_tmp_dir, transformation_ctx,
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o87.getCatalogSource.
: java.io.InvalidObjectException: JDBC connection URL jdbc:snowflake://<ORG>-<ACC>.snowflakecomputing.com/?user=<USER>&db=<DB>&role=<ROLE>&warehouse=<VW> is not supported. Check the Developer Guide for the list of supported data stores / URL formatting.
	at com.amazonaws.glue.jdbc.commons.JdbcUrl.getMatcherForUrl(JdbcUrl.java:181)
	at com.amazonaws.services.glue.util.DataCatalogWrapper.$anonfun$getJDBCConf$1(DataCatalogWrapper.scala:263)
	at scala.util.Try$.apply(Try.scala:209)
	at com.amazonaws.services.glue.util.DataCatalogWrapper.getJDBCConf(DataCatalogWrapper.scala:223)
	at com.amazonaws.services.glue.GlueContext.getGlueNativeJDBCSource(GlueContext.scala:494)
	at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:327)
	at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:198)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)

I've managed to use catalog tables crawled from native sources, such as S3, but couldn't find a way to achieve the requirements presented. Thanks in advance for the support.

2 Answers
0

Unfortunately that API is hiding the root cause, I believe is something in the part that you have removed that the regex matcher is not able to handle correct. Run this and see if the new error is more helpful and where it originates:

snowFlakeJdbcUrl = "put here the url you have configured"
jdbcUrlObj = spark._jvm.com.amazonaws.glue.jdbc.commons.JdbcUrl.fromUrl(snowFlakeJdbcUrl)
jdbcUrlObj.getConnectionUrl(False, False, None)
profile pictureAWS
EXPERT
answered a year ago
0

Check you URL:

To connect to a Snowflake instance of the sample database, specify the endpoint for the snowflake instance, the user, the database name, and the role name. You can optionally add the warehouse parameter.

jdbc:snowflake://account_name.snowflakecomputing.com/?user=user_name&db=sample&role=role_name&warehouse=warehouse_name

Important For Snowflake connections over JDBC, the order of parameters in the URL is enforced and must be ordered as user, db, role_name, and warehouse.

Source: JDBC URL

twood
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions