AWS Glue Docker ignoring customJdbcDriverClassName and customJdbcDriverS3Path

0

Attempting to run a very trivial Glue script locally via Docker I can't seem to connect to a mysql database running also in docker.

My docker setup is:

version: '3.7'
services:
  glue:
    container_name: "dev-glue"
    image: amazon/aws-glue-libs:glue_libs_3.0.0_image_01-arm64
    ports:
      - "4040:4040"
      - "18080:18080"
    volumes:
      - ~/.aws:/home/glue_user/.aws
      - /workspace:/home/glue_user/workspace/
    environment:
      - "AWS_PROFILE=$AWS_PROFILE"
      - "AWS_REGION=us-west-2"
      - "AWS_DEFAULT_REGION=us-west-2"
      - "DISABLE_SSL=true"
    stdin_open: true

  mysql:
    image: mysql:8.0
    container_name: 'dev-mysql'
    command: --default-authentication-plugin=mysql_native_password
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: password
    volumes:
      - mysql-db:/var/lib/mysql
    ports:
      - '3306:3306'

volumes:
  mysql-db:

And according to the documentation found here the following should work without a problem

from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

glueContext = GlueContext(SparkContext.getOrCreate())

def df_mysql(glue_context: GlueContext, schema: str, table: str):
 
    connection = {
        "url": f"jdbc:mysql://dev-mysql/{schema}",
        "dbtable": table,
        "user": "root",
        "password": "password",
        "customJdbcDriverS3Path": "s3://my-bucket/mysql-connector-java-8.0.17.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"
    }

    data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)

    data_frame.printSchema()


df_mysql(glueContext, "my_schema", "my_table")

However this fails with

Traceback (most recent call last):
  File "/home/glue_user/workspace/local/local_mysql.py", line 25, in <module>
    df_mysql(glueContext, "my_schema", "my_table")
  File "/home/glue_user/workspace/local/local_mysql.py", line 19, in df_mysql
    data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 608, in from_options
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/context.py", line 228, in create_dynamic_frame_from_options
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame
  File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.getDynamicFrame.
: java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
        at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:14)
        at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:17)
        at org.apache.spark.sql.jdbc.glue.GlueJDBCSource$.createRelation(GlueJDBCSource.scala:29)
        at com.amazonaws.services.glue.util.JDBCWrapper.tableDF(JDBCUtils.scala:878)
        at com.amazonaws.services.glue.util.NoCondition$.tableDF(JDBCUtils.scala:86)
        at com.amazonaws.services.glue.util.NoJDBCPartitioner$.tableDF(JDBCUtils.scala:172)
        at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:967)
        at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99)
        at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99)
        at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)

What's interesting here is its failing due to java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver but if I change the config to use something else, it will fail with the same error message.

Is there some other environment variable or configuration I'm supposed to set in order for this to work?

asked a year ago270 views
1 Answer
0

Drivers with restrictive licenses (e.g. GPL) are not distributed in the docker container. However, you can just download it yourself from https://dev.mysql.com/downloads/connector/j/ and then copy the jar (e.g. mysql-connector-j-8.0.32.jar) into the container /home/glue_user/spark/jars directory. To do that copy you can use "docker cp" or probably easier ssh into the container and download it and put it under that directory.

profile pictureAWS
EXPERT
answered a year ago
  • Yeah thats what I ended up doing, but a couple of things to note on that. Why does the documented "customJdbcDriverS3Path" not work inside of docker, and why isn't this documented anywhere in the mess that is the Glue documentation for local development?

  • The customJdbcDriverS3Path parameter updates the executor classLoader and imports the JDBC driver class. However, this will not work in local development environments because Spark driver acts as the executor in the local development environment and driver class paths cannot be updated. So, the solution would be to add the jar to the $SPARK_HOME/jars directory. or you can also pass the S3 path to --jars parameter. ex: spark-submit --jars s3://S3BUCKET/jars/mysql-connector-j-8.0.32.jar SPARK_SCRIPT.py

  • Actually I wasn't expecting that customJdbcDriverS3Path to work in docker but it does work. I tested the sample code there on the Glue 3 docker and it did ask me to enter credentials to download it from S3, when I added it it did find the driver and tried to connect

  • Subramanya is right, it tries to connect using the driver but if I run it in an actual DB, for some strange reason, when running the query it no longer finds the driver, like in the exception you have there

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions