AWS Glue Docker ignoring customJdbcDriverClassName and customJdbcDriverS3Path

0

Attempting to run a very trivial Glue script locally via Docker I can't seem to connect to a mysql database running also in docker.

My docker setup is:

version: '3.7'
services:
  glue:
    container_name: "dev-glue"
    image: amazon/aws-glue-libs:glue_libs_3.0.0_image_01-arm64
    ports:
      - "4040:4040"
      - "18080:18080"
    volumes:
      - ~/.aws:/home/glue_user/.aws
      - /workspace:/home/glue_user/workspace/
    environment:
      - "AWS_PROFILE=$AWS_PROFILE"
      - "AWS_REGION=us-west-2"
      - "AWS_DEFAULT_REGION=us-west-2"
      - "DISABLE_SSL=true"
    stdin_open: true

  mysql:
    image: mysql:8.0
    container_name: 'dev-mysql'
    command: --default-authentication-plugin=mysql_native_password
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: password
    volumes:
      - mysql-db:/var/lib/mysql
    ports:
      - '3306:3306'

volumes:
  mysql-db:

And according to the documentation found here the following should work without a problem

from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

glueContext = GlueContext(SparkContext.getOrCreate())

def df_mysql(glue_context: GlueContext, schema: str, table: str):
 
    connection = {
        "url": f"jdbc:mysql://dev-mysql/{schema}",
        "dbtable": table,
        "user": "root",
        "password": "password",
        "customJdbcDriverS3Path": "s3://my-bucket/mysql-connector-java-8.0.17.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"
    }

    data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)

    data_frame.printSchema()


df_mysql(glueContext, "my_schema", "my_table")

However this fails with

Traceback (most recent call last):
  File "/home/glue_user/workspace/local/local_mysql.py", line 25, in <module>
    df_mysql(glueContext, "my_schema", "my_table")
  File "/home/glue_user/workspace/local/local_mysql.py", line 19, in df_mysql
    data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 608, in from_options
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/context.py", line 228, in create_dynamic_frame_from_options
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame
  File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.getDynamicFrame.
: java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
        at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:14)
        at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:17)
        at org.apache.spark.sql.jdbc.glue.GlueJDBCSource$.createRelation(GlueJDBCSource.scala:29)
        at com.amazonaws.services.glue.util.JDBCWrapper.tableDF(JDBCUtils.scala:878)
        at com.amazonaws.services.glue.util.NoCondition$.tableDF(JDBCUtils.scala:86)
        at com.amazonaws.services.glue.util.NoJDBCPartitioner$.tableDF(JDBCUtils.scala:172)
        at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:967)
        at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99)
        at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99)
        at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)

What's interesting here is its failing due to java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver but if I change the config to use something else, it will fail with the same error message.

Is there some other environment variable or configuration I'm supposed to set in order for this to work?

已提問 1 年前檢視次數 277 次
1 個回答
0

Drivers with restrictive licenses (e.g. GPL) are not distributed in the docker container. However, you can just download it yourself from https://dev.mysql.com/downloads/connector/j/ and then copy the jar (e.g. mysql-connector-j-8.0.32.jar) into the container /home/glue_user/spark/jars directory. To do that copy you can use "docker cp" or probably easier ssh into the container and download it and put it under that directory.

profile pictureAWS
專家
已回答 1 年前
  • Yeah thats what I ended up doing, but a couple of things to note on that. Why does the documented "customJdbcDriverS3Path" not work inside of docker, and why isn't this documented anywhere in the mess that is the Glue documentation for local development?

  • The customJdbcDriverS3Path parameter updates the executor classLoader and imports the JDBC driver class. However, this will not work in local development environments because Spark driver acts as the executor in the local development environment and driver class paths cannot be updated. So, the solution would be to add the jar to the $SPARK_HOME/jars directory. or you can also pass the S3 path to --jars parameter. ex: spark-submit --jars s3://S3BUCKET/jars/mysql-connector-j-8.0.32.jar SPARK_SCRIPT.py

  • Actually I wasn't expecting that customJdbcDriverS3Path to work in docker but it does work. I tested the sample code there on the Glue 3 docker and it did ask me to enter credentials to download it from S3, when I added it it did find the driver and tried to connect

  • Subramanya is right, it tries to connect using the driver but if I run it in an actual DB, for some strange reason, when running the query it no longer finds the driver, like in the exception you have there

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南