Attempting to run a very trivial Glue script locally via Docker I can't seem to connect to a mysql database running also in docker.
My docker setup is:
version: '3.7'
services:
glue:
container_name: "dev-glue"
image: amazon/aws-glue-libs:glue_libs_3.0.0_image_01-arm64
ports:
- "4040:4040"
- "18080:18080"
volumes:
- ~/.aws:/home/glue_user/.aws
- /workspace:/home/glue_user/workspace/
environment:
- "AWS_PROFILE=$AWS_PROFILE"
- "AWS_REGION=us-west-2"
- "AWS_DEFAULT_REGION=us-west-2"
- "DISABLE_SSL=true"
stdin_open: true
mysql:
image: mysql:8.0
container_name: 'dev-mysql'
command: --default-authentication-plugin=mysql_native_password
restart: always
environment:
MYSQL_ROOT_PASSWORD: password
volumes:
- mysql-db:/var/lib/mysql
ports:
- '3306:3306'
volumes:
mysql-db:
And according to the documentation found here the following should work without a problem
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
def df_mysql(glue_context: GlueContext, schema: str, table: str):
connection = {
"url": f"jdbc:mysql://dev-mysql/{schema}",
"dbtable": table,
"user": "root",
"password": "password",
"customJdbcDriverS3Path": "s3://my-bucket/mysql-connector-java-8.0.17.jar",
"customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"
}
data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
data_frame.printSchema()
df_mysql(glueContext, "my_schema", "my_table")
However this fails with
Traceback (most recent call last):
File "/home/glue_user/workspace/local/local_mysql.py", line 25, in <module>
df_mysql(glueContext, "my_schema", "my_table")
File "/home/glue_user/workspace/local/local_mysql.py", line 19, in df_mysql
data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 608, in from_options
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/context.py", line 228, in create_dynamic_frame_from_options
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.getDynamicFrame.
: java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:14)
at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:17)
at org.apache.spark.sql.jdbc.glue.GlueJDBCSource$.createRelation(GlueJDBCSource.scala:29)
at com.amazonaws.services.glue.util.JDBCWrapper.tableDF(JDBCUtils.scala:878)
at com.amazonaws.services.glue.util.NoCondition$.tableDF(JDBCUtils.scala:86)
at com.amazonaws.services.glue.util.NoJDBCPartitioner$.tableDF(JDBCUtils.scala:172)
at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:967)
at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99)
at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99)
at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
What's interesting here is its failing due to java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
but if I change the config to use something else, it will fail with the same error message.
Is there some other environment variable or configuration I'm supposed to set in order for this to work?
Yeah thats what I ended up doing, but a couple of things to note on that. Why does the documented "customJdbcDriverS3Path" not work inside of docker, and why isn't this documented anywhere in the mess that is the Glue documentation for local development?
The
customJdbcDriverS3Path
parameter updates the executor classLoader and imports the JDBC driver class. However, this will not work in local development environments because Spark driver acts as the executor in the local development environment and driver class paths cannot be updated. So, the solution would be to add the jar to the$SPARK_HOME/jars
directory. or you can also pass the S3 path to--jars
parameter. ex:spark-submit --jars s3://S3BUCKET/jars/mysql-connector-j-8.0.32.jar SPARK_SCRIPT.py
Actually I wasn't expecting that customJdbcDriverS3Path to work in docker but it does work. I tested the sample code there on the Glue 3 docker and it did ask me to enter credentials to download it from S3, when I added it it did find the driver and tried to connect
Subramanya is right, it tries to connect using the driver but if I run it in an actual DB, for some strange reason, when running the query it no longer finds the driver, like in the exception you have there