ClassNotFoundException EmrFileSystem error while setting Spark --driver-class-path config in EMR Serverless

Question

I am running a Scala spark job on EMR serverless and am trying to pass a postgresql JDBC connector jar through Spark's --driver-class-path config. This is how my spark submit configs look
```
 --class main.scala.DataJob \
--driver-class-path s3://s3-bucket/postgresql-42.7.3.jar \
--jars s3://s3-bucket/postgresql-42.7.3.jar,s3://s3-bucket/aws-advanced-jdbc-wrapper-2.3.5.jar \
--conf spark.sql.hive.metastore.sharedPrefixes=software.amazon.jdbc.Driver \
--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=software.amazon.jdbc.Driver \
--conf spark.hadoop.javax.jdo.option.ConnectionUserName= \
--conf spark.hadoop.javax.jdo.option.ConnectionPassword= \
--conf spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:aws-wrapper:postgresql://.rds.amazonaws.com:5432/db_name
```
It looks like setting driver-class-path is overriding the existing class path which already contains the path to EMRFS.  
Also I observed that the jars passed through --jars config are added to the classpath only after the driver node is up. 
Complete error stack trace -
```
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2693)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3628)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3663)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:173)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3767)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3718)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:564)
	at org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:317)
	at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
	at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
	at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:395)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:395)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1010)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1167)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1176)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2597)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2691)
	... 27 more
```
Apologies if any of my observations are incorrect. Please let me know how I can resolve this issue

Answer

Hello,

Basically, If you would like to add customr jars to the classpath, it is recommended to use spark.jars property with comma separated jar files which would be picked during the runtime.

```
--conf spark.jars=s3:///xxxx/postgresql-42.3.6.jar
```

[spark.jars](https://docs.aws.amazon.com/ja_jp/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html) - Additional jars to add to the runtime classpath of the driver and executors.

On the other hand, you can also try adding the extra jars with below properties not driver.classpath, as they will be overwritten but extraClassPath does not. 
```
spark.driver.extraClassPath :/home/hadoop/extrajars/*
spark.executor.extraClassPath :/home/hadoop/extrajars/*
```
Thirdly if you have several jar files, you can create custom docker image to pack all your dependencies. More details [here](https://aws.amazon.com/blogs/big-data/add-your-own-libraries-and-application-dependencies-to-spark-and-hive-on-amazon-emr-serverless-with-custom-images/).

Answer

Thanks for the reply Yokesh. 
The main reason I am trying to pass the postgresql connector jar through the driver-class-path config is because I keep seeing the below error

```
 by: java.sql.SQLException: No suitable driver found for jdbc:aws-wrapper:postgresql....

```
I tried passing both s3://s3-bucket/postgresql-42.7.3.jar,s3://s3-bucket/aws-advanced-jdbc-wrapper-2.3.5.jar through --conf spark.jars as you mentioned and I still see the above error

I analysed the below spark logs and found that the jars are added to the class path only in the executor nodes and not in the driver node. Apparently the jdbc driver needs to be set in the driver node during cluster initialization.

Spark logs -
```
Files s3:///postgresql-42.7.3.jar from /tmp/spark-af1d2a45-d6ba-497c-9b55-342345/postgresql-42.7.3.jar to /home/hadoop/postgresql-42.7.3.jar
Files s3:///aws-advanced-jdbc-wrapper-2.3.5.jar from /tmp/spark-af1d2a45-d6ba-497c-9b55-436523/aws-advanced-jdbc-wrapper-2.3.5.jar to /home/hadoop/aws-advanced-jdbc-wrapper-2.3.5.jar
24/04/29 03:26:14 INFO HiveConf: Found configuration file file:/etc/spark/conf/hive-site.xml
24/04/29 03:26:14 INFO SparkContext: Running Spark version 3.5.0-amzn-1
24/04/29 03:26:14 INFO SparkContext: OS info Linux, 5.10.213-201.855.amzn2.x86_64, amd64
24/04/29 03:26:14 INFO SparkContext: Java version 17.0.10
24/04/29 03:26:14 INFO ResourceUtils: ==============================================================
24/04/29 03:26:14 INFO ResourceUtils: No custom resources configured for spark.driver.
24/04/29 03:26:14 INFO ResourceUtils: ==============================================================
24/04/29 03:26:14 INFO SparkContext: Submitted application: IcebergConnectorApp
24/04/29 03:26:14 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 14336, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
24/04/29 03:26:14 INFO ResourceProfile: Limiting resource is cpus at 4 tasks per executor
24/04/29 03:26:14 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/04/29 03:26:14 INFO SecurityManager: Changing view acls to: hadoop
24/04/29 03:26:14 INFO SecurityManager: Changing modify acls to: hadoop
24/04/29 03:26:14 INFO SecurityManager: Changing view acls groups to: 
24/04/29 03:26:14 INFO SecurityManager: Changing modify acls groups to: 
24/04/29 03:26:14 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: hadoop; groups with view permissions: EMPTY; users with modify permissions: hadoop; groups with modify permissions: EMPTY
24/04/29 03:26:14 INFO Utils: Successfully started service 'sparkDriver' on port 35799.
24/04/29 03:26:14 INFO SparkEnv: Registering MapOutputTracker
24/04/29 03:26:14 INFO SparkEnv: Registering BlockManagerMaster
24/04/29 03:26:14 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
24/04/29 03:26:14 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
24/04/29 03:26:14 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
24/04/29 03:26:14 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-668b0750-7120-4854-8b5e-b8a5af8fe469
24/04/29 03:26:14 INFO MemoryStore: MemoryStore started with capacity 8.2 GiB
24/04/29 03:26:14 INFO SparkEnv: Registering OutputCommitCoordinator
24/04/29 03:26:14 INFO SubResultCacheManager: Sub-result caches are disabled.
24/04/29 03:26:14 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
24/04/29 03:26:14 INFO Utils: Successfully started service 'SparkUI' on port 4040.
24/04/29 03:26:14 INFO SparkContext: Added JAR s3:///postgresql-42.7.3.jar at s3:///postgresql-42.7.3.jar with timestamp 1714361174254
24/04/29 03:26:15 INFO Executor: Starting executor ID driver on host ip-10-0-84-60.us-west-2.compute.internal
24/04/29 03:26:15 INFO Executor: OS info Linux, 5.10.213-201.855.amzn2.x86_64, amd64
24/04/29 03:26:15 INFO Executor: Java version 17.0.10
24/04/29 03:26:15 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): 'file:/usr/lib/hadoop-lzo/lib/*,file:/usr/lib/hadoop/hadoop-aws.jar,file:/usr/share/aws/aws-java-sdk/*,file:/usr/share/aws/emr/emrfs/conf/,file:/usr/share/aws/emr/emrfs/lib/*,file:/usr/share/aws/emr/emrfs/auxlib/*,file:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar,file:/usr/share/aws/emr/goodies/lib/emr-serverless-spark-goodies.jar,file:/usr/share/aws/emr/security/conf,file:/usr/share/aws/emr/security/lib/*,file:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar,file:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar,file:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar,file:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar,file:/docker/usr/lib/hadoop-lzo/lib/*,file:/docker/usr/lib/hadoop/hadoop-aws.jar,file:/docker/usr/share/aws/aws-java-sdk/*,file:/docker/usr/share/aws/emr/emrfs/conf,file:/docker/usr/share/aws/emr/emrfs/lib/*,file:/docker/usr/share/aws/emr/emrfs/auxlib/*,file:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar,file:/docker/usr/share/aws/emr/security/conf,file:/docker/usr/share/aws/emr/security/lib/*,file:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar,file:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar,file:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar,file:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar,file:/usr/share/aws/redshift/jdbc/RedshiftJDBC.jar,file:/usr/share/aws/redshift/spark-redshift/lib/*,file:/usr/share/aws/iceberg/lib/iceberg-emr-common.jar,file:/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar,file:/home/hadoop/iceberg-emr-common.jar,file:/home/hadoop/conf,file:/home/hadoop/emr-serverless-spark-goodies.jar,file:/home/hadoop/emr-spark-goodies.jar,file:/home/hadoop/iceberg-spark3-runtime.jar,file:/home/hadoop/*,file:/home/hadoop/aws-glue-datacatalog-spark-client.jar,file:/home/hadoop/hive-openx-serde.jar,file:/home/hadoop/sagemaker-spark-sdk.jar,file:/home/hadoop/hadoop-aws.jar,file:/home/hadoop/RedshiftJDBC.jar,file:/home/hadoop/emr-s3-select-spark-connector.jar'
24/04/29 03:26:15 INFO Executor: Created or updated repl class loader org.apache.spark.util.MutableURLClassLoader@4247093b for default.
24/04/29 03:26:15 INFO Executor: Fetching s3:///postgresql-42.7.3.jar with timestamp 1714361174254
24/04/29 03:26:15 INFO S3NativeFileSystem: Opening 's3:///postgresql-42.7.3.jar' for reading
```

ClassNotFoundException EmrFileSystem error while setting Spark --driver-class-path config in EMR Serverless

Relevant content