ClassNotFoundException EmrFileSystem error while setting Spark --driver-class-path config in EMR Serverless

0

I am running a Scala spark job on EMR serverless and am trying to pass a postgresql JDBC connector jar through Spark's --driver-class-path config. This is how my spark submit configs look

 --class main.scala.DataJob \
--driver-class-path s3://s3-bucket/postgresql-42.7.3.jar \
--jars s3://s3-bucket/postgresql-42.7.3.jar,s3://s3-bucket/aws-advanced-jdbc-wrapper-2.3.5.jar \
--conf spark.sql.hive.metastore.sharedPrefixes=software.amazon.jdbc.Driver \
--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=software.amazon.jdbc.Driver \
--conf spark.hadoop.javax.jdo.option.ConnectionUserName=<username> \
--conf spark.hadoop.javax.jdo.option.ConnectionPassword=<password> \
--conf spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:aws-wrapper:postgresql://<rds-endpoint>.rds.amazonaws.com:5432/db_name

It looks like setting driver-class-path is overriding the existing class path which already contains the path to EMRFS.
Also I observed that the jars passed through --jars config are added to the classpath only after the driver node is up. Complete error stack trace -

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2693)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3628)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3663)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:173)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3767)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3718)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:564)
	at org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:317)
	at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
	at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
	at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:395)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:395)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1010)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1167)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1176)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2597)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2691)
	... 27 more

Apologies if any of my observations are incorrect. Please let me know how I can resolve this issue

Pranava
asked 16 days ago60 views
2 Answers
1

Hello,

Basically, If you would like to add customr jars to the classpath, it is recommended to use spark.jars property with comma separated jar files which would be picked during the runtime.

--conf spark.jars=s3://<S3-bucket-name>/xxxx/postgresql-42.3.6.jar

spark.jars - Additional jars to add to the runtime classpath of the driver and executors.

On the other hand, you can also try adding the extra jars with below properties not driver.classpath, as they will be overwritten but extraClassPath does not.

spark.driver.extraClassPath <other existing jar locations>:/home/hadoop/extrajars/*
spark.executor.extraClassPath <other existing jar locations>:/home/hadoop/extrajars/*

Thirdly if you have several jar files, you can create custom docker image to pack all your dependencies. More details here.

AWS
SUPPORT ENGINEER
answered 16 days ago
0

Thanks for the reply Yokesh. The main reason I am trying to pass the postgresql connector jar through the driver-class-path config is because I keep seeing the below error

 by: java.sql.SQLException: No suitable driver found for jdbc:aws-wrapper:postgresql.... 

I tried passing both s3://s3-bucket/postgresql-42.7.3.jar,s3://s3-bucket/aws-advanced-jdbc-wrapper-2.3.5.jar through --conf spark.jars as you mentioned and I still see the above error

I analysed the below spark logs and found that the jars are added to the class path only in the executor nodes and not in the driver node. Apparently the jdbc driver needs to be set in the driver node during cluster initialization.

Spark logs -

Files s3://<s3-bucket>/postgresql-42.7.3.jar from /tmp/spark-af1d2a45-d6ba-497c-9b55-342345/postgresql-42.7.3.jar to /home/hadoop/postgresql-42.7.3.jar
Files s3://<s3-bucket>/aws-advanced-jdbc-wrapper-2.3.5.jar from /tmp/spark-af1d2a45-d6ba-497c-9b55-436523/aws-advanced-jdbc-wrapper-2.3.5.jar to /home/hadoop/aws-advanced-jdbc-wrapper-2.3.5.jar
24/04/29 03:26:14 INFO HiveConf: Found configuration file file:/etc/spark/conf/hive-site.xml
24/04/29 03:26:14 INFO SparkContext: Running Spark version 3.5.0-amzn-1
24/04/29 03:26:14 INFO SparkContext: OS info Linux, 5.10.213-201.855.amzn2.x86_64, amd64
24/04/29 03:26:14 INFO SparkContext: Java version 17.0.10
24/04/29 03:26:14 INFO ResourceUtils: ==============================================================
24/04/29 03:26:14 INFO ResourceUtils: No custom resources configured for spark.driver.
24/04/29 03:26:14 INFO ResourceUtils: ==============================================================
24/04/29 03:26:14 INFO SparkContext: Submitted application: IcebergConnectorApp
24/04/29 03:26:14 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 14336, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
24/04/29 03:26:14 INFO ResourceProfile: Limiting resource is cpus at 4 tasks per executor
24/04/29 03:26:14 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/04/29 03:26:14 INFO SecurityManager: Changing view acls to: hadoop
24/04/29 03:26:14 INFO SecurityManager: Changing modify acls to: hadoop
24/04/29 03:26:14 INFO SecurityManager: Changing view acls groups to: 
24/04/29 03:26:14 INFO SecurityManager: Changing modify acls groups to: 
24/04/29 03:26:14 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: hadoop; groups with view permissions: EMPTY; users with modify permissions: hadoop; groups with modify permissions: EMPTY
24/04/29 03:26:14 INFO Utils: Successfully started service 'sparkDriver' on port 35799.
24/04/29 03:26:14 INFO SparkEnv: Registering MapOutputTracker
24/04/29 03:26:14 INFO SparkEnv: Registering BlockManagerMaster
24/04/29 03:26:14 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
24/04/29 03:26:14 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
24/04/29 03:26:14 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
24/04/29 03:26:14 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-668b0750-7120-4854-8b5e-b8a5af8fe469
24/04/29 03:26:14 INFO MemoryStore: MemoryStore started with capacity 8.2 GiB
24/04/29 03:26:14 INFO SparkEnv: Registering OutputCommitCoordinator
24/04/29 03:26:14 INFO SubResultCacheManager: Sub-result caches are disabled.
24/04/29 03:26:14 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
24/04/29 03:26:14 INFO Utils: Successfully started service 'SparkUI' on port 4040.
24/04/29 03:26:14 INFO SparkContext: Added JAR s3://<s3-bucket>/postgresql-42.7.3.jar at s3://<s3-bucket>/postgresql-42.7.3.jar with timestamp 1714361174254
24/04/29 03:26:15 INFO Executor: Starting executor ID driver on host ip-10-0-84-60.us-west-2.compute.internal
24/04/29 03:26:15 INFO Executor: OS info Linux, 5.10.213-201.855.amzn2.x86_64, amd64
24/04/29 03:26:15 INFO Executor: Java version 17.0.10
24/04/29 03:26:15 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): 'file:/usr/lib/hadoop-lzo/lib/*,file:/usr/lib/hadoop/hadoop-aws.jar,file:/usr/share/aws/aws-java-sdk/*,file:/usr/share/aws/emr/emrfs/conf/,file:/usr/share/aws/emr/emrfs/lib/*,file:/usr/share/aws/emr/emrfs/auxlib/*,file:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar,file:/usr/share/aws/emr/goodies/lib/emr-serverless-spark-goodies.jar,file:/usr/share/aws/emr/security/conf,file:/usr/share/aws/emr/security/lib/*,file:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar,file:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar,file:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar,file:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar,file:/docker/usr/lib/hadoop-lzo/lib/*,file:/docker/usr/lib/hadoop/hadoop-aws.jar,file:/docker/usr/share/aws/aws-java-sdk/*,file:/docker/usr/share/aws/emr/emrfs/conf,file:/docker/usr/share/aws/emr/emrfs/lib/*,file:/docker/usr/share/aws/emr/emrfs/auxlib/*,file:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar,file:/docker/usr/share/aws/emr/security/conf,file:/docker/usr/share/aws/emr/security/lib/*,file:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar,file:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar,file:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar,file:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar,file:/usr/share/aws/redshift/jdbc/RedshiftJDBC.jar,file:/usr/share/aws/redshift/spark-redshift/lib/*,file:/usr/share/aws/iceberg/lib/iceberg-emr-common.jar,file:/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar,file:/home/hadoop/iceberg-emr-common.jar,file:/home/hadoop/conf,file:/home/hadoop/emr-serverless-spark-goodies.jar,file:/home/hadoop/emr-spark-goodies.jar,file:/home/hadoop/iceberg-spark3-runtime.jar,file:/home/hadoop/*,file:/home/hadoop/aws-glue-datacatalog-spark-client.jar,file:/home/hadoop/hive-openx-serde.jar,file:/home/hadoop/sagemaker-spark-sdk.jar,file:/home/hadoop/hadoop-aws.jar,file:/home/hadoop/RedshiftJDBC.jar,file:/home/hadoop/emr-s3-select-spark-connector.jar'
24/04/29 03:26:15 INFO Executor: Created or updated repl class loader org.apache.spark.util.MutableURLClassLoader@4247093b for default.
24/04/29 03:26:15 INFO Executor: Fetching s3://<s3-bucket>/postgresql-42.7.3.jar with timestamp 1714361174254
24/04/29 03:26:15 INFO S3NativeFileSystem: Opening 's3://<s3-bucket>/postgresql-42.7.3.jar' for reading
Pranava
answered 15 days ago
  • Jars mentioned in spark.driver.extraClassPath will be added during the runtime not at initialization steps. If you would like to add them via custom image, place the jars in either the default jar location(/usr/lib/spark/jars) or custom location and points to extraClassPath which might certainly work for you.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions