如何解决 Amazon EMR 上 Spark 中的“java.lang.ClassNotFoundException”错误?
我想解决 Amazon EMR 上 Apache Spark 中的“java.lang.ClassNotFoundException”错误。
简短描述
Spark 中出现 java.lang.ClassNotFoundException 错误的原因如下:
- spark-submit 作业无法在类路径中找到相关文件。
- 引导操作或自定义配置会覆盖类路径。因此,类加载程序仅提取存在于您在配置中指定的位置中的 JAR 文件。
解决方法
要解决 java.lang.ClassNotFoundException 错误,请检查堆栈跟踪以查找缺失类的名称。然后,将包含缺失类的自定义 JAR 路径添加到 Spark 类路径中。您可以在正在运行的集群、新集群上或在提交任务时添加自定义 JAR 的路径。
在正在运行的集群上添加您的自定义 JAR 路径
在 /etc/spark/conf/spark-defaults.conf 中,将自定义 JAR 的路径添加到错误堆栈跟踪中指定的类名中。
示例:
sudo vim /etc/spark/conf/spark-defaults.conf spark.driver.extraClassPath <other existing jar locations>:example-custom-jar-path spark.executor.extraClassPath <other existing jar locations>:example-custom-jar-path
**注意:**将 example-custom-jar-path 替换为您的自定义 JAR 路径。
在新集群上添加您的自定义 JAR 路径
要将自定义 JAR 路径添加到 /etc/spark/conf/spark-defaults.conf 中的现有类路径中,请在创建新集群时提供配置对象。使用 Amazon EMR 版本 5.14.0 或更高版本创建新集群。
对于 Amazon EMR 5.14.0 到 Amazon EMR 5.17.0,加入以下内容:
[ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*", "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" } } ]
对于 Amazon EMR 5.17.0 到 Amazon EMR 5.18.0,加入 /usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar 作为额外的 JAR 路径:
[ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*", "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*" } } ]
对于 Amazon EMR 5.19.0 到 Amazon EMR 5.32.0,将 JAR 路径更新为以下内容:
[ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*", "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*" } } ]
对于 Amazon EMR 5.33.0 到 Amazon EMR 5.36.0,将 JAR 路径更新为以下内容:
[ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/", "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/" } } ]
对于 Amazon EMR 6.0.0 及更高版本,您无法使用该配置更新 JAR 路径,因为 conf 文件包含多个 JAR 路径。此外,您更新的每个属性配置的长度不能超过 1024 个字符。要将自定义 JAR 位置传递给 spark-defaults.conf,请添加引导操作。有关详细信息,请参阅如何在引导阶段之后更新所有 Amazon EMR 备注?
要添加引导操作,请参阅添加自定义引导操作,然后执行以下操作:
- 将 s3://example-bucket/Bootstraps/script_b.sh 替换为您的 Amazon Simple Storage Service (Amazon S3) 路径。
- 将 /home/hadoop/extrajars/* 替换为您的自定义 JAR 文件路径。
- 确认 Amazon EMR 运行时角色具有访问 Amazon S3 存储桶所需的权限。
**注意:**当您添加引导脚本时,该脚本将应用于集群的 Spark 配置,而不是特定的作业。
更改 /etc/spark/conf/spark-defaults.conf 的示例脚本:
#!/bin/bash # # This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf # while [ ! -f /etc/spark/conf/spark-defaults.conf ] do sleep 1 done # # Now the file is available, do your work here # sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf exit 0 Launch the EMR cluster and add a bootstrap action similar to the following: #!/bin/bash pwd aws s3 cp s3://example-bucket/Bootstraps/script_b.sh . chmod +x script_b.sh nohup ./script_b.sh &
**注意:**将 example-bucket 替换为您的 Amazon S3 存储桶。
提交作业时添加您的自定义 JAR 路径
要在提交作业时添加自定义 JAR 路径,请运行带有 jars 选项的 spark-submit 命令。有关详细信息,请参阅 Apache Spark 网站上的使用 spark-submit 启动应用程序。
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi --master yarn spark-examples.jar 100 --jars example-custom-jar-path
**注意:**将 example-custom-jar-path 替换为您的自定义 JAR 路径。为防止类冲突,在使用 jars 选项时不要包含标准 JAR。例如,不要包含 spark-core.jar,因为它已经存在于集群中。有关详细信息,请参阅配置 Spark。
相关信息
Apache Spark 网站上的 Spark 配置
相关内容
- AWS 官方已更新 3 年前