使用AWS re:Post即您表示您同意 AWS re:Post 使用条款

如何解决 Amazon EMR 上 Spark 中的“java.lang.ClassNotFoundException”错误?

2 分钟阅读
0

我想解决 Amazon EMR 上 Apache Spark 中的“java.lang.ClassNotFoundException”错误。

简短描述

Spark 中出现 java.lang.ClassNotFoundException 错误的原因如下:

  • spark-submit 作业无法在类路径中找到相关文件。
  • 引导操作或自定义配置会覆盖类路径。因此,类加载程序仅提取存在于您在配置中指定的位置中的 JAR 文件。

解决方法

要解决 java.lang.ClassNotFoundException 错误,请检查堆栈跟踪以查找缺失类的名称。然后,将包含缺失类的自定义 JAR 路径添加到 Spark 类路径中。您可以在正在运行的集群、新集群上或在提交任务时添加自定义 JAR 的路径。

在正在运行的集群上添加您的自定义 JAR 路径

/etc/spark/conf/spark-defaults.conf 中,将自定义 JAR 的路径添加到错误堆栈跟踪中指定的类名中。

示例:

sudo vim /etc/spark/conf/spark-defaults.conf
spark.driver.extraClassPath <other existing jar locations>:example-custom-jar-path
spark.executor.extraClassPath <other existing jar locations>:example-custom-jar-path

**注意:**将 example-custom-jar-path 替换为您的自定义 JAR 路径。

在新集群上添加您的自定义 JAR 路径

要将自定义 JAR 路径添加到 /etc/spark/conf/spark-defaults.conf 中的现有类路径中,请在创建新集群时提供配置对象。使用 Amazon EMR 版本 5.14.0 或更高版本创建新集群。

对于 Amazon EMR 5.14.0 到 Amazon EMR 5.17.0,加入以下内容:

[  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*"
    }
  }
]

对于 Amazon EMR 5.17.0 到 Amazon EMR 5.18.0,加入 /usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar 作为额外的 JAR 路径:

[  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*"
    }
  }
]

对于 Amazon EMR 5.19.0 到 Amazon EMR 5.32.0,将 JAR 路径更新为以下内容:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*"
    }
  }
]

对于 Amazon EMR 5.33.0 到 Amazon EMR 5.36.0,将 JAR 路径更新为以下内容:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/"
    }
  }
]

对于 Amazon EMR 6.0.0 及更高版本,您无法使用该配置更新 JAR 路径,因为 conf 文件包含多个 JAR 路径。此外,您更新的每个属性配置的长度不能超过 1024 个字符。要将自定义 JAR 位置传递给 spark-defaults.conf,请添加引导操作。有关详细信息,请参阅如何在引导阶段之后更新所有 Amazon EMR 备注?

要添加引导操作,请参阅添加自定义引导操作,然后执行以下操作:

  • s3://example-bucket/Bootstraps/script_b.sh 替换为您的 Amazon Simple Storage Service (Amazon S3) 路径。
  • /home/hadoop/extrajars/* 替换为您的自定义 JAR 文件路径。
  • 确认 Amazon EMR 运行时角色具有访问 Amazon S3 存储桶所需的权限。
    **注意:**当您添加引导脚本时,该脚本将应用于集群的 Spark 配置,而不是特定的作业。

更改 /etc/spark/conf/spark-defaults.conf 的示例脚本:

#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
  sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0
Launch the EMR cluster and add a bootstrap action similar to the following:
#!/bin/bash
pwd
aws s3 cp s3://example-bucket/Bootstraps/script_b.sh .
chmod +x script_b.sh
nohup ./script_b.sh &

**注意:**将 example-bucket 替换为您的 Amazon S3 存储桶。

提交作业时添加您的自定义 JAR 路径

要在提交作业时添加自定义 JAR 路径,请运行带有 jars 选项的 spark-submit 命令。有关详细信息,请参阅 Apache Spark 网站上的使用 spark-submit 启动应用程序

spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi --master yarn spark-examples.jar 100 --jars example-custom-jar-path

**注意:**将 example-custom-jar-path 替换为您的自定义 JAR 路径。为防止类冲突,在使用 jars 选项时不要包含标准 JAR。例如,不要包含 spark-core.jar,因为它已经存在于集群中。有关详细信息,请参阅配置 Spark

相关信息

Apache Spark 网站上的 Spark 配置

如何解决 Amazon EMR 上 Spark 中的错误“Container killed by YARN for exceeding memory limits”(容器因超出内存限制被 YARN 终止)?

AWS 官方
AWS 官方已更新 1 个月前