EMR Serverless jar configuration

0

I'm working on EMR Serverless for validating of data located in s3 using deequ library. But I am Unable to do that , I got this error...

Traceback (most recent call last): File "/tmp/spark-f2b9d6f8-9bb9-4879-a398-1f67f9ec5e70/app3.py", line 179, in <module> .addConstraintRule(UniqueIfApproximatelyUniqueRule())
File "/home/hadoop/environment/lib64/python3.7/site-packages/pydeequ/suggestions.py", line 81, in run result = self._ConstraintSuggestionRunBuilder.run() File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in call File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o177.run. : com.amazon.deequ.analyzers.runners.MetricCalculationRuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 19) ([2600:1f18:2d85:5e03:ba20:a78b:a26c:61ab] executor 1): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.aggregate.HashAggregateExec.aggregateExpressions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.aggregate.HashAggregateExec

how could I resolve this error

已提问 1 年前811 查看次数
1 回答
0

Hello,

Thank you for raising this question on re:Post.

From the stacktrace shared I can see it is a ClassCastException during serialization, which points to some incompatibility of the classes for serialization in the spark application. However, it is not enough to clearly identify the root cause here. Please help us with the following so that we can assist you further on this

  1. Are you able to run this successfully on EMR on EC2 cluster?
  2. Are you able to run a test without deequ to confirm EMR serverless job is working as expected without this dependency?
  3. Please share how you are adding the additional deequ libraries to the runtime serverless environment. Your start-job-run command should have the details on this.
  4. Please share reproduction steps, including a link to download the deequ library if it is publicly available.
AWS
支持工程师
已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则