EMR Serverless jar configuration

0

I'm working on EMR Serverless for validating of data located in s3 using deequ library. But I am Unable to do that , I got this error...

Traceback (most recent call last): File "/tmp/spark-f2b9d6f8-9bb9-4879-a398-1f67f9ec5e70/app3.py", line 179, in <module> .addConstraintRule(UniqueIfApproximatelyUniqueRule())
File "/home/hadoop/environment/lib64/python3.7/site-packages/pydeequ/suggestions.py", line 81, in run result = self._ConstraintSuggestionRunBuilder.run() File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in call File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o177.run. : com.amazon.deequ.analyzers.runners.MetricCalculationRuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 19) ([2600:1f18:2d85:5e03:ba20:a78b:a26c:61ab] executor 1): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.aggregate.HashAggregateExec.aggregateExpressions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.aggregate.HashAggregateExec

how could I resolve this error

已提問 1 年前檢視次數 811 次
1 個回答
0

Hello,

Thank you for raising this question on re:Post.

From the stacktrace shared I can see it is a ClassCastException during serialization, which points to some incompatibility of the classes for serialization in the spark application. However, it is not enough to clearly identify the root cause here. Please help us with the following so that we can assist you further on this

  1. Are you able to run this successfully on EMR on EC2 cluster?
  2. Are you able to run a test without deequ to confirm EMR serverless job is working as expected without this dependency?
  3. Please share how you are adding the additional deequ libraries to the runtime serverless environment. Your start-job-run command should have the details on this.
  4. Please share reproduction steps, including a link to download the deequ library if it is publicly available.
AWS
支援工程師
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南