EMR Serverless jar configuration

0

I'm working on EMR Serverless for validating of data located in s3 using deequ library. But I am Unable to do that , I got this error...

Traceback (most recent call last): File "/tmp/spark-f2b9d6f8-9bb9-4879-a398-1f67f9ec5e70/app3.py", line 179, in <module> .addConstraintRule(UniqueIfApproximatelyUniqueRule())
File "/home/hadoop/environment/lib64/python3.7/site-packages/pydeequ/suggestions.py", line 81, in run result = self._ConstraintSuggestionRunBuilder.run() File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in call File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o177.run. : com.amazon.deequ.analyzers.runners.MetricCalculationRuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 19) ([2600:1f18:2d85:5e03:ba20:a78b:a26c:61ab] executor 1): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.aggregate.HashAggregateExec.aggregateExpressions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.aggregate.HashAggregateExec

how could I resolve this error

질문됨 일 년 전811회 조회
1개 답변
0

Hello,

Thank you for raising this question on re:Post.

From the stacktrace shared I can see it is a ClassCastException during serialization, which points to some incompatibility of the classes for serialization in the spark application. However, it is not enough to clearly identify the root cause here. Please help us with the following so that we can assist you further on this

  1. Are you able to run this successfully on EMR on EC2 cluster?
  2. Are you able to run a test without deequ to confirm EMR serverless job is working as expected without this dependency?
  3. Please share how you are adding the additional deequ libraries to the runtime serverless environment. Your start-job-run command should have the details on this.
  4. Please share reproduction steps, including a link to download the deequ library if it is publicly available.
AWS
지원 엔지니어
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인