EMR Serverless jar configuration

0

I'm working on EMR Serverless for validating of data located in s3 using deequ library. But I am Unable to do that , I got this error...

Traceback (most recent call last): File "/tmp/spark-f2b9d6f8-9bb9-4879-a398-1f67f9ec5e70/app3.py", line 179, in <module> .addConstraintRule(UniqueIfApproximatelyUniqueRule())
File "/home/hadoop/environment/lib64/python3.7/site-packages/pydeequ/suggestions.py", line 81, in run result = self._ConstraintSuggestionRunBuilder.run() File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in call File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o177.run. : com.amazon.deequ.analyzers.runners.MetricCalculationRuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 19) ([2600:1f18:2d85:5e03:ba20:a78b:a26c:61ab] executor 1): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.aggregate.HashAggregateExec.aggregateExpressions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.aggregate.HashAggregateExec

how could I resolve this error

posta un anno fa811 visualizzazioni
1 Risposta
0

Hello,

Thank you for raising this question on re:Post.

From the stacktrace shared I can see it is a ClassCastException during serialization, which points to some incompatibility of the classes for serialization in the spark application. However, it is not enough to clearly identify the root cause here. Please help us with the following so that we can assist you further on this

  1. Are you able to run this successfully on EMR on EC2 cluster?
  2. Are you able to run a test without deequ to confirm EMR serverless job is working as expected without this dependency?
  3. Please share how you are adding the additional deequ libraries to the runtime serverless environment. Your start-job-run command should have the details on this.
  4. Please share reproduction steps, including a link to download the deequ library if it is publicly available.
AWS
TECNICO DI SUPPORTO
con risposta un anno fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande