TypeError: 'JavaPackage' object is not callable - jvm.io.delta.tables.DeltaTable.isDeltaTable

0

Hi,

I finally got my pyspark code to mostly work using an EMR Serverless application, but according to the logs it seems like there is a particular problem with this line

    if (DeltaTable.isDeltaTable(spark, targetDeltaTableURI)):

The error message from the log is listed below, you can see at the top of the log the dataframes were created, but it errors out when I check if a the delta table exists

|135123   |1670038    |13558561-12 |2018-03-19         |36.01    |2023-03-22 16:51:05.35525|
|668217   |1361372    |13563506-22 |2018-03-22         |27.19    |2023-03-22 16:51:05.35525|
+---------+-----------+------------+-------------------+---------+-------------------------+
only showing top 20 rows

Traceback (most recent call last):
  File "/tmp/spark-9a694201-d321-475f-9855-7d4ada8da0e5/main.py", line 337, in <module>
    main()
  File "/tmp/spark-9a694201-d321-475f-9855-7d4ada8da0e5/main.py", line 135, in main
    processRawData(spark, s3BucketName, objectName, schema, 'bmi')
  File "/tmp/spark-9a694201-d321-475f-9855-7d4ada8da0e5/main.py", line 317, in processRawData
    if (DeltaTable.isDeltaTable(spark, targetDeltaTableURI)):
  File "/home/hadoop/environment/lib64/python3.7/site-packages/delta/tables.py", line 562, in isDeltaTable
    return jvm.io.delta.tables.DeltaTable.isDeltaTable(jsparkSession, identifier)
TypeError: 'JavaPackage' object is not callable

Here is my submit command

aws emr-serverless start-job-run --profile 000000_SD \
    --name nw_raw_data1 \
    --application-id $APPLICATION_ID \
    --execution-role-arn $JOB_ROLE_ARN \
 --job-driver '{
         "sparkSubmit": {
             "entryPoint": "s3://'${S3_BUCKET}'/scripts/main.py", 
             "sparkSubmitParameters": "--jars s3://'${S3_BUCKET}'/scripts/pyspark_nw.tar.gz --py-files s3://'${S3_BUCKET}'/scripts/variables.ini --conf spark.driver.cores=1 --conf spark.driver.memory=2g --conf spark.executor.cores=4 --conf spark.executor.memory=4g --conf spark.executor.instances=2 --conf spark.archives=s3://'${S3_BUCKET}'/scripts/pyspark_nw.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
         }
     }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://'${S3_BUCKET}'/logs/"
            }
        }
    }'

And below are the pip packages used to build the Virtual Env

boto3==1.26.74
botocore==1.29.74
delta-spark
fhir.resources==6.5.0
pyspark==3.3.0

Thanks for your help

asked a year ago1072 views
2 Answers
2
Accepted Answer

Hi there - I think the last bit is getting the right --jars setting. The one you provided looks to be the packaged pyspark environment(?) and not the actual delta-core jar file.

Depending on the version of EMR you're running, you have a few options.

  • For EMR 6.9.0, Delta Lake 2.1.0 is included on the EMR Serverless image. If you're using the same version, you can specify spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar as a --conf item in your sparkSubmitParameters. You may need to add additional items there too (like spark.sql.extensions) as mentioned in the quick start.
  • Prior to EMR 6.9.0, you have to use the --packages flag to specify your Java dependencies or upload the delta-core jar to S3. You can find more details on that approach in the EMR Serverless docs on Delta Lake.
  • If the version of Delta Lake you're using doesn't match what's installed on EMR Serverless, you can also use the --packages flag or upload the delta-core jar as mentioned above. The --packages flag would be part of the sparkSubmitParameters: --packages io.delta:delta-core_2.12:2.2.0.
AWS
dacort
answered a year ago
  • Thanks dacort. Now I ran into a new issue I tried the 1st and second options. Both give me this new error; :: retrieving :: org.apache.spark#spark-submit-parent-a8946b71-ba44-400b-b227-f7ffe4290c90 confs: [default] 4 artifacts copied, 0 already retrieved (3759kB/11ms) Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class. at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:1023) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:491) at

  • Hm, that's odd - makes it seem like your entrypoint is a jar file. :\ What's your entire start-job-run command? Your jobDriver should look something like this:

    {
        "sparkSubmit": {
            "entryPoint": "s3://bucket/prefix/main.py",
            "sparkSubmitParameters": "--conf spark.archives=s3://bucket/prefix/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar"
        }
    }
    
-1

Hello,

The error clearly indicates the JAVA/JAR is not loaded inside the Python.In order to make delta lake work on pyspark, You must include required jars and python modules . for example your pythong code should have correct jar provided as below , for reference check [1] [2]

.config("spark.jars.packages", "XXXXXXXXXXX")

[1] https://wind010.hashnode.dev/problem-with-pyspark-and-delta-lake-tables-unit-tests [2] https://github.com/JohnSnowLabs/spark-nlp/issues/232

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions