TypeError: 'JavaPackage' object is not callable - jvm.io.delta.tables.DeltaTable.isDeltaTable

Question

Hi,

I finally got my pyspark code to mostly work using an EMR Serverless application, but according to the logs it seems like there is a particular problem with this line

```
    if (DeltaTable.isDeltaTable(spark, targetDeltaTableURI)):
```
The error message from the log is listed below, you can see at the top of the log the dataframes were created, but it errors out when I check if a the delta table exists

```
|135123   |1670038    |13558561-12 |2018-03-19         |36.01    |2023-03-22 16:51:05.35525|
|668217   |1361372    |13563506-22 |2018-03-22         |27.19    |2023-03-22 16:51:05.35525|
+---------+-----------+------------+-------------------+---------+-------------------------+
only showing top 20 rows

Traceback (most recent call last):
  File "/tmp/spark-9a694201-d321-475f-9855-7d4ada8da0e5/main.py", line 337, in 
    main()
  File "/tmp/spark-9a694201-d321-475f-9855-7d4ada8da0e5/main.py", line 135, in main
    processRawData(spark, s3BucketName, objectName, schema, 'bmi')
  File "/tmp/spark-9a694201-d321-475f-9855-7d4ada8da0e5/main.py", line 317, in processRawData
    if (DeltaTable.isDeltaTable(spark, targetDeltaTableURI)):
  File "/home/hadoop/environment/lib64/python3.7/site-packages/delta/tables.py", line 562, in isDeltaTable
    return jvm.io.delta.tables.DeltaTable.isDeltaTable(jsparkSession, identifier)
TypeError: 'JavaPackage' object is not callable
```

Here is my submit command

```
aws emr-serverless start-job-run --profile 000000_SD \
    --name nw_raw_data1 \
    --application-id $APPLICATION_ID \
    --execution-role-arn $JOB_ROLE_ARN \
 --job-driver '{
         "sparkSubmit": {
             "entryPoint": "s3://'${S3_BUCKET}'/scripts/main.py", 
             "sparkSubmitParameters": "--jars s3://'${S3_BUCKET}'/scripts/pyspark_nw.tar.gz --py-files s3://'${S3_BUCKET}'/scripts/variables.ini --conf spark.driver.cores=1 --conf spark.driver.memory=2g --conf spark.executor.cores=4 --conf spark.executor.memory=4g --conf spark.executor.instances=2 --conf spark.archives=s3://'${S3_BUCKET}'/scripts/pyspark_nw.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
         }
     }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://'${S3_BUCKET}'/logs/"
            }
        }
    }'
```
And below are the pip packages used to build the Virtual Env

```
boto3==1.26.74
botocore==1.29.74
delta-spark
fhir.resources==6.5.0
pyspark==3.3.0
```

Thanks for your help

Accepted Answer

Hi there - I think the last bit is getting the right `--jars` setting. The one you provided looks to be the packaged pyspark environment(?) and not the actual `delta-core` jar file.

Depending on the version of EMR you're running, you have a few options.

- For EMR 6.9.0, Delta Lake 2.1.0 is included on the EMR Serverless image. If you're using the same version, you can specify `spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar` as a `--conf` item in your `sparkSubmitParameters`. You may need to add additional items there too (like `spark.sql.extensions`) as mentioned in the [quick start](https://docs.delta.io/latest/quick-start.html).
- Prior to EMR 6.9.0, you have to use the `--packages` flag to specify your Java dependencies or upload the `delta-core` jar to S3. You can find more details on that approach in the [EMR Serverless docs on Delta Lake](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-delta-lake.html).
- If the version of Delta Lake you're using doesn't match what's installed on EMR Serverless, you can also use the `--packages` flag or upload the `delta-core` jar as mentioned above. The `--packages` flag would be part of the `sparkSubmitParameters`: `--packages io.delta:delta-core_2.12:2.2.0`.

Answer

Hello,

The error clearly indicates the  JAVA/JAR is not loaded inside the Python.In order to make delta lake work on pyspark, You must include required jars and python modules . for example your pythong code should have correct jar provided as below , for reference check [1] [2]

.config("spark.jars.packages", "XXXXXXXXXXX")

[1] https://wind010.hashnode.dev/problem-with-pyspark-and-delta-lake-tables-unit-tests
[2] https://github.com/JohnSnowLabs/spark-nlp/issues/232

TypeError: 'JavaPackage' object is not callable - jvm.io.delta.tables.DeltaTable.isDeltaTable

Relevant content