2 Answers
- Newest
- Most votes
- Most comments
2
Hi there - I think the last bit is getting the right --jars
setting. The one you provided looks to be the packaged pyspark environment(?) and not the actual delta-core
jar file.
Depending on the version of EMR you're running, you have a few options.
- For EMR 6.9.0, Delta Lake 2.1.0 is included on the EMR Serverless image. If you're using the same version, you can specify
spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar
as a--conf
item in yoursparkSubmitParameters
. You may need to add additional items there too (likespark.sql.extensions
) as mentioned in the quick start. - Prior to EMR 6.9.0, you have to use the
--packages
flag to specify your Java dependencies or upload thedelta-core
jar to S3. You can find more details on that approach in the EMR Serverless docs on Delta Lake. - If the version of Delta Lake you're using doesn't match what's installed on EMR Serverless, you can also use the
--packages
flag or upload thedelta-core
jar as mentioned above. The--packages
flag would be part of thesparkSubmitParameters
:--packages io.delta:delta-core_2.12:2.2.0
.
answered a year ago
-1
Hello,
The error clearly indicates the JAVA/JAR is not loaded inside the Python.In order to make delta lake work on pyspark, You must include required jars and python modules . for example your pythong code should have correct jar provided as below , for reference check [1] [2]
.config("spark.jars.packages", "XXXXXXXXXXX")
[1] https://wind010.hashnode.dev/problem-with-pyspark-and-delta-lake-tables-unit-tests [2] https://github.com/JohnSnowLabs/spark-nlp/issues/232
answered a year ago
Relevant content
- AWS OFFICIALUpdated 2 years ago
- How can I access an Amazon EMR cluster through an application if the cluster is in a private subnet?AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
Thanks dacort. Now I ran into a new issue I tried the 1st and second options. Both give me this new error; :: retrieving :: org.apache.spark#spark-submit-parent-a8946b71-ba44-400b-b227-f7ffe4290c90 confs: [default] 4 artifacts copied, 0 already retrieved (3759kB/11ms) Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class. at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:1023) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:491) at
Hm, that's odd - makes it seem like your entrypoint is a jar file. :\ What's your entire start-job-run command? Your jobDriver should look something like this: