- Newest
- Most votes
- Most comments
More likely this error can happen when dealing with large datasets while they move back and forth between Spark tasks and pure Python operations. Data needs to be serialized between Spark's JVMs and Python's processes. So, in this regard my suggestion is to consider processing your datasets in separate batches. In other words, process less data per Job Run so that the Spark-to-Python data serialization doesn't take too long or fail.
I can also understand that, your Glue job is failed but the same code is working in Glue notebook. But, in general there is no such difference when using spark-session in Glue job or Glue notebook. To compare, you can run your Glue notebook as a Glue job. To get more understanding of this behavior, I would suggest you to please open a support case with AWS using the link here
Further, you can try below workaround for df.toPandas() using below spark configuration in Glue job. You can pass it as a key value pair in Glue job parameter.
Key : --conf
Value : spark.sql.execution.arrow.pyspark.enabled=true, --conf spark.driver.maxResultSize=0
Thanks that worked
Bear in mind that's not optimal, you are still bringing all the data in the driver memory and disabling the memory safety mechanism by setting it to 0
Very likely you are running out of memory by converting toPandas(), why don't you just save the csv using the DataFrame API?, even if you coalesce it to generate a single file (so it's single thread processing), it won't run out of memory.
Tried that did not worked either. well i can try various other options, but I'm puzzled how the same code works in glue notebook without adding any extra capacity.
Excellent response. I was able to get around the issue by adding the spark configuration/Glue job parameter --conf mentioned. Thanks a lot.
Good to hear. Happy to help you.
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 3 years ago
also tried coalesce(1), still resulting in same error using glue job