Command failed with exit code 10

0

Code works in glue notebook but fails in glue job ( tried both glue 3.0 and 4.0) The line where it fails is,

df.toPandas().to_csv(<s3_path>,index=False)

no detail message in glue logs

2023-05-19 19:08:37,793 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Unknown error from Python: Error Traceback is not available.

Obviously the data frame is large as 500MB specifically, but it succeeds in glue notebook. Wondering if there is any subtle differences internally in glue notebook and glue job that is not obvious or some kind of bug. P.S: write to S3 using Databricks technology works too.

  • also tried coalesce(1), still resulting in same error using glue job

已提問 1 年前檢視次數 3294 次
3 個答案
0
已接受的答案

More likely this error can happen when dealing with large datasets while they move back and forth between Spark tasks and pure Python operations. Data needs to be serialized between Spark's JVMs and Python's processes. So, in this regard my suggestion is to consider processing your datasets in separate batches. In other words, process less data per Job Run so that the Spark-to-Python data serialization doesn't take too long or fail.

I can also understand that, your Glue job is failed but the same code is working in Glue notebook. But, in general there is no such difference when using spark-session in Glue job or Glue notebook. To compare, you can run your Glue notebook as a Glue job. To get more understanding of this behavior, I would suggest you to please open a support case with AWS using the link here

Further, you can try below workaround for df.toPandas() using below spark configuration in Glue job. You can pass it as a key value pair in Glue job parameter.

Key : --conf

Value : spark.sql.execution.arrow.pyspark.enabled=true, --conf spark.driver.maxResultSize=0

AWS
已回答 1 年前
  • Thanks that worked

  • Bear in mind that's not optimal, you are still bringing all the data in the driver memory and disabling the memory safety mechanism by setting it to 0

0

Very likely you are running out of memory by converting toPandas(), why don't you just save the csv using the DataFrame API?, even if you coalesce it to generate a single file (so it's single thread processing), it won't run out of memory.

profile pictureAWS
專家
已回答 1 年前
  • Tried that did not worked either. well i can try various other options, but I'm puzzled how the same code works in glue notebook without adding any extra capacity.

0

Excellent response. I was able to get around the issue by adding the spark configuration/Glue job parameter --conf mentioned. Thanks a lot.

已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南