Command failed with exit code 10

0

Code works in glue notebook but fails in glue job ( tried both glue 3.0 and 4.0) The line where it fails is,

df.toPandas().to_csv(<s3_path>,index=False)

no detail message in glue logs

2023-05-19 19:08:37,793 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Unknown error from Python: Error Traceback is not available.

Obviously the data frame is large as 500MB specifically, but it succeeds in glue notebook. Wondering if there is any subtle differences internally in glue notebook and glue job that is not obvious or some kind of bug. P.S: write to S3 using Databricks technology works too.

  • also tried coalesce(1), still resulting in same error using glue job

asked a year ago3129 views
3 Answers
0
Accepted Answer

More likely this error can happen when dealing with large datasets while they move back and forth between Spark tasks and pure Python operations. Data needs to be serialized between Spark's JVMs and Python's processes. So, in this regard my suggestion is to consider processing your datasets in separate batches. In other words, process less data per Job Run so that the Spark-to-Python data serialization doesn't take too long or fail.

I can also understand that, your Glue job is failed but the same code is working in Glue notebook. But, in general there is no such difference when using spark-session in Glue job or Glue notebook. To compare, you can run your Glue notebook as a Glue job. To get more understanding of this behavior, I would suggest you to please open a support case with AWS using the link here

Further, you can try below workaround for df.toPandas() using below spark configuration in Glue job. You can pass it as a key value pair in Glue job parameter.

Key : --conf

Value : spark.sql.execution.arrow.pyspark.enabled=true, --conf spark.driver.maxResultSize=0

AWS
answered a year ago
  • Thanks that worked

  • Bear in mind that's not optimal, you are still bringing all the data in the driver memory and disabling the memory safety mechanism by setting it to 0

0

Very likely you are running out of memory by converting toPandas(), why don't you just save the csv using the DataFrame API?, even if you coalesce it to generate a single file (so it's single thread processing), it won't run out of memory.

profile pictureAWS
EXPERT
answered a year ago
  • Tried that did not worked either. well i can try various other options, but I'm puzzled how the same code works in glue notebook without adding any extra capacity.

0

Excellent response. I was able to get around the issue by adding the spark configuration/Glue job parameter --conf mentioned. Thanks a lot.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions