Command failed with exit code 10

0

Code works in glue notebook but fails in glue job ( tried both glue 3.0 and 4.0) The line where it fails is,

df.toPandas().to_csv(<s3_path>,index=False)

no detail message in glue logs

2023-05-19 19:08:37,793 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Unknown error from Python: Error Traceback is not available.

Obviously the data frame is large as 500MB specifically, but it succeeds in glue notebook. Wondering if there is any subtle differences internally in glue notebook and glue job that is not obvious or some kind of bug. P.S: write to S3 using Databricks technology works too.

  • also tried coalesce(1), still resulting in same error using glue job

질문됨 일 년 전3274회 조회
3개 답변
0
수락된 답변

More likely this error can happen when dealing with large datasets while they move back and forth between Spark tasks and pure Python operations. Data needs to be serialized between Spark's JVMs and Python's processes. So, in this regard my suggestion is to consider processing your datasets in separate batches. In other words, process less data per Job Run so that the Spark-to-Python data serialization doesn't take too long or fail.

I can also understand that, your Glue job is failed but the same code is working in Glue notebook. But, in general there is no such difference when using spark-session in Glue job or Glue notebook. To compare, you can run your Glue notebook as a Glue job. To get more understanding of this behavior, I would suggest you to please open a support case with AWS using the link here

Further, you can try below workaround for df.toPandas() using below spark configuration in Glue job. You can pass it as a key value pair in Glue job parameter.

Key : --conf

Value : spark.sql.execution.arrow.pyspark.enabled=true, --conf spark.driver.maxResultSize=0

AWS
답변함 일 년 전
  • Thanks that worked

  • Bear in mind that's not optimal, you are still bringing all the data in the driver memory and disabling the memory safety mechanism by setting it to 0

0

Very likely you are running out of memory by converting toPandas(), why don't you just save the csv using the DataFrame API?, even if you coalesce it to generate a single file (so it's single thread processing), it won't run out of memory.

profile pictureAWS
전문가
답변함 일 년 전
  • Tried that did not worked either. well i can try various other options, but I'm puzzled how the same code works in glue notebook without adding any extra capacity.

0

Excellent response. I was able to get around the issue by adding the spark configuration/Glue job parameter --conf mentioned. Thanks a lot.

답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠