S3 shuffle storage for EMR Serverless

0

We are running into No space left on device errors in EMR Serverless for big jobs, even when setting driver / executor drive size to the maximum 200GB.

I tried to make the S3 shuffle storage plug-in for Glue work in EMR Serverless via a custom image and the appropriate spark configuration.

https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html

But it fails with this error:

Receiver class com.amazonaws.spark.shuffle.io.cloud.CloudShuffleMapOutputWriter does not define or inherit an implementation of the resolved method 'abstract org.apache.spark.shuffle.api.metadata.MapOutputCommitMessage commitAllPartitions(long[])' of interface org.apache.spark.shuffle.api.ShuffleMapOutputWriter.

It looks like the plug-in is not implementing a necessary method. But this method has been in Spark since version 3.0, and supposedly the plug-in is compatible with Spark 3.3. We are using Spark 3.4 via EMR Serverless 6.15 though as that's the minimum for having interactive endpoints available in EMR Serverless, and we want to use this via notebooks.

Anybody knows if there is a way to get S3 shuffle storage working in EMR Serverless, or if the plug-in is going to be updated to implement the missing method(s)?

tomups
질문됨 2달 전187회 조회
1개 답변
1

It looks like you're having some trouble with the S3 shuffle storage plugin for Glue in your EMR Serverless setup with Spark 3.4. It seems that the plugin might not be fully compatible with Spark 3.4, even though it worked with Spark 3.3.

You may want to try these suggestions to see if they help resolve the issue:

  1. If possible, consider downgrading to Spark 3.3 to ensure compatibility with the plugin. However, this might not be feasible if you require features exclusive to Spark 3.4 or if you need interactive endpoints available in EMR Serverless.
  2. If downgrading is not an option, you might need to implement a custom version of the plugin that is compatible with Spark 3.4. This could involve modifying the source code of the plugin to add the missing method implementation.

Unfortunately, I don't guarantee this solution, but exploring these options might help you find a way to get S3 shuffle storage working in your EMR Serverless environment.

profile picture
전문가
답변함 2달 전
  • Unfortunately the plug-in is made by Amazon and is closed source so no way to fix it myself.

    I am not even sure if it will work with Spark 3.3 as 3.3 has the same method that fails. It exists since Spark 3.0. I think there must be some difference between the Spark running in Glue vs the Spark in EMR Serverless, that makes the plug-in work for Glue only.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인