How do I resolve "no space left on device" stage failures in Spark on Amazon EMR?

3 minute read
0

I submitted an Apache Spark application to an Amazon EMR cluster. The application fails with a "no space left on device" stage failure similar to the following: Job aborted due to stage failure: Task 31 in stage 8.0 failed 4 times, most recent failure: Lost task 31.3 in stage 8.0 (TID 2036, ip-xxx-xxx-xx-xxx.compute.internal, executor 139): org.apache.spark.memory.SparkOutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1a698b89 : No space left on device

Short description

Spark uses local disks on the core and task nodes to store intermediate data. If the disks run out of space, then the job fails with a "no space left on device" error. Use one of the following methods to resolve this error:

  • Add more Amazon Elastic Block Store (Amazon EBS) capacity.
  • Add more Spark partitions.
  • Use a bootstrap action to dynamically scale up storage on the core and task nodes. For more information and an example bootstrap action script, see Dynamically scale up storage on Amazon EMR clusters.

Resolution

Add more EBS capacity

For new clusters: use larger EBS volumes

Launch an Amazon EMR cluster and choose an Amazon Elastic Compute Cloud (Amazon EC2) instance type with larger EBS volumes. For more information about the amount of storage and number of volumes allocated for each instance type, see Default Amazon EBS storage for instances.

For running clusters: add more EBS volumes

1.    If larger EBS volumes don't resolve the problem, attach more EBS volumes to the core and task nodes.

2.    Format and mount the attached volumes. Be sure to use the correct disk number (for example, /mnt1 or /mnt2 instead of /data).

3.    Connect to the node using SSH.

4.    Create a /mnt2/yarn directory, and then set ownership of the directory to the YARN user:

sudo mkdir /mnt2/yarn
chown yarn:yarn /mnt2/yarn

5.    Add the /mnt2/yarn directory inside the yarn.nodemanager.local-dirs property of /etc/hadoop/conf/yarn-site.xml. Example:

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/mnt/yarn,/mnt1/yarn,/mnt2/yarn</value>
</property>

6.    Restart the NodeManager service:

sudo stop hadoop-yarn-nodemanager
sudo start hadoop-yarn-nodemanager

Add more Spark partitions

Depending on how many core and task nodes are in the cluster, consider increasing the number of Spark partitions. Use the following Scala code to add more Spark partitions:

val numPartitions = 500
val newDF = df.repartition(numPartitions)

Related information

How can I troubleshoot stage failures in Spark jobs on Amazon EMR?

AWS OFFICIAL
AWS OFFICIALUpdated 3 years ago
3 Comments

Strange, but I get the same error on EMR serverless and when I try to increase repartitioning still get the same error

replied 2 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
EXPERT
replied 2 years ago

"Strange, but I get the same error on EMR serverless and when I try to increase repartitioning still get the same error"

You can use "spark.emr-serverless.driver.disk" and "spark.emr-serverless.executor.disk" job's params to avoid such error, see https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html

replied 2 years ago