How do I stop a Hadoop or Spark job's user cache so that the cache doesn't use too much disk space in Amazon EMR?

4 minute read

The user cache for my Apache Hadoop or Apache Spark job uses all the disk space on the partition. The Amazon EMR job fails or the HDFS NameNode service is in safe mode.

Short description

On an Amazon EMR cluster, YARN is configured to allow jobs to write cache data to /mnt/yarn/usercache. When you process a large amount of data or run multiple concurrent jobs, the /mnt file system can fill. This causes node manager to fail on some nodes, and the job freezes or fails.

Use one of these methods to resolve this issue:

  • If you don't have long-running or streaming jobs, then adjust the user cache retention settings for YARN NodeManager.
  • If you have long-running or streaming jobs, then scale up the Amazon Elastic Block Store (Amazon EBS) volumes.


Adjust the user cache retention settings for NodeManager

The following attributes define the cache cleanup settings:

  • yarn.nodemanager.localizer.cache.cleanup.interval-ms: This is the cache cleanup interval. The default value is 600,000 milliseconds. After this interval, if the cache size exceeds the set value in, then files that running containers don't use are deleted.
  • This is the maximum disk space that's allowed for the cache. The default value is 10,240 MB. When the cache disk size exceeds this value, files that running containers don't use are deleted on the interval that's set in yarn.nodemanager.localizer.cache.cleanup.interval-ms.

To set the cleanup interval and maximum disk space size on your cluster, complete the following steps:

  1. Open /etc/hadoop/conf/yarn-site.xml on each core and task node. Reduce the values for yarn.nodemanager.localizer.cache.cleanup.interval and for each core and task node. For example:

    sudo vim /etc/hadoop/conf/yarn-site.xml
    yarn.nodemanager.localizer.cache.cleanup.interval-ms 400000 5120
  2. Run the following commands on each core and task node to restart NodeManager:

    sudo stop hadoop-yarn-nodemanager
    sudo start hadoop-yarn-nodemanager

    Note: In Amazon EMR versions 5.21.0 and later, you can also use a configuration object to override the cluster configuration or specify additional configuration classifications. For more information, see Reconfigure an instance group in a running cluster.

  3. To set the cleanup interval and maximum disk space size on a new cluster at launch, add a configuration object similar to the following one:

          "Classification": "yarn-site",
         "Properties": {
           "yarn.nodemanager.localizer.cache.cleanup.interval-ms": "400000",
           "": "5120"

    The deletion service doesn't complete on running containers. This means that even after you adjust the user cache retention settings, data might still spill to the following path and fill the file system:

    {'yarn.nodemanager.local-dirs'}/usercache/user/appcache/application_id ,

Scale up the EBS volumes on the EMR cluster nodes

To scale up storage on a running EMR cluster, see Dynamically scale up storage on Amazon EMR clusters.

To scale up storage on a new EMR cluster, specify a larger volume size when you create the EMR cluster. You can also do this when you add nodes to an existing cluster:

  • Amazon EMR versions 5.22.0 and later: The default amount of EBS storage increases based on the size of the Amazon Elastic Compute Cloud (Amazon EC2) instance. For more information about the default amount of storage and number of volumes for each instance type, see Default Amazon EBS storage for instances.
  • Amazon EMR versions 5.21 and earlier: The default EBS volume size is 32 GB, and 27 GB is reserved for the /mnt partition. HDFS, YARN, the user cache, and all applications use the /mnt partition. Increase the size of your EBS volume as needed, such as 100-500 GB. You can also specify multiple EBS volumes that are mounted as /mnt1, /mnt2, and so on.

For Spark streaming jobs, you can also perform an RDD.unpersist() after you no longer need the data. Or, explicitly call System.gc() in Scala or sc._jvm.System.gc() in Python to start JVM garbage collection and remove the intermediate shuffle files.

AWS OFFICIALUpdated 5 months ago

can you paste a simple code snippet for spark streaming use case on how you are calling sc._jvm.System.gc() in streaming job

replied 9 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
replied 9 months ago