I want to run Apache Spark jobs on an Amazon EMR cluster, but the core node is almost out of disk space.
Resolution
Determine whether you have unhealthy core nodes
When nodes that have at least one Amazon Elastic Block Store (Amazon EBS) volume attached reach more than 90% disk utilization, they're considered unhealthy. To determine which nodes have reached 90% disk utilization, complete the following steps:
- Check the MRUnhealthyNodes Amazon CloudWatch metric. This metric shows the number of unhealthy nodes in an EMR cluster.
Note: You can also create a CloudWatch Alarm to monitor the MRUnhealthyNodes metric.
- Connect to the primary node and access the instance controller log at /emr/instance-controller/log/instance-controller.log.
- In the instance controller log, search for InstanceJointStatusMap to identify which nodes are unhealthy.
For more information, see the High disk utilization because of an unhealthy node section in How do I resolve an "ExecutorLostFailure: Slave lost" error in Spark on Amazon EMR?
- To determine whether a mount has high utilization, log in to the core nodes and then run the following command:
df -h
Remove unnecessary local and temporary Spark application files
When you run Spark jobs, Spark applications create local files that take up the remaining disk space on the core node. For example, if the df -h command shows that /mnt, uses more than 90% disk space, then check which directories or files have high utilization.
Run the following command on the core node to see the top 10 directories that use the most amount of disk space:
cd /mnt
sudo du -hsx * | sort -rh | head -10
If the /mnt/hdfs directory has high utilization, then check the Hadoop Distributed File System (HDFS) usage and remove any unnecessary files, such as log files. To check the space utilization for a specific directory, run the following command:
hdfs dfsadmin -report
hadoop fs -du -s -h /path/to/dir
Note: Replace /path/to/dir with the path to the the directory you want to check the space utilization for.
Reduce the retention period for Spark event and YARN container logs
The /var/log directory stores log files such as Spark event logs and YARN container logs. To clean the log files from HDFS automatically, reduce the retention period.
Reduce the default retention period for Spark job history files
By default, Spark job history files are located in /var/log/spark/apps. When the file system history cleaner runs, Spark deletes job history files older than seven days.
To reduce the default retention period on a running cluster, complete the following steps:
- Use SSH to connect to the primary node.
- Add or update the following values in /etc/spark/conf/spark-defaults.conf:
------spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 12h
spark.history.fs.cleaner.maxAge 1d
------
The preceding configuration runs the cleaner every 12 hours. The configuration clear files that are more than 1 day old. You can customize this time period in the spark.history.fs.cleaner.internval and spark.history.fs.cleaner.maxAge parameters.
- Restart the Spark History Server.
To reduce the default retention period for Spark job history files when you launch the cluster, use the following configuration:
{
"Classification": "spark-defaults",
"Properties": {
"spark.history.fs.cleaner.enabled": "true",
"spark.history.fs.cleaner.interval": "12h",
"spark.history.fs.cleaner.maxAge": "1d"
}
}
You can customize the time period in the spark.history.fs.cleaner.interval and spark.history.fs.cleaner.maxAge parameters.
For more information on these parameters, see Monitoring and instrumentation on the Apache Spark website.
Reduce the default retention period of YARN container logs
Spark application logs are the YARN container logs for your Spark jobs, that are located in /var/log/hadoop-yarn/apps on the core node. Spark moves these logs to HDFS when the application finishes its run. By default, YARN keeps application logs on HDFS for 48 hours. To reduce the retention period, complete the following steps:
- Use SSH to connect to the primary, core, or task nodes.
- Open the /etc/hadoop/conf/yarn-site.xml file on each node in your EMR cluster (primary, core, and task nodes).
- Reduce the value of the yarn.log-aggregation.retain-seconds property on all nodes.
- Restart the ResourceManager daemon.
You can also reconfigure the cluster to reduce the retention period.
Reduce /mnt/yarn usage
When the disk usage in the /mnt/yarn directory is high, either adjust the user cache retention or scale the EBS volumes on the node. For more information, see How do I stop a Hadoop or Spark job's user cache so that the cache doesn't use too much disk space in Amazon EMR?
Resize the cluster or scale Amazon EMR
To avoid HDFS space issues, scale the number of your core nodes. And, if directories other than HDFS directories get full, then scale the number of your core or task nodes. For more information, see Use Amazon EMR cluster scaling to adjust for changing workloads.
You can also extend the EBS volumes in existing nodes or use a dynamic scaling script. For more information, see How do I resolve "no space left on device" stage failures in an Apache Spark job on Amazon EMR?
Related information
Configure Amazon EMR cluster hardware and networking
HDFS configuration
Working with storage and file systems with Amazon EMR