I want to resolve the excessive disk pressure on my Amazon Elastic Kubernetes Service (Amazon EKS) worker nodes.
Short description
Disk pressure on Kubernetes worker nodes occurs when the available disk space drops to low levels. To mitigate disk pressure, increase the EBS volume size, or add new worker nodes to the node group to provide more disk capacity. You can also adjust Kubelet image garbage collection thresholds, configure container runtime log rotation, and manually clean up unused images to effectively manage disk usage.
It's important to monitor and manage the ephemeral storage to prevent disk pressure issues and to make sure your Kubernetes clusters correctly run.
If you have excessive disk pressure, then you might receive an error message.
For the node, you might receive an error message similar to the following:
"Warning DiskPressure ... kubelet, worker-node DiskPressure on node"
For Kubelet, you might receive an error message similar to the following:
"Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold"
To confirm the issue, run the df -h command to check the disk usage on the worker node.
Example output:
/dev/nvme0n1p1 20G 18G 2.2G 90% /
........
Note: In the preceding example, the root filesystem /dev/nvme0n1p1 is at 90% usage. The high usage causes the disk pressure and image garbage collection failed issues.
Resolution
Increase the EBS volume size
To increase the available disk space, increase the size of the EBS volume that's attached to the worker node. For more information, see How do I increase or decrease the size of my EBS volume?
Note: Rather than resize your ephemeral volumes, it's a best practice to provision new instances with the desired disk size. Use the new instances to replace the old instance. Then, the new instances have the correct disk size and align with the immutable infrastructure mindset.
Add a new worker node to the node group
Add a new node to the node group or pool to add more disk volume to the cluster to distribute the workload across multiple nodes. This alleviates the disk pressure on any single node.
Set up custom thresholds for the Kubelet garbage collector
Configure the Kubelet to initiate image garbage collection at different disk usage thresholds. To do this, set the --image-gc-high-threshold and --image-gc-low-threshold arguments to allow for more buffer. For example, set the high threshold to 70% and the low threshold to 60% to maintain a larger buffer of free disk space. This configuration allows the Kubelet to perform image garbage collection before the disk becomes critically full.
For more information on how to configure these arguments on your worker nodes, see How do I configure Amazon EKS worker nodes to clean up the image cache at a specified percent of disk usage?
Check and adjust container runtime log rotation
Containerized applications write logs to stdout and stderr, and log files are managed by the container runtime. It's a best practice to adjust the log rotation settings for the container runtime on your worker nodes.
To configure your log rotation, for containerd, use the containerd.runc.log options in the configuration file /etc/cotainerd/config/toml. Set the log_file_max and log_file_max_size options to control the maximum number of rotated log files and the maximum size of each log file.
Clean up unused container images
If the Kubelet can't perform an image garbage collection automatically, then manually clean up unused container images on the worker node. To remove dangling or unused images and free up significant disk space, use the crictl rmi --prune command.
Manage your ephemeral storage
For long-term stability and smooth operation of your applications, it's a best practice to properly manage ephemeral storage. If you don't set limits on ephemeral storage, then a pod might consume the entire disk space on the node it runs on.
To mitigate this risk, set appropriate ephemeral storage requests and limits for your pods. Take the following actions to determine the limits: