Help us improve the AWS re:Post Knowledge Center by sharing your feedback in a brief survey. Your input can influence how we create and update our content to better support your AWS journey.
How do I troubleshoot an EC2 Linux instance that fails a status check because of resource over-usage?
My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance failed its instance status check because it no longer has available resources.
Short description
Your instance might fail its status health check because of resource usage for the following reasons:
- Your instance's CPU utilization is close to 100% and the instance didn't have enough compute capacity left for the kernel to run.
- The root device is 100% full and doesn't allow other processes to complete or begin.
- The processes that run on the instance used all of its memory and don't allow the kernel to run.
Resolution
Important: Before you stop and start your instance, take the following actions.
- Create a snapshot of your Amazon Elastic Block Store (Amazon EBS) volume.
Note: If your instance is instance store-backed or has instance store volumes that contain data, then Amazon EC2 deletes the data when you stop the instance. - Temporarily remove the instance from its Amazon EC2 Auto Scaling group.
Note: If you stop an instance that's in an Amazon EC2 Auto Scaling group, then you might terminate the instance based on your scale-in protection settings. Instances that you launch with Amazon EMR, AWS CloudFormation, or AWS Elastic Beanstalk might be in an Auto Scaling group. - Set the instance shutdown behavior to Stop to make sure that the instances don't terminate when you stop them.
Stop and start your instance to force the kernel to stop running processes. This is a temporary solution to return resources to the operating system (OS). To fix the over-usage issues, take the following actions to address the root cause.
Note: When you stop and start an instance, the instance's public IP address changes. It's a best practice to use an Elastic IP address to route external traffic to your instance instead of a public IP address.
Check the instance's CloudWatch CPU metrics
Check whether the instance's Amazon CloudWatch CPUUtilization metric is at or near 100%. If it is, then reboot your instance to return the instance to a healthy status. If the problem still occurs after reboot, then the instance's CPU requirements are higher than what your instance type offers.
To resolve this issue, change your instance type to one with higher CPU availability.
If your instance is a burstable performance instance, such as T2, T3, or T3a, then check its CPUCreditBalance metric. If the credit balance is close to zero, then Amazon EC2 throttles the instance CPU. If you set the instance's Credit specification to Standard, then change the specification to Unlimited.
Check the instance's system log for errors
Check the system log for errors such as "No space left on device" or "Out of memory."
Troubleshoot the "No space left on device" error
If the file system that contains a listed folder is full, then you receive an error similar to the following example:
"OSError: [Error 28] No space left on device '/var/lib/'"
In the preceding example, /var/lib is full.
To free up space, use the EC2 Serial Console to connect to supported Nitro-based instance types and supported bare metal instances. Then, delete the unneeded files. You don't need a functional connection to connect to your instance when you use the EC2 Serial Console.
If you haven't used the EC2 Serial Console before, then make sure that you adhere to the prerequisites. If your instance is unreachable and you haven't already configured access to the serial console, then you can't use the EC2 Serial Console.
If you can't use the EC2 Serial Console, then complete the following steps to launch a rescue instance and remove unneeded files:
-
Launch a new rescue instance in your virtual private cloud (VPC). Use the same Amazon Machine Image (AMI) and the same Availability Zone as the instance that failed its status check.
Note: You can also use an existing instance that's in the same Availability Zone and uses the same AMI as the original instance. -
Detach the Amazon Elastic Block Store (Amazon EBS) root volume, such as /dev/xvda or /dev/sda1, from the original instance. Note the device name of your root volume.
-
Attach the volume as a /dev/sdf secondary device to the rescue instance.
-
To create a mount point directory for the volume that you attached to the rescue instance, run the following command:
sudo mkdir /rescueNote: Replace /rescue with your mount point directory name.
-
As the root user, run the following command to identify the correct device name:
sudo -i # lsblkNote: The device that's attached to the rescue instance might have a different device name.
-
To mount the volume to the new directory, run the following command:
sudo mount /dev/xvdf1 /rescueNote: Replace dev/xvdf1 with your root volume's device name and /rescue with your mount point directory name. If you receive an error when you run the preceding command, then see Why can't I mount my Amazon EBS volume?
-
To identify the files that take up the most space, run the following command:
du -shcm /rescue/var/lib/* |sort -n -
To free up space, delete large files that you don't need.
-
To unmount the secondary device from your rescue instance, run the following command:
sudo umount /rescueNote: Replace /rescue with your mount point directory name.
If the unmount operation fails, then stop or reboot the rescue instance. Then, rerun the preceding command. -
Detach the secondary volume from the rescue instance.
-
Attach the volume to the original instance as the /dev/xvda or /dev/sda1 root volume.
-
Start the instance, and then verify that the instance is responsive.
If you still don't have enough storage available, then complete the following steps to can resize the root EBS volume:
- Request an EBS volume size modification.
- Follow steps 1-8 of the preceding section to launch a rescue instance.
- Extend the Linux file system.
Troubleshoot the "Out of memory" error
If your instance is out of memory, then you receive the following error:
"Out of memory: kill process"
When the instance is out of memory, the kernel doesn't have enough memory to run and Amazon EC2 ends other processes to free memory. For troubleshooting steps, see Out of memory: kill process.
To check your memory logs, follow steps 1-8 of the preceding section to launch a rescue instance. Then, run the following command based on your Linux distribution to search the logs for "out of memory" messages:
Amazon Linux 2 (AL2):
sudo grep -i -r 'out of memory' /var/log/
Amazon Linux 2023 (AL2023):
sudo journalctl -p err | grep -i "out of memory"
-or-
sudo dmesg | grep -i "out of memory"
Troubleshoot page allocation failures
You receive a "page allocation failure" error when the kernel memory allocator fails the allocation request.
To troubleshoot this issue, it's a best practice to upgrade the instance to a larger instance type. Or, for instances that use ephemeral instance store volumes, use a swap file or hard drive partition to add swap storage to the instance and alleviate the memory pressure.
Note: The instance uses swap space when the RAM is full. You can use swap space for instances that have a small amount of RAM. However, swap space isn't a replacement for more RAM. Because swap space is on the instance's hard drive, performance is slower compared to RAM. For more or faster memory, you must increase the instance size.
Use atop to investigate and prevent resource issues
Use the atop monitoring tool to identify resource usage patterns and processes that might cause issues before they cause system failures. The atop tool helps you monitor and troubleshoot the system on an ongoing basis.
Note: The atop tool logs data only after you install it. You can't retrieve historical performance data from before you installed the atop tool.
Use the atop tool to check the following resources:
- Check historical data in /var/log/atop/ to analyze your system and processes when the failure occurred.
- Look for processes that consume a large amount of resources.
- Check the memory, CPU, and disk usage patterns that led up to the failure.
Related information
How do I troubleshoot status check failures for my EC2 Linux instance?
What steps do I need to take before I change the instance type of my EC2 Linux instance?
- Topics
- Compute
- Tags
- Amazon EC2Linux
- Language
- English

Relevant content
- Accepted Answerasked 4 years ago
- Accepted Answerasked 2 years ago
- asked 2 years ago
- Accepted Answerasked 3 years ago