Why is my EC2 Linux instance unreachable and failing its status checks?

9 minute read
0

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance is unreachable and fails its status checks.

Short description

Amazon EC2 uses three status checks to monitor the health of EC2 instances:

System status check

The system status check detects issues with an instance's underlying hardware. If the underlying hardware is unresponsive or unreachable because of network, hardware, or software issues, then the system status check fails.

Instance status check

An instance status check failure shows that the instance is unreachable. The following issues can cause an instance status check failure:

  • Failure to boot the operating system (OS).
  • Failure to correctly mount the volumes.
  • Exhausted CPU and memory.
  • Kernel panic.
  • Network failure.

Warning: Some of the following resolutions require an instance stop and start. Before you stop and start your instance, note these conditions:

  • Data that's stored in instance store volumes are lost when the instance is stopped. Before you stop the instance, make sure that you back up the data. Unlike Amazon Elastic Block Store (Amazon EBS) backed volumes, instance store volumes don't support data persistence.
  • The static public IPv4 address that Amazon EC2 automatically assigned to the instance on launch or start changes after the stop and start. To retain a public IPv4 address that doesn't change when the instance is stopped, use an Elastic IP address.

For more information, see Stop and start Amazon EC2 instances.

Attached EBS status checks

The Attached EBS status checks monitor that the Amazon EBS volumes attached to an instance are reachable and able to complete I/O operations. For more information, see Attached EBS status checks.

Resolution

To see whether the instance status check or system status check failed, see the instance's status check metrics.

If the system status check failed, then see Why did my EC2 Linux instance fail a system status check?

If the instance status check failed, then check the instance's system logs to see the cause of the failure. Then, use one of the following possible resolutions to resolve the issue.

Failure to boot the OS

If the system logs contain boot errors, then see How do I troubleshoot an EC2 Linux instance that failed the instance status check due to operating system issues?

Failure to correctly mount the volumes

A mount point failure can cause the instance status check to fail. Example of mount point failure command:

[FAILED] Failed to mount /
See 'systemctl status mnt-nvme0n1p1.mount' for details.
[DEPEND] Dependency failed for Local File Systems.

For more information, see the following AWS Knowledge Center articles:

When you change an instance type from a Xen to a Nitro-based instance, the volume mount can fail. Mount failure occurs because Amazon EBS volumes are exposed as NVMe block devices on Nitro-based instances. For example, the device names are /dev/nvme0n1 and /dev/nvme1n1. Device names that you specify in a block device mapping are renamed to NVMe device names (/dev/nvme[0-26]n1).

Note: The block device driver might assign the NVMe device names in a different order than the order that you specified in the block device mapping. To avoid mount failure on Nitro-based instances, it's a best practice to use either a label or UUID for device names. For more information, see Make an Amazon EBS volume available for use.

Exhausted CPU and Memory

High CPU Utilization

If the CPUUtilization metric is at or near 100%, then the instance doesn't have enough compute capacity to run the kernel.

For T2 or T3 instances, check the Amazon CloudWatch CPU credit metrics to see if the UPC credits are at or near zero. If the CPU credits are at zero, then the CPUUtilization metric shows a saturation plateau at the baseline performance for the instance. For example, the baseline performance might be 20% or 40%. CPU utilization at or near 100%, shows that the status check failed because of resource over utilization. T2 or T3 instances that have reached a saturation plateau show that the status check failed because of over utilization.

To troubleshoot this issue, see How do I troubleshoot an EC2 Linux instance that fails a status check due to over-utilization of resources?

Block device errors, software bugs, or kernel panic can cause an unusual CPU usage spike. If CPU Utilization is at 100%, first check the system logs for block device or memory issue errors or other unusual system errors. Then, reboot or stop and start the instance.

Out of memory

High memory pressure can cause an instance status check failure. In the following example log extract, the operating system is out of memory and the oom-killer is started. To resolve this error, stop the process that consumes the most memory.

[115879.769795] Out of memory: kill process 20273 (httpd) score 1285879 or a child
[115879.769795] Killed process 1917 (php-cgi) vsz:467184kB, anon-rss:101196kB, file-rss:204kB

By default, EC2 instance memory and disk metrics aren't sent to Amazon CloudWatch. For more information, see Collect metrics, logs, and traces with the CloudWatch agent.

To troubleshoot and resolve the out of memory issue, upgrade the instance to a larger instance type. Or, add swap storage to the instance to alleviate the memory pressure. For more information, see the following AWS Knowledge Center articles:

Disk full errors

If the system logs contain disk full errors, then the instance is in emergency mode because of a full root device.

Example system log:

$: sudo service apache2 restart
Error: No space left on device
 
$: sudo /etc/init.d/mysql restart
[....] Restarting mysql (via systemctl):
mysql.serviceError: No space left on device
         
$: df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       7.7G  7.7G     0 100% /

For detailed instructions on how to troubleshoot and resolve disk full errors, see the following Knowledge Center articles:

Kernel panic

Kernel panic occurs when the kernel detects an internal fatal error during operation. If the error occurs during the operating system boot, then the kernel didn't load properly. This failure to load the kernel causes an instance boot failure.

Example kernel panic error message:

Linux version
2.6.16-xenU (builder@xenbat.amazonsa) (gcc version 4.0.1 20050727 (Red Hat4.0.1-5)) #1 SMP Mon May 28 03:41:49 SAST 2007
Kernel command
line:  root=/dev/sda1 ro 4
Registering block device major 8
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,1)

For information on how to troubleshoot and resolve a kernel panic error, see the following Knowledge Center articles:

Network failure

Your network can fail for the following reasons:

  • The cloud-init package isn't installed on the instance
  • The cloud-init package is used to update network configurations at launch.

To correct this error and install the cloud-init package on your instance, run the following command:

For Amazon, Amazon Linux 2, Amazon Linux 2023, or RedHat operating systems:

$ sudo yum install cloud-init -y

For Ubuntu or Debian operating systems:

$ sudo apt install cloud-init -y

MAC address is hardcoded in a configuration file

Hardcoded MAC addresses are in the Linux configuration files and the udev configuration files. You can find these files in the following locations:

  • /etc/udev/rules.d/
  • /etc/udev/rules.d/70-persistent-net.rules
  • /etc/udev/rules.d/80-net-name-slot.rules

To resolve network issues caused by a hardcoded MAC address, remove the entries or configuration files, and then run the following command:

$ sudo mv /etc/udev/rules.d/70-persistent-net.rules /root/

After the configuration file is moved, restart the network service to make sure that a new MAC address is received.

The IP address is hardcoded in a network configuration file

When you create an Amazon Machine Image (AMI) from an instance with a statically configured IP address, the configuration file contains a hardcoded IP address. To correct this error, set your network interface to use DHCP.

Note: You can't update AMIs that already exist. You must set the network interface to use DHCP before you create a new AMI.

The ENA or Intel-enhanced network drivers are missing

For more information on missing Elastic Network Adapters (ENAs) or Intel-enhanced network drivers, see Enhanced networking on Amazon EC2 instances.

The network interface is automatically renamed at startup

To deactivate predictable network interface renaming, add net.ifnames=0 to the kernel command line. To use the placeholder, you must activate enhanced networking with the ENA and rebuild or update the grub configuration file.

Related information

Troubleshoot Amazon EC2 Linux instances with failed status checks

Why is my EC2 Windows instance down with a system status check failure or status check 0/2?

Why is my EC2 Windows instance down with an instance status check failure?

Types of status checks

AWS OFFICIAL
AWS OFFICIALUpdated 7 months ago