Skip to content

How do I troubleshoot an EC2 Linux instance that failed the instance status check because of OS issues?

9 minute read
2

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance failed the instance status check because of operating system (OS) issues. Now, it doesn't boot successfully.

Short description

Your Linux instance might fail the instance status check for the following reasons:

  • You updated the kernel and the new kernel didn't boot.
  • You received a "Kernel panic" error.
  • The file system entries in /etc/fstab are incorrect or couldn't mount.
  • The file system is corrupted.
  • There are incorrect network configurations on the instance.
  • There's no available CPU or memory on the instance.

To troubleshoot these issues, use the EC2 Serial Console or EC2Rescue to diagnose and troubleshoot boot issues. Or, create a rescue instance and then manually correct the errors.

Note: The following resolution steps are based on Amazon Linux 2023 (AL2023). However, you can use the resolution for Linux distributions in general. If you use a Linux distribution that isn't AL2023, then the commands, paths, and outputs might vary.

Resolution

Important: Before you stop and start your instance, take the following actions.

Note: When you stop and start an instance, the instance's public IP address changes. It's a best practice to use an Elastic IP address to route external traffic to your instance instead of a public IP address. For more information, see What happens when you stop an instance.

Use the EC2 Serial Console

Important: You don't need to stop and start the instance when you use EC2 Serial Console.

Use the EC2 Serial Console to troubleshoot supported Nitro-based instance types and supported bare metal instances. You don't need a functional connection to connect to your instance when you use the EC2 Serial Console. Connect to the EC2 Serial Console, and then check for boot issues, and network and SSH configuration issues.

If you haven't used the EC2 Serial Console before, then make sure that you adhere to the prerequisites. If your instance is unreachable and you haven't already configured access to the serial console, then use EC2Rescue or a rescue instance to troubleshoot.

Run EC2Rescue

Run the EC2Rescue tool to diagnose and troubleshoot the OS on unreachable instances.

Use a rescue instance to manually correct errors

To launch a rescue instance, complete the following steps:

  1. Launch a new instance in your virtual private cloud (VPC). Use the same Amazon Machine Image (AMI) and the same Availability Zone as the instance that failed its status check.
    Note: You can also use an existing instance with the same AMI and Availability Zone as the instance that failed its status check.

  2. Stop the original instance.

  3. Detach the root EBS volume, such as /dev/xvda or /dev/sda1, from the original instance. Note the device name of your root volume.

  4. Attach the volume as a secondary device (/dev/sdf) to the rescue instance.

  5. Use SSH to connect to your rescue instance.

  6. To create a mount point directory for the volume that you attached to the rescue instance, run the following command:

    sudo mkdir /rescue

    Note: Replace /rescue with your mount point directory name.

  7. To become the root user, run the following command:

    sudo -i
  8. As the root user, run the following command to identify the correct device name:

    lsblk

    Note: The device that's attached to the rescue instance might have a different device name.

  9. To mount the volume to the new directory, run the following command:

    sudo mount /dev/xvdf1 /rescue

    Note: Replace dev/xvdf1 with your root volume's device name and /rescue with your mount point directory name. If you receive an error when you run the preceding command, then see Why can't I mount my Amazon EBS volume?

  10. Retrieve the instance's system log to find out the root cause of the issue.

To troubleshoot OS errors that cause instance status check failures, take the following troubleshooting actions based on the error that you receive. After you correct the errors, restart your original instance.

Troubleshoot "Kernel panic"

If you receive a "Kernel Panic" error message, then the kernel doesn't have the vmlinuz or initramfs files it needs to boot. To troubleshoot this issue, complete the following steps:

  1. To check for the vmlinuz and initramfs files, run the following command:
    cd /rescue/boot
    ls -l
    In the output, verify that the vmlinuz and initramfs files correspond to the kernel version that you want to boot.
    Example output:
    uname -r4.14.165-131.185.amzn2.x86_64
    
    cd /boot; ls -l
    total 39960
    -rw-r--r-- 1 root root      119960 Jan 15 14:34 config-4.14.165-131.185.amzn2.x86_64
    drwxr-xr-x 3 root root     17 Feb 12 04:06 efi
    drwx------ 5 root root       79 Feb 12 04:08 grub2
    -rw------- 1 root root 31336757 Feb 12 04:08 initramfs-4.14.165-131.185.amzn2.x86_64.img
    -rw-r--r-- 1 root root    669087 Feb 12 04:08 initrd-plymouth.img
    -rw-r--r-- 1 root root    235041 Jan 15 14:34 symvers-4.14.165-131.185.amzn2.x86_64.gz
    -rw------- 1 root root   2823838 Jan 15 14:34 System.map-4.14.165-131.185.amzn2.x86_64
    -rwxr-xr-x 1 root root   5718992 Jan 15 14:34 vmlinuz-4.14.165-131.185.amzn2.x86_64
    Note: The preceding example shows an Amazon Linux 2 (AL2) instance with kernel version 4.14.165-131.185.amzn2.x86_64. The /boot directory has the required initramfs-4.14.165-131.185.amzn2.x86_64.img and vmlinuz-4.14.165-131.185.amzn2.x86_64 files.
  2. If the initramfs and the vmlinuz files aren't present, then boot the instance with a previous kernel that has both of the files.

For more information about how to resolve kernel panic errors, see How do I resolve the "Kernel panic - not syncing" error in my EC2 instance?

Troubleshoot "Failed to mount" or "Dependency failed"

If the /etc/fstab file has incorrect mount point entries, then you receive a "Failed to mount" or "Dependency failed" error message. To troubleshoot this issue, complete the following steps:

  1. Verify that the mount point entries in the /etc/fstab are correct.
  2. To correct file system inconsistencies, it's a best practice to run the fsck or xfs_repair tool. Before you run the tool, create a backup of your Amazon Elastic File System (Amazon EFS) file system. Then, run the following command to unmount your mount:
    sudo umount /rescue
    Note: Replace /rescue with your mount point directory name.
  3. Run the fsck or xfs_repair tool, based on your file system.
    For ext4 file systems, run the following command:
    sudo fsck /dev/sdffsck from util-linux 2.30.2
    e2fsck 1.42.9 (28-Dec-2013)
    /dev/sdf: clean, 11/6553600 files,
    459544/26214400 blocks
    For XFS file systems, run the following command:
    sudo xfs_repair /dev/sdfxfs_repair /dev/xvdf
    Phase 1 - find and verify superblock...
    Phase 2 - using internal log
            - zero log...
            - scan filesystem freespace and inode maps...
            - found root inode chunk
    Phase 3 - for each AG...
            - scan and clear agi unlinked lists...
            - process known inodes and perform inode discovery...
            - agno = 0
            - agno = 1
            - agno = 2
            - agno = 3
            - process newly discovered inodes...
    Phase 4 - check for duplicate blocks...
            - setting up duplicate extent list...
            - check for inodes claiming duplicate blocks...
            - agno = 0
            - agno = 1
            - agno = 2
            - agno = 3
    Phase 5 - rebuild AG headers and trees...
            - reset superblock...
    Phase 6 - check inode connectivity...
            - resetting contents of realtime bitmap and summary inodes
            - traversing filesystem ...
            - traversal finished ...
            - moving disconnected inodes to lost+found ...
    Phase 7 - verify and correct link counts...
    done

Troubleshoot "interface eth0: failed"

If you receive the "interface eth0: failed" error message, then verify that the ifcfg-eth0 file has the correct network entries. You can find the network configuration file that corresponds to the eth0 primary interface at /etc/sysconfig/network-scripts/ifcfg-eth0. If the device name of your primary interface isn't eth0, then the file name starts with ifcfg followed by the device name. The file is in the /etc/sysconfig/network-scripts directory on the instance.

To find the ifcfg-eth0 file, complete the following steps:

  1. Run the following command to view the network configuration file for the eth0 primary interface:
    sudo cat /etc/sysconfig/network-scripts/ifcfg-eth0
    Note: If needed, replace eth0 in the preceding command with the name of your primary interface.
    The following example output contains the correct entries for the network configuration file located in /etc/sysconfig/network-scripts/ifcfg-eth0:
    $ sudo cat /etc/sysconfig/network-scripts/ifcfg-eth0
    DEVICE=eth0
    BOOTPROTO=dhcp
    ONBOOT=yes
    TYPE=Ethernet
    USERCTL=yes
    PEERDNS=yes
    DHCPV6C=yes
    DHCPV6C_OPTIONS=-nw
    PERSISTENT_DHCLIENT=yes
    RES_OPTIONS="timeout:2 attempts:5"
    DHCP_ARP_CHECK=no
    If ONBOOT isn't set to yes, then you didn't configure your primary network interface to come up at boot.
  2. To change the ONBOOT value, run the following command to open the file:
    sudo vi /etc/sysconfig/network-scripts/ifcfg-eth0
    Note: The preceding command uses the vi editor to edit the file.
  3. To modify the file, press I.
  4. Choose the ONBOOT entry, and then change the value to yes.
  5. Press :wq! to save and exit the file.

For more errors and resolution steps, see Troubleshoot system log errors for Linux instances.

Restart the original instance

Complete the following steps:

  1. To unmount the secondary device from your rescue instance, run the following command:
    sudo umount /rescue
    Note: Replace /rescue with your mount point directory name.
    If the unmount operation fails, then stop or reboot the rescue instance. Then, rerun the preceding command.
  2. Detach the secondary volume from the rescue instance.
  3. Attach the volume to the original instance as the /dev/xvda or /dev/sda1 root volume.
  4. Start the instance, and then verify that the instance is responsive.

Related information

How do I troubleshoot status check failures for my EC2 Linux instance?

Troubleshoot Amazon EC2 Linux instances with failed status checks

Why doesn't my Linux instance boot after I changed it to a Nitro-based instance?

AWS OFFICIALUpdated 9 months ago
3 Comments

Thanks for the very detailed and well structured article.

replied 3 years ago

In the "Method 3: Manually correct errors using a rescue instance", when you try to mount the disk on step no 7 for problematic boot volumes, you would get an error stating that "Wrong Fs type or UUID duplicate, Superblock is missing or badblock found" this is because of boot volumes UUID are conflicting with the rescue server boot UUID and validate the disks boot UUID using "blkid" command and mount the volumes if its xfs using this command "mount -t xfs -o nouuid /dev/vg/lv /mnt" and refer the https://access.redhat.com/solutions/5494781 for reference

AWS
replied 3 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

AWS
MODERATOR
replied 3 years ago