Skip to content

How do I revert to a known stable kernel after an update blocks my Amazon EC2 instance reboot?

8 minute read
2

An update prevented my Amazon Elastic Compute Cloud (Amazon EC2) instance reboot. I want to revert to a stable kernel.

Short description

If you made a kernel update to your EC2 Linux instance but the kernel is now corrupt, then the instance can't reboot. You also can't use SSH to connect to the affected instance.

To troubleshoot this issue, use the EC2 Serial Console to access your root volume. Or, create a temporary rescue instance, and then remount your Amazon Elastic Block Store (Amazon EBS) volume on the rescue instance. Configure your GNU GRUB to use the previous kernel, and then reboot the instance.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Access the instance's root volume

To access the root volume, use the EC2 Serial Console or a rescue instance.

Use the EC2 Serial Console

Prerequisites: You must have already configured access to the EC2 Serial Console. If your instance is unreachable and you didn't already configure access, then you must use a rescue instance to access the root volume. Also, make sure that you adhere to the serial console prerequisites.

If you activated EC2 Serial Console for Linux, then use it for Nitro-based instance types to troubleshoot boot, network configuration, and SSH configuration issues.

You can use the serial console to connect to your instance without a working network connection.

Before you use the serial console, grant it access at the AWS account level. Then, create AWS Identity and Access Management (IAM) policies that grant access to your IAM users. Every instance that uses the serial console must include at least one password-based user.

Use a rescue instance

Important: Don't perform this procedure on an instance store-backed instance. The recovery procedure requires a stop and start of your instance, so you will lose the instance's data.

To use a rescue instance to access the root volume, complete the following steps:

  1. Create an Amazon EBS snapshot of the root volume.

  2. Stop the affected instance.

  3. Detach the Amazon EBS root volume (/dev/xvda or /dev/sda1) from the affected instance. Note the device name of your root volume.
    Note: To help identify the EBS volume in later steps, tag the volume before you detach it. The root device differs by Amazon Machine Image (AMI). For example, Amazon Linux 2 (AL2) and Amazon Linux 2023 (AL2023) use /dev/xvda. However, Ubuntu 14, 16, 18, CentOS 7, and Red Hat Enterprise Linux (RHEL) 7.5, use /dev/sda1.

  4. Launch a rescue EC2 instance in the same Availability Zone as your snapshot.
    Note: Check your instance product code. Some product codes require you to launch an EC2 instance in the same operating system (OS) type. For example, if the affected instance is a paid RHEL AMI, then you must launch an AMI with the same product code. If you have an AL2 instance, then you must create an AL2 rescue instance to avoid errors.

  5. Attach the volume as a secondary device (/dev/sdf) to the rescue instance.

  6. Use SSH to connect to the rescue instance.

  7. To view your available disk devices, run the following command:

    lsblk

    Example output:

    NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    xvda    202:0     0   15G  0 disk
    └─xvda1 202:1     0   15G  0 part /
    xvdf    202:0     0   15G  0 disk
        └─xvdf1 202:1 0   15G  0 part

    Note: Nitro-based instances show EBS volumes as NVMe block devices with the nvme[0-26]n1 disk name. Example output on a Nitro-based instance:

    NAME           MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    nvme0n1        259:0    0    8G  0 disk 
    └─nvme0n1p1    259:1    0    8G  0 part /
    └─nvme0n1p128  259:2    0    1M  0 part 
    nvme1n1        259:3    0  100G  0 disk 
    └─nvme1n1p1    259:4    0  100G  0 part /
  8. To become the root user, run the following command:

    sudo -i
  9. To mount the root partition of the mounted volume to /mnt, run the following command:

    mount -o nouuid /dev/xvdf1 /mnt

    Note: Replace /dev/xvdf1 with the root partition of your volume. The nouuid mount option is for XFS file systems only. If you experience issues when you run the preceding command, then run the following command to identify your filesystem type:

    lsblk -f

    Then, run the following command based on your file system to mount the partition.
    VFAT/FAT32 filesystems:

    mount /dev/xvdf1 /mnt

    ext4 filesystems:

    mount /dev/xvdf1 /mnt

    Note: Replace /dev/xvdf1 with the root partition of your volume.
    If /mnt doesn't exist on your configuration, then run the following commands to create a mount directory, and then mount the root partition to the new directory:

    mkdir /mnt 
    mount -o nouuid /dev/xvdf1 /mnt

    Note: If you receive an error when you run the preceding mount command, then run the following command instead:

    mount /dev/xvdf1 /mnt

    Next, use the mount directory to access the affected instance's data.

  10. To mount /dev, /run, /proc, and /sys of the rescue instance to the same paths as the mounted volume, run the following command:

    for m in dev proc run sys; do mount -o bind {,/mnt}/$m; done
  11. If you have a separate /boot partition, then mount it to /mnt/boot.

  12. To change into the mount directory, run the following command:

    chroot /mnt

Update the default kernel in the GRUB bootloader

You can find the corrupt kernel in position 0 in the list, and the last stable kernel in position 1. To replace the corrupt kernel with the stable kernel, complete the following steps based on your distribution.

GRUB1 (Legacy GRUB) for Red Hat 6

To replace the corrupt kernel with the stable kernel in the /boot/grub/grub.conf file, run the following command:

sed -i '/^default/ s/0/1/' /boot/grub/grub.conf

GRUB2 for Ubuntu 14 LTS, 16.04, and 18.04

Complete the following steps:

  1. To replace the corrupt GRUB_DEFAULT=0 default menu entry with the stable GRUB_DEFAULT=saved value in the /etc/default/grub file, run the following command:

    sed -i 's/GRUB_DEFAULT=0/GRUB_DEFAULT=saved/g' /etc/default/grub
  2. To make sure that GRUB recognizes the change, run the following command:

    update-grub

    Note: You might receive the "device-mapper: reload ioctl on osprober-linux-xvdaX failed: Device or resource busy Command failed" error when you rebuild the grub configuration file. To resolve this issue, add the GRUB_DISABLE_OS_PROBER=true parameter to the /etc/default/grub file, and then rerun the preceding command.

  3. To make sure that Amazon EC2 loads the stable kernel at the next reboot, run the following command:

    grub-set-default 1

GRUB2 for RHEL 7 and AL2

Complete the following steps:

  1. To replace the corrupt GRUB_DEFAULT=0 default menu entry with the stable GRUB_DEFAULT-saved value in the /etc/default/grub file, run the following command:

    sed -i 's/GRUB_DEFAULT=0/GRUB_DEFAULT=saved/g' /etc/default/grub
  2. To update GRUB to regenerate the /boot/grub2/grub.cfg file, run the following command:

    grub2-mkconfig -o /boot/grub2/grub.cfg

    Note: You might receive the "device-mapper: reload ioctl on osprober-linux-xvdaX failed: Device or resource busy Command failed" error when you rebuild the grub configuration file. To resolve this issue, add the GRUB_DISABLE_OS_PROBER=true parameter to the /etc/default/grub file, and then rerun the preceding command.

  3. To make sure that Amazon EC2 loads the stable kernel at the next reboot, run the following command:

    grub2-set-default 1

GRUB2 for RHEL 8 and CentOS 8, and AL2023

GRUB2 uses blscfg files and entries in /boot/loader for the boot configuration, instead of the previous grub.cfg format. It's a best practice to use the grubby tool to manage the blscfg files and retrieve information from the /boot/loader/entries/. If the blscfg files are missing or corrupted, then grubby doesn't show any results. You must regenerate the files to recover functionality.

To update the default kernel in GRUB2, complete the following steps:

  1. To see the current default kernel, run the following command:

    grubby --default-kernel
  2. To see all available kernels and their indexes, run the following command:

    grubby --info=ALL

    Example output:

    root@ip-172-31-29-221 /]# grubby --info=ALLindex=0
    kernel="/boot/vmlinuz-4.18.0-305.el8.x86_64"
    args="ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto $tuned_params"
    root="UUID=d35fe619-1d06-4ace-9fe3-169baad3e421"
    initrd="/boot/initramfs-4.18.0-305.el8.x86_64.img $tuned_initrd"
    title="Red Hat Enterprise Linux (4.18.0-305.el8.x86_64) 8.4 (Ootpa)"
    id="0c75beb2b6ca4d78b335e92f0002b619-4.18.0-305.el8.x86_64"
    index=1
    kernel="/boot/vmlinuz-0-rescue-0c75beb2b6ca4d78b335e92f0002b619"
    args="ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto"
    root="UUID=d35fe619-1d06-4ace-9fe3-169baad3e421"
    initrd="/boot/initramfs-0-rescue-0c75beb2b6ca4d78b335e92f0002b619.img"
    title="Red Hat Enterprise Linux (0-rescue-0c75beb2b6ca4d78b335e92f0002b619) 8.4 (Ootpa)"
    id="0c75beb2b6ca4d78b335e92f0002b619-0-rescue"
    index=2
    kernel="/boot/vmlinuz-4.18.0-305.3.1.el8_4.x86_64"
    args="ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto $tuned_params"
    root="UUID=d35fe619-1d06-4ace-9fe3-169baad3e421"
    initrd="/boot/initramfs-4.18.0-305.3.1.el8_4.x86_64.img $tuned_initrd"
    title="Red Hat Enterprise Linux (4.18.0-305.3.1.el8_4.x86_64) 8.4 (Ootpa)"
    id="ec2fa869f66b627b3c98f33dfa6bc44d-4.18.0-305.3.1.el8_4.x86_64"

    Note the kernel path that you set as the default for your instance. In the preceding example, the kernel path at index 2 is /boot/vmlinuz- 0-4.18.0-80.4.2.el8_1.x86_64.

  3. To change the default kernel of the instance, run the following command:

    grubby --set-default=/boot/vmlinuz-4.18.0-305.3.1.el8_4.x86_64

    Note: Replace 4.18.0-305.3.1.el8_4.x86_64 with your kernel's version number.

  4. To verify that you correctly configured the default kernel, run the following command:

    grubby --default-kernel

Reboot the instance

If you used the EC2 Serial Console, then Amazon EC2 now loads the stable kernel. You can reboot the instance.

If you used a rescue instance to access the root volume, then complete the following steps:

  1. To exit from chroot and unmount /dev, /run, /proc, and /sys, run the following command:

    exit
    umount /mnt/{dev,proc,run,sys,}
  2. Stop the rescue instance.

  3. Detach the root volume from the rescue instance.

  4. Attach the root volume to the original instance as the /dev/xvda or /dev/sda1 root volume

  5. Start the original instance.

  6. Amazon EC2 now loads the stable kernel. You can reboot the instance.

AWS OFFICIALUpdated 5 months ago
5 Comments

Stuck at step #14, my system throws an error when I try to use the mount -o nouuid option. I can, however, just to a normal mount and that seems to work fine ("mount /dev/nvme1n1p1 /mnt"), but I'm not sure if there are other implications with NOT using the -o nouuid option.

Also, if I push ahead, I get an error trying to "chroot /mnt": chroot: failed to run command ‘/bin/bash’: No such file or directory

replied 2 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

AWS
MODERATOR
replied 2 years ago

There is a syntax error under the **"To exit from chroot and unmount /dev, /run, /proc, and /sys, run the following command" **

exitumount /mnt/{dev,proc,run,sys,}

It should be

exit
umount /mnt/{dev,proc,run,sys,}
AWS
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

AWS
MODERATOR
replied a year ago

Re: Scott, who said their system throws an error when they try to use the mount -o nouuid option, I have seen this occur when the system is using a vfat filesystem instead of the usual XFS or EXT4. vfat doesn't offer this nouuid option, so you don't need to (or can't) use it.

vfat has its own shorter volume serial number as part of its filesystem structure. Its GUID (partition UUID) is usually managed by the partition table (e.g. GPT) rather than the by filesystem itself.

AWS
replied 5 months ago