Bare Metal Instances Slow to Reboot.


This problem only occurs with Bare Metal instances.

I created a new metal instance c5d.metal on N. Virginia with Amazon Linux 2 AMI (HVM) - Kernel 5.10.

After the instance was ready, I logged into it using SSH and restarted it by typing ‘sudo reboot’ without doing anything else.

The instance failed to come back from the restart.

The error was :

System status checks : System reachability check failed Instance status checks : Instance reachability check failed

The error disappeared after 10 - 15 minutes after which I was able to access the instance.

However, whenever I reboot the instance, the same error occurs again and I have to wait 10 - 15 minutes to access it.

I tried to terminate the instance and create a new one – same problem.

If I launch equivalent but non metal instances, everything works fine.

The problem only occurs with Bare Metal instances, even if the instances are 'out of the box', i.e. the only thing I do after creating the instance is restarting it.

Any idea what's wrong ? Thanks in advance.

asked 2 years ago1816 views
4 Answers

Doesn't sound to me like anything is wrong per se -- bare metal instances take a long time to (re)boot because the Nitro system has to do a lot of verification before it hands control over to the instance. (More than is necessary on regular instances, because being "bare metal" gives the customer code far more low-level control over the system than it gets on regular instances.)

answered 2 years ago

As per Colin's - yes, we are aware Bare Metal instances are slower to boot (or reboot).

A quick note on what you identify as an "error" - it's perfectly normal for system status checks to fail during a reboot, many customers rely on them to track an instance's real status (but deep health checks ie from an ALB/NLB are generally better).

Is this just an annoiance or causing some problems? Does your use case rely on frequently rebooting instances? And do you have an hard dependency on Bare Metal (ie nested virt) or can do the same with virtual instances?

answered 2 years ago

Thanks for your response.

Yes my use case relies on being able to restart the server quickly, within 2-3 minutes is bearable, but over 10 minutes makes bare metal instances not useful for my case.

I don’t need to restart the server very often, but when I do, I can’t wait over 10 minutes.

If there is any way to shorten the restart time please let me know. Otherwise, I will probably revert to the non-metal equivalent.

As to your question, I don’t know, this is the first time I am trying a bare metal instance. I wanted to check if my application gains any performance advantage by using c5d.metal instead of c5d.24xlarge.

answered 2 years ago

Pretty late to the party but just wanted to add this in case others with similar issues end up here:

Whenever possible, it's best to reboot bare metal instances with SSM Agent. Per EC2 Status Checks Documentation, "If you perform a restart from the operating system on a bare metal instance, the system status check might temporarily return a fail status. When the instance becomes available, the system status check should return a pass status."

This won't resolve the long reboot time but can at least explain why you're seeing Failed system status checks

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions