Instance unavailable for 4 minutes

0

Hi
Our instance (ID - i-06ac0afc4c2c59618) stopped responding on 20th Jan 2019 between 16:46 and 16:50 UTC. We were unable to login to the server. In some time, it came back online with uptime of a few minutes. This is an extremely high priority server and such a downtime can be disastrous for us. Please share all information regarding the reason for the unavailability of the instance as soon as you can.

Thanks

asked 5 years ago230 views
2 Answers
0

Hello ashutoshshah,

I am sorry to hear about the issue with your instance i-06ac0afc4c2c59618.

I have checked the instance and I could see that the underlying physical host, on top of which your instance was hosted, had been experiencing hardware related issues during the above mentioned times. This caused your instance to reboot.

Please note that in the future you can check whether an instance was affected by a hardware related event by checking its 'System Status Checks' [1]. The history of these checks can also be viewed in Amazon CloudWatch by looking at StatusCheckFailed_System metric \[2,3].

Please accept our apologies for the above issue and for any inconvenience caused by it. I have now checked the instance and I can see that it is back up and running again.

I would like to suggest that you to take a look at the Auto Recovery feature for Amazon EC2. You can create an Amazon CloudWatch alarm that monitors an Amazon EC2 instance and automatically recovers the instance if it becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair. Basically, you can use CloudWatch to set up the alarm which will trigger when the System Status check fails. This alarm can further trigger an EC2 Action like "Recover this instance" \[4,5].

We also advise to our customers to design their application in such a way such that there is no single point of failure in their environment. Please refer to our white paper on Building Fault-Tolerant Applications in the AWS Cloud \[6] for more information.

Please let us know if you need any further help.

Links:
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html#types-of-instance-status-checks
[2] https://aws.amazon.com/blogs/aws/ec2-instance-status-metrics/
[3] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ec2-metricscollected.html#ec2-metrics
[4] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html#AddingRecoverActions
[5] https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/
[6] https://aws.amazon.com/whitepapers/designing-fault-tolerant-applications/

Regards,
awstomas

AWS
answered 5 years ago
0

Thanks.

answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions