EC2 instance becomes unreachable after some days

0

Hi All, One of my EC2 instance becomes unreachable after almost few days. During this time the CPU Utilization does increases to around 50-80% (But I still feel that is should not lead to instance becoming unreachable). Can someone let me know on how to debug this and to know which process might be causing this issue. Also, is increased CPU utilization the reason for instance becoming unreachable?

PS : I am using On Demand Linux t2.xlarge Instance and have enough root storage. My ask:

  • How can i know which process is the culprit?
  • Will system logs help me here and if yes how can i access them (now or previous logs for 5-7 days)?
  • There is as such no clear pattern but mostly this happens every 5-7 days. Is AWS doing something on behind that could be the reason
Prakhar
asked 10 months ago482 views
5 Answers
0

How can i know which process is the culprit?

To check the process load, it may be useful to run commands such as "ps aux" or install CloudWatch Agent to monitor the process.
In addition to the CPU, it is also possible that memory and other factors could be causing the inaccessibility.

Will system logs help me here and if yes how can i access them (now or previous logs for 5-7 days)?

This will vary depending on the operating system, but the system log should be stored in the "/var/log/" directory.
For Amazon Linux2, the file "/var/log/messages" is the one that outputs the system log.

There is as such no clear pattern but mostly this happens every 5-7 days. Is AWS doing something on behind that could be the reason

In the case of scheduled maintenance, the AWS side should contact you, so I don't think it is relevant in this case.

profile picture
EXPERT
answered 10 months ago
  • If SSH or other means is available after rebooting, it would be better to SSH and check the system logs. Also, adding more custom metrics in CloudWatch Agent is costly, but I think it is a necessary cost to track down the cause of the problem.
    Also, since you are using t2.xlarge, please check the "CPUCreditUsage" for the time period when the instance became unreachable. If this metric is zero, it is likely that only the baseline throughput can perform and that is why it is unreachable. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-monitoring-cpu-credits.html

    In some cases, it is recommended to change the instance type to the m6 system.

0
profile picture
answered 10 months ago
0

@Riku_Kobayashi ps aux will not help since when instance starts having unreachable state then I have already lost the control to connect/ssh to instance. hence, will not be able to run this command anywhere. Is there a way around? With cloudwatch agent yes I can try, its just cost is associated with same.

Prakhar
answered 10 months ago
0

What kind of application are you running on this? The best place to debug is your application or application server logs. It can be the requests you are getting or a performance issue in your application. Start with debugging in the deployed apps.

answered 10 months ago
0

Where are you seeing the 50-80% figure for CPU usage, is that in AWS Console? Does it go higher than that when the host becomes unresponsive, or does it just stop recording a figure?

It would to setup CloudWatch agent to collect more detailed system logs https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html - this may show that your root cause is exhaustion of some system resources.

What software are you running on the t2.xlarge ?

profile picture
EXPERT
Steve_M
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions