I have an EC2 instance running AL2023 and set up as a web server with two attached EBS volumes and an Elastic IP.
The volumes are set to automount on boot via fstab entries.
The instance runs fine for about a month; then, there will appear to be a CPU and network spike, and the instance will become unreachable via a web browser or through an SSH connection.
Rebooting the instance temporarily fixes the issue and allows access again.
I have looked at the system log, and SSM Agent log, but I can't see any errors.
Is there another log I should look into or something else to check?
My searching in docs and here seems to bring back results for instances that remain unreachable after rebooting, whereas mine does not.
21st May 2024
Another reachability failure occurred on 16/05/2024 16:54
Looking in /var/log/messages I have found some failures/errors leading up to that time. I don't know if they are relevant. Examples below.
I've found some errors in /var/logs/messages
May 16 15:03:35 ip-172-31-25-82 systemd-networkd[1994]: enX0: Could not set DHCPv4 address: Connection timed out
May 16 15:20:00 ip-172-31-25-82 systemd-networkd[1994]: enX0: Failed
May 16 15:59:26 ip-172-31-25-82 systemd-networkd-wait-online[256722]: Timeout occurred while waiting for network connectivity.
May 16 16:16:47 ip-172-31-25-82 audit[15006]: AVC avc: denied { read write } for pid=15006 comm="mariadbd" name="wp_options.MYD" dev="xvdf" ino=33603458 scontext=system_u:system_r:mysqld_t:s0 tcontext=unconfined_u:object_r:unlabeled_t:s0 tclass=file permissive=1
May 16 16:22:17 ip-172-31-25-82 audit[15006]: AVC avc: denied { open } for pid=15006 comm="mariadbd" path="/vol/data/mysql/studioof_wp/wp_options.MYD" dev="xvdf" ino=33603458 scontext=system_u:system_r:mysqld_t:s0 tcontext=unconfined_u:object_r:unlabeled_t:s0 tclass=file permissive=1
May 16 16:27:32 ip-172-31-25-82 chronyd[2500]: Can't synchronise: no selectable sources
May 16 16:36:44 ip-172-31-25-82 systemd[1]: refresh-policy-routes@enX0.service: Main process exited, code=exited, status=1/FAILURE
May 16 16:36:45 ip-172-31-25-82 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=refresh-policy-routes@enX0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 16 16:36:45 ip-172-31-25-82 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:sy
May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. Error: EC2RoleRequestError: no EC2 instance role found
May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: RequestError: send request failed
May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": dial tcp 169.254.169.254:80: connect: network is unreachable
May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 ERROR [TokenRequestService] failed to retrieve instance identity role. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled
May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: :
May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: #011status code: 0, request id:
May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: RequestError: send request failed
May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: network is unreachable
May 16 16:36:46 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: unable to build RSA signature. No Authorization header in request
May 16 16:36:46 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity. Default Host Management Err: error calling RequestManagedInstanceRoleToken: unable to build RSA signature. No Authorization header in request
May 16 16:36:46 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 INFO [CredentialRefresher] Sleeping for 5m0s before retrying retrieve credentials
May 16 16:37:54 ip-172-31-25-82 systemd[1]: Starting refresh-policy-routes@enX0.service - Refresh policy routes for enX0...
May 16 16:37:54 ip-172-31-25-82 ec2net[256814]: Starting configuration for enX0
May 16 16:39:54 ip-172-31-25-82 systemd-networkd-wait-online[256816]: Timeout occurred while waiting for network connectivity.
May 16 16:39:55 ip-172-31-25-82 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=refresh-policy-routes@enX0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 16 16:39:55 ip-172-31-25-82 systemd[1]: refresh-policy-routes@enX0.service: Main process exited, code=exited, status=1/FAILURE
May 16 16:39:55 ip-172-31-25-82 systemd[1]: refresh-policy-routes@enX0.service: Failed with result 'exit-code'.
It appears to start with access errors to the MariaDB sat on an attached volume.
i have faced this issue myself so can you try changing your ip for once it might solve
please accept the answer if it was useful