Skip to content

AL2023 - instance going permanently offline / network crashing

1

Hi There,

We're having some issues on a EC2 instance. Basically the instance is running a piece of software called "Good Sync" to backup various cloud file stores across different providers (365, DropBox, S3, etc.)

The software will run fine for about 24hr - 48hr but then the EC2 instance will go offline. It can only be restored by performing a "Force Stop" from the AWS console.

I've been through it with their support and checked the logs and their software itself doesn't seem to be crashing or freezing. The server doesn't seem to be crashing or freezing either but the networking subsystem / service is failing and never seems to recover. The journal shows the following when it goes offline -

Feb 07 05:58:49 gs.lennox-it.uk systemd-networkd[1993]: enX0: Failed Feb 07 05:58:47 gs.lennox-it.uk systemd-networkd[1993]: enX0: Could not set DHCPv4 route: Connection timed out Feb 07 05:50:36 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 05:50:35 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c>

I've looked into the error and from what I've read online it can occur when the server is under very heavy load, which I think is what is happening (though only for a short amount of time). The software doesn't crash but it does spike the CPU at 100% for a while.

The main problem is that the server never seems to recover the network again and remains offline until it is power cycled.

Has anyone seen anything like this before and/or know a solution? I've tried to install cpulimit to try and throttle the software but this doesn't seem to be available through yum.

Full logs from the error up until when it reboots are below

Any advice would be much appreciated

Olly

Feb 07 09:09:01 localhost kernel: Linux version 6.1.119-129.201.amzn2023.x86_64 (mockbuild@ip-10-0-49-203) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU l> -- Boot e2fb220bddd741f692bec78198ab28b3 -- Feb 07 09:04:30 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 09:04:30 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c> Feb 07 09:04:16 gs.lennox-it.uk systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Feb 07 09:04:12 gs.lennox-it.uk systemd[1]: sysstat-collect.service: Deactivated successfully. Feb 07 09:03:51 gs.lennox-it.uk systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Feb 07 09:03:46 gs.lennox-it.uk ec2net[36932]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 09:03:40 gs.lennox-it.uk setup-policy-routes[36895]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 09:03:06 gs.lennox-it.uk CROND[36909]: (root) CMDEND (run-parts /etc/cron.hourly) Feb 07 09:03:03 gs.lennox-it.uk run-parts[36924]: (/etc/cron.hourly) finished 0anacron Feb 07 09:02:18 gs.lennox-it.uk run-parts[36917]: (/etc/cron.hourly) starting 0anacron Feb 07 09:01:05 gs.lennox-it.uk CROND[36910]: (root) CMD (run-parts /etc/cron.hourly) Feb 07 08:57:39 gs.lennox-it.uk ec2net[36897]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:57:33 gs.lennox-it.uk setup-policy-routes[36871]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:51:50 gs.lennox-it.uk ec2net[36872]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:51:41 gs.lennox-it.uk setup-policy-routes[36856]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:51:30 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 08:51:30 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c> Feb 07 08:51:14 gs.lennox-it.uk systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Feb 07 08:51:09 gs.lennox-it.uk systemd[1]: sysstat-collect.service: Deactivated successfully. Feb 07 08:50:28 gs.lennox-it.uk systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Feb 07 08:49:17 gs.lennox-it.uk ec2net[36857]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:49:09 gs.lennox-it.uk setup-policy-routes[36836]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:44:40 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 08:44:40 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c> Feb 07 08:44:33 gs.lennox-it.uk systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Feb 07 08:44:30 gs.lennox-it.uk systemd[1]: sysstat-collect.service: Deactivated successfully. Feb 07 08:44:00 gs.lennox-it.uk systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Feb 07 08:43:48 gs.lennox-it.uk ec2net[36837]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:43:42 gs.lennox-it.uk setup-policy-routes[36816]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:39:05 gs.lennox-it.uk ec2net[36817]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:38:57 gs.lennox-it.uk setup-policy-routes[36796]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:35:05 gs.lennox-it.uk ec2net[36797]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:34:58 gs.lennox-it.uk setup-policy-routes[36774]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:31:27 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 08:31:26 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c> Feb 07 08:31:16 gs.lennox-it.uk systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Feb 07 08:31:12 gs.lennox-it.uk systemd[1]: sysstat-collect.service: Deactivated successfully. Feb 07 08:30:21 gs.lennox-it.uk systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Feb 07 08:29:14 gs.lennox-it.uk ec2net[36775]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:29:07 gs.lennox-it.uk setup-policy-routes[36757]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:24:27 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 08:24:27 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c> Feb 07 08:24:20 gs.lennox-it.uk systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Feb 07 08:24:17 gs.lennox-it.uk systemd[1]: sysstat-collect.service: Deactivated successfully. Feb 07 08:24:17 gs.lennox-it.uk ec2net[36758]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:24:12 gs.lennox-it.uk setup-policy-routes[36732]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:23:35 gs.lennox-it.uk systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Feb 07 08:22:59 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 08:22:59 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c> Feb 07 08:22:51 gs.lennox-it.uk systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Feb 07 08:22:48 gs.lennox-it.uk systemd[1]: sysstat-collect.service: Deactivated successfully. Feb 07 08:22:18 gs.lennox-it.uk systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Feb 07 08:18:43 gs.lennox-it.uk ec2net[36733]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:18:35 gs.lennox-it.uk setup-policy-routes[36705]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:07:31 gs.lennox-it.uk ec2net[36706]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 08:07:19 gs.lennox-it.uk setup-policy-routes[36657]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 08:04:21 gs.lennox-it.uk CROND[36676]: (root) CMDEND (run-parts /etc/cron.hourly) Feb 07 08:04:17 gs.lennox-it.uk run-parts[36694]: (/etc/cron.hourly) finished 0anacron Feb 07 08:03:02 gs.lennox-it.uk run-parts[36685]: (/etc/cron.hourly) starting 0anacron Feb 07 08:01:05 gs.lennox-it.uk CROND[36677]: (root) CMD (run-parts /etc/cron.hourly) Feb 07 07:50:37 gs.lennox-it.uk ec2net[36658]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 07:50:29 gs.lennox-it.uk setup-policy-routes[36625]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable Feb 07 07:36:35 gs.lennox-it.uk ec2net[36626]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 07:03:40 gs.lennox-it.uk CROND[36558]: (root) CMDEND (run-parts /etc/cron.hourly) Feb 07 07:03:37 gs.lennox-it.uk run-parts[36572]: (/etc/cron.hourly) finished 0anacron Feb 07 07:02:08 gs.lennox-it.uk run-parts[36565]: (/etc/cron.hourly) starting 0anacron Feb 07 07:01:05 gs.lennox-it.uk CROND[36559]: (root) CMD (run-parts /etc/cron.hourly) Feb 07 06:15:08 gs.lennox-it.uk CROND[36482]: (root) CMDEND (run-parts /etc/cron.hourly) Feb 07 06:15:00 gs.lennox-it.uk run-parts[36499]: (/etc/cron.hourly) finished 0anacron Feb 07 06:10:46 gs.lennox-it.uk run-parts[36490]: (/etc/cron.hourly) starting 0anacron Feb 07 06:01:09 gs.lennox-it.uk CROND[36483]: (root) CMD (run-parts /etc/cron.hourly) Feb 07 05:58:49 gs.lennox-it.uk systemd-networkd[1993]: enX0: Failed Feb 07 05:58:47 gs.lennox-it.uk systemd-networkd[1993]: enX0: Could not set DHCPv4 route: Connection timed out Feb 07 05:50:36 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 05:50:35 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c> Feb 07 05:50:32 gs.lennox-it.uk systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Feb 07 05:50:30 gs.lennox-it.uk systemd[1]: sysstat-collect.service: Deactivated successfully. Feb 07 05:50:13 gs.lennox-it.uk systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Feb 07 05:50:02 gs.lennox-it.uk ec2net[36462]: Got IMDSv2 token from http://169.254.169.254/latest Feb 07 05:48:01 gs.lennox-it.uk start-amazon-cloudwatch-agent[2008]: request expired, resigning Feb 07 05:47:39 gs.lennox-it.uk ec2net[36455]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 05:46:09 gs.lennox-it.uk start-amazon-cloudwatch-agent[2008]: request expired, resigning Feb 07 05:41:12 gs.lennox-it.uk start-amazon-cloudwatch-agent[2008]: request expired, resigning Feb 07 05:40:52 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 05:40:52 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c> Feb 07 05:40:49 gs.lennox-it.uk systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Feb 07 05:40:49 gs.lennox-it.uk systemd[1]: sysstat-collect.service: Deactivated successfully. Feb 07 05:40:39 gs.lennox-it.uk systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Feb 07 05:38:21 gs.lennox-it.uk amazon-ssm-agent[2173]: 2025-02-07 05:38:10.5509 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role crede> Feb 07 05:38:10 gs.lennox-it.uk amazon-ssm-agent[2173]: caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exc> Feb 07 05:38:10 gs.lennox-it.uk amazon-ssm-agent[2173]: caused by: RequestError: send request failed Feb 07 05:38:10 gs.lennox-it.uk amazon-ssm-agent[2173]: status code: 0, request id: Feb 07 05:38:10 gs.lennox-it.uk amazon-ssm-agent[2173]: caused by: : Feb 07 05:38:10 gs.lennox-it.uk amazon-ssm-agent[2173]: 2025-02-07 05:37:57.6154 ERROR [TokenRequestService] failed to retrieve instance identity role. Error: > Feb 07 05:37:21 gs.lennox-it.uk amazon-ssm-agent[2173]: 2025-02-07 05:37:19.1422 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profil> Feb 07 05:31:43 gs.lennox-it.uk start-amazon-cloudwatch-agent[2008]: request expired, resigning Feb 07 05:31:14 gs.lennox-it.uk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect co> Feb 07 05:31:14 gs.lennox-it.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sysstat-collect c> Feb 07 05:31:11 gs.lennox-it.uk systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Feb 07 05:31:11 gs.lennox-it.uk systemd[1]: sysstat-collect.service: Deactivated successfully. Feb 07 05:30:56 gs.lennox-it.uk systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Feb 07 05:30:19 gs.lennox-it.uk ec2net[36392]: Got IMDSv2 token from http://169.254.169.254/latest Feb 07 05:27:15 gs.lennox-it.uk ec2net[36383]: [get_meta] Querying IMDS for network/interfaces/macs/(mac)/device-number Feb 07 05:27:11 gs.lennox-it.uk setup-policy-routes[36366]: /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable

asked a year ago929 views
2 Answers
1

I'm having the same issue. Every day or two the instance becomes unreachable and I have to reboot. I see this in the logs:

Querying IMDS for network/interfaces/macs/xx:xx:xx:xx:xx:xx/device-number /usr/share/amazon-ec2-net-utils/lib.sh: line 89: imds_endpoint: unbound variable

answered 7 months ago
0

The issue you're experiencing with your EC2 instance running Amazon Linux 2023 (AL2023) appears to be related to network failures that are not recovering automatically. This problem is likely caused by a combination of factors, including high CPU usage from your "Good Sync" software and potential issues with the network configuration or drivers in AL2023.

Here are some potential causes and solutions to consider:

  1. Network Interface Naming: AL2023 may use different network interface naming conventions compared to older versions. The logs show "enX0" instead of "eth0". Ensure your network configuration files are updated to use the correct interface names.

  2. DHCP Issues: The logs indicate DHCP-related failures. This could be due to the high CPU usage preventing timely DHCP renewals. You might want to configure a static IP address temporarily to see if it resolves the issue.

  3. EC2 Networking Utilities: There seem to be errors related to Amazon EC2 networking utilities. Ensure you have the latest version of the amazon-ec2-net-utils package installed and properly configured.

  4. Resource Exhaustion: As you mentioned, the high CPU usage might be causing resource exhaustion. While cpulimit isn't available through yum, you could try using the 'nice' command to lower the priority of the Good Sync process, or use cgroups to limit its resource usage.

  5. Kernel Parameters: You might need to adjust some kernel networking parameters to be more tolerant of high-load situations. For example, increasing the ARP cache size or adjusting TCP keepalive settings.

  6. NetworkManager: Ensure that NetworkManager is properly configured and running, as AL2023 prefers this over older networking scripts.

  7. Driver Issues: There might be compatibility issues with network drivers in AL2023. Check for any available updates or known issues with your instance type and AL2023.

To troubleshoot further:

  • Monitor resource usage more closely using tools like 'top', 'iotop', and 'netstat'.
  • Review logs in /var/log/messages and /var/log/syslog for more detailed error messages.
  • Consider setting up a CloudWatch alarm to notify you of high CPU usage or network issues before they cause a complete outage.
  • If possible, test the Good Sync software on a different EC2 instance type or a different Linux distribution to isolate whether the issue is specific to AL2023 or your current instance configuration.

If these steps don't resolve the issue, you may need to consider either downgrading to a previous, stable version of Amazon Linux or working with AWS support to identify any potential AL2023-specific networking issues.
Sources
EC2 Instance Reachability Check Failure | AWS re:Post
Amazon Linux 2023 version 2023.0.20230308 (Release Candidate 1) release notes - Amazon Linux 2023
EC2 instance intermittently locking up (network OUT drops to zero) | AWS re:Post

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.