How do I use SSM Agent logs to troubleshoot issues with SSM Agent in my managed instance?

8 minute read
0

I want to use my AWS Systems Manager Agent logs to troubleshoot issues with Systems Manager Agent (SSM Agent).

Short description

SSM Agent runs on your managed Amazon Elastic Compute Cloud (Amazon EC2) instance and processes requests from the AWS Systems Manager service. SSM Agent requires that the following conditions are met:

  • SSM Agent must connect to the required service endpoints.
  • SSM Agent must have AWS Identity and Access Management (IAM) permissions to call the Systems Manager API.
  • Amazon EC2 must assume valid credentials from the IAM instance profile.

If any of these conditions aren't met, then SSM Agent fails to run.

To identify the root cause of the SSM Agent failure, review SSM Agent logs in the following locations:

Linux

/var/log/amazon/ssm/amazon-ssm-agent.log

/var/log/amazon/ssm/errors.log

Windows

%PROGRAMDATA%\Amazon\SSM\Logs\amazon-ssm-agent.log

%PROGRAMDATA%\Amazon\SSM\Logs\errors.log

Note: Because SSM Agent is frequently updated with new capabilities, it's a best practice to configure automated updates for SSM Agent.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

To use SSM Agent logs to troubleshoot issues, run the ssm-cli command. Then, follow the troubleshooting steps for your issue.

SSM Agent can't talk to the required endpoints

Based on your use case, complete the following tasks.

SSM Agent can't reach the metadata service

When SSM Agent can't reach the metadata service, it also can't locate the AWS Region information, IAM role, or instance ID from that service. In this case, you see an error message in the SSM Agent logs that's similar to the following:

"INFO- Failed to fetch instance ID. Data from vault is empty. RequestError: send request failed caused by: Get http://169.254.169.254/latest/meta-data/instance-id"

This error commonly occurs when you use a proxy for outbound internet connections from your instance before you configure SSM Agent for a proxy. To resolve this issue, configure SSM Agent to use a proxy.

On Windows instances, this error might also occur from a misconfigured persistent network route when you use a custom AMI to launch your instance. You must verify that the route for the metadata service IP points to the correct default gateway. For more information, see Why does my Amazon EC2 Windows instance generate a "Waiting for the metadata service" error?

To verify whether metadata is activated for your instance, run the following command in the AWS CLI:

aws ec2 describe-instances --instance-ids i-1234567898abcdef0 --query 'Reservations[*].Instances[*].MetadataOptions'

Note: Replace i-1234567898abcdef0 with your instance ID.

You receive an output that's similar to the following example:

[
  [{
    "State": "applied",
    "HttpTokens": "optional",
    "HttpPutResponseHopLimit": 1,
    "HttpEndpoint": "enabled",
    "HttpProtocolIpv6": "disabled",
    "InstanceMetadataTags": "disabled"
  }]
]

In this output, "HttpEndpoint": "enabled" indicates that metadata is activated for your instance.

If metadata isn't activated, then you can turn metadata on with the aws ec2 modify-instance-metadata-options command. For more information, see Modify instance metadata options for existing instances.

SSM Agent can't reach Systems Manager service endpoints

If SSM Agent can't connect with service endpoints, then SSM Agent fails. SSM Agent must make an outbound connection with the SSM endpoint: ssm.REGION.amazonaws.com following Systems Manager service API calls on port 443.

Note: SSM Agent uses the Region information that the instance metadata service retrieves to replace the REGION value in these endpoints.

When SSM Agent can't connect with the Systems Manager endpoints, you see error messages similar to the following in the SSM Agent logs:

"ERROR [HealthCheck] error when calling AWS APIs. error details - RequestError: send request failed caused by: Post https://ssm.ap-southeast-2.amazonaws.com/: dial tcp 172.31.24.65:443: i/o timeout" "DEBUG [MessagingDeliveryService] RequestError: send request failed caused by: Post https://ec2messages.ap-southeast-2.amazonaws.com/: net/http: request cancelled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

The following are common reasons why SSM Agent can't connect with the Systems Manager API endpoints on port 443:

  • Instance egress security group rules don't allow outgoing connections on port 443.
  • Virtual private cloud (VPC) endpoint ingress and egress security group rules don't allow incoming and outgoing connections to the VPC interface endpoint on port 443.
  • When the instance lives in a public subnet, routing table rules aren't configured to direct traffic using an internet gateway.
  • When the instance lives in a private subnet, routing table rules aren't configured to direct traffic using a NAT gateway or VPC endpoint.
  • If routing table rules are configured to use a proxy for all outgoing connections, then SSM Agent isn't configured to use a proxy.

SSM Agent doesn't have permissions to call the required Systems Manager API calls

Because SSM Agent isn't authorized to make UpdateInstanceInformation API calls to the service, SSM Agent fails to register itself as online on Systems Manager.

The UpdateInstanceInformation API call must maintain a connection with SSM Agent so that the service knows that SSM Agent is functioning as expected. SSM Agent calls the Systems Manager service in the cloud every five minutes to provide health check information. If SSM Agent doesn't have the correct IAM permissions, then it posts an error message in the SSM Agent logs.

If SSM Agent uses the incorrect IAM permissions, then you see an error that's similar to the following:

"ERROR [instanceID=i-XXXXX] [HealthCheck] error when calling AWS APIs. error details - AccessDeniedException: User: arn:aws:sts::XXX:assumed-role/XXX /i-XXXXXX is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-southeast-2:XXXXXXX:instance/i-XXXXXX
status code: 400, request id: XXXXXXXX-XXXX-XXXXXXX
INFO [instanceID=i-XXXX] [HealthCheck] increasing error count by 1"

If SSM Agent doesn't have any IAM permissions, then you see an error that's similar to the following:

"ERROR [instanceID=i-XXXXXXX] [HealthCheck] error when calling AWS APIs. error details - NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors
2018-05-08 10:58:39 INFO [instanceID=i-XXXXXXX] [HealthCheck] increasing error count by 1"

Verify that the IAM role that's attached to the instance contains the required permissions to allow an instance to use Systems Manager service core functionality. Or, if an instance profile role isn't already attached, then attach an instance profile role and include AmazonSSMManagedInstanceCore permissions.

For more information about the required IAM permissions for Systems Manager, see Additional policy considerations for managed instances.

Systems Manager API call throttling

If a high volume of managed instances that run SSM Agent make concurrent UpdateInstanceInformation API calls, then those calls might be throttled.

If the UpdateInstanceInformation API call for your instance is throttled, then you see error messages in the SSM Agent logs similar to the following:

"INFO [HealthCheck] HealthCheck reporting agent health.
ERROR [HealthCheck] error when calling AWS APIs. error details - ThrottlingException: Rate exceeded
status code: 400, request id: XXXXX-XXXXX-XXXX
INFO [HealthCheck] increasing error count by 1"

Use the following troubleshooting steps to prevent ThrottlingException errors:

  • Reduce the frequency of API calls.
  • Implement error retries and exponential backoffs when you make API calls.
  • Stagger the intervals of API calls so that they don't all run at the same time.
  • Request a throttling limit increase for UpdateInstanceInformation API calls.

Amazon EC2 can't assume valid credentials from the IAM instance profile

If Amazon EC2 can't assume the IAM role, then you see a message that's similar to the following example in the SSM Agent logs:

2023-01-25 09:56:19 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity2023-01-25 09:56:19 INFO [CredentialRefresher] Sleeping for 1s before retrying retrieve credentials
2023-01-25 09:56:20 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:20 INFO [CredentialRefresher] Sleeping for 2s before retrying retrieve credentials
2023-01-25 09:56:22 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:22 INFO [CredentialRefresher] Sleeping for 4s before retrying retrieve credentials
2023-01-25 09:56:26 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:26 INFO [CredentialRefresher] Sleeping for 9s before retrying retrieve credentials
2023-01-25 09:56:35 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:35 INFO [CredentialRefresher] Sleeping for 17s before retrying retrieve credentials
2023-01-25 09:56:52 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:52 INFO [CredentialRefresher] Sleeping for 37s before retrying retrieve credentials

If you use IMDSv1 to retrieve metadata from the EC2 instance, then you also see an error that's similar to the following example:

# curl http://169.254.169.254/latest/meta-data/iam/security-credentials/profile-name{
  "Code" : "AssumeRoleUnauthorizedAccess",
  "Message" : "EC2 cannot assume the role profile-name. Please see documentation at https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_iam-ec2.html#troubleshoot_iam-ec2_errors-info-doc.",
  "LastUpdated" : "2023-01-25T09:57:56Z"
}

Note: In the preceding example, profile-name is the name of the instance profile. If you use IMDSv2, then the preceding command doesn't work. For more information on retrieving metadata, see Retrieve instance metadata for Linux and Windows.

To troubleshoot this error, check the trust policy that's attached to the IAM role. In the policy, you must specify Amazon EC2 as a service that's allowed to assume the IAM role. Update your IAM policy through the UpdateAssumeRolePolicy API so that it appears similar to the following example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": ["ec2.amazonaws.com"]
      },
      "Action": ["sts:AssumeRole"]
    }
  ]
}

For more information, see The iam/security-credentials/[role-name] document indicates "Code":"AssumeRoleUnauthorizedAccess".

AWS OFFICIAL
AWS OFFICIALUpdated a month ago