How can I troubleshoot unhealthy Route 53 health checks?

6 minute read
0

The Amazon Route 53 health checks that I created are reporting as unhealthy.

Resolution

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent AWS CLI version.

First, determine the reason for the last health check failure using the AWS Management Console. Or, use the get-health-check-last-failure-reason command in the AWS CLI.

Then, complete the corresponding troubleshooting steps in the following section to identify and fix the issue.

Note: Regardless of health check type, check the status of the Invert health check status option. If this option is set to true, then Route 53 considers the health check unhealthy even if it is marked as healthy.

Troubleshoot a health check that monitors an endpoint

Error: The health checker couldn't establish a connection within the timeout limit

The preceding error occurs when the health checkers' attempt to connect with the configured endpoint times out. The minimum times to establish a connection are as follows:

  • For TCP health checks, the TCP connection between the health checkers and the endpoint must happen within ten seconds.
  • For HTTP and HTTPS health checks, the TCP connection between the health checkers and the endpoint must happen within four seconds. The endpoint must respond with a 2xx or 3xx HTTP status code within two seconds after establishing a connection.

For more information, see How Amazon Route 53 determines whether a health check is healthy.

To avoid the timeout error, complete the following steps:

1.    In the heath check configuration, note the Domain name or IP address of the endpoint.

2.    Access the endpoint. Confirm that the firewall or server allows connections from the Route 53 public IP addresses for the AWS Regions designated in the health check configuration. See IP ranges and search for service: ROUTE53_HEALTHCHECKS. For endpoint resources that are hosted on AWS, configure security groups and network access control lists to allow the Route 53 health checkers' IP addresses.

3.    Use the following tools to test connectivity with the configured endpoint over the internet. Replace the placeholders in the following command. In the following example commands, the variables with the values for your use case.

TCP test

$ telnet <domain name / IP address> <port>

HTTP/HTTPS test

$ curl -Ik -w "HTTPCode=%{http_code} TotalTime=%{time_total}\n" <http/https>://<domain-name/ip address>:<port>/<path> -so /dev/null

Compare the output of the preceding tests with the timeout values for the health checks. Then, confirm that your application is responding within the respective timelines.

For example, if you run the following test:

curl -Ik -w "HTTPCode=%{http_code} TotalTime=%{time_total}\n" https://example.com -so /dev/null

Then the output is:

HTTPCode=200 TotalTime=0.001963

In this example, the total time to get responses with the HTTP status code 200 is 0.001963 seconds.

For HTTP connections, the connection time must be within four seconds. The endpoint must respond with the HTTP status code within two seconds after connecting. The total time is six seconds. A value higher than six seconds indicates that the endpoint is slow to respond and the health check fails. In this case, check your endpoint to make sure that it responds within the timeout period.

If your output from the test commands shows an HTTP code other than 200, then check the following configurations:

  • Firewall rules
  • Security groups
  • Network access control lists

When checking the preceding configurations, confirm that your endpoint allows connections from Route 53 public IP addresses.

4.    If turned on, then use the latency graph option in the health check configuration to check the metrics graph for the following:

  • TCP connection time
  • Time to first byte
  • Time to complete SSL handshake

For more information, see Monitoring the latency between health checkers and your endpoint.

Note:

  • If the Latency graph isn't turned on, then you can't edit existing health checks. Instead, you must create a new health check.
  • If the Elastic IP address of the endpoint that you're monitoring is released or updated, then the health check might fail.

Error: SSL alert handshake_failure

The handshake failure error indicates that SSL or TLS negotiation with the endpoint failed. When you turn on SNI (HTTPS only), Route 53 sends the hostname in the "client_hello" message to the endpoint during TLS negotiation. This action allows the endpoint to respond to the HTTPS request with the applicable SSL or TLS certificate.

If your monitored hostname isn't part of the common name in the endpoint's SSL or TLS certificate, then you receive an "SSL alert handshake_failure" error.

Note: To turn on SNI, the monitored endpoint must support SNI.

Troubleshoot health checks with the string match condition

The endpoint server returns "200 OK", but Route 53 marks the health check as unhealthy

Health checkers must establish a TCP connection with the endpoint within four seconds. Health checkers must then receive an HTTP status code of 2xx or 3xx in the next two seconds. Then, the configured string must appear in the first 5,120 bytes of the response body within the next two seconds. If the string isn't present in the first 5,120 bytes, then Route 53 marks the health check as unhealthy.

To verify that the string appears in the first 5,120 bytes of the response body, use the following command. Replace domain-name, port, and $search-string with your values.

$ curl -sL <http/https>://<domain-name>:<port> | head -c 5120 | grep $search-string

Troubleshoot a health check that monitors a CloudWatch alarm

Route 53 doesn't wait for the Amazon CloudWatch alarm to go into the ALARM state.

The preceding situation occurs when Route 53 monitors the metric data stream instead of the state of the CloudWatch alarm.

To resolve this error, complete the following steps:

1.    Verify the configuration of the health check that's in the INSUFFICIENT DATA state. If the metric data stream provides insufficient information to determine the state of the alarm, then the health check status depends on the InsufficientDataHealthStatus configuration. The status options for the InsufficientDataHealthStatus setting are healthy, unhealthy, or last known status.

2.    When you update the configuration of a CloudWatch alarm, the new settings don't automatically appear in the associated health check. To synchronize the health check configuration with the updated CloudWatch alarm's configuration:

  • In the Route 53 console, choose Health Checks.
  • Select the health check, and then choose Synchronize configuration.
AWS OFFICIAL
AWS OFFICIALUpdated a year ago