Skip to content

How do I troubleshoot high latency on my Application Load Balancer in Elastic Load Balancing?

9 minute read
0

I want to troubleshoot high latency on my Application Load Balancer in Elastic Load Balancing (ELB).

Short description

The following are causes of high latency on an Application Load Balancer:

  • Network issues such as connectivity problems, packet loss, or DNS resolution delays between the client and the Application Load Balancer
  • Application Load Balancer or infrastructure misconfiguration, such as idle timeout settings, insufficient subnet capacity, missing routes, or restrictive security group or Network access control list (ACL) rules
  • AWS WAF overhead to process POST requests
  • SSL/TLS handshake overhead or certificate issues
  • Health check failures that cause traffic overload on remaining healthy targets
  • Backend issues, such as high CPU or memory usage, incorrect web server configurations, or slow application dependencies

Resolution

To troubleshoot high latency on your Application Load Balancer, take the following actions.

Identify the phase where the latency is occurring

To identify the phase where the latency is occurring, check your Application Load Balancer access logs for the values of the following fields:

  • request_processing_time
  • target_processing_time
  • response_processing_time

Example access log entry:

loadbalancer/50dc6c495c0c9188  
 192.168.131.39:2817 10.0.0.1:80 0.001 12.401  0.001 200 200 34 366  
 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - -  
 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/example-targets/73e2d6bc24d8a067  
 "Root=1-58337262-36d228ad5d99923122bbe354"  "-" "-"  
 0 2024-04-01T22:22:48.364000Z  "forward" "-" "-" "10.0.0.1:80"  "200" "-" "-"

In this example, request_processing_time is 0.001, target_processing_time is 12.401, and response_processing_time is 0.001. The high target_processing_time indicates the backend target caused the latency. For more information, see Access log entries. Compare the client's observed total latency with the sum of the three timing fields. A large difference indicates network issues between the client and the Application Load Balancer.

Check your CloudWatch metrics for the following:

Check metrics across Availability Zones to determine if latency is AZ-specific. If a specific AZ has high latency, investigate target health in that AZ and cross-AZ dependencies.

Check for network and DNS issues

Network connectivity problems between the client and the Application Load Balancer can cause high latency. To check for network and DNS issues, complete the following steps:

  • Verify basic connectivity to your Application Load Balancer. For more information, see Troubleshoot your Application Load Balancers.

  • Use My Traceroute (MTR) to check for packet loss and latency between the client and the Application Load Balancer. Run the following command:

    mtr -rw  example-alb.us-east-1.elb.amazonaws.com -report-cycles 100

    Note: Replace example-alb.us-east-1.elb.amazonaws.com with your Application Load Balancer DNS name.

    Example output:

    HOST: ip-10-0-1-50                Loss%   Snt    Last   Avg  Best   Wrst StDev  
     1.|-- 10.0.1.1                    0.0%   100     0.5    0.6   0.4      1.2   0.1  
     2.|-- 52.93.1.100                 0.0%   100     1.2    1.3   1.0      3.5   0.3  
     3.|-- 54.239.110.5                2.0%   100     1.5    1.8   1.2      5.1   0.5  
     4.|-- example-alb.us-east-1.elb.amazonaws.com   0.0%   100    2.1    2.3   1.8   4.2    0.4

    Each numbered row corresponds to a network device that traffic passes through between the client and the Application Load Balancer.

  • Look for network devices with high Loss% values or high ratios of Avg to Wrst latency values. Packet loss at intermediate network devices shows network congestion. If packet loss occurs at the final network device, then the issue is with the Application Load Balancer subnet configuration or network path.

  • Run MTR from both the client side and from an Amazon EC2 instance in the same Amazon Virtual Private Cloud (Amazon VPC) as the Application Load Balancer. Compare results to identify whether the issue is external or internal to AWS. For more information, see Diagnosing packet loss and latency with MTR.

    Run the following curl command:

    curl -kso /dev/null -w  "\n===============\n  
     | DNS lookup: %{time_namelookup}\n  
     | Connect: %{time_connect}\n  
     | App connect: %{time_appconnect}\n  
     | Pre-transfer: %{time_pretransfer}\n  
     | Start transfer: %{time_starttransfer}\n  
     | Total: %{time_total}\n  
     | HTTP Code:  %{http_code}\n===============\n" https://example.com/  
    

    Note: Replace example.com with your DNS name.
    Example output:

    ===============| DNS lookup: 0.002346| Connect: 0.003080| App connect: 0.008422| Pre-transfer: 0.008587| Start transfer: 0.050238| Total: 0.057486| HTTP Code: 200===============
  • To isolate what's causing latency, run the preceding command with your Application Load Balancer, and then run it again and bypass your Application Load Balancer. For more information, see curl man page on the curl website. If the App connect time (TLS handshake) is significantly larger than the Connect time (TCP handshake), then the TLS negotiation is slow.

Check Application Load Balancer and infrastructure configuration

Idle timeout

  • If the target_processing_time in access logs equals the idle timeout value exactly, increase the Application Load Balancer idle timeout or optimize the application response time. To modify the idle timeout, see Edit attributes for your Application Load Balancer. The default idle timeout is 60 seconds. To check the current idle timeout value, run the following command:

    aws elbv2  describe-load-balancer-attributes --load-balancer-arn  example-alb-arn --query  "Attributes[?Key=='idle_timeout.timeout_seconds']"

    Note: Replace example-alb-arn with your Application Load Balancer Amazon Resource Name (ARN).

  • Make sure the backend web server's keep-alive timeout is higher than the Application Load Balancer idle timeout to avoid HTTP 502 errors.

  • Check for HTTP 460 errors in your access logs, which indicate the client closed the connection before the idle timeout elapsed.

Subnet capacity
Each Application Load Balancer subnet requires at least a /27 Classless Inter-Domain Routing (CIDR) block with at least 8 free IP addresses. During traffic spikes, the Application Load Balancer scales and requires additional IPs. If the subnet runs out, the Application Load Balancer can't scale, which causes latency or 5xx errors. For expected surges, use LCU reservations to pre-provision capacity.

AWS WAF
If AWS WAF is associated with your Application Load Balancer, WAF rule evaluation adds to the request_processing_time field. This increase is largest for POST requests with large request bodies, as WAF inspects the body before it forwards to targets. For more information, see Load balancer shows elevated processing times.

SSL/TLS configuration
Verify that the Application Load Balancer uses a modern security policy for faster handshakes. For custom certificates, make sure the full certificate chain is configured. A missing intermediate certificate adds latency to the TLS handshake.

Network infrastructure
Verify that Application Load Balancer subnets have a route to an externally facing Internet Gateway or appropriate internal routes.

Note: For internal Application Load Balancers, subnets don't require an Internet Gateway route. If backend targets need to reach external services, make sure subnets have a route to a NAT Gateway. You can also use VPC Reachability Analyzer to diagnose connectivity issues between Application Load Balancer Elastic Network Interfaces (ENIs) and targets.

Check backend servers and dependencies

To check backend servers and dependencies, take the following actions:

  • To identify high CPU utilization, check the CloudWatch CPUUtilization metric of your backend instances. To resolve high CPU utilization, upgrade your instance to a larger instance type.

  • To review the processes that are running on your backend and check for memory issues, run the following Linux command.

    watch  -n 1 "echo -n 'Apache Processes:' && ps -C httpd --no-headers |  wc -l && free -m"

    Note: Replace httpd with the process name for your server.
    Example output:

    Every  1.0s: echo -n 'Apache Processes:' && ps -C httpd --no-headers | wc -l  && free -m  Apache  Processes:6 total used free shared buff/cache availableMem:  961 332 78 3 550 483Swap:  0 0 0
  • To check how many web server processes are currently running, run the following Linux command:

    ps aux  | grep httpd | wc -l

    Note: Replace httpd with the process name for your server.

    Example output:

    7
  • Check the maximum concurrent connection or worker limits for your web server. Each web server has a configuration setting that controls the maximum number of simultaneous requests it can handle. If this limit is reached, new requests from the Application Load Balancer queue up, which increases latency. Check the relevant setting for your web server type.

  • Check for dependencies on your backend instances that might cause latency issues. Example dependencies include Amazon Simple Storage Service (Amazon S3) buckets, network address translation (NAT) instances or proxy servers, remote web services, and databases.

Use Linux tools to identify performance bottlenecks

To identify performance bottlenecks on the server, use the following Linux commands:

  • Run the uptime command to check for high system load averages because of resource contention.
  • Run the mpstat -P ALL 1 command to view a breakdown of CPU usage for each core and to determine imbalanced usage.
  • Run the pidstat 1 command to identify resource-heavy processes and patterns over time.
  • Run the dmesg | tail command to identify recent system-level events that might affect performance. The output shows the last 10 system messages.
  • Run the iostat -xz 1 command to identify high read or write operations or slow disk performance.
  • Run the free -m command to view available system memory. If available memory is low, the system might rely on swap that increases disk I/O and latency.
  • Run the sar -n DEV 1 command to identify bandwidth saturation. The output shows network interface throughput.
  • Run sar -n TCP,ETCP 1 to view TCP metrics. The retrans/s metric shows TCP retransmissions per second. When this metric is high, you might be experiencing packet loss or congestion.
  • To identify what's using the most bandwidth on the network, run the iftop command. The output shows active bandwidth usage for each connection in real time.
  • The niftop command is a variant of iftop that might be available through third-party repositories on Red Hat Enterprise Linux (RHEL) and Debian-based distributions. If iftop isn't preinstalled on your system, then use niftop.

Note: Depending on your Linux distribution, you might need to manually install the preceding commands.

AWS OFFICIALUpdated 21 days ago