Why am I experiencing high consumer lag in my Amazon MSK cluster?

3 minute read

I want to troubleshoot consumer lag in my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster.



Run the following command to use Amazon CloudWatch metrics to determine consumer lag:

./kafka-consumer-groups.sh --bootstrap-server <broker endpoints> --group <groupid> --describe --command-config <properties>

Note: Replace <broker endpoints> with your broker endpoints, <groupid> with your group ID, and <properties> with your properties. 

Troubleshoot common issues related to consumer lag

Verify consumer-to-partition ratio: Each consumer in a consumer group reads from a subset of partitions based on available consumers in the group. If each consumer consumes from multiple partitions, then it might process high amounts of data and cause a lag.

Where possible, keep the consumer-to-partition ratio close to 1:1. If the lag persists, then increase the number of partitions and consumers.

Identify the outlier: An outlier might cause consumer lag. Check if one consumer partition contributes significantly more than the others in the consumer group. Identify the problem source to apply the appropriate solution. As a final option, reboot the application.

Check resource usage on consumer host: Monitor resources on the consumer applications to see if there's any resource starvation. Slow consumers lead to slow message processing and results in consumer lag.

Check the consumer group rebalancing: During a consumer group rebalance, all the consumer partition assignments are revoked. As a result, consumers stop reading from the topic, and this creates an increase in lag. For more information, see How do I troubleshoot the continuous rebalancing of my consumer group?

Evaluate the consumer configuration: Consumer lag might occur when the producer writes faster than what the consumer can read. To read the data as soon as it's produced, adjust fetch.min.bytes and fetch.partition.min.bytes in the consumer configuration. The properties max.poll.interval.ms and max.poll.records can affect how often the consumer commits its offset and how many messages it fetches at a time. Adjust these settings to help reduce consumer lag.

Manage message size: Large message sizes can cause consumer lag, especially if your consumer application is processing messages slowly. Increase the number of consumer instances to handle the workload.

Review your application design: Your consumer application’s design can impact consumer lag. Check if you designed your application to handle the volume of messages that you're processing. Scale up your application, or optimize your processing logic.

Monitor brokers resource usage: Monitor CPU usage on brokers to check if brokers are overloaded, resulting in an increase in lag. For more information on troubleshooting high CPU usage, see How can I troubleshoot high CPU usage on one or more brokers in an Amazon MSK cluster?

Optimize the cluster for the workload: Check that the Kafka brokers in your MSK cluster are configured and optimized for your workload. Make sure that your topic partitions are evenly distributed across your brokers. Confirm that your replication factor is appropriately set.

Determine network latency: High network latency between a consumer and the MSK cluster can result in high consumer lag. Check the network connection between your consumer application and your MSK cluster. If the connection is slow, then move your consumer application closer to the Kafka brokers. Or, optimize your network configuration.

AWS OFFICIALUpdated a year ago