How do I troubleshoot the continuous rebalancing of my Amazon MSK consumer group?

4 minute read
0

My Amazon Managed Streaming for Apache Kafka (Amazon MSK) consumer group is rebalancing continuously. I want to troubleshoot why this is happening.

Resolution

Apache Kafka consumers are typically part of a consumer group. Each consumer in the consumer group receives messages from a different subset of partitions in the topic when the following happens:

  • Multiple consumers subscribe to a topic.
  • These consumers belong to the same consumer group.

When a consumer can't reach the cluster, the group coordinator removes the consumer from the consumer group. This process initiates a rebalancing event that includes these actions:

  • Remaining consumers are relieved of their partitions.
  • The group coordinator redistributes the topic's partitions to the remaining consumers.

Your consumer group might rebalance under the following conditions:

  • You moved the partition ownership from one consumer to another in a consumer group.
  • You added a new consumer to the consumer group.
  • A consumer shuts down, crashes, or leaves the consumer group.
  • You modified the topics, and a partition realignment happens.
  • There's a client configuration issue in consumer group subscriptions. This results from a mismatch between the topics subscribed by the group and the topic assigned to each individual consumer in the group.

The consumers within the same consumer group can't continue data consumption until the rebalancing event is completed. This is the default behavior of partition assignment. You can avoid this by changing the partition assignment strategy to CooperativeStickyAssignor.

To avoid your consumer group from continuously rebalancing, try the following:

  • Either lower the max.partition.fetch.bytes value or increase the session timeout (session.timeout.ms) value in consumer configuration. The consumer must call poll() frequently to avoid session timeout and subsequent rebalance. If the amount of data that a single poll() returns is large, then the consumer might take a long time to process the data. This means that the consumer doesn't get to the next iteration of the poll loop in time to avoid a session timeout.
    Note: Setting a higher value for session.timeout.ms reduces the possibility of accidental rebalance. However, it might take longer to detect a real failure. This parameter is related to heartbeat.interval.ms. The heartbeat.interval.ms parameter controls the frequency at which the KafkaConsumer poll() method sends a heartbeat to the group coordinator. However, the session.timeout.ms parameter controls how long a consumer can go without sending a heartbeat.
    For example, suppose that you're running Apache Kafka 0.10.1 or later and handling records that take longer to process. In this case, tune max.poll.interval.ms to handle longer delays between polling for new records.
  • Be sure that the session.timeout.ms value in consumer configuration is lower than that of group.max.session.timeout.ms in the broker configuration.
  • max.poll.interval.ms places an upper bound on the amount of time that the consumer can be idle before fetching more records. By default, this value is set to 5 minutes. If this value is set to less than 5 minutes, increase it to reduce the possibility of rebalancing. You can also consider decreasing max.poll.records along with max.poll.interval.ms.
  • heartbeat.interval.ms is the expected time between heartbeats to the consumer coordinator when you're using Kafka's group management facilities. Heartbeats are used to make sure that the consumer’s session stays active. They facilitate rebalancing when new consumers join or leave the group. This value must be set to a value that's lower than session.timeout.ms. Typically, this value must be set to a value that's not higher than one-third of session.timeout.ms. You can choose to reduce the heartbeat.interval.ms value much lower to control the expected time for normal rebalances.
  • If you performed a partition reassignment recently that involves changes to partitions in one of the consumer group's subscribed topics, then the consumer group might rebalance. This is because the partitions involved are moved around or altered. In this case, refrain from restarting the group coordinator or other Kafka brokers. You must wait for partition reassignment to complete before trying to stop the consumer group from rebalancing. It's a best practice to do partition reassignments during low traffic times.

In some cases, you might see the following information in Amazon MSK broker logs:

[2023-03-01 01:23:45,678] INFO [GroupCoordinator 1]: Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 382660 (__consumer_offsets-21) (reason: Adding new member consumer-amazon.msk.canary.group.broker-1-xxxx-xxxx-xxxx-xxxx-xxxx-xxxx with group instance id None) (kafka.coordinator.group .GroupCoordinator)

This message indicates that amazon.msk.canary.group.broker-N is in PreparingRebalance state.

amazon.msk.canary.group.broker-N groups are internal consumer groups that are added or removed regularly to check cluster health and diagnostic metrics. These groups are negligible in size and can't be deleted. You can ignore this message.

Related information

Consumer group stuck in PreparingRebalance state

AWS OFFICIAL
AWS OFFICIALUpdated a year ago