- Newest
- Most votes
- Most comments
Hello,
I would like to inform you that MSK performs Broker update and Patching workflows using a rolling restart process which allows MSK to remain available and durable that means no downtime is required from MSK end. However, it can affect your producers/consumers because broker may be unavailable for that particular time and the partition leadership moves from one broker to another as brokers are restarted.
Hence it is not expected to see the below error which you are observing during MSK maintenance since in a rolling restart other brokers are available while one goes offline. With that being said, this error seems to be because of poll latency which is an expected behavior during security patching as client will get connection timeout error.
(+) [Consumer clientId=consumer-aws.fct.entityupdate.webhooklistener.consumer-14, groupId=aws.fct.entityupdate.webhooklistener.consumer] consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
Having said that, I would recommend you to increase max.poll.interval.ms value since this latency is impacting max.poll.interval.ms value . Please note that these settings need to be updated at the consumer side and are not configurable at cluster level. You need to update the client.properties to update these parameters.
For a deeper analysis into the issue and gain more insights tailored to your cluster and client configurations, I request you to please reach out to AWS Premium Support team via a support case.
Please find the documentation for best practices : https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html#ensure-high-availability
I hope the above information is helpful.
Relevant content
- asked a year ago
- asked a month ago
- asked 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
Thanks for replying.
Our
max.poll.records
size is already set to 1.Even if
max.poll.interval.ms
is breached on the client side then the consumer is expected to be kicked out of the group and a new one is to be started by the broker and there should be no stalling of processing on the consumer side. We are okay with a bit of latency during the maintenance period but what we are observing is that consumers completely stop consuming messages from the topic and the lag starts to build up on the broker side.The only workaround in this case is to restart the entire service. We have tried to simulate the scenarios by doing multiple restarts of MSK brokers on our side but we are not able to do this. The issue is happening only when actual AWS maintenance is being done. It happened during this month's maintenance and also during last month's maintenance.
Some more queries 1.We are using AWS Recommended 2.8.1 broker with Kafka client 3.5.1. Does anything need to be changed here? Are you aware of any issues on the broker side with 2.8.1 which may cause this and have been fixed in some recent versions of Kafka? 2. We have enabled broker logs. But it does not have much information. Can you let us know how we can increase the verbosity level of logs on the broker so that we can see if there are some errors on the broker side when we land up on this issue? 3. Is there a way for us to simulate the AWS MSK maintenance on our end easily? We already tried rebooting the brokers on