How do I make sure that my client I/O isn't disrupted because of security patches?

4 minute read
0

I want to know some best practices for maintaining high availability in MSK clusters during security patching.

Short description

Amazon Managed Streaming for Apache Kafka (Amazon MSK) uses rolling updates to maintain high availability, and support cluster I/O during patching. During this process, brokers are rebooted one by one, and the next broker isn't rebooted until the partitions on the current rebooted broker fully catches up (in-sync). It's normal to see transient disconnect errors on your clients during this update process.

To prevent clients from experiencing downtime during security patching, use the following best practices to make the clusters highly available.

Resolution

Set up a three-AZ cluster

If an Availability Zone fails, then a three-AZ cluster guards against any downtime.

Amazon MSK sets the broker.rack broker property to achieve a rack-aware replication assignment for fault tolerance at the Availability Zone level. This means that when you use a three-AZ cluster with a replication factor (RF) of three, each of the three partition replicas is in a separate Availability Zone.

Note: Having a two-AZ cluster with an RF-3 doesn’t allow each of the three partition replicas to be in a separate Availability Zone. Amazon MSK doesn't allow you to create a cluster in a single Availability Zone.

Make sure that the replication factor is the Availability Zone count

When you restart a broker during security patching, the leader becomes unavailable. As a result, one of the follower replicas gets elected as the new leader so that the cluster can continue to service clients.

An RF-1 can lead to unavailable partitions during a rolling update because the cluster doesn’t have any replicas to promote as a new leader. An RF-2, with a minimum in-sync replica (minISR) of one, might result in data loss, even when producer acknowledgement (acks) is set to "all." For a write to be successful, a minISR of one requires only the leader to acknowledge the write. If the leader replica's broker goes down immediately after the acknowledgement but before the follower replica catches up, then data loss occurs. For more information about min.insync.replicas, see the Apache Kafka Documentation.

Set minimum minISR to at most RF-1

Setting minISR to the value of RF can result in producer failures when one broker is out of service because of a rolling update. If the replicas don't send an acknowledgement for the producer to write, then the producer raises an exception. For example, if AZ equals three and RF equals three, then the producer waits for all three partition replicas (including the leader) to acknowledge the messages. When one of the brokers is out of service, only two of the three partitions return the acknowledgements, resulting in producer exceptions.

This scenario assumes the producer acks is set to "all." When you set producer acks to "all", the record isn't lost as long as at least one in-sync replica remains alive. For more details about producer acks, see the Apache Kafka Documentation.

Include at least one broker from each AZ in the client connection string

The client uses a single broker's endpoint to bootstrap a connection to the cluster. During the initial client connection, the broker sends metadata with information about the brokers that the client must access.

When a broker becomes unavailable, the connection fails. For example, you have only one broker in a client's connection string. During patching, the client fails to establish an initial connection with the cluster when you restart the broker.

Or, you have multiple brokers in a client connection string. In this case, the client's connection string allows failover when the broker that's used for establishing the connection goes offline. For more information on how to set up a connection string with multiple brokers, see Getting the bootstrap brokers for an Amazon MSK cluster.

Allow retries

When you reboot a broker, leader partitions on that broker become unavailable. As a result, Apache Kafka promotes replica partitions on another broker as new leaders. The client now requests a metadata update to locate the new leader partitions. During this change, it's normal for your client to experience transient errors.

By default, clients have retries built in to them to handle these types of transient errors. Confirm that your client is configured for retries. For more information on configuring retries, see the Apache Kafka Documentation.

AWS OFFICIAL
AWS OFFICIALUpdated a year ago