How do I troubleshoot issues when connecting to my Amazon MSK cluster?

14 minute read
2

I'm experiencing issues when I try to connect to my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster.

Resolution

When you try to connect to an Amazon MSK cluster, you might get the following types of errors:

  • Errors that are not specific to the authentication type of the cluster
  • Errors that are specific to TLS client authentication
  • Errors that are specific to AWS Identity and Access Management (IAM) client authentication
  • Errors that are specific to Simple Authentication and Security Layer/Salted Challenge Response Mechanism (SASL/SCARM) client authentication

Errors that are not related to a specific authentication type

When you try to connect to your Amazon MSK cluster, you might get one of the following errors irrespective of the authentication type enabled for your cluster.

java.lang.OutOfMemoryError: Java heap space

You get this error when you don't mention the client properties file when running a command for cluster operations using any type of authentication:

For example, you get the OutOfMemoryError when you run the following command with IAM authentication port:

./kafka-topics.sh --create --bootstrap-server $BOOTSTRAP:9098 --replication-factor 3 --partitions 1 --topic TestTopic

However, the following command runs successfully when you run it with IAM authentication port:

./kafka-topics.sh --create --bootstrap-server $BOOTSTRAP:9098  --command-config client.properties --replication-factor 3 --partitions 1 --topic TestTopic

To resolve this error, be sure to include appropriate properties based on the type of authentication in the client.properties file.

org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: createTopics

You get this error typically when there is a network misconfiguration between the client application and Amazon MSK Cluster.

To troubleshoot this issue, check the network connectivity by performing the following connectivity test.

Run the command from the client machine.

telnet bootstrap-broker port-number

Be sure to do the following:

  • Replace bootstrap-broker with one of the broker addresses from your Amazon MSK Cluster.
  • Replace port-number with the appropriate port value based on the authentication that's turned on for your cluster.

If the client machine is able to access the brokers, then there are no connectivity issues. If not, review the network connectivity, especially the inbound and outbound rules for the security group.

org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [test_topic]

You get this error when you're using IAM authentication and your access policy blocks topic operations, such as WriteData and ReadData.

Note that permission boundaries and service control policies also block user attempting to connect to the cluster without the required authorization.

If you're using non-IAM authentication, then check if you added topic level access control lists (ACLs) that block operations.

Run the following command to list the ACLs that are applied on a topic:

bin/kafka-acls.sh --bootstrap-server $BOOTSTRAP:PORT --command-config adminclient-configs.conf –-list –-topic testtopic

Connection to node -1 (b-1-testcluster.abc123.c7.kafka.us-east-1.amazonaws.com/3.11.111.123:9098) failed authentication due to: Client SASL mechanism 'SCRAM-SHA-512' not enabled in the server, enabled mechanisms are [AWS_MSK_IAM]

-or-

Connection to node -1 (b-1-testcluster.abc123.c7.kafka.us-east-1.amazonaws.com/3.11.111.123:9096) failed authentication due to: Client SASL mechanism 'AWS_MSK_IAM' not enabled in the server, enabled mechanisms are [SCRAM-SHA-512]

You get these errors when you're using a port number that doesn't match with the SASL mechanism or protocol in the client properties file. This is the properties file that you used in the command to run cluster operations.

  • To communicate with brokers in a cluster that's set up to use SASL/SCRAM, use the following ports: 9096 for access from within AWS and 9196 for public access
  • To communicate with brokers in a cluster that's set up to use IAM access control, use the following ports: 9098 for access from within AWS and 9198 for public access

Timed out waiting for connection while in state: CONNECTING

You might get this error when the client tries to connect to the cluster through the Apache ZooKeeper string, and the connection can't be established. This error might also result when the Apache ZooKeeper string is wrong.

You get the following error when you use the incorrect Apache ZooKeeper string to connect to the cluster:

./kafka-topics.sh --zookeeper z-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:2181,z-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:2181,z-3.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:2181 --list
[2020-04-10 23:58:47,963] WARN Client session timed out, have not heard from server in 10756ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-04-10 23:58:58,581] WARN Client session timed out, have not heard from server in 10508ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-04-10 23:59:08,689] WARN Client session timed out, have not heard from server in 10004ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
Exception in thread "main" kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:259)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:255)
at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:113)
at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1858)
at kafka.admin.TopicCommand$ZookeeperTopicService$.apply(TopicCommand.scala:321)
at kafka.admin.TopicCommand$.main(TopicCommand.scala:54)
at kafka.admin.TopicCommand.main(TopicCommand.scala)

To resolve this error, do the following:

  • Verify that the Apache ZooKeeper string used is correct.
  • Be sure that the security group for your Amazon MSK cluster allows inbound traffic from the client's security group on the Apache ZooKeeper ports.

Topic 'topicName' not present in metadata after 60000 ms. or Connection to node -<node-id> (<broker-host>/<broker-ip>:<port>) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

You might get this error under either of the following conditions:

  • The producer or consumer is unable to connect to the broker host and port.
  • The broker string is not valid.

If you get this error even though the connectivity of the client or broker was working initially, then the broker might be down.

You get the following error when you try to access the cluster from outside the virtual private cloud (VPC) using the broker string for producing data:

./kafka-console-producer.sh --broker-list b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092 --topic test
[2020-04-10 23:51:57,668] ERROR Error when sending message to topic test with key: null, value: 1 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Topic test not present in metadata after 60000 ms.

You get the following error when you try to access the cluster from outside the VPC for consuming data using broker string:

./kafka-console-consumer.sh --bootstrap-server b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092 --topic test
[2020-04-11 00:03:21,157] WARN [Consumer clientId=consumer-console-consumer-88994-1, groupId=console-consumer-88994] Connection to node -1 (b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com/172.31.6.19:9092) could not be established. Broker may not
be available. (org.apache.kafka.clients.NetworkClient)
[2020-04-11 00:04:36,818] WARN [Consumer clientId=consumer-console-consumer-88994-1, groupId=console-consumer-88994] Connection to node -2 (b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com/172.31.44.252:9092) could not be established. Broker may
not be available. (org.apache.kafka.clients.NetworkClient)
[2020-04-11 00:05:53,228] WARN [Consumer clientId=consumer-console-consumer-88994-1, groupId=console-consumer-88994] Connection to node -1 (b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com/172.31.6.19:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

To troubleshoot these errors, do the following:

  • Be sure that the correct broker string and port are used.
  • If the error is caused due to the broker being down, check the Amazon CloudWatch metric ActiveControllerCount to verify that the controller was active throughout the period. The value of this metric must be 1. Any other value might indicate that one of the brokers in the cluster is unavailable. Also, check the metric ZooKeeperSessionState to confirm that the brokers were constant communicating with the Apache ZooKeeper nodes. To understand why the broker failed, view the metric KafkaDataLogsDiskUsed metric and check if the broker ran out of storage space. For more information on Amazon MSK metrics and the expected values, see Amazon MSK metrics for monitoring with CloudWatch.
  • Be sure that the error is not caused by the network configuration. Amazon MSK resources are provisioned within the VPC. Therefore, by default, clients are expected to connect to the Amazon MSK cluster or produce and consume from the cluster over a private network in the same VPC. If you access the cluster from outside the VPC, then you might get these errors. For information on troubleshooting errors when the client is in the same VPC as the cluster, see Unable to access cluster from within AWS: networking issues. For information on accessing the cluster from outside the VPC, see How do I connect to my Amazon MSK cluster outside of the VPC?

Errors that are specific to TLS client authentication

You might get the following errors when you try to connect to that cluster that's TLS client authentication enabled. These errors might be caused due to issues with SSL related configuration.

Bootstrap broker <broker-host>:9094 (id: -<broker-id> rack: null) disconnected

You might get this error when the producer or consumer tries to connect to a TLS-encrypted cluster over TLS port 9094 without passing the SSL configuration.

You might get the following error when the producer tries to connect to the cluster:

./kafka-console-producer.sh --broker-list b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 --topic test
[2020-04-10 18:57:58,019] WARN [Producer clientId=console-producer] Bootstrap broker b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -2 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 18:57:58,342] WARN [Producer clientId=console-producer] Bootstrap broker b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -2 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 18:57:58,666] WARN [Producer clientId=console-producer] Bootstrap broker b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)

You might get the following error when the consumer tries to connect to the cluster:

./kafka-console-consumer.sh --bootstrap-server b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 --topic test
[2020-04-10 19:09:03,277] WARN [Consumer clientId=consumer-console-consumer-79102-1, groupId=console-consumer-79102] Bootstrap broker b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 19:09:03,596] WARN [Consumer clientId=consumer-console-consumer-79102-1, groupId=console-consumer-79102] Bootstrap broker b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 19:09:03,918] WARN [Consumer clientId=consumer-console-consumer-79102-1, groupId=console-consumer-79102] Bootstrap broker b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -2 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)

To resolve this error, set up the SSL configuration. For more information, see How do I get started with encryption?

If client authentication is enabled for your cluster, then you must add additional parameters related to your ACM Private CA certificate. For more information, see Mutual TLS authentication.

ERROR Modification time of key store could not be obtained: <configure-path-to-truststore>

-or-

Failed to load keystore

If there is an issue with the truststore configuration, then this error can occur when truststore files are loaded for the producer and consumer. You might view information similar to the following in the logs:

./kafka-console-consumer --bootstrap-server b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 --topic test --consumer.config /home/ec2-user/ssl.config
[2020-04-11 10:39:12,194] ERROR Modification time of key store could not be obtained: /home/ec2-ser/certs/kafka.client.truststore.jks (org.apache.kafka.common.security.ssl.SslEngineBuilder)
java.nio.file.NoSuchFileException: /home/ec2-ser/certs/kafka.client.truststore.jks
[2020-04-11 10:39:12,253] ERROR Unknown error when running consumer: (kafka.tools.ConsoleConsumer$)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /home/ec2-ser/certs/kafka.client.truststore.jks of type JKS

In this case, the logs indicate an issue with loading the truststore file. The path to the truststore file is wrongly configured in the SSL configuration. You can resolve this error by providing the correct path for the truststore file in the SSL configuration.

This error might also occur due the following conditions:

  • Your truststore or key store file is corrupted.
  • The password of the truststore file is incorrect.

Error when sending message to topic test with key: null, value: 0 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)

org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed

-or-

Connection to node -<broker-id> (<broker-hostname>/<broker-hostname>:9094) failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)

You might get the following error when there is an issue with the key store configuration of the producer leading to the authentication failure:

./kafka-console-producer --broker-list b-2.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-1.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-4.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094 --topic example --producer.config/home/ec2-user/ssl.config
[2020-04-11 11:13:19,286] ERROR [Producer clientId=console-producer] Connection to node -3 (b-4.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com/172.31.6.195:9094) failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)

You might get the following error when there is an issue with the key store configuration of the consumer leading to the authentication failure:

./kafka-console-consumer --bootstrap-server b-2.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-1.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-4.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094 --topic example --consumer.config/home/ec2-user/ssl.config
[2020-04-11 11:14:46,958] ERROR [Consumer clientId=consumer-1, groupId=console-consumer-46876] Connection to node -1 (b-2.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com/172.31.15.140:9094) failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)
[2020-04-11 11:14:46,961] ERROR Error processing message, terminating consumer process: (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed

To resolve this error, be sure that you have correctly configured the key store related configuration.

java.io.IOException: keystore password was incorrect

You might get this error when the password for the key store or truststore is incorrect.

To troubleshoot this error, do the following:

Check whether the key store or truststore password is correct by running the following command:

keytool -list -keystore kafka.client.keystore.jks
Enter keystore password:
Keystore type: PKCS12
Keystore provider: SUN
Your keystore contains 1 entry
schema-reg, Jan 15, 2020, PrivateKeyEntry,
Certificate fingerprint (SHA1): 4A:F3:2C:6A:5D:50:87:3A:37:6C:94:5E:05:22:5A:1A:D5:8B:95:ED

If the password for the key store or truststore is incorrect, then you might see the following error:

keytool -list -keystore kafka.client.keystore.jks
Enter keystore password:
keytool error: java.io.IOException: keystore password was incorrect

You can view the verbose output of the above command by adding the -v flag:

keytool -list -v -keystore kafka.client.keystore.jks

You can also use these commands to check if the key store is corrupted.

You might also get this error when the secret key associated with the alias is incorrectly configured in the SSL configuration of the producer and consumer. To verify this root cause, run the following command:

keytool -keypasswd -alias schema-reg -keystore kafka.client.keystore.jks
Enter keystore password:
Enter key password for <schema-reg>
New key password for <schema-reg>:
Re-enter new key password for <schema-reg>:

If your password for the secret of the alias (example: schema-reg) is correct, then the command asks you to enter a new password for the secret key else. Otherwise, the command fails with the following message:

keytool -keypasswd -alias schema-reg -keystore kafka.client.keystore.jks
Enter keystore password:
Enter key password for <schema-reg>
keytool error: java.security.UnrecoverableKeyException: Get Key failed: Given final block not properly padded. Such issues can arise if a bad key is used during decryption.

You can also verify if a particular alias is part of the key store by running the following command:

keytool -list -keystore kafka.client.keystore.jks -alias schema-reg
Enter keystore password:
schema-reg, Jan 15, 2020, PrivateKeyEntry,
Certificate fingerprint (SHA1): 4A:F3:2C:6A:5D:50:87:3A:37:6C:94:5E:05:22:5A:1A:D5:8B:95:ED

Errors that are specific to IAM client authentication

Connection to node -1 (b-1.testcluster.abc123.c2.kafka.us-east-1.amazonaws.com/10.11.111.123:9098) failed authentication due to: Access denied

-or-

org.apache.kafka.common.errors.SaslAuthenticationException: Access denied

Be sure that the IAM role that accesses the Amazon MSK cluster allows cluster operations as mentioned in IAM access control.

In addition to access policies, permission boundaries and service control policies block the user that attempts to connect to the cluster, but fails to pass on the required authorization.

org.apache.kafka.common.errors.SaslAuthenticationException: Too many connects

-or-

org.apache.kafka.common.errors.SaslAuthenticationException: Internal error

You get these errors when your cluster is running on a kafka.t3.small broker type with IAM access control and you exceeded the connection limit. The kafka.t3.small instance type accepts only one TCP connection per broker per second. When this connection limit is exceeded, your creation test fails and you get this error, indicating invalid credentials. For more information, see How Amazon MSK works with IAM.

To resolve this error, consider doing the following:

Errors that are specific to SASL/SCRAM client authentication

Connection to node -1 (b-3.testcluster.abc123.c2.kafka.us-east-1.amazonaws.com/10.11.111.123:9096) failed authentication due to: Authentication failed during authentication due to invalid credentials with SASL mechanism SCRAM-SHA-512

  • Be sure that you stored the user credentials in AWS Secrets Manager and associated these credentials with the Amazon MSK cluster.
  • When you access the cluster over 9096 port, be sure that the user and password used in AWS Secrets Manager is the same as those in client properties.
  • When you try to retrieve the secrets using the get-secret-value API, make sure that the password used in AWS Secrets Manager doesn't contain any special characters, such as (/]).

org.apache.kafka.common.errors.ClusterAuthorizationException: Request Request(processor=11, connectionId=INTERNAL_IP-INTERNAL_IP-0, session=Session(User:ANONYMOUS,/INTERNAL_IP), listenerName=ListenerName(REPLICATION_SECURE), securityProtocol=SSL, buffer=null) is not authorized

You get this error when both the following conditions are true:

  • You turned on SASL/SCRAM authentication for your Amazon MSK cluster.
  • You've set resourceType=CLUSTER and operation=CLUSTER_ACTION in the ACLs for your cluster.

The Amazon MSK cluster doesn't support this setting because this setting prevents the internal Apache Kafka replication. With this setting, the identity of the brokers appear as ANONYMOUS for inter-broker communication. If you need your cluster to support these ACLs while using the SASL/SCRAM authentication, then you must grant the permissions for ALL operations to the ANONYMOUS user. This prevents the restriction of replication between the brokers.

Run the following command to grant this permission to the ANONYMOUS user:

./kafka-acls.sh --authorizer-properties
zookeeper.connect=example-ZookeeperConnectString --add --allow-principal
User:ANONYMOUS --operation ALL --cluster

Related information

Connecting to an Amazon MSK cluster

How do I troubleshoot common issues when using my Amazon MSK cluster with SASL/SCRAM authentication?

AWS OFFICIAL
AWS OFFICIALUpdated a year ago
2 Comments

Looks like "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: createTopics" is a bit more general than just network connectivity.
After getting errors because I was using a SASL mechanism of "PLAIN" (which proves I had network line of sight), I changed to "SCRAM-SHA-512" and then got the Timeout exception. Turns out I was still using "org.apache.kafka.common.security.plain.PlainLoginModule" left over from a previous cluster connection. Changing to "org.apache.kafka.common.security.scram.ScramLoginModule" fixed it all immediately

replied 9 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 9 months ago