'unable to complete the operation against any hosts' regular error

0

We are using a python3 lambda function to write fairly high volumes of data to keyspaces (probably 10,000 writes/s).

I have tried keyspaces in both provisioned (10,000 read/write provisioned) and on demand. Each time after about 5 minutes of running, I very regularly get the error:
'unable to complete the operation against any hosts', <error from server: code=0000 [server error] message="internal server error">

The lambda function is triggered by SQS, given we need a very high concurrency in Lambda, these constant errors mean we are unable to scale up to the volumes needed.

The behavior is very strange, the errors are not constant. As mentioned, when it starts I get about 5 minutes of no erroring, then it'll start failing everything, then some will work, then maybe it'll go back to no erroring for a few more minutes.

I am confident we are not being throttled as this shows up in cloudwatch metrics. Another post mentioned looking at cloudtrail, but this doesn't seem to log UPDATE events.

Is there any way to further debug this as I don't believe there is any server side logging available?

Edited by: thewire24717 on Mar 15, 2021 9:53 AM

asked 3 years ago796 views
2 Answers
0

Our problem was that the failed cassandra sessions were persisted between invocations, therefore causing cascading failures. Adding an exception to clear the sessions on a failure has helped with this

answered 3 years ago
0

In keyspaces getting A ServerError usually indicates an transient service error.
https://docs.aws.amazon.com/keyspaces/latest/devguide/metrics-dimensions.html

You can set up Keyspace & Table Metrics for Amazon Keyspaces using https://github.com/aws-samples/amazon-keyspaces-cloudwatch-cloudformation-templates

Some of the metrics include Consumed and Provisioned Capacity per second, Number of CQL Request per second, Average Latency per Second, User errors, System Errors, Current Account Quotas. These statistics are kept for 15 months, so that you can access historical information and gain a better perspective on how your web application or service is performing

In distributed systems its common to see transient failures. The default policy will try “next host”, with keyspaces its best to retry the same. Here is a sample retry policy that should help

 
from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT, ConsistencyLevel
from ssl import SSLContext, PROTOCOL_TLSv1_2 , CERT_REQUIRED
from cassandra.auth import PlainTextAuthProvider
from cassandra import ( ConsistencyLevel, AuthenticationFailed, OperationTimedOut, UnsupportedOperation, ProtocolVersion )
from cassandra.protocol import ( ErrorMessage, ReadTimeoutErrorMessage, WriteTimeoutErrorMessage, UnavailableErrorMessage )
from cassandra.policies import ( TokenAwarePolicy, DCAwareRoundRobinPolicy, RetryPolicy )

import logging

logging.basicConfig(format='%(asctime)s - %(message)s', datefmt='%d-%b-%y %H:%M:%S' , level=logging.INFO)

class KeyspacesRetryPolicy(RetryPolicy):
     def __init__(self, RETRY_MAX_ATTEMPTS=3):
          self.RETRY_MAX_ATTEMPTS = RETRY_MAX_ATTEMPTS

     def on_read_timeout ( self, query, consistency, required_responses, received_responses, data_retrieved, retry_num):
        if retry_num <= self.RETRY_MAX_ATTEMPTS:
            return self.RETRY, consistency
        else:
            return self.RETHROW, None 

     def on_write_timeout (self, query, consistency, write_type, required_responses, received_responses, retry_num):
        if retry_num <= self.RETRY_MAX_ATTEMPTS:
            return self.RETRY, consistency
        else:
            return self.RETHROW, None

     def on_unavailable (self, query, consistency, required_replicas, alive_replicas, retry_num):
        if retry_num <= self.RETRY_MAX_ATTEMPTS:
            return self.RETRY, consistency
        else:
            return self.RETHROW, None 

     def on_request_error (self, query, consistency, error, retry_num):
        if retry_num <= self.RETRY_MAX_ATTEMPTS:
            return self.RETRY, consistency
        else:
            return self.RETHROW, None 

ssl_context = SSLContext(PROTOCOL_TLSv1_2 )
ssl_context.load_verify_locations('sf-class2-root.crt')
ssl_context.verify_mode = CERT_REQUIRED
auth_provider = PlainTextAuthProvider(username='keyspace_user+', password='xxxxx')
hosts = ['cassandra.us-east-2.amazonaws.com']

profile = ExecutionProfile(
    # load_balancing_policy=WhiteListRoundRobinPolicy(['cassandra.us-east-2.amazonaws.com']),
    consistency_level=ConsistencyLevel.LOCAL_QUORUM,
    retry_policy=KeyspacesRetryPolicy(RETRY_MAX_ATTEMPTS=5)
 )

cluster = Cluster( hosts, ssl_context=ssl_context, auth_provider=auth_provider, port=9142, execution_profiles={EXEC_PROFILE_DEFAULT: profile} )  
session = cluster.connect()
r = session.execute('select * from system_schema.keyspaces')
print(r.current_rows)
answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions