Keyspaces: Why do i get WriteThrottleEvents when there is plenty of write capacity available?

0

When attempting to import data into keyspaces, even with provisioned WCU of 40000, I am only able to import a few thousand records per second. The graphs show that it is only using a small amount of the available capacity, but also shows that the WriteThrottleEvents are through the roof. What causes writes to become throttled even though the provisioned capacity is barely being used? The errors returned to the client are all WriteTimeout.

asked 2 years ago1451 views
2 Answers
0

Hello,

Thank you for using Apache Cassandra.

In AWS Keyspaces(for Apache Cassandra), the throttling exception on the table gets translated to timeout exceptions at the client side. Similar to DynamoDB, even in AWS Keyspaces, data is distributed across multiple partitions at the backend and each partition has a throughput limit of 3000 RCU or 1000 WCU. Throttling in Keyspaces table occurs due to :

  1. In case of table with provisioned capacity mode : exceeding the provisioned capacity allocated for the table or exceeding the partition level throughput limit.
  2. In case of table with on demand capacity mode: consuming more than double the previous traffic peak for the table or exceeding the partition level throughput limit

Resolution:

  1. If the throttling is due to exceeding the table’s provisioned throughput, then increase the provisioned capacity accordingly based on your table’s traffic or enable autoscaling on the table.
  2. If the throttling is due to exceeding the partition level throughput limit, then:
    • Increasing the provisioned capacity won’t help.
    • Have a look at your access patterns and distribute read and write operations evenly across the backend partitions of your table and not exceeding the partition level throughput limit of 3000 IOPS.
    • Implement error retries based on exponential back off strategy - If not using AWS SDK, then you will have to implement a retry strategy which retries these throttled requests multiple times (10 times) based on exponential backoff strategy.

If you still have further concern, we would recommend you to open a case with Premium Support we require details that are non-public information. Please open a support case with AWS using the following link

Thank you,
Kartik_R

AWS
SUPPORT ENGINEER
answered 2 years ago
0

It looks like you have a hot key access pattern. This is denoted by the "Storage Partition Throughput exception" in cloudwatch. I recommend the following template to help visualize capacity and errors for Amazon Keyspaces.

The service will pass these back these events to the client as a timeout exceptions. You can retry them in your application or using a retry policy in the Cassandra driver. You can use the following policies to help implement exponential backoff.

You will need to determine first if your hotkey is against a single row or a single partition. Often when customers export data from Apache Cassandra it can be ordered by partition key. If loading data exported from Apache Cassandra it's best to randomize the data before loading. This will ensure that write access is evenly distributed against all resources of the table. If you're importing or exporting data I would recommend using glue. You can use the following example to bulk load data from s3. You will see the script will shuffle the data.

val orderedData = sparkSession.read.format(backupFormat).load(s3bucketBackupsLocation)	

//The following command will randomize the data.	
val shuffledData = orderedData.orderBy(rand())	

shuffledData.write.format("org.apache.spark.sql.cassandra").mode("append").option("keyspace", keyspaceName).option("table", tableName).save()

In Keyspaces, rows are distributed over many "physical storage partitions". Row collections that are part of the same logical partition are likely to be grouped together on the same physical storage partition. Physical storage partition can perform up to 1,000 Write request units per second or 3,000 Read request units per second. Over time, the service will distribute row collection over many physical partitions in a process called adaptive capacity. Rows located on one physical partition are split against the key range where access is highest, and the rows are moved to two separate physical partitions. As a result, access to the same key range will have twice the throughput.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions