What metrics can I use to monitor and troubleshoot Kinesis Data Streams issues?

5 minute read
0

I want to monitor incoming and outgoing data for Amazon Kinesis Data Streams.

Resolution

Use stream-level metrics

You can use Amazon CloudWatch metrics to continuously monitor the performance of your Amazon Kinesis data stream and its throughput. The following metrics can help you monitor producer and consumer issues.

GetRecords.IteratorAgeMilliseconds
GetRecords.IteratorAgeMilliseconds measures the age in milliseconds of the last record in the stream for all GetRecords requests. A value of zero for this metric indicates that the records are current within the stream. A lower value is preferred. To monitor any performance issues, increase the number of consumers for your stream, so that the data is processed more quickly. To optimize your application code, increase the number of consumers to reduce the delay in processing records.

ReadProvisionedThroughputExceeded
ReadProvisionedThroughputExceeded measures the count of GetRecords calls that are throttled during a given period and exceed the service or shard limits for Kinesis Data Streams. A value of zero indicates that the data consumers aren't exceeding service quotas. Any other value indicates that the throughput limit is exceeded and required additional shards. This metric confirms that there are no more than five reads/second/shard or 2 MB/second/shard in the stream. You can turn on enhanced monitoring to validate that there are no hot shards in the stream.

WriteProvisionedThroughputExceeded
WriteProvisionedThroughputExceeded measures the PUT or data producer (such as ReadProvisionedThroughputExceeded) to help determine if the stream is throttled. This exceeds the service quotas for Data Streams when writing into a shard. Be sure that the PUT requests don't exceed 1 MB/second/shard or 1,000 records/shard/second. Be sure that the partition key is evenly distributed and that enhanced monitoring is turned on to verify hot shards in the stream. Depending on shard saturation, update the shard count in the stream to allow for increased throughput.

PutRecord.Success and PutRecords.Success
PutRecord.Success and PutRecords.Success measure the count of successful records of PutRecords request over a given period by data producers into the stream. This metric confirms effective retry logic for failed records.

GetRecords.Success
GetRecords.Success measures the count of successful GetRecords requests for a given time period in the stream. It confirms effective retry logic for failing records.

GetRecords.Latency
GetRecords.Latency measures the time taken for each GetRecords operation on the stream over a specified time period. It confirms sufficient physical resources or record processing logic for increased stream throughput. It also processes larger batches of data to reduce network and other downstream latencies in your application. For the Kinesis Client Library (KCL), investigate the ProcessTask.Time metric to monitor the processing time of the application that is falling behind. The GetRecords.Latency metric confirms that the IDLE_TIME_BETWEEN_READS_IN_MILLIS setting is set to keep up with stream processing.

PutRecords.Latency
PutRecords.Latency measures the time taken for each PutRecords operation on the stream over a specified time period. If the PutRecords.Latency value is high, aggregate records into a larger file to put batch data into the Kinesis data stream. You can also use multiple threads to write data. Throttling and retry logic on the PutRecords API can impact latency and the time taken for each PutRecords operation on the stream. Then, use the Average statistic for the listed metrics to monitor performance and throughput of the stream.
Note: For GetRecords.IteratorAgeMilliseconds, use the Maximum statistic to reduce the risk of data loss for consumers that lag behind any read operations. Configure a CloudWatch alarm to respond to any data points to be evaluated for a metric. For more information about CloudWatch alarms, see Using Amazon CloudWatch alarms.

If you use the enhanced fan-out feature, use the following metrics to monitor Kinesis Data Streams:

SubscribeToShard.RateExceeded: Measures the number of calls per second exceeded that are allowed for the operation or when a subscription attempt fails because an active subscription already exists.

SubscribeToShard.Success: Verifies whether the SubscribeToShard operation succeeds.

SubscribeToShardEvent.Success: Verifies the successful publication of an event for active subscription.

SubscribeToShardEvent.Bytes: Measures the number of bytes received in the shards over the specified time period.

SubscribeToShardEvent.Records: Measures the number of records received in the shards over the specified time period.

SubscribeToShardEvent.MillisBehindLatest: Measures the difference of current time and last record of the SubscribeToShard event written to the stream.

Turn on enhanced shard-level metrics

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

Turn on shard-level metrics in CloudWatch to monitor specific tasks and to troubleshot data producers and consumers. For example, turn on shard-level metrics to help you identify issues such as uneven workload distributions. To turn on enhanced monitoring, complete the following steps:

Note: You can also use the EnableEnhancedMonitoring API request or enable-enhanced-monitoring AWS CLI command. 

  1. Open the Kinesis console.
  2. Choose a specific Region.
  3. From the navigation pane, choose Data Streams.
  4. Under Data Stream Name, select your Kinesis data stream.
  5. Choose Configuration.
  6. Under Enhanced (shard-level) metrics, choose Edit.
  7. From the dropdown menu, select your metrics for enhanced monitoring.
  8. Choose Save Changes.

Additional troubleshooting with API calls

Use the following API calls to read or write data from Kinesis Data Streams:

  • CreateStream: Limit of five transactions per second per account.
  • DeleteStream: Limit of five transactions per second per account.
  • ListStreams: Limit of five transactions per second per account.
  • GetShardIterator: Limit of five transactions per second per account per open shard.
  • MergeShards: Limit of five transactions per second per account.
  • DescribeStream: Limit of ten transactions per second per account.
  • DescribeStreamSummary: Limit of twenty transactions per second per account.

When you use these API calls, you can monitor any throttling in the AWS CloudTrail logs. For more information about Kinesis Data Streams API calls and CloudTrail, see Logging Amazon Kinesis Data Streams API calls with AWS CloudTrail.

Related information

Amazon CloudWatch pricing

Monitoring the Amazon Kinesis Data Streams service with Amazon CloudWatch

AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago