IoT Core Rules error actions - no way to access converted body

0

Hello, I've been doing a datalake solution for a customer heavily relying on AWS infrastructure for about 3-4 years now. We noticed at some point that there are a lot of errors from Kinesis for various reasons: unfortunately a lot of them are Kinesis Internal Errors which make the rule to fail and not automatically retry (this is what I would expect from it). I found this ticket: https://repost.aws/knowledge-center/kinesis-data-stream-500-error that basically suggests to implement a retry mechanism. Initially I have just saved the failed messages to S3 but this is not maintainable as we process massive amounts of data, and I can't afford reprocessing them from my local machine. So I figured I should use a SQS + Lambda that would try to re-send the message to Kinesis. And here, the problems started. I have noticed that the base64OriginalPayload is the original message arriving at the IoT Topic. Now, we have custom SQL queries doing some simple (or not) operations on incoming data, and there seems to be no way of getting a correctly parsed message. It's usually a problem of Kinesis, not IoT Core SQL for our issues. I can't also resend the message to IoT Topic, as we have other teams subscribing to those topics doing their own dashboarding and analytics and resending the same message would corrupt the state of the data (duplicates).

What is the recommended way of doing such retries in a configurable and robust way? Best Regards

2 Answers
0
  1. Retry Mechanism: Utilize SQS and Lambda for reprocessing failed Kinesis records with exponential backoff and DLQ for handling persistent failures.
  2. Payload Handling: Decode and reprocess the original payload within the Lambda function.
  3. Monitoring: Set up comprehensive monitoring and alerting to handle failures and retries effectively.
  4. Scalability: Consider AWS Step Functions for complex retry logic or the circuit breaker pattern for resilient systems.
profile pictureAWS
EXPERT
Deeksha
answered a month ago
  • That's exactly what I started with, but this is not maintainable. In many places we have a simply query just adding some clientid() or topic() information, but there are others that have dedicated SQL queries. You suggest I do dedicated lambdas for each case? This seems like so much unnecessary work. If I had access to the processed SQL result that would be so much easier!

0

Hi. AFAIK you can't access the original SQL transformation in the error action, but you can use SQL functions and substitution templates in the error action. For example, in the key field when you use the S3 action: https://docs.aws.amazon.com/iot/latest/developerguide/rule-error-handling.html#rule-error-example. This could help you format the data in the bucket in such a way that a single Lambda could more easily process all errors from all topics.

I can't afford reprocessing them from my local machine

You could have a Lambda do it.

For the data landing in Kinesis Data Streams, are you processing it in real-time? Or are you just taking it from KDS and landing it into a data lake? If the latter, then I wonder if it might be a better option to swap to Amazon Data Firehose? This might have the happy side effect of allowing you to sidestep the problem.

https://aws.amazon.com/blogs/iot/unlocking-scalable-iot-analytics-on-aws/

https://iotatlas.net/en/best_practices/aws/data_ingest/#save-ingested-data-to-durable-storage-before-processing

Some other relevant links:

https://docs.aws.amazon.com/wellarchitected/latest/iot-lens/failure-management.html

https://iotatlas.net/en/best_practices/aws/data_ingest/#use-error-actions-for-your-iot-rules

profile pictureAWS
EXPERT
Greg_B
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions