What is the best way to sync my DynamoDB tables to S3, so that I can perform serverless 'big data' queries using Athena? The data must be kept in sync without any intervention. The frequency of sync would depend on the cost, ideally daily but perhaps weekly.
I have had this question a long time. I will cover what I have considered, and why I don't like the options.
-
AWS Glue Elastic Views. Sounds like this will do the job with no code, but it was announced 18 months ago and there have been no updates since. Its not generally available, and there is not information on when it might be.
-
Use dynamodb native backup following this blog https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/. I actually already use this method for 'one-off' data transfers that I kick-off manually and then configure in Athena. I have two issues with this option. The first is that, to my knowledge, the export cannot be scheduled natively. The blog suggests using the CLI to kick off exports, and I assume the writer intends that the CLI would need scheduling on a cron job somewhere. I don't run any servers for this. I imagine I could do it via a scheduled Lambda with an SDK. The second issue is that the export path in S3 always includes a unique export ID. This means I can't configure the Athena table to point to a static location for the data and just switch over the new data after a scheduled export. Perhaps I could write another lambda to move the data around to a static location after the export has finished, but it seems a shame to have to do so much work and I've not seen that covered anywhere before.
-
I can use data pipeline as described in https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html. This post is more about backing data up than making it accessible to Athena.
I feel like this use case must be so common, and yet none of the ideas I've seen online are really complete. I was wondering if anyone had any ideas or experiences that would be useful here?