Incremental Data Capture from DynamoDB to S3

Question

I have a use case where I would like to replicate the data from DynamoDB to S3 (for backup and later for the analytical processing). I don't need real-time data updates or any notifications on DDB item changes.

The "regular" design with DDB Streams, Lambda, Kinesis Firehouse and S3 destination is possible, but I am concerned about high cost, complexity and effort for initial setup.

I have implemented the following design: [https://aws.amazon.com/blogs/database/simplify-amazon-dynamodb-data-extraction-and-analysis-by-using-aws-glue-and-amazon-athena/][1]

(Download data from DDB via ETL Glue Job to S3)

That works fine, but I have the following additional questions to anybody having more experience with that design:
1. Is there any possibility to perform the incremental data load, e.g. not loading the full DDB table via Glue Job each time?

2. Does anybody has any practical experience for that design in terms of performance? Main concern here that for the large DDB Tables (100x millions of items) the full data load will A) consume too much read capacity and B) take hours/days to complete

3. Is there any other practical design approaches working with large DDB tables?

[1]: http://%20https://aws.amazon.com/blogs/database/simplify-amazon-dynamodb-data-extraction-and-analysis-by-using-aws-glue-and-amazon-athena/

Accepted Answer

It is not possible to perform an incremental 'bookmarked' load from a DynamoDB table without data modeling to design for this (i.e. a sharded-GSI that allows time based queries across the entire data set), which would then require a custom reader (Glue doesn't support GSI queries).

Using streams --> lambda --> firehose is currently the most 'managed' and cost effective way to deliver incremental changes from a DynamoDB table to S3.

Reading DynamoDB streams only has the computational cost of Lambda associated with it, and the Lambdas can read hundreds of items from a single invocation. Having Firehose buffer/package/store these changes as compressed/partitioned/queryable data on S3 is simple and cost effective.

If you are concerned about cost it could be worth opening a specreq to have a specialist take a look at the analysis - these configurations are both common and generally cost effective (the cost is not relative to the size of the table, but rather the velocity/size of the writes - which will often be more efficient than a custom reader/loader).

Answer

Export to S3 feature has been enhanced to support incremental data load https://aws.amazon.com/blogs/database/introducing-incremental-export-from-amazon-dynamodb-to-amazon-s3/

.

Incremental Data Capture from DynamoDB to S3

Contenuto pertinente