- 新しい順
- 投票が多い順
- コメントが多い順
The issue is that DynamoDB cannot auto-scale fast enough to keep up with the AWS Glue write speed. To avoid DynamoDB ThrottlingException on write please use Capacity mode “Provisioned” with Autoscaling for Read and Write. Choose the min / max capacity wisely. Set the target utilization to allow DynamoDB sufficient time to scale. As an example I ran this blog successfully setting the number of workers in the Glue job to 25 and Provisioned Table capacity with autoscaling for DDB for set to Read 5 (min) 40K (max) TU 70% Write 5K / 20K (min) 40K (max) TU 35% Total records / items : 14,092,383
Essentially providing enough time for DynamoDB to auto scale.
While using on-demand table is recommended, it too must scale to provide your throughput needs. DynamoDB on-demand tables provide 4k WCU out of the box, so your ETL job will be initially capped at 4k WCU, then throttle until DynamoDB makes the necessary changes to accommodate your request throughput. To achieve best results, you must first "pre-warm" your table, this is easier done on new tables, but will also work for old ones.
- Create table in provisioned mode with no auto-scaling, assign 40k WCU and 40 RCU for capacity settings.
- Allow table to go from Creating/Updating to Activce
- Once Active, change capacity mode to on-demand.
- You can now achieve approx 40k WCU out of the box, and your ETL job will have the capacity it needs to run with no issue.
*note that 40k WCU is the soft limit for max table capacity, and can be increased with a request from Service Quotas.
Leeroy, thanks for the additional information. I figured out the capacity part by trial and error The Glue job started writing at provisioned capacity.
However, I had to change the composite key.
Before: table with 3 mil items Composite primary key: AAABBB (a currency pair, USDCAD for example) + sort key: yyyymmdd (date) Item: just a number Avg item size: 28 bytes, table size 500MB
There were around 4000 unique primary keys for this 3 mil items table. Each had about 1000 items. Around 600 sort keys (2 year period) for each partition key
With this schema, DDB was able to write only at 1.3k WCUs. Even with provisioned capacity at 5-10k WCUs
Changing schema to abcdez#yyyy with search key mmdd allowed writers at half of provisioned WCU And only with scheme abcdez#yyyymm with search key dd it allowed writers at full provisioned capacity.
Why couldn't ddb write with provisioned WCU with this partition key? I didn't see anywhere in the docs anything like "if a primary key has too many sort keys, the writes will be slow"
Its not that it has too many SK's, its the fact that a partition has a hard limit of 1000WCU. Your items are partitioned based on a hashed version of your partition key, meaning each key can only have 1000WCU. DynamoDB does have the mechanics to split a keys items into multiple partitions, but you will get throttled before that happens.
My suggestion is to try shuffle your data before writing. Glue basically provides an abstraction for BatchWriteItems API, wherein you are providing DynamoDB a batch of items with the same partition key, which in turn directs traffic to a single partition and hitting the 1000 WCU cap. As Glue is distributed in nature, it will not seem obvious that this is happening, but in essence you are having multiple hot partitions. Hence, I suggest a shuffle in your data you allow a more even distribution.
Problem is called "hot partition". Consider changing the partition key :-) A guid usually works fine. I think you still can make an index on the AAABBB key, the index will eventually catch up.
関連するコンテンツ
- 質問済み 10ヶ月前
- 質問済み 7ヶ月前
- AWS公式更新しました 3年前
- AWS公式更新しました 2年前
- AWS公式更新しました 3年前
That's interesting, thank you. I left the table in "on-demand" mode as recommended here https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb
When the DynamoDB table is in on-demand mode, AWS Glue handles the write capacity of the table as 40000. For importing a large table, we recommend switching your DynamoDB table to on-demand mode.
Isn't ddb suppose to handle 40k WCU pretty effortlessly for on-demand mode?
@catenary-dot-cloud yes I know that is recommended in our docs and I tried running the above blog by setting the DDB to on-demand and not surprisingly that failed. I guess with the release of AWS Glue 3.0 that recommendation needs a re-look. BTW this load to DDB via AWS Glue will be added to upcoming AWS Glue workshops.