DynamoDB modeling

0

Hey We are ad tech software to track impressions, clicks, and conversions. We are currently using Postgres but plan on moving to dynamoDb. The most common query is by source and clickId (UUID), so we're considering using the source as a partition key and the clickId as the secondary key.

For the DW pipeline, we might need to query by date - that will be the secondary global index, and another Id that we use will be the 2nd GSI

We'll reach the 10GB partition limit for a few sources in less than a year as they drive millions of events daily. I've done some reading and understand that this is a soft limit, meaning that dynamoDB will split the partition into two or more partitions. What are the performance side effects of the split, both in the short and long term?

Also, some sources drive more traffic than others, which leads me to the Hot Key problem. Is there a way to avoid that?

HaimB
asked a year ago351 views
3 Answers
1

First I want to start by clearing up that DynamoDB does not have any data size limit, the 10GB limit per partition is only applicable when you use a Local Secondary Index, which I would always recommend not using unless you first consult with a specialist. Otherwise, DynamoDB automatically splits partitions as needed based on your storage and throughput needs.

If you are using source as the partition key, we aware that you are limited to 1000WCU/3000RCU per source. If your use-case allows you should combine both source and uuid to give you more even distribution across the keyspace:

pkskother
source001#uuid0022023-04-10T00:00:000ZData
source001#uuid9252023-04-11T00:00:000ZData
source002#uuid0342023-04-02T00:00:000ZData

Likewise, for your GSI you need to ensure you are not exceeding 1000WCU/3000RCU per partition. If your use-case is to read by data, I assume you mean you want all the clicks in the last 24hrs for example. Take care to enable Write Sharding to ensure you don't create a hot partition on your index.

profile pictureAWS
EXPERT
answered a year ago
  • Thank you Leeroy,

    If I'll use UUID or source+UUID as PK that will lead to a row per partition, millions of partitions, and a single item in a partition. is that valid use of DynamoDb? if so, it actually might solve my other problem of another unique attribute we use as a short identifier, I can add it as a SGI and all my problems are gone

  • You don't need to think about partitions when you work with DynamoDB. You can have as many unique partition keys as you like, in-fact the more the better. Just be aware that using a combination of source+uuid as partition key will prevent you from reading SELECT * FROM table WHERE source = 'abc' unless you create a GSI on source also.

0

You need the partition key to be as well-distributed as possible so some partitions don't get overloaded; this might require some creativity. You can for example avoid unbalanced hot keys by sharding/aliasing popular keys; e.g. create hash.1, hash.2, hash.3 just to get the storage to distribute. Then you can use a batch query to pull it all in at the same time.

Given that a good Partition Key is essential to get full performance by distributing requests evenly across partitions, what makes a good key?

  • Attribute should have as many distinct values as possible.
  • Should be uniformity of access across all key values, not just in total but at any point in time.
  • If no attributes fit the above criteria, consider a synthetic/created/hybrid value.
  • Do not mix Hot and Cold key values in a table.
EXPERT
answered a year ago
0

Thank you for your answer, but I'm not sure I follow.

The only way to match an event is by the ID (UUID). Using the ID as a partition key will lead to a partition per row. I should use a partition key in my queries to avoid a table scan. so the second common field is the source, as the UUID will always be attached to a source.

The user flow is :

  1. User clicks on an ad -> UUID is generated, and the user is redirected using this UUID
  2. a server-to-server event, with source and UUID, is sent any time after the click. It can be 5 seconds or 5 months.

If I understand correctly, you suggest creating alias groups, kind of group1, group2, etc.. and splitting the source across groups. is that right? In such a case, I'll need to attach a source to an alias before he even started, which makes no sense as I have no idea what his volume is at this stage.

I had an idea of using a **date **(day/week/month) key as a partition key and a **source **as a sort key. and using the global secondary index with the source and the UUID. This way, the partition will be split evenly, but I have no idea If the global index can be a composite on attributes that are not one of the original partition key

Thanks, Haim

HaimB
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions