S3 key structure for optimal prefix partitioning

3

We have a key scheme like the example below. The end of the image name is a timestamp. We perform distributed processing over the images per right folder (1k+ workers over 10k+ images under ...right/), and hit 503 Slow Down responses above a certain job size:

s3://mybucket/sensor/project_id/images/right/SN1234_172114123131.jpg

We do handle retries and ramp up request rates as described in the docs and troubleshooting guide but we would like to update our key scheme to best align with how s3 partitions keys in the first place.

However, it is unclear what an effective prefix scheme would be. The docs explain that partitions are not necessarily along delimiter boundaries (/ that define logical folders), but the recommendations always explain prefixes in terms of folders. If the prefix doesn't need to be along delimiters, wouldn't the fact that our image keys end in a timestamp allow spreading across partitions?

Assuming we need to add folder prefixes, does it matter whether the keys diverge at the beginning or end of the prefix? That is, which of the schemes below should we implement for the highest throughput "out of the box", before sending significant request volume to the keys?

s3://mybucket/<hash_or_timestamp>/sensor/project_id/images/right/SN1234_172114123131.jpg

or

s3://mybucket/sensor/project_id/images/right/<hash_or_timestamp>/SN1234_172114123131.jpg

We understand that S3 uses a lot of heuristics and usage patterns to optimize partitions, but we are specifically looking for a clear recommendation of how to optimize our key scheme out of the box. Thank you!

asked a month ago68 views
2 Answers
2

AWS has extensive deep-dives into the specifics in presentations available on Youtube. I believe it's been stated explicitly many times that for partitioning purposes, S3 doesn't care about or in any way prioritise any specific characters, such as slashes. It simply looks at the distribution of data and activity across prefixes of different lengths, attempting to find an optimal spot along the full key to split a given data set to be split into two partitions. This partitioning process can continue in a nested manner, with each subsequently created partition getting split into yet another two new ones, or the process getting reversed by combining partitions generated earlier.

The core idea for avoiding hitting your API limits is to ensure that objects being accessed in parallel are distributed across prefixes that have previously been arranged into partitions. For example, splitting your objects into a folder-like structure by a date represented as .../YYYY/MM/DD/HH/mm/ss/filename.jpeg or an incrementally increasing ID number would often not help to optimise performance, because a typical application might be accessing data from the past day or 10-15 minutes as the "hot spot", which would all be located in the same partition. Data in other partitions created earlier would be mostly inactive.

Convenient solutions vary by the structure of your data, but a hash-based naming structure would mostly work better than a timestamp-based structure, because hashes naturally have a uniform distribution, which the prefixes would follow.

For example, the structure below might tend to focus I/O on the newest objects, all sharing the same prefix and only getting partitioned later, long after the high I/O pressure is over. That's because all the recently created objects would be placed in the same partition created earlier, such as /images-to-process/2024/09/1, where also files from the 10th would belong, and distinct from /images-to-process/2024/09/0 for a hypothetical previous partition:

/images-to-process/2024/09/11/184502_5579cb0b97c990f7.jpeg
/images-to-process/2024/09/11/184502_5cd386a19afce9af.jpeg
/images-to-process/2024/09/11/184502_a900fdd2064e3073.jpeg
/images-to-process/2024/09/11/184502_acbe3954657842dd.jpeg

By comparison, the structure below could split the partitions at /images-to-process/, with the 5 and a (and presumably all the other intermediate characters) immediately underneath belonging in the same 16 hexadecimal prefixes that were used yesterday, the day before, and every day before then, making it likely that those prefixes would have got partitioned already previously. Within each day and second, as well as between days and months, the hash values would get distributed uniformly and match the hash prefixes from past times. If the partition /images-to-process/5 needed to be split further, it could become /images-to-process/55, /images-to-process/5c, and so on, for another level of 16 hexadecimal branches:

/images-to-process/5579cb0b97c990f7/2024/09/11/184502_5579cb0b97c990f7.jpeg
/images-to-process/5cd386a19afce9af/2024/09/11/184502_5cd386a19afce9af.jpeg
/images-to-process/a900fdd2064e3073/2024/09/11/184502_a900fdd2064e3073.jpeg
/images-to-process/acbe3954657842dd/2024/09/11/184502_acbe3954657842dd.jpeg

The first core consideration would be leveraging the partitioning that has been established by S3 previously, before objects start to get accessed at a high rate by your application.

Secondly, you should ensure that the momentary peaks of object operations from your application are distributed at every moment as evenly as possible across different, pre-existing prefixes. This would typically happen naturally, if hashes or random values are generated to serve as prefixes. Timestamps in either forward or reverse order usually don't work for partitioning, because they tend to form a momentary hotspot that just keeps moving and dragging the hotspot from one partition to the next, but not targeting object operations across multiple partitions simultaneously.

This idea is explained along with nice diagrams in this Re:Invent presentation from last year: https://youtu.be/sYDJYqvNeXU?si=zIb_FbQFixQD7jkv&t=1125 The partitioning discussion starts at 19:00 with the general concept of prefixes, and at 20:00 there's an example of hypothetical content for Re:Invent itself being distributed across "day 1", "day 2", and so on, but where the bulk of activity would hit the current or latest day, with the previous days receiving much less traffic and not benefitting from partitioning. A summary of suggested best practices starts at 21:00.

EXPERT
Leo K
answered a month ago
profile picture
EXPERT
reviewed a month ago
0

"it's been stated explicitly many times that for partitioning purposes, S3 doesn't care about or in any way prioritise any specific characters, such as slashes" Would appreciate a citation or a link here.... The video you linked seems to indicate otherwise- it seems like a prefix is the S3 def of prefix- basically the "full folder path" between an object filename and the bucket name. There is a critical hand wave moment in that video that talks about adding a new prefix and the subsequent "new load spreading" that has to happen. If this is anticipated from the beginning and prefix splitting is pre-determined (say by a bunch of batch workers, each one gets a unique hashed prefix, and each worker will self limit to the IOPS limit) seems like there would not be a problem to me. I think our org is going to possibly run some tests here but would love an answer / citation if there is an up to date article that confirms this.

pjcyvl
answered a month ago
  • If you look at the slide at 19:10 in the video, it shows as clearly as can be how an object key like reinvent-bucket/prefix1/data/otherstuff also shares the prefixes reinvent-bucket/p and reinvent-bucket/prefix, neither of which is adjacent to a forward slash. There are many deep-dive videos on the topic, this being only one of them, and some more clearly than others state explicitly that S3 doesn't care about the slash character any more than it does about any other character. Also the APIs allow specifying any character as the separator, despite / surely being a popular choice.

  • In documentation, optimisation is discussed here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html which links here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html that states explicitly what I wrote earlier: "There is nothing unique about the slash (/) character, but it is a very common prefix delimiter."

  • While the slide you are referencing shows a list of prefixes, it is not made clear by the presenter OR by the text on the slide that a SHARED prefix is what counts toward IOPS limits. In fact, it seems to indicate the opposite- that you DO get some benefits from further splitting (it even uses slashes, see slide at 20:27). I think pointing to a slide that says "what is a prefix" and it listing some prefixes, and then to make the case from that alone that states they have the SAME prefix (again- I think this is untrue- they just share some level of a common prefix- the letter "p") is a huge leap in logic in my opinion. I understand the idea that the slash is not special- I'm with you there- but there are very rarely specific mentions of how the S3 service level optimizations work.

    The bottom line is it seems like there is good consensus in the doc and in the presentations that we want items with really high request rates to have lots of diversity on the left of their prefixes- but there does not seem to be a formula to determine how many characters are needed, or what proportion of the characters in the total prefix lead to getting the full "per prefix rate". Does that make sense? There are best practices to follow, but little way to evaluate if your current practice will incur scaling issues, and how long or what amount of requests will even trigger partioning that allows for the full scale of IOPS limit

  • Slashes mean nothing special to S3. It's stated clearly in documentation and every presentation that discusses the topic. They're commonly used in examples, because they're equally commonly used in real life to arrange data hierarchically. There have been other talks where it's stated more explicitly that the partition splits can occur anywhere along the object key. It's also shown in this video clearly starting at 10:45: https://youtu.be/FJJxcwSfWYg?si=ChDOfzM6uQcAQbcP&t=645 that the split isn't binary, with a four-way split on the slide (and not at a slash-separated boundary).

  • It's entirely possible that there's some automated machine learning or statistical analyses being done in the background to anticipate opportunities for partitioning. Those kinds of internal optimisations are likely to keep changing over time and would likely often be a disservice to customers to reveal, because many would start relying on technical implementation details liable to change at any time.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions