Query for S3 objects requested on a specific day - Athena?

0

Hi,

We sometimes experience a large increase in GET requests to our production S3 bucket. This bucket contains customer assets that are frequently requested. Usually, these assets are requested via CloudFront but sometimes, in this instance, that doesn't look likely as the GET request rate via CloudFront is not correlative to the number of GET requests from the bucket. When this happens, I want to query for the objects that are requested on that specific day so I can find the remoteip requesting.

The anatomy of an object key is: public/asset/raw/<accountId>/<hash>/<timestamp>/<assetId>. This bucket was created back in 2017 & thus contains over 40TBs data. I was previously using Athena console to get this information. An example query, if I was looking for the requests from 18/07/2022, would be:

SELECT remoteip,
    requestdatetime,
    requester,
    requestid,
    operation,
    key,
    request_uri
FROM "s3_access_logs_db"."mybucket_logs"
WHERE operation='REST.GET.OBJECT'
AND key LIKE 'public/asset/raw/<accountId>/%'
AND bucket_name = '<bucketName>
AND bucketowner = '<bucketOwner>'
AND httpstatus = '200'
AND date(parse_datetime(requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z')) = date '2022-07-18'
ORDER BY requestdatetime DESC

However, due to the bucket being so large & how Athena queries, I will always get a "Please reduce your request rate" error message. I can't find a way to implement exponential backoff in Athena. I thought about adding another condition to only query objects that were created in a specific period but the issue with that is you may miss some object requests. For example, someone, still today, might be requesting an object that was created in 2018 & if I had a condition whereby I only got the requests for objects from 2019-2022, that would be missed. Even if I set the key condition to be an exact match (=) instead of a LIKE, the query still fails as the request rate is too high.

Is there a better way to do this rather than Athena or is that my only option &, if so, what is a better way to go about this?

asked 2 years ago979 views
1 Answer
0

Ideally partitions can be used to only query the information you are interested in saving time and cost by limiting the data to search. With S3 access logs this is made more difficult due to no prefix's being used (solutions discussed https://repost.aws/questions/QUjQHGAO_nQd-PsIKsvfH3HA/partitioning-s-3-access-logs-to-optimize-athena-queries). You could enable data events for this bucket in CloudTrail to then export out to S3 and query with Athena which has better support for partitions so will be possible to scan over a larger data set although this is more expensive due to CloudTrail being involved, information about setting that up here https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-data-events-with-cloudtrail.html#logging-data-events-examples

If that bucket is public for the purpose of being behind a Cloudfront distribution, access to the bucket can be setup so that only Cloudfront can access the objects and thus preventing any direct access bypassing Cloudfront https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-restricting-access-to-s3.html. If you have any applications that also access the bucket, the policy can be configured so that there access is still allowed.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions