Event driven glue trigger using event bridge

0

Currently crawler and glue job has been scheduled to run every 15 mins. But its running even without data and hence higher cost. To streamline this, one option is to use an event driven approach as belowL: S3 -> event bridge -> glue workflow ( which gives option to even define batching in glue trigger within workflow )

This flow works fine but howver, for event bridge to capture s3 put events, cloud trail needs to be create that logs events for the S3 bucket. Now with cloud trail, it needs to be continuously running and also needs a bucket to log the event, hence the added cost. I might have high number of buckets , so having cloud trail for each - seems like addtional ongoing cost, what we're saving with not runnign glue job in a schedule might be spent here.

Also, the requirement is, once a file lands in s3, don't want to trigger the glue job or glue workflow immediately, want to have a batch window of atleast 15 mins, and then trigger the workflow/job, job is using bookmark so it can process / resume from last time.

Could anyone please suggest / advise on this, if above approach can be enhance to be better, or if there is entirely a better approach

1 Answer
1

A simple approach could be to configure the S3 buckets to send event notifications to EventBridge, where a rule configured for the default bus for the region of the bucket would match PutObject events for the appropriate bucket or several, and send them to a lightweight Step Functions workflow or Lambda function that would simply update a per-bucket "last update" timestamp value in a DynamoDB table.

Secondly, configure a similar scheduled rule as you have currently but triggering another lightweight Step Functions workflow that would first check if the last update timestamp of the bucket is later than the last time the crawler job has been executed. If no changes have taken place, the Step Functions workflow would simply do nothing, without any significant costs, and otherwise it would update a "last executed" timestamp in the DynamoDB table and trigger the crawler job similarly as your current schedule is doing.

EXPERT
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions