How many DPUs do Glue Crawlers use? Is a recrawl policy available for catalogue targets? And how to optimise firehose to Athena ingestion?

0
  1. I am trying to understand how to estimate the end result of pricing for Glue crawlers for data being ingested by AWS Kinesis Firehose service with the max possible buffer setting while glue crawlers would be configured to crawl based on S3 events. When looking at crawler information for each run it performs, it seems that DPUs are fractions and run time and DPUs used are not aligned when trying to estimate the pricing.

To my understanding:

  • Crawlers would be launched every 15 minutes.
  • Crawlers will run a couple of minutes, most probably less than 10 (?). However, I am going to be charged anyway as if the crawler would run for 10 minutes.
  • I am going to be priced for 40/60 * 0.44$ DPU hour ~ 0.3$ per hour, resulting in 0.32431=223 $ per month for just a single crawler.

The alternative atm is to run the crawler on the source once per day based on cron expression.

  1. The source for the crawler is a Glue Table, thus it seems there is no bookmark to enable in order for crawlers to go through only new partitions (as it is the possibility for S3 source data type). I want to keep the source as a Glue Table for schema enforcement purposes.

  2. Maybe there are some pointers about how to optimise Firehose to S3 and making it available in Athena?

hRed
asked 6 months ago170 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions