Can Glue crawler be configured to include only the most recent partition in a table?

0

I'm brand new to Athena & Glue. We have several sets of data in S3 in partitioned format e.g. /year=yyyy/month=mm/day=dd. Some of these data sets are incremental (as would make sense for partitions) but some are simply complete snapshots of all the data we're interested in. When creating a Glue crawler things work really nicely out of the box for the naturally incremental/partitioned data sets, and seamless tables are automatically created. For the data set that are complete snapshots, however, we end up with lots of duplicate/old data in the tables because all the "partitions" are included in the table. For the latter, is there some way to configure the glue crawler to only include the most recent partition?

質問済み 2年前284ビュー
1回答
0

Hello,

Unfortunately, as of now, Glue crawler does not have such a feature to crawl only the most recent partition. All you can try is to specify an exclusion/inclusion pattern which are simple wild cards like * and not sophisticated enough to get something like current date.

However, you can try something like below

  1. Create a Glue table manually on your path like /year=2022/month=06/day=01
  2. Create a Glue crawler with the above table as source
  3. Run the crawler
  4. On the next day, when you have a new partition day=02, you can write a simple code like below which updates the path/location of the table and starts the crawler programmatically
import boto3
client = boto3.client('glue')

response = client.update_table(DatabaseName='db',TableInput={'Name':'tbl','StorageDescriptor':{'Location':'<S3_Bucket>/year=2022/month=06/day=02'}})
response1=client.start_crawler(Name='mycrawler')
AWS
サポートエンジニア
回答済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン