Glue crawler to exclude all files except the ones that match a pattern

0

I have an include path like this one: s3://my-datalake/projects/. In this project folder, I have these folders within - daily-2022-11-05, daily-2022-11-06, incremental_123456, and incremental_234567 Each of these files contains a parquet file. Now, when the crawler runs, I want it to exclude everything that starts with incremental_ in its name.

I did try using incremental_**/**. This is working for one crawler and isn't working for the other one. What I meant by isn't working for the other one - when I run the crawler it isn't updating the table or is failing.

質問済み 1年前729ビュー
1回答
0

I've tested a crawler using the same folder structure in S3 as mentioned.

Specified include path as: s3://my-datalake/projects/

Exclude pattern as: incremental_**/**

Using above exclude pattern ignores all files under folders named 'incremental_'. The only additional thing could be that existing crawlers have "UpdateBehavior" as "LOG" - so the already created tables are not being dropped. You could try updating it to "UPDATE_IN_DATABASE" - this will recreate the tables.

Reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude

profile pictureAWS
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ