Glue crawler to exclude all files except the ones that match a pattern

0

I have an include path like this one: s3://my-datalake/projects/. In this project folder, I have these folders within - daily-2022-11-05, daily-2022-11-06, incremental_123456, and incremental_234567 Each of these files contains a parquet file. Now, when the crawler runs, I want it to exclude everything that starts with incremental_ in its name.

I did try using incremental_**/**. This is working for one crawler and isn't working for the other one. What I meant by isn't working for the other one - when I run the crawler it isn't updating the table or is failing.

gefragt vor einem Jahr794 Aufrufe
1 Antwort
0

I've tested a crawler using the same folder structure in S3 as mentioned.

Specified include path as: s3://my-datalake/projects/

Exclude pattern as: incremental_**/**

Using above exclude pattern ignores all files under folders named 'incremental_'. The only additional thing could be that existing crawlers have "UpdateBehavior" as "LOG" - so the already created tables are not being dropped. You could try updating it to "UPDATE_IN_DATABASE" - this will recreate the tables.

Reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude

profile pictureAWS
beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen