Glue crawler to exclude all files except the ones that match a pattern

0

I have an include path like this one: s3://my-datalake/projects/. In this project folder, I have these folders within - daily-2022-11-05, daily-2022-11-06, incremental_123456, and incremental_234567 Each of these files contains a parquet file. Now, when the crawler runs, I want it to exclude everything that starts with incremental_ in its name.

I did try using incremental_**/**. This is working for one crawler and isn't working for the other one. What I meant by isn't working for the other one - when I run the crawler it isn't updating the table or is failing.

demandé il y a un an781 vues
1 réponse
0

I've tested a crawler using the same folder structure in S3 as mentioned.

Specified include path as: s3://my-datalake/projects/

Exclude pattern as: incremental_**/**

Using above exclude pattern ignores all files under folders named 'incremental_'. The only additional thing could be that existing crawlers have "UpdateBehavior" as "LOG" - so the already created tables are not being dropped. You could try updating it to "UPDATE_IN_DATABASE" - this will recreate the tables.

Reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude

profile pictureAWS
répondu il y a un an

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions