What should be the correct Exclude Pattern and Table level when dealing with folders with different names?

0

Hello,

I have a s3 bucket with this following path: "s3://a/b/c"

Inside this 'c' folder I have one folder for each table. Then for each of these table folders I have a folder for each version. Each version is a database snapshot obtained on a weekly basis, which is run by a workflow. To clarify, the structure inside 'c' is like this:

  1. products
    1. /version_0
      1. _temporary
        1. 0_$folder$
      2. part-00000-c5... ...c000.snappy.parquet
    2. /version_1
      1. _temporary
        1. 0_$folder$
      2. part-00000-c5... ...c000.snappy.parquet
  2. locations
    1. /version_0
      1. _temporary
        1. 0_$folder$
      2. part-00000-c5... ...c000.snappy.parquet
    2. /version_1
      1. _temporary
        1. 0_$folder$
      2. part-00000-c5... ...c000.snappy.parquet

I have created a crawler (Include Path is set to the same path mentioned above - "s3://a/b/c") with the intention of merging all the versions together into 1 table, for each table (products, locations). The schemas of the different partitions are always the same. The structure of the different partitions is also always the same.

The _temporary folder is something automatically generated by the workflow.

What should be the actual correct Exclude path (to ignore everything in _temporary folder) and maybe set any Table Level in order for me to create only ONE table merging all versions together for each table (products, locations)?

In summary I should have 2 tables:

  1. products (containing version_0 and version_1 rows)
  2. locations (containing version_0 and version_1 rows)

I really have no way of testing the exclude patterns. Is there any Sandbox where we can actually test the glob exclude patterns? I have found one online but it doesn't seem to be similar to what AWS is using. I have tried with these exclude patterns but none worked (it still created a table for each table & each version):

  1. version*/_temporary**
  2. /**/version*/_temporary**
1 Antwort
1
Akzeptierte Antwort

Hi. I created two crawlers for each of the include paths s3://a/b/c/products and s3://a/b/c/locations and used the exclude pattern below:

version*/_temporary**

The generated tables are correctly partitioned by version_* Generally, the Glue Crawler will generate multiple tables if the data schemas differ in the include path. You can check the "Create a single schema for each S3 path" option to group the schemas- crawler grouping

There is currently no sandbox environment to test the patterns.

profile pictureAWS
beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen