What should be the correct Exclude Pattern and Table level when dealing with folders with different names?

0

Hello,

I have a s3 bucket with this following path: "s3://a/b/c"

Inside this 'c' folder I have one folder for each table. Then for each of these table folders I have a folder for each version. Each version is a database snapshot obtained on a weekly basis, which is run by a workflow. To clarify, the structure inside 'c' is like this:

  1. products
    1. /version_0
      1. _temporary
        1. 0_$folder$
      2. part-00000-c5... ...c000.snappy.parquet
    2. /version_1
      1. _temporary
        1. 0_$folder$
      2. part-00000-c5... ...c000.snappy.parquet
  2. locations
    1. /version_0
      1. _temporary
        1. 0_$folder$
      2. part-00000-c5... ...c000.snappy.parquet
    2. /version_1
      1. _temporary
        1. 0_$folder$
      2. part-00000-c5... ...c000.snappy.parquet

I have created a crawler (Include Path is set to the same path mentioned above - "s3://a/b/c") with the intention of merging all the versions together into 1 table, for each table (products, locations). The schemas of the different partitions are always the same. The structure of the different partitions is also always the same.

The _temporary folder is something automatically generated by the workflow.

What should be the actual correct Exclude path (to ignore everything in _temporary folder) and maybe set any Table Level in order for me to create only ONE table merging all versions together for each table (products, locations)?

In summary I should have 2 tables:

  1. products (containing version_0 and version_1 rows)
  2. locations (containing version_0 and version_1 rows)

I really have no way of testing the exclude patterns. Is there any Sandbox where we can actually test the glob exclude patterns? I have found one online but it doesn't seem to be similar to what AWS is using. I have tried with these exclude patterns but none worked (it still created a table for each table & each version):

  1. version*/_temporary**
  2. /**/version*/_temporary**
1 Answer
1
Accepted Answer

Hi. I created two crawlers for each of the include paths s3://a/b/c/products and s3://a/b/c/locations and used the exclude pattern below:

version*/_temporary**

The generated tables are correctly partitioned by version_* Generally, the Glue Crawler will generate multiple tables if the data schemas differ in the include path. You can check the "Create a single schema for each S3 path" option to group the schemas- crawler grouping

There is currently no sandbox environment to test the patterns.

profile pictureAWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions