AWS re:Postを使用することにより、以下に同意したことになります AWS re:Post 利用規約

AWS Glue crawler creating multiple tables

0

I have following S3 structure and want to craw the parquet files

bucket/basefolder
    subfolder1
        logfolder
            log1.json
        file1.parquet
    subfolder2
        logfolder
            log2.json
        file2.parquet
        file3.parquet

I have used exclude pattern as below to exclude unwanted files.

**/logfolder/**

So these are the tables the crawler got:

file1.parquet
file2.parquet
file3.parquet

How to get just one table? All these parquet files have exact same schema, there is no partition. I have also ticked the check box : Create a single schema for each S3 path in crawler settings, but the result is the same.

質問済み 3年前4671ビュー
3回答
0

This might be due to data compatibility issue. By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Since you already selected option “Create a single schema for each S3 path”, schema similarity will be ignored in this case but it will still check for data compatibility. Please check here for more information: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-grouping-policy

If the crawler identifies that data is not compatible even in a single file it will create a table for each file. Please open a support case with Glue team and provide crawler name, region and sample data (if possible) for us to troubleshoot further.

AWS
回答済み 3年前
  • I have tried check and uncheck that option (Create a single schema for each S3 path), the result is the same.

0

You are essentially creating the same schema twice, as you've already selected single schema. Crawlers take into consideration both the data compatibility and schema similarity, since your data compatibility is met it won't create another table. If however, the data wasn't compatible it will then create a table for each file.

AWS
回答済み 3年前
  • I have tried check and uncheck that option (Create a single schema for each S3 path), the result is the same.

0

I may want to have a workflow with a pythonshell job that rebuilds your file structure to have bucket/basefolder/logfolder. Then have a crawler in the same workflow crawl the bucket structure. The confusion is coming from the crawler not knowing how to jump between the tags (directories) like that. You can maintain subfolders with partitions but probably not required. You might need to set a table level for the crawler as shown at bottom of this doc page. https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html

回答済み 3年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ