How do I configure the AWS Glue crawler to manage schema changes?

所要時間2分
0

I want to configure the AWS Glue crawler to manage schema changes.

Resolution

To configure the crawler to manage schema changes, use either the AWS Glue console or the AWS Command Line Interface (AWS CLI).

Note: If you receive errors when you run AWS CLI commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

Use the AWS Glue console

Complete the following steps:

  1. Open the AWS Glue console.
  2. Choose Set output and scheduling.
  3. Under Advanced options, choose one of the following configurations for the schema change that the crawler must manage:
    Update the metadata table: Update the table definition in the data catalog
    Add new columns, but not overwrite updates that you make: Add new columns only
    Not update table metadata: Ignore the change and don't update the table in the data catalog
  4. To keep the partition's metadata the same as the table metadata, turn on Update all new and existing partitions with metadata from the table.

Use the AWS CLI

Note: In the following commands, replace all example_ values with your values.

Open the AWS CLI, and run the create-crawler command for your configuration.

Have the crawler update the metadata table:

aws glue create-crawler --name "example_name" --role "example_role_arn" --database-name "example_database" --targets '{"S3Targets": [{"Path": "s3://example_bucketname/example_foldername/","Exclusions": []}],"JdbcTargets": [],"DynamoDBTargets": [],"CatalogTargets": []}' --schema-change-policy '{"UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG"}' --configuration '{"Version": 1, "CrawlerOutput": {"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}}}' --recrawl-policy '{"RecrawlBehavior": "CRAWL_EVERYTHING"}'

Have the crawler add new columns, but not overwrite updates that you make:

aws glue create-crawler --name "example_name" --role "example_role_arn" --database-name "example_database" --targets '{"S3Targets": [{"Path": "s3://example_bucketname/example_foldername/","Exclusions": []}],"JdbcTargets": [],"DynamoDBTargets": [],"CatalogTargets": []}' --schema-change-policy '{"UpdateBehavior": "LOG", "DeleteBehavior": "LOG"}' --configuration '{"Version": 1, "CrawlerOutput": {"Tables":{"AddOrUpdateBehavior": "MergeNewColumns"},"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}}}' --recrawl-policy '{"RecrawlBehavior": "CRAWL_EVERYTHING"}'

Not have the crawler update table metadata:

aws glue create-crawler --name "example_name" --role "example_role_arn" --database-name "example_database" --targets '{"S3Targets": [{"Path": "s3://s3://example_bucketname/example_foldername/","Exclusions": []}],"JdbcTargets": [],"DynamoDBTargets": [],"CatalogTargets": []}' --schema-change-policy '{"UpdateBehavior": "LOG", "DeleteBehavior": "LOG"}' --configuration '{"Version": 1, "CrawlerOutput": {"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}}}' --recrawl-policy '{"RecrawlBehavior": "CRAWL_EVERYTHING"}'

To verify that the crawler made the schema change, run the get-crawler command:

aws glue get-crawler --name "example_name"

If the output corresponds with your updates, then the change was made.

Related information

Configuring a crawler

Preventing a crawler from changing an existing schema

Scheduling incremental crawls for adding new partitions

AWS公式
AWS公式更新しました 3ヶ月前
コメントはありません

関連するコンテンツ