I want to configure the AWS Glue crawler to manage schema changes.
Resolution
To configure the crawler to manage schema changes, use either the AWS Glue console or the AWS Command Line Interface (AWS CLI).
Note: If you receive errors when you run AWS CLI commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.
Use the AWS Glue console
Complete the following steps:
- Open the AWS Glue console.
- Choose Set output and scheduling.
- Under Advanced options, choose one of the following configurations for the schema change that the crawler must manage:
Update the metadata table: Update the table definition in the data catalog
Add new columns, but not overwrite updates that you make: Add new columns only
Not update table metadata: Ignore the change and don't update the table in the data catalog
- To keep the partition's metadata the same as the table metadata, turn on Update all new and existing partitions with metadata from the table.
Use the AWS CLI
Note: In the following commands, replace all example_ values with your values.
Open the AWS CLI, and run the create-crawler command for your configuration.
Have the crawler update the metadata table:
aws glue create-crawler --name "example_name" --role "example_role_arn" --database-name "example_database" --targets '{"S3Targets": [{"Path": "s3://example_bucketname/example_foldername/","Exclusions": []}],"JdbcTargets": [],"DynamoDBTargets": [],"CatalogTargets": []}' --schema-change-policy '{"UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG"}' --configuration '{"Version": 1, "CrawlerOutput": {"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}}}' --recrawl-policy '{"RecrawlBehavior": "CRAWL_EVERYTHING"}'
Have the crawler add new columns, but not overwrite updates that you make:
aws glue create-crawler --name "example_name" --role "example_role_arn" --database-name "example_database" --targets '{"S3Targets": [{"Path": "s3://example_bucketname/example_foldername/","Exclusions": []}],"JdbcTargets": [],"DynamoDBTargets": [],"CatalogTargets": []}' --schema-change-policy '{"UpdateBehavior": "LOG", "DeleteBehavior": "LOG"}' --configuration '{"Version": 1, "CrawlerOutput": {"Tables":{"AddOrUpdateBehavior": "MergeNewColumns"},"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}}}' --recrawl-policy '{"RecrawlBehavior": "CRAWL_EVERYTHING"}'
Not have the crawler update table metadata:
aws glue create-crawler --name "example_name" --role "example_role_arn" --database-name "example_database" --targets '{"S3Targets": [{"Path": "s3://s3://example_bucketname/example_foldername/","Exclusions": []}],"JdbcTargets": [],"DynamoDBTargets": [],"CatalogTargets": []}' --schema-change-policy '{"UpdateBehavior": "LOG", "DeleteBehavior": "LOG"}' --configuration '{"Version": 1, "CrawlerOutput": {"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}}}' --recrawl-policy '{"RecrawlBehavior": "CRAWL_EVERYTHING"}'
To verify that the crawler made the schema change, run the get-crawler command:
aws glue get-crawler --name "example_name"
If the output corresponds with your updates, then the change was made.
Related information
Configuring a crawler
Preventing a crawler from changing an existing schema
Scheduling incremental crawls for adding new partitions