- Newest
- Most votes
- Most comments
You need to set ScanAll
to true, I agree it is not well documented but seems to be the correct behavior looking at the core API.
Resources:
GlueCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: myCrawler
DatabaseName: myGlueDatabase
TablePrefix: myTable
Targets:
Type: DynamoDB
Path: myDynamoDBTable
SchemaChangePolicy:
UpdateBehavior: UpdateInPlace
DeleteBehavior: DeleteFromMetadata
Configuration:
ScanAll: true
Currently, the data sample can only be set to scanAll on the AWS console or CLI, so you would not be able to do this from CloudFormation. Scanning all the records from the table can take a very long time depending on the size of the table and is generally not recommended as it can also exhaust all your RCU for the table.
If your intention for scanning whole table is to account for DynamoDB's non conformant schema, then a better approach would be to export your table to S3 using the export to S3 feature. Since the table content is dumped on an external system it will not affect your table and you will have more control over the performance (since you can control the reads without worrying about table limits or partition limits).
I tried running a crawler on an S3 bucket containing a direct export from DyanmoDB, but it just ran and created nothing. The crawler didn't fail, but it didn't create a data catalog. Could you clarify how you might configure a crawler to run on a DDB export like you mentioned?
Spoke with AWS support, they claimed that the feature isn't currently implemented. As of now, you are only able to provide a 'Path' value when creating crawler resources via YAML/JSON in CF. The only solution is to add an 'update-crawler' CLI command via script or pipeline after deploying the resource.
Relevant content
- asked 9 months ago
- Accepted Answerasked a year ago
- asked 9 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 3 years ago
Interesting... Take a look at the docs for providing a crawler configuration: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-crawler.html#aws-resource-glue-crawler--examples
What would that look like for ScanAll? We need to specify Configuration -> something -> ScanAll: true
Configuration is a string: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-crawler.html#cfn-glue-crawler-configuration
Yes, but what 'category' would ScanAll fall under in Configuration? i.e. table, partition, grouping, etc Also I opened an official case with AWS support and they said that the 'ScanAll' feature simply isn't implemented for CF resources created via YAML/JSON.