How to specify 'ScanAll' procedure for AWS::Glue::Crawler DynamoDBTarget

0

I was looking at the Glue Crawler resource creation docs, and saw that the DynamoDB Target object: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-crawler-dynamodbtarget.html

The only allowed parameter is 'Path' for a DynamoDB Target of a AWS Glue Crawler resource. Interestingly, when I deployed my crawler, I noticed that the 'data sampling' setting was automatically enabled for my DDB data source. This is NOT the setting I want, so I am looking for a way to specify that the Crawler should scan the entire data source (DDB table).

asked a year ago300 views
3 Answers
1

You need to set ScanAll to true, I agree it is not well documented but seems to be the correct behavior looking at the core API.

Resources:
  GlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: myCrawler
      DatabaseName: myGlueDatabase
      TablePrefix: myTable
      Targets:
        Type: DynamoDB
        Path: myDynamoDBTable
      SchemaChangePolicy:
        UpdateBehavior: UpdateInPlace
        DeleteBehavior: DeleteFromMetadata
      Configuration:
        ScanAll: true
profile pictureAWS
EXPERT
answered a year ago
0

Currently, the data sample can only be set to scanAll on the AWS console or CLI, so you would not be able to do this from CloudFormation. Scanning all the records from the table can take a very long time depending on the size of the table and is generally not recommended as it can also exhaust all your RCU for the table.

If your intention for scanning whole table is to account for DynamoDB's non conformant schema, then a better approach would be to export your table to S3 using the export to S3 feature. Since the table content is dumped on an external system it will not affect your table and you will have more control over the performance (since you can control the reads without worrying about table limits or partition limits).

AWS
odwa_y
answered a year ago
  • I tried running a crawler on an S3 bucket containing a direct export from DyanmoDB, but it just ran and created nothing. The crawler didn't fail, but it didn't create a data catalog. Could you clarify how you might configure a crawler to run on a DDB export like you mentioned?

0

Spoke with AWS support, they claimed that the feature isn't currently implemented. As of now, you are only able to provide a 'Path' value when creating crawler resources via YAML/JSON in CF. The only solution is to add an 'update-crawler' CLI command via script or pipeline after deploying the resource.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions