Data ingestion using DMS or Glue?

0

A customer is setting up an IOT - Big Data analytics platform on AWS.. As per PHASE 1 of their design, they have an on-prem SQL Server DW that is going to send data on a near-real time basis to AWS. Once the data is in AWS, it is then processed, analyzed and visualized etc.

Question customer has is as follows:

  1. What is the best way to send this data into AWS in near real time:
  • Either use DMS (CDC) and store the data in a staging bucket? From there, have Glue catalog and ETL it.. OR

  • Directly consumed by Glue using a Crawler and ETL it? NOTE That the customer doesn't have a Direct Connect and uses a VPN as of date.

  1. What is better to use and why -- CDC or Triggers? (I know this is a database/ application level question, but they just wanted our opinion on it..

  2. Are there any Best Practices that customers use when working with Glue? For ETL/ crawler/ jobs etc? (links to documentation more than welcome!)

1 Answer
0
Accepted Answer

Glue does not support true CDC, but is capable of bringing in new rows from a database table using Glue bookmarks. See: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

So if the source database table has rows that are updated or deleted and you need to capture that in the S3 data lake, then DMS is the only option here.

AWS
answered 4 years ago
  • is this still true today about Glue not supporting updates and deletes?

  • Same question as Ryane. Answer still seems to be capturing state updates to the dataset requires DMS CDC. Hoping that’s not the case, dealing with Stateful vs Stateless data seems like a common problem?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions