Data ingestion using DMS or Glue?

0

A customer is setting up an IOT - Big Data analytics platform on AWS.. As per PHASE 1 of their design, they have an on-prem SQL Server DW that is going to send data on a near-real time basis to AWS. Once the data is in AWS, it is then processed, analyzed and visualized etc.

Question customer has is as follows:

  1. What is the best way to send this data into AWS in near real time:
  • Either use DMS (CDC) and store the data in a staging bucket? From there, have Glue catalog and ETL it.. OR

  • Directly consumed by Glue using a Crawler and ETL it? NOTE That the customer doesn't have a Direct Connect and uses a VPN as of date.

  1. What is better to use and why -- CDC or Triggers? (I know this is a database/ application level question, but they just wanted our opinion on it..

  2. Are there any Best Practices that customers use when working with Glue? For ETL/ crawler/ jobs etc? (links to documentation more than welcome!)

1 個回答
0
已接受的答案

Glue does not support true CDC, but is capable of bringing in new rows from a database table using Glue bookmarks. See: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

So if the source database table has rows that are updated or deleted and you need to capture that in the S3 data lake, then DMS is the only option here.

AWS
已回答 4 年前
  • is this still true today about Glue not supporting updates and deletes?

  • Same question as Ryane. Answer still seems to be capturing state updates to the dataset requires DMS CDC. Hoping that’s not the case, dealing with Stateful vs Stateless data seems like a common problem?

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南