Data ingestion using DMS or Glue?

0

A customer is setting up an IOT - Big Data analytics platform on AWS.. As per PHASE 1 of their design, they have an on-prem SQL Server DW that is going to send data on a near-real time basis to AWS. Once the data is in AWS, it is then processed, analyzed and visualized etc.

Question customer has is as follows:

  1. What is the best way to send this data into AWS in near real time:
  • Either use DMS (CDC) and store the data in a staging bucket? From there, have Glue catalog and ETL it.. OR

  • Directly consumed by Glue using a Crawler and ETL it? NOTE That the customer doesn't have a Direct Connect and uses a VPN as of date.

  1. What is better to use and why -- CDC or Triggers? (I know this is a database/ application level question, but they just wanted our opinion on it..

  2. Are there any Best Practices that customers use when working with Glue? For ETL/ crawler/ jobs etc? (links to documentation more than welcome!)

1개 답변
0
수락된 답변

Glue does not support true CDC, but is capable of bringing in new rows from a database table using Glue bookmarks. See: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

So if the source database table has rows that are updated or deleted and you need to capture that in the S3 data lake, then DMS is the only option here.

AWS
답변함 4년 전
  • is this still true today about Glue not supporting updates and deletes?

  • Same question as Ryane. Answer still seems to be capturing state updates to the dataset requires DMS CDC. Hoping that’s not the case, dealing with Stateful vs Stateless data seems like a common problem?

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠