ETL Workflow Orchestration Step functions and/or Glue Workflows??

0

IHAC who's doing a low level design on their data lake. they want to use all AWS native services where possible. they have a question on ETL orchestration best practices on AWS. They were looking at Step functions but since Glue Workflow is available since Jun 2019 they were wondering which to use or a combo. Of course they are looking for the easy button. here's their primary requirements.

  1. ETL orchestration - step functions vs. Glue Workflow
    1. ~150 sources all sending files various times
    2. Source systems have limits on concurrency that scheduling tool must support
    3. example max 10 concurrent jobs for ACME source - job scheduling tool should pole and submit jobs keeping 10 active jobs but no more than 10
    4. ETL jobs should be built off of parameterized template where they pass in parameters like source, table name date and the job auto builds vs. having to maintain library of jobs/scripts per source/table. want this to be dynamically built
    5. Alerts on ETL processing
      1. Cloudwatch alert to SNS topics to etl teams on failures
      2. Cloudwatch alert to SNS for business users(loads complete)
      3. etc
    6. Support downstream jobs/etl example load file A & Load file B once completed for the day should launch load file C etc
1개 답변
0
수락된 답변

Hey Dave,

It sounds like a perfect use case for Glue especially because the quantities and the concurrency is not too masive.

Good luck! Ido

답변함 5년 전
  • Hi Dave, Thank you for your response. I still have some doubts regarding the parameters you mentioned in the first point, and I believe you can assist me with that. Currently, I am using a Glue Workflow to process my data, which is triggered by an Event using EventBridge. This workflow starts with a Glue Job that fetches the last file uploaded to the S3 bucket being monitored, which triggers the event. However, I'm unsure how to adapt my Glue Job to get the exact file that triggered the workflow. I am concerned about how the workflow will handle multiple files being uploaded simultaneously.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠